The Unicode committee is very clear that U+2019 (RIGHT SINGLE QUOTATION MARK) should represent the English apostrophe.

Section 6.2 of the Unicode Standard 7.0.0 states:

U+2019 […] is preferred where the character is to represent a punctuation mark, as for contractions: “We’ve been here before.”

This is very, very wrong. The character you should use to represent the English apostrophe is U+02BC (MODIFIER LETTER APOSTROPHE). I’m here to tell you why why.

Using U+2019 is inconsistent with the rest of the standard

Earlier in section 6.2, the standard explains the difference between punctuation marks and modifier letters:

Punctuation marks generally break words; modifier letters generally are considered part of a word.

Consider any English word with an apostrophe, e.g. “don’t”. The word “don’t” is a single word. It is not the word “don” juxtaposed against the word “t”. The apostrophe is part of the word, which, in Unicode-speak, means it’s a modifier letter, not a punctuation mark, regardless of what colloquial English calls it.

According to the Unicode character database, U+2019 is a punctuation mark (General Category = Pf), while U+02BC is a modifier letter (General Category = Lm). Since English apostrophes are part of the words they’re in, they are modifier letters, and hence should be represented by U+02BC, not U+2019.

(It would be different if we were talking about French. In French, I think it makes more sense to consider «L’Homme» as two words, or «jusqu’ici» as two words. But that’s a conversation for another time. Right now I’m talking about English.)

Using U+2019 breaks regular expressions

When doing word matching on Unicode text, programmers might reasonably assume they can detect “words” with the regex /\w+/ (which, in a Unicode context, matches characters with General Category L*, M*, N*, or Pc). This won’t actually work with English words that contain apostrophes if the apostrophes are represented as U+2019, but it will work if the apostrophes are represented as U+02BC.

To be fair, this problem exists in ASCII right now, where /\w+/ fails to match \x27 (the ASCII apostrophe). This leads to common bugs where users named O’Brien get told they can’t enter their name on a form, or where blog titles get auto-formatted as “Don’T Stop The Music”. Programmers soon learn they need to include the ASCII apostrophe in their regex as an exception.

But we shouldn’t be perpetuating this problem. When a programmer is writing a regex that can match text in Chinese, Arabic, or any other human language supported by Unicode, they shouldn’t have to add an exception for English. Furthermore, if apostrophes are represented as U+2019, the programmer would have to add both \x27 and \u2019 to their regex as exceptions.

The solution is to represent apostrophes as U+02BC, and let programmers simply write /\w+/ to match words like O’Brien and don’t.

[Edit: If you’re about to tell me that word segmentation should be done using UAX #29, guess what: Using U+2019 for the apostrophe breaks UAX #29! Using U+02BC would fix it. See comments. –Ed]

Using U+2019 means that Word Processors can’t distinguish between apostrophes and actual quotation marks, leading to a heap of problems.

How many times have you seen things like ‘Tis the Season or Up and at ‘em with the apostrophe curled the wrong way, because someone’s word processor mistook the apostrophe for an opening single quotation mark? Or, have you ever cut-and-pasted a block of text to use as a quote, put quotation marks around it, and then had to manually change all the nested quotation marks from double to single? Or maybe you’ve received text from the UK to use in your American presentation, but first you had to change all the quotation marks, because the UK prefers single-quotes while the US prefers double-quotes.

These are all things your word processor should be able to handle automatically and properly, but it can’t due to the ambiguity of whether a U+2019 character represents a single quotation mark or an apostrophe. We wouldn’t have these problems if apostrophes were represented by U+02BC. Allow me to explain.

1) In my perfect world, Word processors would automatically ensure that quotes (and nested quotes) were properly formatted according to your locale’s conventions, whether it be US, UK, or one of the myriad crazy quote conventions found across Europe. All quote marks would be reformatted on the fly according to their position in the paragraph, e.g. changing ‘ to “ (and vice versa) or ’ to ” (and vice versa).

2) Right now, Microsoft word users use the " key to type a double-quote (either opening or closing) and the ' key to type either a single-quote (either opening or closing) or an apostrophe. In my perfect world, since the word processor could automatically convert between single- and double-quotes as needed, you’d only need one key (the " key) to type all quotation marks, and the ' key would be reserved exclusively for typing apostrophes. Therefore, the word processor would know that '-t-i-l means ʼtil, not ‘til.

But because U+2019 can represent either an apostrophe or single quote, it’s hard for Word Processors to do (1), which means they also can’t do (2).

So whenever you see ‘Til we meet again with the apostrophe curled the wrong way, remember that’s because of the Unicode committee telling you to use U+2019 for English apostrophes.

Common bloody sense

For godsake, apostrophes are not closing quotation marks!

U+2019 (RIGHT SINGLE QUOTATION MARK) is classed as a closing punctuation mark. Its general category is Pf, which is defined in section 4.5 of the standard as meaning:

Punctuation, final quote (may behave like Ps [opening punctuation] or Pe [closing punctuation] depending on usage [They’re talking about right-to-left text –Ed])

When you use U+2019 to represent an apostrophe, it’s behaving as neither a Ps or Pe. (Or, if it is, it’s an unbalanced one. As of Unicode 6.3, the Unicode bidi algorithm attempts to detect bracket pairs for bidi processing. They would be unable to do the same for quotation marks, due to all these unbalanced “quotation marks”.)

Compare that to U+02BC (MODIFIER LETTER APOSTROPHE) which has “apostrophe” right in its name. LOOK AT IT. RIGHT THERE, IT SAYS APOSTROPHE.

Which do you think make more sense for representing apostrophes?

C’mon, let’s fix this.