Manifesto

This document contains special characters. Without proper rendering support, you may see question marks, boxes, or other symbols.

Our goal is to promote usage and support of the UTF-8 encoding and to convince that it should be the default choice of encoding for storing text strings in memory or on disk, for communication and all other uses. We believe that our approach improves performance, reduces complexity of software and helps prevent many Unicode-related bugs. We suggest that other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.

In particular, we believe that the very popular UTF-16 encoding (often mistakenly referred to as â€˜widecharâ€™ or simply â€˜Unicodeâ€™ in the Windows world) has no place in library APIs except for specialized text processing libraries, e.g. ICU.

This document also recommends choosing UTF-8 for internal string representation in Windows applications, despite the fact that this standard is less popular there, both due to historical reasons and the lack of native UTF-8 support by the API. We believe that, even on this platform, the following arguments outweigh the lack of native support. Also, we recommend forgetting forever what â€˜ANSI codepagesâ€™ are and what they were used for. It is in the userâ€™s bill of rights to mix any number of languages in any text string.

Across the industry, many localization-related bugs have been blamed on programmersâ€™ lack of knowledge in Unicode. We, however, believe that for an application that is not supposed to specialize in text, the infrastructure can and should make it possible for the program to be unaware of encoding issues. For instance, a file copy utility should not be written differently to support non-English file names. In this manifesto, we will also explain what a programmer should be doing if they do not want to dive into all complexities of Unicode and do not really care about whatâ€™s inside the string.

Furthermore, we would like to suggest that counting or otherwise iterating over Unicode code points should not be seen as a particularly important task in text processing scenarios. Many developers mistakenly see code points as a kind of a successor to ASCII characters. This lead to software design decisions such as Pythonâ€™s string O(1) code point access. The truth, however, is that Unicode is inherently more complicated and there is no universal definition of such thing as Unicode character. We see no particular reason to favor Unicode code points over Unicode grapheme clusters, code units or perhaps even words in a language for that. On the other hand, seeing UTF-8 code units (bytes) as a basic unit of text seems particularly useful for many tasks, such as parsing commonly used textual data formats. This is due to a particular feature of this encoding. Graphemes, code units, code points and other relevant Unicode terms are explained in SectionÂ 5. Operations on encoded text strings are discussed in SectionÂ 7.

In 1988, Joseph D. Becker published the first Unicode draft proposal. At the basis of his design was the naÃ¯ve assumption that 16 bits per character would suffice. In 1991, the first version of the Unicode standard was published, with code points limited to 16 bits. In the following years many systems have added support for Unicode and switched to the UCS-2 encoding. It was especially attractive for new technologies, such as the Qt framework (1992), Windows NT 3.1 (1993) and Java (1995).

However, it was soon discovered that 16 bits per character will not do for Unicode. In 1996, the UTF-16 encoding was created so existing systems would be able to work with non-16-bit characters. This effectively nullified the rationale behind choosing 16-bit encoding in the first place, namely being a fixed-width encoding. Currently Unicode spans over 109449 characters, about 74500 of them being CJK ideographs.

Nagoya City Science Museum. Photo by Vadim Zlotnik.

Microsoft has often mistakenly used â€˜Unicodeâ€™ and â€˜widecharâ€™ as synonyms for both â€˜UCS-2â€™ and â€˜UTF-16â€™. Furthermore, since UTF-8 cannot be set as the encoding for narrow string WinAPI, one must compile his code with UNICODE define. Windows C++ programmers are educated that Unicode must be done with â€˜widecharsâ€™ (Or worseâ€”the compiler setting-dependent TCHARs, which allow programmer to opt-out from supporting all Unicode code points). As a result of this mess, many Windows programmers are now quite confused about what is the right thing to do about text.

At the same time, in the Linux and the Web worlds, there is a silent agreement that UTF-8 is the best encoding to use for Unicode. Even though it provides shorter representation for English and therefore to computer languages (such as C++, HTML, XML, etc) over any other text, it is seldom less efficient than UTF-16 for commonly used character sets.

In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.

UTF-8 is endianness independent. UTF-16 comes in two flavors: UTF-16LE and UTF-16BE (for the two different byte orders, respectively). Here we name them collectively as UTF-16.

Widechar is 2 bytes in size on some platforms, 4 on others.

UTF-8 and UTF-32 yield the same order when sorted lexicographically. UTF-16 does not.

UTF-8 favors efficiency for English letters and other ASCII characters (one byte per character) while UTF-16 favors several Asian character sets (2 bytes instead of 3 in UTF-8). This is what made UTF-8 the favorite choice in the Web world, where English HTML/XML tags are intermixed with any-language text. Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both in UTF-16 and UTF-8.

UTF-16 is often misused as a fixed-width encoding, even by the Windows package programs themselves: in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16. On Windows 7, the console displays such characters as two invalid characters, regardless of the font being used.

Many third-party libraries for Windows do not support Unicode: they accept narrow string parameters and pass them to the ANSI API. Sometimes, even for file names. In the general case, it is impossible to work around this, as a string may not be representable completely in any ANSI code page (if it contains characters from a mix of Unicode blocks). What is normally done by Windows programmers for file names is getting an 8.3 path to the file (if it already exists) and feeding it into such a library. It is not possible if the library is supposed to create a non-existing file. It is not possible if the path is very long and the 8.3 form is longer than MAX_PATH . It is not possible if short-name generation is disabled in OS settings.

. It is not possible if short-name generation is disabled in OS settings. In C++, there is no way to return Unicode from std::exception::what() other than using UTF-8. There is no way to support Unicode for localeconv other than using UTF-8.

other than using UTF-8. There is no way to support Unicode for other than using UTF-8. UTF-16 remains popular today, even outside the Windows world. Qt, Java, C#, Python (prior to the CPython v3.3 reference implementation, see below) and the ICUâ€”they all use UTF-16 for internal string representation.

Letâ€™s go back to the file copy utility. In the UNIX world, narrow strings are considered UTF-8 by default almost everywhere. Because of that, the author of the file copy utility would not need to care about Unicode. Once tested on ASCII strings for file name arguments, it would work correctly for file names in any language, as arguments are treated as cookies. The code of the file copy utility would not need to change at all to support foreign languages. fopen() would accept Unicode seamlessly, and so would argv .

Now letâ€™s see how to do this on Microsoft Windows, a UTF-16 based architecture. To make a file copy utility that can accept file names in a mix of several different Unicode blocks (languages) here requires advanced trickery. First, the application must be compiled as Unicode-aware. In this case, it cannot have main() function with standard-C parameters. It will then accept UTF-16 encoded argv . To convert a Windows program written with narrow text in mind to support Unicode, one has to refactor deep and to take care of each and every string variable.

The standard library shipped with MSVC is poorly implemented with respect to Unicode support. It forwards narrow-string parameters directly to the OS ANSI API. There is no way to override this. Changing std::locale does not work. Itâ€™s impossible to open a file with a Unicode name on MSVC using standard features of C++. The standard way to open a file is:

std::fstream fout("abc.txt");

The proper way to get around is by using Microsoftâ€™s own hack that accepts wide-string parameter, which is a non-standard extension.

On Windows, the HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\ACP registry key enables receiving non-ASCII characters, but only from a single ANSI codepage. An unimplemented value of 65001 would probably resolve the cookie issue, on Windows. If Microsoft implements support of this ACP value, this will help wider adoption of UTF-8 on Windows platform.

For Windows programmers and multi-platform library vendors, we further discuss our approach to handling text strings and refactoring programs for better Unicode support in the How to do text on Windows section.

Here is an excerpt of the definitions regarding characters, code points, code units and grapheme clusters according to the Unicode Standard with our comments. You are encouraged to refer to the relevant sections of the standard for a more detailed description.

ĞŸÑ€Ğ¸Ğ²ĞµÌ�Ñ‚ à¤¨à¤®à¤¸à¥�à¤¤à¥‡ ×©Ö¸×�×œ×•Ö¹×� How many characters do you see?

Code point Any numerical value in the Unicode codespace.[Â§3.4, D10] For instance: U+3243F. Code unit The minimal bit combination that can represent a unit of encoded text.[Â§3.9, D77] For example, UTF-8, UTF-16 and UTF-32 use 8-bit, 16-bit and 32-bit code units respectively. The above code point will be encoded as four code units â€˜ f0 b2 90 bf â€™ in UTF-8, two code units â€˜ d889 dc3f â€™ in UTF-16 and as a single code unit â€˜ 0003243f â€™ in UTF-32. Note that these are just sequences of groups of bits; how they are stored on an octet-oriented media depends on the endianness of the particular encoding. When storing the above UTF-16 code units, they will be converted to â€˜ d8 89 dc 3f â€™ in UTF-16BE and to â€˜ 89 d8 3f dc â€™ in UTF-16LE. Abstract character A unit of information used for the organization, control, or representation of textual data.[Â§3.4, D7] The standard further says in Â§3.1: For the Unicode Standard, [...] the repertoire is inherently open. Because Unicode is a universal encoding, any abstract character that could ever be encoded is a potential candidate to be encoded, regardless of whether the character is currently known. The definition is indeed abstract. Whatever one can think of as a characterâ€”is an abstract character. For example, tengwar letter ungwe is an abstract character, although it is not yet representable in Unicode. Encoded character Coded character A mapping between a code point and an abstract character.[Â§3.4, D11] For example, U+1F428 is a coded character which represents the abstract character ğŸ�¨ koala. This mapping is neither total, nor injective, nor surjective: Surragates, noncharacters and unassigned code points do not correspond to abstract characters at all.

Some abstract characters can be encoded by different code points; U+03A9 greek capital letter omega and U+2126 ohm sign both correspond to the same abstract character â€˜Î©â€™, and must be treated identically.

and U+2126 both correspond to the same abstract character â€˜Î©â€™, and must be treated identically. Some abstract characters cannot be encoded by a single code point. These are represented by sequences of coded characters. For example, the only way to represent the abstract character Ñ�Ì� cyrillic small letter yu with acute is by the sequence U+044E cyrillic small letter yu followed by U+0301 combining acute accent . Moreover, for some abstract characters, there exist representations using multiple code points, in addition to the single coded character form. The abstract character Çµ can be coded by the single code point U+01F5 latin small letter g with acute, or by the sequence <U+0067 latin small letter g, U+0301 combining acute accent>. User-perceived character Whatever the end user thinks of as a character. This notion is language dependent. For instance, â€˜châ€™ is two letters in English and Latin, but considered to be one letter in Czech and Slovak. Grapheme cluster A sequence of coded characters that â€˜should be kept togetherâ€™.[Â§2.11] Grapheme clusters approximate the notion of user-perceived characters in a language independent way. They are used for, e.g., cursor movement and selection. Glyph A particular shape within a font. Fonts are collections of glyphs designed by a type designer. Itâ€™s the text shaping and rendering engine responsibility to convert a sequence of code points into a sequence of glyphs within the specified font. The rules for this conversion might be complicated, locale dependent, and are beyond the scope of the Unicode standard.

â€˜Characterâ€™ may refer to any of the above. The Unicode Standard uses it as a synonym for coded character.[Â§3.4] When a programming language or a library documentation says â€˜characterâ€™, it typically means a code unit. When an end user is asked about the number of characters in a string, he will count the user-perceived characters. A programmer might count characters as code units, code points, or grapheme clusters, according to the level of the programmerâ€™s Unicode expertise. For example, this is how Twitter counts characters. In our opinion, a string length function should not necessarily return one for the string â€˜ğŸ�¨â€™ to be considered Unicode-compliant.

So, most Unicode code points take the same number of bytes in UTF-8 and in UTF-16. This includes Russian, Hebrew, Greek and all non-BMP code points take 2 or 4 bytes in both encodings. Latin letters, together with punctuation marks and the rest of ASCII take more space in UTF-16, while some Asian characters take more in UTF-8. Couldnâ€™t Asian programmers, theoretically, object dumping UTF-16â€”which saves them 50% of the memory per character?

Hereâ€™s reality. Saving half the memory is true only in artificially constructed examples containing only characters in the U+0800 to U+FFFF range. However, computer-to-computer text interfaces dominate over all other usages of text. This includes XML, HTTP, filesystem paths and configuration filesâ€”they all use almost exclusively ASCII characters, and in fact UTF-8 is very popular in respective Asian countries.

For a dedicated storage of Chinese books, UTF-16 may still be used as a fair optimization. As soon as the text is retrieved from such storage, it should be converted to the standard compatible with the rest of the world. In either case, if storage is at premium, a lossless compression will be used. In such cases, UTF-8 and UTF-16 will take roughly the same space. Furthermore, â€˜in the said languages, a glyph conveys more information than a [L]atin character so it is justified for it to take more space.â€™ (Tronic, UTF-16 considered harmful).

Here are the results of a simple experiment. The space used by the HTML source of some web page (Japan article, retrieved from the Japanese Wikipedia on 2012â€“01â€“01) is shown in the first column. The second column shows the results for text with markup removed, that is â€˜select all, copy, paste into plain text fileâ€™.

HTML Source (Î” UTF-8) Dense text (Î” UTF-8) UTF-8 767 KB (0%) 222 KB (0%) UTF-16 1â€‰186 KB (+55%) 176 KB (âˆ’21%) UTF-8 zipped 179 KB (âˆ’77%) 83 KB (âˆ’63%) UTF-16LE zipped 192 KB (âˆ’75%) 76 KB (âˆ’66%) UTF-16BE zipped 194 KB (âˆ’75%) 77 KB (âˆ’65%)

As can be seen, UTF-16 takes about 50% more space than UTF-8 on real data, it only saves 20% for dense Asian text, and hardly competes with general purpose compression algorithms. The Chinese translation of this manifesto takes 58.8 KiB in UTF-16, and only 51.7 KiB in UTF-8.

The popular text-based data formats (e.g. CSV, XML, HTML, JSON, RTF and source codes of computer programs) often contain ASCII characters as structure control elements and may contain both ASCII and non-ASCII text data strings. Working with a variable length encoding, where ASCII-inherited code points are shorter than other code points may seem like a difficult task, because encoded character boundaries within the string are not immediately known. This has driven software architects to opt for UCS-4 fixed-width encoding. (e.g. Python v3.3). In fact, this is both unnecessary and does not solve any real problem we know.

By design of this encoding, UTF-8 guarantees that an ASCII character value or a substring will never match a part of a multi-byte encoded character. The same is true for UTF-16. In both encodings, the code units of multi-part encoded code point will have MSB set to 1.

To find, say, â€˜<â€™ sign marking a beginning of an HTML tag, or an apostrophe (') in a UTF-8 encoded SQL statement to defend against an SQL injection, do as you would for an all-English plaintext ASCII string. The encoding guarantees this to work. Specifically, that every non-ASCII character is encoded in UTF-8 as a sequence of bytes, each of them having a value greater than 127. This leaves no place for collision for a naÃ¯ve algorithmâ€”simple, fast and elegant, and no need to care about encoded character boundaries.

Also, you can search for a non-ASCII, UTF-8 encoded substring in a UTF-8 string as if it was a plain byte arrayâ€”there is no need to mind code point boundaries. This is thanks to another design feature of UTF-8â€”a leading byte of an encoded code point can never hold value corresponding to one of trailing bytes of any other code point.

As we already noted, there is a popular idea that counting, splitting, indexing or otherwise iterating over code points in a Unicode string should be considered a frequent and important operation. In this section, we review this in further detail.

This is a common mistake by those who think that UTF-16 is a fixed-width encoding. It is not. In fact UTF-16 is a variable length encoding. Refer to this FAQ if you deny the existence of non-BMP characters.

This depends on the meaning of the misused word â€˜characterâ€™. It is true that we can count code units and code points in constant time in UTF-32. However, code points do not correspond to user-perceived characters. Even in the Unicode formalism some code points correspond to coded character and some to non-characters.

We think that the importance of code points is frequently overstated. This is due to common misunderstanding of the complexity of Unicode, which merely reflects the complexity of human languages. It is easy to tell how many characters are there in â€˜Abracadabraâ€™, but letâ€™s go back to the following string:

ĞŸÑ€Ğ¸Ğ²ĞµÌ�Ñ‚ à¤¨à¤®à¤¸à¥�à¤¤à¥‡ ×©Ö¸×�×œ×•Ö¹×�

It consists of 22 (!) code points, but only 16 grapheme clusters. It may be reduced to 20 code points if converted to NFC. Yet, the number of code points in it is irrelevant to almost any software engineering task, with perhaps the only exception of converting the string to UTF-32. For example:

For cursor movement, text selection and alike, grapheme clusters shall be used.

For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.

The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.

No, because the number of user-perceived characters that can be represented in Unicode is virtually infinite. Even in practice, most characters do not have a fully composed form. For example, the NFD string from the example above, which consists of three real words in three real languages, will consist of 20 code points in NFC. This is still far more than the 16 user-perceived characters it has.

Unicode support of libraries and programming languages is frequently judged by the value returned for the â€˜length of the stringâ€™ operation. According to this evaluation of Unicode support, most popular languages, such as C#, Java, and even the ICU itself, would not support Unicode. For example, the length of the one character string â€˜ğŸ�¨â€™ will be often reported to be 2 where UTF-16 is used as for the internal string representation and 4 for the languages that internally use UTF-8. The source of the misconception is that the specification of these languages use the word â€˜characterâ€™ to mean a code unit, while the programmer expects it to be something else.

That said, the code unit count returned by those APIs is of the highest practical importance. When writing a UTF-8 string to a file, it is the length in bytes which is important. Counting any other type of â€˜charactersâ€™ is, on the other hand, not very helpful.

UTF-16 is the worst of both worlds, being both variable length and too wide. It exists only for historical reasons and creates a lot of confusion. We hope that its usage will further decline.

Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth when using platform APIs that donâ€™t support UTF-8 and accept wide strings (e.g. Windows API). Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), and there is a great advantage to using the same encoding everywhere else in the application, so we see no sufficient reason to do otherwise.

Speaking of performance, machines often use strings to communicate (e.g. HTTP headers, XML, SOAP). Many see this as a mistake by itself, but regardless of that it is nearly always done in English and ASCII, giving UTF-8 further advantage there. Using different encodings for different kinds of strings significantly increases complexity and resulting bugs.

In particular, we believe that adding wchar_t to the C++ standard was a mistake, and so are the Unicode additions to C++11. What must be demanded from the implementations though, is that the basic execution character set would be capable of storing any Unicode data. Then, every std::string or char* parameter would be Unicode-compatible. â€˜If this accepts text, it should be Unicode compatibleâ€™â€”and with UTF-8, it is easy to achieve.

The standard facets have many design flaws. This includes std::numpunct , std::moneypunct and std::ctype not supporting variable-length encoded characters (non-ASCII UTF-8 and non-BMP UTF-16), or not having the information to perform the conversions. They must be fixed:

decimal_point() and thousands_sep() shall return a string rather than a single code unit. This is how C locales do this through the localeconv function, albeit not customizable.

and shall return a string rather than a single code unit. This is how C locales do this through the function, albeit not customizable. toupper() and tolower() shall not be phrased in terms of code units, as it does not work in Unicode. For example, the Latin ï¬„ ligature must be converted to FFL and the German ÃŸ to SS (there is a capital form áº�, but the casing rules follow the traditional ones). In addition, some languages (e.g. Greek) have special final forms of some lower case letters, so case conversion routines must be aware of their position to perform the conversion correctly.

This section is dedicated to developing multi-platform library development and to Windows programming. The problem with Windows platform is that it does not (yet) support Unicode-compatible narrow string system APIs. The only way to pass Unicode strings to Windows API is by converting to UTF-16 (also known as wide strings).

Note that our guidelines differ significantly from the Microsoftâ€™s original guide to Unicode conversion. Our approach based on performing the wide string conversion as close to API calls as possible, and never holding wide string data. In the previous sections we explained that this will typically result in better performance, stability, code simplicity and interoperability with other software.

Do not use wchar_t or std::wstring in any place other than adjacent point to APIs accepting UTF-16.

or in any place other than adjacent point to APIs accepting UTF-16. Do not use _T("") or L"" literals in any place other than parameters to APIs accepting UTF-16.

or literals in any place other than parameters to APIs accepting UTF-16. Do not use types, functions, or their derivatives that are sensitive to the UNICODE constant, such as LPTSTR , CreateWindow() or the _T() macro. Instead, use LPWSTR , CreateWindowW() and explicit L"" literals.

constant, such as , or the macro. Instead, use , and explicit literals. Yet, UNICODE and _UNICODE are always defined, to avoid passing narrow UTF-8 strings to ANSI WinAPI getting silently compiled. This can be done by VS project settings, under code name Use Unicode character set.

and are always defined, to avoid passing narrow UTF-8 strings to ANSI WinAPI getting silently compiled. This can be done by VS project settings, under code name Use Unicode character set. std::string and char* variables are considered UTF-8, anywhere in the program.

and variables are considered UTF-8, anywhere in the program. If you have the privilege of writing in C++, the narrow() / widen() conversion functions below can be handy for inline conversion syntax. Of course, any other UTF-8/UTF-16 coversion code would do.

/ conversion functions below can be handy for inline conversion syntax. Of course, any other UTF-8/UTF-16 coversion code would do. Only use Win32 functions that accept widechars ( LPWSTR ), never those which accept LPTSTR or LPSTR . Pass parameters this way: ::SetWindowTextW(widen(someStdString or "string litteral").c_str()) The policy uses conversion functions described below. See also, a note on conversion performance.

), never those which accept or . Pass parameters this way: The policy uses conversion functions described below. See also, a note on conversion performance. With MFC strings: CString someoneElse; // something that arrived from MFC. // Converted as soon as possible, before passing any further away from the API call: std::string s = str(boost::format("Hello %s

") % narrow(someoneElse)); AfxMessageBox(widen(s).c_str(), L"Error", MB_OK);

For .NET developers: using the native UTF-16 based string class may be hard to avoid. Remember that this implementation detail leaks heavily through the interface of this class. For example, string[index] operation may return part of a character (as it would be with a UTF-8 byte array). When serializing strings into output files or communication devices, remember to specify Encoding.UTF8 . Be ready to pay the performance penalities for conversion, e.g. in ASP.NET web applications, which typically generate UTF-8 HTML output.

Always produce text output files in UTF-8.

Using fopen() should anyway be avoided for RAII/OOD reasons. However, if necessary, use _wfopen() and WinAPI conventions as described above.

should anyway be avoided for RAII/OOD reasons. However, if necessary, use and WinAPI conventions as described above. Never pass std::string or const char* filename arguments to the fstream family. MSVC CRT does not support UTF-8 arguments, but it has a non-standard extension which should be used as follows:

or filename arguments to the family. MSVC CRT does not support UTF-8 arguments, but it has a non-standard extension which should be used as follows: Convert std::string arguments to std::wstring with widen : std::ifstream ifs(widen("hello"), std::ios_base::binary); We will have to manually remove the conversion, when MSVCâ€™s attitude to fstream changes.

arguments to with : We will have to manually remove the conversion, when MSVCâ€™s attitude to changes. This code is not multi-platform and may have to be changed manually in the future.

Alternatively use a set of wrappers that hide the conversions.

This guideline uses the conversion functions from the Boost.Nowide library (it is not yet a part of boost):

std::string narrow(const wchar_t *s); std::wstring widen(const char *s); std::string narrow(const std::wstring &s); std::wstring widen(const std::string &s);

The library also provides a set of wrappers for commonly used standard C and C++ library functions that deal with files, as well as means of reading and writing UTF-8 through iostreams.

These functions and wrappers are easy to implement using Windowsâ€™ MultiByteToWideChar and WideCharToMultiByte functions. Any other (possibly faster) conversion routines can be used.

This manifesto was written by Pavel Radzivilovsky, Yakov Galka and Slava Novgorodov. It is a result of our experience and research of real-world Unicode issues and mistakes done by real-world programmers. Our goal here is to improve awareness of text issues and to inspire industry-wide changes to make Unicode-aware programming easier, ultimately improving the experience of users of those programs written by human engineers. Neither of us is involved in the Unicode consortium.

Special thanks to Glenn Linderman for providing information about Python, and to Markus KÃ¼nne, Jelle Geerts, Lazy Rui and Jan RÃ¼egg for reporting bugs and typos in this document.

Much of the text was inspired by discussions on StackOverflow initiated by Artyom Beilis, the author of Boost.Locale. Additional inspiration came from the development conventions at VisionMap and Michael Hartlâ€™s tauday.org.

You can leave comments/feedback on the Facebook UTF-8 Everywhere page. Your help and feedback are much appreciated.



Bitcoin donate to: 1UTF8gQmvChQ4MwUHT6XmydjUt9TsuDRn

The cash will be used for research and promotion.