Unicode® 5.2.0

Released: 2009 October 1 (Announcement)

Version 5.2.0 has been superseded by the latest version of the Unicode Standard.

Version 5.2.0 of the Unicode Standard consists of the core specification (The Unicode Standard, Version 5.2), together with the delta and archival code charts for this version, the 5.2.0 Unicode Standard Annexes, and the 5.2.0 Unicode Character Database (UCD). The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

Version 5.2.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 5.2.0, defined by: The Unicode Standard, Version 5.2 (Mountain View, CA: The Unicode Consortium, 2009. ISBN 978-1-936213-00-9). (http://www.unicode.org/versions/Unicode5.2.0/)

A complete specification of the contributory files for Unicode 5.2.0 is found on the page Components for 5.2.0. That page also provides the recommended reference format for Unicode Standard Annexes.

The text of The Unicode Standard, Version 5.2, as well as the delta and archival code charts, is available via the navigation links on this page. The charts and the Unicode Standard Annexes may be printed, while the other files may be viewed but not printed. The Unicode 5.2 Web Bookmarks page has links to all sections of the online text. A zipped version of the core specification (10 MB) is also available for download.

This page summarizes important changes to the standard from Unicode 5.1.0. The core specification and the Unicode Standard Annexes are not delta documents; they incorporate all of the textual changes for their updates for Version 5.2.0.

The Unicode Standard, Version 5.2, adds 6,648 characters and significantly improves the documentation of conformance requirements for the specification of normalization forms, canonical ordering, and the status of types of properties. Version 5.2 brings improved clarity of presentation in many Unicode Standard Annexes.

Seven new contemporary scripts have been added in Version 5.2: Bamum, Javanese, Lisu, Meetei Mayek, Samaritan, Tai Tham, and Tai Viet. New character additions to existing scripts now provide greater support for Abkhaz, Canadian Aboriginal Syllabics, Coptic, Devanagari, Khamti Shan, Malayalam, and Myanmar. Of particular note are Devanagari additions in support of Vedic Sanskrit. Encoding Vedic is significant because Sanskrit is one of the principal languages for the religious heritage of India, and because Vedic represents the earliest attested phase of the language.

The seven contemporary scripts and newly encoded individual characters expand support of language and orthographic communities in Africa, India, China, Central Asia, Southeast Asia, and the Middle East.

Other character additions include important modern use symbols and historic characters. With Unicode Version 5.2, scholars will now have access to the Gardiner set of Egyptian Hieroglyphs as well as other important historic scripts: Imperial Aramaic, Avestan, Kaithi, Old South Arabian, and Old Turkic. Several key symbol sets were added or expanded: the ARIB set of Japanese broadcasting symbols, additional number forms used in India, and currency symbols.

This latest version of the Unicode Standard has exactly the same character assignments as ISO/IEC 10646:2003 plus Amendments 1 through 6.

Unicode Version 5.2:

Updates stability policies to add property value stability guarantees for identifier-related properties, a guarantee of property, property alias and property value alias stability, and a policy on alias uniqueness.

Incorporates into Chapter 3, Conformance the formal definitions of normalization formerly presented in Unicode Standard Annex #15, "Unicode Normalization Forms ." Sections that were modified include sections 3.6 and 3.11.

the formal definitions of normalization formerly presented in ." Sections that were modified include sections 3.6 and 3.11. Revises Section 3.5, Properties to better explain the status of Normative, Informative, Provisional, and Contributory properties.

to better explain the status of Normative, Informative, Provisional, and Contributory properties. Clarifies the definition of Deprecated and its relationship to ”strongly discouraged,” and updates the set of Deprecated characters in view of this clearer definition.

Updates best practices for the use of replacement characters.

Improves the description of compatibility characters in Chapter 2, General Structure .

. Adds standardized named sequences for Tamil.

Contains significant changes to properties and behavioral specifications.

Errata incorporated into Unicode 5.2.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 5.2.0, see the list of current Updates and Errata.

The Unicode Character Encoding Stability Policy has been updated. This update strengthens normalization stability, adds stability policy for case pairs, and extends constraints on property values. For the current statement of these policies, see Unicode Character Encoding Stability Policy.

6,648 new character assignments were made to the Unicode Standard, Version 5.2.0 (over and above what was in Unicode 5.1.0). The character repertoire corresponds to ISO/IEC 10646:2003 plus Amendments 1 through 6.

The exact list of characters added for Version 5.2.0 is documented in the file DerivedAge.txt in the Unicode Character Database. Among the characters added, there are a few notable cases which may impact existing implementations. These cases are highlighted here, so that implementers can check for any problematical assumptions in their code.

There are three new characters in the newly-encoded Kaithi script that will require changes in implementations which make hard-coded assumptions about composition during normalization. Most new characters added to the standard with decompositions cannot be generated by the operations toNFC() or toNFKC(), but these three can. Implementers should check their code carefully to ensure that it handles these three characters correctly. U+1109A KAITHI LETTER DDDHA U+1109C KAITHI LETTER RHA U+110AB KAITHI LETTER VA

One of the compatibility CJK ideographs added in this version has a decomposition mapping to a unified CJK ideograph in Extension B. The effect of this is that for the first time a character in the BMP normalizes to a character not in the BMP: toNFC(U+FA6C) = U+242EE Implementers should check their implementations of normalization to ensure they are not assuming that no BMP character can normalize to a non-BMP character.

Implementers should check their implementations of normalization to ensure they are not assuming that no BMP character can normalize to a non-BMP character. Any hard-coded range assumptions about Unified CJK Ideographs in implementations may need fixing, because the end range for those has changed from U+9FC3 to U+9FCB in this version. There is also an entirely new block of CJK Unified Ideographs: CJK Unified Ideographs Extension C (U+2A700..U+2B73F), with characters encoded in the range U+2A700 to U+2B734.

There is now an assigned Hangul jamo character at U+11A7. This may interfere with some implementations' boundary testing for Hangul decomposition.

There are a number of new Hangul jamo characters added for support of Old Korean. Some of these are encoded in new blocks. An implementation may run into trouble if it assumes that the repertoire of conjoining jamos is fixed, or that all conjoining jamos occur only in the Hangul Jamo block, U+1100..U+11FF.

New uppercase parenthesized symbols have been added. Unlike the circled letter symbols, there are no uppercase/lowercase relationships for these new characters.

Character Assignment Overview

The new character additions were to both the BMP and the SMP (Plane 1). The following table shows the allocation of code points in Unicode 5.2.0. For more information on the specific characters, see the file DerivedAge.txt in the Unicode Character Database. For more details of character counts, see Appendix D, Changes from Previous Versions in Unicode 5.2.

Graphic 107,154 Format 142 Control 65 Private Use 137,468 Surrogate 2,048 Noncharacter 66 Reserved 867,169

There are several changes to conformance requirements in Unicode 5.2 that impact implementations. The most important of these are noted specifically here.

The formal definitions of normalization formerly presented in Unicode Standard Annex #15, "Unicode Normalization Forms," have been moved to Chapter 3, Conformance .

. A key conformance clause on the modification of character sequences, C7, has been tightened to eliminate security risks resulting from deletion of noncharacters from uninterpreted text strings. In Unicode 5.2, the conformance requirements now disallow their removal, except where strings are explicitly being modified.

The status of Normative, Informative, Provisional, and Contributory properties is clarified in Section 3.5 Properties .

. The types of code points are clarified in Chapters 2, 3, and 4, with coordinated updates in Unicode Standard Annex #44, "Unicode Character Database."

The PropertyAliases.txt file in the Unicode Character Database is now designated as the normative listing of Unicode character properties and their names.

The BidiTest.txt file in the Unicode Character Database is a new feature in Unicode 5.2. This file contains test cases for assessing conformance to the Unicode Bidirectional Algorithm.

There are additional changes in Unicode conformance requirements due to changes in the UCD data files and the Unicode Standard Annexes listed below.

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 5.2.0 can be found in UAX #44, Unicode Character Database. The most significant changes include:

There are new case-related properties in DerivedCoreProperties.txt and DerivedNormalizationProps.txt. The new case-related derived properties are NFKC_Casefold, Case_Ignorable, Cased, Changes_When_Lowercased, Changes_When_Uppercased, Changes_When_Titlecased, Changes_When_Casemapped, Changes_When_Casefolded, and Changes_When_NFKC_Casefolded.

Contributory is considered to be a distinct status for a Unicode character property. Contributory properties are neither normative nor informative . The status of all character properties is listed in the property table in UAX #44, Unicode Character Database.

is considered to be a distinct status for a Unicode character property. Contributory properties are neither nor . The status of all character properties is listed in the property table in UAX #44, Unicode Character Database. Two new joining groups, FARSI YEH and NYA, were added. These new joining groups may require an update to implementations of Arabic shaping rules.

There is a new data file in the Unicode Character Database, CJKRadicals.txt, which maps the radical numbers used in the Unicode Radical-Stroke Index to the actual Unicode code points for the corresponding radicals. Unlike other files, the first field is not a code point number.

a code point number. The Unihan.txt file in Unihan.zip is split into 8 separate files within the zip file, organized by category. See UAX #38, Unicode Han Database (Unihan) for details.

In Version 5.2, many of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.