Unicode® 12.0.0

2019 March 5 (Announcement)

Version 12.0.0 has been superseded by the latest version of the Unicode Standard.

This page summarizes the important changes for the Unicode Standard, Version 12.0.0. This version supersedes all previous versions of the Unicode Standard.

Unicode 12.0 adds 554 characters, for a total of 137,928 characters. These additions include 4 new scripts, for a total of 150 scripts, as well as 61 new emoji characters.

The new scripts and characters in Version 12.0 add support for lesser-used languages and unique written requirements worldwide. Funds from the Adopt-a-Character program provided support for some of these additions. The new scripts and characters include:

Elymaic, historically used to write Achaemenid Aramaic in the southwestern portion of modern-day Iran

Nandinagari, historically used to write Sanskrit and Kannada in southern India

Nyiakeng Puachue Hmong, used to write modern White Hmong and Green Hmong languages in Laos, Thailand, Vietnam, France, Australia, Canada, and the United States

Wancho, used to write the modern Wancho language in India, Myanmar, and Bhutan

Miao script additions to write several Miao and Yi dialects in China

Popular symbol additions:

61 emoji characters, including several new emoji for accessibility. For complete statistics regarding all emoji as of Unicode 12.0, see Emoji Counts. For more information about emoji additions for Unicode 12.0, including new emoji ZWJ sequences and emoji modifier sequences, see Emoji Recently Added, v12.0.

Marca registrada sign

Heterodox and fairy chess symbols

Additional support for lesser-used languages and scholarly work was extended worldwide, including:

Hiragana and Katakana small letters, used to write archaic Japanese

Tamil historic fractions and symbols, used in South India

Lao letters, used to write Pali

Latin letters used in Egyptological and Ugaritic transliteration

Hieroglyph format controls, enabling full formatting of quadrats for Egyptian Hieroglyphs

Important glyph corrections, including:

Bopomofo, with significant improvements

Won currency sign, changed to align with modern Korean fonts

Synchronization

Several other important Unicode specifications have been updated for Version 12.0. The following four Unicode Technical Standards are versioned in synchrony with the Unicode Standard, because their data files cover the same repertoire. All have been updated to Version 12.0:

Some of the changes in Version 12.0 and associated Unicode Technical Standards may require modifications to implementations. For more information, see the migration and modification sections of UTS #10, UTS #39, UTS #46, and UTS #51.

This version of the Unicode Standard is also synchronized with ISO/IEC 10646:2017, fifth edition, plus Amendments 1 and 2 to the fifth edition, plus the following additions from the CD for the sixth edition:

61 emoji characters

U+1E94B ADLAM NASALIZATION MARK

See Sections D through H below for additional details regarding the changes in this version of the Unicode Standard, its associated annexes, and the other synchronized Unicode specifications.

Version 12.0 of the Unicode Standard consists of:

The core specification

The code charts (delta and archival) for this version

The Unicode Standard Annexes

The Unicode Character Database (UCD)

The core specification gives the general principles, requirements for conformance, and guidelines for implementers. The code charts show representative glyphs for all the Unicode characters. The Unicode Standard Annexes supply detailed normative information about particular aspects of the standard. The Unicode Character Database supplies normative and informative data for implementers to allow them to implement the Unicode Standard.

The core specification is available as a single pdf for viewing. (14 MB) Links are also available in the navigation bar on the left of this page to access individual chapters and appendices of the core specification.

Several sets of code charts are available. They serve different purposes:

The latest set of code charts for the Unicode Standard is available online. Those charts are always the most current code charts available, and may be updated at any time. The charts are organized by scripts and blocks for easy reference. An online index by character name is also provided.

For Unicode 12.0.0 in particular two additional sets of code chart pages are provided:

A set of delta code charts showing the new blocks and any blocks in which characters were added for Unicode 12.0.0. The new characters are visually highlighted in the charts.

A set of archival code charts that represents the entire set of characters, names and representative glyphs at the time of publication of Unicode 12.0.0.

The delta and archival code charts are a stable part of this release of the Unicode Standard. They will never be updated.

Links to the individual Unicode Standard Annexes are available in the navigation bar on the left of this page. The list of significant changes in the content of the Unicode Standard Annexes for Version 12.0 can be found in Section G below.

Data files for Version 12.0 of the Unicode Character Database are available. The ReadMe.txt in that directory provides a roadmap to the functions of the various subdirectories. Zipped versions of the UCD for bulk download are available, as well.

Version 12.0.0 of the Unicode Standard should be referenced as:

The Unicode Consortium. The Unicode Standard, Version 12.0.0, (Mountain View, CA: The Unicode Consortium, 2019. ISBN 978-1-936213-22-1)

http://www.unicode.org/versions/Unicode12.0.0/

The terms “Version 12.0” or “Unicode 12.0” are abbreviations for the full version reference, Version 12.0.0.

The citation and permalink for the latest published version of the Unicode Standard is:

The Unicode Consortium. The Unicode Standard.

http://www.unicode.org/versions/latest/

A complete specification of the contributory files for Unicode 12.0 is found on the page Components for 12.0.0. That page also provides the recommended reference format for Unicode Standard Annexes. For examples of how to cite particular portions of the Unicode Standard, see also the Reference Examples.

Errata incorporated into Unicode 12.0 are listed by date in a separate table. For corrigenda and errata after the release of Unicode 12.0, see the list of current Updates and Errata.

There were no significant changes to the Stability Policy of the core specification between Unicode 11.0 and Unicode 12.0.

Four new scripts were added with accompanying new block descriptions:

Script Number of

Characters Elymaic 23 Nandinagari 65 Nyiakeng Puachue Hmong 71 Wancho 59

Changes in the Unicode Standard Annexes are listed in Section G.

Character Assignment Overview

554 characters have been added. Most character additions are in new blocks, but there are also character additions to a number of existing blocks. For details, see delta code charts.

There are no significant new conformance requirements in Unicode 12.0.

The detailed listing of all changes to the contributory data files of the Unicode Character Database for Version 12.0 can be found in UAX #44, Unicode Character Database. The changes listed there include character additions and property revisions to existing characters that will affect implementations. Some of the important impacts on implementations migrating from earlier versions of the standard are highlighted in Section M.

In Version 12.0, some of the Unicode Standard Annexes have had significant revisions. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UAX, linked directly from the following list of UAXes.

Unicode Standard Annex Changes UAX #9

Unicode Bidirectional Algorithm Text was added in BD2 to guarantee that max_depth can be treated as a constant (with value 125). UAX #11

East Asian Width No significant changes in this version. UAX #14

Unicode Line Breaking Algorithm The behavior of NNBSP was clarified for Mongolian. References to CLDR and UTS #35 as a source for tailoring were added. UAX #15

Unicode Normalization Forms No significant changes in this version. UAX #24

Unicode Script Property No significant changes in this version. UAX #29

Unicode Text Segmentation The derivation of Lower and Upper for Sentence_Break was updated for Georgian, to account for the difference in how casing in Georgian interacts with sentence boundaries. Surrogate code points were moved from Control to XX for the Grapheme_Cluster_Break property, to eliminate the need to have isolated surrogate code points in the test cases. Fullwidth digits were moved to Numeric for Word_Break and Sentence_Break, to address an inconsistency in handling of boundaries for digits. UAX #31

Unicode Identifier and Pattern Syntax The context specified for A2 was tightened up, by requiring $Letter at the end of the sequence. The new scripts for Unicode 12.0 were added to Tables 4 and 7. UAX #34

Unicode Named Character Sequences The occurrence of initial hyphen-minus in Unicode character names was clarified. UAX #38

Unicode Han Database (Unihan) The syntax and/or descriptions for several Unihan data fields were significantly updated: kIRG_GSource, kIRG_JSource, kIRG_KSource, and kIRG_TSource. The discussion of kDefaultSortKey was replaced with a description of the actual sorting algorithm used to generate the radical-stroke charts. UAX #41

Common References for Unicode Standard Annexes All references were updated for Unicode 12.0. UAX #42

Unicode Character Database in XML New code point attributes, values, and patterns were added. UAX #44

Unicode Character Database Clarification was added about the meaning of “abbreviated” property aliases. The note on the derivation of Default_Ignorable_Code_Point was updated to account for the Egyptian Hieroglyph format controls. The note about Grapheme_Extend was updated to explain the current relationship to GCB=Extend. Documentation was added for the new file USourceRSChart.pdf in Table 5. UAX #45

U-Source Ideographs New values, A and B, were added to the status field, to account for CJK ideographs encoded in Extensions A or B. Documentation was added regarding the addition of a new comments field to the data file, USourceData.txt. Numerous entries have been added to that data file, and many entries were corrected to indicate their correct extension and code point, if encoded. UAX #50

Unicode Vertical Text Layout No significant changes in this version.

There are also significant revisions in the Unicode Technical Standards whose versions are synchronized with the Unicode Standard. The most important of these changes are listed below. For the full details of all changes, see the Modifications section of each UTS, linked directly from the following list of UTSes.

Unicode Technical Standard Changes UTS #10

Unicode Collation Algorithm No significant changes in this version. UTS #39

Unicode Security Mechanisms The discussion of simplified versus traditional CJK characters as part of the enhancements for spoof detection was removed, because any effective approach for that would need to be more sophisticated. The criteria for exclusions for the listing of Not_XID in the data files were clarified. UTS #46

Unicode IDNA Compatibility Processing Table 4, IDNA Comparisons was frozen at the Unicode 11.0 level, with appropriate recaptioning and explanation added. Additional tweaks to the stats in the table for each subsequent release have proven to be of little additional benefit. UTS #51

Unicode Emoji Several definitions were updated, and a new definition for “RGI Set” was added. A new section about marking gender in emoji input has been added, as well as numerous clarifications about multi-person groupings, emoji and text presentation selectors, and the significance of the word “FACE” in emoji names. The mechanisms for support of skin tone distinctions when using multi-person emoji are now more fully described.

There are a significant number of changes in Unicode 12.0 which may impact implementations upgrading to Version 12.0 from earlier versions of the standard. The most important of these are listed and explained here, to help focus on the issues most likely to cause unexpected trouble during upgrades.

Script-related Changes

Four new scripts have been added in Unicode 12.0.0. Some of these scripts have particular attributes which may cause issues for implementations. The more important of these attributes are summarized here.

Nandinagari is a complex script of the Indic type.

Ottoman Siyaq numerals, when combined to represent large numbers, have complex formatting requirements.

A set of Egyptian format controls has been added in a new block in the range U+13430..U+13438. While these are intended for use with the existing Egyptian Hieroglyphs script, their use involves a complicated extension to the rendering model for hieroglyphs to enable quadrat formation. Implementers who wish to support these format controls will need to study the specification in the supporting proposal documents. See, in particular, L2/17-112.

U+1E94B ADLAM NASALIZATION MARK has been added for the Adlam script. Although the Adlam script was encoded earlier, implementations have run into trouble attempting to implement the Adlam nasalization mark with characters such as U+0027 APOSTROPHE. The new character is intended to eliminate those problems, but Adlam implementations will need to be updated to add the character and its correct rendering to Adlam fonts.

Casing Issues

A few new uppercase Latin letters have been added, which form case pairs with existing lowercase Latin letters. Casing tables should be checked carefully.

General Character Property Changes

Numerous updates have been made to the Alphabetic and Diacritic property values, to help keep the DUCET table for collation stable when initial weights are assigned based on character property values. Most of the affected characters are tone marks for lesser-known scripts.

A Script_Extensions property value of {Latn Mong} has been added for U+202F NARROW NO-BREAK SPACE. Implementations that support Script_Extensions should check that they are handling this character appropriately, and that its identification in both Latin and Mongolian script runs is correct.

Numeric Property Changes

Unicode 12.0 adds a large number of Tamil characters used for fractional values in traditional accounting practices. Some of these fraction characters introduce fractional values distinct from those noted for fraction characters in prior versions of the UCD. Implementations which handle numeric values of Unicode characters and which have special assumptions about how to deal with fractional values should take note of the following new fractional values among the Tamil fractions:

1/320, 1/80, 1/64, 1/32, 3/64

Note that these Tamil fractions share structural similarities (and many values) with Malayalam fractions. See DerivedNumericValues.txt for details.

CJK/Unihan Changes

The regular expressions for the kIRG_GSource and kIRG_JSource properties were completely rewritten to be comprehensible. See UAX #38 for details.

Standardized Variation Sequences

Many additional new standardized variation sequences have been added, to represent distinctions between variants of some common East Asian punctuation characters.

New Data Files Added to the UCD

A new radical/stroke index was added, for easier lookup of U-Source ideographs.

Code Charts