How Complex is Tangut ?

Last year my friend Nathan Hill kindly invited me to give a talk on Tangut at my Alma Mater. I accepted with some trepidation because I am still very much at the start of a long and steep learning curve with regards to Tangut, but I hoped that by the time the talk was due to be given in May this year I would have something interesting and exciting to talk about. Unfortunately I got tied up with other stuff (Tangut, ironically), so in the end my talk turned out to be more of a general introduction to the structure of the Tangut script and some of the issues that I have faced over the last year or so in preparing an encoding proposal for Tangut. But anyway, the talk didn't go too badly, and so I thought that I would convert my PowerPoint slides into a four-part series of blog posts.

Notes for an introductory talk on the Tangut script given at SOAS on 21st May 2009

Part 1 : How Complex is Tangut ?

Part 2 : Untangling the Web of Characters

Part 3 : Tangut Homographs

1.1 The Age of New Scripts

During the 10th to 13th centuries a number of new scripts were devised by peoples who had come into contact with (and conflict with) China, and who wanted to assert their national identity and cultural superiority by means of their own, unique and distinct writing systems (colour-coded to show their current Unicode status):

[See Documents relating to the encoding of the Tangut, Jurchen and Khitan scripts for Unicode encoding proposals]

Three of these scripts, Large Khitan, Jurchen and Tangut, are structurally similar to Chinese, and I will look at their similarities and differences, both amongst themselves and in relation to Chinese, below.

1.2 Khitan Large Script

Closely modelled on Chinese

Many characters borrowed directly from Chinese

Some with the same meaning (e.g. 皇帝 in the text below)

Some as phonetic borrowings

Many other characters derived from Chinese characters by adding or removing strokes (e.g. 東 with two extra strokes on the 6th line from the right in the text below)

Few or no characters composed of multiple elements with large numbers of strokes (i.e. no characters like Chinese 雙)

Uses exactly the same stroke types as Chinese

Largely undeciphered

Transcription of a Khitan Memorial Stone

Source: Miínzú Yǔwén 民族语文 2005 no.4 page 54

Click here to highlight Khitan characters that are the same as Chinese characters

1.3 Jurchen

Very similar to Khitan Large Script

Many characters derived from Khitan and/or Chinese

Relatively few direct borrowings from Chinese compared with Khitan

No characters with large numbers of strokes or composed from multiple complex elements

Uses exactly the same stroke types as Chinese

Largely deciphered

Drawing of a "Medallion" with a Jurchen inscription

Source: S. W. Bushell, "Inscriptions in the Juchen and Allied Scripts" in Actes du Onzième Congrès International des Orientalistes (1897) 2nd section page 21

(originally from Fāngshì Mòpǔ 方氏墨譜 [Mr. Fang's Catalogue of Inkstones] (1588) vol. 1 folio 33)

Table of Chinese, Khitan and Jurchen Numerals

Source: Daniel Kane, The Sino-Jurchen Vocabulary of the Bureau of Interpreters (1989) page 21

1.4 Tangut

Only superficially similar to Chinese

Characters are not obviously derived directly from Chinese or Khitan characters, although they are clearly influenced by Chinese

Discrete elements arranged into a square character

Appears crowded compared with Chinese, with few non-complex characters

Most characters composed of two or three distinct components, and only a few characters are themselves elemental components

Mostly written using the same stroke types as used for writing Chinese, but some stroke types and stroke constructions are unique to Tangut

Higher proportion of diagonal and oblique strokes than in Chinese

No closed elements (i.e. no box elements like Chinese 口 and 囗)

Chrysographic Edition of the Lotus Sutra

Source 中国少数民族文字字符总集

Fragment of a Memorial Stone from the Western Xia Royal Tombs

Source: 大夏寻踪——西夏文物特展 (Vanished Exhibition on Western Xia artefacts at the National Museum of China)

[Can you spot the characters meaning "one" and "three" ?]

1.5 Stroke Complexity

Tangut is renowned as being very complex in terms of the structure of its individual characters, but I wanted to try to determine exactly how complex Tangut is, and how it compares with Chinese, Khitan and Jurchen, so I produced the following graphs to show the distribution of characters by stroke count in these various scripts.

Distribution of Tangut Characters by Stroke Count

Data derived from Proposal for a revised Tangut character set for encoding in the SMP of the UCS (SC2/WG2/N3577) Appendix A.

Distribution of Traditional CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5), excluding simplified characters (mostly those characters with a kTraditionalVariant field).

Distribution of Simplified CJK Characters by Stroke Count

Data derived from the kTotalStrokes field of the Unihan Database for those characters defined in Unicode 1.0 (i.e. U+4E00 through U+9FA5) that have the kXHC1983 field but do not have the kSimplifiedVariant field (i.e. most simplified characters in the 1983 edition of Xiàndài Hànyǔ Cídiǎn 现代汉语词典).

Distribution of Large Khitan Characters by Stroke Count

Data derived from the transcription of a Khitan memorial stone given in Miínzú Yǔwén 民族语文 2005 no.4 page 54 and page 55.

Distribution of Jurchen Characters by Stroke Count

Data derived from Jin Qizong 金啓孮, Nüzhenwen Cidian 女真文辞典 [Dictionary of Jurchen Characters] (Beijing: Wenwu Chubanshe, 1984).

Stroke Count Data for Traditional CJK, Simplified CJK, Tangut, Jurchen and Khitan

Strokes CJK Traditional CJK Simplified Tangut Jurchen Khitan 1 10 2 0 3 0 2 37 22 0 6 6 3 80 60 0 25 28 4 157 143 3 165 52 5 240 215 32 287 60 6 386 351 65 401 41 7 664 568 160 293 34 8 957 759 310 147 18 9 1,125 851 524 37 10 10 1,369 923 773 13 4 11 1,555 901 847 0 2 12 1,636 870 885 0 0 13 1,546 761 782 0 0 14 1,446 594 640 0 0 15 1,502 534 473 0 0 16 1,251 409 336 0 0 17 1,020 311 173 0 0 18 793 175 106 0 0 19 716 168 60 0 0 20 519 105 29 0 0 21 394 79 15 0 0 22 304 47 6 0 0 23 240 40 1 0 0 24 149 21 1 0 0 25 107 22 0 0 0 26 54 6 0 0 0 27 52 1 0 0 0 28 26 1 0 0 0 29 13 1 0 0 0 30 8 0 0 0 0 31 5 0 0 0 0 32 3 1 0 0 0 33 4 1 0 0 0 34 0 0 0 0 0 35 1 0 0 0 0 36 1 1 0 0 0 37 0 0 0 0 0 38 0 0 0 0 0 39 1 0 0 0 0 40 0 0 0 0 0 41 0 0 0 0 0 42 0 0 0 0 0 43 0 0 0 0 0 44 0 0 0 0 0 45 0 0 0 0 0 46 0 0 0 0 0 47 0 0 0 0 0 48 1 0 0 0 0 Total 18,373 8,943 6,221 1,377 255 Mean 13.46 11.49 12.09 6.01 5.43 Mode 12 10 12 6 5

Comparison of CJK, Tangut, Jurchen and Khitan Stroke Counts

Jurchen and Large Khitan are the two scripts that appear to be most similar to Chinese, yet actually they are the most different when it comes to stroke count, both having only half the number of strokes as traditional CJK characters on average. This difference is probably due to the fact that Large Khitan and Jurchen characters do not have any high stroke count radicals such as 言 "speech" (7 strokes), 金 "gold" (8 strokes), 馬 "horse" (9 strokes) and 鳥 "bird" (9 strokes) that are very common in Chinese characters.

On the other hand, it was a surprise (to me at least) to see how closely the contour of Tangut matches that of traditional Chinese, as I had always assumed that Tangut characters must, on average, be much more complex than Chinese characters. But although Tangut does not have any characters with very few strokes (less than 4 strokes) or very many strokes (more than 24 strokes), which distinguishes it from Chinese, if you ignore the lower and upper ends of the graph the distribution of stroke counts for Tangut is very close to that of traditional Chinese. Why then does Tangut text look so much more complex and more crowded than Chinese? That could be answered with another graph which took into account each character's frequency of occurence. A large proportion of high frequency Chinese characters have very few strokes (e.g. 一二三人女山火水大小中), and conversely Chinese characters with very many strokes tend to occur less frequently, with the result that normal Chinese text always has a large proportion of characters with few strokes. In contrast to the situation with Chinese, there does not appear to be any relationship between frequency and stroke count for Tangut characters, so that normal Tangut text is uniformly composed of characters with 12±6 strokes, with the result that it appears denser and more crowded than Chinese.

1.6 Structure of Tangut Characters

Individual Tangut characters not obviously derived directly from Chinese or Khitan characters

Limited set of component elements

Elements are themselves built from simpler elements by the addition of 1 or 2 strokes

Most characters constructed from 2 or 3 component elements

Very few basic elements are also characters in their own right

Series of components are constructed from a basic element, on the one hand by the addition of strokes to the basic element to make other simple components (vertical progression in the diagrams below), and on the other hand by combining these simple components with other components to make complex components (horizontal progression in the diagrams below).

Series of Tangut Components (Example A)

Series of Tangut Components (Example B)

Due to this incremental process many character components are very similar to each other, and when two or three such similar components (coloured red in the diagram below) are combined together in different combinations to make different characters (coloured blue in the diagram below), the results are confusingly confusable.

Eleven Characters composed from different combinations of Five Components

1.7 Tangut Radicals

Not true radicals (determinatives)

But simply aids to character lookup

Chinese dictionaries select leftmost or topmost character element as the radical

Most Russian dictionaries base the radical on the character element at the bottom right corner of the character

In the example below, the same radical is used in both Li Fanwen's dictionary and Kychanov's dictionary, but in the former it is a lefthand radical, and in the latter it is a bottom right radical. This shows how most horizontally aligned components can occur equally on the left side or on the right side of a character, and it is largely an arbitrary decision of dictionary compilers as to whether it is treated as a lefthand side radical or a righthand side radical.

Li Fanwen 2008 Kychanov 2006

The proposed Unicode character ordering is based on 527 left-based radicals (including some top, bottom and enclosing radicals where there is no lefthand component). The advantage of this system of ordering is that it is consistent and allows for deterministic lookup of characters, but the disadvantage is that there are some high stroke-count radicals with very few members.

N3577 Appendix A

1.8 Structural Analysis

Because Tangut characters are composed of a limited set of component elements arranged in different configurations they are very amenable to structural analysis

Nishida’s 1966 dictionary gives structural analysis of each character

Table of Tangut Component Configurations identified by Nishida

Source: Nishida Tatsuo 西田龍雄, Seikago no kenkyū 西夏語の研究 (1964) page 246

Entry in Nishida's 1966 Tangut Dictionary

Source: Nishida Tatsuo 西田龍雄, Seikabun Shōjiten 西夏文小字典 (1966) no. 10-103

The Unicode proposal gives an Ideographic Description Sequence (IDS) for each proposed character. This borrows a character description syntax designed for CJK characters (but which will no longer be restricted to CJK characters from Unicode 6.0).

Tags:

Tangut

Index of BabelStone Blog Posts