For this experiment, I extracted readings from Jim Breen’s venerable Kanjidic, and structural analysis from KanjiVG, a publicly-available database of graphical– and component decompositions. (If there are any errors in the tables, please report, so that I can either debug my code or forward corrections to the original sources.) Regarding my goals:

And I was especially interested in components that could be used reliably as a guide to pronounciation.

I also chose to ignore approximate readings, out of didactic interest (in my experience as a Japanese student, I found approximations to be more trouble than worth). I looked for exact matches.

I chose KanjiVG because I was interested in a synchronic analysis—that is, of the structure of the kanji as they are now , not of their historical (traditional, Seal, or pre-Qin) forms. I ignored history, and looked for correlations between modern visual components and modern readings, in the spirit of testing how much information is still present in the system.

There’s quite a bit of data to massage, and it can be tricky to measure what exactly is a “good” phonetic component. In the next section I make some important definitions about metrics.

First of all, the results will differ significantly depending on which kanji set (our universe) is analyzed. We’ll investigate two such sets:

Quantifying phonetic series

Within each kanji set, the basic variables to relate are:

A few thousand kanji , where

, where each kanji is made of one or more components , and

, and each kanji has zero or more (on-yomi) readings.

A component series is set of kanji that include a certain component. Here are some examples from the Jōyō set:

Component Size of series Kanji in series 走 7 越 起 趣 走 超 徒 赴 青 7 情 晴 清 精 請 青 静 包 6 包 抱 泡 砲 胞 飽 乍 5 作 搾 昨 詐 酢 及 4 及 吸 扱 級

For our purposes, a phonetic series is a set of kanji that

shares a component and a reading. If we add each kanji’s

readings to the table above, interesting patterns appear:

Component Size of series Kanji in series 走 7 越 起 趣 走 超 徒 赴 etsu,

otsu ki shu sou chou to fu 青 7 情 晴 清 精 請 青 静 sei ,

jou sei sei ,

shou,

shin sei ,

shou,

shiyau sei ,

shou,

shin sei ,

shou sei ,

jou 包 6 包 抱 泡 砲 胞 飽 hou hou hou hou hou hou 乍 5 作 搾 昨 酢 詐 saku ,

sa saku saku saku sa 及 4 及 吸 扱 級 kyuu kyuu kyuu ,

sou kyuu

First, consider the 走-series. Not a single kanji in it has a

shared reading! 走 is not a phonetic component at all, i.e. the

走-series is not a phonetic series.

Compare to the 包-series. All the six kanji that include 包 are

pronounced hou. In other words, 包-hou is a phonetic

series of size 6.

Now consider the 乍-series. Almost there! Four out of five kanji include the reading saku, but 詐 breaks the pattern; it’s only read as sa. In this case the 乍-saku phonetic series (size 4, 作搾昨酢) is smaller than the full 乍 component series (5, 作搾昨詐酢). We say this phonetic series covers 4 out of 5 kanji, or that it has a kanji coverage of 4/5 = 80%.

What about the 青-series? It does include a phonetic series, with 100% coverage: all characters do have a shared reading, sei. However, most characters also have extra, unpredictable readings, so that this component is less predictive than 包, and we should measure this. Call readings coverage the ratio of predicted readings to all the readings of all kanji in which the component appears (that is, all kanji in the larger component series). 青-sei is a phonetic series of size 7, with 100% kanji coverage but only 43.75% (7/16) readings coverage.

I hope it’s clear by now that 及-kyuu is a size-4 phonetic

series with 100% kanji coverage and 80% (4/5) readings coverage.

Phonetic series with 100% kanji and 100% readings coverage (like 包-hou) are especially useful; these ratings mean that, whenever the component appears, one can be sure of all readings of the kanji. We call these perfect series. Second in importante are those with 100% kanji but less than 100% readings (like 及-kyū and 青-sei); let’s name them semiperfect series. If you see a semiperfect phonetic component, you can be sure of at least one of the kanji’s readings. Series with less than 100% kanji coverage are not as useful, since you have to memorize the exceptions anyway; these are imperfect series.

Attentive readers might have noticed that a single component can be on many phonetic series; 乍, for example, could also be described as a very imperfect predictor for sa, working for 作詐 but not 搾昨酢. It of course performs better as a predictor for saku, since in that role it get 80/66% for 4 kanji, rather than 40/33% for just 2. We’re now in position to choose the best series for a component or a kanji: the rating criteria will be: