To get started, I grabbed a couple of datasets and graphed character frequency vs. character complexity for the first 5,000 characters, the number an ordinary educated Chinese person would know. I obtained the probability of each character appearing (the most common 7 each make up about 1% of all Chinese text apiece, for instance) and used the logarithm formula from earlier to extract the information content of each one:

r-squared is 0.15 for simplified and 0.07 for traditional.

The correlation here is unconvincing, although there is an interesting lower bound that seems to appear for how much information a character might contain based on its stroke count. Next I compared character frequency index against stroke count. On all following graphs, a higher frequency index means the character is less common, so the most common character has frequency index 1. In this way, frequency index indicates information content of a character, because less frequent characters have lower probabilities of appearing.

r-squared is 0.14 for simplified and 0.07 for traditional.

The two methods tell us the same thing — there is, at best, a very weak correlation between how complex a Chinese character appears and how much information it carries. To compare a very appley apple to a very orangey orange, I also graphed English word length vs. frequency index and found an even more profound lack of meaningful trend.

r-squared is 0.05.

It’s clear that neither English or Chinese resembles an optimal encoding. This is inevitable because words or characters aren’t just random jumbles of strokes or letters but are constructed with intention, so that it’s clear that hydroplane has something to do with water, and that 談 has something to do with speech. There’s a balance to be struck here — add too much complexity to a word and it becomes redundant and ornamental, but remove too much, and your language is basically a Huffman coding; every word is the optimal length for its frequency, but contains no information about its own meaning, like a string of 1's and 0's. So while this is just one measuring stick we could choose to assess Chinese with, it does make for an interesting analysis of the two character sets. Simplified’s r-squared value is still twice traditional’s — what’s with that?

To better describe how the frequency list differs from the stroke-count list, I needed a new metric to describe the “disorder” of a list of numbers — how far the list is from its ordered state. For this I turned to insertion sort, which operates by swapping adjacent elements in a list until it is ordered. While considered inefficient, this algorithm has the useful property that it takes zero swaps to sort an already sorted list, and O(n²) swaps to sort a “maximally unsorted” list — i.e. one that is sorted in reverse order! So I wrote an insertion sort that counts and returns how many swaps it performs, then defined disorder for a list as follows:

disorder(L) = swaps to sort L / swaps to sort reversed(sorted(L))

The disorder for any list is a value between 0 and 1, where 0 means the list is already in order, and 1 means the list is in reverse order.

Insertion sort in action (source)

Now I could take a look at the disorder in the frequency-indexed list of stroke counts of a set of characters. By way of example, the ten most common Chinese characters are:

的, 一, 是, 不, 了, 在, 人, 有, 我, 他

The corresponding stroke count list is:

8, 1, 9, 4, 2, 6, 2, 6, 7, 5

It takes 21 swaps to order this list. The same list, sorted and reversed, is:

9, 8, 7, 6, 6, 5, 4, 2, 2, 1

It takes 43 swaps to put this list in order. Therefore the disorder of the list is 21 / 43 = 0.488. Not off to a great start! If disorder is low (< 0.5), that means that there is a positive correlation between complexity and information. If it is high (> 0.5) there is a negative correlation. For the first ten characters, there doesn’t appear to be much of a correlation at all! Undaunted, I applied this procedure to several sets of characters:

The first 5,000 characters by frequency ( simplified )

) The first 5,000 characters by frequency ( traditional )

) The 1,676 simplified characters among the first 5,000 that actually differ from their traditional counterpart, which I refer to as diff-simplified

The 1,676 traditional equivalents of those actually simplified characters, which I refer to as diff-traditional

Here is the result:

Ringing in at 0.3–0.4, these disorder levels indicate the same weakly positive correlation between complexity and information we’ve already seen. Honestly, I was a little disappointed to see how little a difference there was between the different character sets. Since the simplified sets display less disorder, they do in fact provide a better encoding in the information-theoretic sense — but not to the extent that I was expecting, or hoping. The script I wrote also provided some other statistics. Here are a few interesting bits that I decided didn’t merit a chart:

With 9,933 characters (all the ones in my frequency table), the difference between disorder in the two character sets is more pronounced, at 0.30 for simplified to 0.37 for traditional.

for simplified to for traditional. For the first 5,000 characters, the simplified set’s mean/median stroke count is 10.3/10 and the traditional set’s is 12.1/12 .

and the traditional set’s is . The most complex character in my dataset is 鱺 (eel) with 30 strokes. In simplified it looks like 鲡.

Here are a few more charts I made while I was at it:

Using simplified tends to save about two strokes per character.

The stroke count of Chinese characters in this range appears to be distributed about normally…or is it binomially? Very qualitatively, this makes sense. The simplest characters only have a few strokes to combine in relatively few ways. There are many more very complex characters than very simple ones, but most of them fall right of the 5,000 mark. In the middle is a sweet spot where you have a lot of ways to combine a large pool of relatively simple characters, often just by merging two of them horizontally or vertically. As an aside, it’s hard to overstate how versatile this process is, and I recommend HanziCraft as a great way to visualize how characters are constructed. Anyway, as expected, this histogram shows simplification in action — the simplified distribution is tighter and shifted to the left.

This chart shows character frequency vs. number of strokes removed by simplification. It’s interesting because it reveals bands at 3 and 5 strokes saved containing tons of characters that were simplified by virtue of their radical. At 3 strokes saved, you have dozens of characters with a 糹(silk) on the side that is simplified to 纟, for instance 绿, 练, and 细. Similarly simplified are 金→钅(gold/metal), 貝→贝 (money/value), and 車→车 (car), all of which have a 3-stroke differential. At 5 strokes, you’ll likewise see many, many characters which contain a 言→讠 (speech), 食→饣 (food), or 門→门 (door) simplification. On the other hand, there are some characters visible below the zero line that actually gained strokes in simplification! The most common of these is 強→强 (strong), which I actually think gained some nice squareness at the cost of one stroke. Finally, with a whopping differential of 21 strokes, 廳→厅 (hall) is the most drastic change in the entire script. As 丁’s phonetic clue is almost as useful as 聽’s, I chalk this one up as a win, but I’ve seen several users of traditional react with horror to this change.

Like I said before, I love simplified characters. My view is usually that more simplification is better, because fewer strokes means less wrist pain for everyone. But I got a taste of what it must be like to have your entire writing system upended when I discovered the list of second-round simplifications that the Chinese government tried and failed to implement in the 1970s and 80s. Although these characters aren’t part of my native language or my heritage, I still recoiled instinctively when I saw these proposed changes:

Current version of the character on the left, proposed simplification on the right (source)

All of these simplifications are highly logical and rely mostly on homophonic substitution. But after just a few months of taking Chinese classes and getting familiar with 原, 菜, and 酒, learning about their etymology and construction, it seemed cruel to rip all that suddenly-meaningful ‘stuff’ out just to save a few strokes here and there. It’s an arbitrary line drawn in the sand — this is what Chinese characters looked like when I learned them, and it would make me sad, somehow, if they were simplified further. The rational part of me knows that if character simplification continues, it will probably be for the benefit of the billions of people who use them every day. But I’ll always have a place in my heart for 藏, 幕, 疑, and the thousands of other needlessly complicated characters whose complexity, after a fashion, invites investigation and untangling. And until Obama tells me otherwise, I’ll continue to keep my there’s, theirs, and they’res apart as well. ■

谢谢, 奥巴马 (source)

Data Sources

All of the code I used to generate this data is available on Github.