Character Distributions at a Glance

Define the conditional right / left distribution of a character as the distribution of characters immediately following / preceding it. These distributions carry important information about a language and can, for example, be used to separate vowels from consonants. However, when represented simply as frequency tables they aren't very illustrative, so I decided to make a visualization.

A single conditional distribution p is visualized in the following way. First, calculate Shannon's diversity index d 1 (p) = Exp(-∑p i Ln p i ), which measures how many different values a distribution can effectively take. As you may recognize, it is an exponentiated entropy. Diversity index of a uniform distribution over n values is n.

Then draw a bar with unit height and length = d 1 (p), subdivide it into sections with lengths proportional to p i and sort the sections by length. The length of each section can be interpreted as the ratio of the corresponding conditional character probability to the weighted geometric mean of p i with weights p i . The complete visualization is a stack of left p L and right p R conditional distribution bars sorted by total diversity index d 1 (p L ) × d 1 (p R ).

Now let's take a look at the visualizations of English (statistics collected from Moby-Dick) and Voynich manuscript. '_' denotes spaces, '/' new lines, and '*' marks the unconditional distributions.

Some interesting, even if well-known features of Voynichese are apparent:

It is predictable in the sense of having relatively low conditional distribution entropies. In English, the bigram entropy is 7.41 bits and the unigram entropy is 4.09 bits, the ratio being 1.81. For Voynichese, the corresponding figure is 6.04 / 3.88 = 1.56.

EVA q is followed by o 97.5% of the time, and n is preceded by i 97.4% of the time. Could these combinations actually be single characters?

is followed by 97.5% of the time, and is preceded by 97.4% of the time. Could these combinations actually be single characters? m , g , n and y appear mostly at the ends of the words.

, , and appear mostly at the ends of the words. Beginnings and ends of lines influence the character statistics. m and g appear especially frequently at line ends, while the line beginnings have an increased proportion of p and t (it has been suggested that gallows can serve as paragraph markings).

and appear especially frequently at line ends, while the line beginnings have an increased proportion of and (it has been suggested that gallows can serve as paragraph markings). Similar-looking gallows ( k and t , f and p ) have very similar statistics, but p and t appear much more frequently at the line beginnings. A similar phenomenon occurs with r and m , but m usually ends the line. Could the symbols with extra loops ( p , t and m ) be a special newline graphic variants of f , k and r ?

Follow me on Twitter to receive notifications about new blog posts. This post was made possible thanks to the patron support. If you appreciate the effort put into creating this page and would like to see more posts like this, you can support me on Patreon. Top patrons of the month: