English vs Voynichese

In order to avoid difficulties with the differences between Currier languages and have as uniform a text as possible, I focused on a single section of the Voynich manuscript: quire 20.

As a benchmark, I used a portion of the Genesis from King James Bible, considering a similar number of words (about 10,500).

In both cases, I only used 5 POS classes. Such a number is obviously too small to correctly represent all different part of speech categories, but it makes analysis and discussion easier. I fed the whole text to the algorithm, without any kind of punctuation or sentence boundary marks.

These are the most frequent 20 words for each of the classes for the English text:

C:0 C:1 C:2 C:3 C:4

tokens:1860 tokens:2159 tokens:2095 tokens:2048 tokens:2294

types:131 types:84 types:122 types:399 types:439

ratio:14.1 ratio:25.7 ratio:17.1 ratio:5.1 ratio:5.2

hapax:67 hapax:31 hapax:41 hapax:190 hapax:189 and 1117 the 866 of 441 it 93 earth 99

that 141 he 146 in 163 be 87 said 93

upon 58 his 131 to 131 him 79 lord 87

which 52 i 109 unto 121 all 76 years 65

lived 34 god 102 was 109 thee 67 will 53

forth 27 a 98 shall 95 abram 59 sons 47

into 24 thou 82 is 79 them 53 man 45

came 24 every 69 after 69 not 43 hundred 44

as 24 thy 68 were 67 noah 41 had 37

out 22 they 62 for 66 me 37 name 34

on 21 their 36 with 65 her 37 wife 33

but 21 my 35 begat 64 there 31 land 33

up 15 s 33 from 61 went 30 waters 32

saying 13 she 30 shalt 34 also 26 day 32

then 12 an 29 when 25 you 20 have 30

because 12 two 21 made 25 this 20 days 30

at 12 cain 18 called 23 one 20 seed 27

wives 10 three 15 make 22 old 20 ark 26

therefore 9 five 14 behold 22 daughters 20 flesh 25

where 7 nine 12 let 21 eat 19 son 23

class:0 mostly conjunctions and adverbs

class:1 determiners and numbers (but also subject pronouns)

class:2 12 verbs + 8 prepositions

class:3 contains several object pronouns (him,thee,them,me,her), but it is quite mixed

class:4 16 nouns + 4 verbs (3 of which are auxiliary); even if they do not appear among the most frequent words, several adjectives are also assigned to this class

“Hapax” is the number of hapax legomena (i.e. words which only occur once in the whole text). This value is obviously anti-correlated with the tokens/type ratio. Classes that include fewer word types tend to have fewer hapax legomena. The number of tokens per class is roughly constant. It could be that function words concentrate in classes with a high tokens/type ratio and a low number of hapax legomena.

The most frequent sequences of two consecutive classes are (the numbers correspond to occurrences of the sequence in the tagged text):

1_4 1828

2_3 1010

3_0 917

2_1 868

4_2 862

4_0 804

0_1 801

3_2 500

They can be represented by the following graph:

Some sequences that match those illustrated in the graph:

1:the 4:days 2:of 3:enos 2:were 1:nine 4:hundred 0:and 1:five 4:years

1:the 4:bow 2:shall 3:be 2:in 1:the 4:cloud

1:a 4:dove 2:from 3:him 2:to 3:see 2:if 1:the 4:waters 2:were

0:and 1:the 4:lord 2:plagued 3:pharaoh 0:and 1:his 4:house

0:that 1:his 4:brother 2:was 3:taken

Of course, many more sequences appear a significant number of times. It is also evident that word classes are not clear-cut. Yet the results illustrate how this software can detect something relevant, at least for the most frequent words, even with a relatively short text.