Our analysis was performed on a publicly available digital version of the Voynich text in the European Voynich Alphabet (EVA) transcription system. This digital version reproduces the ordering of the manuscript’s current binding, where -except for around a dozen scattered folios- all pages belonging to each “thematic” section are contiguous to each other. Our results were obtained for a reordered version of the manuscript, where those scattered folios were aggregated into the corresponding sections (see Materials and Methods).

Most Informative Words in the Voynich Text

In any sizable piece of written human language, which articulates information about several subjects, certain words are tightly related to the main topics dealt with in the text. If the Voynich manuscript contains a meaningful text encrypted by translation into a coded or invented language, statistical signatures in the distribution of tokens could be used to identify candidates to play the role of those keywords.

Methods for detecting content-bearing keywords in language samples have a long history [7], [8]. Some of the most successful approaches have looked not only at the frequencies of words, but also at their distribution over the sample [6], [8]–[10]. In particular, the distribution profile of the occurrences of each individual word has turned out to be a key feature to assess the word’s relevance to the overall meaning of a text. While uninformative words tend to have an approximately homogeneous (Poissonian) distribution, the most relevant words are scattered more irregularly, and their occurrences are typically clustered [11]–[13]. The tendency of content-bearing words to cluster over certain parts of the text is a direct consequence of their varying relation to the local semantic context as the text progresses and its meaning unfolds. Over long spans, the clustering patterns of words develop a systematic statistical structure that determines the degree of local specificity of their usage in successive contextual domains. Word clustering has been preliminarily reported in the Voynich text [14].

In our analysis, we used an information-theoretical measure that quantifies the amount of information that the distribution of words bears about the sections where they appear in the text [6]. Words that are uniformly scattered contribute little or no information, since their distribution cannot tag any specific section of the text. On the contrary, words that appear only in certain contextual domains contribute much information, because their distribution identifies those specific sections. The information measure, given by Eq. 2, depends parametrically on a length scale -a given number of words- that defines the size of local domains (see Materials and Methods for an overview).

Figure 1A shows the information in bits per word as a function of the scale of contextual domains for several information-carrying sequences, comprising natural and artificial languages, the Voynich manuscript, and the genetic code (details about the individual sequences are given in Materials and Methods). All cases share a similar overall pattern, with low information for both large and small scales. This feature is a consequence of the fact that, in those two limits, there is poor specificity in the profile of the distribution of words over the text. For sufficiently small scales, all words occur only once or none in each domain, thus making their distribution uninformative about specific locations. In the opposite limit, when the scale becomes comparable to the total length of the text all words have a more or less uniform distribution, which again leads to low information.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Comparison of the Voynich manuscript and different information carrying sequences. A) Information in word distribution as a function of the scale for the Voynich manuscript compared to other five language and symbolic sequences (F: Fortran; C: Chinese; V: Voynich; E: English; L: Latin; Y: yeast DNA). The number of words in all sequences was equal to that of the Voynich text; if the original sequence was longer, the additional words were not considered. B) Scale of maximal information for the sequences considered in A (see Materials and Methods for more details on the language and symbolic sources). https://doi.org/10.1371/journal.pone.0066344.g001

In all cases, in turn, the information about the identity of the different local domains attains a maximum at an optimal scale. This is the scale at which the heterogeneity in the distribution of word frequencies over the text is largest. At that particular scale, the frequency profile of the words can, on average, tag different parts most efficiently. For the texts in natural human languages, the maximum achieved by the information varies between approximately 0.2 bits/word for Latin to 0.6 bits/word for Chinese. This difference can be attributed to the disparate size of vocabularies, resulting from the different degrees of inflexion in the respective languages. Latin is a highly inflected language, with nouns and adjectives changing by declension and verbs by conjugation. It typically shows a very rapid growth of vocabulary size as a function of text length [15]. On the contrary, Chinese texts usually require a smaller number of tokens -in this case, characters- for comparable text lengths. Therefore, assuming comparable total information, a smaller effective vocabulary implies more information per word. Interestingly, the three natural language texts attain maximal information at a similar scale of around 600–800 words. The maximal information for the Voynich manuscript is slightly above that of English, and significantly below that of Chinese. Moreover, as can be seen in Figure 1B, the scale at which maximal information is reached for the Voynich text is very similar to that of the human language examples. In contrast, the scales of maximal information for the DNA sequence of the yeast Saccharomyces cerevisiae, and for the Fortran source code are sensibly different from those of the human language texts and the Voynich manuscript. These results suggest that the overall statistics of word distribution over the text in the Voynich manuscript is comparable with that of real human languages.

The total information given by Eq. 2 is a sum of contributions from individual words. It is then possible to assign an information value to each word in the lexicon, corresponding to each term in the sum [6]. This allows a ranking of the individual words according to their contribution to the overall information. In Table 1 we show the 30 most informative words in the Voynich text (in the EVA transcription) ranked by their contribution to the total information, computed both at the optimal scale and with respect to the division of the text in its “thematic” sections. The same procedure applied to texts written in known languages yields a list of keywords that closely relate to their general semantic content [6]. All the words listed in Table 1 have a substantial contribution to the information that their individual distribution profiles bear about the different sections of the text. Despite the fact that the optimal scale is of 807 words while the average size of the “thematic” sections is above 7500 words, there are some words in common in both columns of Table 1, in particular among the most informative. This is because some of these top words in Table 1 are both highly frequent and have a strongly non-uniform distribution over the different “thematic” sections, with some of them being used in only one or two sections of the text. The strong specificity of their distribution is also captured by a partition of the text in sections of equal size, as in the first column of Table 1, thus leading to a high information value.