As I see it, there are four foundational tasks that need to be done to wrangle Voynichese into a properly usable form:

* Task #1: Transcribing Voynichese text into a reliable computer-readable raw transcription e.g. EVA qokeedy

* Task #2: Parsing the raw transcription to determine Voynichese’s fundamental units (its tokens) e.g. [qo][k][ee][dy]

* Task #3: Clustering the pages / folios into groups where the text shares distinct features e.g. Currier A vs Currier B

* Task #4: Normalizing the clusters e.g. how A tokens / patterns map to B tokens / patterns, etc

I plan to tackle these four areas in separate posts, to try to build up a substantive conversation on each topic in turn.

Takahashi’s EVA transcription

Rene Zandbergen points out that, of all the different “EVA” transcriptions that appear interleaved in the EVA interlinear file, “the only one that was really done in EVA was the one from Takeshi. He did not use the fully extended EVA, which was probably not yet available at that time. All other transcriptions have been translated from Currier, FSG etc to EVA.”

This is very true, and is the main reason why Takeshi Takahashi’s transcription is the one most researchers tend to use. Yet aside from not using extended EVA, there are a fair few idiosyncratic things Takeshi did that reduce its reliability, e.g. as Torsten Timm points out “Takahashi reads sometimes ikh where other transcriptions read ckh“.

So the first thing to note is that the EVA interlinear transcription file’s interlinearity arguably doesn’t actually help us much at all. In fact, until such time as multiple genuinely EVA transcriptions get put in there, its interlinearity is more of an archaeological historical burden than something that gives researchers any kind of noticeable statistical gain.

What this suggests to me is that, given the high quality of the scans we now have, we really should be able to collectively determine a single ‘omega’ stroke transcription: and even where any ambiguity remains (see below), we really ought to be able to capture that ambiguity within the EVA 2.0 transcriptions itself.

EVA, Voyn-101, and NEVA

The Voyn-101 transcription used a glyph-based Voynichese transcription alphabet derived by the late Glen Claston, who invested an enormous amount of his time to produce a far more all-encompassing transcription style than EVA did. GC was convinced that many (apparently incidental) differences in the ways letter shapes were put on the page might encipher different meanings or tokens in the plaintext, and so ought to be captured in a transcription.

So in many ways we already have a better transcription, even if it is one very much tied to the glyph-based frame of reference that GC was convinced Voynichese used (he firmly believed in Leonell Strong’s attempted decryption).

Yet some aspects of Voynichese writing slipped through the holes in GC’s otherwise finely-meshed net, e.g. the scribal flourishes on word-final EVA n shapes, a feature that I flagged in Curse back in 2006. And I would be unsurprised if the same were to hold true for word-final -ir shapes.

All the same, GC’s work on v101 could very well be a better starting point for EVA 2.0 than Takeshi’s EVA. Philip Neal writes: “if people are interested in collaborating on a next generation transcription scheme, I think v101/NEVA could fairly easily be transformed into a fully stroke-based transcription which could serve as the starting point.”

EVA, spaces, and spatiality

For Philip Neal, one key aspect of Voynichese that EVA neglects is measurements of “the space above and below the characters – text above, blank space above etc.”

To which Rene adds that “for every character (or stroke) its coordinates need to be recorded separately”, for the reason that “we have a lot of data to do ‘language’ statistics, but no numerical data to do ‘hand’ statistics. This would, however, be solved by […having] the locations of all symbols recorded plus, of course their sizes. Where possible also slant angles.”

The issue of what constitutes a space (EVA .) or a half-space (EVA ,) has also not been properly defined. To get around this, Rene suggests that we should physically measure all spaces in our transcription and then use a software filter to transform that (perhaps relative to the size of the glyphs around it) into a space (or indeed half-space) as we think fit.

To which I’d point out that there are also many places where spaces and/or half-spaces seem suspect for other reasons. For example, it would not surprise me if spaces around many free-standing ‘or’ groups (such as the famous “space transposition” sequence “or or oro r”) are not actually spaces at all. So it could well be that there would be context-dependent space-recognition algorithms / filters that we might very well want to use.

Though this at first sounds like a great deal of work to be contemplating, Rene is undaunted. To make it work, he thinks that “[a] number of basics should be agreed, including the use of a consistent ‘coordinate system’. Again, there is a solution by Jason Davies [i.e. voynichese.com], but I think that it should be based on the latest series of scans at the Beinecke (they are much flatter). My proposal would be to base it on the pixel coordinates.”

For me, even though a lot of this would be nice things to have (and I will be very interested to see Philip’s analysis of tall gallows, long-tailed characters and space between lines), the #1 frustration about EVA is still the inconsistencies and problems of the raw transcription itself.

Though it would be good to find a way of redesigning EVA 2.0 to take these into account, perhaps it would be better to find a way to stage delivery of these features (hopefully via OCR!), just so we don’t end up designing something so complicated that it never actually gets done. 🙁

EVA and Neal Keys

One interesting (if arguably somewhat disconcerting) feature of Voynichese was pointed out by Philip Neal some years ago. He noted that where Voynichese words end in a gallows character, they almost always appear on the top line of a page (sometimes the top line of a paragraph). Moreover, these had a strong preference for being single-leg gallows (EVA p and EVA f); and also for appearing in nearby pairs with a short, often anomalous-looking stretch of text between them. And they also tend to occur about 2/3rds of the way across the line in which they fall.

Rather than call these “top-line-preferring-single-leg-gallows-preferring-2/3rd-along-the-top-line-preferring-anomalous-text-fragments“, I called these “Neal Keys”. This term is something which other researchers (particularly linguists) ever since have taken objection with, because it superficially sounds as though it is presupposing that this is a cryptographic mechanism. From my point of view, those same researchers didn’t object too loudly when cryptologist Prescott Currier called his Voynichese text clusters “languages”: so perhaps on balance we’re even, OK?

I only mention this because I think that EVA 2.0 ought to include a way of flagging likely Neal Keys, so that researchers can filter them in or out when they carry out their analyses.

EVA and ambiguity

As I discussed previously, one problem with EVA is that it doesn’t admit to any uncertainty: by which I mean that once a Voynichese word has been transcribed into EVA, it is (almost always) then assumed to be 100% correct by all the people and programmes that subsequently read it. Yet we now have good enough scans to be able to tell that this is simply not true, insofar as there are a good number of words that do not conform to EVA’s model for Voynichese text, and for which just about any transcription attempt will probably be unsatisfactory.

For example, the word at the start of the fourth line on f2r:

Here, the first part could possibly be “sh” or “sho”, while the second part could possibly be “aiidy” or “aiily”: in both cases, however, any transcriber attempting to reduce it to EVA would be far from certain.

Currently, the most honest way to transcribe this in EVA would be “sh*,aii*y” (where ‘*’ indicates “don’t know / illegible”). But this is an option that isn’t taken as often as it should.

I suspect that in cases like this, EVA should be extended to try to capture the uncertainty. One possible way would be to include a percentage value that an alternate reading is correct. In this example, the EVA transcription could be “sh!{40%=o},aiid{40%=*}y”, where “!{40%=o}” would mean “the most likely reading is that there is no character there (i.e. ‘!’), but there is a 40% chance that the character should be ‘o'”.

For those cases where two or more EVA characters are involved (e.g. where there is ambiguity between EVA ch and EVA ee), the EVA string would instead look like “ee{30%=ch}”. And on those occasions where there is a choice between a single letter and a letter pair, this could be transcribed as “!e{30%=ch}”.

For me, the point about transcribing with ambiguity is that it allows people doing modelling experiments to filter out words that are ambiguous (i.e. by including a [discard words containing any ambiguous glyphs] check box). Whatever’s going on in those words, it would almost always be better to ignore them rather than to include them.

EVA and Metadata

Rene points out that the metadata “were added to the interlinear file, but this is indeed independent from EVA. It is part of the file format, and could equally be used in files using Currier, v101 etc.” So we shouldn’t confuse the usefulness of EVA with its metadata.

In many ways, though, what we would really like to have in the EVA metadata is some really definitive clustering information: though the pages currently have A and B, there are (without any real doubt) numerous more finely-grained clusters, that have yet to be determined in a completely rigorous and transparent (open-sourced) way. However, that is Task #3, which I hope to return to shortly.

In some ways, the kind of useful clustering I’m describing here is a kind of high-level “final transcription” feature, i.e. of how the transcription might well look much further down the line. So perhaps any talk of transcription

How to deliver EVA 2.0?

Rene Zandbergen is in no doubt that EVA 2.0 should not be in an interlinear file, but in a shared online database. There is indeed a lot to be said for having a cloud database containing a definitive transcription that we all share, extend, mutually review, and write programmes to access (say, via RESTful commands).

It would be particularly good if the accessors to it included a large number of basic filtering options: by page, folio, quire, recto/verso, Currier language, [not] first words, [not] last words, [not] first lines, [not] labels, [not] key-like texts, [not] Neal Keys, regexps, and so forth – a bit like voynichese.com on steroids. 🙂

It would also be sensible if this included open-source (and peer-reviewed) code for calculating statistics – raw instance counts, post-parse statistics, per-section percentages, 1st and 2nd order entropy calculations, etc.

Many of these I built into my JavaScript Voynichese state machine from 2003: there, I wrote a simple script to convert the interlinear file into JavaScript (developers now would typically use JSON or I-JSON).

However, this brings into play the questions of boundaries (how far should this database go?), collaboration (who should make this database), methodology (what language or platform should it use?), and also of resources (who should pay for it?).

One of the strongest reasons for EVA’s success was its simplicity: and given the long (and complex) shopping list we appear to have, it’s very hard to see how EVA 2.0 will be able to compete with that. But perhaps we collectively have no choice now.