« previous post | next post »

Last month, I taught a short course on "Corpus-based Linguistic Research" at the LSA Institute in Ann Arbor, in which the participants were asked to do individual projects. One of the undergraduates in the class, Alex R., undertook to examine the time-course of variability in English spelling, starting with the Paston Letters, which are "a collection of letters and papers consisting of the correspondence of members of the Paston family of Norfolk gentry, and others connected with them in England, between the years 1422 and 1509".

There's plenty of variation — here's Alex's inventory of the some of the ways that Wednesday is spelled in that collection:

For context, here's the start of one letter, from Agnes Paston to her son John, written in 1447:

Soon, I grete ȝow wel wyth Goddys blyssyng and myn; and I latte ȝow wette þat my cosyn Clere wrytted to me þat sche spake wyth Schrowpe aftyre þat he had byen wyth me at Norwyche, and tolde here what chere þat I had made hym; and he seyde to here he lyked wel by þe chere þat I made hym. He had swyche wordys to my cosyn Clere þat lesse þan þe made hym good chere and ȝaf hym wordys of conforth at London he wolde no more speke of þe matyre. My cosyn Clere thynkyth þat it were a foly to forsake hym lesse þan ȝe knew of on owdyre as good ore bettere, and I haue assayde ȝowre sustere and I fonde here neuer so wylly to noon as sche is to hym, ȝyf it be so þat his londe stande cleere. I sent ȝow a letter by Brawnton fore sylke and fore þis matyre be-fore my cosyn Clere wrote to me, þe qwyche was wrytten on þe Wednysday nexȝt aftyre Mydsomere Day. Ser Herry Ynglows is ryȝth besy a-bowt Schrowpe fore on of his doȝhteres.

The usual story about English spelling regularization is that it developed as a result of printing. Alex is interested is the hypothesis that spelling was already becoming more consistent in hand-written documents before printing would have had any effect, perhaps due to some of the same forces that tend to create linguistic consistency in a speech community.

One of the problems in this general area is that the available corpora are generally not lemmatized — that is, when a text says "Wendysday" there's no straightforward automatic way to determine that this represents the same word as other letter-strings like "Wendisdaye" or "Wednysdaye" or "Wednysday".

Although there are many programs that purport to "lemmatize" English text, none of them are adequate even for modern text in standard spellings, since there is no standard way to identify English words at the level of dictionary entries or major sub-entries — there's a commonly-accepted fiction that the letter-string corresponding to the standard spelling of the stem ought to be good enough. And for older texts, or for modern texts with non-standard spelling, even that inadequate solution is not easily available.

I pointed Alex in the direction of some interesting recent work by Jacob Eisenstein ("What to do about bad language on the internet", NAACL-HLT 2013) on the analogous problems in "normalizing" modern social-media text. Alex's background is in the humanities, so the world of computer text hacking is new to him, but he's making good progress, as you'll see below.

As far as I know, no one has yet made a serious attempt (for instance) to learn a weighted transducer that would connect letter-strings in historical texts to the corresponding modern spellings — much less to do what we really need, which is to connect such letter-strings to stable lexical identifiers at the level of entries and major sub-entries in a work like the Oxford English Dictionary. At the recent OED Symposium, I proposed that the OED should work with others to build a large historical corpus annotated with such identifiers, and of course also to create taggers that would do this annotation automatically. This would imply licensing the identifiers for appropriate use by others — an alternative approach would be to try to extend the Wiktionary in directions that would make such a project possible.

Anyhow, what reminded me of these issues today was an email from Alex, which I reproduce below:

I hope you are well. I'm halfway through my time in Edinburgh, working at the Festival doing technical production. It's been hectic and stressful but I'm still finding time to work on my Python skills and the Paston Letters. I'm hoping to get in touch with Jacob soon and learn more about finite state transducers and his work.

I thought the following might amuse you – my first breakfast experiment! I wanted to practise processing the XML file I have to remove all the guff and play around with lists/dictionaries/functions/loops/frequencies/tokenisation/etc so I wrote some stuff in Python to do this. It took around 30 minutes.

"Suppose you are playing a non-standard variant of Scrabble. The board is large with each side being over a million tiles wide. It has no bonus letter/word score tiles at all.

It is the first turn and you get to place your tiles first. By a stroke of luck, you notice that the 841,995 tiles you are currently holding in your hand will allow you to place the entire text of the Paston Letters (without spaces, numbers (unless in Roman numeral notation) or punctuation marks) in a straight line along the middle of the board.

Before doing so, you decide to calculate the total score.

The standard letter scores are the same as in Present Day English with the following additions based on frequency profiles: ȝ (yogh) is worth the same as "Q" at 10 points, þ (thorn) is worth the same as "K" at 5 points. French "é", despite having the same frequency as Q is only worth 1 point, the same as "e", on account of England's friction with France during the period the Paston Letters were written.

The final score is 1,594,464 (plus an extra 50 for using all your tiles)."

In my experience, the impulse to have fun programming is an excellent predictor of the rate of skill development.

By the way, the OED gives these variants for Wednesday:

α. OE Wodnesdæg, OE Wodnesdoeg (Northumbrian), OE Wodnessdæg, lOE Wodenesdei, lOE Wodnes dægge (dative), lOE Wodnesdæig, lOE Wodnesdeg, lOE Wodnesdeig, lOE Wodnosdæg, lOE–eME Wodnesdei, eME Wodnesdæȝ, eME Wodnesdawes (plural), ME Wodeinsday, ME Wodenesday, ME Wodenisday, ME Wodenysday, ME Wodinsdai, ME Wodnesday, ME Wodnysday, ME–15 Wodensday, 15 Wodinsday; Sc. pre-17 Vodenisday, pre-17 Vodinsday, pre-17 Vodnisday, pre-17 Vodynnis day, pre-17 Voidinisday, pre-17 Woddinnesdaye, pre-17 Woddinnisday, pre-17 Woddinsday, pre-17 Woddnesday, pre-17 Woddynsday, pre-17 Wodenisday, pre-17 Wodinsday, pre-17 Wodnisday, 18–19 Wodensday; N.E.D. (1926) also records a form lME Wodinsday. β. eME Wednesdei, eME Weodnesdei, ME Weddenesday, ME Weddensdaye, ME Weddynisday, ME Wedenesday, ME Wedenisdai, ME Wedenysday, ME Wednesdai, ME Wednesseday, ME Wednysdaye, ME Wedonesday, ME 16 Wedensday, ME–15 Wedinsday, ME–15 Wednysday, ME–15 Wedynsday, ME–16 Wednisday, ME– Wednesday, lME Weddysday, 15 Weddinsday, 15 Weddynsday, 15 Wedensdaye, 15 Wedenysdaye, 15 Wednesdaie, 15 Wednisdaye, 15 Wednsdaye, 15 Wedynsdaye, 15–16 Wednesdaye, 16 Weddensday, 17 Wedonsday; Sc. pre-17 Vadinsday, pre-17 Vadynisday, pre-17 Veddensday, pre-17 Veddnesday, pre-17 Veddnsday, pre-17 Veddyinsday, pre-17 Veddynisday, pre-17 Vedenysday, pre-17 Vedinnisday, pre-17 Vedinsday, pre-17 Vednesday, pre-17 Vednisday, pre-17 Vednysday, pre-17 Waddinsday, pre-17 Wadinesday, pre-17 Wadinsdaye, pre-17 Wadnysdaye, pre-17 Weddansday, pre-17 Weddensday, pre-17 Weddenseday, pre-17 Weddinisday, pre-17 Weddinissday, pre-17 Weddinsday, pre-17 Weddnesday, pre-17 Weddnysday, pre-17 Weddynisday, pre-17 Weddynnisday, pre-17 Wedenisdaye, pre-17 Wedinday, pre-17 Wedinsday, pre-17 Wednisday, pre-17 Wednysday, pre-17 Wedynnisda, pre-17 Wedynsday, pre-17 Wedynysday, pre-17 Widinsday, pre-17 17 Wadinsday, pre-17 17– Wednesday, pre-17 18– Wadnesday, 17 Wedensday, 17 Wednsday, 17 Wednsdy, 17– Wadensday, 18 Wadnsday, 18 Wedsinday, 19– Wadsday; N.E.D. (1926) also records forms ME Wedonesdai, lME Weddynsday. γ. eME Wendesdei, ME Wendesdai, ME Wendesday, ME Wendesdaye, ME Wendisday, ME Wendisdaye, 19– Wensdeh (Eng. regional (Yorks.)); Sc. pre-17 Wandisday, pre-17 Wendinsday, pre-17 Wendisday, pre-17 17 Wendsday; N.E.D. (1926) also records a form lME Wyndenesse day. δ. ME Vennysday, ME Wannysday, ME Wanysday, ME Wennessday, ME Wenstay, ME Wenysday, ME Wonnysday, ME Wonysday, ME–15 Wenesday, ME–15 Wennesday, ME–15 Wennysday, ME–16 Wensdaie, ME–16 Wensdaye, ME–17 Wensday, lME Whenys day, lME Wonesday, lME Wonesdaye, 15 Wensdye, 16 Weansday, 18 Wennesdei (Irish English (Wexford)); Sc. pre-17 17– Wensday, 17– Wansday; N.E.D. (1926) also records a form ME Wannesdai.

I suspect that the list is incomplete, and hereby offer a free lifetime LLOG subscription to the first reader who can find a historically-attested variant that's missing from the OED's list.

Permalink