A recent post on Patrick Vennebush blog Math Jokes for 4 Mathy Folks asserted that the rule of thumb “I before E except after C” was “total bullshit.” This got me thinking: the “I before E except after C” rule (let’s just call it the IEC rule) was almost certainly developed without any research at all, just based on the subjective experience of educators. It’s not a horrible rule, but certainly we can more intelligently construct better rules of this kind (for lack of a better term, I’ll be refering to these as “trigram spelling rules”). We have the technology, and I have nothing better to do with my night off 🙂

You can find my full methodology and analysis in the following IPython notebook:

http://nbviewer.ipython.org/gist/dmarx/b6a095d2b161eccb18a3

The short version

I used a larger word list than Patrick (233,621 words), but my analysis still corroborated his. I observed the following bigram and trigram frequencies for the IEC rule:

ie: 3950 cie: 256 ei: 2607 cei: 156

I thought that perhaps although the IEC rule doesn’t work when we look at the unique words in our vocabulary, perhaps it might hold true if we look at trigram and bigram frequencies across word usage in written text. Here are the frequencies for the IEC rule in the Brown corpus:

ie: 13275 cie: 1310 ei: 5677 cei: 485

Nope, still no good.

Instead of the IEC rule, here are some alternatives (taken from my vocabulary analysis, not the word usage analysis). For each rule of the form “A before B except after C” below, the bigram frequency percentage count(AB)/(count(AB)+count(BA)) is at least 1/2, and the laplace smoothed trigram frequency ratio (1+count(cba))/(1+count(cab)) is maximized:

P before E except after C

pe: 8052 cpe: 0 ep: 5053 cep: 955

E before U except after Q

eu: 2620 qeu: 0 ue: 1981 que: 949

I before C except after I

ic: 26140 iic: 1 ci: 6561 ici: 1830

T before E except after M

te: 27265 mte: 2 et: 11743 met: 2684

R before D except after N

rd: 3641 nrd: 0 dr: 2738 ndr: 808

Update

After posting this article, there was some discussion that the optimal rules should focus on vowel placement and have a higher bigram ratio than the 1/2 threshold I used. Here are two “better” rules that satisfy these condiitons:

O before U except after Q

ou 12144 qou 0 uo 671 quo 122

I before O except after J

io 15247 jio 0 oi 4040 joi 95