« previous post | next post »

Andrew Gelman is justifiably impressed by Laura Wattenberg's ruminations on rhyme (warning: the second link triggers one of those insufferable ads that starts playing loud sounds as soon as the page comes up, so mute your audio before clicking). Ms. Wattenberg without the musical background:

Here's a little pet peeve of mine: nothing rhymes with orange. You've heard that before, right? Orange is famous for its rhymelessness. There's even a comic strip called "Rhymes with Orange." Fine then, let me ask you something. What the heck rhymes with purple?

If you stop and think about it, you'll find that English is jam-packed with rhymeless common words. What rhymes with empty, or olive, or silver, or circle? You can even find plenty of one-syllable words like wolf, bulb, and beige. Yet orange somehow became notorious for its rhymelessness, with the curious result that people now assume its status is unique.

Andrew wrote to ask about this, and so I did a bit of looking around for information about the statistics of rhyme.

To my surprise, I was not able to find any work on the distribution of sizes of rhyme equivalence classes in English. As I wrote to Andrew, there's been a lot of interest on the part of psycholinguists in various local density measures on lexical neighborhoods — how many words are near a given word (in terms of some sort of edit distance) in spelling or sound — because such measures play an important role in theories of speech perception, reading, speech errors, language learning, and so on.

Choosing somewhat at random among recent publications, there's a historical survey in Davis et al., "Re(de)fining the orthographic neighborhood: The role of addition and deletion neighbors in lexical decision and reading", Journal of Experimental Psychology: Human Perception and Performance, 2009, and a useful discussion in Magnuson et al., "The dynamics of lexical competition during spoken word recognition", Cognitive Science 31(1) 2007.

Those references will give you the flavor of this work, which now amounts to thousands if not tens of thousands of articles. (Google Scholar claims 393,000 hits for lexical neighborhood, but perhaps this is as unreliable as other Google hit counts are these days.)

As far as I know, none of these innumerable articles has looked at the distribution of "rhymes" in the poetic sense, and in particular at the distribution of number of rhymes overall, or conditioned on various source-word properties (frequency, length in segments or in syllables, semantic field, etc.).

I found a few hints of such work in the computational linguistics literature, for example Roy Bird and Martin Chodorow, "Using an On-line Dictionary to find rhyming words and pronunciations for unknown words", Proceedings of the 23rd annual meeting on Association for Computational Linguistics, 1985. In that paper, Roy and Martin treat us to the apotheosis of plain-text GUIs:

But there's no information about the distribution of rhyme-set sizes in their dictionary.

So I decided give it a shot myself, naively thinking that it would be easy.

Of course I understood that the easy answer would be meaningless. That is, you could get a large number of different answers depending on how you define "rhyme", how big your wordlist is, whose pronunciations you use, and so on.

It's like asking "how many English words are there?". In fact, an exact word-list (and thus word-count) is presupposed by questions like "how many rhymeless English words are there?", "how many English words have exactly one (or two, or seventeen) rhymes?", etc.

But, I naively thought, getting versions of these numbers is easy to do, given a choice of dictionary and a definition of "rhyme". And I told Andrew that as classes come to an end, I might have a bit of extra hacking time to try it.

And indeed the basic hacking is pretty straightfoward.

I started with the CELEX2 epw ("English phonology, wordforms") table, which has 160,595 entries. I eliminated entries with internal white space or hyphens, leaving 119,277. There are a fair number of alternative pronunciations, so that after a couple of lines of scripting I was left with 187,576 word/pronunciation pairs. Using a definition of "rhyme" as "identical in pronunciation from the main-stressed vowel to the end", a few more lines of scripting revealed 50,344 rhyme equivalence classes (i.e. sets of rhyming words), of which 30,905 (61% of rhyme sets, 16% of words+pronunciations) are singletons. The rest of the histogram of rhyme-set sizes starts like this:

Number of Rhyme Sets Number of Words in Set 30,905 1 7,120 2 2,458 3 2,277 4 2,857 5 733 6 437 7 467 8 259 9 525 10

The largest equivalence-class by far is the -ation set, with 1,978 members.

If we believed this table, it would certainly convince us that singletons dominate the English rhyme scene.

And that qualitative conclusion is quite likely to stand up. But a bit of inspection establishes that these specific numbers should not be trusted very far, especially the counts of sets with small numbers of members.

Here are some of the issues:

1. What is a "rhyme"?

2. What wordforms should be in the list? (rare words, compounds, productive derivation…)

3. How should we treat words with alternative pronunciations?

4. CELEX2 has got a fair number of mistakes (or at least inconsistencies) in its pronunciation fields.

Let's consider (or at least exemplify) these one at a time.

What is a "rhyme"? If you ask the Oxford Dictionary of Rhymes about orange, it gives you

• Falange , flange • avenge , henge, revenge, Stonehenge • arrange , change, counterchange, estrange, exchange, grange, interchange, Lagrange, mange, part-exchange, range, short-change, strange • binge , cringe, fringe, hinge, impinge, singe, springe, …

These count under some definitions, but not the one I'm using.

What wordforms should be in the list? Among my 30,905 singletons are monkish and feasting. CELEX2 doesn't include skunkish and bee sting (and my script would have removed bee sting anyhow).

How should we count words with alternative pronunciations? Patron has two pronunciations, one with /eɪ/ and one with /æ/, while matron has only one, with /eɪ/. In the way that I created my rhyme sets, patron is a member of two of them: one singleton (CELEX2 has no other wordforms that rhyme with patron-with-an-/æ/) and one pair (since patron-with-an-/eɪ/ is a member of patron, matron). Should the singleton version be counted, or not?

And here's an example of the interaction of alternative pronunciations with errors or inconsistencies in CELEX2. The two wordforms gobblers and wobblers are given (mistaken?) alternative pronunciations with a syllabic /l/, perhaps due to being derived from gobble and wobble. Or maybe they (the original dictionary authors, I guess) are referencing an alternative pronunciation where gobbler has three syllablers, gob-ble-er? In any case, cobblers lacks this alternative. So we end up with one triple (cobblers, gobblers, wobblers) and one pair (gobblers, wobblers).

So for these and various other reasons, I don't think that my specific numbers should be trusted very far. However, some checking of random samples convinces me that any similar exercise, carried out on any similar sort of dictionary, will also reach the conclusion that overall, orange is indeed the norm: singleton rhyme-sets are more common than any other size.

The situation is (unsurprisingly) different for monosyllables. Among the 26,841 monosyllabic spelling/pronunciation pairs, I found 1,078 rhyme equivalence classes, of which 160 (15% of rhyme sets, 0.6% of words) are singletons. (Even among monosyllables, however, there are more singletons than any other set size — 63 pairs, 39 triples, 24 4-tuples, etc.)

Somewhat to my surprise, the initial-stressed two-syllable words showed about the same pattern as the vocabulary at large, with perhaps even a higher proportion of singletons: Among the 42,663 initial-stressed disyllabic word/pronunciation pairs, there were 17,212 rhyme equivalence classes, of which 11,042 (64% of rhyme sets, 26% of words) were singletons.

Thanks to available data and powerful text-analysis software, all of this was easy — it took quite a bit longer to explain than it did to do. What would really be hard would be doing it so that the results could be trusted — that would require exploring the consequences of all the alternative choices that I've ignored. And the main thing that I've learned from what I've done so far is how large and complex that space of choices really is.

Permalink