« previous post | next post »

Over the years, we've viewed the phenomenon of word aversion from several angles — a recent discussion, with links to earlier posts, can be found here. What we're calling word aversion is a feeling of intense, irrational distaste for the sound or sight of a particular word or phrase, not because its use is regarded as etymologically or logically or grammatically wrong, nor because it's felt to be over-used or redundant or trendy or non-standard, but simply because the word itself somehow feels unpleasant or even disgusting.

Some people react in this way to words whose offense seems to be entirely phonetic: cornucopia, hardscrabble, pugilist, wedge, whimsy. In other cases, it's plausible that some meaning-related associations play a role: creamy, panties, ointment, tweak. Overall, the commonest object of word aversion in English, judging from many discussions in web forums and comments sections, is moist.

One problem with web forums and comments sections as sources of evidence is that they don't tell us what fraction of the population experiences the phenomenon of word aversion, either in general or with respect to some particular word like moist. Dozens of commenters may join the discussion in a forum that has at most thousands of readers, but we can't tell whether they represent one person in five or one person in a hundred; nor do we know how representative of the general population a given forum or comments section is.

Pending other approaches, it occurred to me that we might be able to learn something from looking at usage in literary works. Authors who are squicked by moist, for example, will plausibly tend to find alternatives. (Well, in some cases the effect might motivate over-use; but never mind that for now…)

So for this morning's Breakfast Experiment™, I downloaded the April 2010 Project Gutenberg DVD, and took a quick look.

Let me say first that from the point of view of corpus analysis, the Gutenberg DVD is a mess. The text files have multiple formats, with different sorts of boilerplate fore and aft; there are many duplicate works, arranged in ways that make automated un-duplication difficult; there is no master list of path names to texts; in some cases, hyphenation and other relics of print editions are preserved; and so on. I say this not mainly to complain about the quality of free icecream (though I hope that at some time in the future, someone will produce a more analysis-friendly version), but to make the point that the clean-up I was able to accomplish in the space of an hour may very well have a few bugs. So take the following with a suitably-sized grain of salt.

I chose this list of 50 well-represented authors for a start. After removing duplicates and other extraneous material, the collection boiled down to 125,896,608 words, or an average of about 2.5 million words per author. (The minimum per-author word count was 118,881 and the maximum was 9,362,632.)

There were 798 occurrences of moist, for an overall frequency of 6.34 per million. This is roughly consistent with the frequencies seen in the Google Books ngram collection for English Fiction:

Among the 50 authors, there are four whose gutenberg-works (as processed by me this morning) have no instances of moist: Gertrude Atherton, Jane Austen, Fanny Burney, and Mary Andrews. The overall word counts for these four authors are 1,235,294, 786,308, 1,069,712, and 269,060, so if the underlying probability of moist were really about 6.34 per million, the probability of getting zero moists in the corresponding number of random draws from the urn of words would be about 0.0004, .007, .001, and 0.18, respectively.

Does this mean anything? Well, there's the Bonferroni-correction problem. And anyhow, there's no reason to expect that everyone has the same underlying frequency for every word: there are differences in topic choice as well as individual differences in word-use preferences.

A plot of all 50 authors' empirical moist-frequencies, compared (say) to their empirical dry-frequencies, helps us to visualize what's going on:

The numbers reference the previously-cited list of authors; and (because there have been suggestions that the effect might be gendered) I've plotted female authors in red, and male authors in blue.

It's clear that different authors have different underlying rates of moist-usage. For Bret Harte, whose 56 moists in 2,522,731 words make him the moistest author, the naive 95%-confidence intervas for the rate of moist usage would be 17.1 to 28.8 per million. For Mark Twain, who has only 2 moists in 3,436,448 words, a similarly-calculated confidence interval comes out as .07 to 2.1 per million.

I should note that one of Mark Twain's 2 moists is not really moist — it comes from his story "The Stolen White Elephant":

He took a pen and some paper. "Now–name of the elephant?"

"Hassan Ben Ali Ben Selim Abdallah Mohammed Moist Alhammal Jamsetjejeebhoy Dhuleep Sultan Ebu Bhudpoor."

"Very well. Given name?"

"Jumbo."

Anyhow, it's quite clear that Twain's propensity to use the word moist is substantially smaller than Harte's.

Still, Twain's one genuine use of the word is in a rather positive context, in chapter XXXIX of Roughing It, describing a visit to an island in Mono Lake:

When we reached the top and got within the wall, we found simply a shallow, far-reaching basin, carpeted with ashes, and here and there a patch of fine sand. In places, picturesque jets of steam shot up out of crevices, giving evidence that although this ancient crater had gone out of active business, there was still some fire left in its furnaces. Close to one of these jets of steam stood the only tree on the island–a small pine of most graceful shape and most faultless symmetry; its color was a brilliant green, for the steam drifted unceasingly through its branches and kept them always moist. It contrasted strangely enough, did this vigorous and beautiful outcast, with its dead and dismal surroundings. It was like a cheerful spirit in a mourning household.

There's some indication that Harte is in general more concerned with moisture, by whatever name, than Twain is — but still, moist remains unexpectedly rare in Twain's writing. For the other humidity-words wet, damp, dry, and arid, Twain runs about half the rate of Harte, while his moist usage is two orders of magnitude less frequent:

Overall words moist N wet N damp N dry N arid N Twain 3,436,448 1 86 40 36 7 Harte 2,522,731 56 141 92 51 13

__

Overall words moist/MW wet/MW damp/MW dry/MW arid/MW Twain 3,436,448 0.27 25.0 11.6 10.5 2.0 Harte 2,522,731 22.2 55.9 36.5 20.2 5.2

Perhaps that lonely moist was inserted by an editor?

Anyhow, if we look at a violin plot of the empirical distribution of moist-frequencies and dry-frequencies (per million words) in the 50 authors surveyed, we see a hint of a bimodal distribution among the moist-frequencies:

Random usage-noise, or the literary signature of moist-aversion? I'm not sure, but breakfast is over.

Permalink