When Does a Word Become a Word?

“A shot of expresso, please.” “You mean ‘espresso,’ don’t you?” A baffled customer, a smug barista—media is abuzz with one version or another of this story. But the real question is not whether “expresso” is a correct spelling, but rather how spellings evolve and enter dictionaries. Lexicographers do not directly decide that; the data does. Long and frequent usage may qualify a word for endorsement. Moreover, I believe the emergent proliferation of computational approaches can help to form an even deeper insight into the language. The tale of expresso is a thriller from a computational perspective.

In the past I had taken the incorrectness of expresso for granted. And how could I not, with the thriving pop-culture of “no X in espresso” posters, t-shirts and even proclamations from music stars such as “Weird Al” Yankovic. Until a statement in a recent note by Merriam-Webster’s online dictionary caught my eye: “… expresso shows enough use in English to be entered in the dictionary and is not disqualified by the lack of an x in its Italian etymon.” Can this assertion be quantified? I hope this computational treatise will convince you that it can. But to set the backdrop right, let’s first look into the history.

History of Industry and Language

In the 19th century’s steam age, many engineers tackled steam applications accelerating the coffee-brewing process to increase customer turnover, as coffee was a booming business in Europe. The original espresso machine is usually attributed to Angelo Moriondo from Turin, who obtained a patent in 1884 for “new steam machinery for the economic and instantaneous confection of coffee beverage.” But despite further engineering improvements (see the Smithsonian), for decades espresso remained only a local Italian delight. And for words to jump between languages, industries need to jump the borders—this is how industrial evolution triggers language evolution. The first Italian to truly venture the espresso business internationally was Achille Gaggia, a coffee bartender from Milan.

In 1938 Gaggia patented a new method using the celebrated lever-driven piston mechanism allowing new record-brewing pressures, quick espresso shots and, as a side effect, even crema foam, a future signature of an excellent espresso. This allowed the Gaggia company (founded in 1948) to commercialize the espresso machines as a consumer product for use in bars. There was about a decade span between the original 1938 patent and its 1949 industrial implementation.

Around 1950, espresso machines began crossing Italian borders to the United Kingdom, America and Africa. This is when the first large spike happens in the use of the word espresso in the English language. The spike and following rapid growth are evident from the historic WordFrequencyData of published English corpora plotted across the 20th century:

The function above gets TimeSeries data for the frequencies of words w in a fixed time range from 1900–2000 that, of course, can be extended if needed. The data can be promptly visualized with DateListPlot :

The much less frequent expresso also gains its popularity slowly but steadily. Its simultaneous growth is more obvious with the log-scaled vertical frequency axis. To be able to easily switch between log and regular scales and also improve the visual comprehension of multiple plots, I will define a function:

The plot below also compares the espresso/expresso pair to a typical pair acknowledged by dictionaries, unfocused/unfocussed, stemming from American/British usage:



The overall temporal behavior of frequencies for these two pairs is quite similar, as it is for many other words of alternative orthography acknowledged by dictionaries. So why is espresso/expresso so controversial? A good historical account is given by Slate Magazine, which, as does Merriam-Webster, supports the official endorsement of expresso. And while both articles give a clear etymological reasoning, the important argument for expresso is its persistent frequent usage (even in such distinguished publications as The New York Times). As it stands as of the date of this blog, the following lexicographic vote has been cast in support of expresso by some selected trusted sources I scanned through. Aye: Merriam-Webster online, Harper Collins online, Random House online. Nay: Cambridge Dictionary online, Oxford Learner’s Dictionaries online, Oxford Dictionaries online (“The spelling expresso is not used in the original Italian and is strictly incorrect, although it is common”; see also the relevant blog), Garner’s Modern American Usage, 3rd edition (“Writers frequently use the erroneous form [expresso]”).

In times of dividing lines, data helps us to refocus on the whole picture and dominant patterns. To stress diversity of alternative spellings, consider the pair amok/amuck:

Of a rather macabre origin, amok came to English around the mid-1600s from the Malay amuk, meaning “murderous frenzy,” referring to a psychiatric disorder of a manic urge to murder. The pair amok/amuck has interesting characteristics. Both spellings can be found in dictionaries. The WordFrequencyData above shows the rich dynamics of oscillating popularity, followed by the competitive rival amuck becoming the underdog. The difference in orthography does not have a typical British/American origin, which should affect how alternative spellings are sampled for statistical analysis further below. And finally, the Levenshtein EditDistance is not equal to 1…

… in contrast to many typical cases such as:

This will also affect the sampling of data. My goal is to extract from a dictionary a data sample large enough to describe the diversity of alternatively spelled words that are also structurally close to the espresso/expresso pair. If the basic statistics of this sample assimilate the espresso/expresso pair well, then it quantifies and confirms Merriam-Webster’s assertion that “expresso shows enough use in English to be entered in the dictionary.” But it also goes a step further, because now all pairs from the dictionary sample can be considered as precedents for legitimizing expresso.

Dictionary as Data

Alternative spellings come in pairs and should not be considered separately because there is statistical information in their relation to each other. For instance, the word frequency of expresso should not be compared with the frequency of an arbitrary word in a dictionary. Contrarily, we should consider an alternative spelling pair as a single data point with coordinates {f + , f – } denoting higher/lower word frequency of more/less popular spelling correspondingly, and always in that order. I will use the weighted average of a word frequency over all years and all data corpora. It is a better overall metric than a word frequency at a specific date, and avoids the confusion of a frequency changing its state between higher f + and lower f – at different time moments (as we saw for amok/amuck). Weighted average is the default value of WordFrequencyData when no date is specified as an argument.

The starting point is a dictionary that is represented in the Wolfram Language by WordList and contains 84,923 definitions:

There are many types of dictionaries with quite varied sizes. There is no dictionary in the world that contains all words. And, in fact, all dictionaries are outdated as soon as they are published due to continuous language evolution. My assumption is that the exact size or date of a dictionary is unimportant as long as it is “modern and large enough” to produce a quality sample of spelling variants. The curated built-in data of the Wolfram Language, such as WordList , does a great job at this.

We notice right away that language is often prone to quite simple laws and patterns. For instance, it is widely assumed that lengths of words in an English dictionary…

… follow quite well one of the simplest statistical distributions, the PoissonDistribution . The Wolfram Language machine learning function FindDistribution picks up on that easily:

My goal is to search for such patterns and laws in the sample of alternative spellings. But first they need to be extracted from the dictionary.

Extracting Spelling Variants

For ease of data processing and analysis, I will make a set of simplifications. First of all, only the following basic parts of speech are considered to bring data closer to the espresso/expresso case:

This reduces the dictionary to 84,487 words:

Deletion of duplicates is necessary, because the same word can be used as several parts of speech. Further, the words containing any characters beyond the lowercase English alphabet are excluded:

This also removes all proper names, and drops the number of words to 63,712:

Every word is paired with the list of its definitions, and every list of definitions is sorted alphabetically to ensure exact matches in determining alternative spellings:

Next, words are grouped by their definitions; single-word groups are removed, and definitions themselves are removed too. The resulting dataset contains 8,138 groups:

Different groups of words with the same definition have a variable number of words n ≥ 2…

… where m is the number of groups. They follow a remarkable power law. Very roughly for order for magnitudes m~200000 n-5.

Close synonyms are often grouped together:

This happens because WordDefinition is usually quite concise:

To separate synonyms from alternative spellings, I could use heuristics based on orthographic rules formulated for classes such as British versus American English. But that would be too complex and unnecessary. It is much easier to consider only word pairs that differ by a small Levenshtein EditDistance . It is highly improbable for synonyms to differ by just a few letters, especially a single one. So while this excludes not only synonyms but also alternative spellings such as amok/amuck, it does help to select words closer to espresso/expresso and hopefully make the data sample more uniform. The computations can be easily generalized to a larger Levenshtein EditDistance , but it would be important and interesting to first check the most basic case:

This reduces the sample size to 2,882 pairs:

Mutations of Spellings

Alternative spellings are different orthographic states of the same word that have different probabilities of occurrence in the corpora. They can inter-mutate based on the context or environment they are embedded into. Analysis of such mutations seems intriguing. The mutations can be extracted with help of the SequenceAlignment function. It is based on algorithms from bioinformatics identifying regions of similarity in DNA, RNA or protein sequences, and often wandering into other fields such as linguistics, natural language processing and even business and marketing research. The mutations can be between two characters or a character and a “hole” due to character removal or insertion:

In the extracted mutations’ data, the “hole” is replaced by a dash (-) for visual distinction:

The most probable letters to participate in a mutation between alternative spellings can be visualized with Tally . The most popular letters are s and z thanks to the British/American endings -ise/-ize, surpassed only by the popularity of the “hole.” This probably stems from the fact that dropping letters often makes orthography and phonetics easier.

Querying Word Frequencies

The next step is to get the WordFrequencyData for all

2 x 2882 = 5764 words of alternative spelling stored in the variable samedefspair . WordFrequencyData is a very large dataset, and it is stored on Wolfram servers. To query frequencies for a few thousands words efficiently, I wrote some special code that can be found in the notebook attached at the end of this blog. The resulting data is an Association containing alternative spellings with ordered pairs of words as keys and ordered pairs of frequencies as values. The higher-frequency entry is always first:

The size of the data is slightly less than the original queried set because for some words, frequencies are unknown:

Basic Analysis

Having obtained the data, I am now ready to check how well the frequencies of espresso/expresso fall within this data:

As a start, I will examine if there are any correlations between lower and higher frequencies. Pearson’s Correlation coefficient, a measure of the strength of the linear relationship between two variables, gives a high value for lower versus higher frequencies:

But plotting frequency values at their natural scale hints that a log scale could be more appropriate:

And indeed for log-values of frequencies, the Correlation strength is significantly higher:

Fitting the log-log of data reveals a nice linear fit…

… with sensible statistics of parameters:

In the frequency space, this shows a simple and quite remarkable power law that sheds light on the nature of correlations between the frequencies of less and more popular spellings of the same word:

Log-log space gives a clear visualization of the data. Obviously due to {greater, smaller} sorting of coordinates {f + , f – }, all data points cannot exceed the Log[f – ]==Log[f + ] limiting orange line. The purple line is the linear fit of the power law. The red circle is the median of the data, and the red dot is the value of the espresso/expresso frequency pair:



A simple, useful transformation of the coordinate system will help our understanding of the data. Away from log-frequency vs. log-frequency space we go. The distance from a data point to the orange line Log[f – ]==Log[f + ] is the measure of how many times larger the higher frequency is than the lower. It is given by a linear transformation—rotation of the coordinate system by 45 degrees. Because this distance is given by difference of logs, it relates to the ratio of frequencies:

This random variable is well fit by the very famous and versatile WeibullDistribution , which is used universally for weather forecasting to describe wind speed distributions; survival analysis; reliability, industrial and electrical engineering; extreme value theory; forecasting technological change; and much more—including, now, word frequencies:

One of the most fascinating facts is “The Unreasonable Effectiveness of Mathematics in the Natural Sciences,” which is the title of a 1960 paper by the physicist Eugene Wigner. One of its notions is that mathematical concepts often apply uncannily and universally far beyond the context in which they were originally conceived. We might have glimpsed at that in our data.

Using statistical tools, we can figure out that in the original space the frequency ratio obeys a distribution with a nice analytic formula:

It remains to note that the other corresponding transformed coordinate relates to the frequency product…

… and is the position of a data point along the orange line Log[f – ]==Log[f + ]. It reflects how popular, on average, a specific word pair is among other pairs. One can see that the espresso/expresso value lands quite above the median, meaning the frequency of its usage is higher than half of the data points.

Nearest can find the closest pairs to espresso/expresso measured by EuclideanDistance in the frequency space. Taking a look at the 50 nearest pairs shows just how typical the frequencies espresso/expresso are, shown below by a red dot. Many nearest neighbors, such as energize/energise and zombie/zombi, belong to the basic everyday vocabulary of most frequent usage:

The temporal behavior of frequencies for a few nearest neighbors shows significant diversity and often is generally reminiscent of such behavior for the espresso/expresso pair that was plotted at the beginning of this article:

Networks of Mutation

Frequencies allow us to define a direction of mutation, which can be visualized by a DirectedEdge always pointing from lower to higher frequency. A Tally of the edges defines weights (or not-normalized probabilities) of particular mutations.

For clarity of visualization, all edges with weights less than 10% of the maximum value are dropped. The most popular mutation is s→z->1, with maximum weight 1. It is interesting to note that reverse mutations might occur too; for instance, z→s->0.0347938, but much less often:

Thus a letter can participate in several types of mutations, and in this sense mutations form a network. The size of the vertex is correlated with the probability of a letter to participate in any mutation (see the variable vertex above):

The larger the edge weight, the brighter the edge:

The letters r and g participate mostly in the deletion mutation. Letters with no edges participate in very rare mutations.

Among a few interesting substructures, one of the obvious is the high clustering of vowels. A Subgraph of vowels can be easily extracted…

… and checked for completeness, which yields False due to many missing edges from and to u:

Nevertheless, as you might remember, the low-weight edges were dropped for a better visual of high-weight edges. Are there any interesting observations related to low-weight edges? As a matter of fact, yes, there are. Let’s quickly rebuild a full subgraph for only vowels. Vertex sizes are still based on the tally of letters in mutations:

All mutations of vowels in the dictionary can be extracted with the help of MemberQ :

In order to visualize exactly the number of vowel mutations in the dictionary, the edge style is kept uniform and edge labels are used for nomenclature:

And now when we consider all (even small-weight) mutations, the graph is complete:

But this completeness is quite “weak” in the sense that there are many edges with a really small weight, in particular two edges with weight 1:

This means that there is only one alternative word pair for e→u mutations, and likewise for i→o mutations. With the help of a lookup function…

… these pairs can be found as:

Thus, thanks to these unique and quite exotic words, our dictionaries have e→u and i→o mutations. Let’s check WordDefinition for these terms:

The word yarmulke is a quite curious case. First of all, it has three alternative spellings:

Additionally, the Merriam-Webster Dictionary suggests a rich etymology: “Yiddish yarmlke, from Polish jarmułka & Ukrainian yarmulka skullcap, of Turkic origin; akin to Turkish yağmurluk rainwear.” The Turkic class of languages is quite wide:

Together with the other mentioned languages, Turkic languages mark a large geographic area as the potential origin and evolution of the word yarmulke:

This evolution has Yiddish as an important stage before entering English, while Yiddish itself has a complex cultural history. English usage of yarmulke spikes around 1940–1945, hence World War II and the consequent Cold War era are especially important in language migration, correlated probably to the world migration and changes in Jewish communities during these times.

These complex processes brought many more Yiddish words to English (my personal favorites are golem and glitch), but only a single one resulted in the introduction of the mutation e→u in the whole English dictionary (at least within our dataset). So while there are really no s↔x mutations currently in English (as in espresso/expresso), this is not a negative indicator because there are cases of mutations that are unique to a single or just a few words. And actually, there are many more such mutations with a small weight than with a large weight:

So while the s→z mutation happens in 777 words, it is the only mutation with that weight:

On the other hand, there are 61 unique mutations that happen only once in a single word, as can be seen from the plot above. So in this sense, the most weighted s→z mutation is an outlier, and if expresso enters a dictionary, then the espresso/expresso pair will join the majority of unique mutations with weight 1. These are the mutation networks for the first four small weights:

As the edge weight gets larger, networks become simpler—degenerating completely for very large weights. Let’s examine a particular set of mutations with a small weight—for instance, weight 2:

This means there are only two unique alternative spellings (four words) for each mutation out of the whole dictionary:

Red marks a less popular letter, printed as a superscript of the more popular one. While the majority of these pairs are truly alternative spellings with a sometimes curiously dynamic history of usage…

… some occasional pairs, like distrust/mistrust, indicate blurred lines between alternative spellings and very close synonyms with close orthographic forms—here the prefixes mis- and dis-. Such rare situations can be considered as a source of noise in our data if someone does not want to accept them as true alternative spellings. My personal opinion is that the lines are blurred indeed, as the prefixes mis- and dis- themselves can be considered alternative spellings of the same semantic notion.

These small-weight mutations (white dots in the graph below) are distributed among the rest of the data (black dots) really well, which reflects on their typicality. This can be visualized by constructing a density distribution with SmoothDensityHistogram , which uses SmoothKernelDistribution behind the scenes:

Some of these very exclusive, rare alternative spellings are even more or less frequently used than espresso/expresso, as shown above for the example of weight 2, and can be also shown for other weights. Color and contour lines provide a visual guide for where the values of density of data points lie.

Conclusion

The following factors affirm why expresso should be allowed as a valid alternative spelling.

Espresso/expresso falls close to the median usage frequencies of 2,693 official alternative spellings with Levenshtein EditDistance equal to 1

equal to 1 The frequency of espresso/expresso usage as whole pair is above the median, so it is more likely to be found in published corpora than half of the examined dataset

Many nearest neighbors of espresso/expresso in the frequency space belong to a basic vocabulary of the most frequent everyday usage

The history of espresso/expresso usage in English corpora shows simultaneous growth for both spellings, and by temporal pattern is reminiscent of many other official alternative spellings

The uniqueness of the s→x mutation in the espresso/expresso pair is typical, as numerous other rare and unique mutations are officially endorsed by dictionaries

So all in all, it is ultimately up to you how to interpret this analysis or spell the name of the delightful Italian drink. But if you are a wisenheimer type, you might consider being a tinge more open-minded. The origin of words, as with the origin of species, has its dark corners, and due to inevitable and unpredictable language evolution, one day your remote descendants might frown on the choice of s in espresso.