In this paper, we introduce a method for mapping lexical innovation, which we then use to track the origin and spread of new words on American Twitter, based on a multi-billion-word corpus of Tweets collected between 2013 and 2014. We first extract fifty-four emerging words from the corpus by searching for words that are very uncommon at the end of 2013 but whose use rises dramatically over the course of 2014. We then map the origin and spread of each of these words. Based on these results, we identify five main regional patterns of lexical innovation on American Twitter, primarily associated with the West Coast, the Northeast, the Mid-Atlantic, the Deep South, and the Gulf Coast. We conclude by proposing explanations for these results and by discussing their significance to theories of language variation and change, including both the actuation and diffusion of lexical innovations.

2. A Corpus of American Tweets The corpus analyzed in this study represents American Twitter and consists of 8.9 billion words of geo-coded American mobile Twitter data, totaling 980 million Tweets written by 7 million users from across the contiguous United States, posted and downloaded between October 11th, 2013, and November 22nd, 2014, using the Twitter API (http://dev.twitter.com) (see Huang et al. 2016; Grieve et al. 2017; Nini et al. 2017). We focused on Twitter because this variety of language provides a uniquely large and accessible source of geo-coded and time-stamped natural language data.2 Specifically, our corpus represents mobile Twitter, as geo-coded Tweets are produced when users post on their smartphones with geocoding enabled, resulting in a record of the precise longitude and latitude of the user at the time of posting. To identify patterns of lexical innovation in American Twitter, we stratified the corpus by both time and geography. The corpus was stratified by day based on the time-stamp information provided with each Tweet. Although the period of time represented by the corpus spans 409 days, in total the corpus includes 399 days, because ten days are missing due to various technical issues that interrupted data collection. The corpus contains an average of 22 million words per day, but ranges from 10 to 29 million words per day. The corpus was stratified by county using the longitude and latitude provided with each Tweet. In total, the corpus contains 3075 county equivalents out of 3108 total county equivalents in the contiguous United States, with missing data primarily occurring in small, sparsely populated counties in the Central States. On average, the corpus contains 2 million words per county, but ranges from 300 to 300 million words per county. Overall, 98 percent of the counties are represented by at least 10,000 words and 79 percent of the counties are represented by at least 100,000 words. It is important to acknowledge that Twitter cannot be used to track all the emerging words of the English Language. Twitter can provide only a partial picture of lexical innovation, as many words that are emerging in other varieties of English will be absent from Twitter. A different set of emerging words, for example, would likely be found in a corpus of workplace interaction or scientific writing. New words in other varieties might also follow different regional patterns. New scientific vocabulary, for example, might originate from universities and spread along academic networks. Furthermore, the demographics of Twitter will also affect the types of emerging words we identify. Although Twitter is a very popular social media platform, with 21 percent of all Americans using Twitter regularly, including users from a wide range of demographic backgrounds, the user base of American Twitter differs from the general population, with somewhat higher engagement from younger people, African Americans, and urban residents (Duggan & Brenner 2013).3 Consequently, our analysis will be more likely to identify innovations originating from these specific demographic groups. Despite these limitations, at this time, Twitter is the only variety of language that can be sampled at the necessary scale and with the necessary metadata to allow for the type of analysis reported in this study, providing us with a unique chance to observe regional patterns of lexical innovation in natural language on a large scale.

3. Finding Emerging Words To map patterns of lexical innovation in American Twitter, we first identified a set of emerging word forms in the corpus, which are relatively new word forms that are entering into general usage on Twitter for the first time. For the purposes of this study, we define a “word form” as a case-insensitive string of alphabetic characters, hyphens, and apostrophes. Notably, this corpus-based definition treats creative spellings and acronyms as distinct forms, as Twitter is a type of written language. We then define an “emerging word form” as a word form that is very uncommon at the start of the period of time under analysis, but whose relative frequency rises over that period of time. Specifically, we identified potential emerging word forms in the corpus following the procedure described in Grieve et al. (2017). First, we extracted the word forms that occurred at least 500 times in the complete corpus, which amounted to 97,246 distinct types.4 Second, we calculated the relative frequency of each form in the 2013 segment of the corpus to measure the popularity of the form at the start of the period of time under analysis. Third, we measured the relative frequencies of each of the 97,246 forms across the 399 days in the corpus, normalized to control for variation in the total number of words per day. Fourth, we measured the degree to which the usage of each form showed a monotonic increase over time by comparing the rank of the relative frequency per day of each form to the chronological rank of the day using a Spearman correlation coefficient. Finally, we extracted the 398 forms with a Spearman correlation coefficient larger than .70 and with a 2013 relative frequency smaller than once per million words. These correlation coefficient and relative frequency cut-offs were selected because they are common thresholds in correlation analysis (Hinkel et al. 2003) and corpus linguistics (Biber et al. 1998). Furthermore, they allow for a sufficiently large sample of emerging words to be identified so that common patterns of lexical innovation can be mapped through a multivariate spatial analysis. Setting these values differently would result in a somewhat different sample, but under reasonable settings, a similar core set of forms would be identified and the analyses that follow would be largely the same. This set of 398 potential emerging word forms was then filtered by hand. First, all proper nouns (e.g., partynextdoor, timehop) were removed from the list to focus on newly emerging word forms as opposed to people and products that happened to become popular over the course of 2014. Three forms, idgt, lituation, and thotful, that were introduced as proper nouns in song titles, were retained, as they were being used regularly outside that context in the corpus. Second, all dictionary words (e.g., feminists, infusion) were removed from the list, using the Merriam-Webster Dictionary for reference, to focus the analysis on the identification of relatively new word formations, as opposed to well-established words whose recent usage might not reflect their origin. Finally, numerous terms, including many acronyms and abbreviations, primarily related to the medical industry (e.g., pacu, cath), whose frequency increased on Twitter in 2014 due to a growing use of geo-coded employment advertisements, were also removed from the list. Through this process, eighty-one true emerging word forms were identified. These are listed in Table 1, along with a working definition, established through online searches and close readings of Tweets from the corpus containing these forms. The table is organized into fifty-four word lemmas, which were obtained by grouping together all inflected forms and variant spellings. The most common form is listed in the first column. In most cases, the lemma consists of a single form, but some lemmas contain up to eleven forms. This creates an imbalance in the dataset, with certain lemmas represented by far more forms than others, which is especially problematic for the multivariate analysis. We therefore chose to focus our analysis on one form per lemma, specifically the fifty-four emerging word forms listed in the first column of Table 1. Table 1. Emerging Words View larger version For the most part, these fifty-four words can be characterized as relatively new forms of everyday slang (Green 2011) and appear to be used across various registers, including spoken varieties (Grieve et al. 2017). Common topical domains include family and friends (e.g., boolin, famo), relationships and sex (e.g., baeless, pullout), intoxication (e.g., traphouse, xans), technology (e.g., candids, celfie), and Japanese culture (e.g., senpai, waifu). Most of these words are the result of standard word formation processes, such as compounding (e.g., fuckboys, traphouse), derivation (e.g., unbae, lituation), truncation (e.g., notifs, xans), and blending (e.g., brazy, boolin). The list also includes eleven acronyms (e.g., gmfu, tfw), which represent multiword expressions that generally occur in spoken language, but whose use as acronyms on Twitter is presumably encouraged by the length restrictions placed on Tweets. Similarly, the list contains twelve forms that are spelling variants of established words, which seem to mark non-standard meanings (e.g., gainz, litt) or pronunciations (e.g., bruuh, yaas). Notably, a number of these words appear to be associated with African American English, which is perhaps not surprising given the demographics of the Twitter user base. For example, derivatives of bae and thot were largely popularized through urban music, while brazy and boolin are associated with the predominantly African American Bloods street gang, who often modify existing words by replacing the letter ‘c,’ which is associated with the rival Crips gang, with the letter ‘b’ (Grieve et al. 2017).

4. Mapping Lexical Innovation There are various ways to map the origin and spread of an emerging word. The simplest approach is to map the relative frequency of the word in all texts from each location in a corpus over a series of time periods, normalized by the total number of words in all texts from that location during those periods. It is also possible to map the cumulative relative frequency of a word by calculating its relative frequency in all the texts from each location from the start of the corpus up to a series of points in time. We prefer mapping cumulative relative frequencies because it helps to further control for variation in sample size across locations, as the amount of data available on a given day in a given county is often very limited, while also highlighting locations where the word had been used in the past, making regional patterns clearer. For example, we map the cumulative relative frequency per billion words of baeless from the start of the corpus up until eight points in time in Figure 1, where darker shades indicate a high relative frequency in that county, lighter shades indicate a low relative frequency, and white indicates no occurrences of the word, and where the scale is based on the quartiles for non-zero values in the complete corpus.5 The maps show that the earliest usages of baeless were concentrated in the South in a number of largely discontinuous counties, especially in Georgia. The word then spread through much of the South by the end of the first quarter of 2014, before moving to urban areas farther afield, first to the Midwest by mid-2014, and then eventually to the Northeast and the West. By the end of 2014 the word had spread across most of the country, although its usage was still concentrated in the South. Download Open in new tab Download in PowerPoint In addition to a series of maps, it is also possible to plot change in the usage of a word over time on one map, which simplifies not only visualization but also multivariate analysis. There is, however, no standard solution to this problem. A basic approach is to map the relative frequencies of the word for some period of time (e.g., any one of the maps presented in Figure 1), but this ignores change over time, making it difficult to distinguish the origin of a word from its spread. Another approach is to map the time since the word first occurred, but this ignores the amount of data available at each location; in other words, patterns of usage can be obscured by variation in sample size, which tends to correspond to patterns of population density. A better option is to take both types of information into consideration and map the time since the word first reached a specific relative frequency threshold, allowing for the origin and spread of a word to be plotted together, while controlling for variation in the amount of data at each location. For example, Figure 2 maps the number of days since the cumulative relative frequency of baeless first reached 1087 occurrences per billion words by county, where darker shades indicate the word hit this relative frequency threshold at a relatively early date. We selected a relative frequency threshold of 1087 occurrences per billion words because baeless is used at least this often in 25 percent of counties in the complete corpus, although we could have used another threshold (see below). We then measured and mapped the number of days, relative to the end of the corpus, that had elapsed since the cumulative relative frequency of the word first reached this 3rd quartile relative frequency threshold in each county. These lexical emergence maps appear to successfully represent the underlying patterns visible in the geographical time series, as can be seen by comparing Figures 1 and 2, which both show that the form originates in the South, especially Georgia. Download Open in new tab Download in PowerPoint Lexical emergence maps for the fifty-four words are presented in Appendix I (available in supplementary materials online). Various sources of lexical innovation are attested in these maps. Perhaps the clearest overall pattern is a distinction between words that are primarily associated with the South (e.g., baeless, fallback), many of which appear to come from African American English, and words that are primarily associated with the rest of the United States (e.g., amirite, gainz). There are also several words associated with more specific parts of the country, including Louisiana (e.g., idgt, lordt), Georgia (e.g., boolin, brazy), the Mid-Atlantic (e.g., thottin, tookah), the West Coast (e.g., cosplay, tbfh), and New York (e.g., litt, lituation). Many of these maps also show regional patterns of spread, with words reaching much of the rest of the country on Twitter by the end of 2014. To confirm this interpretation, we ran a series of global Moran’s I spatial autocorrelation analyses, which test the degree to which the values of a regional variable exhibit spatial clustering or dispersion (Bivand et al. 2008; Grieve 2018). Specifically, we calculated Moran’s I using a twenty-five neighboring county spatial weights matrix (see section 4). We found that all fifty-four lexical emergence maps exhibit significant levels of global spatial autocorrelation (p < .0001). In other words, our analysis shows that the origin and spread of lexical innovations on Twitter is geographically patterned, as broadly predicted by the wave model. This basic result was to be expected—all new words must begin somewhere and if they are to enter into common usage, they must spread out from this source—but this is the first time that such a large number of emerging forms has been mapped at the same time, providing empirical support for an important theoretical assumption in linguistics. Although physical distance affects the emergence of new words on Twitter, it leaves much about these maps unexplained. Emerging words do not first occur in one county or in one cluster of counties before spreading out radially to adjacent counties. Counties with the earliest attestations of a word do tend to be found primarily in one part of the country, but these regions are often large and contain as many counties where the word is never used at all, and there are often early usages scattered across the rest of the country. Clearly other factors therefore affect the spread of emerging words on social media. These factors appear to include population density, as predicted by the hierarchical model, given that these emerging words often seem to spread to major urban areas before reaching the rest of the country. Overall, however, regional patterns of lexical innovation on Twitter are highly complex, making it difficult to draw generalizations through the manual analysis of so many maps. To identify common regional patterns of lexical innovation we therefore conducted a multivariate spatial analysis of these maps.

7. Conclusion In this paper, we introduced methods for mapping individual and common patterns of lexical innovation in large time-stamped and geo-coded corpora. We then reported the results of using these methods to map the origin and spread of new words in a multi-billion-word corpus of American Twitter collected between 2013 and 2014. Based on the maps for fifty-four emerging words, we identified five common regional patterns of lexical innovation, primarily associated with the West Coast, the Northeast, the Mid-Atlantic, the Deep South, and the Gulf Coast. Because this is the first time that such a large sample of emerging words—or of any type of linguistic innovation—has been mapped in one variety of language, these results extend our understanding of the actuation and diffusion of linguistic innovation in several ways. In addition to mapping individual emerging words and identifying five common patterns of lexical innovation on American Twitter, we believe our study has made five more general contributions to our understanding of language change: Regional patterns of lexical innovation can be observed in written online communication, even though most of these words do not appear to have first occurred on social media. Emerging words on Twitter tend to originate from a small number of hubs of lexical innovation and spread along relatively consistent pathways of diffusion. Emerging words on Twitter tend to originate from urban areas, but the cultural influence of an urban area appears to be more important than its size. The diffusion of emerging words is affected by geography and population density, as predicted by the wave and hierarchical models, but also by cultural patterns, with emerging words tending to spread within cultural regions first. African American English is the main source of lexical innovation on American Twitter. The degree to which these results can be generalized across different registers, dialects, eras, and languages, as well as different levels of linguistic analysis, is an open question. Twitter is only one variety of language, which does not account for a large percentage of most people’s linguistic output and which presumably is not the variety where most of these words were first used or through which most of these words are primarily spread. A corpus of Twitter can therefore only partially reflect patterns of lexical innovation in the language as a whole, as would be the case for a corpus representing any variety of modern American English, including spoken varieties—especially in the modern world, where communication takes place across so many channels, online and off. However, given that almost all these words appear to be used in everyday speech, and given that Twitter is a marginal variety of language that is geographically unconstrained and that therefore should not necessarily show such patterns, we believe our results may in fact reflect the general spread of these words in American English. Regardless, this analysis has provided us with an unprecedented testing ground for theories of actuation and diffusion, and has clearly demonstrated that these patterns are far more complex than has been previously observed. Finally, this study has provided a methodological framework for future research on the spatial analysis of linguistic innovation, by showing how the origin and spread of emerging words can be measured and mapped. Crucially, this paper has introduced a method for reducing a geographical time series for a single word down to a single map, which is not only useful for visualization, but as a precursor for dimension reduction, as demonstrated in this study. Although this method was used here to study emerging words, it could also be used to map the use of any linguistic form over time. This study has also shown how multivariate spatial analysis, an approach that was developed for the analysis of dialect patterns, can be used to identify common sources of linguistic innovation. More broadly, this study has illustrated how the quantitative analysis of very large corpora of natural, written, online communication allows for new research questions of considerable general importance to linguistic theory to be pursued. There can be no doubt that as more data becomes available online from across a wider range of varieties, and as techniques from data science become more widely accepted in linguistics, our understanding of language variation and change will continue to be enriched.

Acknowledgements In addition to our funders, we would like to thank Alexandra D’Arcy, Matthew Gordon, Peter Grund, Alice Kasakoff, David Saad, Nikhil Sonnad, Emily Waibel, and two anonymous reviewers for their comments on this study, as well as Hans-Jörg Schmid, Daphné Kerremans, Jelena Prokic, and Quirin Würschinger, who invited us to present this research at the Dynamics of Lexical Innovation Workshop at LMU Munich, 28–30 June 2017.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported in this article was funded by the Arts and Humanities Research Council (UK), the Economic and Social Research Council (UK), Jisc (UK) (Jisc grant reference number 3154), and the Institute of Museum and Library Services (US), as part of the Digging into Data Challenge (Round 3).

Notes 1.

More specifically, Croft (2000) makes a distinction between “actuation” or “innovation,” on the one hand, and “diffusion” or “propagation,” on the other. We, however, use the term “innovation” to refer to new linguistic forms in general (i.e., to refer to a form as opposed to a process), and we thus refer to both the actuation and the diffusion of linguistic innovations. 2.

Throughout this paper, we use the term “variety of language” to refer to a type of language defined based on extra-linguistic criteria, specifically the situational, social, and temporal context in which that language is produced, as is standard in corpus linguistics (e.g., Biber et al. 1998). In particular, our Twitter corpus is defined in terms of all three of these extra-linguistic criteria: it is situationally defined as being composed of texts posted on Twitter, socially defined as texts posted by users from the United States, and temporally defined as texts posted between 2013 and 2014. 3.

Based on survey data, Duggan and Brenner (2013) provide the percentage of people from different demographic groups who use Twitter, as opposed to the demographics of Twitter users. However, taking population statistics for the United States into consideration, it is possible to estimate the demographics of Twitter users based on these results. For example, Duggan and Brenner (2013) find that 14 percent of White Americans and 26 percent of African Americans use Twitter. Given that there are approximately 233 million White Americans (74 percent of the general population) and 40 million African Americans (13 percent of the general population) (U.S. Census Bureau 2015), we can infer that there are approximately three times more White Americans than African Americans in our corpus, compared to approximately six times more White Americans than African Americans in the general population. 4.

We set this threshold to maximize the number of words included in our analysis, while excluding less frequent forms, so we could focus on those forms that we felt could potentially exhibit patterns across time (399 days) and space (3075 counties). In particular, we found that forms that occurred much less frequently did not generally show clear regional patterns when mapped, and we found that further lowering this threshold resulted in relatively few additional forms being identified. Applying a lower threshold would therefore have little effect on our results. If our method were applied to other corpora, this threshold would need to be re-evaluated, especially if the dimensions of the corpus differed. 5.

Baeless is an adjective that means to be without a partner and is a derivation of bae, itself a recent formation created most likely through the truncation of babe.

Supplemental Material

Supplemental material for this article is available online.