Quantifying semantic and contextual change

The Macroscope provides researchers with the ability to examine two distinct but related aspects of linguistic change in individual words over historical time, as shown in Fig. 2. First, diachronic word embeddings computed from the co-occurrence matrix enable us to discover words that are semantically similar to a given word for a given year (i.e., revealing the semantic or synonym structure surrounding a word). These semantically related words will be referred to as synonyms for the remainder of this article (top portions of Fig. 2). Second, the co-occurrence matrix provides information regarding the context of a given word in a given year. Words that co-occur with the target word will be referred to as context words for the remainder of this article (bottom portions of Fig. 2).

Fig. 2 Conceptual framework summarizing the key features of the Macroscope. The Macroscope permits synchronic (left) and diachronic (right) analysis of the semantic/synonym (top) and contextual/co-occurrence (bottom) structures of words Full size image

In addition to being able to “focus” the Macroscope on the semantics and contextual structure of an individual word in a particular year, the true power of the Macroscope is harnessed when the researcher “zooms” out to obtain a bird’s eye view of changes in the semantic and contextual structure of words over historical time. Below we describe how the Macroscope can be used to examine the semantic (synonym) and contextual (co-occurrence) structures of individual words for a specific year (i.e., zooming in) and over historical time (i.e., zooming out). In the analyses described below, techniques from network analysis are employed to help with the interpretation and visualization of the synonym and co-occurrence structures of words. All analyses can be easily replicated using the Macroscope, and the user can download the network graphs along with the data used to construct the graphs.

Synchronic semantic structure of words: Historical synonyms

How do we know what a word meant in the past? Using diachronic word embeddings, the Macroscope can quantify semantic similarity by computing the cosine distance of word embeddings for any pair of words. Therefore, a word’s historical meaning can be inferred by finding its most semantically similar words in a given time period (i.e., synonyms).

Anxiety and depression are conceptualized as two distinct emotions by psychologists, yet often they are experienced by the general population as the same feeling (Barrett, 2017). To examine how these concepts are represented in the written language and produced and read by people who do not necessarily have a psychology background, we used the Macroscope to identify the synonyms of anxiety, depression, and fear using co-occurrence data from the year 2000 (see Table 1). Anxiety and depression share many synonyms that are associated with mental disorders. In contrast, fear, another commonly experienced negative emotion, appears to have different synonyms from anxiety and depression.

Table 1 Top five closest synonyms of depression, anxiety, fear, disgust, and anger from the year 2000, provided by the Macroscope Full size table

To better capture how these three emotion concepts are related to each other, the Macroscope provides a network graph representing the semantic similarity structure of their synonyms. The nodes shown in the network represent the top five synonyms for fear, depression, and anxiety as identified above, as well as the words fear, depression, and anxiety themselves. The edges between nodes are weighted by the strength of semantic similarity between word pairs (i.e., the cosine similarity between word embeddings). Edges that are greater than a threshold of .8 are shown in the network (this value can be set by the user). If the synonyms of two words share a high degree of semantic similarity (i.e., if they are connected to each other in the semantic network), this indicates that the two words are likely to be used in similar contexts and are semantically “close” to each other. Higher semantic similarity among the synonyms of two words offers an additional layer of depth to investigate how similar are the meanings of the two words, even if the synonyms of the two words were not necessarily the same. Though previous tools have provided quantitative information about word similarity (e.g., BEAGLE from Jones & Mewhort, 2007; LSA from Landauer, Foltz, & Laham, 1998), the present example demonstrates how the Macroscope provides and visualizes additional information about the broader semantic similarity structure of words via their synonyms. Figure 3 (left panel) shows that the synonyms of anxiety and depression are synonyms of each other but are distinct from those of fear. Although psychologists treat anxiety and depression as two separate constructs, they appear to be used in semantically similar contexts in written language.

Fig. 3 (Left) Synonym structure of anxiety, depression, and fear. (Right) Synonym structure of disgust, fear, and anger. The nodes represent the emotion concepts of interest and the top five most similar synonyms for each of the emotion concepts. The colors represent the community structure of nodes in the network, with each community represented with a different color. Community structure was detected by an algorithm proposed by Blondel, Guillaume, Lambiotte, and Lefebvre (2008) Full size image

The same network approach used to represent concepts and their synonyms can also provide insights into the overlapping and distinctive components of two concepts. A similar analysis was conducted for the emotion words fear, disgust, and anger, three of the six basic emotions that are proposed to exist universally across cultures (Ekman, 1992). The results indicated that all three negative emotions intersect with some of each other’s synonyms (see Table 1). Figure 3 (right panel) shows that the concepts of anger, fear, and disgust share similar connections to such words as disappointment, bitterness, and loathing. However, each of these emotion concepts is also marked by its own unique components, which make the concepts distinct from each other: disgust is linked with dismay, anger with rage and resentment, and fear with dread and dread.

Diachronic semantic structure of words: Semantic drift analysis

With diachronic language data, the Macroscope is able to track how the semantics of individual words change over time. In the following examples we show how several words “move” along a path in a semantic space defined by their historical synonyms. A longer path moving from one point in the semantic space to another indicates significant changes in a word’s semantic meaning over time. In contrast, a path that stays within a confined semantic space suggests that the word has retained its meaning over the time window examined.

Using the Macroscope, the user can conduct a semantic drift analysis by inputting the word of interest, beginning and end time points (e.g., the years 1850 and 2000), and intervening intervals (e.g., spaced every 50 years). A semantic space is then constructed for a target word by searching for its historical synonyms at the beginning time point (1850) and its modern synonyms at the end time point (2000). All synonyms’ word embeddings are taken in their modern sense (2000). The Macroscope also retrieves the historical word embeddings of the target word for each time point of interest (i.e., 1900, 1950) and aligns these historical embeddings to its modern embedding using orthogonal Procrustes (Schönemann, 1966), an algorithm to map one matrix to another of the same shape. Finally, these word embeddings are visualized in a two-dimensional space using principal component analysis (PCA). All synonyms in this two-dimensional space are represented in their modern sense. Although in reality all word meanings fluctuate over time, we elected to adopt this approach in order to provide a clearer understanding of how changes in a word’s historical meaning occur over time, as benchmarked against its modern sense.

We used the Macroscope to examine the semantic change of three words that have been previously documented in historical linguistics (Jeffers & Lehiste, 1979). The first three panels of Fig. 4, in the top row and lower left, show semantic drift analyses of broadcast, cell, and car from the year 1850 to 2000 (at 50-year intervals). In 1850, the word broadcast referred to “disperse upon the ground by hand” and was closely associated with agricultural activity. In 2000, the word broadcast referred to radio and other media-related concepts. Our analysis shows that this change primarily took place between 1900 and 1950, the time period during which radio and television were invented (Fig. 4, top left). Cell changed its dominant meaning from “a chamber in a prison” to a biological term, and this change predominantly took place between 1850 and 1900 (Fig. 4, top right). In 1850 the word car referred to a horse-driven wagon, but after the automobile was invented in 1885, it quickly acquired its modern sense. The semantic drift analysis shows that by 1900 car was no longer associated with a wagon (Fig. 4, bottom left), but with modern transportation vehicles such as bus and truck. In addition, we conducted a similar analysis for a word that was likely to have been semantically stable over time: happy. The semantic drift analysis confirmed our intuitions: The word happy remained within the same semantic space over the past 150 years.

Fig. 4 Semantic drift analysis for (top left) broadcast, (top right) cell, (bottom left) car, and (bottom right) happy from 1850 to 2000 at 50-year intervals. The blue dots indicate words that are semantically related to the target word of interest (i.e., its synonyms at the first and last time points). The path taken by the red dots indicates the “drift” in semantics of the target word from 1850 to 1900, from 1900 to 1950, and from 1950 to 2000 Full size image

The semantic drift analysis shown in Fig. 4 offers a qualitative visualization of how word meanings have changed over history, but it is not easy to use such visualizations to quantitatively compare semantic stability between words (e.g., the semantic path traveled by happy relative to the path traveled by broadcast from 1850 to 2000). Previous work has examined the properties of words that appear to show the highest degree of stability over historical time (e.g., Hamilton et al., 2016; Monaghan, 2014; Pagel, Atkinson, & Meade, 2007). Since the Macroscope provides information on diachronic changes in semantics, it can be used to quantify the semantic stability of words, as is shown in Fig. 4:

$$ \mathrm{Stability}\left({w}_i,t\right)=\cos \_\mathrm{sim}\left({w}_i(T),{w}_i\left(T+t\right)\right) $$

where \( {w}_i(t) \) refers to the word embedding of word w i in year t. Semantic similarity ranges from 0 to 1. For example, the similarity of happy between year 1850 and 2000 is .74, much higher than the values for words that underwent greater semantic change, such as broadcast (.08), cell (.17), and car (.47). This allows researchers to examine potential forces that may have influenced semantic change. As a baseline for further examination, the Macroscope provides the semantic stability of a word in relation to its modern and historical word embeddings. Using this method, we retrieved the ten most stable words from 1800 to 2000. They are and, the, when, his, he, they, him, in, them, and a. A complete list of word stability between these two time points can be downloaded from the Macroscope.

Synchronic contextual structure of words

Synonym analysis provides an accessible way to examine the semantic structure of words, based on the assumption that words that are used in similar contexts are also semantically related to each other (e.g., Jones & Mewhort, 2007). On the other hand, identifying the particular context(s) in which a word has been used can help us understand how polysemous words are used in their different senses across varying contexts, furthering our understanding of the relationship between the semantic and co-occurrence structures of words. For instance, it is possible for words to have a stable semantic/synonym structure but a varying co-occurrence structure over time. A concrete example can be seen in the word woman. Although the semantic meaning of the word woman has not changed much over the past 200 years, in recent decades the word has increasingly been used in the context of social issues surrounding feminism, gender discrimination, and abortion—contexts that were not commonly discussed during the 1800s.

The following co-occurrence networks of the words monitor, option, and gay show how the Macroscope can be used to understand the contextual structure of words. All networks were centered at the target word of interest. The context words, represented as nodes in the network, were selected on the basis of their PPMI value with the target word. The edges were weighted by the PPMI values between each word pair. Next, nodes with a low co-occurrence frequency with the target word and edges signaling low PPMI values were removed. During the procedure, arbitrary thresholds for parameters must be specified in order to produce meaningful network graphs. The networks presented below were constructed using a PPMI threshold of 3 and a minimum co-occurrence frequency of 200 times per ten billion words. Communities are subgroupings of nodes that are more likely to be connected to each other than to other nodes within the network. Community structures of the network are detected using an algorithm introduced by Blondel, Guillaume, Lambiotte, and Lefebvre (2008), based on modularity optimization, which uses an iterative process that defines each node as a community at the first step and merges them until modularity (a measure of the strength of the communities) is optimized.

Figure 5a shows the contextual network structure of monitor in the year 2000. Community detection analysis of the contextual network showed approximately three distinct contexts in which the word was used: as a computer device, in healthcare-related settings, and with a group of nouns that it often accompanies. From the contextual network structure of monitor, one can infer that it is used as a noun or a verb. As a noun, monitor is often referred to as a computer device; as a verb, monitor is often used in medical settings.

Fig. 5 The contextual network structure of (a) monitor, (b) nuclear, (c) gay in year 2000, (d) gay in year 1850, and (e) option. The nodes represent the context words that co-occurred with the target word in a given year. The size of nodes is proportional to their usage frequency in a given year. The nodes were included in the networks if they had a PMI threshold greater than three with other words, and a minimum co-occurrence frequency of 200 times out of one billion words with the target word. The colors represent the community structure of nodes in the network and each community is represented with a different color Full size image

Figure 5b shows the contextual network structure of nuclear in the year 2000, which shows that the word is used in a number of distinct contexts: It can refer to a power source, physical phenomena, a technology known as nuclear magnetic resonance, or a weapon associated with some countries (Iraq, Cuba, Korea) but not with other nuclear-armed states.

Figure 5e shows an example of what the contextual structure of a polysemous word such as option looks like. Other than the conventional context of choosing among various possibilities, option also refers to a financial instrument. As Fig. 5e shows, its contextual structure in the year 2000 was divided into two components. One involves its traditional sense, which incorporates use of the option button on a keyboard. The other component consists of finance-related terms. It is important to note that such information would not be available if one only analyzed the synonyms of option in the year 2000 (which are options, cancel, default, item, and choose), further highlighting how an analysis of a word’s contextual structure can complement the analysis of that word’s semantic structure.

As we mentioned earlier, understanding the contextual usage of a concept can be useful for inferring changes in the sociocultural environment. Figure 5c shows the context in which the word gay was used in the year 2000. It was not only associated with homosexuality, but also with a political movement associated with issues that extended beyond gay rights, such as feminism and abortion. Sexually transmitted diseases such as HIV and AIDS also appeared in this context, reflecting a social awareness of the association between homosexuality and the way that these diseases were transmitted among communities of gay men during the AIDS epidemic in the 1980s and 1990s. In contrast, 150 years earlier, not only did all these associations not exist, the word gay simply did not refer to homosexuality. The contextual structure analysis suggests that the word gay in 1850 was used in contexts involving fashionable clothes, cheerful mood, and pleasant colors (Fig. 5d).

Diachronic contextual structure of words

In addition to quantifying the contextual structure of words at a static point in time, the Macroscope allows users to quantify changes in the contextual structure of words diachronically. Figure 6 shows how the frequency of co-occurrence of the words co-occurring with gay and nuclear has changed between the years 1950 and 2000. The words with the largest blue bars extending to the right (top of each y-axis) are those whose frequency of co-occurrence with the given word has increased the most from 1950 to 2000, whereas the words with largest red bars extending to the left (bottom of each y-axis) are those whose frequency of co-occurrence with the given word has declined the most from 1950 to 2000. For instance, for the word gay, lesbian and bisexual increased the most in their frequency of co-occurrence, whereas happy and hearted decreased the most in their frequency of co-occurrence. For the word nuclear, weapons and magnetic increased the most in their frequency of co-occurrence, whereas molecule and spin decreased the most in their frequency of co-occurrence, reflecting the increased usage of nuclear for a weapon of destruction in recent years, as compared to its scientific sense in the 1950s.

Fig. 6 Words whose frequency of co-occurrence with gay and nuclear changed the most from 1950 to 2000. Words that increased the most in their frequency of co-occurrence with the target word from 1950 to 2000 are shown in blue near the top and words that decreased the most are shown in red near the bottom. The x-axes on the left and right side of the y-axis are scaled differently so that the y-axis is centered in the middle of the graph Full size image

Although the previous analysis shows the largest changes in the frequency of co-occurring words between two time points, it is not completely clear to what extent a word would have “lost” its old meaning. For instance, it is possible for a word’s old meaning to still be in use, albeit not as commonly used as before. In addition, the previous analysis does not contain information regarding fine-grained changes in the frequency of co-occurring words during the time period between the two specified time points.

One way to address these questions would be to examine the extent to which a given word co-occurred with words found in its historical context. These context words can be obtained from the synchronic contextual structure analysis described earlier (see Fig. 5). Users of the Macroscope can also enter words that are of particular interest in their research. The co-occurrence values in Fig. 7 (on the y-axis) were computed by summing the number of times the target word co-occurred with each word of interest (in this case, from its historical context identified in the contextual structure analysis in Fig. 5) in each consecutive year after the historical reference year.

Fig. 7 Co-occurrence frequency between the target word and its context words from 1850 and 2000. The context words were derived from the synchronic contextual structure analysis described earlier (see Fig. 5 for examples). The co-occurrence frequency was computed by summing the number of times the target word co-occurred with each single word in the list of context words Full size image

For instance, gay in 1850 co-occurred with words associated with cheerfulness, bright colors, and fashion (Fig. 5c), and in 2000 it co-occurred with words associated with homosexuality and sexually transmitted diseases (Fig. 5d). The Macroscope can take these two lists of context words and compute their respective co-occurrence frequencies with the target word gay in order to capture how frequently its meaning in 1850 and its meaning in 2000 have been used over the entire corpus (i.e., from 1800 to 2009). Figure 7 (left side) shows that the overall usage frequency of gay can largely be decomposed into two trends, with each corresponding to a different sense of gay. The co-occurrence between gay and its context words in the year 1850 declined quickly after 1900, whereas the co-occurrence between gay and its context words in the year 2000 emerged in the mid-1960s and increased dramatically from the 1980s. The pattern suggests that the old meaning of gay has been largely overwritten by its new, emerging meaning.

Another example is the word option (shown on the right side of Fig. 7). When looking at the contemporary contextual structure of option (Fig. 5e), one can easily see that the word refers to economic instruments: A stock option refers to stock warranted from a company to their employees as part of a remuneration package, and a lease option refers to a real estate contract that gives the lessor an option to buy the property. Visual inspection of Figs. 7d and f shows that a lease option probably existed in some form before the 19th century, whereas a stock option was first introduced in the 1920s, and the usage of this sense has continued to grow since the 1980s.

By combining the synchronic contextual structure analysis of words with a diachronic analysis of the co-occurrence frequencies of context words with the target word, the Macroscope provides an accessible quantitative approach to tracking the association strength between a word and its various contextual structures over history, which could be used to investigate the evolution of word meanings or cultural change over time.

Diachronic changes in word sentiment

So far we have demonstrated how the Macroscope can be used to investigate the semantic and contextual structures of words at a specific point of time and across historical time. Below we show how the Macroscope can also be used to examine diachronic changes in word sentiment and how that information can be used to infer cultural changes due to urbanization and understanding the changing social perceptions of risk.

Example 1: Cultural changes due to urbanization

Greenfield (2013) analyzed the changing psychology of culture in the United States as a consequence of urbanization by selecting two lists of words, associated with urban and rural cultural values, respectively, and tracking their usage frequency over time. She found that words signaling urban values have proliferated in the United States over the past century, along with a declining trend among words signaling rural values. The Macroscope not only can track the usage frequencies of these words over time, but also can track the sentiment change of words over time. Here we use the Macroscope to extend Greenfield’s results by analyzing the sentiment of words that co-occurred with the words associated with urban and rural values over historical time.

The results reproduce Greenfield’s analysis (see the left side of Fig. 8), showing that the frequency of give and obliged (rural values; in blue) decreased over time, and the frequency of get and choose (urban values; in orange) increased over time. The Macroscope adds additional information by showing that the sentiments of get and choose increased at a faster rate than did the sentiments of give and obliged (see the right side of Fig. 8). The increasingly positive sentiment of urban value words complements and extends Greenfield’s argument, because the increasing usage of words such as get and choose does not necessarily imply that urban values are viewed positively and are increasingly being adopted by people. To provide a counterexample, if a word is used more frequently but has an increasingly negative sentiment (such as the word gay in the 1980s during the AIDS epidemic), this concept may instead be viewed as dangerous and unfavorable.

Fig. 8 Frequency (left column) and valence (right column) from the Macroscope. The left side shows the usage frequencies for words associated with urban values (get and choose in orange) and words associated with rural values (give and obliged in blue) over historical time. The right graphs show the change in sentiment for the same words along with the change in sentiment for words such as happy and death respectively, a high- and a low-valenced word whose sentiment is stable over time Full size image

Example 2: Changing social perceptions of risk

Risk, as defined by the Oxford English Dictionary, is a synonym for danger, hazard, and fear. However, sociologists and anthropologists have argued that risk represents more than just objective dangers or hazards in the real world. Instead, the notion of risk has been used to motivate social regulation and control or has acted as a surrogate for other ideological concerns (Beck, 1992). In this example, we used the Macroscope to examine the relationships between risk and its synonyms over the past 200 years. Our results showed that usage of risk experienced a rapid proliferation after the 1950s, as compared to the stable usage of hazard and the declining usage of danger (Fig. 9, top left). Correspondingly, the contextual sentiments of danger and hazard remained stable over time, whereas the sentiment of risk became increasingly negative (Fig. 9, top right). Output from the Macroscope (Fig. 9, bottom) shows how risk and its synonyms (i.e., danger and hazard) have drifted in semantic space between 1800 and 2000: Danger and hazard have had fairly limited semantic drift as compared to risk, which in the year 2000 was primarily associated with words related to medicine and health.