Significance The independence between sound and meaning is believed to be a crucial property of language: across languages, sequences of different sounds are used to express similar concepts (e.g., Russian “ptitsa,” Swahili “ndege,” and Japanese “tori” all mean “bird”). However, a careful statistical examination of words from nearly two-thirds of the world’s languages reveals that unrelated languages very often use (or avoid) the same sounds for specific referents. For instance, words for tongue tend to have l or u, “round” often appears with r, and “small” with i. These striking similarities call for a reexamination of the fundamental assumption of the arbitrariness of the sign.

Abstract It is widely assumed that one of the fundamental properties of spoken language is the arbitrary relation between sound and meaning. Some exceptions in the form of nonarbitrary associations have been documented in linguistics, cognitive science, and anthropology, but these studies only involved small subsets of the 6,000+ languages spoken in the world today. By analyzing word lists covering nearly two-thirds of the world’s languages, we demonstrate that a considerable proportion of 100 basic vocabulary items carry strong associations with specific kinds of human speech sounds, occurring persistently across continents and linguistic lineages (linguistic families or isolates). Prominently among these relations, we find property words (“small” and i, “full” and p or b) and body part terms (“tongue” and l, “nose” and n). The areal and historical distribution of these associations suggests that they often emerge independently rather than being inherited or borrowed. Our results therefore have important implications for the language sciences, given that nonarbitrary associations have been proposed to play a critical role in the emergence of cross-modal mappings, the acquisition of language, and the evolution of our species’ unique communication system.

Although there is substantial debate in the language sciences over how to best characterize the features of spoken language, there is nonetheless a general consensus that the relationship between sound and meaning is largely arbitrary (1⇓–3). Plenty of exceptions exist, however, within individual languages. For instance, ideophones—a class of words found in many languages—convey a communicative function (or meaning) through the depiction of sensory imagery (4). In the Mel language Kisi Kisi (spoken in Sierra Leone), hábá means “(human) wobbly, clumsy movement,” and hábá-hábá-hábá “(human) prolonged, extreme wobbling”; here, repetition serves as a way to convey the meaning of intensity. More generally, the resemblance between certain aspects of the acoustic basis of speech and their referents, “iconicity,” is the most researched and well-known case of nonarbitrary associations between sound and meaning (5, 6). “Systemacity,” in contrast, refers to (statistical) regularities that are common to particular set of words, created by historical contingencies and analogical processes (5). For example, word-initial gl- in English evokes the idea of a visual phenomenon (as in glare, glance, glimmer) (7). At a larger scale, there is evidence that the phonological properties of whole morphosyntactic classes of words (like verbs and nouns) are distinct in several languages (8).

The evidence of recurring regularities in sound–meaning mappings across multiple languages is considerably more modest, despite its potential importance for fundamental questions about language evolution and the role of basic perceptual biases in cognition. For example, certain shape–sound associations—known as the bouba-kiki effect (9⇓–11)—are believed to rely on the ability that humans [and perhaps also other primate species (12)] have for associating stimuli across different modalities (13). Other plausible sources of cross-linguistic associations include, for instance, the relationship across many animal species between vocalization frequency and animal size (14), the mimicry of referents via unconscious mouth gesturing (15), and the persistence of vestiges of a conjectured early human language (16).

Experimental studies support the hypothesis that humans are indeed sensitive to such associations. It has been demonstrated several times that participants perform above chance when asked to pair up words with opposite meanings (antonyms) in languages unknown to them (17), and that English speakers might even be able to decide on the concreteness of words from languages to which they have not been exposed (18). However, this evidence for nonarbitrary sound–meaning associations pertains only to narrow pockets of the vocabulary, making it unclear whether a more general pressure toward arbitrariness may overpower such potential biases when considering a more semantically diverse selection of the vocabulary (2, 19).

A further issue with current studies of nonarbitrariness in sound–meaning correspondences is that, save for a single exception (20), cross-linguistic corpus studies of nonarbitrary associations have tended to rely on a small number of languages (maximally 200) and focusing on small semantically restricted sets of words, ranging from phonation-related organs (21) to South American animals (15), to spatial orientation (demonstratives) (14, 22), repair initiators (like huh? in English) (23), and the conceptualization of magnitude in Australian languages (24). These studies involve confirmatory analyses, aiming to test specific hypotheses regarding sound–meaning correspondences; as a consequence, they are guided by a priori intuitions or indirectly by findings from other disciplines. These limitations may help explain, at least in part, why language scientists typically consider nonarbitrary associations to be marginal phenomena that may only apply to small, strictly circumscribed regions of the vocabulary (3). In this paper, we therefore conduct a comprehensive set of analyses involving a semantically diverse set of words from close to two-thirds of the world’s languages.

Testing Associations on a Global Scale The availability of a large collection of word lists allows us to search for statistically robust associations in an unsupervised, theory-neutral manner. The data consist of 28–40 lexical items from 6,452 word lists, with a subset of 328 word lists having up to 100 items (25). Words are transcribed into a phonologically simplified system consisting of 34 consonant and 7 vowels, which we refer to collectively as “symbols” (Table S1). These words belong to what is often referred to as “basic vocabulary,” including for instance pronouns, body part terms, property words, motion verbs, and nouns describing natural phenomena (26). The word lists include both languages and dialects, spanning 62% of the world’s languages and about 85% of its lineages (Fig. 1). A lineage is a maximal set of languages that can be shown to have a common ancestor. Such a set may have only one member (an isolate) or multiple members (a family). Fig. 1. Geographic distribution of the 6,452 word lists from the ASJP database (25). Colors distinguish different linguistic macroareas, regions with relatively little or no contact between them (but with much internal contact between their populations). These are North America (orange), South America (dark green), Eurasia (blue), Africa (green), Papua New Guinea and the Pacific Islands (red), and Australia (fuchsia). Table S1. ASJP symbols and their description Regarding the classification of languages, the Glottolog genealogical classification is preferable over other available alternatives because it is the only one to classify every living or extinct language while providing brief pointers to justifications for all choices taken—however, a less conservative independent classification was used additionally in the main test (see below). We stratify languages geographically by dividing the world’s landmass into six largely independent linguistic macroareas: North America, South America, Eurasia, Africa, Greater New Guinea, and Australia—these regions have a history of attested contact within them but little contact between them in prehistorical times (27). To guarantee that only truly global associations were selected, we screened the sound–meaning associations, keeping only those where the concept and symbol were attested in languages from at least 10 different lineages and found in no less than three different macroareas. We aim to capture robust and widespread tendencies in sound–meaning associations, where “tendency” should be understood as a systematic bias in the frequency with which certain words tend to carry specific symbols in contrast to their baseline occurrence in other words. Crucially, a strong tendency does not imply that a signal has an extremely high frequency of occurrence, and conversely a very frequent sound–meaning co-occurrence is not sufficient evidence to discount chance. Importantly, whatever advantage a sound–meaning pairing might confer in terms of learning or processing, it has to be considered in the context of a myriad of competing factors that shape the phonetic and phonological fabric of words, from articulatory production costs (28) to systemic constraints due to the similarity with other lexical elements (29). Our statistical approach consists in a series of tests where the presence of a symbol in a word is contrasted against a suitable subset of other words, and then the bias is evaluated across lineages. To begin, we calculate, for each concept and symbol, a genealogically balanced average ratio of the times they co-occur in a word of a language for which both symbol and concept are attested. We simulated the same quantity based on the rest of the concepts and compared it with the previously computed quantity (Materials and Methods). The associated P value roughly estimates the chance of finding the same or more extreme (genealogically balanced) average by picking any word other than the target one. Notice that this includes both recurring sound–meaning pairings as well as its complement, sound–meaning associations that are observed less often than expected given our null model. Crucially, a sequence of tests need to be applied to ensure that potential associations are not statistical artifacts (see Materials and Methods and SI Materials and Methods for more details). First, we used two independent worldwide language classifications with contrasting degrees of conservativeness (30, 31). Second, we controlled the false discovery rate at a 5% expected level of false positives (for both classifications independently) so as to avoid an inflated number of associations due to multiple comparisons. Third, word length is trivially correlated with the chance of finding any particular symbol. There is considerable variance in the (genealogically balanced) length of the words in our dataset, with some pronouns, negation, and basic verbs (like say and give) consisting only of about three symbols on average, whereas the length of some color words and body part terms contain is over five (Fig. S1). We filter out associations that also emerge when all of the symbols of all of the words of each language are randomly permuted while keeping word lengths fixed. Fig. S1. On the Left, genealogically balanced average of the number of characters for each of the 40 concepts with most coverage in ASJP. The horizontal bars represent approximate 95% CI for the average. On the Right, distribution of the genealogically balanced average for all of the concepts in ASJP. In both graphs, the vertical blue bar represents the mean value across all concepts in ASJP. Fourth, besides the mere number of symbols, word length might be a confound due to the fact that different phonotactic restrictions might apply accordingly. For instance, in a language that only allows consonant–vowel structures and also prohibits the presence of word-initial liquids, no monosyllabic words will carry liquids. To remedy this, we performed a test similar to the first one described but this time comparing words only with the length-matched equivalents of different concepts. Finally, to filter out associations due to areal contact or unresolved genealogy, we looked for association that could be detected within the linguistic macroareas independently. Thus, we restricted our attention to associations that passed all these statistical controls and for which a bias consistent with the worldwide trend could be found in at least three macroareas, with no single area showing a bias in the opposite direction. It should be noted that the overall testing scheme is conservative and that it is likely to have a large false-negative rate. Also working against our analyses is the fact that the core set of concepts we use was originally gathered due to their exceptional phylogenetic persistence and resistance to borrowing, thus rendering them less likely to be adapted to potential functional biases that might underlie specific sound–meaning associations. Moreover, it is not clear a priori whether the granularity of our phonetic descriptions is sufficiently fine to capture widespread sound–meaning relations—for instance, the opposition between voiced and unvoiced consonants and between rounded and unrounded in vowels have been suggested to bear importance for sound–symbolism (22, 32), but each feature pair are usually conflated under a single symbol in the database. For these reasons, the associations found in our analyses should be regarded as providing a lower-bound estimate of the presence of nonarbitrariness in sound–meaning pairings.

SI Materials and Methods Positional Test. We simulate, for each language and signal, random positions of the relevant signal-associated symbol based on all of the available positions in the word according to the consonant/vowel distinction. Concretely, we calculate the number of times the phone is initial when its simulated counterpart is not, averaging genealogically and respecting the vowel and consonant template of each word. Then we compare this quantity in the original word list against n = 1,000 simulations and consider those cases in which the original bias is larger than 95% of the simulated cases. These results can be observed in Table S6. Areal and Population Test. For each positive signal we calculated the great circle distances—i.e., the distance in kilometers of the shortest geodesic connecting two points in the surface of the Earth—involving all languages having both the relevant symbol and concept (but not necessarily the signal) and their nearest language from a different lineage that has the (positive) signal (dnn). The hypothesis is that small distance from a language that has a signal will influence the likelihood of signal presence in a given language. Only signals belonging to the group of 28–40 better attested concepts were used for the analysis, and only one dialect per language was chosen. Extinct languages were excluded from the analyses. For the testing, we used a generalized logistic model with random effects logit ( E [ signal presence ] ) = α + ( β dnn + β dnn lineage ) log ( 1 + dnn ) + β pop log ( population ) + α lineage , where the superscripted coefficients ( β dnn lineage and α lineage ) are random effects structured according to the lineage. Lineage as a random intercept is introduced as a means of accounting for the varying baseline presence of the signals within lineages, and their presence as random slopes aims to capture the fact that lineages have spread with different rates across the globe. The logarithmic transforms aim to reduce the effect of population and distance outliers. P values were estimated through an asymptotic likelihood ratio test. Apart from the estimated coefficients, we calculated the genealogical balanced mean difference in probability of having a signal for two reference points, one variable at a time. For population, the difference was calculated between fixing all languages’ populations to 10,000 individuals and a single individual, and for dnn between 1,000 km—which is roughly the maximum radius of linguistic areas as defined in AUTOTYP (56)—and 0 km (which corresponds to the situation where both languages as spoken at the same place). The results can be observed in Table S5. Word Similarity Test. Ideally, a proper phylogenetic test in the context of language history would comprise some kind of data carrying a phylogenetic signal (like cognate sets or collections of regular sound changes) and a sound evolutionary model that would lead to a tree or a distribution of trees. Unfortunately, such trees exist for only a handful of language families (57, 58). Instead, we approach the question of both phylogenetic stability and ancestry of signals by analyzing word form similarity, which serves as a proxy for cognacy. If it is a correct hypothesis that signals render words less prone to change and that they are prehistoric vestiges, then, after controlling for concept, symbol, and lineage, we would expect to find that the similarity among words is predicted by signals. The distance between words used here is the Levenshtein distance, which has found several uses in linguistics and often correlates with perceptual, processing, and other meaningful lexical distances differences (59, 60). The Levenshtein distance between strings x and y LD(x,y) is defined as the minimum number of edits, additions, or deletions of characters necessary to make two strings identical. For instance, “Zultus” and “sulus”—star in Uyghur and Sakha (two Turkic languages), respectively, have a Levenshtein distance of 2: a change of “Z” to “s” and the deletion of “t” in the Sakha word. The normalized Levenshtein distance is simply l = L D ( x , y ) / max ( | x | , | y | ) . For every family with at least six languages and every combination of concept and symbol, we calculated the Levenshtein distance between all members of two groups: word pairs for a concept belonging to a combination, and word pairs for a concept sharing at least one symbol but not the symbol relevant for the combination. For instance, given a family with three languages having the forms ana, ena, and ete for the concept “rock,” and considering the combination rock-n, we will have the two following groups: (ana,ena) and (ena,ete). Families with less than three distances in any of the groups were excluded from the analysis. To summarize the previous information, we calculated, for each family, the probability of choosing a distance in the signal-sharing group and another in the non–signal-sharing group and finding that the first is smaller than the second [ Pr ( l s < l − s ) ]. The larger this quantity, the more reliable an estimator of word form similarity the association is. Then we implemented the following β regression mixed model with logistic link function and constant precision parameter logit ( E [ Pr ( l s < l − s ) ] ) = ∑ concepts β i I i + ∑ symbols β j I j + α signalhood + α lineage , where the i and j indexes run over the set of concepts and symbols, respectively; the coefficient “signalhood” indicates whether the combination of concept and symbol is to be found in Table S2. “Signalhood” was coded as a single level common to all individual positive signals. α lineage stands for a random intercept according to lineage. To cope with a few values of Pr ( l s < l − s ) identical to 1 (that account for less than 0.5% of the data), we applied the transformation t ( x ) = ( x ( N − 1 ) + 0.5 ) / N to the values (61). As a way of accounting for the more robust evidence provided by lineages with a large number of distance pairs to be compared, we included a weight for each observation equal to the logarithm of the number of such pairs involved—however, the results did not differ considerably from the unweighted case. Overall, the model quality is heavily dominated by lineage: 86% vs. 3% of explained deviance with and without the lineage random effect, respectively.

Strong Worldwide Associations Our analysis detected 74 (positive and negative) sound–meaning associations, involving 30 concepts and 23 symbols. All of these associations are referred to as “signals” (Table 1; more detail is provided in Tables S2 and S3). Table 1. Summary of signals found in the ASJP database Table S2. Complete list of positive signals found in the ASJP database Table S3. Complete list of negative signals found in the ASJP database Signals will be described in terms of the most relevant information about them: the frequency of the symbol in the words corresponding to the concept (p), the ratio between that frequency and the frequency in other words (RR), the number of lineages that were analyzed for the global association ( n l ), and the ratio between the number of areas where the association was independently found and the total number of tested areas ( a s / a t ). Some concepts are associated with more than one signal. These are expected to be correlated; across languages, it is often observed that there are preferences or restrictions with regard to the co-occurrence of symbols within one and the same word for either diachronic or synchronic phonotactic reasons. As an example, it is known that high front vowels trigger palatalization (33), so it is therefore not surprising that the voiceless palato-alveolar affricate C appears with i in the signals of small. In a set of testable pairs of signals (Materials and Methods), signals sharing a concept tend to be significantly associated in about 41% of the time, against only 8% of signals involving different concepts (Table S4). Table S4. Dependencies between signals involving the same concept The signals found in our analysis show a mixture of well-known and new associations. In line with the considerable literature on magnitude sound symbolism, the concept small was found to be associated with the high front vowel i (RR = 1.58, P = 0.61, n l = 78, a s / a t = 3/5), consistent with findings linking vowel height quality and size (14, 17), and with the palatal consonant C (RR = 5.12, P = 0.41, n l = 61, a s / a t = 3/4), also in agreement with previous work (14, 24). We also observed a strong association between round and r sounds (RR = 2.48, P = 0.37, n l = 56, a s / a t = 4/5). Although most recent research has emphasized the role of consonants in shape–sound meaning associations like this (34, 35), the usual hypothesis in this direction concerned the correlation between vowel roundedness and round objects (11)—association that appears as a tendency in our analyses without reaching the minimum statistical threshold established before. Both small and round have been linked to the phenomenon of cross-modal mapping (10, 13, 36). Another property word, full, is endowed with a pair of signals involving voiced (RR = 1.91, P = 0.22, n l = 213, a s / a t = 4/6) and unvoiced bilabial stops (RR = 2.11, P = 0.13, n l = 231, a s / a t = 5/6). Some of the strongest signals found correspond to body parts. Tongue was very strongly associated with the lateral “l” (RR = 2.77, P = 0.41, n l = 280, a s / a t = 6/6) and the mid and low front vowels e (RR = 1.54, P = 0.11, n l = 322, a s / a t = 5/6) and E (RR = 1.73, P = 0.11, n l = 164, a s / a t = 4/6). Nose was found to be associated most strongly with the alveolar nasal n (RR = 1.47, P = 0.35, n l = 334, a s / a t = 4/6) and the high back vowel u (RR = 1.38, P = 0.35, n l = 325, a s / a t = 4/6). The link between nose and nasality has been noted previously (37), in particular in reference to the conjecture that body part terms used in phonation makes use of the distinctive qualities provided by the relevant organ (21). Breasts was associated with the bilabial nasal consonant m (RR = 1.63, P = 0.32, n l = 320, a s / a t = 4/6) and the high back vowel u (RR = 1.46, P = 0.37, n l = 317, a s / a t = 4/6). Similar associations were found in the nursery terms for mother, a concept with which it often colexifies. It has been suggested that this might be due to the mouth configuration of suckling babies or to the sounds feeding babies produce (38, 39). Although this study lends support to a number of associations that were either elicited in experiments or conjectured based on a much smaller number of languages, it also provides telling negative evidence on others. Together with the association between high front vowels and the concept of small, there has been reports on a connection between back low vowels and the notion of big (22). However, big ( n l = 73) and large ( n l = 74) and o did not show any relevant signature of association in our sample at the global level. Similarly, an analogous front/back vowel opposition has been proposed to hold between proximal and distal pronouns—the purported explanation being that proximal referents tend to be small, whereas distal referents are usually large (22). The concepts this ( n l = 71) and that ( n l = 74), however, do not show any associations with i and o (respectively).

Origins and Nature of the Associations As discussed in the previous sections, there are multiple theories that attempt to elucidate why humans find that some sounds are more convenient or salient in association with certain meanings. How these hypothesized mechanisms lead to the widespread biases in vocabularies we find here is a complex question that is unlikely to be fully answered by the inspection of wordlists. Nonetheless, we can attempt to evaluate some of the potential consequences of those theories given the coarse level of detail of our data. Functional advantages might increase the likelihood of signals being borrowed across languages in contact with one another, thus producing spatial diffusion patterns (39) (Fig. 2). The existence of opposing factors obscure definitive inferences in this direction, however: basic vocabulary items are particularly resistant to borrowing, but unresolved genealogy involving nearby languages would be confounded with borrowing. In the same direction, large populations have been claimed to be more efficient at gaining and retaining nonarbitrary sound–meaning associations given a potential functional value (39), which is coherent with recent evidence from some Austronesian languages showing that larger populations gain new words at a faster rate (40). Fig. 2. Competing configurations of the spatial distribution of the tested languages. Blue and fuchsia dots represent languages with and without a specific signal, respectively. In the panel to the Left, the likelihood of a language having the signal is correlated with its geographical distance to its nearest neighbor, and on the Right, there is no spatial structure. We determined whether present-day log population size and log distance to the nearest genealogically unrelated language bearing the (positive) signal are effective predictors for signal presence, via a mixed-effects logistic model (Table S5 and SI Materials and Methods). At α = 0.05 , log population turned out to be significant in about one-third of the cases, but the effect was small and as many times positive as it was negative, which rules out a consistent role for population. Only one-fifth of the signals showed sensitivity to the distance of nearest neighbors with signal, with all of the cases having an effect in the predicted direction by our model. On average, and in contrast to the case in which a language and its signal-bearing nearest genealogically unrelated neighbor are spoken in exactly the same place, the probability of finding the signal also in the language drops by 28%. Table S5. Spatial and population analysis From a historical perspective, it has been suggested that sound–meaning associations might be evolutionarily preserved features of spoken language (41), potentially hindering regular sound change (17). Furthermore, it has been claimed that widespread sound–meaning associations might be vestiges of one or more large-scale prehistoric protolanguages (16). Tellingly, some of the signals found here feature prominently in reconstructed “global etymologies” (42, 43) that have been used for deep phylogeny inference (44). If signals are inherited from an ancestral language spoken in remote prehistory, we might expect them to be distributed similarly to inherited, cognate words; that is, their distribution should to a large extent be congruent with the nodes defining their linguistic phylogeny (see Fig. 3 for illustration). Fig. 3. Genealogical trees of languages where leaves are words for specific referents. In the figure to the Left, cognate classes (depicted as different shapes) are associated with signal presence (blue shapes), whereas to the Right there is no such correspondence. A direct evaluation of this hypothesis is infeasible due to the absence of etymological dictionaries for all but a few families. However, it can be tested indirectly given that cognate words are expected to be more similar to one another than noncognates (45). We investigated whether the presence of the signal-bearing symbol was a better indicator of overall form similarity between words than other shared symbols, using a β mixed-regression model that distinguishes the effects of symbols, concept, and lineage (SI Materials and Methods). The model is heavily dominated by the effect of lineage, and signal presence (although significant) has a negligible effect in the opposite direction than predicted: the genealogically balanced average effect is less than a 0.5% decrease in similarity for those words sharing a signal-related symbol compared with those sharing some other symbol. Consistency in word position is important for establishing cognacy (45, 46). Further support for the idea that signals are not residuals of deep history comes from the analysis of the position within the word in which they occur, in particular whether they have a clear word-initial bias. All in all, we find that signals do not have a consistent cross-linguistic preference or dispreference in this respect beyond well-established cross-linguistic phonotactic patterns, such as the avoidance of liquids or the prevalence of dorsal and labial stops in word-initial position (47, 48) (SI Materials and Methods and Table S6). Table S6. Analysis of word-initial position bias These results suggest that, although it is possible that the presence of signals in some families are symptomatic of a particularly pervasive cognate set, this is not the usual case. Hence, the explanation for the observed prevalence of sound–meaning associations across the world has to be found elsewhere (49).

Conclusion We have demonstrated that a substantial proportion of words in the basic vocabulary are biased to carry or to avoid specific sound segments, both across continents and linguistic lineages. Given that our analyses suggest that phylogenetic persistence or areal dispersal are unlikely to explain the widespread presence of these signals, we are left with the alternative that the signals are due to factors common to our species, such as sound symbolism, iconicity, communicative pressures, or synesthesia. We expect future research to further elucidate the role and interaction of these factors in driving the observed sound–meaning association biases, and to extend the scope of our findings to a broader portion of the vocabulary. The outcome of our analyses have consequences for historical-comparative linguistics, where it has been suggested that there is a small set of ultraconserved words that are particularly useful for establishing ancient genealogical relations beyond the limits of the comparative method (44). However, some of these words are involved in the signals discovered here: we is associated with the alveolar nasal, hear with the velar nasal, and ash with the vowel u. Thus, proposals of far-reaching etymologies based on words of similar form and meaning should be accompanied by an evaluation of whether the observed lexical similarities might have resulted from the kinds of signal discussed in this paper rather than common inheritance. More generally, even though it is unclear whether the locus of the emergence of signals is in the invention or historical development of lexical roots, our findings have implications for the study of the dynamics of lexical phonology. In summary, our results provide insights into the constraints that affect how we communicate, suggesting that despite the immense flexibility of the world’s languages, some sound–meaning associations are preferred by culturally, historically, and geographically diverse human groups.

Materials and Methods Basic Vocabulary Word Lists. The dataset used for this study is drawn from version 16 of the Automated Similarity Judgment Program (ASJP) database (25). ASJP comprises 6,895 word lists from around 62% of the world’s languages, covering 85% of families, isolates, and unclassified languages [using the Ethnologue (50) for these statistics]. After removing artificial languages, pidgins, and creoles, and varieties whose ISO-639-3 code cannot be confirmed, the number goes down to 6,447 word lists, corresponding to 4,298 different languages and 359 lineages. The database was not constructed for the specific purpose of studying sound symbolism, but rather for identifying genealogical relations among languages. For this reason, it generally consists of the 40-item subset of the 100-item so-called Swadesh list (51) that are assumed to remain stable as languages diverge into different lineages over time (52). Of these word lists, 328 additionally contain the remaining 60 Swadesh lists items. Words are rendered in a unified transcription system, which facilitates cross-linguistic comparison but also ignores phonetic details such as vowel length, nasalization, tones, and retroflexation. Vowel quality distinctions are merged into seven categories (high front, mid front, low front, high-mid central, low central, high back, and midlow back) (see ref. 53 for a discussion of the system). Each 40-item word list provides translational equivalents, when available, for the following items: blood, bone, breast, come, die, dog, drink, ear, eye, fire, fish, full, hand, hear, horn, I, knee, leaf, liver, louse, mountain, name, new, night, nose, one, path, person, see, skin, star, stone, sun, tongue, tooth, tree, two, water, we, and you (sg). The additional Swadesh list items contained in some of the word lists are as follows: all, ash, bark, belly, big, bird, bite, black, burn, claw, cloud, cold, dry, earth, eat, egg, feather, flesh, fly, foot, give, good, grease, green, hair, head, heart, hot, kill, know, lie, long, man, many, moon, mouth, neck, not, rain, red, root, round, sand, say, seed, sit, sleep, small, smoke, stand, swim, tail, that, this, walk, what, white, who, woman, and yellow. Associations Between Symbols and Concepts. The fundamental statistic in our analysis is p i j , the maximum-likelihood estimator (i.e., the sample frequency) for the probability of finding that concept i has at least one instance of symbol j, after randomly choosing a lineage, a language within the lineage and a dialect within the language (if any) in that sequential order. Naturally, this calculation is restricted to the set of dialects of languages for which the concept and the phone are attested (which we will refer as S i j ); for each of those sets, this quantity is formally p i j = 1 | L | ∑ k = 1 | L | ( 1 | L k | ∑ l = 1 | L k | 1 | L k l | ∑ d = 1 | L k l | π i j k l d ) . The sets L, L k , and L k l are the sets of all lineages, languages within lineage k and dialects of language l within lineage k. π i j k l d is a binary variable that takes value 1 if there is at least one instance of symbol j in the word for concept i for dialect d of language l from lineage k (always within the set S i j ) and 0 otherwise. This computation is conservative in that all languages known to belong to the same genealogical group influence the aggregated statistics in the same way regardless of their size, but on the other hand it guarantees the minimum possible bias in the dependence of the languages’ words. To avoid testing cases whose coverage is insufficiently wide before testing, we evaluated only those associations for which S i j comprises 10 lineages in each of three different macroareas at least. Conversely, for each dialect of each language, we calculated the proportion of words other than that associated with i that have symbol j, and we note this as π − i j k l d , and similarly the genealogical balanced average as p − i j . These probabilities are used to produce n s i m = 1,000 Monte Carlo simulations of symbol j presence/absence for all of the languages in S i j —the set of p − i j values resulting from these simulations will be called ζ i j . The purpose is to compare ζ i j with π i j to answer the question: does symbol j appear much more (or much less) often when a subset of words referring to concept i is selected than in a randomly picked set of words from the same languages? The two-tailed P value for a particular concept i and symbol j is then as follows (54): P = 1 n sim + 1 ( 2 min { | x ∈ ζ i j : x ≥ p i j | , | x ∈ ζ i j : x ≤ p i j | } + 1 ) , where | ⋅ | is the cardinality of the set. The large number of tests performed require a control for type I errors. We perform a false discovery rate (FDR) analysis fixing the FDR rejection threshold to 0.05, which means that we will allow no more than 5% of false positives on average. For this purpose, we use the method described in ref. 55. The basic idea is that the distribution of P values comes from a mixture of a uniform distribution (that corresponds to the baseline of tests where no associations beyond chance are present) and a distribution concentrated near P = 0 of true positives. The method used here learns the mixture proportion of the uniform distribution from values P from 1 down to a threshold that is adjusted to reduce the false nondiscovery rate. This entire procedure was repeated with a different, less conservative, genealogical classification—the one provided by the World Atlas of Language Structures (WALS) (30). For our analysis, we only considered associations that were below the defined FDR level according to both classifications. The fraction of the component of true negatives learned from both classifications was around 0.65. Regarding possible confounds due to word length, we performed two extra tests on those associations that successfully passed the previous test. First, we repeated the same global test using the Glottolog classification this time comparing p i j with simulations obtained from words of exactly the same number of symbols in each language (and dialect). Second, for each language (and dialect) in S i j , n = 1,000 of independent simulations we sampled without replacement as many random symbols from words other than i up to the length of word i. This effectively produces, for each word i, a random counterpart equivalent to shuffling all of the symbols corresponding to all of the words of a language while keeping word lengths constant. Over each of those sets, the same association test based on the Glottolog classification was performed. In both of these procedures, we imposed a stricter cutoff: if any of the simulations yield a value of p i j equally or more extreme, we would reject the association as of potential interest. Finally, for each macroarea with at least 10 independent lineages in S i j , we analyzed the presence of a significant direction of association as in the main associations test—computing both empirical and random probabilities using only the languages of that area—with the difference that we flagged each macroarea-specific association with P ≤ 0.1. It should be noticed that this does not imply a softer rejection threshold than in the worldwide case: we only keep associations that display a bias consistent with the worldwide trend in at least one-half of the macroareas, with the extra condition that no macroarea should exhibit a bias in the opposite direction. To summarize: only associations that successfully satisfied all of the requirements of the overall association test (with Glottolog and WALS classifications independently), the word length and the matched-length tests, and for which a consistent preference in at least one-half of the macroareas could be found were considered “signals.” Association Between Signals. As in the previous case, we analyze sets of languages for which both the concept and the symbol associated with a pair of signals was present in at least 10 lineages in each of (at least) three macroareas. The association between signals—which we will refer to A and B here—was tested by means of a simple mixed-effects logistic model as follows: logit ( signal A presence ) = α signal B presence + α lineage , where α signal A presence is the coefficient related to the presence of signal A, and α lineage is a random coefficient structured according to lineage. To the results obtained by comparing all of the pairwise associations between signals belonging to the core 40 words, we applied a threshold on the FDR of 5%. About 12% of the 2,062 cases satisfied this condition. The results of associations regarding same-concept signals and the genealogically balanced average effect on the presence of signal B on A can be found in Table S4.

Acknowledgments We acknowledge the comments of Bernard Comrie, Brent O. Berlin, Stephen C. Levinson, Mark Dingemanse, Russell Gray, and Eric W. Holman. We also thank Jeremy Collins and Stefany Moreno for assistance with other aspects of the manuscript. H.H.’s research was made possible thanks to the financial support of the Language and Cognition Department at the Max Planck Institute for Psycholinguistics, Max-Planck Gesellschaft, and European Research Council’s Advanced Grant 269484 (“INTERACT”) to Stephen C. Levinson. S.W.’s research was supported by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013)/European Research Council Grant Agreement 295918 and by a subsidy of the Russian Government to support the Program of Competitive Development of Kazan Federal University. D.E.B.’s research was funded by the Max Planck Institute International Research School.