To identify structural patterns of musical discourse we first need to build a ‘vocabulary’ of musical elements (Fig. 1). To do so, we encode the dataset descriptions by a discretization of their values, yielding what we call music codewords20 (see Supplementary Information, SI). In the case of pitch, the descriptions of each song are additionally transposed to an equivalent main tonality, such that all of them are automatically considered within the same tonal context or key. Next, to quantify long-term variations of a vocabulary, we need to obtain samples of it at different periods of time. For that we perform a Monte Carlo sampling in a moving window fashion. In particular, for each year, we sample one million beat-consecutive codewords, considering entire tracks and using a window length of 5 years (the window is centered at the corresponding year such that, for instance, for 1994 we sample one million consecutive beats by choosing full tracks whose year annotation is between 1992 and 1996, both included). This procedure, which is repeated 10 times, guarantees a representative sample with a smooth evolution over the years.

Figure 1 Method schematic summary with pitch data. The dataset contains the beat-based music descriptions of the audio rendition of a musical piece or score (G, Em and D7 on the top of the staff denote chords). For pitch, these descriptions reflect the harmonic content of the piece15 and encapsulate all sounding notes of a given time interval into a compact representation11,12, independently of their articulation (they consist of the 12 pitch class relative energies, where a pitch class is the set of all pitches that are a whole number of octaves apart, e.g. notes C1, C2 and C3 all collapse to pitch class C). All descriptions are encoded into music codewords, using a binary discretization in the case of pitch. Codewords are then used to perform frequency counts and as nodes of a complex network whose links reflect transitions between subsequent codewords. Full size image

We first count the frequency of usage of pitch codewords (i.e. the number of times each codeword type appears in a sample). We observe that most used pitch codewords generally correspond to well-known harmonic items21, while unused codewords correspond to strange/dissonant pitch combinations (Fig. 2a). Sorting the frequency counts in decreasing order provides a very clear pattern behind the data: a power law17 of the form z ∝ r−α, where z corresponds to the frequency count of a codeword, r denotes its rank (i.e. r = 1 for the most used codeword and so forth) and α is the power law exponent. Specifically, we find that the distribution of codeword frequencies for a given year nicely fits to P(z) ∝ (c + z) −β for z > z min , where we take z as the random variable22, β = 1 + 1/α as the exponent and c as a constant (Fig. 2b). A power law indicates that a few codewords are very frequent while the majority are highly infrequent (intuitively, the latter provide the small musical nuances necessary to make a discourse attractive to listeners3,4,5). Nonetheless, it also states that there is no characteristic frequency nor rank separating most used codewords from largely unused ones (except for the largest rank values due to the finiteness of the vocabulary). Another non-trivial consequence of power-law behavior is that when α ≤ 2, extreme events (i.e. very rare codewords) will certainly show up in a continuous discourse providing the listening time is sufficient and the pre-arranged dictionary of musical elements is big enough.

Figure 2 Pitch distributions and networks. (a) Examples of the rank-frequency distribution (relative frequencies z′ such that ). For ease of visualization, curves are chronologically shifted by a factor of 10 in the vertical axis. Some frequent and infrequent codewords are shown. (b) Examples of the density values and their fits, taking z as the random variable. Curves are chronologically shifted by a factor of 10 in the horizontal axis. (c) Average shortest path length l versus clustering coefficient C for pitch networks (right) and their randomized versions (left). Randomized networks were obtained by swapping pairs of links chosen at random, avoiding multiple links and self-connections. Values l and C calculated without considering the 10 highest degree nodes (see SI). Arrows indicate chronology (red and blue colors indicate values for more and less recent years, respectively). Full size image

Importantly, we find this power-law behavior to be invariant across years, with practically the same fit parameters. In particular, the exponent β remains close to an average of 2.18 ± 0.06 (corresponding to α around 0.85), which is similar to Zipf's law in linguistic text corpora23 and contrasts with the exponents found in previous small-scale, symbolic-based music studies24,25. The slope of the least squares linear regression of β as a function of the year is negligible within statistical significance (p > 0.05, t-test). This makes a high stability of the distribution of pitch codeword frequencies across more than 50 years of music evident. However, it could well be that, even though the distribution is the same for all years, codeword rankings were changing (e.g. a certain codeword was used frequently in 1963 but became mostly unused by 2005). To assess this possibility we compute the Spearman's rank correlation coefficients26 for all possible year pairs and find that they are all extremely high, with an average of 0.97 ± 0.02 and a minimum above 0.91. These high correlations indicate that codeword rankings practically do not vary with years.

Codeword frequency distributions provide a generic picture of vocabulary usage. However, they do not account for discourse syntax, as well as a simple selection of words does not necessarily constitute an intelligible sentence. One way to account for syntax is to look at local interactions or transitions between codewords, which define explicit relations that capture most of the underlying regularities of the discourse and that can be directly mapped into a network or graph18,19. Hence, analogously to language-based analyses27,28,29, we consider the transition networks formed by codeword successions, where each node represents a codeword and each link represents a transition (see SI). The topology of these networks and common metrics extracted from them can provide us with valuable clues about the evolution of musical discourse.

All the transition networks we obtain are sparse, meaning that the number of links connecting codewords is of the same order of magnitude as the number of codewords. Thus, in general, only a limited number of transitions between codewords is possible. Such constraints would allow for music recognition and enjoyment, since these capacities are grounded in our ability for guessing/learning transitions3,4,8 and a non-sparse network would increase the number of possibilities in a way that guessing/learning would become unfeasible. Thinking in terms of originality and creativity, a sparse network means that there are still many ‘composition paths’ to be discovered. However, some of these paths could run into the aforementioned guessing/learning tradeoff9. Overall, network sparseness provides a quantitative account of music's delicate balance between predictability and surprise.

In sparse networks, the most fundamental characteristic of a codeword is its degree k, which measures the number of links to other codewords. With pitch networks, this quantity is distributed according to a power law P(k) ∝ k−γ for k > k min , with the same fit parameters for all considered years. The exponent γ, which has an average of 2.20±0.06, is similar to many other real complex networks18 and the median of the degree k is always 4. Nevertheless, we observe important trends in the other considered network metrics, namely the average shortest path length l, the clustering coefficient C and the assortativity with respect to random Γ. Specifically, l slightly increases from 2.9 to 3.2, values comparable to the ones obtained when randomizing the network links. The values of C show a considerable decrease from 0.65 to 0.45 and are much higher than those obtained for the randomized network. Thus, the small-worldness30 of the networks decreases with years (Fig. 2c). This trend implies that the reachability of a pitch codeword becomes more difficult. The number of hops or steps to jump from one codeword to the other (as reflected by l) tends to increase and, at the same time, the local connectivity of the network (as reflected by C) tends to decrease. Additionally, Γ is always below 1, which indicates that the networks are always less assortative than random (i.e. well-connected nodes are less likely to be connected among them), a tendency that grows with time if we consider the biggest hubs of the network (SI). The latter suggests that there are less direct transitions between ‘referential’ or common codewords. Overall, a joint reduction of the small-worldness and the network assortativity shows a progressive restriction of pitch transitions, with less transition options and more defined paths between codewords.

As opposed to pitch, timbre provides a different picture. Even though the distribution of timbre codeword frequencies is also well-fitted by a power law (Fig. 3a), the parameters of this distribution vary across years. In particular, since 1965, β constantly decreases to values approaching 4 (Fig. 3b). Although such large values of β would imply that other fits could also be acceptable, the power law provides a simple parameterization to compare the changes over the years (and is not rejected in a likelihood ratio test in front of other alternatives). Smaller values of β indicate less timbral variety: frequent codewords become more frequent and infrequent ones become even less frequent. This evidences a growing homogenization of the global timbral palette. It also points towards a progressive tendency to follow more fashionable, mainstream sonorities. Interestingly, rank correlation coefficients are generally below 0.7, with an average of 0.57 ± 0.15 (Fig. 3c). These rather low rank correlations would act as an attenuator of the sensation that contemporary popular music is becoming more homogeneous, timbrically speaking. The fact that frequent timbres of a certain time period become infrequent after some years could mask global homogeneity trends to listeners.

Figure 3 Timbre distributions. (a) Examples of the density values and fits taking z as the random variable. (b) Fitted exponents β. (c) Spearman's rank correlation coefficients for all possible year pairs. Full size image

Compared to timbre codeword frequencies, metrics obtained from timbre transition networks show no substantial variation. Again, similar median degrees (all equal to 8) and degree distributions were observed for all considered years. However, we were not able to achieve a proper fit for the latter (SI). Values of Γ are larger than 1 and increasing since 1965. Thus, in contrast to pitch, timbre networks are more assortative than random. The values of l fluctuate around 4.8 and C is always below 0.01. Noticeably, both are close to the values obtained with randomly wired networks. This close to random topology quantitatively demonstrates that, as opposed to language, timbral contrasts (or transitions) are rarely the basis for a musical discourse1. This does not regard timbre as a meaningless facet. Global timbre properties, like the aforementioned power law and rankings, are clearly important for music categorization tasks2,11 (one example is genre classification31). Notice however that the evolving characteristics of musical discourse have important implications for artificial or human systems dealing with such tasks. For instance, the homogenization of the timbral palette and general timbral restrictions clearly challenge tasks exploiting this facet. A further example is found with the aforementioned restriction of pitch codeword connectivity, which could hinder song recognition systems (artificial song recognition systems are rooted on pitch codeword-like sequences, cf.32).

Loudness distributions are generally well-fitted by a reversed log-normal function (Fig. 4a). Plotting them provides a visual account of the so-called loudness race (or loudness war), a terminology that is used to describe the apparent competition to release recordings with increasing loudness33,34, perhaps with the aim of catching potential customers' attention in a music broadcast (from our point of view, loudness changes are not only the result of technological developments but, in part, also the result of conscious decisions made by musicians and producers in the musical creation process, cf.33). The empiric median of the loudness values x grows from −22 dB FS to −13 dB FS (Fig. 4b), with a least squares linear regression yielding a slope of 0.13 dB/year (p < 0.01, t-test). In contrast, the absolute difference between the first and third quartiles of x remains constant around 9.5 dB (Fig. 4c), with a regression slope that is not statistically significant (p > 0.05, t-test). This shows that, although music recordings become louder, their absolute dynamic variability has been conserved, understanding dynamic variability as the range between higher and lower loudness passages of a recording34. However and perhaps most importantly, one should notice that digital media cannot output signals over 0 dB FS 35, which severely restricts the possibilities for maintaining the dynamic variability if the median continues to grow.

Figure 4 Loudness distributions. (a) Examples of the density values and fits of the loudness variable x. (b) Empiric distribution medians. (c) Dynamic variability, expressed as absolute loudness differences between the first and third quartiles of x, |Q 1 − Q 3 |. Full size image

Finally we look at loudness transition networks, which show comparable degree distributions, a median degree between 13 and 14, values of l between 8 and 10 and a Γ fluctuating around 1.08. Noticeably, l is appreciably beyond the values obtained by randomly wired networks. The values of C have an average of 0.59 ± 0.02, also much above the values obtained by the random networks. These two observations suggest that the network has a one-dimensional character, inferring that no extreme loudness transitions occur (one rarely finds loudness transitions to drive a musical discourse). The very stable metrics obtained for loudness networks imply that, despite the race towards louder music, the topology of loudness transitions is maintained.