Birdsong data sets

We analyzed song recordings from four different species: European starling (Sturnus vulgaris), Bengalese finch (Lonchura striata domestica), Cassin’s vireo (Vireo cassinii), and California thrasher (Toxostoma redivivum). As the four data sets were each hand-segmented or algorithmically segmented by different research groups, the segmentation methodology varies between species. The choice of the acoustic unit used in our analyses are somewhat arbitrary and the choice of the term syllable is used synonymously across all four species in this text, however the units that are referred to here as syllables for the California thrasher and Cassin’s vireo are sometimes referred to as phrases in other work21,22,34,35. Information about the length and diversity of each syllable repertoire is provided in Extended Data Table 1.

The Bengalese finch data set33,52 was recorded from sound-isolated individuals and was hand-labeled. The Cassin’s vireo21,34,63 and the California thrasher35 data sets were acquired from the Bird-DB40 database of wild recordings, and were recorded from the Sierra Nevada and Santa Monica mountains, respectively. Both data sets are hand-labeled. The European starling song64 was collected from wild-caught male starlings (sexed by morphological characteristics) 1 year of age or older. Starling song was recorded at either 44.1 or 48 kHz over the course of several days to weeks, at various points throughout the year in sound-isolated chambers. Some European starlings were administered with testosterone before audio recordings to increase singing behavior. The methods for annotating the European starling data set are detailed in the “Corpus annotation for European starlings” section.

Procedures and methods comply with all relevant ethical regulations for animal testing and research and were carried out in accordance with the guidelines of the Institutional Animal Care and Use Committee at the University of California, San Diego.

Speech corpora

Phone transcripts were taken from four different data sets: the Buckeye corpus of spontaneous conversational American-English speech36, the IMS GECO corpus of spontaneous German speech37, the AsiCA corpus of spontaneous Italian speech of the Calabrian dialect38 (south Italian), and the CSJ corpus of spontaneous Japanese speech39.

The American-English speech corpus (Buckeye) consists of conversational speech taken from 40 speakers in Columbus, Ohio. Alongside the recordings, the corpus includes transcripts of the speech and time aligned segmentation into words and phones. Phonetic alignment was performed in two steps: first using HMM automatic alignment, followed by hand adjustment and relabeling to be consistent with the trained human labeler. The Buckeye data set also transcribes pauses, which are used as the basis for boundaries in an utterance in our analyses.

The German speech corpus (GECO) consists of 46 dialogs ~25 min in length each, in which previously unacquainted female subjects are recorded conversing with one another. The GECO corpus is automatically aligned at the phoneme and word level using forced alignment65 from manually generated orthographic transcriptions. A second algorithmic step is then used to segment the data set into syllables65.

The Italian speech data (AsiCA) consist of directed, informative, and spontaneous recordings. Only the spontaneous subset of the data set was used for our analysis to remain consistent with the other data sets. The spontaneous subset of the data set consists of 61 transcripts each lasting an average of 35 min. The AsiCA data set is transcribed using a hybrid orthographic/phonetic transcription method where certain phonetic features were noted with International Phonetic Alphabet labels.

The CSJ consists of spontaneous speech from either monologues or conversations which are hand transcribed. We use the core subset of the corpus, both because it is the fully annotated subset of the data set, and because it is similar in size to the other data sets used. The core subset of the corpus contains over 500,000 words annotated for phonemes and several other speech features, and consists primarily of spontaneously spoken monologues. CSJ is also annotated at the level of mora, a syllable-like unit consisting of one or more phonemes and serving as the basis of the 5–7–5 structure of the Haiku66. In addition, CSJ is transcribed at the level of Inter-Pausal Units (IPUs) which are periods of continuous speech surrounded by an at-least 200-ms pause. We refer here to IPUs as utterances to remain consistent with the Buckeye data set.

As each of the data sets was transcribed using a different methodology, this disparity between the transcription methods may account for some differences in the observed MI decay. The impact of using different transcription methods are at present unknown. The same disparity is true of the birdsong data sets.

Corpus annotation for European starlings

The European starling corpus was annotated using a novel unsupervised segmentation and annotation algorithm being maintained at GitHub.com/timsainb/AVGN. An outline of the algorithm is given here.

Spectrograms of each song bout were created by taking the absolute value of the one-sided short-time Fourier transformation of the band-pass-filtered waveform. The resulting power was normalized from 0 to 1, log-scaled, and thresholded to remove low-amplitude background noise in each spectrogram. The threshold for each spectrogram was set dynamically. Beginning at a base-power threshold, all power in the spectrogram below that threshold was set to zero. We then estimated the periods of silence in the spectrogram as stretches of spectrogram where the sum of the power over all frequency channels at a given time point was equal to zero. If there were no stretches of silence for at least n seconds (described below), the power threshold was increased and the process was repeated until our criteria for minimum length silence was met or the maximum threshold was exceeded. Song bouts for which the maximum threshold was exceeded in our algorithm were excluded as too noisy. This method also filtered out putative bouts that were composed of nonvocal sounds. Thresholded spectrograms were convolved with a Mel-filter, with 32 equally spaced frequency bands between the high and low cutoffs of the Butterworth bandpass filter, then rescaled between 0 and 255.

To segment song bouts into syllables, we computed the spectral envelope of each song spectrogram, as the sum power across the Mel-scaled frequency channels at every time-sample in the spectrogram. We defined syllables operationally as periods of continuous vocalization bracketed by silence. To find syllables, we first marked silences by minima in the spectral envelope and considered the signal between each silence as a putative syllable. We then compared the duration of the putative syllable with an upper bound on the expected syllable length for each species. If the putative syllable was longer than the expected syllable length, it was assumed to be a concatenation of two or more syllables which had not yet been segmented, and the threshold for silence was raised to find the boundary between those syllables. This processes repeated iteratively for each putative syllable until it was either segmented into multiple syllables or a maximum threshold was reached, at which point it was accepted as a long syllable. This dynamic segmentation algorithm is important for capturing certain introductory whistles in the European starling song, which can be several times longer than any other syllable in a bout.

Several hyperparameters were used in the segmentation algorithm. The minimum and maximum expected lengths of a syllable in seconds (ebr_min, ebr_max) were set to 0.25/0.75 s. The minimum number of syllables (min_num_sylls) expected in a bout was set to 20. The maximum threshold for silence (max_thresh), relative to the maximum of the spectral envelope was set to 2%. To threshold out overly noisy song, a minimum length of silence threshold was expected in each bout (min_silence_for_spec), set at 0.5 s. The base spectrogram (log) threshold for power considered to be spectral background noise (spec_thresh) was set at 4.0. This threshold value was set dynamically, where the minimum spectral background noise (spec_thresh_min) was set to be 3.5.

We reshaped the syllable spectrograms to create uniformly sized inputs for the dimensionality reduction algorithm. Syllable time-axes were resized using spline interpolation to match a sampling rate of 32 frames equaling the upper limit of the length of a syllable for each species (e.g., a starling’s longest syllables are ~1 s, so all syllables are reshaped to a sampling rate of 32 samples/s). Syllables that were shorter than the set syllabic rate were zero-padded on either side to equal 32-time samples, and syllables that were longer than the upper bound were resized to 32-time samples to fit into the network.

Multiple algorithms exist to transcribe birdsong corpora into discrete elements. Our method is unique in that it does not rely on supervised (experimenter) element labeling, or hand-engineered acoustic features specific to individual species beyond syllable length. The method consists of two steps: (1) project the complex features of each birdsong data set onto a two-dimensional space using the UMAP dimensionality reduction algorithm41 and (2) apply a clustering algorithm to determine element boundaries67. Necessary parameters (e.g. the minimum cluster size) were set based upon visual inspection of the distributions of categories in the two-dimensional latent space. We demonstrate the output of this method in Fig. 1 both on a European starling data set using our automated transcription, and on the Cassin’s vireo, California thrasher, and Bengalese finch data sets. The dimensionality reduction procedure was used for the Cassin’s vireo, Bengalese finch, and California thrasher data sets, but using hand segmentations rather than algorithmic segmentations of boundaries. The hand labels are also used rather than UMAP labels for these three species.

Song bouts

Data sets were either made available, segmented into bouts by the authors of each data set, as in the case of the Bengalese finches, or were segmented into bouts based upon inter-syllable gaps of >60 s in the case of Cassin’s vireo and California thrashers, and 10 s in the case of European starlings. These thresholds were set based upon the distribution of inter-syllable gaps for each species (Supplementary Fig. 9).

Mutual information estimation

We calculated MI using distributions of pairs of syllables (or phones) separated by some distance within the vocal sequence. For example, in the sequence “a → b → c → d → e”, where letters denote exemplars of specific syllable or phones categories, the distribution of pairs at a distance of “2” would be ((a, c), (b, d), (c, e)). We calculate MI between these pairs of elements as:

$$\hat I(X,Y) = \hat S(X) + \hat S(Y) - \hat S(X,Y),$$ (1)

where X is the distribution of single elements (a, b, c) in the example, and Y is the distribution of single elements (c, d, e). \(\hat S(X)\) and \(\hat S(Y)\) are the marginal entropies of the distributions of X and Y, respectively, and \(\hat S(X,Y)\) is the entropy of the joint distribution of X and Y, ((a, c), (b, d), (c, e)). We employ the Grassberger68 method for entropy estimation used by Lin and Tegmark3 which accounts for under-sampling true entropy from finite samples:

$$\hat S = {\mathrm{log}}_2(N) - \frac{1}{N}\mathop {\sum}\limits_{i = 1}^K {N_i} \psi \left( {N_i} \right),$$ (2)

where ψ is the digamma function, K is the number of categories (e.g. syllables or phones) and N is the total number of elements in each distribution. We account for the lower bound of MI by calculating the MI on the same data set, where the syllable sequence order is shuffled:

$$\hat I_{{\rm{sh}}}(X,Y) = \hat S\left( {X_{{\rm{sh}}}} \right) + \hat S\left( {Y_{{\rm{sh}}}} \right) + \hat S\left( {X_{{\rm{sh}}},Y_{{\rm{sh}}}} \right),$$ (3)

where X sh and Y sh refer to the same distributions as X and Y described above, taken from shuffled sequences. This shuffling consists of a permutation of each individual sequence being used in the analysis, which differs depending on the type of analysis (e.g. a bout of song in the analysis shown in Supplementary Fig. 5 versus an entire day of song in Fig. 4).

Finally, we subtract out the estimated lower bound of the MI from the original MI measure.

$${\rm{MI}} = \hat I - \hat I_{{\rm{sh}}}$$ (4)

Mutual information decay fitting

To determine the shape of the MI decay, we fit three decay models to the MI as a function of element distance: an exponential decay model, a power-law decay model, and a composite model of both, termed the composite decay:

$${\mathrm{exponential}}\,{\mathrm{decay}} = a \ast e^{ - x \ast b} + c$$ (5)

$${\mathrm{power}} - {\mathrm{law}}\,{\mathrm{decay}} = a \ast x^b + c$$ (6)

$${\mathrm{composite}}\,{\mathrm{decay}} = a \ast e^{ - x \ast b} + c \ast x^d + f$$ (7)

where x represents the inter-element distance between units (e.g., phones or syllables). To fit the model on a logarithmic scale, we computed the residuals between the log of the MI and of the model’s estimation of the log of the MI. Because our distances were necessarily sampled linearly as integers, we scaled the residuals during fitting by the log of the distance between elements. This was done to emphasize fitting the decay in log-scale. The models were fit using the lmfit Python package69.

Model selection

We used the Akaike information criterion (AIC) to compare the relative quality of the exponential, composite, and power-law models. AIC takes into account goodness-of-fit and model simplicity, by penalizing larger numbers of parameters in each model (3 for the exponential and power-law models, 5 for the composite model). All comparisons use the AICc42 estimator, which imposes an additional penalty (beyond the penalty imposed by AIC) to correct for higher-parameter models overfitting on smaller data sets. We choose the best-fit model for the MI decay of each bird’s song and the human speech phone data sets using the difference in AICc between models42. In the text, we report the relative probability of a given model (in comparison to other models), which is computed directly from the AICc42 (see Supplementary Information). We report the results using log-transformed data in the main text (Extended Data Tables 3 and 4).

To determine a reasonable range of element-to-element distances for all the birdsong and speech data sets, we analyzed the relative goodness-of-fit (AICc) and proportion of variance explained (r2) for each model on decays over distances ranging from 15 to 1000 phones/syllables apart. The composite model provides the best fit for distances up to at least 1000 phones in each language (Supplementary Fig. 10) and at least the first 100 syllables for all songbird species (Supplementary Fig. 11). To keep analyses consistent across languages and songbird species we report on analyses using distances up to 100 elements (syllables in birdsong and phones in speech). Figures 3 and 4 show a longer range of decay in each language and songbird species, plotted up to element distances where the coefficient of determination (r2) remained within 99.9% of its value when fit to 100-element distances.

Curvature of decay fits

We calculated the curvature for those signals best fit by a composite model in log space (log-distance and log-MI).

$$\kappa = \frac{{|y\prime\prime |}}{{\left( {1 + y^{\prime 2}} \right)^{\frac{3}{2}}}}$$ (8)

where y is the log-scaled MI. We then found the local minima and the following local maxima of the curvature function, which corresponds to the “knee” of the exponential portion of the decay function, and the transition between a primary contribution on the exponential decay to a primary contribution of the power-law decay.

Sequence analyses

Our primary analysis was performed on sequences of syllables that were produced within the same day to allow for both within-bout and between-bout dynamics to be present. To do so, we considered all syllables produced within the same day as a single sequence and computed MI over pairs of syllables that crossed bouts, regardless of the delay in time between the pairs of syllables. In addition to the primary within-day analysis, we performed three controls to observe whether the observed MI decay was due purely to within-bout, or between-bout organization. The first control was to compute the MI between only syllables that occur within the same bout (as defined by a 10 s gap between syllables). Similar to the primary analysis (Fig. 4), the best-fit model for within-bout MI decay is the composite model (Supplementary Figs 7b and 5). To more directly dissociate within-bout and between-bout syllable dependencies in songbirds, we computed the MI decay after removing either within- or between-bout structure. To do this, we shuffled the ordering of bouts within a day while retaining the order of syllables within each bout (Supplementary Fig. 7c), or shuffled the order of syllables within each bout while retaining the ordering of bouts (Supplementary Fig. 7d). Analyses were performed on individual songbirds with at least 150 syllables in their data set (Supplementary Fig. 7), and on the full data set of all birds in a given species. We performed similar shuffling analysis on the speech data sets (Supplementary Fig. 2). For speech, we shuffled the order of phones within words (while preserving word order) to remove within-word information, and shuffled word order (while preserving within-word phone ordering) to remove between-word information. We used a similar shuffling strategy at the utterance level remove within- and between-utterance information. The speech data sets were not broken down into individuals due to limitations in data set size at the individual level, and because language is clearly shared between individuals in each speech data set.

To address the possibility that repeating syllables might account for long-range order, we performed separate analyses on both the original syllable sequences (as produced by the bird) and compressed sequences in which all sequentially repeated syllables were counted as a single syllable. The original and compressed sequences show similar MI decay shapes (Supplementary Fig. 12). We also assessed how our results relate to the timescale of segmentation and discretization of syllables or phones by computing the decay in MI between discretized spectrograms of speech and birdsong at different temporal resolutions (Supplementary Fig. 13) for a subset of the data. Long-range relationships are present throughout both speech and birdsong regardless of segmentation, but the pattern of MI decay does not follow the hypothesized decay models as closely as that observed when the signals are discretized to phones or syllables, supporting the nonarbitrariness of these low-level production units.

Computational models

We compared the MI decay of sequences produced by three different artificial grammars: (1) Markov models used to describe the song of two Bengalese finches31,32, (2) The hierarchical model proposed by Lin and Tegmark3, and (3) a model composed of both the hierarchical model advocated by Lin and Tegmark and a Markov model. While these models do not capture the full array of possible sequential models and their signatures in MI decay, they well-capture the predictions made based upon the discussed literature2,3,6,7,14 and provide an illustration of what would be expected given our competing hypotheses. With each model, we generate corpora of sequences, then compute the MI decay of the sequences using the same methods as with the birdsong and speech data. We also fit a power-law, exponential, and composite model to the MI decay, in the same manner (Fig. 2).

A Markov model is a sequential model in which the probability of transitioning to a state (x n ) is dependant solely on the previous state (x n−1 ). Sequences are generated from a Markov model by sampling an initial state, x 0 from the set of possible states S. x 0 is then followed by a new state from the probability distribution P(x n |x n−1 ). Markov models can thus be captured by a Matrix M of conditional probabilities M ab = P(x n = a|x n−1 = b), where a ∈ S and b ∈ S. In the example (Fig. 2b) we produce a set of 65,536 (216) sequences from Markov models describing two Bengalese finches31,32.

The hierarchical model from Lin and Tegmark3 samples sequences recursively in a similar manner to how the Markov model samples sequences sequentially. Specifically, a state x 0 is drawn probabilistically from the set of possible states S as in the Markov model. The initial state x 0 is then replaced (rather than followed by, as in the Markov model) by q new states (rather than a single state as in the Markov model), which are similarly sampled probabilistically as P(x i |x 0 ), where x i is any of the new q states replacing x 0 . The hierarchical grammar can therefore similarly be captured by a conditional probability matrix M ab = P(x l+1 = a|x l = b). The difference between the two models is that the sampled states are replaced recursively in the hierarchical model, whereas in the Markov model they are appended sequentially to the initial state. In the example (Fig. 2a) we produce a set of 1000 sequences from a model parameterized with an alphabet of 5 states recursively subsampled 12 times, with 2 states replacing the initial state at each subsampling (generating sequences of length 4096).

The final model combines both the Markov model and the hierarchical model by using Markov-generated sequences as the end states of the hierarchical model. Specifically, the combined model is generated in a three-step process: (1) A Markov model is used to generate sequences equal to the number of possible states of the hierarchical model (S). (2) The combined model is sampled in the exact same manner as the hierarchical model to produce sequences. (3) The end states of the hierarchical model are replaced with their corresponding Markov-generated states from (1). In the example (Fig. 2c) we produce sequences in the same manner as the hierarchical model. Each state of these sequences is then replaced with sequences between 2 and 5 states long generated by a Markov model with an alphabet of 25 states.

Neither the hierarchical model nor the combined model is meant to exhaustively sample the potential ways in which hierarchical signals can be formed or combined with Markovian processes. Instead, both models are meant to illustrate the theory proposed by prior work and to act as a baseline for comparison for our analyses on real-world signals.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.