Why do distantly related mammals like sheep, giant pandas, and fur seals produce bleats that are characterized by vibrato-like fundamental frequency (F0) modulation? To answer this question, we used psychoacoustic tests and comparative analyses to investigate whether this distinctive vocal feature has evolved to improve the perception of formants, key acoustic components of animal calls that encode important information about the caller’s size and identity []. Psychoacoustic tests on humans confirmed that vibrato-like F0 modulation improves the ability of listeners to detect differences in the formant patterns of synthetic bleat-like stimuli. Subsequent phylogenetically controlled comparative analyses revealed that vibrato-like F0 modulation has evolved independently in six mammalian orders in vocal signals with relatively high F0 and, therefore, low spectral density (i.e., less harmonic overtones). We also found that mammals modulate the vibrato in these calls over greater frequency extents when the number of harmonic overtones per formant is low, suggesting that this is a mechanism to improve formant perception in calls with low spectral density. Our findings constitute the first evidence that formant perception in non-speech sounds is improved by fundamental frequency modulation and provide a mechanism for the convergent evolution of bleat-like calls in mammals. They also indicate that selection pressures for animals to transmit important information encoded by formant frequencies (on size and identity, for example) are likely to have been a key driver in the evolution of mammal vocal diversity.

Results and Discussion

2 Fant G. Acoustic Theory of Speech Production. 3 Titze I.R. Principles of Voice Production. 3 Titze I.R. Principles of Voice Production. 1 Taylor A.

Charlton B.D.

Reby D. Vocal production by terrestrial mammals: source, filter and function. 4 Ryalls J.H.

Lieberman P. Fundamental frequency and vowel perception. 5 Assmann P.F.

Nearey T.M. Identification of frequency-shifted vowels. 6 Kewley-Port D.

Li X.

Zheng Y.

Neel A.T. Fundamental frequency effects on thresholds for vowel formant discrimination. 7 Charlton B.D.

Taylor A.M.

Reby D. Are men better than women at acoustic size judgements?. Figure 1 The Effect of Different F0 Characteristics on Formant Resolution Show full caption (A) Power spectrum of a 500 Hz tone without F0 modulation and three formants at 550 Hz, 1650 Hz, and 2750 Hz, labeled F1, F2, and F3, respectively. (B) Power spectrum with a 250 Hz unmodulated F0 and the same formant pattern as in (A). The lower F0 and concomitant closer harmonic spacing of the 250 Hz tone in (B) provide greater formant resolution. (C) A 500 Hz F0 tone modulated in a sinusoidal fashion at 50% of its mean value. The SFM highlights the same formant pattern shown in (A) and (B) in more detail. (D) A section of a giant panda bleat spectrogram in which the third and fourth harmonics (labeled Upper and Lower) of F0 strike the third formant. The darkening on the spectrogram indicates an increase in frequency amplitude as a harmonic coincides with the centre frequency of the formant, which is predicted to increase its perceptual salience. According to the source-filter theory of voice production, the key components of mammal vocal signals are produced in two stages []. A source signal is produced in the larynx and is characterized by its fundamental frequency (F0), which corresponds to the rate of vocal fold vibration in the larynx and determines the perceived pitch of the signal []. This source signal is then filtered in the supra-laryngeal vocal tract, whose resonance properties determine the formants that appear as broadband frequency maxima in the sound spectrum and determine the timbre of the signal []. One of the key assumptions of the source-filter theory is that F0 and formants can vary independently, and both have been shown to provide receivers with important biosocial information on the caller’s phenotype or motivational state []. However, the value of F0 can also affect the resolution of the formants and hence the availability of any information encoded by them. For example, when F0 is high and the distance between the harmonic overtones (multiple integers of F0) is large, formant peaks are poorly resolved because the density of harmonics sampling the spectral envelope is relatively low [] ( Figure 1 ). Consistent with this, studies involving human listeners have confirmed that both the discrimination of vowels (which relies on formant perception) and the discrimination of formant patterns in synthetic voice-like signals are poorer when F0 is raised [].

8 Fitch W.T.

Hauser M.D. Vocal production in nonhuman-primates - acoustics, physiology, and functional constraints on honest advertisement. 9 Morton E.S. On the occurrence and significance of motivation-structural rules in some birds and mammal sounds. 10 McAdams S.

Rodet X. The role of FM-induced AM in dynamic spectral profile analysis. 11 Demany L.

Semal C. The effect of vibrato on the recognition of masked vowels. 12 Erickson M.L.

Gaskill C.S. Can listeners hear how many singers are singing? The effect of listener’s experience, vibrato, onset, and formant frequency on the perception of number of simultaneous singers. 13 Carlson R.

Fant G.

Tatham M.A.A. Two-formant models, pitch and vowel perception. 14 Sundberg J. Vibrato and vowel identification. High spectral density can be achieved by a low F0 ( Figure 1 B) or broadband frequency noise/deterministic chaos []. However, vocal signals with low F0 or deterministic chaos may be selected against in affiliative contexts because they are typically associated with aggressive intent []. An alternative mechanism for increasing harmonic density in signals with relatively high F0s is to modulate F0 so that the harmonics scan a wider-frequency bandwidth ( Figure 1 C). If F0 is repeatedly modulated, then the likelihood and number of times that harmonics cross vocal tract resonances increase ( Figure 1 D), which should improve the resolution and perceptual salience of formants. However, work testing this hypothesis on humans has yielded conflicting results, with some studies indicating that vibrato-like F0 modulation may slightly improve formant perception [], while others suggest that rapid F0 modulation hinders formant perception []. Whether F0 modulation improves the salience of formants therefore remains an open question.

15 Gelfand S.A. Essentials of Audiology. 7 Charlton B.D.

Taylor A.M.

Reby D. Are men better than women at acoustic size judgements?. Figure 2 Psychoacoustic Test Stimuli and Results Show full caption (A and B) The upper panel shows spectrograms of vocal stimuli with (A) and without (B) sinusoidal F0 modulation (SFM). F0 = 250 Hz in both spectrograms; H1 and H2 refer to the first and second harmonic, respectively. The fourth formant (F4) is shifted up by 3% in the second presentation of each stimulus pair. Spectrogram settings: window length = 0.05 s, frequency step = 20 Hz, dynamic range = 40 dB, Gaussian window shape. (C) Estimates ± SEM of the proportion of correct classifications made by participants for the two F0 modulation conditions. (D) The relationship between the number of times that harmonics enter the F4 bandwidth (F4 strikes) and the percentage of correct classifications in the SFM stimuli. Each of the data points represents the mean correct classification rate across all 34 human subjects for each log 10 -transformed F4 strike value. See also Figure S1 for spectrograms of all the vocal stimuli. In this study, we hypothesized that nonhuman animal vocalizations with sinusoidal F0 modulation (defined as an unbroken F0 contour that is modulated in a periodic, sinusoidal fashion), such as sheep bleats, horse whinnies, and marmoset trills, may have evolved to highlight the formant pattern in high F0 calls with concomitant low harmonic density. To investigate this hypothesis, we combined human psychoacoustic tests and phylogenetically controlled comparative analyses. Psychoacoustic tests on human listeners sought to determine whether sinusoidal F0 modulation (hereafter SFM) improves the perceptual salience of formants in non-speech, bleat-like stimuli based on a source-filter production mechanism (for stimulus preparation see STAR Methods ). Participants were presented with synthetic vocal stimuli with or without SFM, and across a range of different mean F0s. The stimuli were presented in pairs that either were identical or differed only in the value of the fourth formant (F4). F4 was chosen because it is more likely to vary with vocal tract length and voice individuality than the lower three formants (which are used to produce vowel sounds), yet it remains well within the range of peak human auditory sensitivity []. To test whether humans can perceive subtle differences in formant patterns, we shifted F4 up 3%, or down 3%, from its original value of 3,850 Hz ( Figures 2 and S1 ). Previous work indicates that formant shifts of 3% and above are perceptible to humans, whereas shifts below this magnitude are significantly harder to detect [].

2 1,1223 = 28.32, p < 0.001): subjects were better at classifying the stimuli with SFM as sounding the same or different than they were for stimuli with no F0 modulation (2 1,1223 = 6.66, p = 0.010), and a significant interaction between F0 modulation condition and gender (Wald χ2 1,1223 = 5.85, p = 0.016) revealed that the positive effect of SFM on formant perception was strongest in male participants. Classification rates were also higher for stimuli with lower mean F0 (Wald χ2 3,1221 = 45.29, p < 0.001). On the premise that F0 modulation increases the likelihood that harmonics cross formants ( Figure 1 D), we predicted that human participants would be best at detecting small shifts in the value of the fourth formant (F4) in stimuli with SFM ( Figures 2 A and S1 ). The best supported generalized estimating equations (GEE) model, with the lowest quasi-likelihood under independence model criterion (QIC) value, included the mean F0 of the stimulus pairs, the subject’s gender, and the interaction term “F0 modulation condition × gender.” This binary logistic GEE model revealed a significant main effect of F0 modulation condition (SFM versus no F0 modulation) on classification performance (correct or incorrect) (GEE: n = 34, Wald χ= 28.32, p < 0.001): subjects were better at classifying the stimuli with SFM as sounding the same or different than they were for stimuli with no F0 modulation ( Figure 2 C). Men were overall better than women at classifying stimuli (Wald χ= 6.66, p = 0.010), and a significant interaction between F0 modulation condition and gender (Wald χ= 5.85, p = 0.016) revealed that the positive effect of SFM on formant perception was strongest in male participants. Classification rates were also higher for stimuli with lower mean F0 (Wald χ= 45.29, p < 0.001).

1,9 = 5.58, p = 0.042) ( To further explore how SFM improves the perception of F4 shifts, we calculated the number of times a harmonic entered the F4 bandwidth for the different classes of SFM stimuli ( STAR Methods ) and performed a linear regression between these values and the mean proportion of correct classifications. The number of times that a harmonic crossed into the F4 bandwidth in the SFM stimuli (F4 strikes) was positively correlated with the mean proportion of correct classifications (F= 5.58, p = 0.042) ( Figure 2 D). Taken together, the results of the psychoacoustic tests confirm that SFM improves the ability of human listeners to perceive small shifts in the formant pattern of bleat-like stimuli and indicate that improved perception of formant variation in the SFM stimuli is driven by an increase in the number of times that harmonics strike formants.

16 Charlton B.D.

Reby D. The evolution of acoustic size exaggeration in terrestrial mammals. 17 García Navas V.

Blumstein D.T. The effect of body size and habitat on the evolution of alarm vocalizations in rodents. 18 Bowling D.L.

Garcia M.

Dunn J.C.

Ruprecht R.

Stewart A.

Frommolt K.H.

Fitch W.T. Body size and vocalization in primates and carnivores. 10 body mass as a covariate ( 10 harmonics-to-formant ratio than other call types (n = 71 species) (PGLS: estimate ± SE = −0.240 ± 0.094, λ = 0.71, t 3,89 = −2.419, p = 0.013) ( Figure 3 Examples of Sinusoidal F0 Modulation in Mammals Show full caption 19 de la Torre S.

Snowdon C.T. Dialects in pygmy marmosets? Population variation in call structure. 20 Devries M.S.

Sikes R.S. Vocalisations of a North American subterranean rodent Geomys breviceps. 21 Nelson J.E. Vocal Communication in Australian Flying Foxes (Pteropodidae; Megachiroptera). (A and B) Phylogeny used to control for shared ancestry between different mammal species (A) and examples of SFM calls from each of the 21 species that produce these vocalizations (B), often called “bleats” or “trills”. Red stars denote species that produce SFM calls. Spectrograms were generated using the following settings: window length = 0.01–0.05 s, frequency step = 20 Hz, dynamic range = 40 dB, Gaussian window shape. For the Baird’s pocket gopher, pygmy marmoset, and gray-headed flying fox, the spectrograms were taken from the literature [] because we could not obtain recordings for these species. (C) How the F0 characteristics of SFM calls were measured. Complete cycles of SFM are labeled FM1, FM2, etc. See also Tables S1 and S3 for acoustic data on SFM calls and the sample composition. Figure 4 Support for the Hypothesis that Sinusoidal F0 Modulation Functions to Highlight Formants in Calls with Low Spectral Density Show full caption (A) Estimates ± SEM of log 10 harmonics-to-formant ratio for SFM (n = 21) and other call types (n = 71) derived from a Brownian motion + Pagel’s lambda (λ) model with log 10 body mass as a covariate. SFM calls have a significantly lower harmonics-to-formant ratio than other call types. (B) The relationship between log 10 harmonics-to-formant ratio and % F0 modulation in the SFM calls of 21 mammal species. The dotted line represents the slope and intercept obtained from the PGLS regression. % F0 modulation is significantly higher in SFM calls with lower harmonics-to-formant ratios. See also Tables S1 and S3 Having established that SFM improves formant perception in human listeners, and making the assumption that these observations could generalize to other mammalian species, we then used phylogenetically controlled comparative analyses [] to investigate the possible factors behind the evolution of SFM in nonhuman mammals. Data on mean F0 for 92 mammal species ( Table S1 ) revealed that SFM has evolved independently in six mammalian orders: Carnivora, Primates, Artiodactyla, Perissondactyla, Rodentia, and Chiroptera ( Figure 3 ). We predicted that SFM should evolve in nonhuman mammal calls with fewer harmonics per formant when compared to other call types. A Brownian motion + Pagel’s lambda (λ) phylogenetic generalized least-squares (PGLS) regression model with logbody mass as a covariate ( Table S2 ) was used to examine the relationship between harmonics-to-formant ratio (see STAR Methods for estimation of this parameter) and call type (SFM calls versus all other call types). SFM calls (n = 21 species) had a significantly lower logharmonics-to-formant ratio than other call types (n = 71 species) (PGLS: estimate ± SE = −0.240 ± 0.094, λ = 0.71, t= −2.419, p = 0.013) ( Figure 4 A).

10 body mass as covariates ( 10 harmonics-to-formant ratio (estimate ± SE = −13.249 ± 5.185, t 5,16 = −2.555, p = 0.021) ( Finally, we examined the relationship between the harmonics-to-formant ratio and percent modulation of F0 in SFM calls from 21 mammal species ( Table S3 ). We predicted that F0 would be modulated over a greater frequency range in species that produce SFM calls with relatively lower harmonics-to-formant ratios, and in which the harmonics would be required to scan a larger frequency range in order to strike formants. A Brownian motion evolutionary model with habitat, social structure, and logbody mass as covariates ( Table S4 ) revealed that % F0 modulation was negatively correlated with logharmonics-to-formant ratio (estimate ± SE = −13.249 ± 5.185, t= −2.555, p = 0.021) ( Figure 4 B). This result indicates that F0 is modulated over a relatively wider frequency range in SFM calls with fewer harmonics per formant, supporting the contention that SFM functions to scan the spectrum so that harmonics are more likely to excite formants.

7 Charlton B.D.

Taylor A.M.

Reby D. Are men better than women at acoustic size judgements?. 1 Taylor A.

Charlton B.D.

Reby D. Vocal production by terrestrial mammals: source, filter and function. The results of these psychoacoustic and comparative investigations provide insights into the convergent evolution of a distinctive form of mammal vocalization that is often referred to as a bleat or trill and is characterized by periodic SFM. The psychoacoustic tests on human subjects confirmed that participants were significantly better at detecting small differences in the formant pattern of synthetic vocal stimuli with SFM. We also provide clear evidence that formant perception in the SFM stimuli is improved as modulating harmonics cross formants to provide them with excitation energy, and, consistent with previous findings [], we found that men performed better than women in the discrimination task. Taken together, the results of the psychoacoustic tests support the general hypothesis that SFM enhances formant perception and could therefore have evolved in nonhuman mammals for this purpose. Indeed, these observations are likely to generalize to receivers of nonhuman mammals, as formants have been shown to be both perceptually discriminable and important in size and identity communication across a wide range of mammalian species [].

4 Ryalls J.H.

Lieberman P. Fundamental frequency and vowel perception. The subsequent comparative analysis revealed that SFM occurs in mammal calls characterized by a relatively higher F0 (and therefore a lower expected harmonic density in the absence of modulation) and that SFM calls with fewer harmonics per formant tend to be modulated over greater frequency ranges. These results provide strong support for the hypothesis that SFM functions to highlight formants in calls with low harmonic densities, in which an unmodulated F0 would not produce sufficient spectral density to resolve the formants. Interestingly, a spectral density of less than 4.4 harmonics per formant is known to significantly impair human formant perception for vowel discrimination [], and all of the species producing calls with SFM had a ratio of harmonics to predicted formant spacing of 4.4 or less ( Table S3 ). Accordingly, we suggest that once the harmonic density drops to around 4 harmonics per formant, SFM becomes an effective mechanism for highlighting functional information encoded by formants.

1 Taylor A.

Charlton B.D.

Reby D. Vocal production by terrestrial mammals: source, filter and function. 22 Ghazanfar A.A.

Turesson H.K.

Maier J.X.

van Dinther R.

Patterson R.D.

Logothetis N.K. Vocal-tract resonances as indexical cues in rhesus monkeys. 23 Fitch W.T.

Fritz J.B. Rhesus macaques spontaneously perceive formants in conspecific vocalizations. 24 Rendall D.

Owren M.J.

Rodman P.S. The role of vocal tract filtering in identity cueing in rhesus monkey (Macaca mulatta) vocalizations. 8 Fitch W.T.

Hauser M.D. Vocal production in nonhuman-primates - acoustics, physiology, and functional constraints on honest advertisement. 9 Morton E.S. On the occurrence and significance of motivation-structural rules in some birds and mammal sounds. 25 Charrier I.

Harcourt R. Individual vocal identity in mother and pup Australian sea lions (Neophoca cinerea). 26 Briefer E.

McElligott A.G. Mutual mother-offspring vocal recognition in an ungulate hider species (Capra hircus). Mammal calls often have distinctive formant patterns, attributed to individual differences in vocal tract morphology, and several studies in humans and other nonhuman mammals have confirmed the importance of formants as cues to individual identity []. We therefore suggest that SFM is important for highlighting individually distinctive formant patterns in a range of nonhuman mammal calls. Another mechanism for highlighting formants is to produce calls with a low F0 or broadband frequency noise []; however, because low-frequency or noisy calls are typically associated with aggression [], SFM is likely to be favored for highlighting formant-related information in affiliative contexts. Although the exact social context of SFM call production is not documented for all of the species in the analysis, only one of the 21 SFM call types in the dataset is produced in an agonistic context (the giant otter scream) ( Table S3 ). All of the other examples are produced in nonaggressive contexts when animals are thought to be promoting contact with other conspecifics and in which identity cueing is likely to be important, such as mother-offspring contact, promoting contact with mating partners, and reuniting with social group members. It is also possible that other characteristics of SFM calls are used for identity cueing in these contexts. For example, the rate and extent of F0 modulation contribute to individual vocal distinctiveness in Australian sea lions [] and goats [], respectively, and may therefore be individually distinctive components of SFM calls in other species.

In conclusion, these investigations have provided a highly plausible scenario for the convergent evolution of bleat-like calls in terrestrial mammal vocal signals and highlight the importance of an interdisciplinary approach to tackling questions about the evolution of mammal vocal diversity. Future work should probe the ability of nonhuman mammals to discriminate between different callers using re-synthesized SFM calls with varying levels of F0 modulation. Investigations could also be extended to other vertebrates and to vocal signals that do not have SFM but nevertheless contain strong F0 modulation that may function to highlight important formant-related information. By adopting a phylogenetically controlled comparative approach, these studies may reveal other examples of convergent evolution in vocal signal structure and allow researchers to gain a better understanding of how and why certain features of animal vocalizations evolve independently in distantly related species.