Previously it was shown that male mice, when they encounter female mice or their pheromones, emit ultrasonic vocalizations with frequencies ranging over 30–110 kHz. Here, we show that these vocalizations have the characteristics of song, consisting of several different syllable types, whose temporal sequencing includes the utterance of repeated phrases. Individual males produce songs with characteristic syllabic and temporal structure. This study provides a quantitative initial description of male mouse songs, and opens the possibility of studying song production and perception in an established genetic model organism.

Copyright: © 2005 Holy and Guo. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The term “song” has been used with a variety of connotations, so that Broughton [ 13 ] offers three different definitions of song: a sensu latissimo, a “sound of animal origin which is not both accidental and meaningless,” which includes relatively simple vocalizations often described as “calls”; a sensu stricto, “a series of notes [or syllables], generally of more than one type, uttered in succession and so related as to form a recognizable sequence or pattern in time”; and a sensu strictissimo, “a complete succession of periods or phrases,” in which a song consists of several distinct motifs, often delivered in a characteristic sequence.

As the terminology used to describe animal vocalizations is varied, we adopt the following definitions. A “syllable” is a unit of sound separated by silence from other sound units [ 14 ]; it may consist of one or more “notes,” continuous markings on a sonogram. A “syllable type” is a category of syllable, observed regularly in the animal's vocalization, distinct from other syllable types. A “phrase” is a sequence of syllables uttered in close succession. A “phrase type” or “motif” is a sequence of several syllables, falling into one or more syllable types, where the entire sequence is observed repeatedly in the animal's vocalization.

Here, we provide a quantitative description of the ultrasonic vocalizations of the adult male mouse, and show that they display unexpected richness, including several syllable types organized into phrases and motifs. Thus, these vocalizations display the characteristics of song [ 1 , 3 , 13 ]. Different males, even though genetically identical, show small but significant differences in syllable usage and the temporal structure of their songs. These results indicate that communication among mice may be more complex than previously appreciated. Because of the ubiquity of the mouse for physiological and genetic investigations, these observations may lead to new opportunities in studies of the biological basis of song production and perception.

While numerous studies have focused on the circumstances leading to ultrasound production, few have examined the sounds themselves. Sales [ 7 ] observed that these vocalizations consisted of a series of discrete utterances, with species-specific differences in vocalizations. Some diversity was also noted among the utterances within a species [ 6 , 7 ], but it was not determined whether this latter variability was continuous—as in the case, for example, of the “random” variability observed when a single word is spoken many times—or whether the utterances fall into distinct categories. In a recent quantitative study of mouse vocalizations, Liu et al. [ 12 ] studied changes in pup vocalizations during the first 2 wk after birth, and compared these to adult vocalizations. However, this study focused only on the aggregate properties of vocalizations, measuring parameters such as median pitch and call rate, which, if applied to humans, would be more analogous to “voice” than to speech. To date, no study that we know of has examined whether the discrete utterances consist of distinct syllable types, or whether these vocalizations have significant temporal structure.

Rodents produce a variety of social vocalizations, including vocalizations audible to humans, like postpartum sounds and distress calls, as well as ultrasonic vocalizations [ 5 , 6 ]. In mice, ultrasonic vocalizations utilize frequencies higher than 30 kHz [ 7 ], and therefore cannot be detected directly by human ears. A number of studies have shown that mice produce ultrasonic vocalizations in at least two situations: pups produce “isolation calls” when cold or when removed from the nest [ 8 ], and males emit “ultrasonic vocalizations” in the presence of females or when they detect their urinary pheromones [ 6 , 9 – 11 ]. Most commonly, these sounds have been recorded using a detector with narrow frequency tuning [ 9 , 10 ], which suffices to estimate the amount of vocalization. However, because of its narrow frequency tuning, such a detector does not record the acoustical details of these vocalizations.

Many animals communicate using sound. Often, brief sounds are produced to warn of danger or mediate aggressive encounters. Some species, however, produce long sequences of vocalizations often called “songs.” Most commonly, these long sequences are generated as a part of courtship. For example, many insects and amphibians [ 1 ] advertise their presence and identity with a single type of utterance—which, depending on the species, might be described as a chirp, click, or whine—repeated several times to form a “phrase,” with silent gaps between phrases. The utterance, its repetition rate, and the number of repetitions in a phrase are characteristic of the species [ 1 ]. More complex vocalizations are observed in many birds [ 2 ], as well as in a few mammals such as whales [ 3 ] and bats [ 4 ]. These species generate multiple types of sounds organized in more intricate phrases.

Results

Listening to Ultrasonic Vocalizations To induce ultrasonic vocalizations, male mice of the B6D2F1 strain were presented with sex-specific odors applied on cotton swabs (Figure 1). We tested dilute urine of either sex (BALB/c strain) and mixtures of urine from both sexes. (The correspondence between stimulus identity and vocal response will be reported elsewhere.) We recorded all sounds in the chamber with a microphone with flat frequency response from 20 Hz to 100 kHz. While these vocalizations are well beyond the range of human hearing, we make them audible through two techniques. Most straightforward is to play them back slowly. When slowed 16×, these vocalizations sound like a series of breathy whistles (Audio S1). However, slow playback makes it difficult for human listeners to develop an appreciation of the temporal sequence of the vocalizations. Using a phase vocoder algorithm [15], the pitch of these vocalizations can be dropped several octaves without lengthening the duration of the playback. These pitch-shifted vocalizations are reminiscent of birdsong (Audio S2). Readers are urged to listen to these recordings. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Male Mice Vocalize in the Ultrasound after Olfactory Exploration of Urinary Cues A cotton swab containing female mouse urine (top) was introduced at approximately 30 s into a 210-s trial. Arrow indicates the time of first contact with the cotton swab. Recorded acoustical power is represented as a function of time and frequency, with shading increasing with power. Power below 25 kHz was truncated. Bottom, an expansion of a 2-s period showing vocalizations in greater detail. Individual syllables, as identified by an automated algorithm, are spanned by magenta lines below. https://doi.org/10.1371/journal.pbio.0030386.g001

Elementary Features of Vocalizations Male mouse ultrasonic vocalizations consisted of a rapid series of “chirp-like” syllables in the 30–110 kHz band (Figure 1). Syllables were of varying duration (approximately 30–200 ms), uttered at rates of about ten per second. Most syllables involved rapid sweeps in frequency, with rates of approximately 1 kHz/ms typical. Over tens of seconds, periods of closely spaced syllables alternated with periods of silence. These features of adult male vocalizations, and their analogs for the isolation calls of mouse pups, have been previously described [7,12]. The microphone recorded a variety of sounds in the test chamber, including noises from movement, gnawing, contact with the cage wall, audible squeaks, and ultrasonic vocalizations. For the purposes of this study, we excluded sounds other than ultrasonic vocalizations. The majority of extraneous sounds fell below 30 kHz, and were excluded by selecting the appropriate frequency band. However, some sounds, particularly brief “snaps,” penetrated into the frequency band of the ultrasonic vocalizations. We developed an automated algorithm to recognize ultrasonic vocalizations in terms of their generic features. Subjectively, the algorithm appears no worse than a well-trained human in identifying these vocalizations (see Materials and Methods; Figure 1).

Features of Syllables: Pitch Changes As reported previously [7], inspection (Figure 1) suggests that some syllables involve relatively sudden, large changes (“jumps”) in frequency. To determine whether these frequency jumps are stereotyped or random, we analyzed a collection of 750 syllables uttered by one mouse in a single 210-s trial. We simplified our description of each syllable by extracting the dominant frequency (the “pitch”) as a function of time (Figure 2A). For each syllable, we compared the pitch at one moment with the pitch in the next time bin, approximately 1 ms later. These pitch pairs were pooled for all 750 syllables, resulting in a total of 31,303 consecutive pitch pairs. This analysis (Figure 2B) revealed four distinct clusters of pitch changes. The long cluster along the diagonal corresponds to the gradual shift in pitch occurring at most time points in all syllables. Two distinct off-diagonal clusters reveal large, stereotyped jumps to or from comparatively low frequencies (35–50 kHz). These downward (“d”) and upward (“u”) jumps are often paired in a syllable (see below and insets for Figure 2B), and will be collectively described as “low jumps.” The cluster just below the diagonal, containing transitions from 70–90 kHz down to 55–70 kHz, results from a third type of jump (“high jump,” or “h”). These jumps were often, but not exclusively, associated with a brief “grace note” at the beginning of a syllable (see jump labeled “h” in lower inset, Figure 2B). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Characterization of Pitch Changes during Syllables (A) Two examples of syllables, represented in terms of their sonogram (top member of each pair of panels) and the extracted pitch versus time (bottom member of pairs). (B) Plot of pitch at one time point versus the next time point (Δt = 1.02 ms). All such pitch pairs in all syllables from a single trial with 750 syllables are shown, representing a total of 31,303 pitch changes. Particular pitch jumps are placed within the context of their individual syllables at right (top syllable, 98 ms in duration; bottom syllable, 33 ms in duration). (C) Pitch pairs analyzed for single 210-s trials from 45 different mice, containing in aggregate 15,543 syllables and over 600,000 pitch pairs. The distribution of pitch pairs is represented as a two-dimensional histogram; the correspondence between grayscale and number of observations is indicated in the color bar at right. Polygons define the clusters corresponding to the three jump types “u,” “h,” and “d.” (D) Numbers of each type of pitch jump per trial (45 mice, one trial each). https://doi.org/10.1371/journal.pbio.0030386.g002 These pitch jumps were identified in Figure 2B from a single 210-s recording from one mouse. To determine whether these jumps are stereotypic features of the ultrasonic vocalizations of all male mice, we performed the same analysis for a 210-s trial from each of 45 different males. The pitch changes in adjacent time bins are pooled across mice in Figure 2C. Both the number of clusters and their positions and sizes are essentially unchanged, and examples of all three types of jumps were broadly distributed across mice (Figure 2D). Thus, at least for similarly aged males of the B6D2F1 strain, these pitch jumps are a universal feature of ultrasonic vocalizations.

Pitch Jumps and Mechanisms of Sound Production Many syllables with low jumps display both a fundamental frequency and a faint first harmonic during the low-frequency period (Figure 3A; see also Figures 1 and 2A). The frequency of the harmonic is almost precisely twice that of the fundamental, suggesting the involvement of a resonator in the production of these sounds. A priori, this resonator might be the vocal folds of the larynx. However, based on the effect of partial replacement of air with helium, Roberts [16] argued that these sounds are not produced by the vibration of vocal cords. Instead, he proposed that ultrasound arises from an aerodynamic whistle, and showed that mechanical whistles can produce sounds similar to the examples described by Sales [7], including pitch jumps. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Features of Vocalizations Relating to Mechanisms of Sound Production (A) Syllable with both a fundamental and first harmonic. (B) Abundance of frequency (vertical axis is frequency, continued from [A]) in syllables with (LJ+) and without (LJ−) low jumps. (C) Average pitch (top) and mean ± standard deviation log 10 (power) (bottom) as a function of time, surrounding a downward low jump (for syllables with low jumps) or surrounding the upward crossing of 75 kHz (for syllables without low jumps). Power units are arbitrary but consistent between syllable types. Color scheme is as in (B). (D) Syllable showing extensive temporal overlap and independent frequency modulation among the different notes in the syllable. Syllables are from the same trial analyzed in Figure 2B. https://doi.org/10.1371/journal.pbio.0030386.g003 While our recordings appear largely consistent with Roberts's results, several features of these vocalizations indicate that their production is more sophisticated than that of a whistle from a rigid, static pipe. The rigid whistles investigated by Roberts had a characteristic relationship between frequency and fluid velocity [16]. Frequency was fairly stable over a range of velocities, and would suddenly jump to a new frequency at yet higher or lower velocities. In contrast, the pitch of mouse vocalizations is modulated considerably, in both a continuous and discrete (jump) fashion. Despite their stereotyped form, jumps were not obligatory upon reaching a particular frequency. While down-type jumps began from frequencies of 65–80 kHz (see Figure 2B), these frequencies were well-sampled even in syllables that lack these jumps (Figure 3B). Furthermore, if jumps were produced by changes in air velocity, one might expect to see differences in vocal power between cases where jumps do and do not occur. In contrast with this expectation, the power distributions of syllables both with and without “d” jumps overlap considerably (Figure 3C), although variability in the mouse's head position and orientation relative to the microphone could obscure a true relationship. Finally, the fine-scale temporal structure of pitch jumps appears to be inconsistent with the nonlinear properties of purely static whistles. During a downward low jump, the pitch of the preceding phase overlaps in time with the pitch in the succeeding phase (Figure 3A), often by 5–10 ms. This behavior is apparently not observed in pitch jumps arising from mode-locking nonlinearities [17], where changes in pitch are nearly instantaneous. In a few cases, both tones were present simultaneously for longer periods, with one frequency modulated and the other nearly fixed (Figure 3D). In birdsong, similar observations were used by Greenewalt [18] to posit two sites of sound production—specifically, that birds could independently control the left and right sides of their syrinx. This assertion was later confirmed directly [19]. Examples such as Figure 3D may indicate that mice have at least two sites of ultrasound production. However, the strength of this conclusion is tempered by our incomplete knowledge of the nonlinear properties of aerodynamic whistles [20].

Temporal Sequencing of Syllables In sonograms of mouse vocalizations, complex syllable sequences can be identified: Figure 6A shows an example of a phrase in which three “hdu” syllables with descending low-frequency bands (labeled “A”) are followed by six “hdu” syllables with ascending low-frequency bands (labeled “B”); the phrase is finished off by an “h” syllable (almost a SS, but for the brief grace note), an A-type “hdu,” and an SS (Audio S3). An example of a motif can be seen in Figure 6B, in which a phrase beginning with 2–3 SSs followed by 6–8 “du” syllables is repeated three times. The consistency of this repeated sequence, in the context of the whole, is easily noted in pitch-shifted playbacks (Audio S4). Finally, there are regularities in the syllable types over longer time scales. Figure 6C shows an example of a trial that begins with a series of SSs, has a middle period with many syllables containing low jumps, and ends with repeated blocks of “h” syllables. To determine whether such examples are statistically significant, we investigated the temporal structure of these vocalizations quantitatively in terms of two models of syllable selection. To simplify the analysis, we grouped syllables into only two categories, depending on whether they did (“1”) or did not (“0”) contain one or more low jumps. We considered whether individual syllables might be selected randomly. In the first model, we tested whether the probability of selecting a syllable was based purely on the prevalence of each type, so that each syllable is selected independently of all others. In the second model, the selection probability depended on the identity of the previous syllable (Figure 7A): from the data, we calculated the conditional probability p i →j to choose a syllable of type j after a syllable of type i (i, j = 0, 1). We also used a third state (a “gap”) to represent a silent period lasting more than 0.5 s, to ensure that the analyzed state transitions occurred within a phrase. Omitting the gap state from the model did not qualitatively change the results. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 7. Quantitative Modeling of Syllable Temporal Sequences (A) A three-state Markov model, where the states correspond to syllables with (“1”) or without (“0”) low jumps, and to a gap of greater than 0.5 s in the sequence. Arrows indicate possible choices for the next state; transition probabilities are calculated from the observed sequence of syllables and gaps. (B) Observed numbers of the eight distinct three-syllable combinations, and the number expected from two models: “syllable prevalence” picks the next syllable randomly based on the proportion of each type, whereas “transition probability” employs the Markov model diagrammed in (A). (C) Comparison of transition probabilities to type 1 syllables with the prevalence of type 1 syllables. “Prevalence of 1” is n 1 /(n 0 + n 1 ), where n i is the number of syllables of type i; prevalence of transition g→1 is calculated as n g →1 /(n g →0 + n g →1 ), where n i →j is the number of observed transitions from state i to state j (g = gap); and prevalence of 1→1 is n 1→1 /(n 0→1 + n 1→1 ). Each point represents the results from a single trial, of 81 qualifying trials (see text). https://doi.org/10.1371/journal.pbio.0030386.g007 We then examined the prevalence of all possible three-syllable combinations (see Materials and Methods) in terms of these two models. As shown in Figure 7B, the first model, based purely on prevalence, does a poor job of predicting the distribution of three-syllable combinations (p << 10−10). The transition-probability model provides a much more accurate description of the temporal structure. However, it, too, is insufficient (p ≈ 10−6) to capture all of the higher-order structure of these three-syllable sequences. Similar conclusions apply to four- and five-syllable sequences. Therefore, we find that syllables are not chosen independently in random order. From examples of raw sonograms (see Figure 1), it appears that type 1 syllables (those with low jumps) tend to be grouped in blocks. To examine this aspect of sequencing, we compared the prevalence of type 1 syllables against the likelihood that the next syllable after a type 1 would also be a type 1. For the example in Figure 7B, 258/750 (34%) of syllables were of type 1, but the likelihood of a successive type 1 was much higher (58%). On the basis of counting statistics (binomial distribution), this difference is highly significant (p << 10−10). To determine whether this tendency to repeat low-jump syllables is a universal feature of these vocalizations, we recorded the vocalizations of 45 socially experienced males over a period of 3 wk. Over the 3 wk, each animal participated in nine trials, each 210 s in duration, during which the male was presented with either a blank (non-odorized) cotton swab or one with 20 μl of dilute mouse urine (see Materials and Methods). Of the more than 400 trials, 81 (from a total of 32 different males) contained ten or more examples each of type 0 and 1 syllables, and were tagged as “qualifying trials.” These qualifying trials contained sufficient numbers of each syllable type to allow measurement of the syllable prevalence and transition probabilities. We consistently found that type 1 syllables were more likely following another type 1 (Figure 7C): in 78/81 qualifying trials, type 1 syllables were more likely following another type 1 than would have been predicted from their overall prevalence. This demonstrates a strong tendency for male mice to utter low-jump syllables in blocks. Similarly, we found that type 1 syllables were very unlikely to be used at the beginning of a phrase: after a gap, the likelihood of a type 1 syllable was lower (in 78/81 trials) than would have been predicted from chance selection of syllable types (Figure 7C). A related phenomenon is seen in zebra finch song, in which phrases often begin with an “introductory note” [25]. We conclude that these vocalizations display strong temporal regularities. Therefore, mouse ultrasonic vocalizations contain the two elements most commonly used to define song [1,3,13]: the vocalizations contain multiple syllable types, and these syllables are uttered in regular, repeated temporal sequences. We therefore label these vocalizations as songs.