Significance Past work characterizes songbirds as having a strong bias to rely on absolute pitch for the recognition of tone sequences. In a series of behavioral experiments, we find that the human percepts of both pitch and timbre are poor descriptions of the perceptual cues used by birds for melody recognition. We suggest instead that auditory sequence recognition in some species reflects more direct perception of acoustic spectral shape. Signals that preserve this shape, even in the absence of pitch, allow for generalization of learned patterns.

Abstract Humans easily recognize “transposed” musical melodies shifted up or down in log frequency. Surprisingly, songbirds seem to lack this capacity, although they can learn to recognize human melodies and use complex acoustic sequences for communication. Decades of research have led to the widespread belief that songbirds, unlike humans, are strongly biased to use absolute pitch (AP) in melody recognition. This work relies almost exclusively on acoustically simple stimuli that may belie sensitivities to more complex spectral features. Here, we investigate melody recognition in a species of songbird, the European Starling (Sturnus vulgaris), using tone sequences that vary in both pitch and timbre. We find that small manipulations altering either pitch or timbre independently can drive melody recognition to chance, suggesting that both percepts are poor descriptors of the perceptual cues used by birds for this task. Instead we show that melody recognition can generalize even in the absence of pitch, as long as the spectral shapes of the constituent tones are preserved. These results challenge conventional views regarding the use of pitch cues in nonhuman auditory sequence recognition.

Songbirds are an important animal model for studying the sensorimotor mechanisms of vocal learning and the processing of learned, complex sound sequences (1⇓⇓–4). Although birds lack the six-layered mammalian neocortex (5), the avian auditory system follows the general vertebrate plan (6), including telencephalic circuits organized in a radial columnar pattern that are anatomically (7, 8), genetically (9), and functionally (10) analogous to the mammalian auditory cortical microcircuit. Likewise, songbirds and humans share evolutionarily convergent features of their vocal production biomechanics (11, 12) and of brain circuitry that underlies the rare trait of vocal learning (13, 14). Many aspects of auditory processing are also similar in songbirds and humans. For example, humans and European Starlings (Sturnus vulgaris) have similar frequency sensitivity thresholds and auditory filter widths (15⇓–17), perceive the pitch of the missing fundamental (18), and parse multiple pure-tone sequences (separated in frequency) into separate auditory streams (19, 20). At higher levels, the “musical” nature of birdsong has long been appreciated by humans (21), and some songbirds can readily learn to discriminate and imitate human melodic sequences (22⇓–24).

Given these similarities, it is surprising to find a major difference in how humans and songbirds perceive sequences of tones. Humans readily recognize tone sequences that are shifted up or down in log frequency (e.g., the “Happy Birthday” tune played on a piccolo or a tuba), because the pattern of relative pitches (the pitch interval sequence) is maintained. This ability appears effortless to humans: It is present in infancy and is a universal of human music cognition (25⇓–27). In fact, the human ability to use relationships between acoustic cues to recognize sound sequences appears to extend beyond pitch, including loudness and perceptual brightness (28). In contrast, multiple studies over the past three decades indicate that songbirds lack relational pitch processing for tone sequences (22, 29, 30; but see refs. 31, 32). Although songbirds can easily learn to discriminate between sequences of several tones (say, ascending vs. descending sequences of four pitches or between the opening phrases of two different human melodies), even modest generalization to frequency-shifted versions of the same relative pitch patterns requires extensive training (33), and this generalization is restricted to narrow frequency ranges near the training tones (34). However, songbirds can do relational processing for certain aspects of tone sequence structure. Starlings, for example, can learn to discriminate between tone sequences of different tempi and can generalize this discrimination to novel sequences at double the training tempo (35). They can also learn to discriminate between tone sequences that increase versus decrease in loudness and generalize this discrimination to different loudness ranges (36).

Past work has characterized the difference in how humans and songbirds recognize transposed pitch sequences in terms of a reliance on relative versus absolute pitch (AP) in tone sequence perception. Although most humans rely primarily on relative pitch for recognizing tone sequences, songbirds are thought to exhibit a strong bias for relying on AP cues in recognizing tone patterns (37). Here AP does not refer to the human ability to assign a note name or pitch chroma to a tone, such as “G sharp,” but the more general ability to recognize tones on the basis of their AP height. This has been amply demonstrated in songbirds (38, 39). (For pure tones, AP height corresponds to frequency, whereas for complex harmonic tones, it corresponds to fundamental frequency.) However, the view that songbirds gravitate to AP cues in recognizing tone sequences is based on studies using fairly simple acoustic stimuli, such as pure tones or harmonic tones that vary in pitch but have a fairly stable spectral shape over the course of the sequence (e.g., sequences of piano tones). More natural sound sequences, including numerous animal vocalizations, human speech and song, and multi-instrument music, vary in both pitch and spectral shape over time.

The question of whether the AP bias demonstrated for songbirds with acoustically simple tone sequences holds for sequences that also evolve spectrally over time is particularly salient given the recent finding that starlings are able to recognize frequency-shifted versions of conspecific songs, including songs shifted outside of the frequency range of the training songs (22). Starling songs are spectrotemporally complex, with salient changes in spectral shape over time, and include narrow-band whistles, harmonic warbles, and broader band bursts and rattles that vary in their strength of periodic pitch cues (40). Thus, it is possible that generalization across frequency-shifted songs reflects the birds’ ability to detect patterns of spectrotemporal change over time, independent of absolute frequency.

In humans, spectral structure is a critical element used in speech perception and plays an important role in the percept of timbre (41). Starlings are also able to recognize harmonic tone complexes based on spectral structure despite changes in the absolute frequencies of the spectral components (42). Here we investigate how songbirds perceive tone sequences that systematically vary over time in both pitch and timbre. We find that neither pitch nor timbre alone can provide sufficient information to permit accurate tone sequence generalization. Surprisingly, however, generalization is strong for acoustic manipulations that preserve the temporal pattern of spectral shapes and remove pitch (rather than shift it). These results suggest that the absolute spectral envelope (i.e., the overall pattern of spectral amplitudes across particular frequency bands), rather than AP, may be the salient cue for songbirds recognizing sequences of sounds that have both pitch and spectral variation.

Discussion Our results challenge the long-held view that songbirds, unlike humans, rely primarily on AP cues for the recognition of tone sequences. By using sound sequences that simultaneously vary in pitch and timbre (as many natural sound sequences do) and by using an acoustic resynthesis technique from speech science that removes pitch cues (noise vocoding) but preserves overall spectral shape, we find that the absolute spectral envelope of each sound (i.e., overall spectral shape across particular frequencies) drives the recognition of tone patterns rather than AP. These results are surprising given the similarities between starlings and humans in basic psychoacoustic abilities (16) and given that these birds do perceive the pitch of tones with complex harmonic structure, including the pitch of the missing fundamental (18). Thus, although the pitch of a sound may be salient to songbirds, unlike humans they do not seem to use pitch to generalize across sound patterns. This was particularly notable in experiment 3, where songbirds transferred much more readily to noise-vocoded versions of the ascending versus descending training sequences than to piano-tone versions that preserved the pattern of AP of the training stimuli. To human ears, the piano-tone versions sound quite similar to the training sequences because of their identical pitch patterns, whereas the noise-vocoded versions sound strikingly different from the training stimuli (compare Audio Files S1–S3 to Audio Files S6–S11). However, for the birds, this pattern of perceptual similarity seems to be reversed: The noise-vocoded stimuli are treated as more similar to the training stimuli than are the piano-tone versions that preserve AP. In humans, speech recognition is famously robust to the pitch-degrading manipulations introduced by noise vocoders (43), whereas similar manipulations have severe impacts on music perception (44). Our observation that birds rely on spectral shape features to recognize sound sequences suggests a similarity to human speech recognition. An additional implication of our results is that unlike humans, songbirds may not have largely independent percepts of pitch and timbre (a possibility also suggested by ref. 45). Although hearing scientists have often considered pitch and timbre as distinct dimensions of auditory perception, this distinction may not be an automatic consequence of having a complex auditory system. Indeed, research shows that even humans do not always fully separate these percepts (46⇓⇓–49). For example, in a four-alternative choice task with two tones presented (no change, pitch change, timbre change, both change), nonmusicians reported that both pitch and timbre had changed 26% of the time when in fact the pitch had remained constant and only the timbre changed (50). Musicians, however, made this error just 2% of the time, even though the two timbres (piano vs. trumpet) were easily discriminable by nonmusicians. This raises the idea that the perceptual separability of pitch and timbre is experience-dependent (presumably musicians are better at perceiving pitch and timbre independently because they have more exposure listening to the same instruments playing at different pitches). Likewise, the perception of pitch itself may be more plastic than traditionally appreciated. Individuals considered to have AP can show considerable variability in the range of frequencies they label as the same pitch (51), and recent work shows that exposure to subtly detuned music can significantly alter the note categories of adults with AP (52). On their surface, our results may seem to be at odds with earlier work showing that starlings can recognize similar spectral structures at different APs (42). In that study, however, pitch did not change within a tone sequence during discrimination training, unlike the current work. The distinct spectral structures used by ref. 42 may have also differed in other perceptual properties, such as degree of consonance or dissonance, that can drive generalization (53). Similarly, if variation in spectral shape (rather than AP) drives tone sequence recognition in songbirds, then one might ask why starlings were unable to recognize frequency-shifted versions of the training sequences (Fig. 2). Although frequency-shifting these sequences preserves the relative relationship among spectral components of each tone (and across the tone sequence), it nonetheless alters the absolute frequencies (Figs. S3–S5). It thus seems that the absolute spectral envelope governs avian tone sequence recognition. For pure tones, the spectral band envelope corresponds directly to pitch; for complex tones, the spectral band envelope contributes to both pitch and timbre percepts. Fig. S3. Example comparisons of the overall absolute spectral shape for notes in the training and novel pitch test sounds. Each panel shows the frequency spectra for one of the four notes (blue lines) in a training stimulus (Fig. S1B) and the corresponding notes (red lines) in a novel pitch stimulus (Fig. S1E) that is one semitone lower. Fig. S4. As in Fig. S3 but comparing the spectra of notes in a training stimulus (blue lines; Fig. S1B) to spectra from notes in a novel timbre stimulus (red lines; Fig. S1G). Fig. S5. As in Fig. S3 but comparing the spectra of notes in a training stimulus (blue lines; Fig. S2B) to spectra from notes in a noise-vocoded version of the same stimulus (red lines; Fig. S2E). Importantly, absolute spectral envelope is not likely to be the only perceptual feature that songbirds can use for auditory recognition. Previously, we showed that starlings can maintain the learned recognition of conspecific songs even when those songs are shifted in frequency by large amounts (22). Although the precise cues that starlings use to recognize frequency-shifted conspecific songs require future study, such manipulations alter the absolute spectral envelope of the signals. However, the spectrotemporal complexity of songs and other natural stimuli provides additional perceptual cues (e.g., rhythm and amplitude envelope) that are invariant across frequency shifts. Moreover, the importance of any given cue can vary depending on listening task. Sensitivity to these features in specific tasks may help to explain prior evidence suggesting specialized song-processing mechanisms in birds (15). Our results indicate that behavioral effects tied classically to changes in the frequencies of pure tones (29, 30, 37⇓–39) should not be strictly interpreted as changes to the percept of pitch. Instead, we suggest a revised perspective on melody recognition by songbirds. We propose that unlike humans, for whom pitch plays a dominant role in the perception of melodic sequences, songbirds rely on a perceptual representation that appears more closely tied to absolute spectral envelope. This surprising difference has implications both for research in the cognitive psychology of auditory perception and for neuroscientists investigating the physiological and computational processes underlying auditory recognition. A promising avenue of research lies in linking cross-species differences in the physiological organization of the auditory system to observed differences in the use of auditory cues. Further research manipulating spectral shape and pitch salience of tone sequences, for example using noise-vocoding or sinusoidal-vocoding, which allow control over spectral resolution while removing versus preserving pitch cues, respectively (54, 55), will help researchers understand species differences in auditory sequence recognition.

Materials and Methods Subjects. Five adult wild-caught European Starlings (S. vulgaris) of unknown sex were tested. No subjects had previously been used in other tasks or had prior exposure to experimental stimuli. Before beginning experimental training, subjects were housed in a large mixed-sex aviary. Stimuli. Stimuli were sequences of four complex tones with no intervening pauses. Each tone was 368 ms in duration so that the full sequences lasted about 1.5 s. We created the sequences by MIDI synthesis as 16 bit, 44.1 KHz wave files using the built-in Quicktime MIDI synthesizer in Mac OS 10.6. General MIDI instrument codes for the sounds were 69 (oboe), 53 (sung “aah” formant), 60 (muted trumpet), and 81 (synthesizer square wave). Our criterion in selecting these sounds was to choose instruments with a sustained amplitude envelope, so that pitch and spectral shape (and not amplitude envelope) were the primary cues to distinguish the tones. Training stimuli. From the set of synthesized tones, we created six training stimuli, each a sequence of four tones. Each tone within a sequence was distinct in both pitch and timbre from the other tones in that sequence. For three of the sequences, the pitch of each tone increased systematically with intervals of two semitones between each note from start to end, and for the three remaining sequences, the pitch of the tones decreased by the same intervals. The order of timbres in the ascending and descending pitch sequences also differed systematically so that the serial pattern structure between the two types of sequences was redundant across pitch and timbre (Fig. 1B). The lowest ascending training stimulus started on Bb4 (466.16 Hz) and continued to C5 (523.25 Hz), D5 (587.33 Hz), and E5 (659.26 Hz) (corresponding to MIDI notes 70, 72, 74, and 76, respectively). The corresponding descending stimulus used the same pitches in reverse order, starting at E5 and ending at Bb4. The other two ascending stimuli started on C5 and D5, ending on F#5 and G#5, respectively, whereas descending stimuli used the same pitches in reverse order. Thus, the two other ascending and descending stimuli represented upward shifts of two and four semitones relative to the original Bb4–C5–D5–E5 or E5–D5–C5–Bb4 sequence. All stimuli were normalized to a mean power of 65 dB. Pitch-shifted stimuli. We synthesized test stimuli with the same interval spacing and timbre sequences as the training stimuli but starting at pitches not heard during training. The novel ascending stimuli started at Bb3, D3, F#3, A4, B4, C#5, F5, G5, and Bb5. Relative to the lowest ascending training sequence starting on Bb4, these sequence represent shifts of –12, –8, –6, –3, –1, 1, 3, 6, 8, and 12 semitones, respectively. Novel sequences starting at B4 and C#5 lie between two training stimuli but were never heard during training, whereas the other test stimuli lie partly or entirely outside of the training frequency range. Piano-tone stimuli. We also constructed novel timbre versions of the training stimuli—that is, three ascending and three descending sequences—matched in AP and duration pattern to the sequences in Fig. 1 but using only piano tones. In addition, we synthesized versions of these novel timbre sequences that were shifted by ±1, ±3, ±6, ±8, and ±12 semitones relative to lowest frequency ascending/descending pair of training stimuli. Noise-vocoded stimuli. To disrupt pitch cues while retaining the frequency-specific spectral shape of each tone, we created noise-vocoded versions of the training stimuli. Noise vocoding is accomplished by dividing an acoustic signal into a fixed number of frequency bands, extracting the amplitude envelope within each band, and then using this envelope to modulate band-pass filtered white noise. These amplitude-modulated noise bands are then recombined to create the noise-vocoded signal (see ref. 56). Noise vocoding has been used for many years in speech research to investigate the role of detailed spectral structure (independent of pitch) in speech perception (43). Noise-vocoded speech sounds somewhat like whispered speech and is highly intelligible to humans if the number of frequency bands is sufficiently large (e.g., 15 bands between 50 and 8,000 Hz, as in ref. 57), because this preserves the overall time-varying shape of the speech spectrum. We constructed noise-vocoded versions of our training stimuli by dividing each original training stimulus into 16 logarithmically spaced frequency bands, with the first band spanning 50 Hz to 193 Hz and the 16th band spanning 8,865 Hz to 11,000 Hz. We then computed the amplitude envelope for each of these bands and applied it to band-limited white noise. Vocoding was done using a custom-written Praat script (see computer script found in Dataset S1). Procedures. Each subject was trained and tested in four phases: shaping, recognition training, recognition testing, and novel stimulus transfer using a two-alternative choice operant training procedure. Further details on the operant training are provided in previous publications (e.g., refs. 22, 58). Subjects were housed individually and trained inside a sound isolation chamber with access to an operant panel (Fig. 1A). During training, subjects initiated trials when the house lights were on (matched to local daylight). Water was freely available, and animals were not fed except when earning a food reward after completing an experimental trial. All procedures were completed as part of a protocol approved by the University of California, San Diego Institutional Animal Care and Use Committee. Shaping. During shaping, each subject was trained to obtain food from a hopper underneath the food port (Fig. 1A) by pecking the center response port. After pecking the center port 100 times, they were trained to peck the center port and then either the left or right response port (cued randomly with a flashing light) for a food reward. Each subject completed several hundred trials pecking the left and right response ports. Recognition training. After shaping, subjects learned to associate ascending stimuli (with the characteristic timbre sequence; see Fig. S1) with the left response port and descending stimuli with the right response port. On each trial, a peck to the center response port started playback of a randomly selected training stimulus from a speaker behind the operant panel. Pecks to the left or right response port within 2 s after the stimulus playback ended led to reinforcement. Incorrect responses were punished with 10–20 s of lights out; correct responses resulted in 2 s of food access paired with a secondary visual reinforcement (blinking LEDs in all three response ports). To improve performance and increase the number of trials performed, we transitioned subjects over multiple sessions to a fixed-ratio reinforcement schedule where they were fed only if they responded correctly to a fixed number of consecutive trials, otherwise receiving only the secondary visual reinforcer for correct responses. Incorrect responses or nonresponses reset the count of correct trials. Eventually the fixed ratio was set at six trials for each subject. Recognition testing. To test generalization of the pitch-shifted and novel timbre tone sequences, we used a probe procedure. On 66% of trials, we presented the training stimuli (randomly selected as in the initial training), and on the remaining 33% of trials, we presented one of the pitch-shifted or novel timbre stimuli. To keep response rates high, we required subjects to respond correctly to six consecutive training stimulus trials. Responses to the test stimuli were never immediately followed by reinforcement and did not affect the consecutive correct response counter. Failure to respond to any stimulus reset the response counter so that six correct responses to training stimuli were again required to receive a food reward. Transfer procedure. We a used a transfer procedure rather than a probe-recognition procedure in experiment 3. In the transfer procedure, subjects were switched immediately from sessions in which the training stimuli are presented on all trials to sessions in which the test stimuli were presented on all trials. During these test sessions, subjects were reinforced (or punished) for correct (or incorrect) responses to test stimuli, as during the training sessions. Above chance performance during initial transfer trials and/or rapid acquisition indicates the generalization, or transfer, of learning from the training to the test stimuli. Initial transfer performance that falls to chance and takes longer to recover indicates weaker generalization and the need to relearn the recognition task anew. The primary difference compared with the probe procedure is that the transfer procedure differentially reinforces responses and thus affords subjects an opportunity to learn stimulus response associations, providing access to a more subtle behavioral measure in acquisition rate. We exposed subjects to a mean of 2,636 trials with the noise-vocoded stimuli (range, 1,389–3,308) before returning them to the original training stimuli. After ensuring stable, accurate recognition (mean = 94.4% correct; range, 87.6–97.0% correct), we then transferred them to piano-tone versions of the training stimuli.

Acknowledgments The research was partially funded by National Institutes of Health Grant R01DC008358. A.D.P. was supported by Neurosciences Research Foundation as part of its program on music and the brain at The Neurosciences Institute.