Species-specific vocalizations fall into two broad categories: those that emerge during maturation, independent of experience, and those that depend on early life interactions with conspecifics. Human language and the communication systems of a small number of other species, including songbirds, fall into this latter class of vocal learning. Self-monitoring has been assumed to play an important role in the vocal learning of speech [] and studies demonstrate that perception of your own voice is crucial for both the development and lifelong maintenance of vocalizations in humans and songbirds []. Experimental modifications of auditory feedback can also change vocalizations in both humans and songbirds []. However, with the exception of large manipulations of timing [], no study to date has ever directly examined the use of auditory feedback in speech production under the age of 4. Here we use a real-time formant perturbation task [] to compare the response of toddlers, children, and adults to altered feedback. Children and adults reacted to this manipulation by changing their vowels in a direction opposite to the perturbation. Surprisingly, toddlers' speech didn't change in response to altered feedback, suggesting that long-held assumptions regarding the role of self-perception in articulatory development need to be reconsidered.

The development of auditory feedback monitoring. II. Delayed auditory feedback studies on the speech of children between two and three years of age.

An auditory-feedback-based neural network model of speech production that is robust to developmental changes in the size and shape of the articulatory system.

An examination of individual's baseline utterances revealed that variability in production decreased with age. The average individual's standard deviation in F1 and F2 during production of baseline utterances is plotted in Figure 3 . For both F1 and F2, an ANOVA revealed a significant effect between groups [F1: F(2,69) = 37.23, p < 0.001; F2: F(2,69) = 22.32, p < 0.001]. Multiple comparisons with Bonferroni correction confirmed that for both F1 and F2, the differences between all groups were significant (p < 0.05).

Standard deviation in F1 and F2 of an average individual's production of baseline utterances for each of the three groups. Standard error bars are shown.

To verify these observations, we computed individual measures of compensation in F1 and F2. For both F1 and F2, an analysis of variance (ANOVA) revealed a significant effect between groups [F1: F(2,69) = 7.23, p < 0.01; F2: F(2,69) = 6.38, p < 0.01]. Multiple comparisons with Bonferroni correction confirmed that the compensation by the adults and young children was significantly different from that of the toddlers (p < 0.01 for both F1 and F2), but no significant differences between the adults and young children were observed (p > 0.99 for both F1 and F2).

The normalized results, averaged across individuals in each group, are plotted in Figure 2 . As in previous formant perturbation experiments [], the adults spontaneously compensated by altering the frequency of F1 and F2 in a direction opposite to that of the perturbation (top panel). The young children also compensated in a manner similar to the adults (middle panel). However, the toddlers did not alter production of F1 or F2 in response to the perturbation (bottom panel).

For each utterance, the “steady-state” F1 and F2 frequency was determined by averaging estimates of that formant from 40% to 80% of the way through the vowel. These results were then normalized for each individual by subtracting that average of that individual's baseline utterances defined as the average of the last 15 utterances before feedback was altered (i.e., utterances 6–20). For statistical analyses, individual measures of compensation in F1 and F2 were computed with the magnitude based on the difference in average frequency between the last 20 utterances (i.e., utterances 31–50) and the baseline used in normalization. The sign was determined based on whether the change in production opposed (positive) or followed (negative) the direction of the perturbation.

We tested three different age groups of native English speakers: adults (26 adult females with a mean age of 18.9 years), young children (26 children with a mean age of 51.5 months), and toddlers (20 children with a mean age of 29.8 months). Each talker produced 50 utterances of the word “bed.” To elicit these utterances from the young children and toddlers, we developed a video game in which the children would help a robot cross a virtual playground by saying the robot's “magic” word “bed” ( Figure 1 B). During the first 20 utterances, talkers received normal acoustic feedback through a pair of headphones. During the last 30 utterances, talkers received feedback in which the frequency of their first and second formants (F1 and F2, respectively) were perturbed using a real-time formant shifting system. F1 was increased by 200 Hz and F2 was decreased by 250 Hz. This manipulation changed talkers' productions of the word “bed” into their own voice saying the word “bad.”

In the current study, we look at real-time compensatory behavior in vowel production when auditory feedback is modified. We use a rapid signal processing system to change the formant frequencies of vowels produced by children and adults. Previous work with adults has demonstrated that when talkers receive auditory feedback in which their own vowel formants are shifted to new locations in the vowel space, they rapidly compensate for the perturbations, altering the formant frequencies of the vowels they produce in a direction opposite to the perturbation []. This response pattern has been interpreted as evidence for the existence of a predictive mechanism in speech motor control []. This phenomenon also demonstrates that even adult speakers remain reliant on auditory feedback to fine-tune the accuracy of their vocal productions.

In humans, there is a clearly defined linkage between vocal tract configuration and the acoustic structure of speech. The two vocal tract configurations shown in Figure 1 A have different resonant frequencies leading to the amplification of different harmonics in the speech signal. Speech researchers call these amplified harmonics “formants,” and listeners rely heavily on formants to determine what consonant or vowel a speaker intended to produce. As speakers shift the configuration of their vocal tract, the formant structure of their utterances shifts accordingly. By attending to the linkage between their own unique vocal tract configurations and the resulting speech acoustics, young children could fine-tune the mapping between motor commands sent from their brains to the vocal-production organs and the resulting acoustic output produced.

(A) Midsagittal adult vocal tract showing the positioning of articulators when producing two different vowels that differ in height and frontness of the vocal tract constriction. The different tongue positions result in different resonances in the vocal tract and perception of different vowels.

Discussion

20 Doupe A.J.

Kuhl P.K. Birdsong and human speech: common themes and mechanisms. 21 Swingley D. 11-month-olds' knowledge of how familiar words sound. 22 de Boisson-Bardies B. How Language Comes to Children. 23 Stoel-Gammon C. Relationships between lexical and phonological development in young children. Our data suggest that by the age of 4, children are monitoring their speech productions in an adult-like manner. Toddlers, in contrast, do not appear to self-regulate their vowel acoustics like adults or young children do. Feedback discrepancies with their own speech simply do not produce compensatory behaviors. At first blush, these results seem paradoxical. Perceptual attunement to the vowel space of the native language is in evidence by 6 months of age []. Infants readily detect small deviations in others' pronunciation of familiar words [] and begin babbling in prosodic patterns characteristic of the language they have been exposed to []. By the age of 24 months, American children have an average vocabulary size of about 300 words []. Thus, by 2 years of age, toddlers appear to be well on their way to acquiring the sound structure of their native language. If toddlers do not automatically monitor their own speech productions for accuracy as adults and young children do, then how do they learn to produce the speech sounds used in their language community? We see two kinds of possible answers to this question: (1) explanations that are consistent with the idea that feedback error correction is important at all ages but that its role is context-dependent in young children, and (2) hypotheses that suggest that error correction based on feedback of the child's own speech develops only after the internal representation of a sound category is robust.

24 Baldwin D.A. Infants' ability to consult the speaker for clues to word reference. 25 Yoshida K.A.

Fennell C.T.

Swingley D.

Werker J.F. Fourteen-month-old infants learn similar-sounding words. One context-dependent explanation for our data is that children may require different cognitive and/or social conditions to learn language at different ages. For example, Baldwin [] showed that by 18 months, the social cue of gaze direction of a speaker is more important for infant lexical acquisition than other cues that had previously been important, such as salience of an object or temporal contiguity of object and name. Similarly, the speech processing behavior of very young children during word learning varies with different cognitive demands. For some online speech testing procedures, young children do not attach labels to objects as readily as they do if they were given more naturalistic contextual support or simpler tasks [].

26 Sakata J.T.

Brainard M.S. Social context rapidly modulates the influence of auditory feedback on avian vocal motor control. Social context might also modulate when auditory feedback can influence the sound representation. As has been shown with songbirds, social or public use of vocalizations and vocal practice in early learning can be differentiated and feedback plays a different role in each type of vocalization []. For our 2-year-olds, the minimal speech produced by the adults during the task may have resulted in a situation where fine-tuning of production was minimized. In addition, the words produced by the children were reinforced by the video game independent of their accuracy in producing the vowel—the robot progressed through the playground regardless of whether the child did or did not compensate. Note that this was true for both toddlers and young children and this by itself does not explain the age-related changes.

Alternatively, more in line with our second class of explanation, feedback error correction may not be adaptive during the earliest stages of word production, perhaps because of the magnitude of variability observed in the motor activities of toddlers. If production variance alone was the issue, compensations should only be observed when variability is reduced to a tolerable amount. To explore this hypothesis, we conducted two types of analyses. In the first, regressions were computed between an individual's compensation magnitude and production variability in the baseline values of F1 and F2. When the regressions were carried out within age groups and when the compensation results of the toddler and young children groups were pooled together, no significant relationship (p > 0.3) was found. In the second analysis, we tested whether the perturbation was influencing articulation even if the youngest children did not compensate. It is conceivable that the altered feedback might induce instability even if mature compensatory behavior was not developed. To test this, we compared the standard deviation of an individual's last 15 utterances of the baseline phase and shift phase. For both the toddlers and young children, no significant difference in standard deviation was observed for either F1 or F2. These results suggest that variability per se is not the issue.

27 Fitch W.T.

Giedd J. Morphology and development of the human vocal tract: a study using magnetic resonance imaging. 28 Vorperian H.K.

Kent R.D. Vowel acoustic space development in children: a synthesis of acoustic and anatomic data. 29 Goldstein M.H.

Schwade J.A. From birds to words: Perception of structure in social interaction guides vocal development and language learning. 30 Kuhl P.K.

Tsao F.-M.

Liu H.-M. Foreign-language experience in infancy: effects of short-term exposure and social interaction on phonetic learning. An additional possibility in line with our second class of explanation is that the rapid growth of the vocal tract during the first two years of life may combine with motor variability to make feedback-based control suboptimal. The first couple years of life is one of the periods associated with rapid changes in vocal tract size and configuration, primarily due to descent of the larynx []. A consequence of this rapid growth is abrupt change in vowel formant values between ages 1 and 4 []. One solution to this early phase of vocal learning is for learners to regulate their productions to the vocalizations of their communication partners rather than to use feedback from their own ill-defined targets for error correction. This suggestion is consistent with growing evidence that contingent adult behaviors shape the course of vocal learning in both birdsong acquisition and speech development [].

31 West M.J.

King A.P. Female visual displays affect the development of male song in the cowbird. 32 Kuhl P.K.

Meltzoff A.N. Infant vocalizations in response to speech: vocal imitation and developmental change. 33 Papousek M.

Papousek H. Forms and functions of vocal matching in interactions between mothers and their precanonical infants. 31 West M.J.

King A.P. Female visual displays affect the development of male song in the cowbird. 34 Goldstein M.H.

King A.P.

West M.J. Social interaction shapes babbling: testing parallels between birdsong and speech. The most remarkable evidence for socially guided vocal learning comes from the study of the brown-headed cowbird []. Juvenile males, raised in isolation with females that do not sing, nevertheless acquired mature, species-specific songs. Video analyses revealed that the immature male vocalizations were shaped by visual feedback from females (small wing strokes). Thus, without hearing mature male models, these young males were able to learn songs that contained the markers of regional dialects and songs that could strongly elicit female mating responses. As this example demonstrates, adult input to the vocal learning process can vary over a wide spectrum, ranging from an acoustic template for assessing articulation error [] to nonverbal reinforcements for correct articulation [].

35 Stager C.L.

Werker J.F. Infants listen for more phonetic detail in speech perception than in word-learning tasks. 36 Eliades S.J.

Wang X. Sensory-motor interaction in the primate auditory cortex during self-initiated vocalizations. 37 Eliades S.J.

Wang X. Neural substrates of vocalization feedback monitoring in primate auditory cortex. 38 Zheng Z.Z.

Munhall K.G.

Johnsrude I.S. Functional overlap between regions involved in speech perception and in monitoring one's own voice during speech production. The period between ages 1 and 4 is marked by many other cognitive and linguistic developments associated with speech processing. For example, there are questions concerning the immaturity of the receptive phonology of children in this age group when engaged in word learning [], despite evidence that even younger children are able to make fine-grained speech discriminations. Although the auditory speech perception system and the auditory control of speech articulation clearly overlap and share resources, each system appears to have unique requirements and neural architectures tuned to meet those requirements. Single-cell populations in the auditory cortex of nonhuman primates have been shown to be selectively activated or inhibited during the animal's vocalizations as compared to listening to others []. Unique functional magnetic resonance imaging activity in feedback compared to listening conditions has also been shown in humans []. The auditory feedback system also has different functional components, including a mapping between articulator movements and acoustics, an error detection system, and a computational model that learns from the errors and computes new trajectories for speech movements. All of these components must undergo development because the vocal tract changes in size and shape and articulatory precision changes over time. Only through the use of real-time perturbation experiments of the kind performed here will we be able to begin to tease apart the components of this complex network of processing and understand the passage to mature communication.

In summary, an age-related difference in the use of auditory feedback to control speech production was observed. When exposed to altered feedback in which formant frequencies were perturbed, both 4-year-olds and adults compensated but 2-year-olds did not. These results suggest that either the auditory feedback component of the speech motor-control system may be suppressed in infants and toddlers or develops between 2 and 4 years of age. Although it is not possible to distinguish between these two classes of hypotheses using the present data, the finding that toddlers do not monitor their own auditory feedback in a manner similar to adults has broad implications for models of speech learning.