From Scholarpedia

Sine-wave speech is an intelligible synthetic acoustic signal composed of three or four time-varying sinusoids. Together, these few sinusoids replicate the estimated frequency and amplitude pattern of the resonance peaks of a natural utterance (Remez et al., 1981). The intelligibility of sine-wave speech, stripped of the acoustic constituents of natural speech, cannot depend on simple recognition of familiar momentary acoustic correlates of phonemes. In consequence, proof of the intelligibility of such signals refutes many descriptions of speech perception that feature canonical acoustic cues to phonemes. The perception of the linguistic properties of sine-wave speech is said to depend instead on sensitivity to acoustic modulation independent of the elements composing the signal and their specific auditory effects.

The ability of listeners to integrate asynchronously varying, harmonically unrelated tones into a single perceptual stream also has posed a strong challenge to accounts of auditory perceptual organization. Perceptual organization is the function by which auditory experience is resolved into individual sensory streams, each issuing from a distinct source. In standard accounts, stream formation occurs by grouping similar auditory sensory elements into streams, a proposal that owes much to the Gestalt principles of figural organization (Wertheimer, 1923). Although sine-wave speech is coherent perceptually, neither sine-wave components nor their respective sensory properties are similar to each other at the fine acoustic grain described in the Gestalt-based account. For this reason, sine-wave speech requires an alternative account of perceptual organization that depends instead on sensitivity to coordinate variation, transcending the details of elementary signal properties and their isolated auditory effects. Research has exploited these properties of sine-wave speech to examine perceptual organization of a spoken auditory scene, perceptual analysis of the linguistic properties of utterances, and the perceptual identification of individual talkers.

Although numerical methods of automatic estimation have been used to derive frequency and amplitude values for synthesizing sine-wave speech, these techniques are prone to error and the estimates they provide require extensive correction before they are suitable for use as synthesis parameters. Accordingly, old fashioned practices of acoustic analysis are often used in which a phonetician inspects a spectral display and picks frequency and amplitude values by hand for this form of copy synthesis.

Figure 1: Spectrogram of a natural utterance, "Jazz and swing fans like fast music."

Media:jazz_nat.mp3An example of natural speech;

Figure 2: Spectrogram of a sine-wave replica of the utterance in Figure 1.

Media:jazz_sw.mp3Example of a sine-wave replica.

Perceptual organization

Perceptual organization is the function by which proximal sensory samples are resolved into coherent streams, each stream specific to a distal object and event. In the auditory modality, perceptual organization resolves sensory samples into an array of sound sources (Bregman & Pinker, 1978). Whether a listening environment is noisy or quiet, a perceiver who understood a talker’s utterance must have treated the acoustically diverse constituents of speech as a stream of sound produced by a single vocal source. Because none of the acoustic elements of speech is unique to speech (Stevens, 1999) — the vocal spectrum is made of whistles, clicks, hisses, buzzes and hums — the organizational function that finds sources of spoken sound cannot identify a speech stream by appraising acoustic elements one by one. Instead, the variation of an aggregate pattern is probably identified and tracked perceptually by virtue of its characteristic modulation (Remez et al., 1994; Remez, 2005; Stevens & Blumstein, 1981).

The dissimilarity among the natural acoustic constituents of vocal sound also shows that the perceptual integrity of speech entails the resolution of coherence among acoustic constituents without requiring likeness in detail among the integrated elements. Speech is produced as a train of syllables with a vowel at the nucleus and consonantal contrasts at syllable onsets and offsets. Finding and following this regular albeit unpredictable linguistic sequence requires sensitivity to complex acoustic patterns independent of the momentary attributes of the constituents. Such perceptual flexibility about sensory details is characteristic of the perceptual organization of natural speech, and underlies the integration of concurrent sine-waves expressing the spectrotemporal properties of speech despite unspeechlike components.

Because each sine-wave component differs at the fine grain from the others composing the pattern, the aggregation of sinusoidal constituents into a single perceptual stream depends on a listener’s sensitivity to their coordinate variation despite their dissimilarity in detail. Moreover, the sinusoidal constituents differ greatly from natural acoustic products of vocalization, both in their physical properties and in their sensory effects. For this reason, it is doubtful that coherence of the tonal elements of sine-wave speech occurs as a consequence of similarity to acoustic elements typical of speech, or by the resemblance of the timbre of sine-wave constituents to natural acoustic products of vocalization. Instead, the pattern of frequency and amplitude variation of the aggregate configuration evokes an impression of linguistic properties despite the prominent unnatural quality of the constituents.

Grouping of the tone components of sine-wave speech therefore confounds a leading account of perceptual organization, Auditory Scene Analysis (Bregman, 1990), in both aspects of its conceptualization: 1) the initial Gestalt-derived grouping function and 2) the ensuing resort to schematic knowledge to guide the repair of erroneous grouping. In the first stage of this model, acoustic elements coalesce into groups according to similarity in the physical properties correlated with auditory sensation: similarity in frequency among signal elements; in proportional frequency change; in fundamental frequency; in frequency modulation; in amplitude modulation; in synchrony of onset and offset; in the shape or composition of the acoustic spectrum. In Auditory Scene Analysis, while perceptual grouping is effortful, requiring attention (Carlyon et al., 2001; Cusack et al., 2004), the temporal and spectral granularity of its similarity standard is fine (reviewed by Remez et al., 1994; Remez, 2005). For example, grouping by similarity warrants synchronized and strictly proportional frequency change, as typifies the harmonics of a common fundamental frequency. However, frequencies replicated in sine-wave speech are the result of the natural resonance of the column of air enclosed by the vocal anatomy. As movements of relatively independent articulators change the configuration of the vocal tract, this changes the frequency and amplitude of the natural resonances sustained by the enclosed column of air. But, as independent effects of articulatory changes, the lowest resonance does not determine the frequency of the upper resonances, unlike harmonics of a common fundamental. The harmonic relation observed among the Fourier components of a periodic source is therefore not observed among the time-varying resonant frequencies of natural speech. Because sine-wave speech replicates resonance peaks that are unrelated harmonically, tone components that follow their frequency variation are unrelated harmonically. Overall, the within-signal correlation of the components of sine-wave speech is relatively low (Stone & Moore, 2008).

Although the sine-wave components of a sentence are dissimilar in detail, the coarse-grain pace of their spectrotemporal variation derives from the syllable cycle of the natural utterance that served as the model for the synthesis. Some studies of digitally altered speech have reported effects of a hypothetical integrative perceptual function acting at this syllabic grain, roughly 6-8 Hz (for example, Saberi & Perrott, 1999). This provides a hypothetical auditory function for perceptual organization of sine-wave speech that supersedes other acoustic differences in the components. However, direct tests of temporal coherence of sine-wave sentences have shown that the integration window is far narrower, perhaps no greater than 50 ms (Remez et al., in press). Accordingly, it is unlikely that the coarse-grain syllable-paced amplitude characteristics of the tone components of sine-wave speech are contributing to their perceptual integration.

Other investigations have shown that imposing fine-grain amplitude modulation on the tonal components promotes intelligibility, though only within a range restricted to 50-100 Hz (Carrell & Opie, 1992). Nonetheless, sine-wave sentences of phonemically balanced composition can be highly intelligible without comodulation of tone components (Remez et al., in press), indicating that acoustic and auditory similarity by common pulsing is not necessary for perceptual organization of sine-wave constituents of a sentence. Moreover, in natural speech the range of amplitude comodulation attributable to laryngeal pulsing spans 75-300 Hz, a range far wider than the evident range of cohesion by comodulation. Despite evidence of effectiveness within strict limits, this physical feature is unlikely to play a significant role in perceptual stream formation, neither for natural nor for sine-wave speech.

Auditory Scene Analysis proposes a second stage of perceptual organization in which a perceiver’s long-term schematic knowledge of the correspondence between acoustics and sound sources is latterly applied to grouping. This function contrasts with the fast grouping in the first stage for which similarity in auditory form is said to draw acoustic elements into a perceptual stream. This portion of the model has received few tests, and it remains largely hypothetical (Bregman, 1990; Darwin, 1997). Further, it is implausible as an account of the perceptual organization of sine-wave speech, for two reasons. First, schematic knowledge is described as a representation of the typical sensory effects of familiar sound sources. This notion of auditory schema shares much with the top-down memory structures (Bobrow & Norman, 1975). But, sine-wave spectra are not typical of speech, nor are they perceived to be similar to speech even when they are highly intelligible. The perceptual coherence of sine-wave components occurs despite a manifest lack of typicality or similarity to natural sensory samples, and for this reason the device of schematic knowledge as conceived in Auditory Scene Analysis is inadequate to explain the perceptual effects observed with sine-wave speech.

Second, the opportunity for a secondary process occurring after a lag to correct a grouping error committed by a preliminary mechanism depends on the durability of auditory samples. An auditory sensory trace must last long enough for a knowledge-based function to correct a perceptual stream that erroneously formed or failed to form by Gestalt similarity. Measures of auditory sensory persistence converge (Elliot, 1962; Howell & Darwin, 1977; Pisoni & Tash, 1974). A trace fades significantly within 100 ms, well before the detection of an erroneous grouping could invoke a lumbering resort for a knowledge-based alternative to Gestalt similarity criteria. It is simply implausible to suppose that the role of experience in perceptual organization of speech takes this form of slowly established cognitive correction of an auditory grouping whether knowledge is schematic or more versatile. Computational implementations of scene analysis face different costs than natural systems do (Klatt, 1989; Wang & Brown, 2006). Those instrumental techniques are free to treat auditory sensory samples as persisting and simultaneously available over wide spans. In stark contrast, descriptions of human perceivers are bound by the findings of auditory physiology and correlated psychoacoustics. Computational implementations that observe the characteristic urgency of the sensory functions in human perception are likely to be instructive as models of perceptual organization, while those that falsely presume the durability of ephemeral sensory samples sacrifice plausibility whatever they gain in that conceptualization.

Recently, a new argument purporting to buttress the claims of Auditory Scene Analysis against the falsifying evidence of sine-wave speech was offered (Darwin, 2008). The argument asserts that “the default condition for the auditory system is to treat everything as coming from a single source, and only to segregate different sources if the evidence is sufficient (pp. 1013).” The implied premise of this assertion is that the dissimilarities among sine-wave components are simply insufficient to split the tones into separate streams, and that they coalesce by default into a single stream that a perceiver treats as speech. This claim is simply false as a description of the perceptual effects of sine-wave speech and, therefore, false as a hypothesis about the perceptual organization of sine-wave speech. The evidence, supplied in Condition A of the original research report (Remez et al., 1981), is that the default treatment of sine-wave replicas of speech is as independent sound sources, one per sinusoid. This was observed in a test condition in which listeners, having been told nothing about the nature of the sounds, were simply asked to report their spontaneous impressions of the tone patterns. Listeners mainly reported hearing science fiction sounds, computer bleeps, electronic music, and several simultaneous sounds, although the rare listener did report speech. Integration of the tones to compose a speech stream occurred under a different instruction, in this case, to transcribe the tone patterns as if they were synthetic speech. This finding of coincidence of perceptual grouping and instructional set was replicated with different test material by Remez et al. (2001) and by Liebenthal et al. (2003). In contrast to the hypothetical default grouping proposed in the mistaken new argument, the observed default organization of the tonal components of sine-wave speech is split into separate simultaneous groups, much as the Gestalt-derived principles of Auditory Scene Analysis actually claim.

Indeed, a listener’s first impression of sine-wave speech is dominated by its auditory form: a collection of whistles varying in pitch and loudness. Lacking harmonically related components, broadband resonances and aperiodicities typical of natural vocalization, sine-wave replicas of utterances do not resemble speech in acoustic detail, nor do such sounds evoke the auditory quality of natural vocalization. Few listeners spontaneously notice linguistic properties, yet the instruction to listen to a sine-wave replica of a natural utterance as if it were synthetic speech, without arduous training or other special conditions, is all the help that naïve listeners need to provide accurate transcriptions of its linguistic properties. Perception of the linguistic properties of the replicated utterance and of the personal attributes of the replicated talker occurs by virtue of the coarse-grain spectrotemporal properties of speech that are preserved in the frequency and amplitude modulation of the tonal constituents. Yet, even when the tonal constituents are grouped to compose a speech stream, the auditory form of individual tones remains prominent. This evidence that the tones are both grouped and split into separate streams at the same time results from the organizational bistability of sine-wave speech (Remez et al., 2001). The similarity-based principles of the generic auditory account of perceptual organization explain that the tones should be segregated into separate perceptual streams; grouping by modulation sensitivity explains that the tones should be grouped as speech despite their detailed dissimilarities. Because both kinds of grouping occur concurrently, sine-wave speech is organizationally bistable. Neither kind of grouping reduces to the other.

Figure 3: Spectrogram of a noise-band replica of the utterance in Figure 1.

Media:jazz_noise_band.mp3Example of noise-band vocoded speech based on the natural sample in Figure 1.



Independent corroboration of perceptual integration without similarity or familiarity of elementary auditory constituents is found in perceptual studies of noise-band vocoded speech (Shannon et al., 1995) and of acoustic chimeras (Smith et al., 2002). In each of these kinds of signal, an aspect of the spectrotemporal variation of speech is imposed on a nonspeech carrier. In noise-band vocoded speech, three or four band-limited noise sources of stationary frequency span the range of speech without overlapping, and each band is excited to the extent given by the momentary integrated energy within its frequency range in a natural sample of speech. The result is an intelligible composite of noise bands changing in amplitude though not in center frequency, expressing the spectrotemporal properties of natural speech at coarse grain. Grouping can depend neither on similarity among the noise-band components in frequency or frequency change, nor on coincident onset and offset, nor on static or changing amplitude.

To produce an acoustic chimera, the spectrum envelope of a speech sample is excited with the fine temporal structure extracted from an arbitrary acoustic signal. The resulting chimerical spectrum depends on the coincidence of the spectrum envelope of the speech source with the fine-structure of the second source. With sufficient resolution in each, a bistable organization occurs in which events identifiable from the fine structure are apprehended concurrently with a series of words evoked by the time-varying spectrum. The ease with which perceivers handle these anomalous speech signals reflects the ordinariness of this perceptual function.





Figure 4: Spectrogram of a musical sample used to produce fine-structure for an acoustic chimera of speech.

Media:tickle.mp3Example of a musical excerpt used for its fine structure.

Figure 5: Spectrogram of an acoustic chimera with the spectrum envelope of the utterance in Figure 1 and the fine-structure of the musical sample in Figure 4.

Media:jazz_chi.mp3An example of the chimera that results from combining the musical fine structure with the estimated spectrum envelope of a natural speech sample.

Overall, studies of sine-wave speech reveal the effect of sensitivity to acoustic modulation independent of the detailed characteristics of an carrier. This function is critical for the establishing the perceptual integrity of sound streams that originate in complex sound sources. An account of perceptual grouping by Gestalt functions focused on momentary acoustic elements, supplemented by schemas, presents an incomplete characterization of auditory perceptual organization. At the present time, no evidence designates either of these dynamics of perceptual organization as specializations in the biological sense, although it is clear which function is crucial in tracking acoustic correlates of speech for phonetic perceptual analysis.

Perceptual analysis

A key link in the speech chain is the production of vocal sound with phonological purpose, and the search for an effective account integrating physical, physiological, articulatory and linguistic aspects remains a keynote of studies of speech. It is often presumed that a reciprocal function ties typical acoustic effects of phonetic expression to auditory sensory analysis of speech sounds, and that the identification of consonants and vowels depends critically on the comparison of an unidentified sensory pattern to perceptual standards characterizing the likely sensory details of linguistically determined contrasts within a language. This normative axiom of perception has deep roots in psychology, and offers a corollary about sine-wave speech that is both clear and false. If speech perception depended on matching a sensory sample and a long-term representation of the likeliest sensory properties of consonants and vowels within a language, sine-wave speech would be unintelligible. Its acoustic and auditory features are unlike speech sounds in detail. The likeness is apparent only considering speech spectra in the abstract.

To be precise, an inventory of the physical, auditory, and psychoacoustic attributes of sine-wave speech would find only remote correspondence to natural speech. To note the abstract similarity of these two signals requires indifference to the details that have held the focus of researchers under the rubric of acoustic cues (Raphael, 2005). None of the spectral elements of natural speech is found in sine-wave speech, which consists of three or four linear emissions instead of resonances, and which is stripped of the frication, aspiration and bursts that are familiar short-term aperiodic correlates of contrasts in voicing, manner and articulatory place. Without these ingredients of vocal sound, sine-wave speech remains effective perceptually because the spectrotemporal pattern superordinate to the psychoacoustic details evokes linguistic impressions even when the subordinate constituents are wrong for speech. The perceiver experiences both states: linguistic properties conveyed by the spectrotemporal modulation, and auditory form impressions of the changing pitch and loudness of individual tone components.

Figure 6: This histogram shows the distribution of consonant errors committed by naïve listeners grouped by type of error. Sine-wave errors are compared here to the error distribution for identifying natural CV syllables in noise at -5 dB S/N. The height of the dark bar indicates performance for listening to natural syllables in noise, and the light bar for listening to sine-wave syllables. The distributions do not differ between natural speech in noise and sine-wave speech. Performance in sine-wave CV consonant identification by experienced listeners is almost free of error.

Findings of intelligibility of sine-wave speech for every phone class demote the isolable phonetic cues from the status as the causes of speech perception to mere constituents of a regular albeit unpredictable causal spectral pattern. (See Figure 6). Removed from a speech signal, few acoustic correlates of phonetic contrasts elicit an impression of speech at all. When the cumulative pattern is preserved, the specific aspect of the element within the pattern might matter very little in the perception of a segmental contrast. Studies of sine-wave speech therefore revealed the relative perceptual independence of analysis of the psychoacoustic forest and of individual psychoacoustic trees, to offer an analogy. Indeed, the studies show that a perceptually effective acoustic pattern confers causal status on an element within it. For this reason, a listener perceives speech as if the commitment to the particular sensory realization of the linguistic contrasts is flexible. Such readiness to find functional contrasts in the least expected acoustic or auditory form opposes the fixity of an auditory norming rationale for perception.





Perceptual identification of individual talkers

The vocal quality of a sine-wave sentence is consistently reported as highly unnatural (Remez et al., 1981). This reflects the contrapuntal whistling of the tones accompanying the sequence of syllables, for no natural talker ever produces anything approximate to sine-wave speech. With such extraordinary vocal timbre, it is perhaps surprising that listeners are able nonetheless to identify distinctive characteristics of the talker whose speech sample provided the spectral model for sine-wave replication (Remez et al., 1997; Sheffert et al., 2002).

Research on the identification of talkers by ear distinguishes linguistic properties from indexical properties of speech (Abercrombie, 1967). If the former includes the composition of the message denoted by an utterance, the latter comprises the attributes specific to the talker who spoke it, and classically consists of the acoustic effects of a talker’s regional and social group, of the talker’s age and sex, and of the talker’s affective state and arousal. Studies of perceptual identification of individual talkers from sine-wave sentences suggest that listeners notice and remember idiosyncratic articulatory habits that produce perceptually distinctive variants. While the perception of linguistic properties depends on the recognition of attributes that are shared within a linguistic community, independent of the talker who produces them, the perception of a specific talker rests on characteristics that are individually distinctive, independent of the linguistic properties of any specific utterance (Kreiman, 1997).

With this definition, researchers have treated indexical properties of a talker as a second message parallel to linguistic form (Bricker & Pruzansky, 1976: Hollien, 2001), with its own set of acoustic correlates of the qualitative differences among voices: smooth-rough, breathy-full, steady-shaky, etc. Sine-wave speech omits a component following the fundamental frequency of phonation (Remez & Rubin, 1984, 1993) and therefore does not convey a correlate of vocal pitch, nor does a sine-wave copy of a natural utterance preserve the broad-band resonance structure, glottal spectrum or other detailed acoustic attributes that are often proposed as acoustic correlates of vocal quality. Nonetheless, sine-wave speech has replicated the acoustic correlates of vocal tract scale (Remez et al., 1987) by preserving the frequency variation of vocal resonances, and naïve listeners are capable of identifying the sex of a talker from a sine-wave sample (Brungart et al., 2006; Fellowes et al., 1997). Identification of some individuals persists when vocal tract scale differences are eliminated (Fellowes et al., 1997), evidence of the indexical effect of fine-grained phonetic expression of phoneme contrasts preserved in sine-wave speech. These include many allophonic variations, among them the held or released stop consonants occurring at the end of a word; the graded variations in the expression of r, and the alternative use of the tongue tip or tongue blade to produce d.

Lexically, such phonetic contrasts do not distinguish one English word from another. Indexically, because talkers are hypothetically consistent in their use of these allophones and because this use is memorable to listeners, these linguistically governed features are available for the perceptual identification of specific individuals. Studies of sine-wave talker identification illustrate that idiosyncratic phonetic articulatory habits can be indexical. Even with the acoustic correlates of vocal tract scale variation removed, listeners can identify individual talkers from the phonetic contrasts that reliably mark the speech of particular individuals. Although evidence suggests that qualitative properties of the voice are powerfully salient (Kreiman, 1997), and allophone variants composing a talker’s idiosyncratic style are less so (Sheffert et al., 2002), both kinds of indexical property are available. Indeed, idiosyncratic phonetic sources of indexical attributes survive conditions of communication of novel messages, and those that obscure or distort vocal quality, such as the telephone or a bad cold, as well as sine-wave speech. This implicit understanding might power the perceptual reliance on this aspect independent of vocal quality of the perceptual identification of talkers.

References

Abercrombie, D. (1967). Elements of General Phonetics. Chicago: Aldine.

Bobrow, D. G., & Norman, D. A. (1975). Some principles of memory schemata. In D. G. Bobrow & A. M. Collins (Eds.), Representation and Understanding (pp. 131-149). New York: Academic Press.

Bregman, A. S. (1990). Auditory Scene Analysis. Cambridge, Massachusetts: MIT Press.

Bregman, A. S., & Pinker, S. (1978). Auditory streaming and the building of timbre. Canadian Journal of Psychology, 32, 19-31.

Bricker, P. D., & Pruzansky, S. (1976). Speaker Recognition. In N. J. Lass (Ed.), Contemporary Issues in Experimental Phonetics (pp. 295-326). New York: Academic Press.

Brungart, D., Iyer, N., & Simpson, B. (2006). Monaural speech segregation using synthetic speech signals. Journal of the Acoustical Society of America, 119, 2327-2333.

Carlyon, R. P., Cusack, R., Foxton, J. M., & Robertson, I. H. (2001). Effects of attention and unilateral neglect on auditory stream segregation. Journal of Experimental Psychology: Human Perception and Performance, 27, 115–127.

Carrel, T. D., & Opie, J. M. (1992). The effect of amplitude comodulation on auditory object formation in sentence perception. Perception & Psychophysics, 52, 437-445.

Cusack, R., Deeks, J., Aikman, G., & Carlyon, R. P. (2004). Effects of location, frequency region, and time course of selective attention on Auditory Scene Analysis. Journal of Experimental Psychology: Human Perception and Performance, 30, 643–656.

Darwin, C. J. (1997). Auditory grouping. Trends in Cognitive Science, 1, 327-333.

Darwin, C. J. (2008). Listening to speech in the presence of other sounds. Philosophical Transactions of the Royal Society B, 363, 1011-1021.

Elliot, L. L. (1962). Backward and forward masking of probe tones of different frequencies. Journal of the Acoustical Society of America, 34, 1116-1117.

Fellowes, J. M., Remez, R. E., & Rubin, P. E. (1997). Perceiving the sex and identity of a talker without natural vocal timbre. Perception & Psychophysics, 59, 839-849.

Hollien, H (2001). Forensic Voice Identification. New York: Academic Press.

Howell, P., & Darwin, C. J. (1977). Some properties of auditory memory for rapid formant transitions. Memory & Cognition, 5, 700-708.

Klatt, D. H. (1989). Review of selected models of speech perception. In W. Marslen-Wilson (Ed.), Lexical Representation and Process (pp. 169-226) Cambridge, MA: MIT Press.

Kreiman, J. (1997). Listening to voices: Theory and practice in voice perception research. In K. Johnson & J. W. Mullenix (Eds.), Talker Variability in Speech Processing (pp. 85-108). San Diego: Academic Press.

Liebenthal, E., Binder, J. R., Piorkowski, R. L., & Remez, R. E. (2003). Short-term reorganization of auditory analysis induced by phonetic experience. Journal of Cognitive Neuroscience, 15, 549-558.

Pisoni, D. B., & Tash, J. (1974). Reaction times to comparisons within and across phonetic categories. Perception & Psychophysics, 15, 285-290.

Raphael, L. J. (2005). Acoustic cues to the perception of segmental phonemes. In D. B. Pisoni and R. E. Remez (Eds.), The Handbook of Speech Perception (pp. 182-206). Oxford: Blackwell.

Remez, R. E. (2005). Perceptual organization of speech. In D. B. Pisoni and R. E. Remez (Eds.), The Handbook of Speech Perception (pp. 28-50). Oxford: Blackwell.

Remez, R. E., & Rubin, P. E. (1984). On the perception of intonation from sinusoidal sentences. Perception & Psychophysics, 35, 429-440.

Remez, R. E., & Rubin, P. E. (1993). On the intonation of sinusoidal sentences: Contour and pitch height. Journal of the Acoustical Society of America, 94, 1983-1988.

Remez, R. E., Fellowes, J. M., & Rubin, P. E. (1997). Talker identification based on phonetic information. Journal of Experimental Psychology: Human Perception and Performance, 23, 651-666.

Remez, R. E., Ferro, D. F., Wissig, S. C., & Landau, C. A. (in press). Asynchrony tolerance in the perceptual organization of speech. Psychonomic Bulletin & Review, 00, 000-000.

Remez, R. E., Pardo, J. S., Piorkowski, R. L., & Rubin, P. E. (2001). On the bistability of sine-wave analogues of speech. Psychological Science, 12, 24-29.

Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M. (1994). On the perceptual organization of speech. Psychological Review, 101, 129-156.

Remez, R. E., Rubin, P. E., Nygaard, L. C., & Howell, W. A. (1987). Perceptual normalization of vowels produced by sinusoidal voices. Journal of Experimental Psychology: Human Perception and Performance, 13, 40-61

Remez, R. E., Rubin, P. E., Pisoni, D. B., & Carell, T. D. (1981). Speech perception without traditional speech cues. Science, 212, 947-950.

Saberi, K., & Perrott, D. R. (1999). Cognitive restoration of reversed speech. Nature, 398, 760.

Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech recognition with primarily temporal cues. Science, 270. 303-304.

Sheffert, S. M., Pisoni, D. B., Fellows, J. M., & Remez, R. E. (2002). Learning to recognize talkers from natural, sinewave and reversed speech samples. Journal of Experimental Psychology: Human Perception and Performance, 28, 1447-1469.

Smith, Z. M., Delgutte, B., & Oxenham, A. J. (2002). Chimaeric sounds reveal dichotomies in auditory perception. Nature, 416, 87-90.

Stevens, K. N. (1999). Acoustic Phonetics. Cambridge, Massachusetts: MIT Press.

Stevens, K. N., & Blumstein, S. E., (1981). The search for invariant acoustic correlates of phonetic features. In P. D. Eimas & J. L. Miller (Eds.), Perspectives on the Study of Speech (pp. 1-38). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Stone, M. A., & Moore, B. C. J. (2008). Effects of spectro-temporal modulation changes produced by multi-channel compression on intelligibility in a competing-speech task. Journal of the Acoustical Society of America, 123, 1063-1076.

Wang, DeL., & Brown, G. J. (Eds.) (2006). Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. New York: Wiley IEEE Press.

Wertheimer, M. (1923). Untersuchungen zur Lehre von der Gestalt, II. Psychologische Forschung, 4 301-350. [Translated as, "Laws of organization in perceptual forms," in W. D. Ellis (Ed.), A Sourcebook of Gestalt Psychology (pp. 71-88). London: Routledge & Kegan Paul, 1938.]



Internal references

Eugene M. Izhikevich (2006) Bursting. Scholarpedia, 1(3):1300.

James Meiss (2007) Dynamical systems. Scholarpedia, 2(2):1629.

Mark Aronoff (2007) Language. Scholarpedia, 2(5):3175.

Howard Eichenbaum (2008) Memory. Scholarpedia, 3(3):1747.

Kendall E. Atkinson (2007) Numerical analysis. Scholarpedia, 2(8):3163.

Sadaoki Furui (2008) Speaker recognition. Scholarpedia, 3(4):3715.

Arkady Pikovsky and Michael Rosenblum (2007) Synchronization. Scholarpedia, 2(12):1459.





Recommended reading

Pardo, J. S., & Remez, R. E. (2006). The perception of speech. In M. Traxler and M. A. Gernsbacher (Eds.), Handbook of Psycholinguistics, 2nd Edition (pp. 201-248). New York: Academic Press

Remez, R. E. (2005). Perceptual organization of speech. In D. B. Pisoni and R. E. Remez (Eds.), The Handbook of Speech Perception (pp. 28-50). Oxford: Blackwell.

Remez, R. E., Rubin, P. E., Berns, S. M., Pardo, J. S., & Lang, J. M. (1994). On the perceptual organization of speech. Psychological Review, 101, 129-156.

See also

Synchronization, Speech recognition, Speaker recognition