Intensive musical training places significant demands on auditory processing (e.g., in making subtle distinctions between sounds in terms of pitch, timing and timbre) and on cognitive abilities such as auditory attention and working memory. These demands are not unique to music. Speech perception, for example, also depends on detailed auditory analysis operating in concert with working memory and auditory attention1. The shared demands of musical training and speech perception may rely on partly overlapping brain mechanisms: growing evidence suggests that the brain networks involved in music and speech processing are not entirely segregated within the cerebral cortex and may in fact have a significant degree of overlap2,3,4,5. This raises a fundamental question: are linguistic and musical abilities related in significant ways, or do they constitute largely distinct mental faculties, as suggested by some theorists6?

One way to address the question is to compare musically trained and untrained individuals on language processing tasks. If musically-trained individuals show benefits on these tasks, this would suggest neurobiological connections between music and speech processing. This could arise because 1) individuals who are innately advantaged in certain auditory and cognitive processes shared by music and speech are attracted to musical training, 2) musical training enhances speech processing via experience-dependent neural plasticity in brain networks shared by speech and music, or because of a combination of 1) and 2)7,8.

There are now numerous studies comparing musically trained and untrained individuals across a variety of language tasks. For some abilities, including speech intonation perception9,10, vocal affect discrimination in sentences11,12 and production and perception of second language phonological contrasts13,14, multiple studies have found musical training to be associated with enhanced speech processing. One area where research has produced less consistent results, however, concerns speech perception in “noise” (meaning, generally, unwanted sounds not limited to white noise or speech-shaped noise). This is an important ability in everyday life since speech is often heard in the context of other sounds and is also an ability in which normal-hearing individuals can vary widely15.

The idea that musicians might show benefits in speech-in-noise perception seems plausible. In the practice of their art, musicians depend upon their ability to listen selectively to individual instruments within a musical ensemble and to shift the focus of attention from one instrument to another at will. This bears a striking similarity to the problem of attending to a specific human voice among several competing voices, an extensively-studied problem known as the “cocktail party” problem16,17. Conversing in a “cocktail party” type of environment has been shown to be extremely challenging for listeners with sensorineural hearing loss18, for cochlear implantees19 and even for some listeners with clinically normal hearing15.

There are, of course, several differences between selective listening in musical and linguistic contexts. For example, members of a musical ensemble are typically playing the same piece (although different instruments may be playing different parts), while the cocktail party problem involves selecting a given talker from various independent conversations. However, to the extent that both situations place demands on the capacity for selective listening in a complex auditory scene and these demands engage brain networks shared by music and speech processing, then one might expect musicians to have an enhanced ability to select and attend to a target talker in the presence of competing (masking) talkers.

To date, research on speech perception in multiple-source environments by musicians has produced equivocal results. On the one hand several studies have reported small but statistically significant benefits for musicians on standard tests of speech-in-noise perception. For example, Parbery-Clark et al.20, demonstrated a small but significant performance advantage for young adult musicians over non-musicians in two clinical tests of speech understanding in noise (overall effect size <1 dB between groups). More recently, however, two other studies (Ruggles et al.21; Boebinger et al.22) found no benefit for musicians in tests of speech-in-noise perception. Given these inconsistent results, further research on this topic is warranted because of its potential theoretical and practical significance. In terms of basic research, if musicians show clear advantages for hearing speech in noise, this would offer researchers a useful population for exploring the mechanisms (sensory and cognitive) that contribute to better speech-in-noise perception. This in turn could help hearing scientists understand the factors underlying the large individual differences mentioned above. From a practical perspective, if musical training actually causes improvements in speech-in-noise perception, this would have significant implications for designing training programs to enhance this ability in normal and clinical populations7.

The current study examines speech perception in musically trained and untrained individuals, using a multiple-talker masking approach. We focus not on questions of causality, which require longitudinal studies with random assignment to musical vs. nonmusical training, but rather on attempting to determine whether musicians show benefits for selective listening in a cocktail-party like listening task. Unlike most previous studies, we use competing sounds that consist of intelligible sentences that are spatially separated from the target sentence. This emulates an ecologically realistic situation in which one seeks to understand an interlocutor whom one is facing directly while trying to ignore nearby speakers. To help distinguish between the different factors which contribute to masking in such situations, we separately manipulate two types of masking caused by the interfering speech: informational and energetic masking (henceforth, IM and EM). EM occurs when maskers overlap in time and frequency with the target, producing competition for representation at the auditory periphery (e.g., “sensory interference”). IM occurs when maskers are highly similar to and/or confusable with the target, thus producing competition at physiological sites beyond the auditory periphery (e.g, “cognitive interference”). Using Gaussian white noise or speech-shaped noise to mask speech, for example, creates high EM but little IM, since there is no other intelligible signal competing for cognitive processing. By using speech as the masking stimulus we create both EM and IM, but crucially, we can manipulate the maskers in specific ways to vary the amount of IM in different conditions, from very high to very low.

The modulation of IM in our stimuli is based on manipulating both the spatial location and the intelligibility of the masking speech. In terms of spatial location, it has been demonstrated that the intelligibility of target speech is improved considerably when the competing maskers are spatially separated from the target23,24, or appear to be separated from the target25,26, relative to the case when all of the sounds arise from the same location, an effect referred to as “spatial release from masking” (SRM). It also has been shown that the IM component of speech-on-speech masking may play a critical role in determining the magnitude of SRM (e.g., reviewed in27). In other words, much of the benefit listeners receive from spatially separating the target speech from the masker speech seems to be due to cognitive factors producing a release from IM (e.g., the ability of the listener to focus on a target signal and suppress the cognitive/linguistic processing of distractor signals) rather than exclusively to sensory factors producing a release from EM (e.g., reduction of within-channel competition for representation of the target28).

In terms of intelligibility, we manipulated the IM produced by the masking speech by either playing it forward (in which case it was normal and fully intelligible) or by reversing its time-domain signal (rendering it unintelligible). The comparison of performance with these two maskers has the advantage that they have very similar spectrotemporal structures (see Fig. 1) and thus are expected to produce equivalent amounts of EM while differing substantially in the amount of IM they produce. However, it should be noted that the relative benefit for target speech intelligibility produced by time-reversing masker speech may depend crucially on the specific procedures used (e.g., the speech corpus, the way the target speech is designated separate from the masker speech, other segregation cues present such as talker sex differences, etc; see recent review in29) and some studies have reported little or no effect of masker time reversal30,31,32. Generally, if the procedures involved produced little IM for forward masker speech, then time-reversal would likely not provide much of a benefit. Here, we used neural modeling of auditory peripheral processing to verify that, for the stimuli used in this study, forward and reversed maskers produced similar EM of the target temporal features (see supporting information for details) and therefore any differences in the masking they produced could reasonably be attributed to differences in IM.

Figure 1 A: Speaker locations relative to listener; B&C: Example target and masker waveforms and spectrograms for forward and reversed speech. Target: “Jane took two new toys”; Forward masker1: “Sue bought six red pens”; Forward masker2: “Lynn held nine cold bags”. Full size image

There are four conditions in our study. In all conditions, the target is a short intelligible sentence (e.g., “Jane saw two red shoes”) coming from directly ahead of the participant. The target is always presented with two other similar sentences (spoken by different speakers) which serve as maskers. In conditions 1 and 2 the maskers are intelligible and are either colocated with the target (condition 1) or spatially separated from it (condition 2). In conditions 3 and 4 the maskers are unintelligible (time-reversed) and are again either colocated with the target (condition 3) or spatially separated from it (condition 4). This results in set of conditions in which EM is very similar but in which IM is gradually reduced from very high in condition 1, intermediate in conditions 2 and 3 and very low in condition 4.

We predicted that if musicians showed a benefit for hearing speech in noise, the degree of this benefit would be strongly modulated by the amount of IM created by the maskers. This prediction was based on prior research showing musician advantages on auditory cognitive tasks using nonlinguistic stimuli33,34. It was also based on research with nonlinguistic stimuli (e.g., tone bursts) showing that musicians are better at concurrent sound segregation and less susceptible to IM than non-musicians35,36. The current work built on this prior work, but employed intelligible, spatialized speech as the key stimulus, in order to determine if musicians showed advantages for speech-in-noise perception in more ecologically valid situations.