In our daily lives, auditory stream segregation allows us to differentiate concurrent sound sources and to make sense of the scene we are experiencing. However, a combination of segregation and the concurrent integration of auditory streams is necessary in order to analyze the relationship between streams and thus perceive a coherent auditory scene. The present functional magnetic resonance imaging study investigates the relative role and neural underpinnings of these listening strategies in multi-part musical stimuli. We compare a real human performance of a piano duet and a synthetic stimulus of the same duet in a prioritized integrative attention paradigm that required the simultaneous segregation and integration of auditory streams. In so doing, we manipulate the degree to which the attended part of the duet led either structurally (attend melody vs. attend accompaniment) or temporally (asynchronies vs. no asynchronies between parts), and thus the relative contributions of integration and segregation used to make an assessment of the leader-follower relationship. We show that perceptually the relationship between parts is biased towards the conventional structural hierarchy in western music in which the melody generally dominates (leads) the accompaniment. Moreover, the assessment varies as a function of both cognitive load, as shown through difficulty ratings and the interaction of the temporal and the structural relationship factors. Neurally, we see that the temporal relationship between parts, as one important cue for stream segregation, revealed distinct neural activity in the planum temporale. By contrast, integration used when listening to both the temporally separated performance stimulus and the temporally fused synthetic stimulus resulted in activation of the intraparietal sulcus. These results support the hypothesis that the planum temporale and IPS are key structures underlying the mechanisms of segregation and integration of auditory streams, respectively.

Introduction

Multi-part music is an example of a complex auditory scene. Bregman [1] has proposed that stream segregation and, through it, auditory scene analysis is based on general gestalt principles such as temporal proximity or closeness in pitch. Through these principles, stream segregation for multi-part music is based for example, on distances in pitch space, with small distances belonging to the same musical part and large distances between pitches allowing for differentiation of parts (for more details on segregation cues in music see [2], [3]). Another grouping cue that has been proposed is a hierarchical structural relationship of melody and accompaniment, with the melody dominating perceptually over the harmonizing accompaniment [1], [4], [5]. However, segregating music into its component streams is often made more challenging by different parts having the same or similar timbre (e.g. string quartet or piano duets) and harmony between the parts as horizontal (i.e. over time) and vertical (i.e. fusion of tones within chords) grouping may compete for perception [1], [6], [7]. Temporal components such as differences in note onsets or asynchronies between parts might represent more reliable cues in such situations [1], [6], [8].

The perceptual analysis of complex auditory scenes relies upon two specific mechanisms, stream segregation and stream integration. While stream segregation is necessary to group sequential auditory information coming from different sources, integration, as a higher order process, then places streams into the same representational space to allow for an assessment of the relationship between them (i.e. distance, space, structural importance) [9]–[11]. Two neuroanatomical structures have been implicated in these mechanisms.

It has been proposed that the planum temporale (PT) is involved in segregating incoming auditory streams [12], [13]. More specifically, different relevant information about stimulus attributes such as spatial position, movement [13], temporal cues [14], [15] or general spectro-temporal patterns are used to segregate streams and are then used to forward stimulus information to the parietal lobe for further processing [12], [13].

The integration of information from different sources, on the other hand, is achieved through the involvement of the inferior parietal cortex (IPC). Across sensory modalities, the IPC has been implicated in the processing of the relationship [10] or magnitude [9], [16] of and between objects. Relevant to the auditory domain, this brain area has been shown to be activated during the assessment of pitch relations such as comparing a melody to a reversed melody [10], [17], [18] or the assessment of temporal relations (i.e. comparing time intervals, [19]) [19]–[22].

It has been hypothesized that a form of divided attention, termed “prioritized integrative attention” is employed when listening to or producing multi-part music [11], [23]–[25]. This kind of attention allows the listener to prioritize one of the streams while still integrating the rest so as to capture a holistic sound scape and to assess the relationships between the parts. Prioritized integrative attention may thus be uniquely suited to the investigation of auditory scene analysis, where both segregation and integration of streams is required.

Relationships between streams can be determined based on different attributes of the streams (i.e. louder than, higher in pitch than, faster than, etc.) and are especially important in music [26] as they contribute to the perception of a “conversation like” relationship between voices of instruments (cf. [27]). This relationship may be more abstract, encompassing, for example, leader and follower roles between the different instrument parts [28]. Leading and following in music can be described on a temporal basis: one player intentionally or unintentionally produces sounds slightly temporally ahead and, as such, is temporally leading [28]–[32]. Alternatively, leading and following can also be defined structurally with the melody leading and the accompaniment following, as is conventionally the case for many western styles of music [1], [4], [5], [11], [27], [33]. A hierarchy in which the melody leads or even dominates the accompaniment perceptually is sometimes considered to be analogous to visual figure-ground perception, with the melody defining the figure and the accompaniment the background [4], [5]. In everyday life, music listeners are in general more familiar with this kind of structural relationship (melody lead) than the reverse (accompaniment lead), which can influence their perception via top-down mechanisms [11]. Leader and follower roles can thus be defined either through a temporal manipulation, which relates to asynchronies between voices, or through the structural relationship of a musical piece, which relates to a hierarchical structure where the melody leads in western music.

In a recent paper [11] we were able to show that both kinds of relationship (structural and temporal) interact on a behavioral as well as on a neural level, highlighting the value of prioritized integrative attention tasks for ongoing research in music perception. The previous study explored the interaction of the leader-follower relationship factors by manipulating the temporal relationship and contrasting a natural performance stimulus without a global leader with an exaggerated global temporal leader. In this case the exaggeration, although synthetically created, was still within the range of natural performance asynchronies. The effect of the temporal relationship on behavioral as well as neural responses could however not be interpreted strongly in favor of the segregation mechanism, as both kinds of stimuli could be segregated on the basis of temporal cues. In the present study, the same task was used in order to explore in greater detail the neural underpinnings of segregation and integration as mechanisms involved in listening to multi-part music (piano duets). The leader-follower relationship was manipulated by using a recording of a real performance of the duet, which included natural local temporal variations between parts (asynchronies) and was contrasted with a synthetically computer-generated version of the duet in which there were no temporal variations within or between parts. The use of a synthetic control stimulus is consistent with common practice in imaging studies exploring the neural underpinnings of music listening, which employ synthetic stimuli instead of natural performances e.g. [34]. Participants were cued to follow (prioritize) one of two duet streams and therefore to segregate the streams present in a piano duet stimulus. A question about the leader-follower relationship between parts of the duet presented after the listening task, however, also necessitated participants to concurrently integrate the second stream into a common representational space with the first stream. Participants were required to judge whether the attended part was leading or following compared to the second duet part. Only by integrating the two streams could a picture of a leader-follower relationship between melody and accompaniment of the duet emerge.

In the performance stimulus, depending on the direction of the asynchrony, either the melody or the accompaniment part was temporally leading or following locally, but not globally across the entire recording (i.e., the median asynchrony between parts was close to zero). As such, there was no temporal relationship cue available for segregating the two piano duet streams. Both parts of the piano duet had the same instrumental timbre, therefore segregation of streams for both kinds of stimuli differed based on the temporal relationship between parts [1], [8]. The temporal relationship between parts, being one possible factor defining leader-follower roles in music, was expected to be a factor driving the perception of the leader-follower relationship between parts. Nevertheless, it was unclear whether the temporally separated performance stimulus or the - due to the lack of a changing temporal relationship between parts – much simpler temporally fused synthetic stimulus would be more difficult to judge.

For the subjective assessment of the leader-follower relationship, we thus posited that the performance stimulus could be rated based on its temporal relationship, its structural relationship or, as participants were not directly aware of these two components, a combination of both relationship factors. By contrast, the leader-follower relationship between parts of the synthetic stimulus could only be based on the structural, hierarchical relationship. A comparison of the two different stimulus types would thus shed light on the integration of the structural and temporal relationship factors between parts as well as on segregation processes based on the difference in the cues of the temporal relationship, which we hypothesized to involve the PT. The assessment of the relationship and thus integration of parts, however, was expected to be represented by common activations for both stimulus types – the temporally separable performance and temporally inseparable synthetic stimulus - within the IPC.