It’s easy to think about music as just a sequence of sounds – recorded and encoded in a Spotify stream, these days, but still: an acoustic phenomenon that we respond to because of how it sounds. The source of music’s power, according to this account, lies in the notes themselves. To pick apart how music affects us would be a matter of analysing the notes and our responses to them: in come notes, out tumbles our perception of music. How does Leonard Cohen’s Hallelujah work its magic? Simple: the fourth, the fifth, the minor fall, the major lift…

Yet thinking about music in this way – as sound, notes and responses to notes, kept separate from the rest of human experience – relegates music to a special, inscrutable sphere accessible only to the initiated. Notes, after all, are things that most people feel insecure about singing, and even less sure about reading. The vision of an isolated note-calculator in the brain, taking sound as input and producing musical perceptions as output, consigns music to a kind of mental silo.

But how could a cognitive capacity so removed from the rest of human experience have possibly evolved independently? And why would something so rarified generate such powerful emotions and memories for so many of us?

In fact, the past few decades of work in the cognitive sciences of music have demonstrated with increasing persuasiveness that the human capacity for music is not cordoned off from the rest of the mind. On the contrary, music perception is deeply interwoven with other perceptual systems, making music less a matter of notes, the province of theorists and professional musicians, and more a matter of fundamental human experience.

Brain imaging produces a particularly clear picture of this interconnectedness. When people listen to music, no single ‘music centre’ lights up. Instead, a widely distributed network activates, including areas devoted to vision, motor control, emotion, speech, memory and planning. Far from revealing an isolated, music-specific area, the most sophisticated technology we have available to peer inside the brain suggests that listening to music calls on a broad range of faculties, testifying to how deeply its perception is interwoven with other aspects of human experience. Beyond just what we hear, what we see, what we expect, how we move, and the sum of our life experiences all contribute to how we experience music.

If you close your eyes, you might be able to picture a highly expressive musical performance: you might see, for instance, a mouth open wide, a torso swaying, and arms lifting a guitar high into the air. Once you start picturing this expressive display, it’s easy to start hearing the sounds it might produce. In fact, it might be difficult to picture these movements without also imagining the sound.

Or you could look – with the volume muted – at two performances of the same piano sonata on YouTube, one by an artist who gesticulates and makes emotional facial expressions, and the other by a tight-lipped pianist who sits rigid and unmoving at the keyboard. Despite the fact that the only information you’re receiving is visual, you’ll likely imagine very different sounds: from the first pianist, highly expressive fluctuations in dynamics and timing, and from the second, more straightforward and uninflected progressions.

Could it be that visual information actually affects the perception of musical sound, and contributes substantially to the overall experience of a performance? Numerous studies have attempted to address this question. In one approach, the psychologist Bradley Vines at McGill University in Canada and colleagues video-recorded performances intended to be highly expressive as well as ‘deadpan’ performances, in which performers are instructed to play with as little expressivity as possible. Then the researchers presented these recordings to the participants, either showing them just the video with no sound, or playing them just the audio with no video, or playing them the full audiovisual recording – or, in a particularly sneaky twist, playing them a hybrid video, in which the video from the expressive performance was paired with the audio from the deadpan performance, and vice versa.

It turns out that participants tend to describe as more expressive and emotional whichever performance is paired with the more expressive video – rather than the recording with the more expressive sound. In a separate experiment, the psychologist Chia-Jung Tsay at University College London showed that people predicted the winners of music competitions more successfully when they watched silent videos of their performances than when they merely heard the performances, or watched the video with the sound on.

Pairing minor (sad) audio with major (happy) video leads to the minor music being rated as happier

Music, it seems, is a highly multimodal phenomenon. The movements that produce the sound contribute essentially, not just peripherally, to our experience of it – and the visual input can sometimes outweigh the influence of the sound itself.

Visual information can convey not only information about a performance’s emotional content, but also about its basic structural characteristics. Work by the psychologists Bill Thompson at Macquarie University in Sydney and Frank Russo at Ryerson University in Toronto showed that people could judge the size of an interval being sung even when they couldn’t hear it – merely by watching facial expressions and head movements. When video of a person singing a longer interval was crossed with audio from a shorter one, people actually heard the interval as longer. Similarly, when Michael Schutz and Scott Lipscomb, then both at Northwestern University in Illinois, crossed video of a percussionist playing a long note with audio from a short note, people actually heard the note’s duration as longer.

Multisensory integration at this basic level feeds into some of the higher-level effects of vision on perceived emotion. For example, pairing audio of a sung minor interval, typically heard as sad, with video footage of someone singing a major interval, typically heard as happy, leads to the minor interval being rated as happier.

A musical experience is more than an audiovisual signal. Maybe you’re trying out a new band because your best friend recommended it, or because you’re doing your parent a favour. Maybe you’re experiencing a concert in a gorgeous hall with a blissed-out audience, or maybe you’ve wandered into a forlorn venue with a smattering of bored-looking folks, all of whom seem to have positioned themselves as far from the stage as possible. These situations elicit markedly different sets of expectations. The information and inferences brought to the concert can make or break it before it even starts.

Joshua Bell is a star violinist who plays at the world’s great concert halls. People regularly pay more than $100 per ticket to hear him perform. Everything about the setting of a typical concert implies how worthy the music is of a listener’s full attention: the grand spaces with far-away ceilings, the hush among the thousand attendees, the elevation of the stage itself. In 2007, a reporter from theWashington Post had an idea for a social experiment: what would happen if this world-renowned violinist performed incognito in the city’s subway? Surely the exquisiteness of his sound would lure morning commuters out of their morning routine and into a rhapsodic listening experience.

Instead, across the 35 minutes that he performed the music of Bach, only seven people stopped for any length of time. Passers-by left a total of $32 and, after the last note sounded, there was no applause – only the continued rustle of people hurrying to their trains. Commentators have interpreted this anecdote as emblematic of many things: the time pressures faced by urban commuters, the daily grind’s power to overshadow potentially meaningful moments, or the preciousness of childhood (several children stopped to listen, only to be pulled away by their parents). But just as significantly, it could suggest that the immense power of Bell’s violin-playing does not lie exclusively in the sounds that he’s producing. Without overt or covert signalling that prepared them to have a significant aesthetic experience, listeners did not activate the filters necessary to absorb the aspects of his sound that, in other circumstances, might lead to rhapsodic experiences. Even musicianship of the highest level is susceptible to these framing effects. The sound just isn’t enough.

People liked the music more and were more moved by it when they thought it had been written for a happy reason

Other studies also suggest a powerful role for context in the experience of music. In 2016, with my colleague Carolyn Kroger at the University of Arkansas, we exposed participants to pairs of performances of the same excerpt, but told them that one was performed by a world-renowned professional pianist and the other by a conservatory student: people consistently preferred the professional performance – whether they were listening to the professional, to the student, or had in fact just heard the exact same sound played twice. And, in another factor unrelated to the sound itself, listeners tended to show a preference for the second excerpt that they heard in the pair. When these two factors coincided – when the second performance was also primed as professional – their tendency to prefer it was especially strong. My own subsequent neuroimaging work using the same paradigm revealed that reward circuitry was activated in response to the professional prime, and persisted throughout the duration of the excerpt; this finding is in line with previous neuroimaging studies that demonstrated the sensitivity of the reward network to contextual information, affecting or even improving the pleasantness of a sensual experience.

It’s not only our sense of the quality of a performance that is manipulable by extrinsic information; our sense of its expressive content can also vary. In a recent study, we told people that we had special information about the musical excerpts that they were going to hear: in particular, we knew something about the composer’s intent when writing it. Unbeknown to the participants, we created the intent descriptions so that some were highly positive, some highly negative, and some neutral. For example, we could say that a composer wrote the piece to celebrate the wedding of a dear friend, to mourn the loss of a friend, or to fulfil a commission. We scrambled the description-excerpt pairings so that the same excerpts were matched with different descriptions for different participants. In each trial, participants read the composer-intent description, listened to the excerpt, and answered questions about it.

When told that the excerpt had been written for some positive reason, people heard the music as happier, but when told that the excerpt had been written in a negative circumstance, they heard it as sadder. Recasting the emotional tenor of an excerpt had important consequences for the listeners’ experience of it. People liked the excerpts more and were more moved by them when they thought they had been written for a happy reason (intriguingly, another part of the same study showed that people liked and were more moved by poetry when they thought it had been written for a sad reason). The social and communicative context within which a performance occurs – rudimentarily approximated-by-intent descriptions in this study – can imbue the same sounds with very different meanings.

The right music can get a roomfull of people dancing. Even people at classical concerts that discourage overt movement sometimes find it irresistible to tap a finger or foot. Neuroimaging has revealed that passive music-listening can activate the motor system. This intertwining of music and movement is a deep and widespread phenomenon, prevalent in cultures throughout the world. Infants’ first musical experiences often involve being rocked as they’re sung to. The interconnection means not only that what we hear can influence how we move, but also that how we move can influence what we hear.

To investigate this influence, the psychologists Jessica Phillips-Silver and Laurel Trainor at McMaster University in Ontario bounced babies either every two or every three beats while listening to an ambiguous musical excerpt that was capable of being understood as characterised by perceived accents every two or three beats. During this exposure phase, babies were hearing the same music, but some of them were being moved in a duple pattern (every two beats, or a march) and some of them were being moved in a triple pattern (every three beats, or a waltz). In a later test phase, babies were presented with versions of the excerpt featuring added accents every two or every three beats, translating the emphasis from the kinaesthetic to the auditory domain. They listened longer to the version that matched the bouncing pattern to which they had been exposed – babies who had been bounced every two beats preferred the version with a clear auditory duple meter, and babies who had been bounced every three beats preferred the version with the triple meter. To put it another way, these infants transferred the patterns they had learned kinaesthetically, through movement, to the patterns they were experiencing auditorily, through sound. What they perceived in the sound was framed by the way they had moved.

The findings paint an embodied picture of music-listening: the way you physically interact with music matters

Testing whether this transfer from movement to sound occurs in adults required a few modifications to the study design – it’s not as easy to pick up adults and bounce them. Instead, adults were taught how to bend their knees every two or three beats as a musical excerpt played. And rather than devising a listening-time paradigm to infer aspects of perception from preverbal infants, researchers simply asked participants which of two excerpts sounded more similar to the one in the exposure phase. Participants chose from versions of the excerpt to which auditory accents had been added every two or three beats. Mirroring results with the infants, the adults judged the version to be most similar when it featured the accent pattern that matched the way they’d moved. The effect persisted even when participants were blindfolded while moving, demonstrating that perception could transfer from movement to sound even in the absence of a mediating visual influence. Movements much subtler than full-body bounding can also influence auditory perception. Participants asked to detect target tones occurring on the beat from within a series of distractor tones performed better when they tapped a finger on a noiseless pad than when they listened without tapping.

Together, these findings paint an embodied picture of music-listening, where not just what you see, hear and know about the music shapes the experience, but also the way you physically interact with it matters as well. This is true in the more common participatory musical cultures around the world, where everyone tends to join in the music-making, but also in the less common presentational cultures, where circumstances seem to call for stationary, passive listening. Even in these contexts, when and how a person moves can shape what they hear.

The musical vocabularies and styles that people hear while growing up can shape the structures and expressive elements they are capable of hearing in a new piece. For example, people show better recognition memory and different emotional responses to new music composed in a culturally familiar style, as compared with new music from an unfamiliar culture. But it’s not just previous musical exposure that shapes their perceptual system: the linguistic soundscape within which a person is raised also reconfigures how they orient to music.

In languages such as English, the pitch at which a word is pronounced doesn’t influence its dictionary meaning. Motorcycle means a two-wheeled vehicle with an engine whether I say it in a really high or really low voice. But other languages, such as Mandarin Chinese and Thai, are tone languages: when Chinese speakers say ma with a high, stable pitch it means ‘mother’, but if they say it with a pitch that starts high, declines, then goes back up again, it means ‘horse’. The centrality of pitch to basic definitional content in these languages means that tone-language speakers produce and attend to pitch differently than non-tone-language speakers, day in and day out over the course of years. This cumulative sonic environment tunes the auditory system in ways that alter basic aspects of music perception. Speakers of tone languages, for example, detect and repeat musical melodies and pitch relationships more accurately than non-tone language speakers.

Culture and experience can change how music is heard, not just how people derive meaning from it

The psychologist Diana Deutsch at the University of California, San Diego concocted tritones (two pitches separated by half an octave) using digitally manipulated tones of ambiguous pitch height. People heard these tritones as ascending or descending (the first note lower or higher than the second) depending on the linguistic background in which they had been raised. Speakers of English who grew up in California tended to hear a particular tritone as ascending, but English speakers raised in the south of England tended to hear it as descending. Chinese listeners raised in villages with different dialects showed similar differences. A striking characteristic of this ‘tritone paradox’ is that listeners who hear the interval as ascending generally experience this upward motion as part of the perception, and have trouble imagining what it would be like to experience it the other way, and vice versa for listeners who hear it as descending. The effect influences what feels like the raw perception of the sound, not some interpretation layered on later. Culture and experience can change how music is heard, not just how people derive meaning from it.

Music’s interdependence on so many diverse capacities likely underlies some of its beneficial and therapeutic applications. As the late neurologist Oliver Sacks showed in Musicophilia (2007), when a person with dementia listens to music from her adolescence, she can become engaged and responsive, revealing the extent to which these tunes carry robust autobiographical memories.

Music cannot be conceptualised as a straightforwardly acoustic phenomenon. It is a deeply culturally embedded, multimodal experience. At a moment in history when neuroscience enjoys almost magical authority, it is instructive to be reminded that the path from sound to perception weaves through imagery, memories, stories, movement and words. Lyrics aside, the power of Cohen’s Hallelujah doesn’t stem directly from the fourth, the fifth, or even the minor fall or the major lift. Contemporary experiences of the song tend to be coloured by exposure to myriad cover versions, and their prominent use in movies such as Shrek. The sound might carry images of an adorable green ogre or of a wizened man from Montreal, or feelings experienced at a concert decades ago.

Despite sometimes being thought about as an abstract art form, akin to the world of numbers and mathematics, music carries with it and is shaped by nearly all other aspects of human experience: how we speak and move, what we see and know. Its immense power to sweep people up into its sound relies fundamentally on these tight linkages between hearing and our myriad other ways of sensing and knowing.