Vocal imitation is a hallmark of human spoken language, which, along with other advanced cognitive skills, has fuelled the evolution of human culture. Comparative evidence has revealed that although the ability to copy sounds from conspecifics is mostly uniquely human among primates, a few distantly related taxa of birds and mammals have also independently evolved this capacity. Remarkably, field observations of killer whales have documented the existence of group-differentiated vocal dialects that are often referred to as traditions or cultures and are hypothesized to be acquired non-genetically. Here we use a do-as-I-do paradigm to study the abilities of a killer whale to imitate novel sounds uttered by conspecific (vocal imitative learning) and human models (vocal mimicry). We found that the subject made recognizable copies of all familiar and novel conspecific and human sounds tested and did so relatively quickly (most during the first 10 trials and three in the first attempt). Our results lend support to the hypothesis that the vocal variants observed in natural populations of this species can be socially learned by imitation. The capacity for vocal imitation shown in this study may scaffold the natural vocal traditions of killer whales in the wild.

1. Introduction

Learning a previously unknown behaviour by observation from another individual [1] enables the non-genetic transfer of information between individuals and constitutes a potential driver for the diffusion and consolidation of group-specific behavioural phenotypes (i.e. traditions and cultures) [2,3]. Imitation of novel sounds, also known as vocal production learning [4], and defined as learning to produce a novel sound just from hearing it, is a core property of human speech which has fuelled the evolution of another adaptation unique in our species, human culture [5]. Although the ability to copy sounds from conspecifics is widespread in birds, it is strikingly rare in mammals [4,6], and among primates it is uniquely human [7,8] (but see [9]). Cetaceans are one of the few mammalian taxa capable of vocal production learning. Several cetacean species in the wild exhibit substantial behavioural diversity between sympatric groups in terms of the acoustic features of their vocal repertoires (songs, calls) [10,11]. It has been suggested that imitative learning may underpin these behaviours, with experimental evidence for the ability for sound imitation demonstrated mainly in the bottlenose dolphin [11–13] and recently in the beluga [14,15].

Among cetaceans, the killer whale (Orcinus orca) stands out regarding vocal dialects in the wild [16]. Each matrilineal unit or pod within a population has been documented to have a unique vocal dialect, including a combination of unique and shared call types [17–19]. These dialects are believed to be transmitted via social learning [16–18], not only from mother to offspring (vertical transmission), but also between matrilines (horizontal transmission) [18–21]. Moreover, the similar acoustic features found between different populations in the same area do not correlate with geographical distance [22]. As many of these group-differentiated signatures are not explained by ecological factors or genetic inheritance, the hypothesis that they may have been acquired by social learning, particularly imitation, appears plausible [16–24].

Elucidating the precise mechanism of social learning involved is difficult, however, particularly for acoustic communication in wild populations. Although killer whales are capable of learning novel motor actions from conspecifics through imitation [25], the experimental evidence for vocal production learning is still scarce in this species. There are reports of killer whales in the field and in captive settings suggesting that they can copy novel calls from conspecifics [26,27], and even from heterospecifics such as bottlenose dolphins [28] or sea lions [24]. One Icelandic female was found to match novel calls from a Northern Resident female with whom she had been housed together for several years [26]. Two juvenile killer whales, separated from their natal pods, were observed to mimic the barks of sea lions in a field study [24]. Crance et al. [27] and Musser et al. [28] took advantage of two unplanned cross-socializing experimental situations to show that two juvenile males learned novel calls from an unrelated but socially close adult male, and three individuals learned novel whistles from a dolphin, respectively.

However, as suggestive as these reports are, the lack of experimental controls curtails the interpretation about the underlying acquisition mechanisms. Experimental data are needed to ascertain whether vocal learning is a plausible mechanism underlying the complexity of vocal traditions in wild killer whales. However, to the best of our knowledge, not even anecdotal reports exist about killer whales spontaneously mimicking human speech similar to those reported in some birds (e.g. parrots [29], mynahs [30]) and mammals (elephants [31], seals [32], belugas [14]).

In most mammals, sound production occurs in the vocal folds within the larynx (the sound source) and the supralaryngeal vocal tract, consisting of the pharyngeal, oral and nasal cavities (the filter) [33]. In humans, this apparatus increases in complexity due to the unusual neurological and motor control that we can exert on these structures [33,34]. By contrast, toothed cetaceans (e.g. killer whales, belugas and dolphins) have evolved a pneumatic sound production in the nasal complex passages (instead of the larynx) involving bilateral structures such as a pair of phonic lips, that can operate as two independent sound sources and filters [35,36]. This difference in the sound production system between toothed cetaceans and humans make the investigation of cetacean vocal production particularly valuable for comparative analyses of flexible vocal production.

Here we report an experimental study of sound learning and mimicry in a killer whale listening to familiar or novel sounds uttered by a conspecific or a human model and requested to reproduce them on command (‘Do this!’). The do-as-I-do paradigm [37] involves the copying of another's untrained (familiar or novel) motor or vocal action using a specific previously trained signal in the absence of results-based cues. The do-as-I-do training method has been successfully used in studies of primates, birds, dogs and two species of cetaceans [12,25,38]. In fact, we used this method to test production imitation of novel motor actions in this same group of killer whales [25]. Ultimately, we wanted to test whether production imitation learning may be a candidate to explain the group-specific vocal patterns documented in wild killer whale populations.

2. Material and methods

(a) Subjects

We tested a 14-year-old female killer whale (Orcinus orca), named Wikie, housed at Marineland Aquarium in Antibes, France. The conspecific model (Moana) was her own 3-year-old calf, born in Marineland. Wikie had been trained for a variety of examination and exercise behaviours with standard operant conditioning procedures and fish/tactile positive reinforcement. Her participation in our previous action imitation study [25] meant that she already knew the ‘copy’ command.

(b) Procedure

The study comprised three phases. Phase 1 involved retraining and reinforcing the subject to respond to the gesture-based command ‘copy’ (‘Do that!’) given by the trainer, which had been used 4 years earlier in the previous study of action imitation aforementioned [25]. Phase 2 involved testing the subject's response to the trainer's copy command when the model uttered familiar vocalizations (n = 3 different sounds), that is, vocalizations that the subject had already performed herself, either because she had been trained with them or because they were part of her natural repertoire (table 1). Finally, phase 3 involved testing the subject with novel sounds (n = 11 different sounds), that is, sounds that were unknown to the subject in terms of neither having heard them, nor having been uttered by her previously. To ensure that the unfamiliar sounds (conspecific and humans) were as different as possible from what they had produced before we compared them with 278 sound samples extracted from ‘Hodgins’ sound recording baseline of the vocal repertoire in this same group of killer whales [39], where she had identified up to 11 distinct discrete call types and we found none that match those in our sample of novel conspecific or human sounds. In addition, before running the experiment we recorded 28 h of in-air spontaneous sounds produced by the killer whales during their free time to see if the subject (or any other killer whale in the group) uttered sounds similar to the novel sounds in our sample (further details are given in the electronic supplementary material). Phase 3 comprised two testing conditions: a conspecific model (condition 1) and a human model (condition 2). In condition 1, the subject first listened to a conspecific model's performance that included three familiar sounds and five novel sounds (test trials), and then was signalled to copy them. The sounds were presented in two formats: (1) performed by a killer whale model live and (2) played through a speaker (e.g. conspecific sounds like airy atonal sounds as ‘breathy’ and ‘strong’ raspberries, or tonal whiny siren sounds like ‘wolf’). In condition 2, the subject also listened to three familiar and six other novel sounds (test trials), but now they were produced by a human model (e.g. human sounds like a human laugh ‘ah ah’ or human words like ‘one two’; electronic supplementary material, table S1 gives the complete description of each sound). In the two conditions, the sounds were presented with the constraint that no more than three consecutive test trials of the novel sound could occur in a row. In each session, a single novel sound was presented to the subject at a time. We also interspersed the three familiar sounds that had been used in the previous phases and control trials consisting of ‘non-copy’ trials during which the subject's trainer did not make the copy sign and asked for any other trained action that the subject regularly was requested to perform during the aquarium shows. The subject was positively reinforced with fish and/or tactile and voice reinforcement signals whenever she yielded a correct response as judged in real time by two observers (Wikie's trainer and one experimenter), but only when she was asked to copy familiar sounds or perform familiar actions (control trials). During the test trials (novel sounds from conspecific and human models), the subject received no rewards (or experimenter-given feedback) regardless of her response, thus making real-time judgments unnecessary. All the sounds were requested and performed when the subject's head was above the water surface with her blowhole exposed.

Table 1.Total number of trials for each sound tested, number of trials until the model's sound was judged to be copied by the subject (according to two experimenters who listened to the sound recordings after the test and then confirmed by six independent observers) and percentage of correct trials since the first full copy. Collapse no. trials first trial copied % correct since the first copy familiar sounds song (SO) 394 1 100 birdy (BI) 316 34 98 blow (BL) 371 2 99 through human model (transfer sessions) SO 30 1 100 BL 30 1 100 novel sounds conspecific alive model strong raspberry (SR) 30 10 19 creaking door (CD) 30 2 100 breathy raspberry (BR) 30 3 30 conspecific through speaker SR 30 1 100 CD 30 4 44 BR 30 1 57 wolf (WO) 30 17 36 elephant (EL) 30 6 28 conspecific through human model (transfer sessions) SR 30 1 100 human ah ah (AA) 30 17 14 hello (HE) 30 1 55 bye bye (BB) 30 12 21 Amy (AM) 30 8 26 one two (OT) 30 3 36 one two three (OTT) 30 1 23

Three different set-ups were used. (a) Conspecific live condition: The two trainers (T M and T S ; M for model and S for subject) were positioned on different sides of a wooden panel 2 m long × 1.90 cm high placed in a position in which S and M could see each other and their own trainer, but could not see the other trainer's commands. T M was positioned on the right side of the panel, and T S was on the left side; thus, the trainers were in a position from which they were not able to see each other's signals either (figure 1). (b) Conspecific speaker condition: two trainers were also required, one trainer held the speaker and another (T S ) gave the copy command to the subject. (c) Human live condition; just one trainer was needed, as he both uttered the sound and gave the ‘copy’ signal (figure 1). Table 1 gives the complete list of sounds by phase and electronic supplementary material, table S1 gives the description of sounds (see audio samples in electronic supplementary material). Figure 1. Experimental set-up. (a) Conspecific live condition. The two trainers (T M and T S ; M for model and S for subject) were positioned on different sides of a wooden panel 2 m long × 1.90 cm high placed in a position in which S and M could see each other and their own trainer, but could not see the other trainer's commands. (b) Conspecific speaker condition. One trainer holds the speaker and another (T S ) gave the copy command to the subject. (c) Human live condition. Just one trainer was needed, as he both uttered the sound and gave the ‘copy’ signal.

All sessions were videotaped and were recorded with Fostex Fr2 and Zoom H-4N digital recorders and a Rode NTG-2 condenser shotgun microphone. To play the sounds in the speaker condition, a sound launcher app for iOS ‘SoundPad Live’ was developed. The sounds were played through an iPad to an Ik Multimedia ‘I Loud’ portable Bluetooth speaker.

(c) Coding and data analysis

The analysis comprised two steps. In the first step we used a traditional method of categorization that consisted of using acoustic inputs and making a selection of the sounds that looked more similar [23,26,39–41]. That is, one experimenter listened to each test trial, and scored whether the subject's response correctly matched the sound uttered by the model. Six judges, blind to the sound uttered by the model, listened to pair of sounds (model and candidate copies) and were asked to judge if the copy matched the model sample (scoring yes for correct matching and no for non-matching) across six samples (three correct and three incorrect, the latter chosen randomly from the pool of sounds emitted by the subject) for each demonstrated sound.

Next, using a visual inspection of the waveform, we analysed two time domain-related parameters, namely the number and duration of bursts, of a random sample of five copies of each novel vocalization using Adobe Audition and then we calculated the intra-class correlation coefficient (ICC) as a measure of concordance between model and copy sounds. The ICC for absolute agreement was estimated using a two-way random effects model.

We run an objective detailed analysis in which the demonstrated and imitated sounds selected in the first step were subjected to an analysis of matching accuracy using algorithms implemented in Matlab version 2014a, using the signal processing toolbox version 6.21 (R2014a) and the additional code and scripts designed by Lerch [42] (http://www.audiocontentanalysis.org/code/). These analyses involved several steps.

First, we selected and extracted a subset of acoustic features (e.g. statistics, timbre or quality of sound, intensity-related, tonal or temporal) of both model-copy sounds. All of these features were implemented using a 20 ms time window, using hamming windowing, with an overlap of 50% (hop 10 ms). The challenge was to select in an exploratory approach a subset of these features in time and frequency domains that a priori seemed suitable for comparing sounds made by two species that use the remarkably different acoustic modes of production mechanisms aforementioned. The main features selected were as follows: (i) spectral pitch contour ACF (autocorrelation function of the magnitude spectrum), which shows the evolution of the fundamental frequency over time; (ii) time energy evolution, which allows us to compare the evolution of the energy pattern over time between the model's and the subject's acoustic signals (temporal regularity and rhythm); and (iii) pitch class profile, a histogram-like 12-dimensional vector (corresponding to the 12 notes of the diatonic musical scale) with each dimension representing both the number of occurrences of the specific pitch class in a time frame and its energy or velocity throughout the analysis block [42]. Figure 2 presents an example of a waveform, spectrogram and pitch class profile of the model and the copy of one human (tonal) novel sound and of one conspecific (atonal) novel sound acoustic analyses. (See electronic supplementary material, figures S2–S4 for one example for each spectral analysis for each of the main features selected and for a complete list of all features selected.) Figure 2. (a) Waveform and spectrogram of the model (i) and the copy (ii) of the human (tonal) novel sound ‘HE’. Note the harmonic pattern in both signals. (b) ‘HE’ pitch class profile of the model (i) and the copy (ii). (c) Waveform and spectrogram of the model (i) and the copy (ii) of the conspecific (atonal) novel sound ‘BR’. Note the inharmonic pattern in both signals. (d) ‘BR’ pitch class profile of the model (i) and the copy (ii).

Second, once these features were selected, all the characteristics of each frame were compacted into a single vector. Finally, for the comparison it was necessary to then take into account that these signals were of different duration. We used a dynamic time warping (DTW) method to deal with the alignment task, that is, with the operations of stretching and compressing audio parts allowing similar shapes to match even if they are out of phase in the time domain. DTW represents a family of algorithms developed for the automated recognition of human speech that allows for limited compression and expansion of the time axis of a signal to maximize frequency overlap with a reference signal [42]. DTW is a more robust distance measure for time series capable of quantifying similarity (or dissimilarity) in an optimal way [42] as, typically, a dissimilarity function is a Euclidean distance measure that calculates and cumulates a cost according to a correspondence function (where a zero cost indicates a perfect match). That is, the higher the matching cost, the more dissimilar (less similar) the two sequences.

DTW has been widely documented and used in digital signal processing, artificial intelligence tasks such as pattern recognition (e.g. sign and gestural language), music information retrieval and signal processing, audio forensic or machine learning [42], and has recently proved to be an excellent technique for assessing matching accuracy between sounds produced by marine mammals and in particular for automatic classification of killer whale call types [43–45]. In the present study, DTW was used to measure dissimilarity of the aforementioned acoustic subset of features that were previously selected between the audio signal of the demonstrated sound and that of the subject, revealing the extent of alignment or synchronization between both signals.

Finally, in order to establish relative comparisons between any model-copy sound pair, a ‘dissimilarity index’ scale was constructed, which allowed us to calibrate the distance measures obtained in the DTW analyses and thus establish how similar or dissimilar the two sounds (demonstrated sound and that of the subject sound) were in all the subsets of features selected. As the dissimilarity index does not have a fixed upper limit, we rescaled the index into an interval from 0 to 1 to quantitatively assess the degree of dissimilarity. As in the non-rescaled version, 0 in this scale represents a perfect copy (i.e. a sound compared with a copy of itself) and 1 represents maximum dissimilarity. To establish this ceiling value (the top of the scale), we chose a main benchmark value, technically referred to as ‘anchor’. As the value depends on the particular vocalizations analysed, indices of dissimilarity were calculated between four randomly chosen demonstration sounds and copies uttered by the subject that corresponded to other different demonstrated sounds. The benchmark value chosen was the round score closest to the maximum found (940 378 score for ‘Amy’ paired with ‘one two three’), which accordingly in this case was rounded to 1 000 000 (see electronic supplementary material for a complete list of DTW dissimilarity index scores). The rescaled dissimilarity index represents the division of the accumulated distance in relation to the distance value of the anchor of dissimilarity. Among these same four pairs of different sounds we also took the lowest score (the more similar) as another benchmark for what could be considered bad and good copies. Finally, another benchmark was included to serve as a reference point for what could be considered a ‘high-quality match’ (i.e. a human copying another human known word). For this we calculated the dissimilarity index between the sound ‘hello’ produced by the trainer and the experimenter's copy of the same sound (figure 4).

3. Results

Inter-observer reliability of whether model and subject sounds matched was high (Fleiss's weighted κ: 0.8; p < 0.001; observed agreement = 0.90).

(a) Familiar sounds

The subject correctly copied all of the trained sounds, either demonstrated by a conspecific or by a human. In phase 1, the subject recalled the copy command given by the trainer 4 years before as indicated by her response in the first trial. Phase 2 involved testing the subject's response to the trainer's copy command when the model uttered familiar sounds. With the copy signal alone the sound ‘song’ was copied in the 1st trial, ‘blow’ was copied in the second trial (first session) and ‘birdy’ was accurately matched in the 34th trial (sixth session). Wikie reached the criterion for moving to the final experimental phase (i.e. 90% correct trials on these intermixed familiar sounds) in the seventh session. In phase 3 Wikie also copied correctly all of the trained conspecific sounds uttered by a human model in the transfer sessions (n = 2) and in the first trial. In sum, Wikie made recognizable copies of the demonstrated sound judged in real time by two observers, Wikie's trainer and one experimenter, later confirmed by both after listening to the recordings.

(b) Novel sounds

The subject produced recognizable copies of all of the untrained sounds, either demonstrated by a conspecific or by a human (as judged by two experimenters who listened to the sound recordings after the test and then confirmed by six independent observers). In the live conspecific condition, the novel sounds (n = 3) were copied before the 10th trial (strong raspberry), with one sound copied in the second trial (‘creaking door’), and the other in the third trial (‘breathy raspberry’). In the conspecific through speaker condition, the novel sounds (n = 2) were copied before the 17th trial (‘wolf’), with the other sound copied in the sixth trial (‘elephant’). In the conspecific through human model condition, the novel sound tested (n = 1) was copied in the first trial (strong raspberry). Finally, in the human sound condition, the novel sounds (n = 6), although they weren't perfect copies, Wikie produced recognizable copies of the human model sounds before the 17th trial (‘ah ah’), with two sounds copied in the first trial (‘hello’ and ‘one, two, three’).

Visual examination of spectral patterns revealed a good matching of the demonstrated sound and the subject's copy in several of the acoustic features analysed. For all sound parameters tested, no differences were observed between the model's sound and the subject's match in the total number of bursts (Cohen's κ = 1, p < 0.0005). When tested with novel conspecific sounds, a high concordance was found between burst duration of the model's sound and the subject's copy (ICC: 0.79; p < 0.001, n = 31 bursts). When tested with human sounds, a very high concordance between burst duration of the model's sound and the subject's copy was found (ICC: 0.89; p < 0.001, n = 65 bursts), showing better performance compared to killer whale sounds.

In the automated quantitative analysis, the DTW showed an optimal overlap represented by a diagonal line alignment between both sounds (demonstrated and copy) in all the examples for each sound judged by the experimenters as correct imitations in phase 1. This diagonal line alignment of the ‘shortest line’ between both signals indicated similarity in all features selected [42]. Figure 3 presents an example of a DTW analysis in the matching of the subject and the human model for the sound ‘hello’ (tonal), the conspecific's novel sound ‘breathy raspberry’ (atonal) and the familiar sounds ‘birdy’ (tonal) and ‘blow’ (atonal) (see the electronic supplementary material, figure S1 for one DTW example of all the other novel sounds tested). Although the fundamental frequency of copies made by human and killer whale models was remarkably dissimilar, the outline F0 contours turned out to be very similar. Figure 4 shows a representation of a DTW distance dissimilarity index between the demonstrated sound and the best match (the lowest DTW value) among the random sample of five copies of each vocalization type of the subject for each and every sound tested plus four ‘incorrect’ reference control points (corresponding to randomly chosen demonstrated sounds paired with copies that corresponded to other different subjects' sounds) and another ‘high-quality copy’ reference control point (human copying another human known word; see the electronic supplementary material for a complete list of DTW dissimilarity index scores). Overall, expected matches (when demonstration and copy were of the same sound type) did match, while expected non-matches (when demonstration and copy were of different sound types) did not. Specifically, we found that copies of familiar conspecific sounds fell below a dissimilarity index threshold (horizontal red dotted line below the lowest incorrect random pair copy) that divided our results into good or bad copies and most of them were close to the ‘high-quality match’ score (human imitating human anchor), with one score being below this value (‘blow’). Copies of novel conspecific sounds were located very close to this ‘high-quality match’ score and novel speech sounds demonstrated by humans were distributed across the whole range of good copies, with one even below this ‘high-quality match’ benchmark. If we take as a criterion of matching accuracy the values obtained with familiar sounds from conspecifics, we observe that except for the sound ‘blow’, which is the simplest untrained sound consisting only of a single burst of atonal voiceless breath (see electronic supplementary material, second example on sound file no. 1), the copies of novel conspecific sounds and three of novel speech sounds (‘Amy’, ‘hello’ and ‘ah ah’) were even more closely matched than the tonal familiar conspecific sounds. Figure 3. Dynamic time warping familiar and novel conspecific and human sounds (tonal and atonal). In both axes all the characteristic features of the signals are aligned and the black line shows the shortest path (minimum distance) between the model and the subject sound streams. (a) DTW familiar sound ‘BL’ (atonal) of the model and the copy. (b) DTW familiar sound ‘BI’ (tonal) of the model and the copy. (c) DTW novel sound ‘HE’ (tonal) of the model and the copy. (d) DTW novel sound ‘BR’ (atonal) of the model and the copy. Figure 4 Dynamic time warping dissimilarity index distribution. Distribution of the DTW dissimilarity index between the model and the copy for each vocalization; familiar (blue dots), killer whale novel (green dots) and human novel (turquoise dots). Five control benchmarks (red dots) are also represented where the first one corresponds to the ‘high-quality match’ score (human-imitating-human benchmark) and the others correspond to the four randomly chosen incorrect copies (model sounds paired with copies that corresponded to other different models). The horizontal red dotted line below the lowest incorrect random pair copy serves as a benchmark for dividing the results between good and bad copies.

Finally, analysing the features selected for the DTW analysis separately, the spectrogram analysis revealed that Wikie produced harmonics when exposed to tonal sounds (including human sounds), but not when exposed to atonal or noisy sounds (figure 2; electronic supplementary material, figures S2–S4).

4. Discussion

Although the subject did not make perfect copies of all novel conspecific and human sounds, they were recognizable copies as assessed by both external independent blind observers and the acoustic analysis. There was great variability in the number of good copies produced after a sound was copied for the first time (table 1). Possible factors that could explain this variability are the difficulty in producing novel sounds and some uncontrolled factors such as variation in motivational levels across sessions. Additionally, our non-differential reinforcement regime (good copies of novel sounds were not reinforced to avoid shaping) may have also contributed to this variability. Consequently, it is conceivable that our data represent a conservative estimate of the killer whale's capacity for vocal imitation.

According to the DTW dissimilarity scale (figure 4), all the copies of novel conspecific utterances fell below the dissimilarity index threshold for good and bad copies (pairs of different demonstrated and copied sounds randomly chosen) and most of them were close or even fell below the ‘high-quality match’ score, as represented by the human-copying-human anchor. Similarly, although three of the copies of human sounds were only close to the dissimilarity index threshold for good and bad copies (incorrect randomly paired copies), the other three fell close to the ‘high-quality match’ score (human-copying-human anchor); that is, they were very accurate copies, with one falling even below this benchmark. This accuracy level is particularly remarkable given that the subject possessed a very different sound production system compared to humans. Some parameters such as the fundamental frequency were sometimes drastically different between the human model and Wikie's copies, but the outline F0 contours were nonetheless quite similar (figure 4).

Overall, the DTW analyses revealed that the accuracy of copies was much higher when these were of the same sound than when they involved a different sound, which strongly suggests that the copies were specific to the demonstrated sound. We believe that the subject's responses represent a case of vocal imitation rather than response facilitation, as the latter form of social learning does not apply to individuals reproducing a model's novel sound [46]. Moreover, the subject's perfect performance in the control ‘non-copy’ trials (i.e. performing a trained action or sound different from that of the model) ruled out automatic response facilitation (i.e. copying the model's sound spontaneously) [46] because she only copied what she was requested to do.

DTW analyses also revealed that the subject's copies of novel conspecific and human sounds were in most cases even more accurate than were the copies of familiar sounds. Thus, in three of the novel speech sounds (‘hello’, ‘Amy’ and ‘ah ah’), the accuracy of the copies was even greater than the matching accuracy of some of the familiar sounds uttered by the conspecific model. Moreover, four copies of novel sounds were found to be high-quality matches, as they were close to the benchmark score of a human copy of the human sound, and one was even a better match (see ‘breathy raspberry’ in figure 4). A greater copying accuracy for novel compared to familiar sounds might suggest that the cognitive mechanisms responsible for producing familiar and novel sounds do not fully overlap. It is possible that the matching of familiar sounds relies more heavily on response facilitation than imitation where the subject's copy is mainly shaped by the general characteristics of the stored representation than by the sound's specific individual components. In contrast, learning to match a novel action or sound might require the subject to carefully process the individual components of the auditory experience, which might generate a better match. The subject's matching accuracy is all the more remarkable as she was able to accomplish it (a) in the absence of extensive trial and error across all the experimental conditions, (b) in response to sounds presented in-air and not in-water (the species' usual medium for acoustic communication), and (c) through the use of a sound production system that greatly differs from that of the model's when matching speech sounds [35,36]. Note that the subject readily matched the harmonic quality of human tonal sounds (figure 2; electronic supplementary material, figures S2–S4). The anatomical structures involved in sound production of cetaceans differ from those used by terrestrial mammals and birds, in that cetaceans are adapted to an aquatic lifestyle where the sound-producing organs compress while diving because of water pressure-related changes [35]. This has been hypothesized to have favoured the development of vocal learning in marine mammals as they need to have a substantial voluntary control over sound production in order to successfully meet the demands of reliably generating the same sounds at different depths [47].

Our experimental findings lend support to the hypothesis that the group-differentiated acoustic dialects that have been documented in many field studies of killer whales [16–23] and other cetaceans [10] can be acquired and maintained through social learning and, more specifically, through imitation. These results add to the growing database of socially learned sounds reported in previous non-experimental and experimental studies of killer whales and other cetaceans (dolphins [11–13]; belugas [14,15]). Compared to the fission–fusion societies of bottlenose dolphins, however, the social systems of killer whales are reported to be more strongly structured and closed [10,16]. Thus, the well-developed propensity of killer whales to copy what others are doing would be consistent with the body of observations on group-specific acoustic dialects, synchronized behaviour and sophisticated cooperative strategies documented in this species [10].

The results reported here show that killer whales have evolved the ability to control sound production and qualify as open-ended vocal learners. It can be argued that because our experimental design included in-air (rather than in-water) sounds, the positive results obtained cannot directly reflect the killer whale's capacity for learning to copy underwater sounds in their natural environment. However, our main objective was to test whether the killer whales were capable of learning novel sounds through imitative learning, regardless of the type of sound (in-air versus in-water) and the model (conspecifics versus heterospecifics). The atypical nature of the sounds that we used represents a strength rather than a weakness in relation to our main question because it evidences flexibility not just on what is copied but on how is copied. With regard to what is copied, our data show that killer whales can copy sounds outside their usual repertoire—which is an important piece of information if one wants to know not only what a species does, but also what it can do, under a variable set of circumstances. With regard to the issue of how it is copied, our data might indicate that the sensory–perceptual and cognitive skills recruited in imitating in-air sounds are ancestral traits, dating back to the terrestrial ancestors of cetaceans. Moreover, given the highly derived state of the sound-producing apparatus uniquely evolved by cetaceans, the imitative capacities found in this study also underscore the fine-tuned ability of this species to flexibly produce accurate matches of heterospecific in-air sounds.

Future experimental studies of imitation of in-water sounds demonstrated by conspecifics are needed to firmly establish the role of social learning in the killer whale's vocal dialects in the wild. Finally, we extended DTW analysis used in previous studies [39,44,45,48] by incorporating several additional features of killer whales' demonstrated and imitated sounds into the algorithm. However, the results must be taken with caution because the choice of features was exploratory. Further studies are thus needed to standardize the assessment of the matching accuracy of different sound features as well as the validation of the dissimilarity index. Although we see great potential in this analytical approach for comparative studies of vocal learning, its applicability may vary depending on the study's objectives, the sounds investigated and the species's vocal production system.

Electronic supplementary material is available in the online content of the paper and at https://figshare.com/s/2991d28752ca0690e843. This includes the methods, raw data, figures S1–S5 and 12 audio file examples (electronic supplementary material, audio file S1: 3 conspecific familiar sounds; electronic supplementary material, audio file S2.1–S2.5: 5 conspecific novel sounds; electronic supplementary material, audio file S3.1–S3.6: 6 human novel sounds).

Ethics

The Ethics and Animal Welfare Committee (CEBA-MEDUC) of the School of Medicine, Pontifical Catholic University of Chile, have approved this research. This research adhered to the legal requirements of the country (France) in which the work was carried out and Marineland institutional guidelines.

Data accessibility

This article has no additional data.

Authors' contributions

J.Z.A. conceived the study. J.Z.A., M.V.H.-L. and J.C. designed the experiment, which was conducted by J.Z.A. and M.V.H.-L. M.V.H.-L. designed and carried out the data analyses and interpretation. J.Z.A. and L.G. performed the sound analyses and interpretation. J.Z.A. and M.V.H.-L. drafted the paper. J.Z.A. and F.C. co-wrote the paper. J.C. and F.A. helped to write and provided critical revisions and some of the ideas in the paper. J.ZA., F.C., F.A. and J.C. provided financial support. All the authors have read and approved the final manuscript.

Competing interests

We declare we have no competing interests.

Funding

This project was conducted at the Marineland Aquarium Antibes, France and supported by a Postdoctoral Scholarship FONDECYT no. 3140580 to J.Z.A. This study was partly funded by project grants PSI2011-29016-C02-01, PSI2014-51890-C2-1-P (MINECO, Spain) and UCM-BSCH GR3/14-940813 (Universidad Complutense de Madrid y Banco Santander Central Hispano) to F.C.

Acknowledgements Special thanks are expressed to Andrea Hodgins for sharing her data and making her thesis available to us and to Manuel Castellote for his advice on sound analyses. We are grateful to the directors of Parques Reunidos and the Marineland Aquarium of Antibes, Jesús Fernández and Jon Kershaw, for allowing us to conduct this research. Furthermore, we appreciate the work of head coaches Lindsay Rubinacam and Duncan Versteegh, and the team of orca trainers for their help and support. Special thanks are expressed also to Amy Walton for her dedication and training of the killer whale model Moana. Finally, we want to thank Francisco Serradilla from Universidad Politécnica de Madrid for developing the sound launcher app for iOS.

Footnotes

Electronic supplementary material is available online at https://dx.doi.org/10.6084/m9.figshare.c.3982647.