Orang-utan wookies and the species-specific repertoire

Data Collection

To test the first prediction of this study and verify the idiosyncrasy of wookies and their novelty among the known orang-utan repertoire, we recorded spontaneous wookies from Rocky (studbook ID: 3331) during interactions with the human experimenter (MEH) between April and May 2012 at the Indianapolis Zoo, where he is currently housed. We used a ZOOM H4Next Handy recorder via the inbuilt mic standing on a miniature tripod at approximately ~0.5 m distance from the subject. Recordings were collected at a sampling rate of 24 bit/48,000 kHz and saved in wav format. These settings obtained high quality audio recordings and are standard for the collection of orang-utan call behaviour in captivity and the wild. The original version of wookies has been produced by Rocky for at least the last 6.5 years. It was apparent when the experimenters first met Rocky when he was 3.5 years old. It is unclear how he originally learned the vocalization and no recordings are available from earlier years. Wookies are produced by the subject to gather attention from caretakers16,21. Recordings from the known orang-utan call repertoire available from previous work22 were used in order to draw a comparison with wookies.

Data analyses

In order to verify the novelty of wookies in relation to the remaining orang-utan call repertoire, we assessed the largest database ever assembled of orang-utan calls22, currently spanning more than 12,000 observation hours across 9 wild and 6 captive populations, and comprising more than 120 individuals. We compared wookies produced spontaneously (i.e. not given in response to human wookie-versions) with the spectrally most similar vocalization known to be produced by orang-utans – the grumph22. Grumphs were the only vocalizations presently described in the orang-utan repertoire to exhibit a complete overlap in frequency range with wookies (grumphs: 86–1723 Hz, wookies: 99.6–1418 Hz). Both calls were the only orang-utan vocalizations to fall below 100 Hz and simultaneously reach above 350 Hz22 (Fig. 1). Wookies were produced with ingressive air-flow, whereas grumphs were presumably produced with egressive air-flow (as various other orang-utan calls)22. Nevertheless, we decided to conduct a comprehensive acoustic comparison in order to verify, with confidence, wookies’ idiosyncrasy and prevent claims of novelty strictly based on one immeasurable articulatory feature (i.e. air-flow direction). For this comparative analysis, grumphs were sampled from wild adolescent males of similar age as Rocky in order to control for the largest number of potentially confounding factors as possible; primarily, sex and body size variation. In order to control for potential geographic variation in grumph acoustics, all wild adolescent males were sampled from the same population (i.e. Ketambe Forest, Aceh, Sumatra, Indonesia).

Figure 1 Spectrographic representation of two orang-utan grumphs followed by two wookies. Full size image

To acoustically compare wookies with orang-utan grumphs, acoustic measures were conducted with Praat, using “voice report” standard settings, except for voicing threshold in the pitch settings, which was set to 0.15. Seven acoustic parameters describing vocal fold oscillation were measured: duration, median pitch, mean pitch, pitch standard deviation, minimum pitch, maximum pitch and pitch amplitude. Complementary, three acoustic parameters describing supralaryngeal action were measured: first, second and third formant. Because these parameters directly express the position of the tongue and jaw during vocal production, they were used to assess whether wookies also involved different oral manoeuvres, besides different oscillation patterns at the vocal folds.

Statistical analyses were conducted using nonparametric tests with IBM SPSS Statistics 21 (SPSS, Inc.). To compare the differences between wookies and grumphs, one would typically use a Mann-Whitey U test for each parameter. However, because different individuals contributed with several calls to our dataset, this condition violated the assumption of data independence for conducting Mann-Whitney U tests. As such, we opted to conduct Kruskal Wallis tests between individuals for each parameter, while correcting for multiple testing using Bonferroni correction. We expected that Kruskal Wallis test results would show the following. For each parameter, our study subject should be different from all other individuals, while all other individuals should not be different between themselves, since wookies only derived from our study subject whereas grumphs derived from all the remaining individuals. For these analyses, we included our subject and the other adolescent males for whom a sample size larger than one was available (i.e. 2 individuals with 24 and 12 calls). This operation resulted in the exclusion of three adolescent males for which one grumph recording was available.

Orang-utan vocal fold action in match trials

Data Collection

To test the second prediction of this study, experimental testing was conducted with Rocky during April and May 2012 at the Indianapolis Zoo. The zoo’s committee provided ethical approval and permission to conduct research, and the methods were carried out in accordance with the approved guidelines. “Do-as-I-do” paradigm was selected for match trials because this paradigm has been successfully used previously to invoke voluntary call responses in captive orang-utans19,20. Human demonstrator used protective gloves and a facial mask at all times and interacted with Rocky always through enclosed mesh. Rocky was rewarded during trial sessions with customary food snacks (i.e. raisins and dried plums) or drinks, prepared and provided by full-time orang-utan caretakers at the zoo. Caretakers assured the items used differed in no noticeable way in terms of the subject’s food preferences and food rewards did not vary within trial sessions.

Under the “do-as-I-do” test paradigm, the human demonstrator presented Rocky with random sequences (Runs test, Z = −4.751, p < 0.001) of human wookie-versions varying in frequency (Hz) – low vs. high wookies. 513 trials were presented (272 low, 241 high), divided through 13 sessions (~49 trials/session, ~472 seconds/session) over the course of 5 days. The subject typically responded to the model signal within approximately 500 ms.

Trial sessions were recorded at ~0.5 m distance from the subject with a ZOOM H4Next Handy recorder via the inbuilt mic standing on a miniature tripod. Recordings were collected at a sampling rate of 24 bit/48,000 kHz and saved in wav format. These settings obtained high quality audio recordings. Rocky only joined trial sessions voluntarily and never refused to participate. Rocky was never food deprived during trials sessions and trial sessions never interfered with normal feeding times or working schedule at the orang-utan enclosure so as to prevent imposing any stress. Rocky was tested when he and his cohort (four other orang-utans) were housed in their individual quarters.

During trial sessions, only the first reply immediately after the human model was considered for analyses, unless the human demonstrator verbally instructed (repeating the call model or saying the name of the variant to be matched, “low” or “high”) the focal to repeat, in which case we considered the call produced after the last instruction provided by the human demonstrator, or the last call produced by the focal before the human demonstrator verbally closed the bout (e.g. by saying “yes” or “very good”). We did not consider calls when overlap between human model and orang-utan match reply did not allow suitable extraction of acoustic parameters from both calls (i.e. focal was too quick to reply).

We intentionally selected a human demonstrator with no previous voice training or music experience. Because our main aim was fundamentally evolutionary, we deliberately avoided using a demonstrator with vocal skills well beyond those potentially present in a human ancestor. We mandated model calls to be as “raw” and naturally sounding as much as possible. No a priori guidelines were given to the human demonstrator before match trials and no acoustic treatment was given to her utterances. Moreover, we purposefully did not obstruct the human demonstrator from deploying her natural behaviour during the interaction (e.g. occasional approximation to the subject, occasional arm movement). Crucially, this decision allowed the demonstrator to keep the subject engaged and cooperative during the tests. Nevertheless, we were adamant about providing no training sessions, opportunities or time to the subject before the match trials, and the subject was presented a human demonstrator with whom he was not familiar. These factors confidently assured that our subject did not develop conditioned responses.

Data analyses

In order to compare the acoustic profile and general vocal fold oscillation between human- and orang-utan-produced wookies, we selected and analyzed call maximum frequency (Hz). This parameter was also used to compare the subject’s wookie sub-variants between each other (spontaneous, high and low). Maximum frequency is the frequency at which maximum energy (dB) occurs within a call. For this reason, maximum frequency contributes disproportionally to pitch and, in the case of wookies, it represented one of the best proxies available for pitch (Spearman test between maximum frequency and mean pitch, r = 0.341, N spontaneous wookies = 124, p > 0.001). Moreover, maximum frequency was equal to the fundamental frequency (F 0 ) 93.4% of 500 measured cases. Therefore, maximum frequency provided one of the most reliable measures of the oscillation rate of the vocal folds and its perception. In order assess the subject’s level of accuracy during the task, we also conducted the same test but analysing low and high wookies separately.

Besides maximum frequency, we measured duration and maximum power (dB) within each call. Because all recordings were conducted at a constant distance from the study subject, maximum power could be used as a proxy of glottal air pressure during call production. This measure allowed us, thus, to monitor the contribution of abdominal action (generating air current within the vocal tract) during the production of wookies exhibiting different maximum frequencies.

Maximum frequency, duration and maximum power were extracted from recordings using Raven Pro software package (version 1.5, Ithaca, NY: The Cornell Lab of Ornithology) and Hann type spectrogram grip spacing at 2.93 Hz. The use of other important parameters characterizing vocal fold oscillation (e.g. harmonics-to-noise ratio) was hampered because these parameters are particularly susceptible to recording settings20.

Nonparametric statistical analyses were conducted using IBM SPSS Statistics 21 (SPSS, Inc.). Spearman binomial correlation test was used to assess a potential effect of human model calls on the responses produced by the study subject. Wilcoxon signed ranks test was used to identify potential differences between wookie subvariants produced by the study subject. Discriminant function analyses were used to assess whether wookie subvariants produced by the study subject could be distinguished perceptually. Discriminant function analyses were conducted both by setting prior probabilities (i.e. chance probability of correct assignment) equal between all groups and by computing prior probabilities based on group size. Because our data set for these analyses derived from the same individual, this did not require conducting a permuted discriminant function analysis. A permuted analysis would have otherwise allowed controlling for a possible confounding variable. For instance, if several individuals had contributed wookie subvariants, the permuted analysis would have allowed controlling for individual variation while assessing the capacity to correctly distinguish wookie subvariants.

Because receivers sense acoustic signals holistically instead of attending to one or few acoustic parameters separately23, we tested whether low and high wookies produced by Rocky were overall perceptually distinct from each other by using automated classification algorithms, combined with artificial neural networks (ANN) and mel frequency cepstral coefficients (MFCC)24, a classification method that scans and analyses signals based on their general acoustic profile. These analyses allowed assessing the differences between wookie sub-variants while taking in consideration their complete acoustic profile simultaneously, other than one acoustic parameter at a time. For both feature extraction and network analyses, Matlab R2012b (The MathWorks, Inc., Natick, MS, U.S.A.) was used. The MFCCs in this study were computed using the ‘melcepst’-routine available in the toolbox Voicebox. We optimized both MFCC and ANN according to published guidelines24. To acquire a MFCC, each call was sliced into seven frames using a Hamming window, two-thirds frame overlap and 16 mel-spaced filters24. We used 10 hidden layer neurons and 100 iterations to obtain an optimal ANN24. To increase the reliability of the results, every call was tested against seven neural networks, and the condition proposed by the majority of the networks was considered final24. Calls were tested using a leave-one-out procedure24.

Lastly, we conducted Spearman binomial correlation tests between maximum frequency, duration and maximum power of the subject’s wookies in order to investigate general production dynamics. With these analyses, we were particularly interested in examining to what extent low and high wookies could have been produced strictly by means of changes in glottal air pressure generated by abdominal control (other than by vocal fold control).