Animal retrieval and care

Adult bats (Rousettus aegyptiacus) were captured in a natural roost near Herzliya, Israel. This roost is regularly inhabited by a colony of 5,000 to 10,000 bats. The bats were kept in acoustic chambers, large enough to allow flight, and fed with a variety of local fruit. All experiments were reviewed and approved by the Animal Care Committee of Tel Aviv University (Number L-13-016), and were performed in accordance with its regulations and guidelines regarding the care and use of animals for experimental procedures. The use of bats was approved by the Israeli National Park Authority.

Bat housing and monitoring

The bats were housed in 2 identical cages (acoustic chambers; for illustrations refer to ref. 36), with 6 females, 5 pups, and 1 male in cage 1, and 4 females, 4 pups, 1 male, and 1 young (of unknown sex) in cage 2. The cages were continuously monitored for 75 days, with IR-sensitive cameras and omnidirectional electret ultrasound microphones (Avisoft-Bioacoustics Knowles FG-O; 2 microphones in each cage). Audio was sampled using Avisoft-Bioacoustics UltraSoundGate 1216HA/D converter with a sampling rate of 250 kHz. The chambers were acoustically isolated and their walls were covered with foam to diminish echoes. Raw audio recordings were automatically segmented and filtered for noises and echolocation clicks, leaving only bat social communication calls (see ref. 36 for details of this process). Video was synchronized to the audio, resulting in a short movie accompanying each audio recording. Videos were then analyzed by trained students, who identified the circumstances of each call (emitter, addressee, context, and behavioral response, see details below). The bats were individually marked using a collar with a reflective disc. The observers were cross validated during their training to ensure qualified annotations. An emitter bat was recognized by its mouth movements, and 2–3 cameras could be used to verify a distinct assignment. If there was any doubt regarding the emitter’s identity we excluded the vocalization from the analysis. This conservative approach is the main cause for the exclusion of almost 90% of the vocalizations from our analysis. There was a negligible number of events when two bats vocalized together (or shortly after each other). These events could be easily distinguished from the spectrograms.

Classification tasks

We managed to annotate 19,021 calls with all of the required details for classification. We then only used individuals for which we had enough vocalizations in at least 3 of the tested contexts (at least 15 per context). The analyzed data hence consists of 14,863 calls produced by 7 adult females (F1–F7). We classified the emitter of the vocalization among these 7 females (the males produced much less vocalizations, hence were not used in this study). For extending the emitter recognition to a larger number of individuals, we used all bats, including adults which were previously recorded in the same setup, excluding pups and those with less than 400 recorded vocalizations, ending up with 15 individuals (Supplementary Fig. S3). Four aggressive contexts were included in the analysis: (1) Feeding aggression – interactions during feeding or in close proximity to the food; (2) Mating aggression – produced by females in protest to males’ mating attempts; (3) Perch aggression – emitted when two bats who perched close to each other, confronted one another, displaying aggressive acts, accompanied by rapid movements of the wings, but with no close contact; and (4) Sleep aggression – squabbling over locations, or other aggression, in the day-time sleeping cluster. For the emitter and addressee classification we also included vocalizations for which the context was not conclusive (“General” in Supplementary Table S1). In this General (unidentified) aggression context the interacting bats are usually 10–20 cm apart, while in the other contexts they were ca. 0–10 cm apart. The Mating aggression context was not used in the addressee classification task, as these vocalizations were exclusively directed toward the male (hence identifying the addressee in this case could result from solely identifying the context). In the prediction of the addressees of the vocalizations we used all addressees with at least 20 calls addressed to them. The outcome of a vocal interaction was defined as one of two options: (1) Depart – the two bats split after the interaction, where either both went their own way, or one of them left and the other stayed in place, (2) Remain together – the two interacting bats stayed in the same position (in close proximity) after the interaction ended. In controlling for emitter/context influence on addressee/outcome classifications (i.e. vocalizations in specific context by individual emitters) we allowed classes (addressees or outcomes) with at least 10 calls, in order to extend the coverage of different cases.

Feature extraction for classification

Egyptian fruit bat vocalizations consist of multisyllabic sequences, with short periods of silence between the syllables (Fig. 1E, Supplementary Fig. S1). Each vocalization was first automatically segmented (as described in ref. 36), retrieving only the voiced segments (see “voiced” and “unvoiced” bars in Supplementary Fig. S1). Then, a sliding window of 20 ms (with an overlap of 19 ms between consecutive windows) was used to extract 64 Mel-frequency cepstral coefficients (MFCC) from each window. The MFCC assumes a logarithmic pitch scale which is typical for mammals (including bats)46. The mel-scale was originally tuned for human perception. However, as we did not intend to mimic the bat’s auditory system precisely, but only to test for available information and as there is no equivalent bat scale, we chose to use it. The feature vectors retrieved from all segments were joined to one set of several 64-dimentional feature vectors representing the vocalization (Supplementary Fig. S1). The MFCCs were normalized by subtracting their mean for every recording channel (2 channels in each cage) as is commonly done to reduce (recording) channel biases.

Classification algorithm and cross validation

The GMM-UBM algorithm was used for classification (following39, which used it for human speaker recognition). In short, given a labeled training set of vocalizations, for each class (e.g. emitter “F1”, the context “Feeding aggr.”, etc.) all sets of feature vectors from all vocalizations of this class are pooled together into one mega-set. This mega-set is then modeled by a Gaussian mixture model (GMM) of 16 Gaussian components. However, instead of directly fitting the GMM onto the data, the GMM parameters are assessed using an adaptive method, based on a universal background model (UBM) (see details of the procedure in Supplementary Fig. S1 and Supplementary Methods). The UBM is a GMM fitted to another set of data, which was not used for training or testing. To this end, we employed the data which was not part of the analysis due to lack of detailed annotations. We sampled a random sample of syllables from all of the vocalizations for which the identity of the pair was known but the role of each individual was not certain (i.e. who addressed whom). This sample constituted the background set of 3900 syllables, and its corresponding UBM was used for all of the classifications. A test sample, i.e. a vocalization unseen by the model training algorithm, then received a score for each possible class (e.g., each context). The score was computed as the ratio between the likelihood that the sample was drawn from the specified class (computed using the learned GMM) and the likelihood that the sample was drawn from the UBM. This process results with each sample in the test set having a score for each possible class, and the class with the maximum score is chosen as the prediction for this sample. Due to small sample size in some classes, in a few classification tasks, we adopted a leave-one-out approach for cross-validation (for all tasks): Iteratively, over the entire set, each vocalization at a time is left outside of the training set and then given a prediction by the trained model (which means that every prediction is made on a vocalization unseen by the training algorithm). The success of the classification was measured using the balanced accuracy (BA): First, the confusion matrix is normalized by each class size (i.e., the sum of each row is 1, and the diagonal holds the fraction of correct predictions in each class), then, the BA is the average of the confusion matrix diagonal. To estimate a p-value for each success rate we ran permutation tests, where we permuted the labels of the original set (e.g., we mixed the contexts). 100 permutations were used in each test. To exclude any influence of context-dependent background noises on the classification success, we verified that our results could not be replicated by classifying non-voiced recordings from the analyzed contexts.

2D projections for visualization

Each vocalization is represented in our data as a set of 64-dimensional vectors, where this set is of varying size, depending on the duration of the vocalization. In order to illustrate the interplay between the acoustics of different classes we projected each vocalization onto a 2-dimensional plane (Figs 2B and 3B,D,E). To this end, we assigned each vocalization a new “feature” vector containing the scores it received from our algorithm for each class. Thus, each vocalization was represented by a single C-dimensional vector (where C is the number of classes in the classification task, e.g. 7 for emitter classification). For visualization, we now used the first 2 dimensions of a linear discriminant analysis applied to this new set of C-dimensional vectors. This process can be viewed as a type of “multi-dimensional scaling”: from a variable number of dimensions (each vocalization was described by a different number of 64-dimensional vectors), through the lens of our models, onto a 2 dimensional plane. Importantly, this process is done on scores received when the vocalizations were in the test set, i.e. unseen by the training algorithm.