This study was approved by the University of Washington Institutional Review Board. The methods were performed in accordance with University of Washington’s ethical, professional and legal standards. The 9-1-1 dataset was provided by Public Health-Seattle & King County, Division of Emergency Medical Services. For the sleep apnea dataset, human participants from the polysomnography studies provided written informed consent.

Datasets

The data represents a subset of 9-1-1 calls which (a) contained a known cardiac arrest and (b) had been identified to contain cardiac arrest-associated agonal breathing instances. The negative data consist of recordings of 12 patients sleeping in a sleep lab recorded on a Samsung Galaxy S4.

Our agonal breathing recordings are sourced from 9-1-1 emergency calls from 2009 to 2017 provided by Public Health-Seattle & King County, Division of Emergency Medical Services. There are 729 calls totaling to 82 h (Fig. 2a). The provided recordings include only calls involving cardiac arrest and specifically those determined to contain occurrences of agonal breathing, either by audible identification of agonal breathing or by description of the breathing from the caller. Each call is further rated by the 9-1-1 operator and an EMS quality assurance reviewer with a confidence score indicating the presence of audible agonal instances. We train our classifier on audio from calls that are rated with high confidence by both the operator and reviewer to contain audible agonal instances. These instances predominantly occur when the 9-1-1 operator asked the caller to place the phone next to the victim’s mouth for the purposes of breathing identification. A clinician who has experience identifying agonal breathing listened to a subset of recordings with the researcher and pointed out instances of agonal breathing. The researcher then identified all instances of agonal breathing that did not co-occur with other interfering sounds such as human speech. The trained researcher did this by listening to the 162 calls (19 h) and manually recorded timestamps where agonal breathing was heard during the call. For every timestamp annotation, we extract 2.5 s worth of audio from the start of each agonal breath. We extracted a total of 236 clips of agonal breathing instances. The female to male ratio was 0.5 and the median age was 62 (IQR: 21).

Two independent researchers confirmed the presence of agonal breathing sounds. They were first trained with examples of agonal breathing sounds. They then listened to the 236 clips and were instructed to mark clips that did not contain agonal breathing. The first researcher marked 1 of the 236 clips as not agonal breathing (classifying it as a cough sound), but marked all remaining 235 clips as containing agonal breathing. The second researcher marked all 236 clips as containing agonal breathing.

Our negative dataset consists of 83 h of audio from polysomnographic studies across 12 different patients (Supplementary Table 1). The female to male ratio was 1 and the median age was 57.5 (IQR: 10.25). The mean number of hypopnea, central apnea, and obstructive apnea events across patients was 41, 24, and 26, respectively. The mean apnea–hypopnea index (AHI) was 13, where a value of 0–5 is considered as ‘no apnea’, 5–15 is considered as ‘mild apnea’, 15–30 is considered as ‘moderate apnea’, and values > 30 are considered as ‘severe apnea’.29 AHI annotations were identified and calculated by trained sleep technicians. The negative dataset also includes interfering sounds that might be played while a person is asleep: podcast, sleep soundscape, and white noise.

We augment the data by playing the recordings over the air at distances of 1, 3, and 6 m, in the presence of interference from indoor and outdoor sounds with different volumes and when a noise cancellation filter is applied. The recordings were captured on different devices, namely an Amazon Alexa, an iPhone 5s and a Samsung Galaxy S4. Similarly, for the negative dataset, portions of the sleep data from all patients were played over the air and recorded on different devices as well as over a phone connection. We play a 5 min portion of audio data from each patient over the air at different distances and record the data on an Amazon Alexa, iPhone 5s and over a phone connection. The entire dataset for cross-validation consists of 14,621 data points with 7316 agonal breathing instances and 7305 instances of negative data.

The classifier’s false positive rate is evaluated on a set of real-world sleep sounds that occur in bedroom settings. We recruited 35 subjects to record themselves sleeping in their own bedrooms with a smartphone. Subjects were recruited from the Amazon Mechanical Turk platform. Subjects were asked to record themselves sleeping with their smartphone. All recordings submitted by subjects were manually reviewed to assure the presence of sleep sounds. The female to male ratio was 0.35, the median age was 33.00 (IQR: 13.00), the median recorded time was 4.48 (IQR: 3.12) h, and 28 unique smartphones were used across all subjects (Fig. 3a).

Data preparation

We note that the audio clips were sampled at a frequency of 8 kHz which is standard for audio received over a telephone. All audio clips are normalized between a range of −1 and 1. The audio clips are then passed through Google’s VGGish14 model for extracting feature embeddings from an audio waveform. The VGGish model transforms the waveforms into a compact embedding. The model resamples all audio waveforms at 16 kHz then computes a spectrogram using the Short-Time Fourier Transform. A log-mel spectrogram is generated and PCA is applied on the spectrogram to produce a 256-dimensional embedding.

Training algorithm

We performed k-fold validation (k = 10). For any given fold, none of the breathing instances in the validation set occurred in the training set. We evaluate detection accuracy such that no audio file in the validation set appears in any of the different recording conditions in the training set (e.g., if a file played at 6 m is present in the validation set, the same file played at 1 m is not present in the training set). We use a support vector machine with a radial basis function kernel and a regularization (C parameter) of 10. To reduce bias in our classifier we partitioned the data such that recordings from the same call did not straddle the training and validation set split. During cross-validation there was never an instance where a subject in the training set occurred in the validation set or vice versa.

Benchmark experiments

To record audio indefinitely on the Echo we used Echo’s Drop In feature which streams audio to another smartphone. That smartphone was plugged into a laptop which recorded audio data that was received on the smartphone’s audio interface. Audio from the Echo is streamed at 16 kHz and recorded at 44.1 kHz. The iPhone recorded data at 44.1 kHz. Each of the 236 audio clips is prepended with a frequency modulated continuous wave (FMCW) chirp. An FMCW chirp has good auto-correlation properties, as a result we can cross-correlate the recordings from the Echo and iPhone with the chirp to determine the exact timestamp of each audio clip. Each audio clip can then be extracted and transformed into an input for the classifier.

In our benchmark scenarios we evaluate the detection accuracy of our classifier across different distances on a second generation Amazon Echo and an iPhone 5s. We played the 236 audio clips of agonal breathing from a AmazonBasics Wireless Bluetooth speaker and recorded the audio on the Echo and iPhone. The sound intensity of the recordings were ~70 dBA at a distance of 1 m. We fixed the location of the Echo and iPhone and placed the speaker at different distances.

To evaluate the audio interference cancellation algorithm we set the iPhone 5s to play music at two different volumes (45 and 67 dBA), while simultaneously recording audio. We then ran an acoustic interference cancellation algorithm that allowed the smartphone to locally reduce the interference of its own audio transmissions. We used an adaptive least mean squares filter to reduce the dissimilarity between the device’s transmission and the received audio recording. Our filter uses the Sign-Data LMS algorithm with 100 weights and a step size of 0.05.

When evaluating system performance in the presence of interfering sounds we use two external speakers, one which plays the agonal breathing recordings and another that plays the interfering noise. The interfering noise is played with a sound intensity of ~55 dBA at a distance of 1 m. The interfering sounds are played outside the room containing the agonal breathing speaker and the recording device to simulate sounds that would be heard from outside a bedroom.

Run-time analysis

The most time consuming operations within the detection pipeline are the fast Fourier transforms (FFTs) required to generate the spectrogram and running inferences on the audio embeddings. Our iPhone 7 implementation of the detection algorithm used the Accelerate frameworks to perform the FFTs and Monte Carlo sampling to approximate the radial basis function kernel. On an iPhone 7 performing the FFTs to generate a single log-mel spectrogram takes 16 ms and running inferences on the support vector machine takes 5 ms. While the classifier can in principle run locally on the Echo device, Amazon currently does not allow third party programs to locally analyze data. Thus, to estimate the performance of our system when run natively on an Amazon Echo, we ran our pipeline on an iPhone 4, which shares the same Cortex-A8 processor as the Echo. On an iPhone 4, computing the spectrogram takes 40 ms and making predictions takes 18 ms.

Reporting summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.