Study participants

The study was conducted in the adult ICU of Intermountain LDS Hospital (Salt Lake City, Utah). Participants included patients admitted to rooms equipped with computer vision depth sensors between August and October 2017, as well as staff entering these rooms. The purpose of this study was to develop and validate computer vision algorithms to detect the occurrence of patient mobility activities, as well as other descriptive attributes of mobility activities such as their duration and the number of personnel assisting. As such, we did not access patient clinical data or quantify the number of patients monitored, as this information was not necessary to validate algorithmic performance. The study protocol was approved by the Intermountain Healthcare Institutional Review Board. Informed consent was waived because the protocol posed no more than minimal risk to participants.

Data collection and annotation

Depth sensors capture 3D volumetric images of humans and objects based on their distance from the sensor, thereby providing visual information while preserving privacy. Sensors were mounted directly facing the bed in seven individual patient rooms, and image data were collected 24 h a day during the study period (2 months). Supplementary Figure 2 shows a floor plan for the Intermountain LDS Hospital ICU, including the location of each sensor and the relative configuration of each room in the study.

To create a curated data set of mobility event occurrences for model training and evaluation, data were manually reviewed and annotated by trained research assistants for four separate activities related to patient mobilization: patient getting into and out of bed, and patient getting into and out of chair. The number of personnel assisting with each mobility activity was also annotated. Owing to the temporal sparsity of patient mobility activities (making it difficult to find and annotate occurrences in long stretches of recorded data), a web-based application was developed to allow nursing staff to flag the approximate time occurrences of the patient mobility activities they witnessed, providing research assistants with a time stamp in the data for focused retrospective review. The use of time stamps to coarsely indicate the occurrence of mobility events enabled our research assistants to retrospectively examine only the periods of data flagged by nursing staff to identify and label mobility activities, avoiding manual review of thousands of hours of data. Three trained research assistants reviewed these sampled periods of data to provide precise temporal annotations, with each occurrence of a mobility activity being reviewed by one research assistant. To assess consistency of the manual review across the different research assistants, a subset of the data was annotated by all three of the research assistants. Frame-level inter-rater reliability of annotations on this subset was 0.894 using Fleiss’s kappa.28

Training and test data sets

A total of 563 mobility events were annotated and included in the final, curated data set, comprising 154 instances of patient getting out of bed, 182 of getting into bed, 112 of getting out of chair, and 115 of getting into chair. The final data set included 98,801 frames of data, totaling 5.7 h. From the collected data set, 67% of the mobility activity instances and surrounding frames were randomly used for training, and 33% for testing. As such, 379 instances of patient mobility activities were used for training, and the remaining 184 instances of patient mobility activities were used for testing. The test data set included 48 instances of patient getting out of bed, 64 of patient getting into bed, 32 of patient getting out of chair, and 40 of patient getting into chair.

Augmentation of training data set

An augmentation data set was used during the training of the neural network for temporal detection of mobility activities and their duration. In order to improve algorithm performance, additional data comprising simulations of the targeted mobility activities was used to augment the training set during model development. These simulations were conducted to provide scripted instances of mobility activities over a short period of time, making them less labor intensive to manually annotate as compared to non-simulation activities that occur infrequently over long stretches of time. This data was collected during clinician-led mobility activity simulations in two of the seven patient rooms equipped with computer vision sensors in the LDS Hospital ICU, as well as in a dedicated patient simulation room at Stanford University. In total, data collected during simulations added 318 additional occurrences of mobility activities, totaling 41,353 frames of additional training data. This additional data included 97 instances of patient getting out of bed, 93 of patient getting into bed, 59 of patient getting out of chair, and 69 of patient getting into chair. Supplementary Figure 3 shows how simulation data were incorporated into the training data set. The simulation data were used only for improving training of the model (by providing an additional 318 training examples) and not for evaluation of algorithm accuracy, such that the evaluation remains based only on patient data. We chose not to include any simulation data in the test data set to evaluate the neural network because we felt that it would be a less-direct measure of how the algorithm would perform on data from a real-world, patient care environment.

Supplementary Table 2 shows the performance statistics for the algorithm with and without the addition of the simulation data to the training data set. Obtaining training data through simulation was a useful technique to enhance the neural network’s performance in a time-efficient manner, and improved the mean sensitivity and specificity on the evaluation data set from 82.93 and 84.44% to 87.20 and 89.20%, respectively. Adding the simulation data provided more examples for all activity classes and increased the exposure to variability in the training data. A comparison of the AUC (an aggregate measure of classification performance) for each activity class shows the improvement obtained with the addition of the simulation data to the training set (Supplementary Figure 4).

Model for detection of mobility activities and their duration

The algorithm for temporal detection of the mobility activities and their duration was a multi-label recurrent convolutional neural network model.29 We used an 18-layer ResNet convolutional neural network30 pre-trained on the large-scale ImageNet31 and fine-tuned on our data set to initially extract informative visual features from every frame of data. We subsequently used a two-layer bidirectional long short-term memory recurrent network to reason over temporal structure in consecutive 64-frame sequences of these features. An ensemble of six such models was used to produce the final detection output.

Model for detection of healthcare personnel

The algorithm for quantifying the number of personnel involved in each mobility activity was based on the YOLOv232 convolutional neural network architecture for object detection. The YOLOv2 convolutional neural network was trained to predict the spatial locations of people in each image frame of data using annotated bounding boxes of the spatial locations of people in 1379 frames of patient data. This trained person-detector was evaluated to achieve a spatial average precision of 0.66 compared with human annotation. After applying the person-detector to the image data, post-processing was used to smooth detections over time. The maximum number of detected people over the duration of a mobility activity (taking into account that one person is the patient) was used to quantify the number of healthcare personnel involved in each activity. In the data set, 7% of activities had a true number of 0 healthcare personnel involved, 51% had one healthcare personnel, 32% had two healthcare personnel, and 10% had three healthcare personnel.

Evaluation of algorithm performance

Evaluation of the algorithms’ accuracy was assessed by comparing the manual annotations of the data set (known as the ground truth standard) with the predictions made by the algorithms. Sensitivity, specificity, and receiver operating characteristic calculations were performed using Python 3.6 (Python Software Foundation, https://www.python.org/).

Code availability

Full code is available from the authors upon reasonable request.