Regular and non-invasive measurements of vital signs are important in both hospital and home situations because of their fundamental role in the diagnosis of health conditions and the monitoring of well-being. Such measurements include pulse rate (PR), pulse rate variability (PRV), breathing rate, blood oxygen level (SpO2), and blood pressure. Conventionally, these vital signs are measured using contact sensors, which include electrocardiogram probes, chest straps, pulse-oximeters, and blood pressure cuffs. These contact methods, however, are not well suited to many settings (e.g., day-to-day monitoring of the elderly, who find it difficult to follow instructions accurately, or monitoring newborn babies, who have delicate skin).

To enable contact-free vital sign estimation, a new breed of algorithms are being developed for use with off-the-shelf CMOS cameras (e.g., mobile phone cameras).1–3 With this new class of algorithms, small imperceptible changes in skin color (caused by cardio-synchronous variations in the blood that lies under the surface of the skin) can be magnified and enhanced. Although this idea of measuring—with cameras—skin color changes for the purpose of obtaining vital sign information has been around for a decade, a robust implementation of the technique has remained elusive. Indeed, there have been three main practical challenges. First, small color changes are harder to extract for people with darker skin tones. Second, low lighting conditions can cause low signal-to-noise (SNR) ratios, which moves the signal of interest closer to the noise floor. Lastly, natural movements of subjects in front of the camera can cause significant noise, which corrupts the signal of interest.

In our recent research we have addressed these three challenges and have proposed a new method in which we ‘divide and conquer’ the problem.4 We achieve this by tracking different parts of the face independently. We then compute a ‘goodness metric’ for each tracked part and subsequently combine the signals in an adaptive way that is based on our defined metric. The steps involved in our algorithm are illustrated in Figure 1.



Figure 1. Steps involved in the ‘distancePPG’ algorithm for estimating camera-based photoplethysmogram (PPG) signals. MRC: Maximum ratio combining. KLT: Kanade-Lucas-Tomasi feature. ROI: Region of interest.

The optics that underlie our camera-based vital sign measurement system are similar to those in a conventional reflectance pulse-oximeter (when light passes through tissue some of it is absorbed by the chromophores that are present in the blood and the rest is reflected, i.e., back-scattered, to the photodetector). The amount of light that is absorbed depends on the volume of blood in the catchment area and on the concentration of major chromophores (e.g., hemoglobin and oxyhemoglobin). When the volume of blood beneath the skin changes in sync with the pumping of the heart, the amount of light reflected also changes. That variation can thus be recorded as a skin-color change—otherwise known as a photoplethysmogram (PPG) signal—with the use of a pulse-oximeter. The major differences in our contactless camera-based system are that the light source is at a much larger distance than for contact-based pulse-oximeters (i.e., we use ambient light) and that the photodetector is replaced by a CMOS sensor grid. This sensor is also at a larger distance (i.e., 1–2m away) compared with a pulse-oximeter. These two differences result in a significantly lower PPG signal strength than that measured using cameras in natural light.

In our method, we track each of the different regions (e.g., forehead, cheeks, chin) of a face separately as the person moves in front of the camera. To extract the associated PPG waveform (under motion), we use a deformable face tracker5 and a Kanade-Lucas-Tomasi feature tracker.6 One major insight that has resulted from our work is that the different regions of interest (ROIs) of a face have different PPG signal strengths. The spatial differences in the signal strength are caused by spatial variations in the depth and density of arterioles beneath the skin surface, combined with spatial variations in the intensity of light that is incident on different regions of the face. We use the goodness metric to quantify the PPG signal strength from each ROI. We define this metric—in the frequency domain—as the ratio of the power of the PPG signal around the pulse rate, to the power of the PPG signal in the rest of the frequency band (0.5–5Hz). The definition of our goodness metric is illustrated schematically in Figure 2(a). Since the PPG signal strengths that are obtained from different ROIs can also change over time (because of changes to lighting conditions or movements of individuals in front of the camera), we re-compute our estimate of the goodness metric every 10s. We also treat PPG signals from different ROIs like signals received from wireless antenna arrays. We then use a maximum ratio combining (MRC) algorithm to provide an overall PPG estimate that maximizes the SNR.7 Sample PPG signals (obtained from different regions of a face, with associated weightings), as well as our final estimate of the camera-based PPG signal using the weighted-averaging algorithm, is shown in Figure 2(c).



Figure 2. Illustrations of various aspects of the distancePPG algorithm. (a) The goodness metric (G i ) is defined by the area under the power spectrum density of the PPG signal, Ŷ i (f). PR: Pulse rate. b: Small frequency band (e.g., 0.1Hz) around the pulse rate. B 1 , B 2 : Frequency bandwidth of the PPG signal. (b) An image of a face with the goodness metric overlaid. Red and blue regions have high and low values of the goodness metric, respectively. (c) The PPG signal extracted from the four regions—labeled 1–4 in (b)—of the face. The weighted average camera PPG estimate, #p(t), is also shown, which is very similar in shape to the pulse-oximeter (pulse-ox) PPG signal.

For different skin tones (pale white to brown), our ‘distancePPG’ algorithm improves the SNR of the estimated PPG signal by an average of 4.1dB compared with previous methods.3, 8 As such, this improvement to the SNR of the camera-based PPG signal significantly reduces the errors in PR and PRV estimates. Some of our key distancePPG algorithm results—from different situational tests—are illustrated in Figures 3 and 4.4 In addition, we have found that our distancePPG algorithm reduces the root mean square error for the estimation of PRV (using a 30fps camera) to below 16ms.



Figure 3. Top: Bland-Altman plot showing a comparison of PR results that are derived using the distancePPG algorithm and ground-truth pulse-oximeter measurements. Bottom: Comparison of PR results that are derived using older PPG estimation methods and ground-truth pulse-oximeter measurements. The results are shown for 12 subjects with diverse skin tones (four subjects had fair skin, four had olive skin, and four had brown skin). All recordings were made while the subjects were at rest. Note the different y-axis scales. The mean bias (average difference between the pulse-oximeter-derived and camera-based PR results) is -0.02 beats per minute (bpm), with a 95% limit of agreement between -0.75 and 0.72bpm for the 12 subjects. For comparison, older PPG estimation methods (by averaging the whole face) give results that have a mean bias of –0.4 (with a 95% limit of agreement between –4.5 and 3.7bpm).



Figure 4. Top: Bland-Altman plot showing a comparison of PR results derived using the distancePPG algorithm and ground-truth pulse-oximeter measurements. Bottom: Comparison of PR results derived using older PPG estimation methods and ground-truth pulse-oximeter measurements. The results are given for three separate motion scenarios (reading text, watching video, and talking) for five subjects with diverse skin tones. The mean bias achieved using the distancePPG method is 0.48bpm, with a 95% limit of agreement between -5.73 to 6.70bpm. For comparison, the results shown in the bottom plot have a mean bias of 7.17bpm, with a 95% limit of agreement between -18.70 and 33.04bpm.

Non-contact monitoring of vital signs is an exciting prospect, both from a research perspective and because of its widespread clinical and wellness applications. There are several fundamental challenges, however, that need to be solved before this concept can be ubiquitously used in clinics and homes. Namely, motion-induced changes in surface/specular reflections (which can produce significant levels of noise) need to be rejected, weak PPG signals need to be faithfully estimated, and robust vital sign estimations must be unaffected by skin tone (melanin content), lighting condition, and camera location variations. We have shown with our distancePPG algorithm that these challenges can be overcome (to some extent) by designing custom signal processing and computer vision algorithms. Our methodology has already been further exploited to improve the accuracy of SpO2 estimation (with the use of a red-green-blue camera).9 We believe that by combining the designs of illumination, camera optics, and signal processing algorithms we will be able to further improve the performance of our approach. Specifically, to improve performance under severe motion scenarios (e.g., videos of people exercising in a gym), we are working on designing customized illumination and advanced signal processing techniques. We will use these to separate skin color changes that are caused by motion artifacts (surface reflection) from those that are caused by the PPG signal (subsurface reflection). By solving challenges in non-contact vital sign monitoring under motion, we believe that the accuracy of other tissue imaging techniques (e.g., terahertz, near-IR, and IR imaging) can also be improved, as these modalities are also affected by motion artifacts.

This work was partially supported by the National Science Foundation's Division of Computer and Network Systems (grant 1126478) and Division of Information and Intelligent Systems (grant 1116718), a Rice University graduate fellowship, a Texas Instruments fellowship, and the Texas Higher Education Coordinating Board (grant THECB-NHARP 13308).

Mayank Kumar, Ashok Veeraraghavan, Ashutosh Sabharwal Rice University Electrical and Computer Engineering

Houston, TX Mayank Kumar is a graduate student. His research interests lie in the areas of signal processing, scalable health, and machine learning. Ashok Veeraraghavan is currently an assistant professor. His research interests are in the broad areas of computational imaging, computer vision, and scalable health.10 Ashutosh Sabharwal is a professor and an IEEE Fellow for his contributions to theory and experimentation of wireless systems and communications. He is also the founder of the WARP project and the Rice Scalable Health Initiative. Rice University