Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.

By Adrian Colyer, Venture Partner, Accel.

DeepSense: a unified deep learning framework for time-series mobile sensing data processing, Yao et al., WWW’17

DeepSense is a deep learning framework that runs on mobile devices, and can be used for regression and classification tasks based on data coming from mobile sensors (e.g., motion sensors). An example of a classification task is heterogeneous human activity recognition (HHAR) – detecting which activity someone might be engaged in (walking, biking, standing, and so on) based on motion sensor measurements. Another example is biometric motion analysis where a user must be identified from their gait. An example of a regression task is tracking the location of a car using acceleration measurements to infer position.

Compared to the state-of-art, DeepSense provides an estimator with far smaller tracking error on the car tracking problem, and outperforms state-of-the-art algorithms on the HHAR and biometric user identification tasks by a large margin.

Despite a general shift towards remote cloud processing for a range of mobile applications, we argue that it is intrinsically desirable that heavy sensing tasks be carried out locally on-device, due to the usually tight latency requirements, and the prohibitively large data transmission requirement as dictated by the high sensor sampling frequency (e.g., accelerometer, gyroscope). Therefore we also demonstrate the feasibility of implementing and deploying DeepSense on mobile devices by showing its moderate energy consumption and low overhead for all three tasks on two different types of smart device.

I’d add that on-device processing is also an important component of privacy for many potential applications.

In working through this paper, I ended up with quite a few sketches in my notebook before I reached a proper understanding of how DeepSense works. In this write-up I’m going to focus on taking you through the core network design, and if that piques your interest, the rest of the evaluation details etcetera should then be easy to pick up from the paper itself.

Processing the data from a single sensor



Let’s start off by considering a single sensor (ultimately we want to build applications that combine data from multiple sensors). The sensor may provide multi-dimensional measurements. For example, a motion sensor that report motion along x, y, and z axes. We collect sensor readings in each of these d dimensions at regular intervals (i.e., a time series), which we can represent in matrix form as follows:

We’re going to process the data in non-overlapping windows of width τ. Dividing the number of data points in the time series sample by τ gives us the total number of windows, T. For example, if we have 5 seconds of motion sensor data and we divide it into windows lasting 0.25s each, we’ll have 20 windows.

Finding patterns in the time series data works better in the frequency dimension than in the time dimension, so the next step is to take each of the T windows, and pass them through a Fourier transform resulting in f frequency components, each with a magnitude and phase. This gives us a d x 2f matrix for each window.

We’ve got T of these, and we can pack all of that data into a d x 2f x T tensor.

It’s handy for the implementation to have everything nicely wrapped up in a single tensor at this point, but actually we’re going to process slice by slice in the T dimension (one window at a time). Each d x 2f window slice is passed through a convolution neural network component comprising three stages as illustrated below:

First we use 2D convolutional filters to capture interactions among dimensions, and in the local frequency domain. The output is then passed through 1D convolutional filter layers to capture high-level relationships. The output of the last filter layer is flatten to yield sensor feature vector.

Combining data from multiple sensors



Follow the above process for each of the K sensors that are used by the application. We now have K sensor feature vectors, that we can pack into a matrix with K rows.

The sensor feature matrix is then fed through a second convolutional neural network component with the same structure as the one we just looked at. That is, a 2D convolutional filter layer followed by two 1D layers. Again, we take the output of the last filter layer and flatten it into a combined sensors feature vector. The window width τ is tacked onto the end of this vector.

For each convolutional layer, DeepSenses learns 64 filters, and uses ReLU as the activation function. In addition, batch normalization is applied at each layer to reduce internal covariate shift.

Now we have a combined sensors feature vector for one time window. Repeat the above process for all T windows.