Introduction

Tasks that humans take for granted are often difficult for machines to complete. That’s why when you’re asked to prove yourself human through those CAPTCHA tests, you’re always asked a ridiculously simple question, e.g., whether an image contains a road sign or not, or selecting a subset of images that contain food (see Moravec’s Paradox). These tests are effective in determining whether a user is human precisely because image recognition in context is difficult for machines. Training computers to accurately answer these kinds of questions in an automated, efficient way for large amounts of data is complicated.

To get around this, companies like Facebook and Amazon spend a lot of money to manually deal with image and video classification problems. For example, TechRepublic suggests that manual labeling of data may be the “future blue-collar job”, something that we already see companies like Facebook doing to curate newsfeed stories. Reviewing millions of images and videos to identify certain types of content by hand is extremely tedious and expensive. Yet despite this, few techniques exist to efficiently analyze image and video content in an automated way.

In this post, I will describe how, as a Fellow for Insight Data Science, I built a classification machine learning algorithm (Crash Catcher!) that employs a hierarchical recurrent neural network to isolate specific, relevant content from millions of hours of video. In my case, my algorithm reviews dashboard camera footage to identify whether or not a car crash occurs. For businesses that may have millions of hours of video to sift through (for instance, an auto insurance company), the tool I created is extremely useful to automatically extract important and relevant content.

Building a Dataset: Dashboard Camera Footage

In the ideal world, the perfect data set for this particular problem would be a large repository of videos with thousands upon thousands of examples of both car crashes and non-crashes. Each video would also contain clear metadata and be consistent in terms of video (e.g., location of camera in car, video quality and duration) and content (e.g., type of car crash, whether it be “head-on” or “t-bone”).

However, the reality is that this type of data doesn’t exist, or is practically impossible to come by. Instead, it is strewn across the web and is contributed by a wide array of individuals across the world. While data collection and preprocessing is not exactly the most glamorous aspect of any kind of Data Science work, it is definitely a significant component that cannot and should not be overlooked.

A “negative” example of a non-accident driving experience.

I obtained access to a private set of videos from VSLab Research, an academic research institution trying to predict car accidents. Their videos were short, four second-long clips with approximately 600 videos of accidents and about a thousand videos without accidents (just normal, boring driving scenes). However, this data had two main problems: too much variation in the types of crashes and the dataset was imbalanced between crashes (“positive”) and non-crashes (“negative”) videos.

Too Much Variation in Crashes

When looking through the videos, I found an exceptional diversity of vehicular accidents — any combination of semis, cars, motorcycles, mopeds, bicyclists, and pedestrians. I had removed duplicate driving scenes from the data, resulting in 439 negative videos and 600 positive videos. In my first attempt to train a model with this dataset, my algorithm barely beat a random guess as whether or not a video contained an accident. Variation in data is great when it can be accurately captured by your model! But, when there isn’t enough data to adequately model the complex variations, underfitting often arises, something we see in my initial model attempt.

Left: videos showing just some of the variety in the VSLab dataset. Right: ROC curve for initial model attempt… :(

Being able to accurately predict so many different kinds of accidents would require far more data than I had available. I decided de-scoping the project to a smaller problem made sense — if I focused in on car-collisions only, I should be able to predict if a video contained a car crash.

I manually trimmed down the data set of videos to then only focus on automobile collisions, leaving only 36 videos of car crashes. This amounts 7.5% of the data being positive examples, which creates a new problem of having an imbalanced dataset.

Dealing with Imbalanced Data

Imbalanced data can be tricky. If I were to train an algorithm on the 439 negative examples and 36 positive examples in my data set, the resulting model could easily predict that there were no crashes and still have 92.5% accuracy. This 92.5% accuracy however does not reflect the fact that the model can’t actually recognize when a car crashes occurs!

To actually capture positive examples with a model, I wanted (needed!) more examples to create a balanced data set — so I turned to YouTube! I scraped short clips of auto accidents from dashboard camera footage from various YouTube uploads. This resulted in 93 new positive examples, bringing the total to 129. By randomly selecting the same number of negative examples, I created a balanced dataset of 258 videos.

Pre-Processing Videos and Images

One of the biggest challenges of working with video is the amount of data. 258 videos may not seem like a lot, but when each video is broken up into individual frames there are over 25,000 individual images!

Upper left: an original frame from a video. Middle: grayscale version of original frame. Lower right: downsampled version of grayscaled frame.

Each image can be thought of as a two-dimensional array of pixels (originally with a size of 1280x720), where each pixel has information about the red, green, and blue (RGB) color levels which creates a 3-D shape. This initial data structure isn’t necessary for the analysis, so I reduced each 3-D RGB color array to a one-dimensional grayscale array. I also down-sampled each image by a factor of five to reduce the number of pixels in each image to a 256x144 array. All of this reduces the size of the data without losing any truly crucial information from the images.

The Nitty Gritty: A Hierarchical Recurrent Neural Network

Video datasets are particularly challenging because of their structure — while each frame in the video might be understood using standard image recognition models, understanding overall context is more difficult. Each video is a datapoint I wanted to classify as having/not having a car accident. Yet, each video is really a set of individual images in a time-dependent sequence. There is both a hierarchical structure to the data as well as a time dependency — the model I chose had to address both these characteristics.

To tackle these dependencies, I initially used a pre-trained convolutional neural networks (the Google Inception model) to vectorize each image in each video into a set of features. But because there weren’t drastic changes from frame to frame, I wasn’t picking up useful information and the model performed barely better than random (still!). After gathering advice from the vast Insight alumni network, I decided to train my own hierarchical recurrent neural network (HRNN). This approach allowed me to train a model to understand the flow of features and objects within a single video, and translate that into patterns that differentiate videos with crashes from videos without crashes.

On the left, a single segment of a recurrent neural network. The circular loop indicates the recursive nature of a recurrent neural network. If we “unfold” the neuron, we can look at how it changes over each iteration. The inputs (x(t-1) , x(t) , x(t+1) …) are filtered through the neuron and then the output (o(t-1) , o(t) , o(t+1)) is controlled by “gates”. These gates determine how much information is retained in memory for the next iteration and what is passed along in the output.

The HRNN is essentially a recurrent neural network wrapped inside another recurrent neural network (specifically, long short-term memory). The first neural network analyzes the time-dependent sequence of the images within each video, tracing objects or features as they move or change throughout the clip (e.g., car headlights or car bumpers). The second recurrent neural network takes the patterns and features encoded by the first neural network and learns patterns to discern which videos contain accidents and which do not.

The videos were all four second clips, so I tweaked the code in order for the algorithm to be able to account for a video of any length. This setup makes it more useful for companies, who would have longer videos they want to analyze. The code splits longer videos into short segments which can be independently, simultaneously screened by my HRNN to detect which (or if) portions of the video contain an accident. This means that the analysis of each segment can be parallelized across multiple GPUs/nodes to reduce the overall time it takes to process a video. So much more practical!

How well did the model do?

I used 60% of my dataset to train and 20% to validate my HRNN model. I tuned the model’s hyperparameters (e.g., number of layers of neurons, number of videos loaded into memory at a time, loss function, number of epochs) to optimize for accuracy, where the training and validation sets iterated through different options of these hyperparameters.

You may be wondering about that other 20% of the data I haven’t mentioned — these 52 videos were my holdout test set to analyze the final performance of the model after tuning the model. The ROC (receiver operating characteristic) curve on the test dataset had an accuracy of over 81%, well above random chance. Considering the complexity of the task, reaching such a high level of accuracy was surprising!

ROC curves for the training, validation, and test set for the final tuned HRNN model. Overall accuracy is over 80% for the training, validation, and test sets. The dotted line indicates the expected performance if we were randomly guessing whether video contained a car accident or not.

The overall performance and accuracy of the HRNN demonstrates that it generalizes fairly well to videos it hasn’t seen before. I accurately identified the vast majority of the positive examples in the test set as crashes. Reducing false negatives (crashes incorrectly identified as normal driving scenes) is key, as these are the cases that are most important to companies sifting through large quantities of video, even if there are occasional videos without crashes predicted to have crashes (false positives).

Conclusion

As tasks become more complex or data become more diverse, exponentially more data is needed to train an accurate neural network. In general, more data + more variety = a more generalizable model. To train this algorithm to accurately predict a wider variety of situations (e.g., classifying head-on collision vs rear-end collision vs car-truck collisions), more data is necessary.

The obvious path for getting more data is to find more examples — but from personal experience, I know that this is a tedious, time-intensive task. Generating “new” data by slightly altering the data we already have is a much more feasible alternative. Applying rotations, horizontal flips, changing image quality, or other variations on each video would create new content for the HRNN. Although humans can easily recognize altered videos as transformations of the original content, to a machine, it looks like new and different data. These alterations generate a “larger” dataset and can improve the generalization of the predictions to never-before-seen data.

The hardest part of this project was pairing the appropriate dataset with an appropriate deep learning approach to understand video context. This experience was exceedingly challenging but highly rewarding — I was impressed by how much I learned about data processing and deep learning techniques in building my final data product. It was satisfying to put all the pieces together to make something that holds significant value not just for myself, but also for companies and organizations trying to understand video content.