We‘ll pass each frame of video through the pipeline, one frame at a time.

The first step in the pipeline is to detect all possible parking spaces in a frame of video. Obviously we need to know which parts of the image are parking spaces before we can detect which parking spaces are unoccupied.

The second step is to detect all the cars in each frame of video. This will let us track the movement of each car from frame to frame.

The third step is to determine which of the parking spaces are currently occupied by cars and which aren’t. This requires combining the results of the first and second steps.

And the last step is to send a notification when a parking space becomes newly available. This will be based on changes in car positions between frames of video.

We can accomplish each of these steps a number of different ways using a variety of technologies. There’s no single right or wrong way to build this pipeline and different approaches will have different advantages and disadvantages. Let’s walk through each step!

Step 1: Detecting Parking Spaces in an Image

Here’s what our camera view looks like:

We need be able to scan that image and get back a list of areas that are valid to park in, like this:

Valid parking locations on this city street.

The lazy approach would be hardcoding the locations of each parking space into the program by hand instead of trying to detect the parking spaces automatically. But then if we ever move the camera or want to detect parking spaces on a different street, we’d have to hardcode the parking space locations again by hand. That sucks, so let’s find an automatic way to detect parking spaces.

One idea might be to look for parking meters and assume that each meter has a parking space beside it:

Detecting parking meters in an image.

But there are some complications with this approach. First, not every parking spot has a parking meter — in fact, we are most interested in finding spots that we don’t have to pay for! And second, just knowing the location of a parking meter doesn’t tell us exactly where the parking space is. It just gets us a little closer.

Another idea is to build an object detection model that looks for the parking space hash marks drawn on the road, like this:

Notice the tiny yellow marks — those are where the boundaries of each parking space are drawn on the road.

But this approach is also painful. First of all, the parking space line markers in my city are really small and hard to see from a distance, so they will also be hard to detect with a computer. And second, the street is full of all kinds of unrelated lines and marks. It will be difficult to separate out which lines are parking spaces and which lines are lane dividers or crosswalks.

Whenever you have a problem that seems difficult, take a few minutes to see if you can think of a different way to approach the problem that sidesteps some of the technical challenges. What exactly is a parking space anyway? A parking space is just a place where a car parks for a long time. So maybe we don’t need to detect parking spaces at all. Why can’t we just detect cars that don’t move for a long time and assume that they are in parking spaces?

In other words, valid parking spaces are just places containing non-moving cars:

The bounding box of each car here is actually a parking space! We don’t need to actually detect parking spaces if we can detect stationary cars.

So if we can detect cars and figure out which ones aren’t moving between frames of video, we can infer the locations of parking spaces. Easy enough–-let’s move on to detecting the cars!

Detecting Cars in an Image

Detecting cars in a frame of video is a textbook object detection problem. There are lots of machine learning approaches we could use to detect an object in an image. Here are some of the most common object detection algorithms, in order from “old school” to “new school”:

Train a HOG (Histogram of Oriented Gradients) object detector and slide it over our image to find all the cars. This older, non-deep-learning approach is relatively fast to run, but it won’t handle cars rotated in different orientations very well.

object detector and slide it over our image to find all the cars. This older, non-deep-learning approach is relatively fast to run, but it won’t handle cars rotated in different orientations very well. Train a CNN (Convolutional Neural Network) object detector and slide it over our image until we find all the cars. This approach is accurate, but not that efficient since we have to scan the same image multiple times with the CNN to find all the cars throughout the image. And while it can easily find cars rotated in different orientations, it requires a lot more training data than a HOG-based object detector.

object detector and slide it over our image until we find all the cars. This approach is accurate, but not that efficient since we have to scan the same image multiple times with the CNN to find all the cars throughout the image. And while it can easily find cars rotated in different orientations, it requires a lot more training data than a HOG-based object detector. Use a newer deep learning approach like Mask R-CNN, Faster R-CNN or YOLO that combines the accuracy of CNNs with clever design and efficiency tricks that greatly speed up the detection process. This will run relatively fast (on a GPU) as long as we have a lot of training data to train the model.

In general, we want to choose the simplest solution that will get the job done with the least amount of training data and not assume that we need the newest, flashiest algorithm. But in this specific case, Mask R-CNN is a reasonable choice despite being fairly flashy and new.

The Mask R-CNN architecture is designed in such a way where it detects objects across the entire image in a computationally efficient manner without using a sliding window approach. In other words, it runs fairly quickly. With a modern GPU, we should be able to detect objects in high-res videos at several frames a second. That should be fine for this project.

In addition, Mask R-CNN gives us lots of information about each detected object. Most object detection algorithms only return the bounding box of each object. But Mask R-CNN will not only give us the location of each object, but it will also give us an object outline (or mask), like this:

To train Mask R-CNN, we need lots of pictures of the kinds objects that we want to detect. We could go outside and take pictures of cars and trace out all the cars in those images, but that would take a couple days of work. Luckily, cars are common objects that lots of people want to detect, so several public datasets of car images already exist.

There’s a popular dataset called COCO (short for Common Objects In Context) that has images annotated with object masks. In this dataset, there are over 12,000 images with cars already outlined. Here’s one of the images in the COCO dataset:

An image in the COCO dataset with the objects already outlined.

This data is perfect for training a Mask R-CNN model.

But wait, things get even better! Since it’s so common to want to build object detection models using the COCO dataset, lots of people have already done it and shared their results. So instead of training our own model, we can start with a pre-trained model that can detect cars out of the box. For this project, we’ll use the great open source Mask R-CNN implementation from Matterport which comes with a pre-trained model.

Side note: Don’t be afraid of training a custom Mask R-CNN object detector! It’s time consuming to annotate the data, but it is not that difficult. If you want to walk through training a custom Mask R-CNN model using your own data, check out my book.

If we run the pre-trained model on our camera image, this is what is detected out of the box: