Computer vision is an exciting discipline in computer science. Research has been concerned with the topic for decades, but only with the latest developments in big data and artificial intelligence has it been possible to create stunning new applications. Processing becomes faster and cheaper thanks to cloud technologies and new GPUs. With pay-as-you-go pricing models, you can enter the arena without the risk of having to make major upfront investments. Small embedded systems like the NVIDIA Jetson make innovative, mobile, and smart devices that have high processing power with low power consumption possible.

Millions of years ago, the so-called Cambrian explosion happened. In a relatively short period of time, the biodiversity on earth "exploded." Some researchers think that one of the reasons for this was the development of eyesight and they think that computers are on a similar path today. But compared to the evolution, the progress of computer vision capabilities has been happening much, much faster.

Cars, robots, and drones start to understand what we see in pictures and videos. The interface “computer vision” between machines and humans will gain much more importance within the next few years.

We started a small project to develop an interactive drone with computer vision capabilities. I will use this project as an example to demonstrate some basic computer vision concepts.

This is the demo video from our drone project:

In the following article, I will try to explain step-by-step how we implemented the person detector with simple algorithms.

Getting Started

There are various frameworks for computer vision. The most popular is OpenCV. I would also recommend taking a look at dlib.

According to OpenCV:

"OpenCV is released under a BSD license and hence it’s free for both academic and commercial use. It has C++, C, Python and Java interfaces and supports Windows, Linux, Mac OS, iOS and Android. OpenCV was designed for computational efficiency and with a strong focus on real-time applications. Written in optimized C/C++, the library can take advantage of multi-core processing. Enabled with OpenCL, it can take advantage of the hardware acceleration of the underlying heterogeneous compute platform."

Depending on your preferences and previous knowledge, you can develop on various platforms. To get a simple start, I would recommend setting up a development environment on Ubuntu 16.04 with Python 3.x and OpenCV 3.x. I use a virtual machine on my MacBook based on VMware Workstation. (The integration of external hardware sometimes works best compared to other virtualization solutions.) You can also get those components to work on other operating systems, but in that case, advanced "version-and-dependency-conflict-fumbling" knowledge is often required.

dlib is not as powerful as OpenCV. Nevertheless, some of the functions are worth taking a closer look at. For example, the Facial Landmark Detector/its correlation tracker can be seen in action in this example from our drone:

GPU or CPU?

Some algorithms are based on CUDA for using the GPU. For that, you need a graphics card from NVIDIA. If you don’t have one, you can rent a GPU instance on AWS or get a developer board (like NVIDIA Jeston TX). It is not necessary for a first start but more advanced algorithms (neuron networks, deep learning, etc.) are running much faster with hardware acceleration. In this area, it is often not the wisest choice to always use the latest and greatest version. Maybe you will need an older Ubuntu version and not the most current Linux kernel to be able to compile all drivers and dependencies. In the AWS marketplace, you can find GPU instances on which OpenCV, Python, CUDA, and the links are already pre-installed and ready to run (based on Ubuntu 14.04 as of May 2017).

Installation of OpenCV With Python Wrappers

Disclaimer: On the internet, you can find many tutorials on how to install OpenCV. I'm not trying to reinvent the wheel here. Just set up an Ubuntu VM and then follow the above article step-by-step.

OpenCV is written in C, but starting with Python wrappers was much easier for me. It depends on your previous knowledge, but in my case, it gets me to working prototypes much faster. The differences in performance are not noticeable for most use cases.

Computer Vision Basics

The progress in computer vision primarily happens with the help of neural networks and deep learning. But to get started in this area, you should cover the basics first.

This short video explains the basics and shows some examples of code. You can also see how we created the first part of the simple object detector from the drone video:

Images Are Multidimensional Arrays

An image is represented as a multidimensional array. In Python, it is of the data type numpy while in C, it is Mat. The coordinates (0, 0) are in the upper0left corner. When you have a colored image, there will be three color values for each coordinate. Depending on the resolution and color depth, those arrays can vary in size. The color values go from 0 to 255. Note that with OpenCV, you first specify the Y and then the X coordinate (which is often confusing).

The following code is reading an image file and executes some basic operations on pixel level:

import cv2 # read image from HD image = cv2.imread("test.png") # read color values at position y, x y = 100 x = 50 (b, g, r) = image[y, x] # print color values to screen print(b,g,r) # set pixel color to RED in BGR color scheme image[y, x] = (0, 0, 255) # choose region of interest at (x, y) with dimension 50x50 pixel region_of_interest = image[y:y+50, x:x+50] # show image on screen cv2.imshow("Bild", image) # show region of interest in seperate window cv2.imshow("ROI", region_of_interest) # set all ROI pixels to green region_of_interest[:, :] = (0, 255, 0) # now show modified image, not that ROI is a "pointer" to the original image cv2.imshow("Modified Image", image) # wait for a key - important, otherwise nothing would be visible cv2.waitKey(0)

Color Space

The default color space in OpenCV is BGR (Blue Green Red). It is more commonly known as RGB, so there is more confusion with the order of basics in OpenCV. (But, of course, there is a good reason for this: “It was this way and will, therefore, stay this way.”)

Depending on the color scheme you are working in, there are advantages and disadvantages for your application. The color space of HSV, for example, is easier to handle if you are filtering for specific color ranges. If you want to filter everything that is "kind of orange" in a BGR color space, that's not so easy. HSV is also not so heavily affected by changes in lighting. If you convert an image to grayscale, it will only have one color channel. This makes sense if you want to reduce the amount of data or if you want to optimize processing time.

Here is another small example:

import cv2 # init web cam cam = cv2.VideoCapture(0) # read frame from web cam ret, image = cam.read() # convert image to gray scale image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) # show image in seperate window cv2.imshow("Bild modifiziert", image) # wait for key press cv2.waitKey(0)

Common Algorithms and Methods

With computer vision, you sometimes have to think outside the box to implement more complex functions. The computer doesn't really understand what's shown on an image — it just sees digits representing color values. Here are a few methods that are indispensable as basic tools for a computer visionary!

Thresholding

Thresholding is often used to filter specific areas of an image with specific (color) properties. One thresholding method is binary thresholding in which you define a threshold and receive a black and white image as output. All pixels that exceed the threshold are white while the other pixels are black. With this method, you can, for example, search all orange pixels in an image (like the marker in our demo video).

Threshold masks are often the basis for further analytics.

Here’s the code to the video:

## check color ranges of tennis ball import cv2 # init webcam cam = cv2.VideoCapture(0) # define region of interest x, y, w, h = 400, 400, 100, 100 # show webcam stream while cam.isOpened(): # read frame from cam ret, frame = cam.read() # convert frame to HSV color scheme frame = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) # draw rectangle in frame cv2.rectangle(frame, (x, y), (x+w, y+h), (255, 255, 255), thickness=1) # print color values to screen cv2.putText(frame, "HSV: {0}".format(frame[y+1, x+1]), (x, 600), cv2.FONT_HERSHEY_SIMPLEX, 1, (255, 255, 255), thickness=2) # show frame cv2.imshow("frame", frame) # wait for key press key = cv2.waitKey(1) & 0xff # if ESC, exit if key == 27: break

With the color values, we filter by range:

import cv2 # init webcam cam = cv2.VideoCapture(0) # define color ranges lower_yellow = (18, 100, 210) upper_yellow = (40, 160, 245) # show webcam stream while cam.isOpened(): # read frame from webcam ret, frame = cam.read() # convert frame to HSV frame = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) # filter image for color ranges mask = cv2.inRange(frame, lower_yellow, upper_yellow) # show mask cv2.imshow("threshold", mask) # wait for key key = cv2.waitKey(1) & 0xff # if ESC, exit if key == 27: break

Finding Contours

There are often efficient algorithms for black and white images to find contours in them. They recognize contiguous pixels and group them together into blobs. Additionally, you can use certain properties from this contours for further analysis — for example, the area or the edge of a contour, and you can request a bounding box in return. We use this in our demo video to find the position of the orange marker. We only search contours with a certain minimum area (this way, we can filter single “noisy” pixels that are also in the orange area).

Here we are trying to find the tennis ball in the image and we are filtering the noise pixels out of it:

import cv2 # init webcam cam = cv2.VideoCapture(0) # define color ranges lower_yellow = (18, 100, 210) upper_yellow = (40, 160, 245) # show webcam stream while cam.isOpened(): # read frame from cam ret, frame = cam.read() # convert frame to HSV frame = cv2.cvtColor(frame, cv2.COLOR_BGR2HSV) # filter for color ranges mask = cv2.inRange(frame, lower_yellow, upper_yellow) # find contours on mask of "tennis-ball" pixels _, contours, _ = cv2.findContours(mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE) # now find the largest contour, this is most likely the tennis balls # for this we use the area of the contour if len(contours) > 0: tennis_ball = max(contours, key=cv2.contourArea) # draw bounding box around tennis ball x, y, w, h = cv2.boundingRect(tennis_ball) cv2.rectangle(frame, (x, y), (x+w, y+h), (0, 255, 0), thickness=3) # show frame cv2.imshow("frame", frame) # wait for key key = cv2.waitKey(1) & 0xff # if ESC, exit if key == 27: break

Background Subtraction

When you have a static camera, there are various (relatively easy) methods to detect motion in an image. Basically, you assume that everything that is not moving is the background. To put it simply, you subtract the pixel color values from the actual frame with the ones from the previous frame. When there is no change, you receive 0 as a result (no motion). But this model is too simple to use practically because it's too easily affected by minor changes in light and environmental influences (like wind). Many different algorithms have been developed over the past decades, and each has its own advantages and disadvantages. There is no single algorithm that fits all and works in all situations. But you can find a comprehensive list of known algorithms here.

A very commonly used algorithm is working with the Gaussian Mixture Model (GMM), MoG2 as it is called in OpenCV. Newer algorithms, for example, are SubSENSE.

Here is a little demo video:

Detectors

With OpenCV or dlib, you get various "standard" detectors. A program that can detect faces within webcam streams can be hacked together with a few lines of Python code. Are such programs suitable for practical use? I don’t think so. Those standard detectors have high error rates (lots of false positives/negatives). Traditionally, you would use the following methods to find faces in an image.

Hair Cascade Classifier

Hair cascade classifiers go back to a paper by Viola and Jones from 2000. The algorithm is relatively fast; you can run it in real time with a small resolution and reduced framerate on a Raspberry Pi. OpenCV already ships with some of these pre-trained hair cascade classifiers to recognize faces of people or cats... but I must mention that this classifier also regularly recognizes the backrest of my chair as a face.

HOG Detectors

With histogram of oriented gradients (HOG) detectors, parts of the image are split into a grid. For each box of the grid, the dominant edges are detected and transformed into numerical values so they can be aligned. The processing power needed to do all these calculations is way higher compared to hair cascade classifiers, but accuracy is higher.

You can think of simplified visualization of a HOG feature vector as follows. The dominant gradients in each box are easily recognizable as a face:

______ / _ _ \ | / | | __ | \ ____ /

We are using a hair cascade classifier in our drone demo video that is solely searching for faces near the detected orange marker. This way, we have a fast realtime detector and can reduce the false negatives to get an overall good result.

Here is the part of the video with the face detector near the markers:

With Deep Learning, the Future Is Now

Finally, I want to give a small outlook in the area of deep learning. Many researchers are working in this area, and you can find impressive demos and algorithms on YouTube. One of those is YOLO. I don’t mean the trend word of 2012 — it stands for “you only look once.” Behind that is a convolutional neural network that can recognize different classes of objects in real time (depending on the hardware).

We tried this algorithm and let our drone fly through our office. We rented a GPU instance on AWS and installed YOLO there. We started a pre-configured TensorFlow image from the Amazon Marketplace and followed this manual. You can, of course, try to set up an instance all by yourself, but this is not a trivial task. This article can help you get started. In this experiment, we clearly recognized how much difference a GPU makes — the algorithm needed 15-20 seconds per frame on CPU, and with GPU support, this dropped down to 6 ms!

You can see the result from our drone flight here:

Further interesting examples are:

Anyone who's interested in learning more about this field should read the book Deep Learning by Ian Goodfellow et al. According to this book, the amount of ANNs doubles every 2.4 years. In 2015, the number of neurons in large networks like GoogleNet was somewhere between the brain of a bee and the brain of a frog. Also, some specialized ANNs were superior in certain tasks compared to humans. If the development continues in this way, it is expected that around 2056 the biggest neural networks will be at the size of the human brain.