In this tutorial, we will discuss the various Face Detection methods in OpenCV and Dlib and compare the methods quantitatively. We will share code in C++ and Python for the following Face Detectors :

Haar Cascade Face Detector in OpenCV Deep Learning based Face Detector in OpenCV HoG Face Detector in Dlib Deep Learning based Face Detector in Dlib

We will not go into the theory of any of them and only discuss their usage. We will also share some rules of thumb on which model to prefer according to your application.

Throughout the post, we will assume image size of 300×300.

1. Haar Cascade Face Detector in OpenCV

Haar Cascade based Face Detector was the state-of-the-art in Face Detection for many years since 2001, when it was introduced by Viola and Jones. There has been many improvements in the recent years. OpenCV has many Haar based models which can be found here.

Code

Please download the code from the link below. We have provided code snippets throughout the blog for better understanding. You will find cpp and python files for each face detector along with a separate file which compares all the methods together ( run-all.py and run-all.cpp ). We also share all the models required for running the code.

Download Code

To easily follow along this tutorial, please download code by clicking on the button below. It’s FREE! Download Code To easily follow along this tutorial, please download code by clicking on the button below. It’s FREE!

Python

faceCascade = cv2.CascadeClassifier('./haarcascade_frontalface_default.xml') faces = faceCascade.detectMultiScale(frameGray) for face in faces: x1, y1, w, h = face x2 = x1 + w y2 = y1 + h

C++

faceCascadePath = "./haarcascade_frontalface_default.xml"; faceCascade.load( faceCascadePath ) std::vector<Rect> faces; faceCascade.detectMultiScale(frameGray, faces); for ( size_t i = 0; i < faces.size(); i++ ) { int x1 = faces[i].x; int y1 = faces[i].y; int x2 = faces[i].x + faces[i].width; int y2 = faces[i].y + faces[i].height; }

The above code snippet loads the haar cascade model file and applies it to a grayscale image. the output is a list containing the detected faces. Each member of the list is again a list with 4 elements indicating the (x, y) coordinates of the top-left corner and the width and height of the detected face.

Pros

Works almost real-time on CPU Simple Architecture Detects faces at different scales

Cons

The major drawback of this method is that it gives a lot of False predictions. Doesn’t work on non-frontal images. Doesn’t work under occlusion

2. DNN Face Detector in OpenCV

This model was included in OpenCV from version 3.3. It is based on Single-Shot-Multibox detector and uses ResNet-10 Architecture as backbone. The model was trained using images available from the web, but the source is not disclosed. OpenCV provides 2 models for this face detector.

Floating point 16 version of the original caffe implementation ( 5.4 MB ) 8 bit quantized version using Tensorflow ( 2.7 MB )

We have included both the models along with the code.

Code

Python

DNN = "TF" if DNN == "CAFFE": modelFile = "res10_300x300_ssd_iter_140000_fp16.caffemodel" configFile = "deploy.prototxt" net = cv2.dnn.readNetFromCaffe(configFile, modelFile) else: modelFile = "opencv_face_detector_uint8.pb" configFile = "opencv_face_detector.pbtxt" net = cv2.dnn.readNetFromTensorflow(modelFile, configFile)

C++

const std::string caffeConfigFile = "./deploy.prototxt"; const std::string caffeWeightFile = "./res10_300x300_ssd_iter_140000_fp16.caffemodel"; const std::string tensorflowConfigFile = "./opencv_face_detector.pbtxt"; const std::string tensorflowWeightFile = "./opencv_face_detector_uint8.pb"; #ifdef CAFFE Net net = cv::dnn::readNetFromCaffe(caffeConfigFile, caffeWeightFile); #else Net net = cv::dnn::readNetFromTensorflow(tensorflowWeightFile, tensorflowConfigFile); #endif

We load the required model using the above code. If we want to use floating point model of Caffe, we use the caffemodel and prototxt files. Otherwise, we use the quantized tensorflow model. Also note the difference in the way we read the networks for Caffe and Tensorflow.

Python

blob = cv2.dnn.blobFromImage(frameOpencvDnn, 1.0, (300, 300), [104, 117, 123], False, False) net.setInput(blob) detections = net.forward() bboxes = [] for i in range(detections.shape[2]): confidence = detections[0, 0, i, 2] if confidence > conf_threshold: x1 = int(detections[0, 0, i, 3] * frameWidth) y1 = int(detections[0, 0, i, 4] * frameHeight) x2 = int(detections[0, 0, i, 5] * frameWidth) y2 = int(detections[0, 0, i, 6] * frameHeight)

C++

#ifdef CAFFE cv::Mat inputBlob = cv::dnn::blobFromImage(frameOpenCVDNN, inScaleFactor, cv::Size(inWidth, inHeight), meanVal, false, false); #else cv::Mat inputBlob = cv::dnn::blobFromImage(frameOpenCVDNN, inScaleFactor, cv::Size(inWidth, inHeight), meanVal, true, false); #endif net.setInput(inputBlob, "data"); cv::Mat detection = net.forward("detection_out"); cv::Mat detectionMat(detection.size[2], detection.size[3], CV_32F, detection.ptr<float>()); for(int i = 0; i < detectionMat.rows; i++) { float confidence = detectionMat.at<float>(i, 2); if(confidence > confidenceThreshold) { int x1 = static_cast<int>(detectionMat.at<float>(i, 3) * frameWidth); int y1 = static_cast<int>(detectionMat.at<float>(i, 4) * frameHeight); int x2 = static_cast<int>(detectionMat.at<float>(i, 5) * frameWidth); int y2 = static_cast<int>(detectionMat.at<float>(i, 6) * frameHeight); cv::rectangle(frameOpenCVDNN, cv::Point(x1, y1), cv::Point(x2, y2), cv::Scalar(0, 255, 0),2, 4); } }

In the above code, the image is converted to a blob and passed through the network using the forward() function. The output detections is a 4-D matrix, where

The 3rd dimension iterates over the detected faces. (i is the iterator over the number of faces)

The fourth dimension contains information about the bounding box and score for each face. For example, detections[0,0,0,2] gives the confidence score for the first face, and detections[0,0,0,3:6] give the bounding box.

The output coordinates of the bounding box are normalized between [0,1]. Thus the coordinates should be multiplied by the height and width of the original image to get the correct bounding box on the image.

Pros

The method has the following merits :

Most accurate out of the four methods Runs at real-time on CPU Works for different face orientations – up, down, left, right, side-face etc. Works even under substantial occlusion Detects faces across various scales ( detects big as well as tiny faces )

The DNN based detector overcomes all the drawbacks of Haar cascade based detector, without compromising on any benefit provided by Haar. We could not see any major drawback for this method except that it is slower than the Dlib HoG based Face Detector discussed next.

It would be safe to say that it is time to bid farewell to Haar-based face detector and DNN based Face Detector should be the preferred choice in OpenCV.

3. HoG Face Detector in Dlib

This is a widely used face detection model, based on HoG features and SVM. You can read more about HoG in our post. The model is built out of 5 HOG filters – front looking, left looking, right looking, front looking but rotated left, and a front looking but rotated right. The model comes embedded in the header file itself.

The dataset used for training, consists of 2825 images which are obtained from LFW dataset and manually annotated by Davis King, the author of Dlib. It can be downloaded from here.

Code

Python

hogFaceDetector = dlib.get_frontal_face_detector() faceRects = hogFaceDetector(frameDlibHogSmall, 0) for faceRect in faceRects: x1 = faceRect.left() y1 = faceRect.top() x2 = faceRect.right() y2 = faceRect.bottom()

C++

frontal_face_detector hogFaceDetector = get_frontal_face_detector(); // Convert OpenCV image format to Dlib's image format cv_image<bgr_pixel> dlibIm(frameDlibHogSmall); // Detect faces in the image std::vector<dlib::rectangle> faceRects = hogFaceDetector(dlibIm); for ( size_t i = 0; i < faceRects.size(); i++ ) { int x1 = faceRects[i].left(); int y1 = faceRects[i].top(); int x2 = faceRects[i].right(); int y2 = faceRects[i].bottom(); cv::rectangle(frameDlibHog, Point(x1, y1), Point(x2, y2), Scalar(0,255,0), (int)(frameHeight/150.0), 4); }

In the above code, we first load the face detector. Then we pass it the image through the detector. The second argument is the number of times we want to upscale the image. The more you upscale, the better are the chances of detecting smaller faces. However, upscaling the image will have substantial impact on the computation speed. The output is in the form of a list of faces with the (x, y) coordinates of the diagonal corners.

Pros

Fastest method on CPU Works very well for frontal and slightly non-frontal faces Light-weight model as compared to the other three. Works under small occlusion

Basically, this method works under most cases except a few as discussed below.

Cons

The major drawback is that it does not detect small faces as it is trained for minimum face size of 80×80. Thus, you need to make sure that the face size should be more than that in your application. You can however, train your own face detector for smaller sized faces. The bounding box often excludes part of forehead and even part of chin sometimes. Does not work very well under substantial occlusion Does not work for side face and extreme non-frontal faces, like looking down or up.

4. CNN Face Detector in Dlib

This method uses a Maximum-Margin Object Detector ( MMOD ) with CNN based features. The training process for this method is very simple and you don’t need a large amount of data to train a custom object detector. For more information on training, visit the website.

The model can be downloaded from the dlib-models repository.

It uses a dataset manually labeled by its Author, Davis King, consisting of images from various datasets like ImageNet, PASCAL VOC, VGG, WIDER, Face Scrub. It contains 7220 images. The dataset can be downloaded from here

Code

Python

dnnFaceDetector = dlib.cnn_face_detection_model_v1("./mmod_human_face_detector.dat") faceRects = dnnFaceDetector(frameDlibHogSmall, 0) for faceRect in faceRects: x1 = faceRect.rect.left() y1 = faceRect.rect.top() x2 = faceRect.rect.right() y2 = faceRect.rect.bottom()

C++

String mmodModelPath = "./mmod_human_face_detector.dat"; net_type mmodFaceDetector; deserialize(mmodModelPath) >> mmodFaceDetector; // Convert OpenCV image format to Dlib's image format cv_image<bgr_pixel> dlibIm(frameDlibMmodSmall); matrix<rgb_pixel> dlibMatrix; assign_image(dlibMatrix, dlibIm); // Detect faces in the image std::vector<dlib::mmod_rect> faceRects = mmodFaceDetector(dlibMatrix); for ( size_t i = 0; i < faceRects.size(); i++ ) { int x1 = faceRects[i].rect.left(); int y1 = faceRects[i].rect.top(); int x2 = faceRects[i].rect.right(); int y2 = faceRects[i].rect.bottom(); cv::rectangle(frameDlibMmod, Point(x1, y1), Point(x2, y2), Scalar(0,255,0), (int)(frameHeight/150.0), 4); }

The code is similar to the HoG detector except that in this case, we load the cnn face detection model. Also, the coordinates are present inside a rect object.

Pros

Works for different face orientations Robust to occlusion Works very fast on GPU Very easy training process

Cons

Very slow on CPU Does not detect small faces as it is trained for minimum face size of 80×80. Thus, you need to make sure that the face size should be more than that in your application. You can however, train your own face detector for smaller sized faces. The bounding box is even smaller than the HoG detector.

5. Accuracy Comparison

I tried to evaluate the 4 models using the FDDB dataset using the script used for evaluating the OpenCV-DNN model. However, I found surprising results. Dlib had worse numbers than Haar, although visually dlib outputs look much better. Given below are the Precision scores for the 4 methods.



Where,

AP_50 = Precision when overlap between Ground Truth and predicted bounding box is at least 50% ( IoU = 50% )

AP_75 = Precision when overlap between Ground Truth and predicted bounding box is at least 75% ( IoU = 75% )

AP_Small = Average Precision for small size faces ( Average of IoU = 50% to 95% )

AP_medium = Average Precision for medium size faces ( Average of IoU = 50% to 95% )

AP_Large = Average Precision for large size faces ( Average of IoU = 50% to 95% )

mAP = Average precision across different IoU ( Average of IoU = 50% to 95% )

On closer inspection I found that this evaluation is not fair for Dlib.

5.1. Evaluating accuracy the wrong way!

According to my analysis, the reasons for lower numbers for dlib are as follows :

The major reason is that dlib was trained using standard datasets BUT, without their annotations. The images were annotated by its author. Thus, I found that even when the faces are detected, the bounding boxes are quite different than that of Haar or OpenCV-DNN. They were smaller and often clipped parts of forehead and chin as shown below.



This can be further explained from the AP_50 and AP_75 scores in the above graph. AP_X means precision when there is X% overlap between ground truth and detected boxes. The AP_75 scores for dlib models are 0 although AP_50 scores are higher than that of Haar. This only means that the Dlib models are able to detect more faces than that of Haar, but the smaller bounding boxes of dlib lower their AP_75 and other numbers. The second reason is that dlib is unable to detect small faces which further drags down the numbers.

Thus, the only relevant metric for a fair comparison between OpenCV and Dlib is AP_50 ( or even less than 50 since we are mostly comparing the number of detected faces ). However, this point should always be kept in mind while using the Dlib Face detectors.

6. Speed Comparison

We used a 300×300 image for the comparison of the methods. The MMOD detector can be run on a GPU, but the support for NVIDIA GPUs in OpenCV is still not there. So, we evaluate the methods on CPU only and also report result for MMOD on GPU as well as CPU.

Hardware used

Processor : Intel Core i7 6850K – 6 Core

RAM : 32 GB

GPU : NVIDIA GTX 1080 Ti with 11 GB RAM

OS : Linux 16.04 LTS

Programming Language : Python

We run each method 10000 times on the given image and take 10 such iterations and average the time taken. Given below are the results.

As you can see that for the image of this size, all the methods perform in real-time, except MMOD. MMOD detector is very fast on a GPU but is very slow on a CPU.

It should also be noted that these numbers can be different on different systems.

7. Comparison under various conditions

Apart from accuracy and speed, there are some other factors which help us decide which one to use. In this section we will compare the methods on the basis of various other factors which are also important.

7.1. Detection across scale

We will see an example where, in the same video, the person goes back n forth, thus making the face smaller and bigger. We notice that the OpenCV DNN detects all the faces while Dlib detects only those faces which are bigger in size. We also show the size of the detected face along with the bounding box.

It can be seen that dlib based methods are able to detect faces of size upto ~(70×70) after which they fail to detect. As we discussed earlier, I think this is the major drawback of Dlib based methods. Since it is not possible to know the size of the face before-hand in most cases. We can get rid of this problem by upscaling the image, but then the speed advantage of dlib as compared to OpenCV-DNN goes away.

7.2. Non-frontal Face

Non-frontal can be looking towards right, left, up, down. Again, to be fair with dlib, we make sure the face size is more than 80×80. Given below are some examples.

As expected, Haar based detector fails totally. HoG based detector does detect faces for left or right looking faces ( since it was trained on them ) but not as accurately as the DNN based detectors of OpenCV and Dlib.

7.3. Occlusion

Let us see how well the methods perform under occlusion.

Again, the DNN methods outperform the other two, with OpenCV-DNN slightly better than Dlib-MMOD. This is mainly because the CNN features are much more robust than HoG or Haar features.

8. Conclusion

We had discussed the pros and cons of each method in the respective sections. I recommend to try both OpenCV-DNN and HoG methods for your application and decide accordingly. We share some tips to get started.

General Case

In most applications, we won’t know the size of the face in the image before-hand. Thus, it is better to use OpenCV – DNN method as it is pretty fast and very accurate, even for small sized faces. It also detects faces at various angles. We recommend to use OpenCV-DNN in most

For medium to large image sizes

Dlib HoG is the fastest method on CPU. But it does not detect small sized faces ( < 70x70 ). So, if you know that your application will not be dealing with very small sized faces ( for example a selfie app ), then HoG based Face detector is a better option. Also, If you can use a GPU, then MMOD face detector is the best option as it is very fast on GPU and also provides detection at various angles.

High resolution images

Since feeding high resolution images is not possible to these algorithms ( for computation speed ), HoG / MMOD detectors might fail when you scale down the image. On the other hand, OpenCV-DNN method can be used for these since it detects small faces.

Have any other suggestions? Please mention in the comments and we’ll update the post with them!

Subscribe & Download Code

If you liked this article and would like to download code (C++ and Python) and example images used in this post, please subscribe to our newsletter. You will also receive a free Computer Vision Resource Guide. In our newsletter, we share OpenCV tutorials and examples written in C++/Python, and Computer Vision and Machine Learning algorithms and news.

Subscribe Now

References

[FDDB Comparison code]

[Dlib Blog]

[dlib mmod python example]

[dlib mmod cpp example]

[OpenCV DNN Face detector]

[Haar Based Face Detector]