The “Leg guy” is my personal favorite :)

The reasons why our users want to hide their number plates could be different. We also are motivated to secure the data on our site. It seems natural to create privacy features for our users. One of our privacy features: we’ve created an anonymous phone number for sellers, e.x., when you sell your car, we create a temporary phone number for you. The callers don’t know your real number when you use a temporary phone number which acts as a proxy for incoming calls. After you’ve sold your car people can’t call you by that number. This means no spam and no additional proposals after you’ve closed the deal. Back to the number plates, let’s hide them on car photos to protect user privacy.

Methods overview

To automate the process of number plate detection we can use convolutional neural networks for object detection. There are several types of architecture for solving the task of object detection: the first one is single-shot e.x. SSD, YOLO, RetinaNet and second one is two-shot e.x. R-CNN series (Mask R-CNN, Faster R-CNN).

Solving object detection task means predicting four coordinates of the bounding box that encloses the object we are interested in.

Object detection models can find many objects of different classes on one photo which is already redundant because people usually sell only one car in their advertisement. There are some exceptions to “one car on the photo” rule but those are just random accidents when people take a photo of their car on a parking lot and another car with a number plate is visible on the photo. Another property of those nets is that by default their output is a bounding box with sides parallel to the coordinate axes. This happens because there are predefined bounding boxes with different shapes which are called anchor boxes. They are not rotated which is a disadvantage for our task since a number plate is usually not parallel to the coordinate axes.

Let’s dig in a little bit. For solving the task of object detection using two-shot architecture firstly we need to get a feature matrix from a picture using some convolutional network (e.x. resnet34).

Then using a sliding window on a feature matrix we make a classification: does this bounding box contain an object or not? The bounding boxes are predefined as “k” different anchor boxes because objects have different scales and proportions. At this stage, the classification is done with another task. It’s the regression of 4 coordinates of the anchor box that corrects the location of the bounding box.

Then there is the second stage with two different heads:

The first one is for the classification of the object. The second one is again for regression of the coordinates of the bounding box but this time it’s not about the location. This time we need to shrink down the bounding box to increase the ratio of the object area to the bounding box area.

But if you train your two-shot detector e.x. Mask R-CNN on a dataset of car images with annotated number plates you will get something like this:

Do you see the problem? Formally we solved the task and you can’t see the number plate but aesthetically this is a failure. How can we make it look better? The easiest way is to predict the angle for the turn of the rectangle with the Avito logo. It will be some sort of a solution but a better way is to predict a rotated bounding box. To do that we need to change the bbox regressor head of the network. It must predict not only coordinates but also an angle of bbox rotation. The best solution is to predict the matrix transformation for the bbox that will change it for perfect fit on number plate.

Besides two-shot detector, like Mask R-CNN, there are one-shot detectors like RetinaNet. It is different from the previous architecture in the way of prediction. It predicts class and bounding box immediately without intermediate step for region proposal predictions as it was in two-shot detectors. For rotated bounding box prediction you need to change the box subnet head of the model.

One example of architectures for the prediction of rotated bounding box predictions is DRBOX. This model doesn’t utilize the region proposal prediction stage. It means that this architecture is a modification of one-shot models. This model starts the learning process using K rotated bounding boxes (rbox). It predicts probabilities for each of K rbox such as: is there an object in the rbox, coordinates of rbox, the size of rbox and additional rotation angle.

It is possible to modify the architecture and train one of the mentioned networks with rotated bounding boxes but do we need to do this?

First, we have only two classes on images: “there is a number plate on a photo” and “there is no number plate on a photo”. What we need is just a binary classification and we don’t need any complex functionality to solve multi-class problems.

Second, there is only one object of interest in a picture. There is no need to detect many objects of the same class on one photo. Why is that? Because people take a photo only of their car when they want to sell it. There are some corner cases when people take a picture on the parking lot and other cars may be visible on the photo but it’s a rare case that we can neglect. That’s why we will use a simple network for the prediction of four coordinates of the number plate vertices.

Data

There are two steps: get the images with cars and number plates and annotate them. The first task is solved within our architecture in Avito — we store all the image data from the advertisements on our servers. The second task could be solved with any mechanical turk service. We used Toloka. Our task for annotators:

“There is a photo of a car. You need to highlight the number plate of a car using a quadrangle. The sides of the quadrangle must be as close as possible to the sides of the number plate”

With the help of Toloka you can annotate different data. For example,

score the quality of SERP, annotate classes of texts and images, annotate videos, etc. Those tasks will be fulfilled by the Toloka users for the money prize that you select. It’s very convenient for annotation of a big dataset but to get a high quality of annotation is quite hard. There are a lot of bots on the Toloka. Their target is to get your money for solving your tasks using some random or more sophisticated strategy.

There are special rules and checks to ban bots from annotating your data. The main ban instrument is the control questions, e.x. you annotate a small part of the dataset by yourself then you add those annotated questions to the Toloka tasks. If the annotator frequently fails on those annotated control questions you will ban him and remove his annotations from the result dataset. These control examples are called “honey pots”.

For the classification task, you can easily determine the correctness of the annotator but for object detection, it’s a bit harder. The classical way is to use IoU.

If this ratio is lower than the threshold then annotator will be banned. Nevertheless, calculating IoU for the two arbitrary quadrilaterals is harder at least because of the rotation. Another point is that you need to calculate the metric in the Toloka using JavaScript which is another problem. That’s why we invented a small hack for calculating the metric. Let’s assume that every vertex in our annotation has a neighborhood and if the annotator places their vertex in the bounds of those neighborhoods then it is adequate, otherwise, it is probably a bot, so it must be banned. Another example of rules is the fast submit check: a human can’t annotate thousands of image in a second. Obviously you need to check your user using Captcha from time to time. There are a lot more different rules for preventing bots from annotating your data. If you carefully configure them you can get a decent annotated dataset but for best quality you will probably need to hire annotators directly. The best part is that our dataset has 4,000 images and we used overlapping, so three different annotators solved one task and all of it cost us only 28$. It’s really cheap to annotate data on Toloka, if you configure it right.

Model architecture

Let’s create a CNN for prediction of four vertices of the polygon with the number plate. First, we need a feature extraction network e.x. resnet18, then add a head for regression in four vertices. Since the sides of the number plate quadrilateral are not parallel to the coordinate axes we need all the eight coordinates instead of just four in the classical case of rectangular with sides parallel to coordinate axes. Secondly, we add another head for the binary classification of the image: “is there a number plate on the image or not?”. We need the second head because sometimes even in the ads for selling cars the images could not contain a car, e.x. it could be an image of a part of a car.

Our model must ignore such images.

The training process of the two heads must be simultaneous. For this purpose, we add photos without the number plate to the dataset and their target is bounding box (0,0,0,0,0,0,0,0) and “0” for the classificator of “is there a number plate”. Now we can create a unified loss function for both heads as a sum of following losses. For the regression task into 8 coordinates, we use smooth L1 loss

This loss can be interpreted as a combination of L1 and L2, which acts as L1 loss when the absolute value of the argument is high and as L2 when the absolute value is near zero. For the classification task, we use binary cross-entropy loss. Feature extraction network is resnet18 with the weights trained on the ImageNet dataset. We finetune this network with two new heads on our dataset. In this task, we used MxNet framework because it’s one of the main deeplearning frameworks that we use in Avito. We are not bound to use only one framework since our production has the microservice infrastructure but we have a big code base in MxNet so we reuse it.

After we have achieved good accuracy on our dataset we challenged our designers with the task to create a good number plate polygon replacement with the Avito logo on it. After we had a good Avito number plate logo at our disposal we added functionality to calculate the brightness of the original polygon with the number plate to change the Avito logo brightness and that’s it we’re ready to production.

Production

The problem of reproducing the results, support, and further development is practically solved in the world of the backend and frontend development but in the world of machine learning is still an open problem. There could be a lot of different potential problems with ml solutions, e.x. you created your ml model in a jupyter notebook with couple comments and it failed to start on another server with different cuda/cudnn/nvidia driver/framework version/etc. Those problems could be solved by approaching your ml experiments with more structure.

We solve reproductivity problems using several instruments. Firstly, we use the nvidia-docker environment for our ml-experiments and production. We add all the dependencies to the docker and use the same version of the libraries with data iterators, augmentations, and inference in production and experiments. For example, if you need to tweak the model you pull repository and start bootstrap script that downloads docker image with the latest version of the environment for your task on any server, downloads current production model weights and you can start experiments. We automated the process of finetuning most of the models on new data but sometimes you still need to start the jupyter notebook for finetuning and that is done through the same process.

The weights of our models are stored in the git lfs — it’s a technology to save large binary files in git. Before that, we used artifactory but it’s more convenient to get the right version of the weights when you bull a branch. We have unit tests for our models, so you can’t run a new production model that fails those tests. We use the microservice infrastructure running on Kubernetes. To deploy the new model we use a/b tests and the decision of the future of the model is based on the statistics.

The result: we deployed covering the number plates on cars and 95 percentile of one image processing by our model is 250 ms.