Using Deep Learning to automatically rank millions of hotel images

At idealo.de we trained two Deep Neural Networks to assess the aesthetic and technical quality of images 🙂😐🙁

Aesthetic visualisations from our trained model (low to high aesthetics from left to right)

At idealo.de (the leading price comparison website in Europe and one of the largest portals in the German e-commerce market) we provide one of the best hotel price comparisons available on the market. For each hotel we receive dozens of images and face the challenge of choosing the most “attractive” image for each offer on our offer comparison pages, as photos can be just as important for bookings as reviews. Given that we have millions of hotel offers, we end up with more than 100 million images for which we need an “attractiveness” assessment.

We addressed the need to automatically assess image quality by implementing an aesthetic and technical image quality classifier based on Google’s research paper “NIMA: Neural Image Assessment”. NIMA consists of two Convolutional Neural Networks (CNN) that aim to predict the aesthetic and technical quality of images, respectively. The models are trained via transfer learning, where ImageNet pre-trained CNNs are fine-tuned for each quality classification tasks.

In this article, we will present our training approach and insights that we’ve gained throughout the process. We will then try to shed some light on what the trained models actually learned by visualising the convolutional filter weights and output nodes of our trained models.

We’ve published the trained models and code on GitHub. The provided code allows one to use any of the pre-trained CNNs in Keras, so we are looking forward to contributions that explore other CNNs for image quality assessments 😃

Training

The aesthetic and technical classifiers were trained in a transfer learning setup. We used the MobileNet architecture with ImageNet weights, and replaced the last dense layer in MobileNet with a dense layer that outputs to 10 classes (scores 1 to 10).

Earth Mover’s Loss

A special feature of NIMA is the use of the Earth Mover’s Loss (EML) as the loss function, contrary to the Categorical Cross Entropy (CCE) loss, that is generally applied in Deep Learning classification tasks. The EML can be understood as the amount of “earth” that needs to be moved to make two probability distributions equal. A useful attribute of this loss function is that it captures the inherent order of the classes. For our image quality ratings, the scores 4, 5, and 6 are more related than 1, 5, and 10, i.e. we would like to punish a prediction of 4 more if the true score is 10 than when the true score is 5. CCE does not capture this relationship, and it is often not required in object classifications task (e.g. misclassifying a tree as a dog is as bad as classifying it as a cat).

In order to use the EML we need for each image a distribution of ratings across all ten score classes. For the AVA dataset, which is used to train the aesthetic classifications, these distribution labels are available. For the TID2013 dataset, used for the technical classifications, we inferred the distribution from the mean score given for each image. For more details on our distribution inference check out our GitHub repo.

Fine-tuning stages

We train the models in a two stage process: