The goal of this challenge is to advance the area of learning knowledge and representation from web data. The web data not only contains huge numbers of visual images, but also rich meta information concerning these visual data, which could be exploited to learn good representations and models.

This year, we organize two tasks to evaluate the learned knowledge and representation: (1) WebVision Image Classification Task, and (2) WebVision Video Classification Task.

WebVision Image Classification Task: Overview

The WebVision dataset is composed of training, validation, and test set. The training set is downloaded from Web without any human annotation. The validation and test set are human annotated, where the labels of validation data are provided but the labels of test data are withheld. To imitate the setting of learning from web data, the participants are required to learn their models solely on the training set and submit classification results on the test set. The validation set could only be used to evaluate the algorithms during development (see details in Honor Code). Each submission will produce a list of 5 labels in the descending order of confidence for each image. The recognition accuracy is evaluated based on the label which best matches the ground truth label for the image. Specifically, an algorithm will produce a label list: \(c_i\), \(i=1,...,5\) for each image and the ground truth labels of the image are: \(y_j\), \( j = 1,..., n \) with n class labels. The error of this prediction is defined as: $$E = \frac{1}{n} \sum_{j=1}^n \min_{i} d(c_i, y_j).$$ The \(d(c_i,y_j)\) is calculated as 0 if \(c_i=y_j\) and 1 otherwise. Since different concepts have different number of test images in WebVision 2.0 dataset, we calculate the mean error for each concept individually, and the final error is the average of mean errors across all classes. For this version of the challenge, there is only one ground truth label for each image (i.e., \(n=1\)).

WebVision Image Classification Task: Benchmarks

This year, to facilitate the algorithmic development, we provide a benchmark model trained using ResNet-50, which gives 71.49% top5 accuracy on the validation set. A development kit including codes for reproducing this benchmark result is provided at the github link at the github link.

Collecting data for large-scale action classification is becomming more and more time consuming. This puts a natural limit to the size of current benchmarks and makes it unlikely to ever have ImageNet scale benchmarks in action recognition with millions of samples. Additionally, those datasets are usually based on a hand-crafted class vocabulary based on easy to search categories as authors need to cover many different scenarios and at the same time, identify unique distinguishable actions. But techniques developed and finetuned on such data do not naturally transfer to applications in the wild. To adress this problem, we want to move some steps away from the usual action classification and explore the problem of learning actions from real-live videos without human supervision. The webvision video track run as part of the Workshop on Visual Understanding by Learning from Web Data. This workshop aims at promoting the advance of learning state-of-the-art visual models from webly supervised data. We want to transfer this idea to the case of learning action representations from video subtitles without any human supervision.