Transfer learning for violence detection in images

Kalliatakis et al. (2017)[4] have compiled the Human Rights Understanding (HRUN) data set. This collection of 400 manually labeled images files includes photos of child soldiers and violent interactions between police officers and civilians.

A data set of this small size can only be effectively processed with the help of transfer learning.

To this end, the authors compare the performance of 10 established convolutional neural networks. The models are either pre-trained on the ImageNet data set or — in the case of the 8-layer Places architecture — optimized for a collection of 10 million scene photographs[5]. To use the models as feature extractors, the layers generating the predictions are removed.

The second component of the model is a linear SVM classifier that is trained on the HRUN data sets and accepts the extracted features as input.

Human rights violations recognition pipeline in Kalliatakis et al. (2017)

Using a 50/50 split for training and test images, the authors report excellent results. The transfer learning approach reached an average precision of 90% for the child soldiers category and close to 96% for violent interactions between the police and civilians. Interestingly, the best results were achieved with the Places architecture.

Detection of violent videos

Videos, of course, are sequences of images. While most state-of-the-art image classification systems use convolutional layers in one form or another, sequential data is frequently processed by Long Short-Term Memory (LSTM) Networks. Consequently, a combination of these two building blocks is expected to perform well on a video classification task.

One such combination has the self-descriptive name of ConvLSTM[6]. Standard LSTM uses simple matrix multiplication to weigh the input and previous state inside the different gates. In ConvLSTM, these operations are replaced by convolutions.

A paper by Sudhakaran and Lanz (2017) tests how well this approach works for the detection of violence in video content[7].

To force the network to model the changes over time, the authors use the difference of two adjacent frames as the input at each step. The AlexNet architecture is then used to generate a vector representation that is sent to the ConvLSTM instance. The final hidden state, after all frames have been processed, is forwarded to a sequence of fully connected layers that computes the classification.

Block diagram of the model proposed by Sudhakaran & Lanz (2017)

The model is evaluated on the small data sets. The Hockey Fight Dataset consists of 500 videos of ice hockey matches, showing either fights or other content. The Movies Dataset contains 100 fight scenes and 100 scenes without violence. The Violent-Flows Crowd Violence Dataset is a collection of 246 videos depicting violent and non-violent crowd behavior at sports events. To augment the data, the authors perform random cropping and horizontal flipping.

The paper reports a second place on the Violent-Flows data set, state-of-the-art results for violence detection in ice hockey videos and a perfect result on the Movies Dataset .

For the Hockey Dataset, using the difference of two adjacent frames as input and a pre-trained AlexNet for feature extraction increases the accuracy from 94% to 97% compared to a randomly initialized network with individual frames as input.

These results are remarkable considering that violent and non-violent scenes can exhibit a high degree of feature overlap. A closer look at some of the lower-level details is required, for example, to distinguish a fight from a hug in an ice hockey match.

Violence as a detectable anomaly

In a civilized society, peaceful co-existence is the norm and violence is the exception. This fortunate fact allows Sultani et al. (2018)[8] to treat intelligent surveillance as an anomaly detection problem. In addition to interpersonal violence, the 13 anomalies they consider include other arson, theft and accidents.

Using the search functionality on YouTube and LiveLeak, the researchers compiled a set of videos showing real-world anomalies. Only unedited recordings by surveillance cameras made it into the final collection of 1,900 videos. The data set is equally balanced between anomalies (labeled as positive) and normal events (labeled as negative).

Multiple-instance learning

Each video is represented as a bag of m temporal segments. In the positive case, at least one of the m segment is assumed to contain an anomaly. In the negative case, none of segments contain an anomaly.

To collect examples for a larger number of videos, annotators provided labels on the level of bags, and not on the level of individual segments. In other words, the data set tells you whether a given video show any anomaly at all. It does not tell you when the anomaly occurs.

The following notation refers to the i-th segment in a bag B representing a video V. The letters a and n are used to denote anomalous and normal events, respectively:

The function f assigns an anomaly score between 0 and 1 to each segment.

A key idea is to push the highest-scoring positive segments as far away from the highest-scoring negative segments as possible. This essential objective is expressed using the following hinge loss function:

In the best possible case, the highest segment score is 1 for the anomalous video and 0 for the normal video. This results in a loss of 0:

In the worst case, the scores are reversed and the loss is 2:

Spatiotemporal feature learning with C3D

The scoring function f uses the representation that is extracted from the pre-trained convolutional 3D (C3D) network, an architecture that was specifically designed with transfer learning in mind.

Images are two-dimensional. Video analysis is spatio-temporal: it adds time as the third dimension. In the C3D network described in Tran et al. (2015)[9], videos are resized to 128x171 (a 4:3 aspect ratio) and split into clips of 16 frames each. Using three color channels, the input has a size of 3x16x128x171. Convolutional filters in this network have have a d x k x k format, where d refers to temporal dimension and k x k refers to the spatial dimensions. Empirical results suggest that a 3x3x3 configuration is an appropriate choice.

The first five blocks in the network consists of one or two convolutionals layer followed by a pooling operation.To generate predictions, the computation is continued by a sequence of two fully-connected layers (identified as fc6 and fc7) and finally completed by a softmax layer. The authors of the C3D network trained the model on the Sports-1M data set, a collection of more than one million videos from 487 sports categories.

The representational power of the trained model can be reused for other tasks. A video from a different data set is first split into the required format of 16 frame long clips. The fc6 clip activations of the individual clips are then averaged to form an L2-normalized feature vector with 4096 entries.

Real-time processing

Going back to the anomaly detector, this feature vector is used as the input to a 3-layer fully-connected neural network with Dropout. The last layer in this architecture has just one unit and computes the anomaly score through the application of the sigmoid activation function to the weighted input.