Synthetic Data Generation

To generate synthetic video data at scale, we used the “Unity” game engine. All visible objects in any scene were either downloaded from the Unity Asset Store or manually designed using Blender.

To achieve maximum variation in the videos, the look of each scene was newly composed, i.e. colors and materials were exchanged, objects were switched or shown at random, the lighting was varied and different objects to conduct the actions were picked. This, as well as the conduction of the actions and the recording and labeling of the videos was fully automated with the scripting interface that Unity provides. The result is a program that, when executed, displays videos of 14 different actions in real-time and writes to disk the single frames as well as a text file that contains the label. It is easily possible to extend the current framework by either adding additional scenes or actions.

Experiments

To investigate the importance of variation in the video’s background, we rendered three different datasets. Each of them contains videos that originated only from a subset of the created scenes. Separate trainings on these datasets were performed and the resulting networks were subsequently applied to real world data. The classification of the actions was more successful the more variation the training videos showed, i.e. the more scenes the videos were rendered in. The accuracy on real data was also higher the more videos were used for training. However, the improvement stagnated after having more than 3000 samples per class.

Since it is rather uninteresting how trained networks perform on synthetic data and it is also infeasible to render videos for every specific use-case, transfer learning is a pivotal concept. Synthetic data can be used to train the weights in deeper layers in the neural network while the upper layers are fine-tuned using real world datasets of the required classes. The advantage is that very fine-grained features can be extracted by using a large and densely labeled synthetic dataset and only a small dataset for fine-tuning. Additionally, fine-tuning requires a comparably short training time compared to pre-training.

To get an indication of how well pre-training works on synthetic data, we pre-trained the same architecture with different amounts of synthetic data as well as with real world data of the same 14 classes. Afterwards, we fine-tuned the last fully connected layer with 14 new classes of real data. The evaluation took place on a real dataset of videos that the network had never seen before.

The result showed that pre-training on a large synthetic dataset (more than 3,000 samples per class) worked best and also surpassed the results of pre-training on real data. Especially the classes “Moving [something] and [something] away from each other” and “Moving [something] and [something] closer to each other” were classified with high accuracies in all cases. This can be explained by the classes used for pre-training. There, “Pushing [something] from right to left” as well as “Pushing [something] for left to right” were two of the contained classes. Since they were detected reasonably well, the assumption was that features developed during training that detect movements in either direction. During fine-tuning on the new classes those features could be newly combined to also detect movements of two different objects. Actions that have no common movement with any of the classes used for pre-training were rather poorly classified.