The clip marking problem

Have you ever shot a video of someone doing hard and repetitive task? Think action sports (skateboarding, snowboarding, inline skating) or random things like bottle flipping or making your cat do something funny. Usually it takes a lot of tries until the trick is eventually performed — only one or few shots are used. To capture the footage it takes a lot of time to actually browse through all the tries just to find these few that were successful.

Old analog solutions

Back in the days of tapes there was one popular solution to this. After successful shot people would usually put a hand right in front of the lens. Afterwards when rewinding the tape with preview they had known which parts needed to be captured.

Modern ways

We don’t capture footage from tapes anymore. After session of filming we end up with multiple files. Some cameras have option to annotate or lock video clips. This is however slow process and definitely not handy if you are right there in action, filming multiple people. You need to mark your clips as fast as possible. There is simply no time for browsing through camera menus. Other way is to mark clips with hand and then look at the thumbnails. This is neither perfect solution. Different file explorers generate thumbnails for different timestamps so your hand might not be picked up. On top of that if you have lots of footage it is easy to accidentally skip marked clip if you don’t notice the thumbnail.

Implementation

Luckily it is 2019 and image object detection has never been easier, thanks to neural networks. From user perspective the application is simple:

Choose a source directory with video clips Analyse last few seconds of every clip and detect hand in front of the lens If there is a hand in a clip, copy it to destination directory for marked clips

To implement such solution I decided to go with good old OpenCV for loading videos and frames grabbing. For object detection I chose Tensorflow Object Detection API. I was also lucky enough to find great hands model made by EvilPort2

The main program flow is as follows:

First we find all MP4 files in source directory. For each file the call to find_hand_in_video is performed to run detection. Finally if the find_hand_in_video returns True the file is copied to destination directory. The find_hand_in_video function begins with following code:

We don’t want to process whole video — just last 3 seconds are enough. This is where hand marks are expected. First frame is calculated by subtracting 3 seconds worth of FPS from total frames amount in video. This can be achieved by getting CAP_PROP_FRAME_COUNT property and setting CAP_PROP_POS_FRAMES . Once the video is loaded into cap variable can start grabbing frames:

We don’t want to process every frame. That would be too heavy. There are 2 main ways to skip frames using OpenCV’s VideoCapture :

1. Set CAP_PROP_POS_FRAMES to desired frame number

2. Use lightweight cap.grab() to move to next frame. Skip frames with if frame_id % SKIPFRAMES != 0 . Afterwards Obtain non-skipped frames’ data using cap.retrieve()

While method 1 is easier, in our case it is simply slower. Jumping to discrete frame is very time consuming. If frames are relatively close to each other, it makes more sense to just iterate one by one and filter unwanted frames. Since we process every 40 frame the second method is more suitable. Once the frame is obtained we can perform detection:

In TensorFlow all data is represented with tensor structure (which is basically multi dimensional array). Graphs are composed from tensors. The model that we used was created using object detection api. Such models contain tensors from which we can pick information about scores and detection boxes. np.expand_dims and np.squeeze are used to convert between OpenCV/TensorFlow array shapes. Moving back to find_hand_in_video :

We are only interested in one hand detection. That is why we grab first item from scores and boxes arrays. We compare score to SCORE_THRESHOLD , but we also need to check how big the hand actually was. If we skip this part we might end up with lots of false positives. After all we do not want to detect hands of people in the shot — we actually only want big hand right in front of lens. Luckily our model provides data with points of box that encapsulate detected item. If the threshold value and box width are high enough the video is considered as hand marked.

Conclusion

Nowadays object detection in images is much easier thanks to tools with neural network capabilities like TensorFlow. In past we usually had to dive deep into computer vision theory to accomplish similar tasks. Not having to worry about it and just creating/finding good models is a pleasant and refreshing approach. There are more things to improve in solution presented above. I plan to add gestures detection to categorize clips. Also voice detection feature would be great for automatically naming clips. There are so many use cases for neural networks in image/audio detection — I highly encourage you too try it out — it is definitely worth investing some time.