Human pose estimation -computationally detecting human body posture- is rising. These technologies capable of detection the human body joints are becoming effective and accessible. And they look astonishing, have a look:

This video is made using OpenPose and it’s impressing

OpenPose represents the first real-time system to jointly detect human body, hand and facial keypoints (in total 130 keypoints) on single images. In addition, the system computational performance on body keypoint estimation is invariant to the number of detected people in the image

Result using OpenPose

This OpenPose library is a wonderful example of *buzzword incoming* Deep Learning. This library is built upon a neural network and has been developed by Carnegie Mellon University. OpenPose uses an interesting pipeline to achieve it’s robust performance. If you want to dig into this topic, the paper “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields” gives an overview of the inner workings of the system.

If you are looking for alternative algorithms, take a look at DeeperCut. Here is an implementation in Python.

Result using DeeperCut

Mobile Development — The Opportunities

So, the magic of Deep Learning gives us 18 human body joints. WHAT CAN WE DO WITH IT? A lot actually. Samin does a good job listing the possibilities in his blog.

Pose Output Format OpenPose

The fact that these pose estimators work with just a normal camera, opens doors for mobile development. Imagine using these technology with your shitty smartphone camera. OK, I admit; I want to develop for smartphone, so (ideally?) everything runs embedded on the smartphone itself. This means, the implementation is capable of running on CPU or smartphone GPU. BUT, as these Estimator Algorithms currently run on some pretty decent GPUs , this is surely a point of discussion. Most of the new smartphones these days have a GPU on board, but are they capable of running these frameworks? To nuance for this case, the (multi-person) keypoint detection doesn’t need to be real-time, as there is only 1 picture (no real-time video) and a delay time of 1 to 1,5 seconds is acceptable.

However, working embedded is not a must, it’s also possible to outsource the estimation/matching algorithm to a central server containing a decent GPU. This particular choice, embedded or out-sourcing, is an issue that involves a lot of parameters (performance/computation power, server cost, accuracy, mobile battery usage, delay server communication, multi platform, scalability, mobile data usage -less important- , …).

Similarity of different poses — The Application

I’m working on a project where a person must mimic a predefined pose (call it the model). A picture is made from the person that mimics this predefined pose. Then, with the help of OpenPose the human pose of the person is extracted from this image and compared with the predefined pose. Finally a scoring mechanism decides how well the two poses match or if they match at all.

This desired project raises two big problems tough:

1. Porting OpenPose to Mobile Platform

I didn’t tell you yet, but OpenPose is build with Caffe, a Deep Learning framework. Caffe is OK, but it doesn’t support Android, not mentioning iOS. There are plenty of other Deep Learning libraries, yet the mainstream library is TensorFlow, a Machine Learning framework developed by Google. And surprise surprise, TensorFlow supports Android (of course, it’s made by Google) and iOS! Great!

So, TensorFlow it is then? I don’t know for sure yet. Still considering/researching the options … Hope to get back on this in the near future.

The community working on this topic is quite small, but it is growing. Actually Ale Solano is working on this currently, he is trying to port the OpenPose-Caffe library to TensorFlow and he is also blogging about it! He’s making good progress.

UPDATE: Part 2 (it works!)

2. How to determine similarity between 2 sets of matching 2D points

Let’s say the previous succeeded, and we have the 2D joint points of two poses: