Human Pose Estimation

Human Pose Estimation is an important task in Computer Vision which has gained a lot of attention the last years and has a wide range of applications like human-computer interaction, gaming, action recognition, computer-assisted living, special effects. It has rapidly progressed with the advent of neural networks in the deep learning era.

Figure 1 HPE

So, what is Human Pose Estimation?

The goal of 3D human pose estimation is to estimate the joints location of one or more human bodies in 2D or 3D space from a single image. Joints are connected to form a skeleton to describe the pose of the person.

We will focus on the most popular and recent works on 2D and 3D Human Pose Estimation

Figure: The overall architecture of OpenPose

OpenPose is the most popular open-source tool for body, foot, hand, and facial keypoint detection. It makes use of Part Affinity Fields (PAFs), a set of 2D vector fields to encode the location and orientation of limbs over the image domain. As shown in the image F is passed through several convolutional layers to generate the the PAFs (L) and confidence maps S for every joint location. The process is repeated for some iterations and the network refines its predictions at every stage. OpenPose is still widely used in many research projects.

Figure 3: A diagram of our approach. The building block of our network is a linear layer, followed by batch normalization, dropout and a RELU activation. This is repeated twice, and the two blocks are wrapped in a residual connection. The outer block is repeated twice. The input to our system is an array of 2d joint positions, and the output is a series of joint positions in 3d

In this work the authors implemented a lightweight and fast network able to process 300 frames per second!!! After extracting 2d joint location, due to the low dimensionality of 2d space, they use a simple neural network as shown in the next Figure which has a small number of parameters and can be easily trained. The network is trying to estimate the coordinates of joints in 3d space.

DensePose: Dense Human Pose Estimation In The Wild

DensePose adopts the architecture of Mask-RCNN with the Feature Pyramid Network (FPN) features, and ROI-Align pooling so as to obtain dense part labels and coordinates within each of the selected regions.

As shown below, the method adopts a fully-convolutional network on top of the ROI-pooling that is entirely devoted to generating per-pixel classification results for selection of surface part and regressing local coordinates within each part.

The DensePose-RCNN system can be trained directly using the annotated points as supervision. However, we obtain substantially better results by ‘inpainting’ the values of the supervision signal on positions that are not originally annotated. To achieve this, we adopt a learning-based approach is adopted, where a ‘teacher’ network is trained at first: A fully-convolutional neural network that reconstructs the ground-truth values given images scale-normalized images and the segmentation masks.

The VIBE framework is trained to predict human pose and shape in video (image sequences) instead of images. Using a temporal network it produces kinematically plausible human motion. They leverage two sources of unpaired information by training a sequence-based generative adversarial network (GAN). Given the video of a person, a temporal model is trained to predict the parameters of the SMPL body model for each frame while a motion discriminator tries to distinguish between real and regressed sequences. A CNN extracts features for each frame which are passed from gated recurrent units to model the motion. These features are used to estimate the body model parameters. The estimated body model and samples from a motion capture dataset are given to the discriminator to distinguish between fake and real examples.

Figure The proposed multi-task approach for human pose estimation and action recognition. This method provides 2D/3D pose estimation from single images or frame sequences. Pose and visual information are used to predict actions in a unified framework and both predictions are refined by K prediction blocks

A multi-task framework is trained for jointly estimating 2D or 3D human poses from monocular colour images and classifying human actions from video sequences

Figure Overview of the multi-task network architecture

Input images are fed through the entry-flow, which extracts low level visual features. The extracted features are then processed by a sequence of downscaling and upscaling pyramids indexed by p Each Prediction Block (PB) is supervised on pose and action predictions, which are then re-injected into the network, producing a new feature map that is refined by further downscaling and upscaling pyramids

This flexible network architecture can inference individual frames for pose estimation or entire video for action recognition.

Figure 6 Algebraic triangulation

This is a multi view approach which can use multiple cameras from different angles for more precise pose estimation. Here two solutions are presented to leverage information from multiple 2D views.

Algebraic triangulation

As shown in Figure 6 the RGB images are passed through a 2D CNN to extract the joints’ heatmaps and confidence score of joints for each camera. The joint keypoints are extracted by applying soft-argmax in the heatmaps. A linear algebraic triangulation method is used to extract 3d locations of joints. The triangulation algorithm assumes that the joint coordinates from each camera view are independent of each other. Because some joint cannot be precisely estimated from some view due to occlusions, the network uses learnable weights (fully connected layer) to control each camera’s contribution during the triangulation.

Figure 7 Volumetric triangulation

Volumetric triangulation approach

The major difference here is that the feature maps are unprojected into 3D volumes. A fixed size 3D cube around the person is filled via projecting output of the 2D network along projection rays inside the 3D cube. The volumetric maps from all views are aggregated to and fed to V2V network. The V2V module is a 3D convolutional neural network that outputs 3D heatmaps. Soft argmax is uses in the 3D heatmaps to output the joint locations. This approach is currently the state-of-the art method in Human3.6M dataset, the largest dataset with human poses.

Figure 8 Results from volumetric approach

Conclusion

We presented some of the most popular and recent works-advances in Human Pose Estimation. We can expect many more solutions as deep learning is still advancing and being applied in Computer Vision.

Also if you are interested in dive deep into Computer Vision with Deep Learning, there is no better way than the Deep Learning in Computer Vision course by HSE University.

References