Learning to Reconstruct People

Reconstructing human body shapes from a single image or a video stream is a challenging problem. The reconstruction needs to be accurate such that automated body measurements agree with the real ones, and needs to be practical such that it is fast and utilizes as few sensors as possible. This problem partially relates to the growth in demand of applications such as telepresence, virtual and augmented reality, cinematography, virtual try-on, gaming and body health monitoring.

There has been steady research in the area of retrieving human body shape and pose from image/video, especially with the advent of deep neural networks. This can be attributed to research in computing, computer graphics, and deep learning techniques.

This blogpost is divided into

Background 3D Morphable Models Blend Skinnng SMPL

Datasets and Tools HumanEva (2009) Human3.6m (2014) SURREAL (2017) UP-3D (2017) DensePose (2018) JTA (2018) 3DPW (2018) Tools Menpo TF-Graphics

Current Research Human shape from silhouettes using generative hks descriptors and cross-modal neural networks (2017) End-to-end recovery of human shape and pose (2017) Neural Body Fitting: Unifying Deep Learning and Model Based Human Pose and Shape Estimation (2018) Learning to estimate 3D human pose and shape from a single color image (2018) Video Based Reconstruction of 3D people models (2018) Deep AutoEncoder for Combined Human Pose Estimation and Body Model Upscaling (2018) Detailed Human Avatars from Mono. Video (2018) Learning to Reconstruct People in Clothing from a Single RGB Camera (2019) DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image (2019)



The intension of this blogpost is to accumulute the current research in recovering human shape from images/video in one place and possibly provide a primer for someone interested in the field.

And hence,

This post would be updated constantly.

Before diving into the research paper summaries, it’s useful to discuss how a “mean template” or a deformable model (the term 3DMM is usually used for a face template model in literature) is built.

Background

3D Morphable Models

Large-scale face model (LSFM)

3D Morphable models (3DMMs) are powerful 3D statistical models, usually of human face shape and texture. A 3DMM is constructed by performing some form of dimensionality reduction, typically PCA, on a training set of meshes, in dense correspondence with one another (vertices are made consistent across all meshes).

Thus, 3DMMs are usually constructed by first establishing group-wise dense correspondences between a set of meshes, and then performing some kind of statistical analysis on the registered data to produce a low-dimensional model.

Establishing dense-correspondence is a challenging problem and usually needs some kind of soft constraint, typically guided by landmarks to perform an optimal similarity alignment between the mesh and the (annotated) template. Non-rigid iterative closest point (NICP) is then usually performed to deform the template so that it takes the shape of the input mesh, with the landmarks acting as a soft constraint. This is susceptible to failure cases because both landmark localization and NICP are non-convex optimization problems that are sensitive to initialization.

Nevertheless, 3DMMs are powerful priors of a 3D shape that can be leveraged in fitting algorithms to reconstruct accurate and complete 3D representations from data-deficient sources (for eg, noisy depth scans). Any (new) input 3D mesh can thus be projected on the model subspace by finding the shape vector that generates a shape instance that is as close as possible to the (registered) canonical model.

3DMMs also provides a mechanism to encode any 3D shape in a low dimensional feature space, a compact representation that makes tractable many 3D shape analysis problems.

Blend Skinning and Blend Shapes

(Linear) blend skinning (LBS) is the idea of transforming vertices inside a single mesh by a (blend) of multiple transforms i.e each vertex in the mesh surface is transformed using a weighted influence of its neighboring bones. Blend skinning is thus attaching the surface of a mesh to an underlying “skeletal” structure. There has been a lot of research on automatically rigging LBS models, taking a collection of meshes and inferring the bones, joints and blend weights. The problem of this method is that the models do not span a space of body shapes and often produce unnatural results.

Rather than skeleton subspace deformation methods (aka blend skinning), pose shape deformation model (PSD) defines deformations relative to a base shape, where these deformations are a function of the articulated pose. Pose-dependent deformations can also be described in terms of the coefficients of the basis vectors i.e. learning a low-dimensional PCA basis for each joint’s deformations. There has been a lot of research in learning an efficient linear and realistic model from example meshes. “Skinned Multi-Person Linear” (SMPL), however, aims to have a realistic poseable model that covers the space of human shape variation.

SMPL

SMPL possibly makes the human body model as simple and standard as possible. SMPL model can realistically represent a wide range of human body shapes, can be posed with natural pose-dependent deformations, exhibits soft-tissue dynamics, is efficient to animate, and is compatible with existing rendering engines.

(a) The SMPL model is defined by a mean template shape represented by a vector of \(N\) concatenated vertices \(\widehat{T} \in \mathbb{R}^{3N}\) in the zero pose, \(\widehat{\theta}^*\); a set of blend weights, \(W \in \mathbb{R}^{N \times K}\).

(b) \(B_S(\widehat{\beta})\) is a blend shape function that takes an input vector of shape params \(\widehat{\beta}\) and returns a blend shape of the subject identity. Function \(J(\widehat{\beta})\) predicts \(K\) joint params.

(c) \(B_P(\widehat{\theta})\) is a pose-dependent blend shape function that takes as input a vector of pose params \(\theta\), and accounts for the effects of pose-dependent deformations. The corrective blend shapes of these functions are added together in the rest pose.

(d) A blend skinning function \(W (.)\) is applied to rotate the vertices around the estimated joint centers with smoothing defined by blend weights.

The resultant model \(M(\widehat{\beta},\widehat{\theta}; \Phi)\) maps shape and pose params to vertices. Given a particular skinning method SMPL’s goal is to learn \(\Phi\) to correct for limitations of the method to model training meshes. For more details please refer to MPI SMPL2015.

A key component of this model is that the pose blend shapes are formulated as a linear function of the elements of the part rotation matrices. On top of that, with the low polygon count, a simple vertex topology (for both men and women), clean quad structure, and a standard rig, SMPL makes a realistic learned model accessible to animators.

SMPL decomposes body shape into identity-dependent shape and non-rigid pose-dependent shape based on a vertex-based skinning approach that uses corrective blend shapes (blend shape represented as a vector of concatenated vertex offsets).

Datasets

Training systems capable of solving complex 3D vision tasks most often require large quantities of data. As labeling data is a costly and complex process, it is important to have mechanisms to design machine learning models that can comprehend the 3D world while being trained without much supervision. 3D shape ground truth is usually either limited or hard to obtain. In the case of human shape reconstruction, SMPL tends to be a popular choice since the representation generates high-quality 3D meshes while the system estimates only a handful of parameters i.e., 72 pose params and 10 shape params. This model also allows optimizing directly for the surface by using a 3D per-vertex loss.

Major datasets used are related to either 3D pose estimation or ones with (dense) 2d-3d (image-surface) correspondences.

HumanEva (2009)

HumanEva-I contains calibrated video sequences (4 people, 6 common indoor poses) that are synchronized with 3D body poses. The error metrics for computing error in 2D and 3D pose are also provided. HumanEva-II uses an additional camera, as well as multi-action scenarios including running around a loop performing actions.

Human3.6m (2014)

Human3.6m: A real image showing multiple people in different poses (left), and a matching sample of actors in similar poses (middle) together with their reconstructed 3D poses from the dataset, displayed using a synthetic 3D model (right).

Human3.6M contains 3.6 million (11 people, 17 common indoor poses) frames of 3D human poses. The dataset together with code for the associated large-scale learning models, features, visualization tools, mixed reality augmentations as well as the evaluation server, is available online.

SURREAL (2017)

SURREAL data generation pipeline

SURREAL (Synthetic hUmans foR REAL tasks), built upon Human3.6m, possibly provides one of the most diverse data generation pipelines for training human shape recovery models. Images (~6.5 million) in SURREAL are rendered from 3D sequences of MoCap data. The SMPL model is used to decompose the body into pose and shape parameters, to sample these independently to produce an image, by random sampling environments. The generated RGB images are accompanied with 2D/3D poses, surface normal, optical flow, depth images, and body-part segmentation maps.

UP-3D (2017)

UP-3D (Unite the People) is an “in-the-wild” dataset with a high-quality 3D body model fits for multiple human pose datasets with a human in the loop (only sort good and bad fits). It was demonstrated that training this pose estimator on the full 91 keypoint dataset helps to improve the state-of-the-art for 3D human pose estimation on the two popular benchmark datasets HumanEva and Human3.6M.

DensePose (2018)

DensePose (by Facebook AI Research) contains manually annotated 50K COCO image-to-surface correspondences (~ 5 million) built via a specific multi-stage annotation pipeline. Densepose also proposes a variant of Mask-RCNN to densely regress part-specific UV coordinates within every human region.

JTA (2018)

JTA (Joint Track Auto) is a huge dataset for pedestrian pose estimation and tracking in urban scenarios created by exploiting the highly photorealistic video game Grand Theft Auto V developed by Rockstar North. We collected a set of 512 full-HD videos (256 for training and 256 for testing), 30 seconds long, recorded at 30 fps.

3DPW (2018)

3DPW is possibly the first dataset in the wild with accurate 3D poses for evaluation (60 in the wild videos with 2D and 3D pose annotation, including camera pose and scanned models with different clothing variations). With each sequence contains its corresponding models, the poses are estimated using IMUs and a handheld 2d video.

Here are a few tools that might come in handy for working with human body recovery models. Thus, tools like Blender are not listed.

Menpo

Menpo is an open-source software framework for constructing and fitting visual deformable models written in Python. The Menpo project contains the associated tooling that provides end-to-end solutions for 2D and 3D deformable modeling with support for various training and fitting algorithms for deformable modeling. The framework also comes with a tool for bulk annotation for model training as well as pre-trained landmark localization models.

TF-Graphics

Announced recently (May 2019), Tensorflow Graphics is an add-on Tensorflow library with differentiable geometry and graphics layers as well as 3D viewer functionalities (Tensorboard 3D).

This Colab Notebook is a great primer on how to use TF-Graphics. Check out https://github.com/tensorflow/graphics for more details and examples.

tensorboard 3d support

Current Literature Takeaways

Human shape from silhouettes using generative hks descriptors and cross-modal neural networks (2017)

Link to the paper

Learns body shape representation from 3D shape descriptors and maps this representation to 3D shapes.

Performs cross-modality learning by first learning representative features through CNNs, and then passing then through shared encoding layers, with the objective of regressing to the embedding space. Can leverage multi-view data at training time, to boost predictions for a single view at test time.

Starts from 3D shape descriptors (HKS - Heat Kernel Signature - invariant to isometric deformations) and encodes it into a new shape embedding space, from which the full body mesh is decoded or possible views of the bodies can be regressed.

End-to-end recovery of human shape and pose (2018)

Link to the paper

Human mesh recovery (HMR) can be trained with and without paired (2D-to-3D) supervision. A convolutional encoder takes in an image infers the latent 3D representation of the human that minimizes the joint reprojection error. The 3D parameters are then to a discriminator to tell whether these parameters come from a real human shape and pose.

Uses adversary training to tell whether human body shape and pose generated are real or not using a large database of 3D human meshes.

3D mesh shapes can be inferred directly to meshes from image features (in the wild) with an end-to-end trained network without using 2D-3D keypoint supervision.

Neural Body Fitting: Unifying Deep Learning and Model-Based Human Pose and Shape Estimation (2018)

Link to the paper

A proxy-CNN is used to infer 12 semantic parts from an image of a human body. An encoding CNN processes the semantic part probability maps to predict SMPL body model parameters.

Explicit part representations (part segmentations or joint heatmaps) can be more useful for 3D pose/shape estimation compared to RGB images and plain silhouettes (Human3.6M error drop from 98.5mm to 27.8mm when using a 12 part segmentation, also better segmentations are found to give better accuracy).

Learning to estimate 3D human pose and shape from a single color image (2018)

Link to the paper

A CNN predicts 2D heatmaps and masks from an image based on the pose for training. Two other networks estimate the SMPL parameters (pose and shape from key points and silhouettes respectively by each network). A differentiable renderer is employed to project the 3D mesh to the image, which finetunes the framework by optimizing for the consistency of the projection with 2D annotations (2D key points or masks).

This framework can thus be trained with no images with 3D shape ground truth with faster running times.

Video-Based Reconstruction of 3D people models (2018)

Link to the paper

Aims to generate 3D shape models (incl. clothing) of people from a video using a method to transform dynamic body pose to a canonical frame of reference - method to transform the silhouette cones corresponding to dynamic human silhouettes to obtain a visual hull in a common reference frame.

A personalized blend shape model is generated by calculating (fitting) pose with the SMPL model, then unposing the silhouette camera rays (i.e. removing human motion) and then optimizing shape in the canonical T-pose.

Accurate 3D models incl. clothing and hair can be extracted from a video of a person moving in front of a camera such that the person can be seen from all sides.

Detailed Human Avatars from Mono. Video (2018)

Link to the paper

Follow up of the above paper.

First, a medium level body shape model is estimated based on segmentation and then more details are added using shape-from-shading. Texture is then computed using a sematic prior and a novel graph cut optimization strategy.

Integrates facial landmark detections and shape-from-shading from multiple frames. The person’s pose is tracked using SMPL.

Introduces a new texture stitching binary optimization which allows to efficiently merge the appearance of multiple frames into a single coherent texture.

BodyNet: Volumetric Inference of 3D Human Body Shapes (2018)

Link to the paper

Proposes to directly infer volumetric body shapes from a single image based on an end-to-end trainable network.

Two sub-networks infer 2D pose and 2D parts segmentation from an image. These predictions, combined with the RGB features, are fed to another network predicting 3D pose. All sub-networks are then combined into a final network that infers the volumetric shape. SMPL model is fit to the final volumetric shape for evaluation.

Sub-networks related to 2D pose, 2D segmentation, and 3D pose are pre-trained and fine-tuned jointly for the task of volumetric shape estimation using multi-view re-projection losses.

Learning to Reconstruct People in Clothing from a Single RGB Camera (2019)

Link to the paper

CNNs predicts 3D human shapes from semantic images from a video in a canonical pose together with per-image pose information calculated from 2D joint detections. The pose information is used to refine the shape via ‘render and compare’ optimization using the same predictor. A graph convolution based decoder learns per vertex offsets and the entire model is end-to-end trainable.

Semantic labels are projected back into the SMPL texture space and fuse different views using graph cut-based optimization. This enables the full synthetic generation of paired (2D-3D) training data.

Instance-specific optimization is performed at test time to better model instance-specific details.

DenseBody: Directly Regressing Dense 3D Human Pose and Shape From a Single Color Image (2019)

Link to the paper

Proposes to represent human bodies in UV space by developing a 3D representation in UV space and use a UV position map to represent the 3D human mesh.

An encoder-decoder network is trained to regress full human 3D mesh directly from a single color image without any intermediate sub-tasks (segmentation, 2D pose-estimation).

Authors of the respective papers and articles reserve the rights to images and content. Comments, suggestions, and improvements are welcome. Please e-mail me at s.[lastname]@tum.de.