Understanding the Outputs of Multi-Pose Estimation

Shape of the Outputs

I started with the blog post then explored the PoseNet tfjs source code. As I understand it, the outputs of multi-pose estimation should be 4 arrays of 4 dimensions like below:

Scores : [1] [height] [width] [ Number of keypoints ]

] Offsets: [1] [height] [width] [ Number of keypoints * 2 ]

] Displacements(Forward): [1] [height] [width] [ Number of edges * 2 ]

] Displacements(Backward): [1] [height] [width] [Number of edges * 2]

The last dimension of the offsets and displacements arrays is the number of keypoints/edges multiplies 2. This is because half of the array are the x of the vectors and the other half are the y.

Scores array is the keypoint heatmap and offsets array is the offset vectors in the picture. The blog post explained them in details.

Displacements array (or displacement vectors) are used when we traverse along a part-based graph (edges) to locate a target keypoint from a known source keypoint. We start from finding a root keypoint which has the highest score in a local window on the heatmap then the root keypoint becomes our first known source keypoint.

Part-based Graph

PoseNet outputs 17 body parts and the parts are chained in a graph. When the source keypoint is a child node, we use the backward displacements array to locate the parent. When the source keypoint is a parent node, we use the forward displacements to locate the child.

With 17 parts, there should be 16 edges.

Calculation

To locate a target keypoint on heatmap from a source keypoint:

targetKeypointHeatmapPositions = sourceKeypointHeatmapPositions + displacementVectors

To find the actual position of a keypoint on an image:

keypointPositions = heatmapPositions * outputStride + offsetVectors

Problem with multi_person_mobilenet_v1_075_float.tflite

The outputs of multi_person_mobilenet_v1_075_float.tflite from TensorFlow’s webpage are:

Scores :[1][23][17][17]

Offsets: [1][23][17][34]

Displacements(Forward):[1][23][17][ 64 ]

] Displacements(Backward):[1][23][17][1]

The shape of the two displacements arrays are different. It doesn’t seem right because with 16 edges, forward and backward displacements arrays should both have 32 values on the last dimension: x and y of 16 vectors . When I first experimented the model, I got index out of range exceptions when traversing backward from the root keypoint.

I found a StackOverflow thread discussing about this model file and others are having issues with it too. Big thanks to the answerer. His version of the converted tflite file generates expected output.