Last year, Huawei’s P30 Pro Smart Phone shook the world with an astonishing Leica 50x zoom camera, yet this achievement covered up another feature of this device that may be equally impressive: a Time-of-flight (TOF) back camera. In fact, this TOF camera, initially adopted by XBox Kinect machine, can be used to develop a real time depth scanner to reconstruct models real time! Since I had a difficult time understanding the mathematics of the original paper KinectFusion (link at the end) that proposed such method, I wanted to share an overview on the entire pipeline and method of this 3D reconstruction algorithm. Enjoy!

Note: No code snippets are presented in this article. The implementation of the mathematics for the entire KinectFusion may be long and daunting, so I believe introducing the mathematical concepts and insights on the data structures would be much better for understanding. Actual code on Github is provided at the end.

Overview

3D reconstruction has been a popular research topic in computer vision for a while already. Its ultimate goal is to create 3D models based on multiple images, where these images could be RGB or depth-based.

KinectFusion is one of the older traditional methods developed in 2011 that attempts to use depth images — images where the the value of depth is given instead of the RGB values for every pixel — as the only inputs to generate an entire 3D model.

Prerequisites

This is going to be the my attempt of explaining the methods of KinectFusion in the most straightforward and simplified way possible. Nonetheless, a little background in matrix algebra such as matrix addition and multiplication will come in handy for faster and better understanding. Here are the two essential linear algebra concepts required:

1. Matrix Transformation

Imagine a 2D plane with x-axis and y-axis, a point can be represented in a Cartesian form (x, y). Any translation can be represented by a vector (cx, cy), and a point, after translation, will have a new coordinate (x+cx, y+cy). We can therefore use matrix computation to represent a translation. Similarly, a rotation can also be represented through matrices can also be presented through a rotation matrix.

By combining the rotation matrix with the translation matrix, we will be able to deduct a matrix equation that transforms any point to its new coordinate after the transformation. For example, we can present any point (x₀, y₀) that is a computed through a transformation of (x, y) as the following:

where θ is the degree rotated anticlockwise and cx, cy are the translations in the x and y-axis respectively. One thing to note is that this equation performs the rotation first and then the translation.

The same idea can be applied to the three dimensional space; the matrices will be more complex but the concept is quite similar.

2. Matrix Projection

A projection of a real-world object onto the camera image can be viewed in the following way:

The pixel on the image plane with coordinate (u, v) corresponds to the point (x, y, z) on the 3D object, and can be found through a projection like this using the concept of similar shapes:

Relationship Equation between 3D coordinates and 2D coordinates

Therefore, we can also represent such computation in terms of matrix, which is the widely known as the camera intrinsic matrix K:

where tx and ty are offsets to put the camera to the center of the image plane.

The projected pixel coordinate (u, v) can then be calculated by:

Projection Equation

where z is the depth (z-axis length) of the point P (x,y,z).

With these two concepts in mind we can dive into the 4 procedures of real time 3D reconstruction.

Procedure 1 — Surface Measurement

Assuming we have acquired the depth map of the object in front. The first thing is to compute a vertex map and a normal map by converting every pixel in the depth map back to its corresponding point in the 3 dimensional world. This can be done through the inverse property of multiplication in matrices applied to every pixel point:

Rearranging the Projection Equation

The normal vector of a point can then be retrieved by computing the cross product of two neighboring vertices, since cross product gives us a vector that is orthogonal to both vertices.

It is also important to know that depth maps retrieved by sensors are usually prone to noises, and hence the original paper KinectFusion applied a bilateral filter on each pixel point and used the filtered depth image to compute the vertex and normal map. This filter, in short, cancels out the noises but also prevents blurring of edges that often occurs with other filters.

This surface measurement procedure will be applied to every depth map retrieved throughout the entire process. The vertex and normal maps are essential elements for camera pose estimation and constructing a 3D model.

Implementation

After obtaining a depth image, treat the position of the pixel as (u, v) and the depth value as z. Obtain K based on the specs provided by the device with the TOF camera, and compute the 3D points by Equation 2. For the XBox Kinect machine, f = 525.0, tx = 319.5, and ty = 239.5. We can create a class named Points and just store a vector of them to represent the point clouds.

Procedure 2 — Mapping and Surface Reconstruction

To explain this procedure, we will first introduce a function called TSDF (Trucated Signed Distance Function). To explain this function, let’s take a look at a following illustration of a 2D TSDF :

2D TSDF Function

The red curve in the middle represents the surface of object, and in this case the camera will be on the right of the grids/pixels above. SDF is a function where the value of each pixel is determined by the distance of it from the surface of the object and to the camera — the closer to the surface the closer the value gets to 0, and all values in “front” of the pixel (closer to camera) will be positive while all values at the “back” will be negative. TSDF is similar, except we set a threshold where if a pixel is too far from the surface it will automatically be set as 1 or -1. This will reduce the actual computing time for real time reconstruction as we can skip through the 1s.

In the 3-dimensional space, the idea of a TSDF stays the same except that we assign a value to each voxel (3D pixel).

But Why do we need TSDF? Because TSDF is very easy to fuse. If we now have two TSDF functions in the same voxels, all we have to do is average the value of each voxel and we will get a fused TSDF.

Implementation

In order to implement TSDF, we have to imagine that there are imaginary voxels in front of the camera. For example, we can use the first ever frame captured and imagine that there are 256x256x256 voxels in front of it, representing a 2x2x2 m³ space. The voxels can be represented by a class Voxel with its TSDF value in the program. Now we can calculate which pixel on the image plane a voxel corresponds to if we were to project it. We can then get the depth value of that pixel and compute the difference in length and obtain the TSDF function.

One thing we will notice is that the corresponding pixel might not actually be a whole number. In that case, we will have to average out the multiple depth values of pixels that surround the voxel projected point to obtain a better value for the depth.

Now for every camera frame afterwards, if we have new pose of the camera, we can perform the inverse of the camera pose transformation on each voxel so that the voxels are still staying in the same space we first created them. In simpler words, the same voxel will now be projected onto a different pixel since we have turned or shifted the camera. We can thus continue to fuse the TSDFs together to reconstruct the model! (If we have the camera pose of course, which is the most important step in KinectFusion.)

Procedure 3 — Ray Casting

This procedure allows us to visualize a TSDF at any angle we want, by sending imaginary rays through the TSDF to detect the surfaces of the point cloud.

In this procedure, we will generate a imaginary ray from every pixel that goes in the direction towards the 3D point that is computed through back-projection by Equation 2. The ray begins from the minimum depth of the pixel (as any TSDF with a smaller depth will have a value of 1), marches with a step size of a voxel length(so no voxel is skipped in the middle), and stops when there is a zero-crossing (change from positive to negative value of TSDF). The march also stops when there is a negative to positive change or when it exited the boundaries of the voxel space we created when computing the TSDF.

Notice that through ray casting, we can now get a vertex map of the fusion of all the previous TSDFs. We can also compute a normal map by calculating the gradient on the zero-crossing surface through the value change of the TSDF values of the voxels nearby. We will call them the global vertex map and global normal map, as they will be used in the final stage of KinectFusion — Pose Estimation — to estimate the pose of each frame through only a new depth map and the pre-existing TSDFs.

Implementation

We can calculate the unit vector from every pixel in the direction of the TSDFS that this particular pixel captures. we emit this imaginary ray starting from depth 0, moving a single voxel length at a time and take the TSDF value of the voxel the ray is currently intersecting. When we hit a zero crossing region (positive TSDF value to negative), we return the 3D coordinates of the voxel.

Interpolation could also be applied since the intersection might not be in the center of a voxel every time, but the time complexity of this interpolation will be very high. Another approach would be just taking the nearest voxel.

Procedure 4 —Pose Estimation

Pose Estimation is the final and most important component of the KinectFusion algorithm. If the pose of the camera is known for every new depth frame is taken, we can easily fuse the new depth frame into the TSDF, and slowly generate a 3D model as more frames were captured.

KinectFusion uses a method called ICP for pose estimation. ICP stands for Iterative Closest Point, which is an algorithm that minimizes the difference between two clouds of points. The mathematical equations behind the ICP can be complex, but the idea is straightforward: Slowly rotate and translate one of the two matching point clouds so that the two clouds eventually align in terms of the vectors and normals of each point.

This ICP algorithm can thus be used to compare a new point cloud captured by a TOF camera, and aforementioned global vertex map. Since the frame rate of capturing depth images is high, the movements between two frames will be fairly little, and hence the ICP will provide a fairly accurate estimation to how the camera has moved from frame to frame.

Implementation

We obtain a set of vertex and normal map from Procedure 1 for a newly inputed depth image. We then obtain the global vertex map through ray casting and the global normal map using the gradients inside the TSDF. We compare the normal maps and vertex maps through the ICP algorithm (full mathematical explanation of this comparison can be found in Section 3.5 of the KinectFusion paper) to eventually obtain the rotation and translation of the camera when it took the new depth image.

Workflow

The overall flow of the program will hence be the following:

We obtain the first depth map image, and perform Procedure 1 and Procedure 2 to obtain the vertex and normal map as well as storing it into the TSDF. As of now, only one image was obtained, so the fusing part in Procedure 2 can be ignored. We obtain a new depth image, and perform Procedure 1 to this new image while process the TSDF with Procedure 3 to obtain vertex and normal maps for both the TSDF and the new depth image. We perform Procedure 4 to obtain the difference in pose between the two point clouds, and fuse the TSDF via Procedure 2 after transforming the pose of the point cloud generated by the new depth image. We repeat Step 2. and 3. until all the depth images are fused together to from a final fused point cloud.

Code

Attached is the ongoing code of mine. As this program is almost complete built from scratch, there are still minor bugs in the ICP algorithm which will hopefully be fixed some day.

Code:

https://github.com/ttchengab/KinectFusion

You can also test out your own KinectFusion program with the TUM RGBD-SLAM datasets which can be found here:

https://vision.in.tum.de/data/datasets/rgbd-dataset

End Note

3D reconstruction is a computer vision topic that is not often discussed due to its complex math computations. Hopefully this article gives you a basic overview on one of the most influential methods of 3D reconstruction. Hopefully I can introduce some other computer vision topics or deep learning approaches of 3D reconstruction in the near future!

KinectFusion Paper:

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/ismar2011.pdf

If you enjoyed this article, make sure you 👏 so that more people will have a chance to read it too!