In our last blog post ( part 1 ), we took a look at how algorithms detect keypoints in camera images. These form the basis of our world tracking and environment recognition. But for Mixed Reality, that alone is not enough. We have to be able to calculate the 3d position in the real world. It is often calculated by the spatial distance between itself and multiple keypoints. This is often called Simultaneous Localization and Mapping (SLAM). And this is what is responsible for all the world tracking we see in ARCore/ARKit.





What we will cover today:

How ARCore and ARKit does it's SLAM/Visual Inertia Odometry

Can we D.I.Y our own SLAM with reasonable accuracy to understand the process better

Sensing the world: as a computer

When we start any augmented reality application in mobile or elsewhere, the first thing it tries to do is to detect a plane. When you first start any MR app in ARKit, ARCore, the system doesn't know anything about the surroundings. It starts processing data from camera and pairs it up with other sensors.

Once it has those data it tries to do the following two things

Build a point cloud mesh of the environment by building a map Assign a relative position of the device within that perceived environment beacons at different known locations. Something From our previous article , we know it's not always easy to build this map from unique feature points and maintain that. However, that becomes easy in certain scenarios if you have the freedom to placeat different known locations. Something we did at Mozfest 2016 when Mozilla still had the Magnets project which we had utilized as our beacons. A similar approach is used in a few museums for providing turn by turn navigation to point of interests as their indoor navigation system. However Augmented Reality systems don't have this luxury.

A little saga about relationships

We will start with a map.....about relationships. Or rather " A Stochastic Map For Uncertain Spatial Relationships " by Smith et al.

reliable unique anchor (A) (or that can be a stationary beacon) and our position is at (B).

C we can infer exactly how we need to move.



Unfortunately, in the world of AR and SLAM we need to work with imprecise knowledge about the position of A and C. This results in uncertainties and the need to continually correct the locations.

The points have a relative spatial relationship with each other and that allows us to get a probability distribution of every possible position. Some of the common methods to deal with the uncertainty and correct positioning errors are Kalman Filter (this is what we used in Mozfest), Maximum Posteriori Estimation or Bundle Adjustment. Since these estimations are not perfect, every new sensor update also has to update the estimation model. Aligning the Virtual World In the real world, you have precise and correct information about the exact location of every object. However in AR world that is not the case. For understanding the case lets assume we are in an empty room and our mobile has detected a(or that can be a stationary beacon) and our position is atIn a perfect situation, we know the distance between A and B, and if we want to move towardswe can infer exactly how we need to move.

To map our surroundings reliably in Augmented Reality, we need to continually update our measurement data. The assumptions are, every sensory input we get contains some inaccuracies. We can take help from Milios et al in their paper " Globally Consistent Range Scan Alignment for Environment Mapping " to understand the issue.

Image credits: Lu, F., & Milios, E. (1997). Globally consistent range scan alignment for environment mapping

Here in figure a, we see how going from position P1....Pn accumulates little measurement errors over time until the resulting environment map is wrong. But when we align the scan sin fig b, the result is considerably improved. To do that, the algorithm keeps track of all local frame data and network spatial relations among those.

A common problem at this point is how much data to store to keep doing the above correctly. Often to reduce complexity level the algorithm reduces the keyframes it stores.





Let's build the map a.k.a SLAM

To make Mixed Reality feasible, SLAM has the following challenges to handle

Monocular Camera input Real-time Drift Skeleton of SLAM

How do we deal with these in a Mixed Reality scene?

We start with the principles by Cadena et. al in their " Past, Present, and Future of Simultaneous Localization and Mapping: Toward the Robust-Perception Age " paper. From that paper, we can see the standard architecture of SLAM to be something like

Image Credit: Cadena et al

If we deconstruct the diagram we get the following four modules

Sensor: On mobiles, this is primarily Camera, augmented by accelerometer, gyroscope and depending on the device light sensor. Apart from Project Tango enabled phones, nobody ahd depth sensor for Android. Front End: The feature extraction and anchor identification happens here as we described in previous post. Back End: Does error correction to compensate for the drift and also takes care of localizing pose model and overall geometric reconstruction. SLAM estimate: This is the result containing the tracked features and locations. To better understand it, we can take a look at one of the open source implementations of SLAM.





D.I.Y SlAM: Taking a peek at ORB-SLAM

To try our hands on to understand how SLAM works let's take a look at a recent algorithm by Montiel et al called ORB-SLAM . We will use the code of its successor ORB-SLAM2 . The algorithm is available in Github under GPL3 and I found this excellent blog which goes into nifty details on how we can run ORB-SLAM2 in our computer . I highly encourage you to read that to avoid encountering problems at the setup.

His talk is also available here to see and is very interesting