Hello again! Andrew Melim here to continue our blog series on the Constellation controller tracking system for Oculus Quest and Rift S. In our first post, I shared how my team has approached increasing fidelity with constellation-tracked controllers. Today we are going to continue our deep dive into the continuous improvements we are shipping to the controller tracking system. We'll cover a recent update where we began applying a fresh approach to how we handle LED matching, a key challenge we need to solve in the constellation tracking algorithm. Have any questions? Feel free to share them in the comments section below.

Going from 2D blobs to 3D pose

There is a classic problem in photogrammetry called the Perspective-n-Point Problem (PnP), which seeks to identify the pose of a camera given an image with known 3D points in it. The problem we are solving to make controller tracking work is the reverse of this one, where we have a good estimate of the headset camera pose, and we want to find the pose of the controller which has known 3D points on it (the LEDs).

Once we detect each of the IR LEDs in the camera imaging, a difficulty that needs to be solved is to determine a mapping from each blob to the specific LED on the controller that emitted the light. We leverage the knowledge of the LED positions and their mapping to blobs in order to solve for the position and orientation of the controller. Each part of this process has to be performed efficiently since we only use data on the headset to perform tracking.

During this process of matching blobs to LEDs, we generate a set of hypotheses around which possible matches are likely correct. In order to compute an accurate pose, a minimum number of matches are required. Our algorithms are able to probabilistically determine the most likely correct matches based on a wide range of variables we process. Getting an incorrect mapping will bake in errors that directly leads to incorrect pose estimation, so it’s important we get enough accurate matches in each frame.

Improved matching w/ multiple views

To solve the LED matching problem, we implemented multiple methods that run on each and every frame, but broadly they fall into two main classes. The first is a statistically exhaustive method that brute forces the solutions, which we colloquially refer to as “brute matching”. These methods are used when we have no prior information about the controller's position and orientation. When we do have the controller’s pose from the previous camera image, we can use that to search in a much smaller window, a class of methods we call “proximity matching”.

Initially, the pipeline searched for blobs in one camera at a time. This required both hypothesis generating blobs and validating blobs to be in the same camera, which meant there would need to be at least 4 blobs detected in one camera for a successful match. Furthermore, it introduced high likelihood of contradicting matching results across multiple cameras. To combat this, the new matching pipeline takes advantage of the stereo-camera calibration data when evaluating LED-blob correspondences in different cameras. This allows us to rely on the relationship of each camera to one another to help resolve uncertainties.

This method improves scenarios where individual cameras have insufficient matches, but enough matches when all of the camera images are combined. Scenarios that suffered the worst are when the controllers are near the edge of field of view, too far, or too close, or when there is occlusion.

Since matching results are evaluated overall instead of per camera, it also precludes possibilities of inconsistent matching results. This helps expand the trackable volume of controllers and reduce 3DOF instances due to tracking loss near the headset or near edges of the field of view to enable a smoother tracking experience.

Computing more with less data

Theoretically, given only one camera image, you need at least three LEDs to be in view to solve for the controller’s pose. However, utilizing only three points leads to multiple possible solutions, therefore we require at least four correct matches to robustly solve for the pose.

It is fairly common for cameras to only see 3, 2 or even 1 LED(s) at a time, so we designed solvers that could use other information and work with fewer LEDs in view. In turn, this included the following new solvers which allow us to track in those particularly challenging orientations:

P2P pose solver

Uses 2 matches and prior pose orientation information to solve for position component of the pose.

Reduces minimum matching requirement to 3 matches (2 hypothesis generating matches and 1 validating match).

P1P pose solver

Uses the predicted pose to validate matches directly, instead of validating via statistical or nearest neighbor predictions.

Reduces minimum matching requirement to 2 matches (Ensures translation and scale is constrained properly for stereo-pose optimization)

Uses position-only stereo-pose optimization in the case of < 4 LEDs to avoid under-constraining orientation.

After extensive experimentation, we discovered that both P2P and P1P solvers require very accurate prior information (good tracking state and accurate prediction), as they rely on the predicted pose as a hard constraint in solving the problem.

In brute matching, however, prior information is less reliable, therefore we observed issues like mismatching with the wrong controller, or matching to ceiling lights. This led us to develop robust state machines that allow us transition between the various solvers to ensure we use the correct approach for the extremely wide range of difficult motions that people encounter everyday while playing on their Quest or Rift S devices.

While these improvements have helped significantly, there is still more room for us to improve the overall experience. In our next blog post, we will be visiting how we tackled another very challenging problem with constellation tracking. Stay tuned!

- Andrew Melim