Making the VR experience simple and portable was the main goal of the Oculus Quest, and it definitely accomplishes that. But going from things in the room tracking your headset to your headset tracking things in the room was a complex process. I talked with Facebook CTO Mike Schroepfer (“Schrep”) about the journey from “outside-in” to “inside-out.”

When you move your head and hands around with a VR headset and controllers, some part of the system has to track exactly where those things are at all times. There are two ways this is generally attempted.

One approach is to have sensors in the room you’re in, watching the devices and their embedded LEDs closely — looking from the outside in. The other is to have the sensors on the headset itself, which watches for signals in the room — looking from the inside out.

Both have their merits, but if you want a system to be wireless, your best bet is inside-out, since you don’t have to wirelessly send signals between the headset and the computer doing the actual position tracking, which can add hated latency to the experience.

Facebook and Oculus set a goal a few years back to achieve not just inside-out tracking, but make it as good or better than the wired systems that run on high-end PCs. And it would have to run anywhere, not just in a set scene with boundaries set by beacons or something, and do so within seconds of putting it on. The result is the impressive Quest headset, which succeeded with flying colors at this task (though it’s not much of a leap in others).

What’s impressive about it isn’t just that it can track objects around it and translate that to an accurate 3D position of itself, but that it can do so in real time on a chip with a fraction of the power of an ordinary computer.

“I’m unaware of any system that’s anywhere near this level of performance,” said Schroepfer. “In the early days there were a lot of debates about whether it would even work or not.”

Our hope is that for the long run, for most consumer applications, it’s going to all be inside-out tracking.

The term for what the headset does is simultaneous localization and mapping, or SLAM. It basically means building a map of your environment in 3D while also figuring out where you are in that map. Naturally robots have been doing this for some time, but they generally use specialized hardware like lidar, and have a more powerful processor at their disposal. All the new headsets would have are ordinary cameras.

“In a warehouse, I can make sure my lighting is right, I can put fiducials on the wall, which are markers that can help reset things if I get errors — that’s like a dramatic simplification of the problem, you know?,” Schroepfer pointed out. “I’m not asking you to put fiducials up on your walls. We don’t make you put QR codes or precisely positioned GPS coordinates around your house.”

“It’s never seen your living room before, and it just has to work. And in a relatively constrained computing environment — we’ve got a mobile CPU in this thing. And most of that mobile CPU is going to the content, too. The robot isn’t playing Beat Saber at the same time it’s cruising though the warehouse.”

It’s a difficult problem in multiple dimensions, then, which is why the team has been working on it for years. Ultimately, several factors came together. One was simply that mobile chips became powerful enough that something like this is even possible. But Facebook can’t really take credit for that.

More important was the ongoing work in computer vision that Facebook’s AI division has been doing under the eye of Yann LeCun and others there. Machine learning models frontload a lot of the processing necessary for computer vision problems, and the resulting inference engines are lighter weight, if not necessarily well understood. Putting efficient, edge-oriented machine learning to work inched this problem closer to having a possible solution.

Most of the labor, however, went into the complex interactions of the multiple systems that interact in real time to do the SLAM work.

“I wish I could tell you it’s just this really clever formula, but there’s lots of bits to get this to work,” Schroepfer said. “For example, you have an IMU on the system, an inertial measurement unit, and that runs at a very high frequency, maybe 1000 Hz, much higher than the rest of the system [i.e. the sensors, not the processor]. But it has a lot of error. And then we run the tracker and mapper on separate threads. And actually we multi-threaded the mapper, because it’s the most expensive part [i.e. computationally]. Multi-threaded programming is a pain to begin with, but you do it across these three, and then they share data in interesting ways to make it quick.”

Schroepfer caught himself here; “I’d have to spend like three hours to take you through all the grungy bits.”

Part of the process of creating Insight was also extensive testing, for which they used a commercial motion tracking rig as ground truth. They’d track a user playing with the headset and controllers, and using the OptiTrack setup measure the precise motions made.

To see how the algorithms and sensing system performed, they’d basically play back the data from that session to a simulated version of it: video of what the camera saw, data from the IMU and any other relevant metrics. If the simulation was close to the ground truth they’d collected externally, good. If it wasn’t, the engineers would adjust the system’s parameters and they’d run the simulation again. Over time, the smaller, more efficient system drew closer and closer to producing the same tracking data the OptiTrack rig had recorded.

Ultimately it needed to be as good or better than the standard Rift headset. Years after the original, no one would buy a headset that was a step down in any way, no matter how much cheaper it was.

“It’s one thing to say, well my error rate compared to ground truth is whatever, but how does it actually manifest in terms of the whole experience?” said Schroepfer. “As we got towards the end of development, we actually had a couple passionate Beat Saber players on the team, and they would play on the Rift and on the Quest. And the goal was, the same person should be able to get the same high score or better. That was a good way to reset our micro-metrics and say, well this is what we actually need to achieve the end experience that people want.”

The computer vision team here, they’re pretty bullish on cameras with really powerful algorithms behind them being the solution to many problems.

It doesn’t hurt that it’s cheaper, too. Lidar is expensive enough that even auto manufacturers are careful how they implement it, and time-of-flight or structured-light approaches like Kinect also bring the cost up. Yet they massively simplify the problem, being 3D sensing tools to begin with.

“What we said was, can we get just as good without that? Because it will dramatically reduce the long-term cost of this product,” he said. “When you’re talking to the computer vision team here, they’re pretty bullish on cameras with really powerful algorithms behind them being the solution to many problems. So our hope is that for the long run, for most consumer applications, it’s going to all be inside-out tracking.”

I pointed out that VR is not considered by all to be a healthy industry, and that technological solutions may not do much to solve a more multi-layered problem.

Schroepfer replied that there are basically three problems facing VR adoption: cost, friction and content. Cost is self-explanatory, but it would be wrong to say it’s gotten a lot cheaper over the years. PlayStation VR established a low-cost entry early on, but “real” VR has remained expensive. Friction is how difficult it is to get from “open the box” to “play a game,” and historically has been a sticking point for VR. Oculus Quest addresses both these issues quite well, being at $400 and, as our review noted, very easy to just pick up and use. All that computer vision work wasn’t for nothing.

Content is still thin on the ground, though. There have been some hits, like Superhot and Beat Saber, but nothing to really draw crowds to the platform (if it can be called that).

“What we’re seeing is, as we get these headsets out and in developers hands, that people come up with all sorts of creative ideas. I think we’re in the early stages — these platforms take some time to marinate,” Schroepfer admitted. “I think everyone should be patient, it’s going to take a while. But this is the way we’re approaching it, we’re just going to keep plugging away, building better content, better experiences, better headsets as fast as we can.”