This sounds amazing, but what are the technical challenges that must be addressed to deliver this experience? At the highest level, it decomposes into three problem areas:

The system needs to collect information about the user’s surroundings, and interpret the structure and content of that environment. That is, it must go from streams of raw sensor data to a structured representation of shapes, materials, objects, and light sources. It needs to track the user’s head position and orientation at the very least; for some applications, it may also need to read the user’s gaze, body posture, gait, hand pose, etc. It must decide what to display and then draw that content to the HMD. This step is a combination of the graphics rendering work performed by 3D gaming engines or digital effects packages, and whatever fancy new projection technology is used by the HMD to deliver imagery to the eyes.

Crucially, all of this must be performed under very stringent latency and quality constraints, including maintaining high and uninterrupted refresh rates. While the VR community is still exploring the various performance requirements needed to maintain a convincing “presence” (see the Oculus Best Practices Guide for a snapshot of current thinking), a good rule of thumb is that changes in the physical world (a user’s head movement, a change in lighting, a physical object moving to occlude a virtual one, etc.) must be accurately reflected in the simulated content within a very few tens of milliseconds. With any more lag, the human perceptual system notices the discrepancies, often at a subconscious level, and the sense of immersion and “magic” is lost. Fall behind or make any number of subtle mistakes and the experience will fail to convince.

I don’t know much about the hardware needed for problem 3 (projection), so I won’t comment here except to note that the size of the recent investment round and the astonished commentary by those who have tried a demo suggest good progress on this front. For problem 2 (tracking), Magic Leap will face many of the same challenges as other VR systems like Oculus. These difficulties are considerable, but it’s entirely plausible that continued R&D will result in cheap, rock solid, performant tracking systems in the near future. The tracking on the Oculus DK2 is already pretty good, for example, though of course it does require that pesky external camera.

That leaves problem 1, which is a real doozy. To deliver on the full promise of “cinematic reality”, the system needs to understand the environment, in real time, well enough to inject new elements that appear to be natural, legitimate participants of the scene. I think this means, essentially, solving all of computer vision. To see why, let’s break problem 1 down a bit more.

First, remember that the system, via the HMD and possibly other as-yet-unspecified sensor platforms, will be collecting sensory signals from the environment. There will certainly be cameras, whether traditional, infrared, or depth, as well as microphones, accelerometers, and so on. These streams form the raw data from which the system must interpret its surroundings, possibly in conjunction with information from maps and other pre-existing databases.

3D Mesh reconstruction. Crudely speaking, this means translating the raw sensor streams into a “wireframe” version of the scene that describes the shapes and geometries, as well as the boundaries between objects. This information is fundamental for determining where virtual characters and objects can exist in the scene, how they can move while respecting the other occupants of the environment, and so on. Without this knowledge, virtual elements will need to be well separated from real objects, or risk appearing in impossible or unnatural locations.

Surface and texture. It’s one thing to understand the shape of things, but it’s also important to understand their textures and material composition — fabric, wood, metal, glass, rubber, and plastic all behave differently, especially with respect to how they interact with light sources. Without a good sense of these surface properties, the interactions between real and virtual elements will be constrained and unrealistic.

Object recognition. This is a matter of attaching labels and other metadata to different pieces of the wireframe and surface scene description. Conceptually, this is similar to the recognition performed by Google’s image search — though in a much messier, faster-moving, and less constrained setting. Object recognition, including identification of key subparts like handles, knobs, hinges, etc. will determine how virtual content can interact with and respond to the real world.

Pose. For some uses, it will be important to understand where the real people are in the scene, what pose their body is in, and even their gait and trajectory. This problem is conceptually similar to the one solved by a system like Microsoft’s Kinect, but it will need to function in much more varied and cluttered environments — and with potentially many more people at a time. There’s a huge difference between grokking one or two people in a living room performing stereotyped movements and tracking a diverse, shifting crowd in a street or on a beach.

Lighting inference. This is one of the most interesting and difficult problems: in order to produce realistic virtual content that “fits” into the real world, the system needs to figure out how to illuminate the objects. In other words, it needs to understand something about the light sources, and how those sources interact with the rest of the environment — including absorption, transparency and translucency, reflection, etc. Get this wrong, and the virtual content will be lit in a manner inconsistent with the rest of the scene, and things will feel subtly (or drastically) wrong.

And that’s just the beginning. In order for scripted virtual characters to interact well with real people, they’ll need to understand all kinds of unspoken social signals. Even a behavior that seems effortless, such as walking with a flow of pedestrians down a sidewalk, in fact requires the system to correctly interpret a complex set of implicit and culturally-defined cues.

Solving these problems means the difference between being able to create a baby elephant floating in the empty space of two cupped hands, and a tarantula sitting right on your hand. Between a submarine floating well above a busy sunlit street, and a virtual character moving naturally and convincingly in the crowd of pedestrians and the flow of traffic on the street below. Or between a whale flying high over a beach full of people, and that same whale finding an appropriate place in the water to breach without “crushing” some unlucky humans. In short, these scene understanding capabilities will be crucial in enabling Magic Leap content to descend from the metaphorical and literal heights and instead become intimately entwined with the inherent drama of the physical world. Will these be truly blended realities, or will the real world serve as just a prettier stage for segregated virtual players?