To illustrate the problem, lets take two extreme examples. In the first case, you find a photo in the pile that looks almost exactly like the photo you have. You can easily estimate that your photo is fractionally behind & to the left of the photo in the pile, so you now have a really accurate estimate of the position your photo was taken. This is the equivalent of asking “player 2” to go and stand right beside “player 1” when player 2 starts their game. Then it’s easy for player 2’s system to figure out where it is relative to player 1, and the systems can align their coordinates (location) and the app can run happily.

In the other example, it turns out that unbeknownst to you all the photos in your pile are taken facing roughly south, while your photo faces north. There is almost nothing in common between your photo and what’s in the pile. This is the AR equivalent of trying to play a virtual board game and player 1 is on one side of the table, and player 2 sits down on the opposite side, and tries to join the game. Apart from some parts of the table itself (which you see in reverse to what’s in the pile) it is *very* hard for the systems to synchronize their maps (relocalize).

The difference between these examples illustrates why just because someone claims they can support “multi-player” AR, it probably also means that there are some significant UX compromises that a user needs to make. In my experience building multi-player AR systems since 2012, the UX challenges of the first example (requiring people to stand side by side to start) are too hard for users to overcome. They need a lot of hand-holding and explanations, and the friction is too high. Getting a consumer-grade multi-player experience means solving the 2nd case (and more).

In addition to the 2nd case above, the photos in the pile could be from vastly different distances away, under different lighting conditions (morning v afternoon shadows are reversed) or using different camera models which affect how the image looks compared to yours (that brown wall may not be the same brown in your image as mine). You also may not even have GPS available (eg indoors), so you can’t even start with a rough idea of where you might be.

The final “fun” twist to all this, is that users get bored waiting. If the relocalization process takes more than 1–2 seconds, the user generally moves the device in some way, and you have to start all over again!

Accurate & robust relocalization (in all cases) is still one of the outstanding hard problems for AR (and robots, and autonomous cars etc).

How does Relocalization work?

So how does it actually work? How are these problems being solved today? What’s coming soon?

At it’s core, relocalization is a very specific type of search problem. You are searching through a SLAM map, which covers a physical area, to find where your device is located in the coordinates of that map. SLAM maps usually have 2 types of data in them, a sparse point-cloud of all the trackable 3D points in that space, and a whole bunch of keyframes. A keyframe is just one frame of video captured and saved as a photo every now & then as the system runs. The system decides how many keyframes to capture based on how far the device has moved since the last keyframe, and the system designer making tradeoffs for performance. More keyframes saved means more chance of finding a match when relocalizing, but takes more storage space, and means the set of keyframes takes longer to search through.

So the search process actually has 2 pieces. The first piece is as described above with the Polaroids example. You are comparing your current live camera image to the set of keyframes in the SLAM map. The second part is that your device has also instantly built a tiny set of 3D points of its own as soon as you turn it on based only on what it currently sees, and it searches through the SLAM sparse point-cloud for a match. This is like having a 3D jigsaw puzzle piece (the tiny point-cloud from your camera) and trying to find the match in a huge 3D jigsaw….where every piece is flat gray on both sides.

Here’s a simplified overview of how most of today’s SLAM systems build their SLAM map using a combination of Optical Features (sparse 3D point cloud) and a database of “keyframes”.

Due to the limited amount of time available before a user gets bored, and the modest compute power of today’s mobile devices, most of the effort in relocalization goes into reducing the size of the “search window” before having to do any type of brute-force searching through the SLAM map. Better GPS, better trackers and better sensors are all very helpful in this regard.

How is it really being done today in apps?

Poorly! There are broadly 5 ways that relocalization is being done today for inside-out tracking systems (it’s easy for outside-in, like a HTC Vive, as the external lighthouse boxes give the common coordinates to all devices that they track). These ways are:

rely on GPS for both devices and just use lat/long as the common coordinate system. This is simple, but the common object we both want to look at will be placed in different physical locations for each phone. Up to the amount of error in a GPS location (many meters!). This is how Pokemon Go currently supports multi-player, but because the “MMO” back-end is still quite simple, it’s actually closer to “multiple people playing the same single-player game in the same location”. This isn’t entirely accurate as once the pokemon is caught, other people can’t capture it, so there is some simple state management going on.

Here’s what happens when you rely on GPS alone for relocalization. We don’t see the object where it is “supposed” to be, and we don’t even see it in the same place on 2 different devices.

rely on a common physical tracking marker image (or QR code). This means we both point our phones at a marker on the table in front of us and both our apps treat the marker as the origin (0,0,0) coordinates. This means the real world and the virtal world are consistent across both phones. This works quite well, it’s just that no one will ever carry the marker around with them, so it’s a dead end for real-world use.

Here’s an app that uses a printed image that all the devices use for relocalization in order to share their coordinates

copy the SLAM maps between devices and ask the users to stand beside each other and have player 2 hold their phone very close to player 1. Techncially this can work quite well, however the UX is just a major problem for users to overcome. This is how we did it at Dekko for Tabletop Speed.

Just guess. If I start my ARkit app standing in a certain place, my app will put the origin at the start coordinates. You can come along later and start your app standing in the same place, and just hope that wherever the system sets your origin is roughly in the same physical place as my origin. It’s techncially much simper than copying SLAM maps, and the UX hurdles are about the same, and the errors across our coordinate systems aren’t too noticeable if the app design isn’t too sensitive. You just have to rely on users doing the right thing….

Constrain the multi-player UX to be OK with low-accuracy location and asynchronous interactions. Ingress and AR treasure-hunt type games fall into this category. Achieving high-accuracy real-time interactions is the challenge. I do believe there will always be great use-cases that rely on asynchronous multi-user interactions, and it’s the job of AR UX designers to uncover these.

It’s worth noting that all of the above solutions have existed for many years, and yet the number of real-time multi-player apps that people are using is pretty much zero… All the solutions above IMO fall into the bucket of an engineer being able to say “look it works, we do multi-player!” but end users just find it too much hassle for too little benefit.

What’s the state of the art in research (and coming soon to consumer)?

While the relocalization method described above is the most common approach, there are others that are seeing great results in the labs and should come to commercial products soon. One is using full frame neural network regression (posenet) to estimate the pose of the device. This looks like being able to get your pose accurate to about a meter or so under a wide range of conditions. Another method is to regresses the pose of the camera for each pixel in the image.

Posenet is indicative of where systems are headed

Can the relocalization problem really be solved for consumers?

Yes! In fact there have been some pretty big improvements over the last 12 months based on state-of-the-art research results. Deep learning systems are giving impressive results for reducing the search window for relocalizing in large areas, or at very wide angles to the initial user. Searching a SLAM map built from dense 3D point clouds of the scene (rather than sparse point clouds used for tracking) are also enabling new relocalization algorithms that are very robust. I’ve seen confidential systems that can relocalize from any angle at very long range in real-time on mobile hardware, and support many many users simulaneously. Assuming the results seen in research carry over into commercial grade systems, then I believe this will provide the “consumer grade” solutions we expect.

But these are still only partial solutions to fully solving relocalization for precise lat/long and for GPS denied environments, or parts of the world were no SLAM system has ever been before (cold-start), but I’ve seen demo’s that solve most of these point problems, and believe that it will just take a clever team to gradually integrate them into a complete solution. Large scale relocalization is on the verge of being primarily an engineering problem now, not a science problem.

Can’t Google or Apple just do this? Not really.

Google has demo’d a service called VPS for their discontinued Tango platform, which enabled some relocalization capabilities between devices. Sort of a shared SLAM map in the cloud. It didn’t support multi-player, but it went a ways towards solving the hard technical parts. It’s never been publicly available so I can’t say how well it worked in the real world, but the demos looked good (as all demos do). All the major AR platform companies are working on improving their relocalizers that are part of ARKit, ARCore, Hololens, Snap etc etc. This is primarily to make their tracking systems more reliable, but this work can help with multi-player also…

VPS is a good example of a cloud-hosted shared SLAM map. However it is completely tied to Google’s SLAM algorithms and data structures, and won’t be used by Apple, Microsoft or other SLAM OEMs (who would conceivably want their own systems, or partner with a neutral 3rd party).

The big problem that every major platform has with multi-player, is that at best they can enable multi-player within their eco-system. ARCore to ARCore, or ARKit to ARKit and so on. This is because for cross-platform relocalization to work, there needs to be a common SLAM map on both systems. This would mean that Apple would have to give Google access to their raw SLAM data, and vice versa (plus Hololens, Magic Leap also opening up etc). While technically possible, this is a commercial bridge too far, as the key differentiators in the UX between various AR systems is largely a combination of hw+sw integration, then the SLAM mapping system capabilities.

So in the absence of all the big platforms agreeing to open all their data to each other, the options are either:

an independent & neutral 3rd party acts as a cross-platform relocalization service; or

a common open relocalization platform emerges.

My personal belief is that due to the very tight integration between the SLAM relocalization algorithms and the data structures, that a dedicated system built for-purpose will outperform (from a UX aspect) a common open system for many years. This has been the case for many years in computer vision, that the open platforms such as OpenCV or various open slam systems (orb slam, lsd slam etc) are great systems, but don’t provide the same level of optimzed performance of focussed in-house developed systems. To date, no AR platform company I know of is running or considering to run an open slam system (though many similar algorithmic techniques are applied in the optimized proprietary systems).

Note that doesn’t mean I don’t believe that open platforms don’t have a place in the ARCloud. On the contrary, I think there will be many services that will benefit from an open approach. However I don’t think as an industry we understand the large scale AR problems well enough yet in order to specifically say this system needs to be open vs that system needs to be as optimized as possible.

Relocalization != Multi-player. It’s also critical for…

This post is ostensibly about why multi-player is hard for AR, and it turns out it’s hard specifically for AR because its hard to make relocalization consumer-grade. There’s a whole bunch of other things to build to enable AR multi-player, which I touched on above, which could be hard to build, but are all previously solved problems. But… there are other ways that relocalilzation really matters, beyond just multi-player. Here’s a few:

the “cold start” problem: This refers to the very first time you launch an app or turn on your HMD, and it has to figure out where it is. Generally today systems don’t even bother to try & solve this, they just call wherever they start (0,0,0). Autonomous cars, cruise missiles and other systems that need to track their location obviously can’t do this, but they have a ton of extra sensors to rely on. Having the AR system relocalize as the very first thing it does, means that persistent AR apps can be built, as the coordinate system will be consistent from session to session. If you drop your pokemon at some specific coordinates yesterday, when you relocalize the next day after turning your device on, those coordinates will still be used today and the pokemon will still be there. Note that these coordinates could be unique to your system, and not necessarily absolute/global coordinates (lat/long) shared by everyone else (unless we all localize into a common global coordinate system, which is where things will ultimately end-up)

the absolute coordinates problem: This refers to finding your coordinates in terms of lat/long to an “AR usable” level of accuracy, which means it’s accurate to “sub-pixel” levels. Sub-pixel means that the coordinates are accurate enough that the virtual content will be drawn using the same pixels on my device, as your device if it was in the exact same physical spot. Usually sub-pixel is used for tracking to refer to jitter/judder so that the pose being accurate sub-pixel means the content doesn’t jitter when the device is still due to the pose varying. It’s also a number that doesn’t have a metric equivalent as each pixel can correspond to slightly different physical distances depending on the resolution of the device (pixel sizes) and also how far away the device is pointing (a pixel covers more physical space if you are looking a long way away). In practice having sub-pixel accuracy isn’t necessary as users can’t really tell if the content is inconsistent by a few cm between my device and yours. Getting accurate lat/long coordinates is essential for any location based commerce services (eg the virtual sign over the door needs to be over the right building), as well as navigation.

This is what you get when you don’t have accurate absolute coordinates (or a 3D mesh of the city)

the lost-tracking problem: the last way in which relocalization matters is that it is a key part of the tracker. While it would be nice if trackers never “lose tracking”, even the best trackers can encounter corner cases that confuse the sensors e.g. getting in a moving vehicle will confuse the IMU in a VIO system, while blank walls can confuse the camera system. When tracking is lost, the system needs to go back and compare the current sensor input to the SLAM map to relocalize so that any content is kept consistent within the current session of the app. If tracking can’t be recovered, then the coordinates are reset to (0,0,0) again and all the content is also reset.

Yeah but when?

Will this remain science-fiction?

So when will end users be able to play true multi-player Pokemon? or StarWars Holochess with their friend? It can be done today, if users are OK to accept a poor quality relocalization UX. If you are OK to relocalize using a common printed marker, or rely on “my GPS gives the exact same result as your GPS” and accept that you see might the pokemon through your phone on the sidewalk, but I see the same one through my phone placed in the middle of the road… then it can be done right now. But users have generally found that UX to be at least a little bit broken (even if it “technically” works). But in terms of when will we see a solid UX, then I expect to see solutions come to market around Q2-Q3 2018. I know this because it’s roughly when my startup 6D.ai is expecting to have a solution ready, and I know other startups (eg Escher Reality for example) are working on the same problems. There’s a chance Apple or Google may release an update to ARCore or ARKit that allows “in eco-system” multi-player, but I’d be surprised if they bring something to market faster than a startup can. I hear Apple has an ARKit update planned for around April that may support vertical plane detection amongst other tweaks, while Google seems to be very focussed on bringing ARCore to market on large numbers of devices as a priority.

Wrap-up

So true multi-player is IMO the single feature that is going to boost user engagement with AR Apps (there are others, such as absolute coordinates and very large scale outdoor apps). At a minimum it should allow far more engaging non-gaming smartphone AR apps to be built… but developers still have to learn what to build, and this will take a while. Adding multi-player to a bad concept won’t make it a compelling UX.

It’s a hard technical problem to solve relocalization, especially cross-platform, to a level that “it just works” for consumers. Once that problem is solved then the rest is just replicating work that has already been done for real-time MMO gaming platforms. As I indicated in my first ARKit post, 2018 is going to be an exciting year for AR enabling infrastructure….