I'd intended to write about this in much more detail earlier - but plenty of things have gotten in the way. Sometimes there's just not time to write everything out - but since much of this has still not, a month after WWDC, been discussed in the media it seemed worth spending an hour or so to write out.

If you enjoy this sort of thing, or would like to hear about a major new AR announcement I'll be unveiling later this month, please follow me here on LinkedIn or on Facebook at my Public Page.

You all know the story, already.

Or at least, you likely believe you do. At the WWDC Keynote on June 5th, Apple announced (among other things) ARKit - an augmented reality integration (tracking and lighting) SDK to be integrated with iOS 11 releasing later this year.

They showed some neat demos, everybody got more excited about AR, and the rest remains to happen.

Now, I was annoyed.

Since 2015, I've been expecting something from Apple. I've been expecting them to release Metaio with an Apple flair - maybe some neat, super-friendly authoring tools - a sort of iLife for AR, basically a re-skin of Metaio Creator, perhaps even something aimed at the industrial sector like Metaio Engineer or drawing upon that technology for visualizing spaces and machines pre-construction.

Seriously - check these links out on the Wayback machine to understand what I'm talking about:

It hadn't stopped there, though - they already owned Primesense - that's the company that developed the first Kinect and Realsense tech. They owned LuxVue - the company that was widely rumored to have been the supplier for the displays in Google Glass. They owned WiFiSLAM which could get very accurate location within a building - which has obvious implications to provide positional data that could easily enhance AR by providing better rough calibration.

So when they announced that they were releasing what appeared to be a very basic part of Metaio - just the SLAM part - it was a bit of a letdown.

I was pissed.

I'd already decided that I didn't really care about iOS. Now - I've been a huge iPhone fan, but with the sluggish movement from Apple in a lot of technology areas that I'm interested in, it's hard to stay enthusiastic. I'm a developer and futurist that likes to keep rushing ahead and "here's another phone. It's thinner" doesn't really do it for me. Not when Google is pushing something like Tango.

But Apple totally buried the lede.

I can only imagine why - and it's pure speculation - but Apple seems to be on a kick of being quietly awesome right now instead of blowing their own horn. That's an interesting new color on them and I think it suits them.

Because when they announced ARKit, they quietly stepped over the thing that made it significant - not just in a "Hurray! This is built into the OS!" kind of way, but in a "Wait... They did WHAT???" kind of way.

ARKit is the first publicly available real-world-scale monocular SLAM solution. Before this moment, every other system required some variety of indication of scale - or relied on a specialized camera to generate that indication.

Apple is not the first to do it. There's been plenty of papers, there's papers from University of California, Riverside, Research at Oxford; there's research in Zurich - there's an endless series of academic papers demonstrating that it was possible - but we hadn't seen much in the way of actual publicly available SDKs.

There's a terrific deck from JPL explaining the situation in detail, in layman's language but with plenty of technical data & equations to satisfy the truly curious.

But my point is - this was a hard problem. It was the stuff of PhD theses and the Jet Propulsion Laboratory, not mobile app developers. This was, in fact, the sort of hard problem that caused Microsoft to use a Kinect v2-derived sensor array for HoloLens and that drove Google to push the development of a custom-sensor basis for Tango.

The whole world had concluded that at least for now, there was no way to get scale-accurate data without at least a pair of cameras or some variety of structured light, time-of-flight sensor, etc.

So how did Apple do it?

The workaround that everyone has been using typically involves markers. Now - markers can have other purposes, such as triggering requests for *specific* data, but the for AR the most significant contribution is known scale. We know if a sticker was printed at 12cm wide. For 3d objects, we always know the scale of that object as well, and if confronted with an otherwise identical object at a different scale, our tracking solutions tend to misunderstand. If we use a coffee cup as reference and somebody brings in a giant coffee cup, there's no way for the algorithm to know that. Not usually.

What everything is looking for is some sort of sensor data that provides information in meters. We can get angles and proportions from ordinary SLAM or even less extraordinary CV. But if we want to know if a surface is a meter wide or an inch, we need some method of attaching scale to that - and you can't get that from just a camera.

Apple turned to their IMU. For reference, the iPhone 7 uses the STMicro LSM6DSM. It's quite likely the best 6-axis IMU being made today - it's certainly the best made by STMicro (They make a 9-DOF that includes three dimensions of magnetometry data, but it's chunky by comparison and doesn't provide anything more dimensionally significant). If you compare to the Samsung S7, for instance, Samsung used the LSM6DS3 - which is smaller but less accurate and less robust.

This gave Apple a huge advantage - they can guarantee that every iPhone in existence, manufactured after a certain date, has a particular very high performance IMU. This isn't an assumption you can make about Android devices. Their internal hardware is all over the place and even calibrating a particular device you can't know how noisy the IMU is, how temporally consistent it is, how it responds to temperature, how well it recovers from shock, etc.

So how does the IMU help?

The IMU provides a couple of things. It has a gyroscopic sensor that gives you orientation info. And it has an accelerometer. Now, the accelerometer for most developers and users is what tells the OS and applications that the device has been rotated. It's how an app knows where "down" is - because gravity is a measurement of acceleration. And that acceleration is measured in m/s^2. When you move the device while you're holding it, you are applying force to the phone and accelerating it yourself, through space, and applying force to stop it, which the accelerometer is also able to read.

And all of that data - every push and pull and lift and wiggle is measured in meters per second squared. Even if the device doesn't know what that motion means, combine that with information from the camera, where you can generate a rough point cloud and you now know exactly how big that point cloud is in real world terms.

What does this mean for other software SDKs?

We don't really even have to talk about software SDK providers here but we can call out PTC Vuforia, Wikitude, EasyAR, Catchoom and others - and since most of them are addressing a significant need that Apple is not, that of cloud-based target recognition and management, there's still room for them to operate. Apple's machine learning and image recognition frameworks aren't really a great fit for this, at least not without an additional layer of code to bring that in. ARKit also isn't any good at tracking handheld objects - my Augmented Tarot app, for instance, is a 100% nonstarter without something like Vuforia powering it.

But folks that have been relying on hardware solutions for spatial understanding - like Microsoft on HoloLens and Google with Tango - have an opportunity to rethink their strategy.

So what happens now?

There are a few things that I imagine are happening:

1. I bet Google is waking up some sleepy projects.

Google has probably pulled a bunch of developers into a room and tasked them with solving the same problem that Apple managed to solve: Worldscale-accurate monocular SLAM.

Now, this isn't as much of a blow as some people might think because Tango isn't really about the sensor but about a framework to make information from that sensor useful. An algorithm that meaningfully derives scale from the IMU and calibrates that IMU against known forces like gravity, doesn't affect how Tango works from the SDK side. Tango itself doesn't care where this data comes from: Time-of-flight, structured-light, or stereo disparity are all options on the table and they all produce the same thing. Adding VIO-SLAM isn't terribly disruptive, it's just challenging.

Tango has some advantages: ARKit needs to move around. That's usually a trivial requirement since people are moving around all the time, but that movement is essential to ARKit being able to get usable scale data. You can mount a Tango device on a tripod. It won't work as well as if you moved it around, and there's hardly any need to do AR from a stationary device, but the point is there's an inherently more robust dataset coming from a device with a special active sensor or fixed stereo pair.

Tango devices have a special sensor using structured light, ToF, or other variations of scale reference. Remember: prior to the WWDC announcement most of the world had pretty much conceded that there was no way to get dependable scale data in monocular SLAM unless you had targets of known scale in the scene. Heck, Kudan (quite reasonably at the time) has a page explaining why you can't do it. All prior VIO SLAM work was aimed at robotics, autonomous drones and self-driving vehicles where small amounts of spatial drift weren't that important as long as you had decent loop closure. For these solutions, you were still worried about angles more than distances and had a lot of other options available to help fine tune scale if your use case warranted it. You could also always depend on having a specific, high performance IMU if your use case demanded it.

Knowing that we can actually do this with consumer devices - that it's not experimental and isn't a waste of time - is going to create a lot more effort in that area. So Google has to be addressing that.

Above that aspect of Tango, Google has an exceptional approach to storing, recognizing and sharing area descriptions - so that the structure of an area recognized by one device can be trivially shared with other Tango-enabled devices. That's not a device-specific thing, and can work across multiple Tango devices using different hardware to derive depth and scale.

2. Software SDK providers are going to focus on what makes them different

I've made no secret about the fact that I think the sort of per-app fees for Augmented Reality SDKs are complete nonsense. That's not just the anarchist/scofflaw part of me, that's just the practical part that sees that none of them provide something in their basic license that cannot - that should not - be integrated at the OS level, completely eliminating them from the market.

If all a developer is asking to do is to recognize a preset number of targets, stored on the local device, and to get some structural info about their surroundings, that should be *free* or have a single, reasonable, fixed cost associated with it. You don't get to complain that it cost you a lot of money to develop it - because the market won't care about that when someone like Apple comes along and releases ARKit. When developers don't need you any more, your expenses don't matter. Maybe this is why some providers have absurd fees: they've already calculated that their days are numbered and if they're ever going to recoup those R&D costs, they need to do so NOW.

What people will pay for - and where the real long-term cash is in augmented reality - is in simplified authoring tools, cloud recognition, and other features like application plugins - connecting AutoCAD to 3d object recognition for instance, so development of augmented reality applications can proceed alongside development and manufacturing.

And, of course, actual utility.

3. Microsoft is going to shrug and keep on going

The HoloLens and Windows Mixed Reality are such mature products at this point, and are so clearly defined as a way of interacting with spaces and information, with ample abstraction between sensors and user experience, that I feel like the core teams at Microsoft that deal with spatial understanding probably said, "Oh, cool, someone got it working."

Maybe that means that they reinvigorate some existing teams, reach out to a few universities that have done prominent academic research and get them involved with hardware developers to build an improved and radically lower power, less expensive headset.