To pull it off, the researchers scanned the environment with a Kinect to create a 3D scene, which the computer at first sees as just a single object. However, a human user can touch and interact with objects in real time (as shown in the video below), then vocally call out the name of each ("banana," for instance). The system then separates the touched object from the surfaces while creating a "class" for each. All of that is done using local resources (a laptop, for instance) for better interactivity.

Behind the scenes, however, software running on more powerful systems is learning as new objects are labeled, and can deduce if an object belongs to a specific class like "chair." In the last pass, it further refines the objects and creates a final 3D scene. Microsoft says that its online system can not only "rapidly segment 3D scenes," but also "learn from these labels" to perform the tasks better and faster in the future.

Once the system is perfected, users may one day be able to just walk around with a Kinect or other depth-sensing camera and create a detailed map of a scene, complete with individual objects. The possibilities for using such maps are endless, but Microsoft cited a few examples "from robot guidance, to aiding partially sighted people, to helping us find objects and navigate our worlds, or experience new types ofaugmented realities."