A gif demonstrating how a Convolutional Neural Network can be used for a VR level editor type interface.

Seriously … It’s like Harry Potter.

TLDR; video here.

A while back, I wrote my first blog post in series about the intersection of AI, Creativity, and 3D content generation. This blog post is a continuation of that series.

My dream VR application is ultimately a seamless extension of my imagination. Sure it’s a lofty goal, but it is my intuition that machine learning techniques can help make this idea a reality.

Specifically, this blog post explores the use of convolutional neural networks to dramatically change interaction design in VR.

Designing VR in VR

Anyone that has designed a VR application will tell you that working in traditional 2D mediums will only get you so far. In order to make VR, you need to be in VR.

However, since VR is such a new medium, the industry supporting VR creation is also new. Companies like Unity and Epic are creating VR level editors, however they end up feeling just like VR desktops, menus upon menus entrenched in design principles for 2D mediums.

A demonstration of a VR editor.

For me, the process of creating a VR environment in real time needs to be fast and effortless. But how do we create such a system without making use of 2D menus?

What if we could draw what we wanted? What if instead of having to navigate through a series of options, the system can understand what I need?

Enter Machine Learning

If you think about it, by entering into a VR environment, everything you do within that environment is being translated into data. Everything you look at, every twitch of your arm, every action you take can potentially be recorded.

Dystopian cyberpunk ramifications aside, this is quite compelling from a ML standpoint because what any ML model needs is lots of quality data. For the purposes of this blog post we’ll be looking at how convolutional neural networks (CNN) can be used for gesture recognition designed to replace current 2D menu design principles.

CNN’s have a remarkable ability to recognize data that has any sort of spatial relationship, i.e. images. What if we could use this ability to create interfaces that figure out what you want?. For instance, instead of picking a prop from a long list of items, what if I just sketched it out?

I’m definitely not the first person to think of this. I’m essentially talking about Google Quick Draw or AutoDraw in VR. In fact companies like Adobe are already exploring this type of effect in their products. Check out this recent demonstration of ProjectQuick3D. While I’m not actually sure that these implementations use a CNN or something else, functionally it is the same.

My First Model

As a first step, I took three classes from the Quick Draw dataset, and used them to train a simple CNN. For the architecture, I took a basic MNIST example since the data for MNIST and Quickdraw is very similar.

Drawing a circle creates a sphere, while drawing a triangle creates a cube. For those that are detail oriented, I’m sorry, drawing a triangle on a trackpad is much easier than drawing a square :(

As a proof of concept, I started with 2D interface on my MacBook Pro. The network is trained on circles, squares, and triangles. The idea being that when a user draws a shape, an associated shape will be instantiated in 3D.

Knowing that the basic principle worked, it was time to take the whole thing into VR!

I decided to go with Leap Motion because I wanted the interactions to feel natural and effortless. While the tracking of leap motion wasn’t as precise as say a Vive controller, I found that once I accounted for Leap’s tracking quirks, the interactions felt very fluid.

A simple model with classes for “tree”, “bush”, and “flower”

So this was super cool, and it was quite surprising how well it worked. But here’s where I ran into the first problem with my assumptions.

Drawing was a very frictionless interaction, however only the first time you do it. For instance, imagine having to draw a tree sketch for every tree in a virtual forest. Some additional UI thinking was required.

What if instead of drawing a tree, I drew a square? In other words what if I mapped each object to a primitive shape?

“Circle” = “Bush”, “Square” = “Tree”, “Triangle” = “Flower”

This made the interaction easier, quicker, and very satisfying. However it’s easy to imagine how you could run out of primitive shapes quickly. Perhaps you could use numbers, but then we’re moving away from the idea of an effortless conjuring of a 3D object. If you have to remember an arbitrary mapping, it’s too hard.

My next step was to separate the drawing and placing mechanic. Switching modes is done by tapping different spheres aligned with my wrist. Above my hand is an icon indicating which object I have currently selected. This allows me to quickly and intuitively choose a new object and quickly and effortlessly place it.

Switching things up with a very hacky interface. Here you can see drawing and placement as two separate interaction modes.

It was time to really dive in CNN’s and start making a custom model that could use more classes. Since I’m still a ML newb, I used a relatively small data set of 11 classes. After manually playing around with the architecture for a day or two, I stumbled upon Hyperas, a library that would help automate my architecture optimization. The final model I used was this:

This resulted in above 95% accuracy for evaluation set. Now I had more classes, meaning I could magically pluck more objects out of thin air. However, since the classes from Quick Draw are seemingly arbitrary, I was left with a rather random model.