Light field landscape generated with the help of style transfer

As a Creative Technologist at MediaMonks, a global production agency, people are always asking me about ML, AI, Neural Networks, etc. What are they?What can they do? How can we use them?

This post represents the first in a series I will be writing exploring the space where AI, Creativity, and 3D content meet.

The way I see it, AI will be the greatest creative tool mankind has ever created. Current visions of AI includes millions of unemployed workers or even an apocalypse. But what if our future involves intertwining our creative processes with artificial intelligence instead of being replaced by it?

To begin looking into this future, I started with a simple problem: How do I use neural networks to help procedurally create 3D landscapes? For instance imagine standing in a virtual space and saying “Computer, make this space look like a Studio Ghibli landscape.”

Landscape from Howl’s Moving Castle

How would we go about doing something like that? Since I’m relatively new to machine learning, my first intuition was to use style transfer.

Style transfer uses the trained filters of a deep convolutional neural network to optimize style and content loss between two input images, the content image (your selfie) and a style image (a Van Gogh painting).

I know that sounds super technical so let’s put it this way. Deep convolutional neural networks have recently become really successful at classifying images. You give it a picture, say for instance a picture of a dog, and the neural network can tell that it’s a dog.

The ability of a machine to understand what’s in an image is immensely powerful because it means that it can understand all the details that make up that image. It understands that a dog has droopy ears and a pointy face whereas a cat has pointy ears and flat face. To put it simply, what a style transfer network does is take the details between two images and combine them.

So how do we use this for 3D content generation? A common problem in 3D procedural content generation is terrain generation. Many games combine various forms of noise to create mountains, hills, and plains in the form of a grayscale height map. Something like this:

Height map generated by noise.

These height maps are then used to displace vertices in a plane, thereby creating hills, valleys, and mountains. These look pretty nice when rendered in 3D, but they don’t hold a candle to the real thing.

Height map taken from actual elevation data of the earth.

Even though we can generate nice looking and complex terrain with plain old mathematics, it’s still really hard to simulate all of the processes that create real terrain, such as plate tectonics and erosion.

But this is where neural networks come in. With the power of style transfer, we don’t have to. We just have to create the general shapes we want, and then the neural network will add all the realistic looking details.

So going back to the Ghibli use case, I had to find real world elevation data that looked like terrain in the image. To me it looked like somewhere in the Alps. By doing a Google image search for “Alps” I came across this image:

Which turned out to be Val Gardena, a valley in the Italian Dolomites. Knowing that, I went over to terrain.party, a website where you can download elevation data from the Earth. I searched Val Gardena and was able to download this height map:

Perhaps in the future you could automate this whole process, but for now manually searching on the web was fine.

Now it was time start building everything. Using Unity, I wrote noise shader that would allow me to tweak the noise generation in realtime. After a lot of experimentation, I discovered that the best results occur when the generated noise has some similarities with the desired output:

Procedural noise on the left, real terrain data on the left.

Finally it was time to use style transfer. After lots of Googling and Youtube, I settled on using this implementation from this video from Siraj Raval’s Youtube channel. If you’re trying to learn machine learning, I can’t recommend his channel enough! He’s killing it!

Here’s what the style transfer output looks like:

Style transfer output on the left, real terrain on the right. Both are planes whose vertices are being displaced by the height map texture.

Pretty cool right? There are probably hundreds of things you could do to optimize this, but for me it was good enough. Now it was time to drop a virtual camera into the generated terrain, and use the original Ghibli image as the style image and the camera render as the content image. Just like with the noise, I decided that it would be better if the content image had similarities with the style image. I therefore imported the height map in Blender in order to do some very simple texture painting.

The height map in blender. Full height map terrain on left, and zoomed in height map terrain on right.

I picked a camera position and angle that resembled the Ghibli image:

A simple Blender render.

And after that I did some very simple texture painting, gave it a sky color, and added a simple particles system for the flowers.

An obviously very quick and dirty render.

And this is where it get’s cool. It was time to run the simple Blender render through style transfer and see what came out. After spending lots of time tweaking the hyper parameters I got this:

The final output of the whole process.

And here’s a final comparison with the input image:

Obviously nowhere near as good as the original masters at Studio Ghibli, but not bad for a mindless machine either.

But where to go from here? Ideally you’d want to do this whole process in real time, but with even with a GPU enabled neural network on my VR PC, the style transfer process takes around 4 minutes. Obviously this is nowhere near fast enough for real time 3D graphics. You could try using fast style transfer which can output an image in under a second, but I didn’t like the artifacts it created in my images. So how could we take this still image, and make something that was 3D and rendered in real time?

The answer was light fields!

Light fields were demonstrated way back in 1996! But they have been seeing more popularity with the rise of VR, and have been popularized by companies like OTOY. A light field is just a fancy term for an array of images taken by an array of cameras. They look like this:

The array captures every ray of light within a given volume, thereby being to able to synthesize new camera angles. See the original video here. These are great for realtime graphics because you can pre render everything. This way you get the freedom of real time graphics, but with image quality of pre rendered scenes.

Now was time to create my own style transfer light field and light field renderer. I basically reimplemented Andrew Lowndes’ WebGl light field renderer in Unity.

By doing some blender scripting in python, I was able to output a 8x8 grid of images which were separately sent through the style transfer network. They were then stitched together into a single image in Unity. Here is the generated style transfer light field:

Downsized because the original is 8192 × 8192!

and here is a video of the light field renderer in action!

As you can see, by fully pre rendering multiple camera views, we can create a light field renderer that offers many things a traditional rendering system. But it also allows us to use generative neural networks to create 3D content now!

But the title of this post is “Neural Networks and The Future of 3D Procedural Content Generation.” Is this really how procedural content will be created in the future? In this exact way? Probably not. There’s lots of optimizations to be done, and perhaps other generative algorithms such as GANN’s would be better for this type of task.

What this post demonstrates is the idea that neural network could radically change how we generate 3D content. I went with light fields because currently my GPU is not fast enough to style transfer or any other generative network at 60 FPS. But if we do get to that point, it’s entirely possible see generative neural networks become an alternative rendering pipe line to the standard rasterization approach. In this way, neural networks could generate each frame of a game in real time, based on realtime feedback from the user.

But it also potentially allows for a much more powerful creative approach, for the creator and the end user. Imagine playing Gears of War, but then telling the computer “Keep the gameplay, story, and 3d models, but make it look like Zelda: Breath of the Wild.” This is how creating or playing a future gaming experience could be, all because computers now know what things “look like” and can make other things “look like” them too.

Like I said earlier, the ability of computers to understand an image, while maybe not as impressive as Samantha from Her or HAL from 2001, is still an incredibly powerful thing. It’s a recent innovation and there are still so many possibilities to discover!