This article is a follow up to my devcom’18 talk “Artpocalypse — the AI Future of Art Production”. I divided the problem into three domains — 2D texture and material workflows, rigging and animation and finally 3D content. In the conclusion I tackle some of the fundamental issues of generating art content for games with the help of artificial intelligence, including some of the economic issues and clarify where the click-bait title comes from.

I won’t delve into all the areas where machine learning will impact the gaming industry — level design, music generation, multiplayer and chat ‘policing’ or matchmatking, as well as writing code or bug tracking. These are all fascinating subjects, deserving their own piece. Even narrowing it down to computer vision, powered by deep neural networks, we can see many interesting solutions on the horizon, e.g. superscaling or denoising raytraced scenes in realtime, but especially in the area of AR/VR — Google backed Ubiquity6 , 6D.AI, Niantic or Magic Leap and obviously Microsoft’s Hololens Team, all deserve a mention in this context. Yet this piece will be taking a slice of the grand matter of machine learning in gaming — art content creation. This is a huge topic on its own and I present only an overview of the subject. I try to provide relevant links and references for all the research projects, demos and solutions I discuss. Every single one of them would deserve a separate in-depth piece and I can only recommend further exploration of the subject.

Scene segmentation, depth estimation or object recognition all allow creating richer experiences for AR, image: Niantic

Before I’ll delve into examples of automated art content creation and start shedding some light on the current solutions and state-of-the-art research in this area, I should briefly mention what’s so special about neural networks and deep learning. They are by no means a new concept, in fact the research papers that were foundational to today’s breakthrough technologies, were published two or three decades ago. Convolutional networks for example, the “hot new thing” of the 2012 ImageNet competition, that smoked previously used methods in terms of image recognition accuracy, were successfully used already in 1989. One of the reasons for this delayed implementation on a wider scale, was that apart from appropriate mathematical models, neural networks need lots of training data and huge processing capabilities (interestingly it is thanks to consumer gaming hardware, namely GPUs, that the cost of computing dropped significantly). The conditions for the perfect deep learning storm were finally met around the time Geoffrey Hinton’s team won the aforementioned image recognition competition and the first autonomous cars crossed the finishing line of the DARPA autonomous driving challenge.

So what’s so special about deep neural networks in the context of image recognition, generation and processing? Speaking very generally deep neural networks are great at finding patterns and that’s something that we often link with intelligence or even creativity. Whether it’s Jesus sightings on a piece of toast, getting spooked by a strange shadow in our periphery vision or contemplating abstract art — it’s all about recognizing, extrapolating or generating patterns. Even telling jokes is often about breaking apparent patterns and finding ones that are less obvious, counter-intuitive. For some evolutionary reason, it’s something that tickles our primate brains. Art often thrives in this space between noise, randomness and order, recognizable patterns.

Our brains immediately recognize the familiar ‘pattern’ of a human silhouette. Sculpture by Antony Gormley

Overwhelming majority of deep neural network models, currently used for supervised learning and image recognition tasks, roughly fit, my very crude description. This is a fascinating field of research and I’m not even scraping the surface. I just want to stress that there’s no mystery behind those algorithms. No “black box”, no advanced mathematical concepts apart from some basic calculus and linear algebra. Of course, as mentioned before, most networks need huge amounts of examples, often labelled data and thousands upon thousands of iterations to tweak the parameters of the network, to achieve a satisfactory level of accuracy.

2D WORKFLOWS - is AI production ready?

Before we get to actual game art production I need to mention one very important and fairly recent addition to the growing family of models, so called GANs. The ‘G’ stands for generative, ‘A’ for adversarial and is actually the key to this ingenious model. There are actually two networks that are trained side-by-side, one an artist, a forger, the other an art critique, trying to distinguish between an original sample from the dataset and one crafted by the falsifier network. Most of the incredible and sometimes disturbing results, whether it’s dreaming up non-existing celebrities, turning horses into zebras, line drawings into cats or creating so called deep fakes, there’s a GAN network involved (or VAE — variational autoencoders, a close ‘cousin’ to GANs).

Since my talk at Devcom in August last year, there were several big breakthroughs and new publications, that made a huge impact and gathered interest both in and outside the machine learning community. Starting with DeepMind’s BigGAN, which generates images from any of the 1000 categories from the ImageNet dataset (and can mix and blend them obviously). The best way to get a feel for the capabilities of the model is to check Ganbreeder.

Another incredible publication was StyleGAN from NVIDIA. Among other things, it can create incredibly realistic faces, potentially giving some control over semantically meaningful parameters (age, gender, hair color etc.). The paper shows that it can also generate cars, bedrooms and cats (of course!). This research has also made news and became known outside the machine learning community thanks to thispersondoesnotexist.com, a site created by Phillip Wang, displaying with every refresh of the webpage a random, non-existing person.

Synthesised, non-existing people from the StyleGAN model

You might be asking yourself how does all of that relate to the everyday work of game artists. Apart from being an interesting gimmick is this actually useful in game content production? The answer is ‘yes’ and actually generating materials, creating texture variations, upscaling or processing photogrammetricly captured textures is an area where we already have some tools available, all powered by machine learning, including GANs.

NVIDIA released three tools as part of GameWorks: Materials & Textures tools. It’s more of a showcase of the potential of deep learning in game development than a production ready tool, but it’s definitely worth trying.

Another company, that has been working for some time on a very promising piece of software, named Artomatix, also heavily relies on recent advances in computer vision and deep learning. The app has many options to tile seamlessly your scanned materials, autogenerate additional map channels as well as do style transfer across multiple assets.

Another product that is currently in closed beta (I have a feeling this might change at GDC ’19) is Alchemist from Allegorithmic, which was first shown publicly at 2018 SIGGRAPH. Similarly to Artomatix it heavily relies on recent advancements in computer vision and machine learning to streamline the workflows for materials processing. You can find more about Alchemist, by watching the recent showcase.

This novel approach in 2D texture and material production pipelines, where latest advances in machine learning and computer vision, disrupt current workflows is perfectly captured by the graphic from Allegorithmic. It will become an ever more important pillar in game content production.

Animation

Animation is one of the most crucial parts of game content creation pipelines. It’s also a skill that is hard to master, takes both a lot of technical ability as well as artistic flair and is tremendously time consuming. This results in the ‘animation’ part of the budget often being a big burden for many teams. Not surprisingly there were many attempts at automating and streamlining the whole workflow. I remember the jaw-dropping results that Natural Motion’s Endorphin showed more than 15 years ago — a system, that already back then, relied heavily on machine learning and procedural techniques. Yet despite few successful use cases (notably GTA) this approach didn’t became a common practice.

More recently we had a few examples of companies that tried to make rigging and animation more accessible. One notable example of such a solution, that even works within your browser, with all the computational heavy lifting done in the cloud, is Mixamo. It allows for fast and easy rigging and applying animations to any humanoid character. Currently part of Adobe Creative Suite, has been unfortunately sidetracked and it seems there are no further plans of development. Despite Mixamo’s unfortunate fate, it showed how the whole process could be improved and the “barrier of entry” for creating animated characters, lowered significantly.

It could be that making an end-to-end pipeline isn’t the best approach and we need to break the whole pipeline into individual components with more control over the process. Motion capture, once reserved only for big budget Hollywood movies, became something ubiquitous even in bigger indie titles, but it’s still not accessible to everyone and often creates more problems than it solves for inexperienced teams.

Even in case of processing motion capture data, there were at least two major research projects outside of academia, focused on automatic clean-up and post-processing with the help of recursive neural networks. Independently, teams from Adobe and Ubisoft were able to achieve impressive results using this approach. This also shows that finding patterns in data, as long as the input has a consistent structure, is something that deep learning is able to tackle. Personally I might add that cleaning up mo-cap data is one job I’m actually glad is getting automated.

Machine learning allows us to capture poses, track hands and facial expressions without markers, often using an app, or even by processing Youtube clips. I should note that many of the solutions currently available are at the moment not as robust as the set-ups provided by multimillion motion capture studios. Current AI state-of-the-art pose estimation could be compared to Google Translate. It’s not hard to find cases where it fails, but at the same time it’s orders of magnitude easier and cheaper than hiring a personal translator. Taking into account the exponential advances in this field, huge interest from many big tech companies (Magic Leap, Facebook, Google, Microsoft — all working on their own AR solutions), I would be surprised if we don’t get high quality motion capture solutions from low-quality 2D video input within the next few years. There are already several companies and research projects in this field worth trying.

Left to right: OpenPose (pose estimation, hand and face tracking), RADiCAL — an AI powered app for mocap and DensePose — Facebook Reaserch project, matching a 3D mesh to single 2D input.

For face performance capture we also have quite a few machine learning solutions already at our disposal. From the USC/NVIDIA or Disney Research work on single view, real time performance capture all the way to commercially available solutions like dynamixyz or Unity Facial performance capture powered by an iPhone. All markerless, running real time and often requiring nothing more than a smartphone and a bit of acting chops.

High quality Facial Performance Capture from a single view and without any markers.

AR performance capture using an iPhone, showed at 2018 Unite Berlin

Let’s assume that we’ve captured all the necessary motion capture sequences with our phone, or by scraping the web for clips showing people doing parkour or taking martial art classes and the AI took care of all the processing and clean-up. Using current workflows we would often end up in the land of finite-state-machine spaghetti and the more complex animations and interactions we’d like to implement, the more complicated the whole set-up becomes, at some point being completely unmanageable for any mortal.

image: ReCore/Unity

Fortunately machine learning comes to the rescue once again and is able to sample our existing motion library to create a realistic character movement that conforms to the surrounding in a way that would otherwise require an immensely elaborate set-up.

Above, left to right: Phase-Function Networks, Unity‘s Kinematica and Mode-Adaptive Neural Networks for Quadruped Motion Control.

Going 3D — challenges of the third dimension

Finally we get to, in my opinion the most difficult area for automated content creation, 3D models. First of all it’s literally more dimensions, so it’s immediately more computationally challenging for any machine learning algorithm, even when we try to extend the methods used in 2D image generation and use voxels to describe 3D data. Voxels produce interesting results and give us the possibility to play around with GANs in three dimensions, but unless we settle for a voxelized esthetic, we will sooner or later find that most of the content is not going to be directly derived from a voxel description (even with optimization methods like octrees).

The neural network models that use convolutions work really well for structured data with consistent input dimensions. Images of certain width and height are basically this ideal input, but with 3D and the unstructured nature of the data, things get trickier and we end up with non-Euclidian geometries, graph mathematics etc. There’s already a lot of research in this area, as other industries and domains deal with unstructured data as well, but nothing with such a proven track record and applications as 2D image processing, recognition or generation.

“Model matching”

One possible approach for 3D content generation is relying on an existing database of models and parts of objects. Let’s call it, for lack of a better term, automated kitbashing. There are domains where this will prove useful (user generated content, or AR games for example), but we are limited by the size and diversity of the dataset.

IM2CAD by University of Washington team and Modeling by example from Princeton

One of the more interesting examples of 3D model retrieval and alignment. After 5 years this project definitely deserves an updated AR version

More interestingly we could use this approach to create and rapidly prototype whole new scenes from an existing database of assets. Promethean AI shows a glimpse of the opportunities AI-driven workflow creates for level-designers and environment artists alike, where they can focus on aesthetics and actual design rather than painstakingly spend time on modeling and placing ‘scene fillers’. The machine learning part, from what has been shown publicly so far, is mainly visible in the voice-to-text commands and semantics, but AI could easily play a bigger role in the visual aspects of such software. Recognizing and matching objects, materials or lighting conditions from gathered reference materials or mood boards and turning the relation between the user and the software to one analogous to the relation between an art director and 3D artists.

Preview of Promethean AI software in action — Teaser trailer

Photogrammetry

The idea to automatically create 3D models with without modeling them manually isn’t new and photogrammetry has been used in game development for some time now. There are some clear benefits to this, but also many limitations and some teams might find it creates more problems than it solves (problems with consistency in visual style, or optimization is a common occurrence).

The very process of turning images to models is also being optimized with DNN and could potentially allow for better results from sparse or lower quality data, but the current methods and software is good enough for it’s purposes. The real game changer is changing the post-processing of the photogrammetricly acquired point clouds to mesh geometry with topology and textures that could be considered game-ready.

Slide from a 2016 GDC presentaion by a team from DICE and a new pipeline for processing geometry within Houdini

So far we are making processing of models captured with photogrammetry more efficient step-by-step. There’s a huge potential for greater improvements with machine learning for many of the tasks. Some crucial parts of those pipelines still aren’t solved — automated topology or even good UV mapping still aren’t at “human level” and we frequently sacrifice quality over quantity.

The Holy grail

One important exemption from the annoyingly unstructured nature of 3d data and at the same time a type of assets that is tremendously important in many game productions is the human body, especially the face. Both the CG industry in general and the game content creation teams in particular, proved that achieving realistic digital human representations is possible and we’ve been getting out of the Uncanny Valley for the last couple of years.

Unfortunately most of those projects we’ve seen so far, were “one-offs”, tech demos, requiring tremendous amount of man-hours, resources and know-how. As impressive as those results are, those workflows limit adoption of super-realistic game characters only to the games with budgets with at least tens of millions of dollars.

Facebook’s “Codec Avatars” revealed recently in more depth

Fortunately In most cases we can reuse topology and UV layouts across many different models, with just a tiny loss in quality for the benefit of feeding both the vertex positions and texture maps through a neural network in a nicely structured way.

French company EISKO has a proven track record of creating digital humans and is venturing into automated rigging and animation with Polywink

GANs would be ideal for such a set-up, but surprisingly most of the academic research around 3D face reconstruction, up until very recently, took a different approach. It could be that lack of publicly available, high quality data is the biggest obstacle (we’re talking about potentially thousands of head scans), but at the same time big players like Magic Leap or Facebook are putting a lot of effort into creating 3D avatars and they definitely put the resources to create such datasets (and one could only hope, ultimately share them with the research community).

Some earlier examples of 3D face models generated automatically from a single shot input.

Current SoTA single-shot 3D face reconstruction , not surprisingly powered by GANs. Paper: GANFIT: Generative Adversarial Network Fitting for High Fidelity 3D Face Reconstruction.

Generating digital characters from photos has been the subject of research for many years. From the ongoing efforts of Paul Debevec’s team at USC Institute for Creative Technologies, covering numerous, breakthrough photogrammetric set-ups, capturing microstructures of the skin, FACS for animation or using variational autoencoders (a type of deep neural network I mentioned previously) for hair synthesis, to a long list of amazing publications from Disney Research, focused on capturing individual facial features and parts such as teeth, eyes or facial hair. It seems that putting it all together into a complete end-to-end solution is a matter of the not-to-distant future.

3D Reconstruction of Facial Hair and Skin from Disney Research

Whole bodies are slightly more problematic, as clothing and different accessories our game characters often have, make the input consistency a bit more challenging. But we’re still able to capture a lot of the variation between different body types and postures and turn them into customizable parameters. Similarly to capturing human faces, whole body capture is a very active field of research. Once again we should acknowledge the extensive research from USC’s Vision & Graphics Lab or Max Planck Institute work on ‘Virtual Humans’.

Even for a more diversified input, such as different animals, by restricting our efforts to quadrupeds, that fit a certain biological blueprint (four legs, head with a mouth and two ears, body with a tail), we can not only retrieve a pretty acceptable 3D model of a given animal from a single-shot input, but also the texture. Of course the results aren’t AAA quality, but it’s a remarkable start and even being able to work from a rough shape of an animal could speed up modeling significantly.

Are we heading for the ARTPOCALYPSE?

The provocative title is a reference to the Indiepocalypse, a concept that was coined in the industry a few years ago. It refers to the ongoing problem of overabundance of games and game content. Brought on largely by the democratization of game development through access to game engines like Unity or Unreal, affordable game content and availability of seemingly effortless publishing platforms. As great as the expansion of game development was, the problem we are experiencing, is that the ability of gamers to buy and play all the new titles, did not match the sudden tsunami of content. Supply and demand. This obviously meant that many great games went unnoticed, despite solid development and marketing efforts, within the budgetary constraints of indie developers.

I won’t focus on the aspect of AGI, a.k.a. artificial general intelligence, questions whether computers can create art?, or spend much time debating if neural networks can replace all artistic jobs, or “just” a significant part of the daily craftsmanship that goes into creating games. Don’t get me wrong, those are all incredibly important, hard questions, absolutely worth pondering. But with respect to the work security of my fellow game artists, they are completely irrelevant. Automating significant portions of current art “pipelines” could potentially disrupt game development. Let’s be absolutely honest, most “game art” is craftsmanship not high art. In terms of artistic style, photorealism, or stylized variations of it, are still dominant especially in the big budget productions. Trying to mimic reality or reuse a retro style unfortunately makes us more susceptible to automation. On the other hand there’s already a significant research effort by economists trying to estimate the impact of automation on the job market. Jobs that are highly repetitive and do not require social skills are most at risk. By that measure game designers and artists are on the other end of the spectrum. I personally believe that the current wave of automation will have a big impact on the whole economy, but people working in one of the most creative industries out there, should not lose sleep about their own livelihood.

With the rise of artificial intelligence we should ask ourselves if we face a possibility of experiencing a similar problem with game art production. Just as the quote from Mark Twain goes — history doesn’t repeat, but it rhymes, there’s no reason to believe that making content creation much easier and more accessible to a larger group of people, won’t affect the industry, just as making the publishing and developing easier did impact the industry. It’s too early to predict the outcome, but it won’t be business-as-usual for sure.

We have some examples from the past of certain artistic jobs going away due to technological advances. The utmost clear example came with the development of photography. It’s hard to imagine it now, when every purse and every pocket holds a tiny picture-making device, but before 1840s the only way to get a portrait of a loved one, or a picturesque landscape of your favourite place on Earth, was to order an artist to paint or draw such a picture. Upper class and wealthier people could afford hiring a painter, but the less fortunate could at best order a small silhouette cut-out of a loved one or family member. Emergence of photography influenced and changed traditional art forms like painting and displaced the latter, as the ‘new technology’ turned out to be a viable option for the small silhouettes.

There were many factors that influenced art and facilitated the rise of new art forms in the late 19th and early 20th century, but it’s hard to believe that the emergence of photography didn’t play a role in the Cambrian explosion of new art forms and styles that deviated from the previous attempts to master accurate depictions of reality.

This article starts with ‘The Fall of the Rebel Angels’, painted by Bruegel, one of my favourite painters, a brilliant 16th century Dutch artist, portraying an apocalyptic vision. I do not believe an apocalypse awaits us, although changes are on the horizon and it is up to us, to make the best of it, to prepare and adjust.

Neural networks are just another tool that lets us work more efficiently. Just like an excavator at a building site results in higher efficiency compared to someone equipped with a simple shovel, the art content creation tools of the not-to-distant future will allow creators to work faster, augmenting individual abilities to bring their creative visions to life.

Runway.ML — one of the first examples of an AI-powered tool for “Augmented Creativity”

I will end this piece with another Bruegel’s painting — Children's Games, as I hope that the new AI-driven creative tools will allow us to discover new and fascinating gameplays and mechanics, venture into new art styles and novel game-aesthetics. Through lowering the barriers of entry, we lower the risk of experimentation and allow game creators and artists to take full advantage of playful exploration the new tools enable.