We’re living in the age of the next industrial revolution: the very first three freed most of the humans from hard labor. This one is aiming to take us over the last domain of human dominance on this planet: our intelligence. In this article, we will put aside ethical, political and social effects of such revolution and concentrate a bit more on the technical side of it. What we see in media today looks a bit different from the real dominance of machines over humans… or not?

Generative AI == Hyped AI

The most rapidly growing areas of artificial intelligence in the few last years have been computer vision, natural language processing, speech processing and, of course, different customer analytics applications like recommender systems (you may not like it, but targeted advertisements are accurate enough to grow companies’ revenues). What is supposed to be a demonstration of the state of the art performance in every one of these areas? Almost in all these areas, we are surprised by things like DeepFakes videos, realistically generated images of faces, artificial voice records that sound like real ones and, of course, fake news generating Transformers from OpenAI.

ICLR 2019 statistics: generative modeling in top-3 topics, https://ailab.criteo.com/iclr-2019-stats-trends-and-best-papers/

A very reasonable question you may ask is

But what all these deep fakes and text generation have to do with intelligence? Is that creativity?

No, it’s just some complicated non-linear statistics.

Can it replace artists, writers or analysts?

Not really, they aren’t even really very helpful at the moment.

Don’t we have more important problems than generating cats in high-resolution, “undress” people in photos and making Mark Zuckerberg saying ridiculous things? Then why is so much time and money of the brightest minds and the most powerful companies are spent on it? To answer this set of questions we need to dive deeper into machine learning basics, in particular, what happens inside of these models, like neural networks, when they are training to solve problems we teach them to do. If you’re interested in the topic, I also recommend you to read the article of mine on other alternative use cases of generative models. And… a motivational quote for today:

If I can’t create something I don’t understand it — Richard Feynman

How “normal” machine learning algorithms work

Let’s have a look at what modern machine learning algorithms can do apart of generating things. Mostly AI applications look like the following:

Given ECG record, predict if arrhythmia happened on it or no;

Based on the market trades history to forecast the price movement in the future;

From the history of movie views of yours and your friends to recommend a movie to watch.

In mathematical terms, we have a function with a lots of degrees of freedom (lately such functions are deep neural networks) that, with correctly found these degrees of freedom (or weights, or parameters) are able to map complex input data (images, text, sounds, statistics) to some defined outputs, that can be sets of categories, real values, or even really complex structured outputs like graphs.

f — a function that maps inputs x to outputs y with a set of parameters w

How to find correct parameters? Usually, we define some criteria of goodness to maximize (for instance, the accuracy of classification), its mathematical surrogate (like cross-entropy, also called loss function) and having a differentiable data modeling function and differentiable loss function we can run numerical optimization process, that maximizes performance of a model on empirical observations with respect to the degrees of freedom.

A cross-entropy function that has to be optimized with respect to parameters of the model

The y^{p}_{n} is the output of the model with respect to the input x_{n}, regular y_{n} — is true corresponding label from the dataset

Gradient descent update rule for the above-mentioned loss function, it trains the model itself

At the end of the optimization process, if you have big enough dataset of inputs and corresponding correct outputs and you have chosen good data modeling function, the found parameters will be able to map images of, let’s say, lungs x-ray images to corresponding health state categories. Often even better than the humans do. The key result of all the training process is, of course, the set of parameters that supposed to be optimal for some particular problem on some particular data. But is it optimal in general?

Possible biases in supervised learning

We already know that supervised models can perform extremely well on numerous tasks but the outstanding accuracy has its own price. Nevertheless, AI researches do a great job in creating more and more powerful mathematical models, the ones who are feeding these models often misuse them horribly or these models even aren’t meant to meet the expectations.

Overfitting

Overfitting has many faces, mainly in practice we’re considering 3 possible situations:

A weakly regularized model just remembers training data and doesn’t generalize to live data

and doesn’t generalize to live data We have not enough labeled samples for training the model, so again, we don’t generalize to the live data

for training the model, so again, we Labeled samples from the training and validation data are totally different from testing data, that’s why, again, performance on the live data shrinks

Mathematically it all means that our parameters w are not capable to describe the patterns in the data apart from the training set.

Human bias

With total adoption of machine learning models by many businesses, a lot of so-called human biases in decision making related to sexism, racism, chauvinism and other negative patterns came up, which can literally ruin other people’s lives, see a great example by the link under the picture above. Well, what we could expect from the algorithms that learn from our past?

Mathematically it means that parameters w are affected not by the nature of the data and true properties, but by the feedback y first of all, which might be biased.

Model bias

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness, https://openreview.net/forum?id=Bygh9j09KX

Algorithms that work well in research phase can fail on totally unexpected problems. For example, it turned out that widely used today convolutional neural networks (CNNs) are suffering from adversarial attacks and tend to learn not the shape of the visual objects, but rather their texture (picture and the link above). The problem, probably, lies in the core mathematical operation chosen for the model — convolutions, that are not robust enough as they seem like.

Another example can be reinforcement learning of agents, that supposed to perform the same well in different environments with the same goals and objectives, but they fail, when you substitute, let’s say, the orange ball you have to eat with a blue square. It can be considered as overfitting, but the heart of the problem is in the design of the algorithm as well.

Mathematically it means, that vector w is built not correctly from the structure of the network first of all.

As we can see, most of the problem is diagnosed by some problems in the parameters of the model, their values and their structure. In most the cases, “adding more (correct) data” rule and training on more labeled data with problematic situations (like for Tesla Autopilot) works, but lately some quasi-generative approaches are also popular, as fine-tuning pre-trained models on other, bigger datasets, but probably different tasks. In reality, this is just a hot-fix. Why? Because with literally every new unseen subcase you need to retrain your algorithm, which is not exactly how do we expect machine intelligence to function. If we have a hypothetical app that distinguishes cats from dogs on the images we don’t want to retrain it with every new unseen breed of animal, but to infer the decision somehow from already seen other breeds of cats or dogs.

A historical look at statistical learning

For completeness of the picture, let’s have a look at how machine learning is defined in some rather classical ML books like “Pattern recognition and machine learning” by Christopher Bishop. The author shows modeling pipeline as three high-level options by decreasing complexity:

First, learn the conditional data-generating model and after applying Bayes’ theorem to model posterior class probabilities;

Learn posterior class probabilities directly and after hard-code decision theory for classification;

Find a discriminative function f(x) that outputs classes of x directly.

Bayesian framework for classification: first model class-conditional densities p(x|C_{k}) for each class C_{k} separately and after apply Bayesian theorem

As we can see, all modern deep learning is about the latter option, which is the easiest and the most superficial one. But the main problem with the fully Bayesian approach is that it can’t be applied to high-dimensional complex data directly at the moment.

What happens inside a generative neural network?

Since we agree, that the straightforward supervised learning might be not the most optimal paradigm to learn effective and generalized representations and we have checked Bayesian approach that is sort of related to generative modeling on its first stage, let’s check main algorithms for generative modeling today and discuss why they are much more powerful tools in the data scientists’ arsenal. We will review generative adversarial networks (GANs) and variational autoencoders (VAEs) as the most popular and as the ones that show the most prominent results lately.

Generative adversarial networks

GANs are neural network-based architectures, that consist of two models: one is called generator G (or sometimes, artist) and second discriminator D(or critic). As you might guess, the generator is a part that is responsible for generating the objects and discriminator is the model that tells if generated by the latter sample looks like real or no. Both networks are being trained jointly together, where the generator is being penalized by discriminator for creating not realistic enough samples. Practical results are indeed astonishing, but digging into representation for later re-use is not that straightforward: we will see it in the next chapter.

The loss function for GANs that aims to maximize the accuracy of discriminator network and minimize the error of generated samples from the generator network. It’s also said, that training GANs is a min-max game problem because of “competition” between D and G. Here x_{i} is a real sample from data, z_{i} — random noise that is input to the generator network, w_{d} — weights of D, w_{g} — weights of G.

Variational autoencoders

VAEs are relatively easier models, even they also consist of two neural networks. The first one (encoder) is trained to encode the input into some compressed code and the second one (decoder) — reconstruct the initial input from this code. The idea is that this compressed representation if chosen and trained correctly can contain all the needed information from the input while having a much lower dimension. We are sure, that this code is sufficient enough if the input can be actually reconstructed from it via decoder neural network. Also, if we sample this code from some distribution, we can generate new realistic samples of data with a decoder from the random code. There are also approaches related to how to control this code and particular properties associated with an element of the code.