The State of AI in 2020

Learn where our future is headed by understanding AI’s past. This article provides an in depth overview of the past, present, and future of AI.

There is no doubt that artificial intelligence, machine learning, and data science have become the most powerful and forward looking force in technology over the past decade. These technologies have allowed for breakthrough insights and applications that may truly change the world for the better. This is, of course, thanks to the symbiosis of data collection, hardware innovation, and driven researchers that have taken hold over the 2010s. This has led us to bestow computers mind boggling abilities in everything from vision to natural language processing to audio understanding to complex signal processing. To understand where we are headed it is important to understand how these breakthroughs took shape and where we currently stand. This article aims to do just that, as well as well as help cast light on the limitations of AI in its current state to form a vision for the future.

In This Article

Where We Came From Where We Are Now Where We Are Headed

Where We Came From

Since the mid 1900s AI has taken many different forms. This includes everything from automata to linear regression and perceptrons to decision trees and eventually neural networks and deep learning. As progress is made and the public becomes aware of what this “AI” truly is, the methods very commonly make their inevitable shift from an intelligence to a statistical technique. The 2010s have put most of these forms behind them and allow for the shift from simple neural networks to the breakthrough methods of deep learning.

Neural Networks to Deep Learning

While neural networks have been theorized since the mid 1900s, compute and data constraints did not allow them to be successfully implemented until the 2000s. This led to an exponential expansion in the research and capabilities of this mathematical methods that allowed machines to learn patterns in both linear and nonlinear forms of data. The concept behind the neural network is more or less a set of stacked and connected linear regressions with an activation function. As more of these “neurons” are added, the networks have more variables to train and thus can model more complex patterns. The mathematics and structure have been detailed further here.

Feed Forward Neural Network (Source[9])

Around 2010 and beyond, computing and data capabilities had really begun to catch up with these theories. For this reason, researchers and engineers realized that as they added many more layers and neurons per layer this statistical technique could model patterns for either regression or categorization in a near human fashion. Very quickly these systems gained the ability to produce equal or better results than rule-based systems on simpler datasets. As GPU technology allowed for deeper networks, deep learning took over the machine learning space and in effect started a much more widespread machine learning revolution than ever before.

Convolutional and Recurrent Neural Networks

It was very quickly realized as deep learning took shape that computer science problems in the language and image areas could be solved much more effectively with this new form of AI. Because research had become so wide spread, rather than just building deeper and deeper networks, new and old methods within the existing framework were developed and leveraged. Language models began to leverage and adapt the concept of Recurrent Neural Networks and computer vision implemented the Convolutional Neural Network. By the end of the decade both of these forms inevitably mixed in many ways, but their independent paths were critical to their prominence.

Recurrent neural networks feed on the idea that the current output is dependent on the previous output(s) of the previous time steps. This of course is the nature of language and most other time series based data types. When humans speak, the next word is not decided at random from a base context, but rather dependent on the prior word and the context. RNNs and more specifically LSTM(Long short term memory) are implemented for exactly this reason. This network architecture recurrently feeds past outputs as inputs to produce the the next output. These inputs can either be fed into the front end of a typical neural network or manipulate more complex internal matrix operations as seen in LSTMs. This method allowed for the most powerful leaps in NLP along with many other times series datasets. For the first time language generation and prediction had become much more feasible. For a more detailed look at LSTMs check here.

Generic Convolutional Neural Network (Source[10])

Convolutional Neural Networks have by now taken over much of the AI space and their creator, Yann LeCun, predicts they can handle nearly any AI task we may think of. Whether this is true or not, CNNs have allowed for nearly perfect accuracy in complex computer vision tasks. The concept arose due to the 3 dimensional nature of images. Classical deep learning networks typically used 1 dimensional input, so image pixel values would be flattened into a single vector. This activity leads to information loss for much of the spatial relationships of each pixel. The concept of a convolution allows for a 3D input(i.e RGB images) and performs a sliding dot product operation across the input with a set of filter values. This allows for multiple feature maps to be learned which, during training, typically dissect different features of an image. Overall this allows for a lower number of weights because of the filter covering all 3 dimensions and networks to extract much deeper features of images. Over the last 10 years, many efforts to manipulate CNNs have led to monumental breakthroughs in computer vision. Many of these achievement are detailed in victories such as ImageNet results.

Training and Libraries

Of course, none of this progress would have been possible if it were not for the ability to train these mathematical models and to do so through accessible code. Back Propagation is the typical method used to train any neural network. This concept utilizes derivative gradients for each layer based on the output weight to tune each weight inside of a network. This requires complex programming and hardware to achieve, especially for the more complex forms of architectures.

Luckily GPU technology has been developed, as well as more mathematical methods to make this training method efficient. Technologies such as NVIDIA CUDA, which allow for mathematical operations to execute on massively parallel GPUs, has taken training time down from days to minutes. This allows for faster iteration and in turn faster results. At the same time methods like Batch Normalization, Dropout, and powerful optimizers like ADAM take this even further and allow for predictable and effective training sessions.

Finally, none of this progress would have been possible without the language and library support explosion that came with the rise of Python and libraries like Tensorflow, Keras, or PyTorch. Python has grown to be the most prominent programming language due to its ease of use and library support. Likewise, it is one of the most ideal languages for mathematical programming. This has allowed more people to experiment and build new concepts with the headaches of more complex languages eliminated. Above all, the creation of libraries, like those listed above, allows an even wider range of people to get their hands on this technology. This in turn creates a more democratized technology ecosystem. Altogether, this last decade has given more capabilities to a wider range of people, which is how truly exponential shifts in technology occur.

Where We Are Today

Natural Language Processing

The current state of the art in language processing has effectively merged recurrent and convolutional neural networks, while creating new methods within. All of this centers around Transformers, self-attention and word embeddings. These concepts all help to model relationships of words in massively parallel CNNs or FCNs that can effectively understand full corpuses.

A transformer does exactly what its name lends it to — transforms a set of word embeddings into another set of word embeddings or similar structures. This is particularly effective for machine translation, word generation, or vector creation for classification. This is a great in depth look at transformers, but in short they are basically a pairing of encoder and decoder networks that are trained to accomplish the tasks above. A powerful concept that lyes at the heart of the transformer is self-attention. Self-Attention was recently developed by Google as a method to model the recurrent and spatial nature of languages in a single network pass. To do this, complex mathematical methods are performed on each input sequence using a query, key, and value matrix. These values are tuned to model spatial relationships of all words in a sequence.

Transformer Encoder (Source[2])

These concepts have been successfully implemented through the BERT and the GPT-2 architectures. While each network is different in their own way, they have both implemented an ultra-deep transformer model to accomplish sequence to sequence and multi-task objectives successfully. OpenAI, the creator of GPT-2, currently has models available for download(after some controversy) that allow researchers to prompt the network and receive a creative “AI” written text. There are numerous use cases for this in content creation, chatbots, and text generation. This availability of such a powerful network should allow technology to continue its exponential growth.

Quote from a philosophically trained GPT-2 I created:

“The meaning of life is to create the illusion of possibility for those who lack this ability….”

Computer Vision

As convolutions have become more effective for image processing tasks and beyond, the community has settled on going deeper and wider in their network architectures. This has brought the powerful trend of Residual Networks. A Residual Neural Network, such as in the case of the state of the art Google Inception network, piece together residual blocks to build the most powerful CNNs. A residual block is very commonly 2 or more parallel CNNs that are concatenated or added at the end. This in turn leverages the powerful parallel compute architectures to achieve better results than ever.

Google Inception (Source[10])

At the same time, new methods for convolutions have been developed to take things further. A step above the Google Inception Network is the Xception Network. This network is nearly the same as its predecessor, but uses a new type of convolution called the Depthwise Separable Convolution. The concept allows for much lower parameter counts and increased accuracy. This is much better described in the paper above, but in short it breaks the convolution operation into two parts. A depth-wise convolution which slides across each channel separately and a point-wise convolution which, pixel by pixel, slides across all channels. This is yet another example of the power of innovation in the methods deep within those that we know.

A final trend not just in Computer Vision, but across all use cases is transfer learning. The concept of transfer learning is prominent due to the heavily trained networks that have no need to be repeated for each use case. Transfer learning typically takes a pre-trained feature extraction network minus its classification layers and freezes the weights. These are then attached to the users own classifier as the input. The training process now just trains the new classifier to utilize the outputs from the heavily trained network. This allows for computer vision networks, like Xception, which have been trained for months to be adapted to any domain use case.

Generative Adversarial Networks

Art Generated By My GAN implementation

Though they have not been used as much for business use cases yet, generative adversarial networks are one of the most interesting forms of AI today. The concept is to build a generator network that maps a vector into an image matrix or similar. This output is fed to a discriminator network which learns to decipher the difference between generated and real content. As both networks are trained in tandem, the generator gets better at fooling the discriminator as it gets better as well. The final result of this game theory equilibrium is a generator capable of creating near real content. This has lead to some quite amazing art, writing, and of course Deep Fakes.

Training and Deployment

AI has reached a point in which it is useable and almost critical to some production software systems. The methods detailed have allowed for acceptable levels of accuracy to allow this technology into the wild. This is very much thanks to powerful cloud technology that allows for scalable deployments and ideal training scenarios.

In order to deploy powerful models to the public, the ideal training and architecture is required. This is exactly where Neural Architecture Search and Hyper Parameter tuning come into play. Using immense levels of computing resources, Neural Architecture Search runs training on combinatorial level sets of potential architectures (layers, neurons, methods, etc.). The results can then be used to deploy the ideal model to train and deploy. In this process, a step further is taken to tune hyper parameters for an even more ideal model.

In 2020, library and cloud capabilities make it simple to deploy AI powered applications. The cloud can accommodate common forms of AI and provide it resources to scale and manage application life cycles. Libraries, at the same time, have allowed for the data science and software development roles to separate. This is due to generic model filetypes that only require the developers to understand the given library in nearly any programming language to deploy the data scientist’s model.

Where We Are Headed

In the very near future the trend towards more complex architecture and search algorithms will continue to proliferate. But soon enough, questions will be raised about how efficient and smart these brute force approaches are. It has already been shown how these methods can have a carbon footprint. As more of society begins to see AI for what it really is and considers these ramifications the community will demand new methods. One approach that has recently surfaced is modifying training methods to begin to move from the resource intensive back propagation methods. One example of this is Greedy InfoMax optimization(GIM). This method allows for layer level training, which in the case of transfer learning especially, can greatly reduce the number of variables required to train in the form of less gradients.

Hype Cycle (Source[11])

Technological progress is a process of discovery leading to research and invention of methods and finally leading to implementation. As we move through this cycle, more emphasis than anything it put on the current stage. Currently we are in the implementation stage for what we know as AI, in which the discoveries and innovations of deep learning are being rapidly applied to nearly every business problem. This, of course, stifles the overall discovery efforts for radically new machine learning methods. As we loop through innovation in this narrow space for the next few years, widespread disillusionment may occur as deep learning is seen for what it really is. As the world learns “AI” is not magic and begins to see the limitations of what is really just a lot of math, they may demand that research pushes harder to get closer to real artificial general intelligence(AGI). This is when the AI hype cycle will pull out of the disillusionment phase and give way to steady new innovation.

This will mean radically new methods outside of Deep Learning will begin to take shape. Reinforcement Learning will continue to innovate, but to truly reach human or greater level intelligence, unpredictable paradigm shifts may need to occur. These methods may begin to center around Quantum Mechanical and computing systems, because the concepts of Quantum Computing allow for the very probabilistic nature of AI and decisions to be modeled in ways never before possible. Quantum also allows for extremely high dimensional systems that can factor in much more than a single input at a time. As humans, we make decisions and interact with the world of physics around us. The overall determinant, or intelligence, of each of our actions is the overall physical system of the universe. It may make more sense to utilize the concepts of Quantum Mechanics rather than the brain as the final determinant of AI systems. In all, if we are looking to invest so much time into AI, shouldn’t our target be something that is better than our own brains that exists as a tool completely in our control? This may be where the unpredictable future of the technology that will define most of our lifetimes will go.

The opportunities are endless, as massive changes in methods can be brought to life by any unique mind, debates on AI and data ethics will continue, and businesses will rely more and more on these methods as their most valuable resource. Taking the time to understand where we came from and where we are going can allow everyone to develop their own vision of the future. The global matrix of these unique human visions is what will lead us into a bright future with AI at our side.