Photo by Ben White on Unsplash

Ilya Stuskever of OpenAI is one of the most prominent advocates of the idea that ‘Brute force computation’ is all you will ever need. You will find his arguments about this in this keynote talk at Nvidia a few months ago (September 2018):

If you have not seen it, I highly recommend that you do. The keynote sums up three recent developments at OpenAI. These developments are continuous long-range RL (see: OpenAI Five), transfer learning from virtual environments to real environments (see: Learning Dexterity) and unsupervised neural language models (see: Improving Language Understanding).

Stuskever captures the important gist of each of these projects. Although it is important to know what each project accomplished, it is more important to recognize the more general high-level conclusions that each of these projects reveals. Here is my summary of the 4 surprising conclusions:

1.Deep Reinforcement Learning is Consistently Scalable — The OpenAI Five project reveals that given enough compute resources, reinforcement learning (RL) can scale with the kind of consistency that we also find in unsupervised learning. This conclusion is a bit surprising due to the recent growing arguments about the inconsistency of RL training. Alex Irpan wrote that “Deep Reinforcement Learning doesn't work yet” (Feb 14, 2018), “Reinforcement Learning never worked, and ‘deep’ only helped a bit” (Feb 23, 2018). Kurenkov wrote, “Reinforcement learning’s foundational flaw” (July 8, 2018). With headlines like this, it would be too easy to junk this method. Yet, OpenAI has revealed that brute force computation is all you need! It is as if the neural network winter that lasted for decades, lasted just a few months for deep reinforcement learning. It was thought to not work, yet now it really works!

2. Complex Strategy and Tactics require only a Few Neurons — The LSTM to driver OpenAI Five consisted of only 4,000 LSTM nodes. In the architecture below, it shows an earlier version of the model that required only 1,024 LSTM, this was upgraded to 4,000+ nodes by the time of the “The International” competition.

3. Transfer Learning from Virtual to Reality needs the only Randomization of Unknowns — The Learning Dexterity project reveals that to transfer knowledge from a simulated virtual environment into a real environment, one should train a network with all of the significant unknowns as being random. In this way, the network learns to compensate for these variables when it is eventually tasked to perform in a real environment.

4. Richer Neural Models Automagically improve Higher-Level NLP tasks — The Improving Language Understanding project revealed that by improving the neural word embedding via unsupervised learning, one automagically unlocks massive improvements in many higher-level NLP tasks. Other groups like Google (see: BERT), AllenAI (see: ELBO) and Fast.AI (see: ULMFiT) have come to a similar conclusion about the benefits of richer language models. In the figure below, the same embedding is used as a base layer for different NLP tasks:

Stuskever ends his talk that the only limitation of Deep Learning appears to be computation. In fact, he further emphasizes that hard-coding knowledge into a network can, in fact, be detrimental. Stuskever argues that within 5 to 10 years:

While highly uncertain, near term AGI should be considered as a possibility.

You might be thinking that this is an overly optimistic assessment of the current situation. However, this isn’t Elon Musk talking (who predicts within 5 years), we are speaking here about someone who is knee deep in the weeds of Deep Learning research (This is the guy who in 2016 has a $1.6m salary in a non-profit firm). Yet, Stuskever’s argues that there are no real major stumbling blocks towards AGI other than the need for more compute. He points out that the availability of computing is increasing at an unexpected exponential pace. From AlexNet in 2012 to today, there’s been a 300,000x increase in compute!

If the fundamental limitation in only computational resources then AGI is happening much sooner than we all think! Nobody could have predicted a 3.5 month doubling period that exceeds Moore’s Law by 32x (i.e. In two years the computational resources will have grown at least 64 times instead of just doubling). Most are simply unaware of how much compute will be now easily available. In fact, with the recent crash in crypto-currency prices, there’s now a major glut in spare GPU cycles.

However, let’s explore again the fundamental limitations of Deep Learning to perhaps get a better grasp of comprehension if this conclusion has a good probability of being true. Gary Marcus has made a good living arguing about the incompleteness of neural networks. I wrote about his arguments in “The Boogeyman Argument that Deep Learning will be Stopped by a Wall”.

There are indeed plenty of arguments about the current limitations of Deep Learning. However, are these limitations fundamental and thus cannot be overcome by just more brute force computation?

The current fundamental limitation of Deep Learning can be distilled into the reality that these systems are unable to generate high-level abstractions of the world. It’s the kind of cognition that conjures up religion or science as models of reality. I’ve discussed this previously in my proposed “Capability Maturity Model”:

At present, current state-of-the-art is at level 2 (i.e. Dual Process systems). We don’t have the kind of interventional or causal reasoning capabilities found at the next level. It is not like this is unexplored territory, there is significant research in this area (see: Intuitive Relational Reasoning).

I have come to the opinion that there is something very different in how Deep Learning captures the regularities of data and the way human brains capture the experience. It may be entirely possible that collectively we are on the wrong path using the kinds of representations used in deep learning. That is a representation that requires high dimensional continuous latent spaces. So even though I claim that we are at least at level two in the roadmap, there may be a possibility that prior levels will need to be revisited.

Another compelling argument why predicting AGI with 5 to 10 years is within the realm of possibility is known as Moravec’s paradox. Moravec’s paradox is the observation made by many AI researchers that high-level reasoning requires less computation than low-level unconscious cognition. This is an empirical observation that goes against the notion that greater computational capability leads to more intelligent systems. Moravec writes:

we should expect skills that appear effortless to be difficult to reverse-engineer, but skills that require effort may not necessarily be difficult to engineer at all.

Higher level cognition does not require a lot of compute! Just look at the number of units required by Open AI five to perform its strategy and tactics. Just look at the size of the pre-frontal cortex (PFC) where higher level cognition resides:

It is only a minor fraction of the entire human brain. In fact, the more primitive Cerebellum is much more dense with approximately 100 billion neurons as compared to the entire Neo-Cortex with 26 billon neurons. The physical computational units required for high-level intelligence is a fraction of what is needed for human-complete cognition.

Moravec’s Law is but just conjecture. It does, however, reveal that the limitation thus may eventually not even be computational resources. There is plenty of research to suggest that reducing the dimensionality of internal latent space so that less compute is required for sequential reasoning components (see: Open AI Five). In terms of today’s compute resources, we may have already overcome the compute obstacle!

There is also an additional trend that we see in Stanford’s end-to-end Deep Learning benchmark ( DAWNBench ). The training time for ImageNet as of September 2018 is now at 18 minutes. DAWN began collecting benchmarks just last year, in October 2017 the same task required 13 days of training. In one year, the improvement in 1,440 times! What is happening here is the double whammy of not only faster computers but much faster algorithms.

The bulk of the improvement cannot be accounted for in just raw compute power. Both systems used Amazon’s p3.16xlarge (8 GPU Tesla V100) instance, the record-breaking system 4 times as many instances. This implies that the main bottleneck may just be algorithmic.

Steven Merity writes about the considerable amount of time and depression that he consumed seeking out greater computation instead of working on fundamentals. He writes in “The compute and data moats are dead”:

I don’t want others to lose months of their time or neurons of their sanity like I have done in the past. It’s not necessary. It’s not productive. It’s rarely what you need to achieve your aims or to help forge progress in our field.

The true fundamental limitation may just be what evolution has already revealed. All cognition is due to natures invention of universal information processing machines. Discovering useful kinds of these machines requires plain luck. There is fundamentally no algorithm that can reveal if a Turing machine halts. Thus the invention of higher forms of cognitions can simply be due to blind luck! This is not a new revelation. This has been known ever since people have been experimenting with artificial life in the form of simple cellular automation. It does not require an insurmountable feat of genius to discover new kinds of self-replicating automata. It just, unfortunately, requires the same kind of method that one can become a billionaire overnight (see: Megamillions $1.6 billion lotteries). That is plain unadulterated luck. Invention usually happens in unexpected places and thus AGI could be discovered in someone’s basement.

Explore Deep Learning: Artificial Intuition: The Improbable Deep Learning Revolution

.