Deep Learning is amazing. But why is Deep Learning so successful? Is Deep Learning just old-school Neural Networks on modern hardware? Is it just that we have so much data now the methods work better? Is Deep Learning just a really good at finding features. Researchers are working hard to sort this out.

Recently it has been shown that [1]

Unsupervised Deep Learning implements the Kadanoff Real Space Variational Renormalization Group (1975)

This means the success of Deep Learning is intimately related to some very deep and subtle ideas from Theoretical Physics. In this post we examine this.

Unsupervised Deep Learning: AutoEncoder Flow Map

An AutoEncoder is a Unsupervised Deep Learning algorithm that learns how to represent an complex image or other data structure . There are several kinds of AutoEncoders; we care about so-called Neural Encoders–those using Deep Learning techniques to reconstruct the data:

The simplest Neural Encoder is a Restricted Boltzman Machine (RBM). An RBM is non-linear, recursive, lossy function that maps the data from visible nodes into hidden nodes :

The RBM is learned by selecting the optimal parameters that minimize some measure of the reconstruction error (see Training RBMs, below)

RBMs and other Deep Learning algos are formulated using classical Statistical Mechanics. And that is where it gets interesting!

Multi Scale Feature Learning

In machine learning (ML), we map (visible) data into (hidden) features

The hidden units discover features at a coarser grain level of scale

With RBMs, when features are complex, we may stack them into a Deep Belief Network (DBM), so that we can learn at different levels of scale

and leads to multi-scale features in each layer

Deep Belief Networks are a Theory of Unsupervised MultiScale Feature Learning

Fixed Points and Flow Maps

We call a flow map

eIf we apply the flow map to the data repeatedly, (we hope) it converges to a fixed point

Notice that we usually expect to apply the same map each time , however, for a computational theory we may need more flexibility.

Example: Linear Flow Map

The simplest example of a flow map is the simple linear map

so that

where C is a non-negative, low rank matrix

We have seen this before: this leads to a Convex form of NonNegative Matrix Factorization NMF

Convex NMF applies when we can specify the feature space and where the data naturally clusters. Here, there are a few instances that are archetypes that define the convex hull of the data.

Amazingly, many clustering problems are provably convex–but that’s a story for another post.

Example: Manifold Learning

Near a fixed point, we commonly approximate the flow map by a linear operator

This lets us capture the structure of the true data manifold, and is usually described by the low lying eigen-spectra of

.

In the same spirit, Semi & Unsupervised Manifold Learning, we model the data using a Laplacian operator , usually parameterized by a single scale parameter .

These methods include Spectral Clustering, Manifold Regularization , Laplacian SVM, etc. Note that manifold learning methods, like the Manifold Tanget Classifier, employ Contractive Auto Encoders and use several scale parameters to capture the local structure of the data manifold.

The Renormalization Group

In chemistry and physics, we frequently encounter problems that require a multi-scale description. We need this for critical points and phase transitions, for natural crashes like earthquakes and avalanches, for polymers and other macromolecules, for strongly correlated electronic systems, for quantum field theory, and, now, for Deep Learning.

A unifying idea across these systems is the Renormalization Group (RG) Theory.

Renormalization Group Theory is both a conceptual framework on how to think about physics on multiple scales as well as a technical & computational problem solving tool.

Ken Wilson won the 1982 Nobel Prize in Physics for the development and application of his Momentum Space RG theory to phase transitions.

We used RG theory to model the recent BitCoin crash as a phase transition.

Wilson invented modern multi-scale modeling; the so-called Wilson Basis was an early form of Wavelets. Wilson was also a big advocate of using supercomputers for solving problems. Being a Nobel Laureate, he had great success promoting scientific computing. It was thanks to him I had access to a Cray Y-MP when I was in high school because he was a professor at my undergrad, The Ohio State University.

Here is the idea. Consider a feature map which transforms the data to a different, more coarse grain scale

The RG theory requires that the Free Energy is rescaled, to reflect that

the Free Energy is both Size-Extensive and Scale Invariant near a Critical Point

This is not obvious — but it is essential to both having a conceptual understanding of complex, multi scale phenomena, and it is necessary to obtain very highly accurate numerical calculations. In fact, being size extensive and/or size consistent is absolutely necessary for highly accurate quantum chemistry calculations of strongly correlated systems. So it is pretty amazing but perhaps not surprising that this is necessary for large scale deep learning calculations also!

The Fundamental Renormalization Group Equation (RGE)

If we (can) apply the same map, , repeatedly, we obtain a RG recursion relation, which is the starting point for most analytic work in theoretical physics. It is usually difficult to obtain an exact solution to the RGE (although it is illuminating when possible [20]).

Many RG formulations both approximate the exact RGE and/or only include relevant variables. To describe a multiscale system, it is essential to distinguish between these relevant and irrelevant variables.

Example: Linear Rescaling

Let’s say the feature map is a simple linear rescaling

We can obtain a very elegant, approximate RG solution where F(x) obeys a complex (or log-periodic) power law.

This behavior is thought to characterize Per-Bak style Self-Organized Criticality (SOC), which appears in many natural systems–and perhaps even in the brain itself. Which leads to the argument that perhaps Deep Learning and Real Learning work so well because they operate like a system just near a phase transition–also known as the Sand Pile Model--operating at a state between order and chaos.

the Kadanoff Variational Renormalization Group (1975)

Leo Kadanoff, now at the University of Chicago, invented some of the early ideas in Renormalization Group. He is most famous for the Real Space formulation of RG, sometimes called the Block Spin approach. He also developed an alternative approach, called the Variational Renormalization Group (VRG, 1975), which is, remarkably, what Unsupervised RBMs are implementing!

Let’s consider a traditional Neural Network–a Hopfield Associative Memory (HAM). This is also known as an Ising model or a Spin Glass in statistical physics.

An HAM consists of only visible units; it stores memories explicitly and directly in them:

We specify the Energy — called the Hamiltonian — for the nodes. Note that all the nodes are visible. We write

The Hopfield model has only single and pair-wise interactions.

A general Hamiltonian might have many-body, multi-scale interactions:

The Partition Function is given as

And the Free Energy is

The idea was to mimic how our neurons were thought to store memories–although perhaps our neurons do not even do this.

Either way, Hopfield Neural Networks have many problems; most notably they may learn spurious patterns that never appeared in the training set. So they are pretty bad memories.

Hinton created the modern RBM to overcome the problems of the Hopfield model. He used hidden units to represent the features in the data–not to memorize the data examples directly.

An RBM is specified Energy function for both the visible and hidden units

This also defines joint probability of simultaenously observing a configuration of hidden and visible spins

which is learned variationally, by minimizing the reconstruction error…or the cross entropy (KL divergence), plus some regularization (Dropout), using Greedy layer-wise unsupervised training, with the Contrastive Divergence (CD or PCD) algo, …

The specific details of an RBM Energy are not addressed by these general concepts; these details do not affect these arguments–although clearly they matter in practice !

It turns out that

Introducing Hidden Units in a Neural Network is a Scale Renormalization.

When changing scale, we obtain an Effective Hamiltonian that acts on a the new feature space (i.e the hidden units)

or, in operator form

This Effective Hamiltonian is not specified explicitly, but we know it can take the general form (of a spin funnel, actually)

The RG transform preservers the Free Energy (when properly rescaled):

where

Critical Trajectories and Renormalized Manifolds

The RG theory provides a way to iteratively update, or renormalize, the system Hamiltonian. Each time we add a layer of hidden units (h1, h2, …), we have

We imagine that the flow map is attracted to a Critical Trajectory which naturally leads the algorithm to the fixed point. At each step, when we apply another RG transform, we obtain a new, Renormalized Manifold, each one closer to the optimal data manifold.

Conceptually, the RG flow map is most useful when applied to critical phenomena–physical systems and/or simple models that undergo a phase transition. And, as importantly, the small changes in the data should ‘wash away’ as noise and not affect the macroscopic / critical phenomena. Many systems–but not all–display this.

Where Hopfield Nets fail to be useful here, RBMs and Deep Learning systems shine.

We now show that these RG transformations are achieved by stacking RBMs and solving the RBM inference problem!

Kadanoff’s Variational Renormalization Group

As in many physics problems, we break the modeling problem into two parts: one we know how to solve, and one we need to guess.

we know the Hamiltonian at the most fine grained level of scale we seek the correlation that couples to the next level scale

The joint Hamiltonian, or Energy function, is then given by

The Correlation V(v,h) is defined so that the partition function is not changed

This gives us

(Sometimes the Correlation V is called a Transfer Operator T, where V(v,h)=-T(v,h) )

We may now define a renormalized effective Hamilonian that acts only on the hidden nodes

so that we may write

Because the partition function does not change, the Exact RGE preserves the Free Energy (up to a scale change, we we subsume into

We generally can not solve the exact RGE–but we can try to minimize this Free Energy difference.

What Kadanoff showed, way back in 1975, is that we can accurately approximate the Exact Renormalization Group Equation by finding a lower bound using this formalism

Deep learning appears to be a real-space variational RG technique, specifically applicable to very complex, inhomogenous systems where the detailed scale transformations have to be learned from the data

RBMs expressed using Variational RG

We will now show how to express RBMs using the VRG formalism and provide some intuition

In an RBM, we simply want to learn the Energy function directly; we don’t specify the Hamiltonian for the visible or hidden units explicitly, like we would in physics. The RBM Energy is just

We identify the Hamiltonian for the hidden units as the Renormalized Effective Hamiltonian from the VRG theory

RBM Hamiltonians / Marginal Probabilities

To obtain RBM Hamiltonians for just the visible or hidden nodes, we need to integrate out the other nodes; that is, we need to find the marginal probabilities.

and

Training RBMs

To train an RBM, we apply Contrastive Divergence (CD), or, perhaps today, Persistent Contrastive Divergence (PCD). We can kindof think of this as slowly approximating

In practice, however, RBM training minimizes the associated Free Energy difference … or something akin to this…to avoid overfitting.

In the “Practical Guide to Training Restricted Boltzmann Machines”, Hinton explains how to train an RBM (circa 2011). Section 6 addresses “Monitoring the overfitting”

“it is possible to directly monitor the overfitting by comparing the free energies of training data and held out validation data…If the model is not overfitting at all, the average free energy should be about the same on training and validation data”

Other Objective Functions

Modern variants of Real Space VRG are not “‘forced’ to minimize the global free energy” and have attempted other approaches such as Tensor-SVD Renormalization. Likeswise, some RBM / DBM approaches do likewise may minimize a different objective.

In some methods, we minimize the KL Divergence; this has a very natural analog in VRG language [1].

Why Deep Learning Works: Lessons from Theoretical Physics

The Renormalization Group Theory provides new insights as to why Deep Learning works so amazingly well. It is not, however, a complete theory. Rather, it is framework for beginning to understand what is an incredibly powerful, modern, applied tool. Enjoy!

References

[1] An exact mapping between the Variational Renormalization Group and Deep Learning, 2014

[2] Variational Approximations for Renormalization Group Transformations, 1976

[3] A Common Logic to Seeing Cats and Cosmos

[4] WHY DOES UNSUPERVISED DEEP LEARNING WORK? – A PERSPECTIVE FROM GROUP THEORY, 2015

[5] A Fundamental Theory to Model the Mind

[6] A Practical Guide to Training Restricted Boltzmann Machines, 2010

[7] On the importance of initialization and momentum in deep learning, 2013

[8] Dropout: A Simple Way to Prevent Neural Networks from Overfitting, 2014

[9] Ken Wilson, A Scientific Appreciation

[10] http://www-math.unice.fr/~patras/CargeseConference/ACQFT09_JZinnJustin.pdf

[11] Training Products of Experts by Minimizing Contrastive Divergence, 2002

[12] Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient, 2008

[13] http://www.nonlin-processes-geophys.net/3/102/1996/npg-3-102-1996.pdf

[14] THE RENORMALIZATION GROUP AND CRITICAL PHENOMENA, Ken Wilson Nobel Prize Lecture

[15] Scaling, universality, and renormalization: Three pillars of modern critical phenomena

[16] The Potential Energy of an Autoencoder, 2014

[17] Contractive Auto-Encoders: Explicit Invariance During Feature Extraction, 2011

[18] Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion, 2010

[19] Quora: What is Renormalization group theory?

[20] Renormalization group and critical localization, 1977