NIPS 2016 Review, Day 2

A Lab41 Perspective

Why good morning again, fellow machine learners. It’s another day at NIPS, and what a marathon. The sessions ran from 9am to 9pm last night, and I was there for most of it! (Check out my NIPS 2016 Review, Day 1 for the low-down on yesterday’s action.) Ok, let’s get crackin’.

NIPS Main Conference in Barcelona: La Sagrada Familia

Overall, today’s talks were better polished and more easy to follow, though that’s partly because of jet-lag yesterday. You’ll see a bit more coverage today over yesterday. The posters were still difficult to navigate, and so I just took note of which ones had the most people around them. (Generally, those were the ones with big company names behind the authors.) I’m not going to go into detail with those here, mostly because I couldn’t ever get close enough to snap a photo of the poster.

Also, on a more fun note, I talked with someone from Google’s DeepMind. Apparently, they’ve got some 150 people here. That’s about half the group, with the other half being support staff!

Keynote

Today’s first invited talk: machine learning in particle physics. The speaker is Kyle Cranmer, and he’s from the Large Hadron Collider and did some work with the Higgs Boson. He now works on a project at CERN called Atlas. Their work has a lot of great innovations with PGM’s, and shows what they’ve been doing with likelihood free estimation.

(As an aside, his analogy of stuff Physics should concentrate on with ML was originally a “pie chart” but is now a “cake chart” a la Yann LeCun. LoL.)

The basic idea is that they’ve got a lot of parameters in particle physics. That means your likelihood function is nuts. So, then you try to do inference on it. You then realize, that doing that is nuts. The example that he gave was in the following forward model (I guess, maybe looking for the Higgs Boson particle):

Some equation in Quantum Field Theory gives a prediction for high-energy collisions. They simulate the interactions and then do feature extraction on simulated data as if they were from real-collisions. Finally, they model all of this with probabilistic graphical models (PGMs). However, Monte Carlo integration over micro-physics is impossible, and so the PGM is intractable. So, you can’t really use likelihood methods. Go likelihood free estimation.

They’ve had discussions with David Blei, and their machine learning contingent is quite strong; Cranmer’s team has more machine learning staff than physicists. His modeling efforts have yielded a pretty generalizable chart shown below with CARL being his invention:

Boston Dynamics: Invited Talk

One of the more entertaining talks since there were so many videos. Here’s a whole bunch of robots that they’ve built.

They’ve built a fairly robust humanoid robot (The Next Generation Atlas). From my coworker, Brad:

They 3D printed the frame with fluid channels built in so that they could remove hoses and connectors.

Another newer robot is called Spot. It was a crowd favorite: knocking it over, closing doors on it, hitting stuff it was trying to pick up, and throwing a banana on the ground for it to trip over. The engineers were the Lucy to it being Charlie Brown.

It’s interesting to note that they are doing no learning whatsoever. At a conference where that play prominent, they were especially cognizant of that, and I think they’ll be looking for AI experts in the near future.

Learning by Poking — There was a similar talk in the afternoon session. This robot actually learns to manipulate objects by poking it. The experimental setup is this arm hanging over an object and you want it to move this object to another location.

The training setup: an initial image and a goal image. The robot then makes random pokes and will get rewarded if after it pokes the object, and the object gets closer to where it appears in the goal image. It then memorizes where it pokes the object and what happened to its orientation and position afterwards. They tried to make it harder by adding other random objects that are distractions to the the objective. In many cases, there are complex movements that are required to manipulate the object to get to the goal. Also, the objects that it pokes aren’t your typical blocks; they’re coffee cups (that roll) and other stuff that’s randomly shaped.

The (Second) Best Paper Award

Here, we’d like to point out that because it’s second, I don’t think it’s less “best” than the “best” paper award. It’s a very rigorous proof answering some questions on why non-convex optimization works well on training machine learning models, specifically for the matrix completion problem.

Matrix Completion has no Spurious Local Minimum

I didn’t know this, and it appears like it’s recent work (2015), but there’s been some stuff on building a conjectural unified theory, saying that all local minima are (approximately) global minima. This paper establishes this property for the matrix completion problem, implying that stochastic gradient descent (SGD) converges to a global minima. There’s an initial sampling element, but with high probability under random initialization while using popular optimization techniques, the end minimum value stays consistent and solvable in polynomial time. The proof was a bit quickly glossed over, so it’ll be worth looking at the paper.

Other Optimization Talks

There were two other talks relating to global and local minima. One was titled Without-Replacement Sampling for SGD, the other was Deep Learning Global Minima. Because the latter was in the same room, I opted to stay.

Deep Learning without Poor Local Minima — Looks like global minima is the rage now (as well it should be). There were some strong statements in this work, but the idea is that while random initialization gives you all sorts of rando-weights, if you train with similar optimization functions, you’re going to get similar performance.

The proof begins with the fact that there are apparently seven strong assumptions to ensure convexity (that is, local minima = global minima) under linear neural networks. He essentially shows that under assumptions 1–4, there are no bad local minima, which I guess there was a 1989 paper on. He then moves onto nonlinear neural networks (with ReLU activation), and with even fewer assumptions, asserts the same. I would encourage you to read this stack overflow for some background. I originally saw this on arXiv, and there’s a good thread on Hacker News.

Interesting Session Talks

Following the trend of neural network image and video synthesis, there’s been some interesting stuff at this conference. At ICML, we saw Scott Reed’s image from text work. Today, just a half year later, more of Scott Reed. Some other works without the need for GANs (congrats, Katie!) show some impressive predictive video.

Modeling Future Frames from an Image — Exactly what the title says. The demo was of a lady doing exercise, and it trains using the motion information from similar video sequences. They delivered it very well, and you can catch a lot of this at their website and on the video:

Generative Adversarial What-Where Networks — Another GAN from Scott Reed. He’s at all the big conferences. It’s similar to his Generating Interpretable Images talk. But now you can put these birds and stuff anywhere you want to. You just tell it some keypoints, and then off it goes. You can take a look at his code at: https://github.com/reedscot/nips2016. Conclusions: location conditioning is useful for image synthesis. It adds an additional layer of control to get more interpretability. Works well for birds, not so well for humans and faces.

Weight Normalization — This is yet another normalization come optimization time of deep learning neural networks, though the results that he showed at the conference were more on accuracy rather than computational performance. Their argument is that batch normalization adds noise to your gradient updates. While noise is probably good when you’re training images with CNNs because it adds a bit of regularization (e.g., it can take care of invariances and stuff that doesn’t matter), it’s not so useful when you want to do reinforcement learning. Instead of normalizing batches, he does normalization directly onto the weights, the contributions being: weight normalization + data dependent initialization. He showed this on reinforcement learning with DQN. Looks like the scores got better, sometimes 30% more. Their code is at https://github.com/openai/weightnorm. For keras, it’s two lines of code.

Supervised Word Mover’s Distance — Killian Weinberger is co-author on this work that can do document comparisons. It’s based on their ICML 2015 paper, titled From Word Embeddings To Document Distances, but this one is supervised. The idea is essentially to use Earth Mover’s Distance on a Bag of Word Vectors. He calls it Word Mover’s Distance, to which my colleagues and I laughed at the fact that Matt Kusner, the graduate student, had WMDs. It’s a pretty interesting metric: you take the L2 distances of word vectors from one document and fill it into the second document. Their contribution is to learn a matrix that takes care of known similar words and apply it to this distance metric.

The Chinese Voting Process — Though the title is a play on the Chinese Restaurant Process, being from Korea, I would’ve expected the author to know full well that the Chinese don’t vote. His talk deals with bias in up-voting and helpful reviews in product reviews, Amazon stuffs, and stack overflow. There were a few good examples: apparently presentation bias (the tendency for people to conform to other’s opinions) plays strong on stack overflow. In another example, he searched on Amazon shopping with a bold query of “nips”. Luckily, he came back with some chocolate cookies, but there are some interesting implications on how voting can be biased and reinforced due to position. His method takes care of this through a generative process called the Chinese Voting Process. This process essentially models (temporally) how people would vote based on the reviews already existing in the corpus. It’s similar to the restaurant process, but the idea is that certain reviews build momentum and trickle up to the top, not by any merit of the review itself, but because of the biases inherent in behavior.

And That’s It for Today

We’ve undoubtedly missed something, I’m sure! Be sure to get our take on Day 1 too. Please let us know what other fun things were at NIPS if you’ve attended via e-mail (kni@iqt.org). Tune in tomorrow for the final blog post on NIPS 2016. It’s a blast out here in Barcelona, and we love sharing what we’re seeing!