NIPS 2013 reading list

As usual, here is my nips 2013 reading list. It’s incomplete, and only has papers I could find online as of a couple of weeks ago.

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction; R. Johnson, T. Zhang

Another paper proposing a linear convergence rate for stochastic gradient descent. This builds upon the earlier stochastic average gradient approach, which works by storing the gradient of all training examples as well as their sum, and then in each iteration sampling a training example and updating the sum of gradients by how much the gradient with respect to this example changed in the past.

The contribution of this paper is that they prove that if instead of keeping the gradient for every example and a running sum of all gradients you keep an old weight vector around and the sum of gradients computed with respect to that weight vector you can efficiently “correct” this gradient sum with respect to one training example at a time and still get the same convergence rate.

A cool thing is that this lets you implement gradient updates for sparse feature vectors efficiently, since you can keep a multiplier on the full gradient you keep around, as the information that varyes from sample to sample is sparse.

An Approximate, Efficient LP Solver for LP Rounding; S. Sridhar, S. Wright, C. Re, J. Liu, V. Bittorf, C. Zhang

This paper has two intertwined contributions: first they show how to approximately solve many LPs that show up in machine learning with coordinate descent on a quadratically-smoothed version of the problem (which looks like ADMM), and then they show that approximate solutions to LP relataxions work almost as well as exact solutions when doing rounding to approximate ILPs.

The results are compelling, and I’m curious to try this quadratic approximation in some LPs and see what happens.

A simple example of Dirichlet process mixture inconsistency for the number of components; J. Miller, M. Harrison

I think this paper makes a very helpful point. In many places and casual conversations one hears dismissive remarks that using a Dirichlet process prior is the way to do inference in models when one doesn’t know the actual number of components. While this is clearly not always the case, as the Dirichlet process makes some quite restrictive assumptions about the sizes of the produced clusters (a few big clusters and many small clusters, essentially required for exchangeable models), this is the first work I know of that shows that it actually also misrepresents the size of the clusters (which is behavior anyone who tried fitting a DP mixture model to random data has observed).



Better Approximation and Faster Algorithm Using the Proximal Average; Y. Yu

This paper shows the neat idea that if you have many nonsmooth regularizers in an optimization problem you can still do something fista-like by instead of computing the expensive prox operator of all the regularizers together simply doing an accelerated gradient step on the smooth part and averaging the prox operator of each regularizer. While this won’t give you sparsity, it seems to be efficient, and it’s very easy to implement.

Distributed Representations of Words and Phrases and their Compositionality; T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean

This is the second paper from the google team working on word embeddings. As with the earlier work, the results are pretty impressive. It still strikes me as counterintuitive that these local bag-of-words or skip-bigram models end up capturing semantics sometimes better than more complex models which take account of things such as word order. The contributions in this one are using phrases, exploring negative-example training instead of the tree structure, subsampling frequent words, and more nice eye candy of the form A - B + C = D, which is always fun to stare at. To be fair, though, some human selection has to be done, because intuitive semantics is not as commutative as these operations. For example, “Berlin - Germany + France = Paris” looks good, but “Berlin - Paris + France = Germany” really doesn’t.

Dropout Training as Adaptive Regularization; S. Wager, S. Wang, P. Liang

This paper investigates what regularizer is actually implied by dropout on linear models. It turns out to be something suspiciously close to the regularizers from Adagrad and confidence-weighted learning algorithms, in that they depend on the covariance matrix of the data (which is intuitive, since randomly deleting features is going to have an effect proportional to how correlated the features are). They present somewhat good results on training models with the exact regularizer, as well as a variant of adagrad which is closer to dropout.

Estimation, Optimization, and Parallelism when Data is Sparse; J. Duchi, M. Jordan, B. McMahan

This paper analyses learnign in the common (for me) case of very high dimensional very sparse feature vectors. As has been empirically observed, adagrad does well, and this paper proves matching lower and upper bounds on performance which are achieved by adagrad, which is nice. Moreover, it proposes and proves good properties about a parallel implementation of dual averaging which is essentially what I’ve been using for the past year in factorie, with good results.

Generalized Denoising Auto-Encoders as Generative Models; Y. Bengio, L. Yao, G. Alain, P. Vincent

This is further work on the line of research interpreting autoencoders as generative models. Interestingly enough, the training algorithms (like Algorithm 1 in this paper) are looking a lot like training algorithms for RBMs, but hopefully with more stability and less issues with things like gaussian units. Looking at samples from the distributions defined by autoencoders is really interesting.

Learning word embeddings efficiently with noise-contrastive estimation; A. Mnih, k. kavukcuoglu

This is very similar to the Google paper I cited above. It shares one contribution, noting that using negative contrastive estimation is faster than a tree of classifiers (and doesn’t create the problem of coming up with a good tree structure). I’m puzzled by the fact that the numbers seem a bit worse than the Google numbers.



Projecting Ising Model Parameters for Fast Mixing; J. Domke, X. Liu

The idea is that since it is known that some model structures (involving low coupling) make sampling efficient, how about picking a model which doesn’t fit this template and somehow projecting its parameters into it to improve the efficiency of sampling? This paper explores many ways of doing so, and things based on the reverse KL divergence (minimizing the KL between the distributions induced approximate and true parameters), which is similar to a mean-field approximation, works well (though it barely outperforms loopy BP, as usual).

Reasoning With Neural Tensor Networks for Knowledge Base Completion; R. Socher, D. Chen, C. Manning, A. Ng

This paper is another one attacking the problem of reasoning in knowledge bases as tensor completion. Here the focus is on learning relations: each entity gets a vector, and each relation (in wordnet or freebase) gets a neural network which takes two entities and produces a score. The model is trained to minimize training loss and performs reasonably well on test data. One clever idea is to represent entities with the average representations of the words which compose them (coming from any neural language model, like Collobert & Weston’s); then knowledge is automatically shared across entities which share words, which can be good..

Streaming Variational Bayes; T. Broderick, N. Boyd, A. Wibisono, A. Wilson, M. Jordan

The idea of this paper is to pick ideas from the stochastic variational inference algorithms, which approximate the posterior given the prior and all observations by taking stochastic natural gradient steps and use these techniques for a streaming setting, where you want to approximate the posterior given the previous posterior and one observation. The technique is parallelizable, and seems to work pretty well.

Variance Reduction for Stochastic Gradient Optimization; C. Wang, X. Chen, A. Smola, E. Xing

This is another paper which has a variance-reduction strategy to improve stochastic gradient descent. The idea is to use control variates: you define a known quantity, which is also an expectation of something over the data, but whose errors in each sample you can compute exactly, and whose error is correlated with the stochastic gradients of your function. Then instead of using the actual stochastic gradients you apply to each gradient the correction that would have removed the error in the control variate. This then should minimize the negative effect of having too much variance in your stochastic gradients.

Deriving these control variates seems a bit like black magic to me, but that’s because I’m not familiar with them.

They test the technique on logistic regression and stochastic variational inference, and the results look ok for logistic regression but quite impressive for variational inference.