The number of scientific papers published annually has topped three million and the number continues to rise. In the field of machine learning it’s estimated that more than 100 papers are uploaded to leading online repository arXiv each and every day. That’s an awful lot of research to look at.

Synced surveyed last week’s crop of machine learning papers and identified seven that we believe may be of special interest to our readers. They include the recently announced EMNLP 2019 Best Paper; a new state-of-the-art model on multiple cross-language comprehension benchmarks proposed by Facebook; as well as a paper published on the Nature Communications which introduces the Eighty Five Percent Rule applied to the Perceptron for optimal learning, and more.

Paper ONE

EMNLP 2019 Best Paper Award: Specializing Word Embeddings (for Parsing) by Information Bottleneck

Authors: Xiang Lisa Li and Jason Eisner from Johns Hopkins University.

Abstract: Pre-trained word embeddings like ELMo and BERT contain rich syntactic and semantic information, resulting in state-of-the-art performance on various tasks. We propose a very fast variational information bottleneck (VIB) method to nonlinearly compress these embeddings, keeping only the information that helps a discriminative parser. We compress each word embedding to either a discrete tag or a continuous vector. In the discrete version, our automatically compressed tags form an alternative tag set: we show experimentally that our tags capture most of the information in traditional POS tag annotations, but our tag sequences can be parsed more accurately at the same level of tag granularity. In the continuous version, we show experimentally that moderately compressing the word embeddings by our method yields a more accurate parser in 8 of 9 languages, unlike simple dimensionality reduction.

Paper Two

Paper: Loss Landscape Sightseeing with Multi-Point Optimization

Authors: Ivan Skorokhodov and Mikhail Burtsev from the Neural Networks and Deep Learning Lab at Moscow Institute of Physics and Technology.

Project Link: https://github.com/universome/loss-patterns

Abstract: We present multi-point optimization: an optimization technique that allows to train several models simultaneously without the need to keep the parameters of each one individually. The proposed method is used for a thorough empirical analysis of the loss landscape of neural networks. By extensive experiments on FashionMNIST and CIFAR10 datasets we demonstrate two things: 1) loss surface is surprisingly diverse and intricate in terms of landscape patterns it contains, and 2) adding batch normalization makes it more smooth. Source code to reproduce all the reported results is available on GitHub.

Paper Three

Paper: Unsupervised Cross-lingual Representation Learning at Scale

Authors: Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzman, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov from Facebook AI.

Abstract: This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model. We also present a detailed empirical evaluation of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-Ris very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make XLM-R code, data, and models publicly available.

Paper Four

Paper: Understanding the Role of Momentum in Stochastic Gradient Methods

Authors: Igor Gitman, Hunter Lang, Pengchuan Zhang, and Lin Xiao from Microsoft Research AI.

Abstract: The use of momentum in stochastic gradient methods has become a widespread practice in machine learning. Different variants of momentum, including heavy-ball momentum, Nesterov’s accelerated gradient (NAG), and quasi-hyperbolic momentum (QHM), have demonstrated success on various tasks. Despite these empirical successes, there is a lack of clear understanding of how the momentum parameters affect convergence and various performance measures of different algorithms. In this paper, we use the general formulation of QHM to give a unified analysis of several popular algorithms, covering their asymptotic convergence conditions, stability regions, and properties of their stationary distributions. In addition, by combining the results on convergence rates and stationary distributions, we obtain sometimes counter-intuitive practical guidelines for setting the learning rate and momentum parameters.

Paper Five

Paper: The Visual Task Adaptation Benchmark

Authors: Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen and other researcher from Google Research, Brain Team.

Abstract: Representation learning promises to unlock deep learning for the long tail of vision tasks without expansive labelled datasets. Yet, the absence of a unified yardstick to evaluate general visual representations hinders progress. Many sub-fields promise representations, but each has different evaluation protocols that are either too constrained (linear classification), limited in scope (ImageNet, CIFAR, Pascal-VOC), or only loosely related to representation quality (generation). We present the Visual Task Adaptation Benchmark (VTAB): a diverse, realistic, and challenging benchmark to evaluate representations. VTAB embodies one principle: good representations adapt to unseen tasks with few examples. We run a large VTAB study of popular algorithms, answering questions like: How effective are ImageNet representation on non-standard datasets? Are generative models competitive? Is self-supervision useful if one already has labels?

Paper Six

Paper: The Eighty Five Percent Rule for optimal learning

Authors: Robert C. Wilson from University of Arizona, Amitai Shenhav from Brown University, Mark Straccia from University of California, Los Angeles, and Jonathan D. Cohen from Princeton University.

Project Link: https://github.com/bobUA/EightyFivePercentRule

Abstract: Researchers and educators have long wrestled with the question of how best to teach their clients be they humans, non-human animals or machines. Here, we examine the role of a single variable, the difficulty of training, on the rate of learning. In many situations we find that there is a sweet spot in which training is neither too easy nor too hard, and where learning progresses most quickly. We derive conditions for this sweet spot for a broad class of learning algorithms in the context of binary classification tasks. For all of these stochastic gradient-descent based learning algorithms, we find that the optimal error rate for training is around 15.87% or, conversely, that the optimal training accuracy is about 85%. We demonstrate the efficacy of this ‘Eighty Five Percent Rule’ for artificial neural networks used in AI and biologically plausible neural networks thought to describe animal learning.

Paper Seven

Paper: Confident Learning: Estimating Uncertainty in Dataset Labels

Authors: Curtis G. Northcutt from MIT, Lu Jiang from Google, and Isaac L. Chuang from MIT.

Project Link: https://pypi.org/project/cleanlab/

Abstract: Learning exists in the context of data, yet notions of confidence typically focus on model predictions, not label quality. Confident learning (CL) has emerged as an approach for characterizing, identifying, and learning with noisy labels in datasets, based on the principles of pruning noisy data, counting to estimate noise, and ranking examples to train with confidence. Here, we generalize CL, building on the assumption of a classification noise process, to directly estimate the joint distribution between noisy (given) labels and uncorrupted (unknown) labels. This generalized CL, open-sourced as 𝚌𝚕𝚎𝚊𝚗𝚕𝚊𝚋, is provably consistent under reasonable conditions, and experimentally performant on ImageNet and CIFAR, outperforming recent approaches, e.g. MentorNet, by 30% or more, when label noise is non-uniform. 𝚌𝚕𝚎𝚊𝚗𝚕𝚊𝚋 also quantifies ontological class overlap, and can increase model accuracy (e.g. ResNet) by providing clean data for training.