Authors: Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Nouri, Norman Casagrande, Edward Lockhart, Sander Dieleman, Aaron van den Oord, Koray Kavukcuoglu

Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating desired samples. Efficient sampling for this class of models at the cost of little to no loss in quality has however remained an elusive task. With a focus on text-to-speech synthesis, we show that compact recurrent architectures, a remarkably high degree of weight sparsification and a novel reordering of the variables greatly reduce sampling latency while maintaining high audio fidelity. We first describe a compact single-layer recurrent neural network, the WaveRNN, with a novel dual softmax layer that matches the quality of the state-of-the-art WaveNet model. Persistent GPU kernels for the WaveRNN are able to synthesize 24kHz 16-bit audio 4 times faster than real time. We then apply a weight sparsification technique to the model. We show that, given a constant number of weights, large sparse networks perform better than small dense networks. Using a large Sparse WaveRNN, we demonstrate the feasibility of real-time synthesis of high-fidelity audio on a low-power mobile phone CPU. We use a large Sparse WaveRNN to demonstrate the first instance of real-time synthesis of high-fidelity audio on low-resource mobile phone CPU. Finally, we introduce a novel reordering of the variables in the factorization of the joint distribution. The reordering makes it possible to trade vacuous dependencies on samples from the distant future for the ability to generate in batches. The Batch WaveRNN produces up to 16 samples per step maintaining high quality and enables audio synthesis that is up to 40 times faster than real time.

Presentations:

11:00 – 11:20 AM @ Victoria (Oral)



06:15 – 09:00 PM @ Hall B #105 (Poster)



Authors: Arthur Guez*, Theophane Weber*, Ioannis Antonoglou, Karen Simonyan, Oriol Vinyals, Daan Wierstra, Remi Munos, David Silver

Planning problems are among the most important and well-studied problems in artificial intelligence. They are most typically solved by tree search algorithms that simulate ahead into the future, evaluate future states, and back-up those evaluations to the root of a search tree. Among these algorithms, Monte-Carlo tree search (MCTS) is one of the most general, powerful and widely used. A typical implementation of MCTS uses cleverly designed rules, optimised to the particular characteristics of the domain. These rules control where the simulation traverses, what to evaluate in the states that are reached, and how to back-up those evaluations. In this paper we instead learn where, what and how to search. Our architecture, which we call an MCTSnet, incorporates simulation-based search inside a neural network, by expanding, evaluating and backing-up a vector embedding. The parameters of the network are trained end-to-end using gradient-based optimisation. When applied to small searches in the well-known planning problem Sokoban, the learned search algorithm significantly outperformed MCTS baselines.

Presentations:

11:20 – 11:30 AM @ Victoria (Oral)



06:15 – 09:00 PM @ Hall B #92 (Poster)

Authors: Gellert Weisz, Andras Gyorgy, and Csaba Szepesvari

We consider the problem of configuring general-purpose solvers to run efficiently on problem instances drawn from an unknown distribution. The goal of the configurator is to find a configuration that runs fast on average on most instances, and do so with the least amount of total work. It can run a chosen solver on a random instance until the solver finishes or a timeout is reached. We propose LEAPSANDBOUNDS, an algorithm that tests configurations on randomly selected problem instances for longer and longer time. We prove that the capped expected runtime of the configuration returned by LEAPSANDBOUNDS is close to the optimal expected runtime, while our algorithm’s running time is near-optimal. Our results show that LEAPSANDBOUNDS is more efficient than the recent algorithm of Kleinberg et al. (2017), which, to our knowledge, is the only other algorithm configuration method that claims to have non-trivial theoretical guarantees. Experimental results on configuring a public SAT solver on a public benchmark also stand witness to the superiority of our method.

Presentations:

11:30 – 11:40 AM @ A6 (Oral)



06:15 – 09:00 PM @ Hall B #165 (Poster)



Authors: Will Dabney*, Georg Ostrovski*, David Silver, Remi Munos

In this work, we build on recent advances in distributional reinforcement learning to give a generally applicable, flexible, and state-of-the-art distributional variant of DQN. We achieve this by using quantile regression to approximate the full quantile function for the state-action return distribution. By reparameterizing a distribution over the sample space, this yields an implicitly defined return distribution and gives rise to a large class of risk-sensitive policies. We demonstrate improved performance on the 57 Atari 2600 games in the ALE, and use our algorithms implicitly defined distributions to study the effects of risk-sensitive policies in Atari games.

Presentations:

11:40 – 11:50 AM @ A1 (Oral)



06:15 – 09:00 PM @ Hall B #3 (Poster)



Authors: Alvaro Sanchez, Nicolas Heess, Jost Tobias Springenberg, Josh Merel, Martin Riedmiller, Raia Hadsell, Peter Battaglia

Understanding and interacting with everyday physical scenes requires rich knowledge about the structure of the world, represented either implicitly in a value or policy function, or explicitly in a transition model. Here we introduce a new class of learnable models—based on graph networks—which implement an inductive bias for object- and relation-centric representations of complex, dynamical systems. Our results show that as a forward model, our approach supports accurate predictions, and surprisingly strong and efficient generalization, across eight distinct physical systems which we varied parametrically and structurally. We also found that our inference model can perform system identification from real and simulated data. Our models are also differentiable, and support online planning via gradient based trajectory optimization, as well as offline policy optimization. Our framework offers new opportunities for harnessing and exploiting rich knowledge about the world, and takes a key step toward building machines with more human-like representations of the world.

Presentations:

11:50 AM – 12:00 PM @ Victoria (Oral)



06:15 – 09:00 PM @ Hall B #84 (Poster)



Authors: Mehrdad Farajtabar, Yinlam Chow, and Mohammad Ghavamzadeh



We study the problem of off-policy evaluation (OPE) in reinforcement learning (RL), where the goal is to estimate the performance of a policy from the data generated by another policy(ies). In particular, we focus on the doubly robust (DR) estimators that consist of an importance sampling (IS) component and a performance model, and utilize the low (or zero) bias of IS and low variance of the model at the same time. Although the accuracy of the model has a huge impact on the overall performance of DR, most of the work on using the DR estimators in OPE has been focused on improving the IS part, and not much on how to learn the model. In this paper, we propose alternative DR estimators, called more robust doubly robust (MRDR), that learn the model parameter by minimizing the variance of the DR estimator. We first present a formulation for learning the DR model in RL. We then derive formulas for the variance of the DR estimator in both contextual bandits and RL, such that their gradients w.r.t. the model parameters can be estimated from the samples, and propose methods to efficiently minimize the variance. We prove that the MRDR estimators are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR in bandits and RL benchmark problems, and compare its performance with the existing methods.

Presentations:

11:50 AM – 12:00 PM @ A1 (Oral)



06:15 – 09:00 PM @ Hall B #62 (Poster)



Authors: Marta Garnelo, Dan Rosenbaum, Christopher Maddison, Tiago Ramalho, David Saxton, Murray Shanahan, Yee Whye Teh, Danilo Rezende, S. M. Ali Eslami

Deep neural networks excel at function approximation, yet they are typically trained from scratch for each new function. On the other hand, Bayesian methods, such as Gaussian Processes (GPs), exploit prior knowledge to quickly infer the shape of a new function at test time. Yet GPs are computationally expensive, and it can be hard to design appropriate priors. In this paper we propose a family of neural models, Conditional Neural Processes (CNPs), that combine the benefits of both. CNPs are inspired by the flexibility of stochastic processes such as GPs, but are structured as neural networks and trained via gradient descent. CNPs make accurate predictions after observing only a handful of training data points, yet scale to complex functions and large datasets. We demonstrate the performance and versatility of the approach on a range of canonical machine learning tasks, including regression, classification



and image completion.

Presentations:

02:10 – 02:20 PM @ Victoria (Oral)



06:15 – 09:00 PM @ Hall B #130 (Poster)

Authors: Marco Fraccaro, Danilo Jimenez Rezende, Yori Zwols, Alexander Pritzel, S. M. Ali Eslami, Fabio Viola

In model-based reinforcement learning, generative and temporal models of environments can be leveraged to boost agent performance, either by tuning the agent’s representations during training or via use as part of an explicit planning mechanism. However, their application in practice has been limited to simplistic environments, due to the difficulty of training such models in larger, potentially partially-observed and 3D environments. In this work we introduce a novel action-conditioned generative model of such challenging environments. The model features a non-parametric spatial memory system in which we store learned, disentangled representations of the environment. Low-dimensional spatial updates are computed using a state-space model that makes use of knowledge on the prior dynamics of the moving agent, and high-dimensional visual observations are modelled with a Variational Auto-Encoder. The result is a scalable architecture capable of performing coherent predictions over hundreds of time steps across a range of partially observed 2D and 3D environments.

Presentations:

02:30 – 02:50 PM @ A7 (Oral)



06:15 – 09:00 PM @ Hall B #101 (Poster)



Authors: Hyunjik Kim, Andriy Mnih

We define and address the problem of unsupervised learning of disentangled representations on data generated from independent factors of variation. We propose FactorVAE, a method that disentangles by encouraging the distribution of rep-resentations to be factorial and hence independent across the dimensions. We show that it improves upon β-VAE by providing a better trade-off between disentanglement and reconstruction quality and being more robust to the number of training iterations. Moreover, we highlight the problems of a commonly used disentanglement metric and introduce a new metric that does not suffer from them.

Presentations:

02:50 – 03:00 PM @ A7 (Oral)



06:15 – 09:00 PM @ Hall B #90 (Poster)



Authors: Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tobias Springenberg

We propose Scheduled Auxiliary Control (SAC), a new learning paradigm in the context of Reinforcement Learning (RL) . SAC enables learning of complex behaviors – from scratch – in the presence of multiple sparse reward signals. To achieve this the agent is equipped with a set of general auxiliary tasks, that it attempts to learn simultaneously via off-policy RL. The key idea behind our method is that active (learned) scheduling and execution of auxiliary policies allows the agent to efficiently explore its environment – enabling it

to excel at sparse reward RL. Our experiments in several challenging robotic manipulation settings demonstrate the power of our approach.

Read more on the DeepMind blog.

Presentations:

04:20 – 04:40 PM @ A1 (Oral)



06:15 – 09:00 PM @ Hall B #41 (Poster)



Authors: Jack W Rae, Chris Dyer, Peter Dayan, Timothy P Lillicrap

Neural networks trained with backpropagation often struggle to identify classes that have been observed a small number of times. In applications where most class labels are rare, such as language modelling, this can become a performance bottleneck. One potential remedy is to augment the network with a fast-learning non-parametric model which attends over recent activations. We explore a simplified architecture where we treat a subset of the model parameters as fast memory stores. This can help retain information over longer time intervals than a traditional memory, and does not require additional space or compute. In the case of image classification, we display faster binding of novel classes on an Omniglot image curriculum task. We also show improved performance for word-based language models on news reports (GigaWord), books (Project Gutenberg) and Wikipedia articles (WikiText-103) — the latter achieving state-of-the-art perplexity.

Presentations:

04:40 – 04:50 PM @ Victoria (Oral)



06:15 – 09:00 PM @ Hall B #121 (Poster)



Authors: Suman Ravuri, Shakir Mohamed, Mihaela Rosca, and Oriol Vinyals

We propose a method of moments (MoM) algorithm for training large-scale implicit generative models. Moment estimation in this setting encounters two problem: it is often difficult to define the millions of moments needed to learn the model parameters, and it is hard to determine which properties are useful when specifying moments. To address the first issue, we introduce a moment network, and define the moments as the gradient of the network’s output with respect to its parameters and the network’s hidden units. To tackle the second problem, we use asymptotic theory to highlight desiderata for moments – namely they should minimize the asymptotic variance of estimated model parameters – and introduce an objective to learn better moments. The sequence of objectives created by this Method of Learned Moments (MoLM) can train high-quality neural image samplers. On CIFAR-10, we demonstrate that MoLM-trained generators achieve significantly higher Inception Scores and lower Frechet Inception Distances than those trained with gradient penalty regularized adversarial objectives. These generators also achieve nearly perfect Multi-Scale Structural Similarity Scores on CelebA, and can create high-quality samples of resolutions up to 128×128.

Presentations:

04:40 – 04:50 PM @ A7 (Oral)



06:15 – 09:00 PM @ Hall B #112 (Poster)



Authors: David Held, Xinyang Geng, Carlos Florensa, Pieter Abbeel

Reinforcement learning is a powerful technique to train an agent to perform a task. However, an agent that is trained using reinforcement learning is only capable of achieving the single task that is specified via its reward function. Such an approach does not scale well to settings in which an agent needs to perform a diverse set of tasks, such as navigating to varying positions in a room or moving objects to varying locations. Instead, we propose a method that allows an agent to automatically discover the range of tasks that it is capable of performing. We use a generator network to propose tasks for the agent to try to achieve, specified as goal states. The generator network is optimized using adversarial training to produce tasks that are always at the appropriate level of difficulty for the agent. Our method thus automatically produces a curriculum of tasks for the agent to learn. We show that, by using this framework, an agent can efficiently and automatically learn to perform a wide set of tasks without requiring any prior knowledge of its environment. Our method can also learn to achieve tasks with sparse rewards, which traditionally pose significant challenges.

Presentations:

04:40 – 04:50 PM @ A1 (Oral)



06:15 – 09:00 PM @ Hall B #135 (Poster)



Authors: Neil C. Rabinowitz, Frank Perbet, H. Francis Song, Chiyuan Zhang, S. M. Ali Eslami, Matthew Botvinick

Theory of mind (ToM) broadly refers to humans’ ability to represent the mental states of others, including their desires, beliefs, and intentions. We propose to train a machine to build such models too. We design a Theory of Mind neural network – a ToMnet – which uses meta-learning to build models of the agents it encounters. The ToMnet learns a strong prior model for agents’ future behaviour, and, using only a small number of behavioural observations, can bootstrap to richer predictions about agents’ characteristics and mental states. We apply the ToMnet to agents behaving in simple gridworld environments, showing that it learns to model random, algorithmic, and deep RL agents from varied populations, and that it passes classic ToM tasks such as the “Sally-Anne” test of recognising that others can hold false beliefs about the world.

Presentations:

05:00 – 05:20 PM @ A3 (Oral)



06:15 – 09:00 PM @ Hall B #208 (Poster)



Authors: Ofir Nachum, Yinlam Chow, and Mohammad Ghavamzadeh

We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e., at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is {\em softmax}, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called sparse PCL, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a sparse consistency equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions.

Presentations:

05:20 – 05:40 PM @ A1 (Oral)



06:15 – 09:00 PM @ Hall B #172 (Poster)



Authors: Andre Barreto, Diana Borsa, John Quan, Tom Schaul, David Silver, Matteo Hessel, Daniel Makowitz, Augustin Zidek, Remi Munos

The ability to transfer skills across tasks has the potential to scale up reinforcement learning (RL) agents to environments currently out of reach. Recently, a framework based on two ideas, successor features (SFs) and generalised policy improvement (GPI), has been introduced as a principled way of transferring skills. In this paper we investigate the feasibility of combining SF&GPI with the representation power of deep learning. Since in deep RL we are interested in learning all the components of SF&GPI concurrently, the existing inter-dependencies between them can lead to instabilities. In this work we propose a solution for this problem that makes it possible to use SF & GPI online, at scale. In order to empirically verify this claim, we apply the proposed method to a complex 3D environment that requires hundreds of millions of transitions to be solved. We show that the transfer promoted by SF&GPI leads to reasonable policies on unseen tasks almost instantaneously. We also show how to build on the transferred policies to learn policies that are specialised to the new tasks, which can then be added to the agent’s set of skills to be used in the future.

Presentations:

05:20 – 05:40 PM @ A3 (Oral)



06:15 – 09:00 PM @ Hall B #163 (Poster)



Authors: Samuel Ritter, Jane Wang, Sid Jayakumar, Zeb Kurth-Nelson, Charles Blundell, Razvan Pascanu, Matt Botvinick

Meta-learning agents have demonstrated the ability to rapidly explore and exploit new tasks sampled from the task distribution on which they were trained. However, when these agents encounter situations that they explored in the distant past, they are not able to remember the results of their past exploration. Thus, instead of immediately exploiting previously discovered solutions, they must again explore from scratch. In this work, we argue that the necessity to remember the results of past exploration is ubiquitous in naturalistic environments. We propose a formalism for modeling this kind of recurring environment structure, then develop a meta-learning architecture for solving such environments. This architecture melds the standard LSTM working memory with a differentiable neural episodic memory. We explore the capabilities of this episodic LSTM in four recurrent-state stochastic process environments: 1.) episodic contextual bandits, 2.) compositional contextual bandits, 3.) episodic two-step task, and 4.) contextual water-maze navigation.

Presentations:

05:40 – 05:50 PM @ A3 (Oral)



06:15 – 09:00 PM @ Hall B #209 (Poster)



Authors: Jonathan Uesato, Brendan O'Donoghue, Aaron van den Oord, and Pushmeet Kohli.

This paper investigates recently proposed approaches for defending against adversarial examples and evaluating adversarial robustness. The existence of adversarial examples in trained neural networks reflects the fact that expected risk alone does not capture the model’s performance against worst-case inputs. We motivate the use of advKarol Hausmaersarial risk as an objective, although it can not easily be computed exactly. We then frame commonly used attacks and evaluation metrics as defining a tractable surrogate objective to the true adversarial risk. This suggests that models may be obscured to adversaries, by optimizing this surrogate rather than the true adversarial risk. We demonstrate that this is a significant problem in practice by repurposing gradient-free optimization techniques into adversarial attacks, which we use to decrease the accuracy of several recently proposed defenses to near zero. Our hope is that

our formulations and results will help researchers to develop more powerful defenses.

Presentations: