Toggle Abstracts

ICML 2017 Videos

Tutorial {daterange} @ Cockle Bay Distributed Deep Learning with MxNet Gluon In Tutorials Session A URL » Video »

We present MxNet Gluon, an easy to use tool for designing a wide range of networks from image processing (LeNet, inception, etc.) to advanced NLP (TreeLSTM). It combines the convenience of imperative frameworks (PyTorch, Torch, Chainer) with efficient symbolic execution (TensorFlow, CNTK). The tutorial covers the following issues: basic distributed linear algebra with NDArray, automatic differentiation of code, and designing networks from scratch (and using Gluon). Subsequently we cover convenience and efficiency features such as automagic shape inference, deferred initialization and lazy evaluation, and hybridization of compute graphs. We then discuss structured architectures such as TreeLSTMs, which are key for natural language processing. We conclude by showing how to perform parallel and distributed training on multiple GPUs and multiple machines. For Jupyter notebooks and details see http://gluon.mxnet.io and https://github.com/zackchase/mxnet-the-straight-dope

Tutorial {daterange} @ Parkside 1 Interpretable Machine Learning In Tutorials Session B URL » Video »

As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is little consensus on what interpretable machine learning is and how it should be measured. In this talk, we first suggest a definitions of interpretability and describe when interpretability is needed (and when it is not). Then we will review related work, all the way back from classical AI systems to recent efforts for interpretability in deep learning. Finally, we will talk about a taxonomy for rigorous evaluation, and recommendations for researchers. We will end with discussing open questions and concrete problems for new researchers.

Tutorial {daterange} @ Parkside 2 Machine Learning for Autonomous Vehicles In Tutorials Session C Video »

The tutorial will cover core machine learning topics for self-driving cars. The objectives are (1) to call to arms of researchers and practitioners to tackle the pressing challenges of autonomous driving; (2) equip participants with enough background to attend the companion workshop on ML for autonomous vehicles. Machine learning holds the key to solve autonomous driving. Despite recent advances, major problems are far from solved both in terms of fundamental research and engineering challenges.

Tutorial {daterange} @ Cockle Bay Recent Advances in Stochastic Convex and Non-Convex Optimization In Tutorials Session A URL » Video »

In this tutorial, we will provide an accessible and extensive overview on recent advances to optimization methods based on stochastic gradient descent (SGD), for both convex and non-convex tasks. In particular, this tutorial shall try to answer the following questions with theoretical support. How can we properly use momentum to speed up SGD? What is the maximum parallel speedup can we achieve for SGD? When should we use dual or primal-dual approach to replace SGD? What is the difference between coordinate descent (e.g. SDCA) and SGD? How is variance reduction affecting the performance of SGD? Why does the second-order information help us improve the convergence of SGD?

Tutorial {daterange} @ Parkside 1 Deep Reinforcement Learning, Decision Making, and Control In Tutorials Session B URL » Video »

Deep learning methods, which combine high-capacity neural network models with simple and scalable training algorithms, have made a tremendous impact across a range of supervised learning domains, including computer vision, speech recognition, and natural language processing. This success has been enabled by the ability of deep networks to capture complex, high-dimensional functions and learn flexible distributed representations. Can this capability be brought to bear on real-world decision making and control problems, where the machine must not only classify complex sensory patterns, but choose actions and reason about their long-term consequences? Decision making and control problems lack the close supervision present in more classic deep learning applications, and present a number of challenges that necessitate new algorithmic developments. In this tutorial, we will cover the foundational theory of reinforcement and optimal control as it relates to deep reinforcement learning, discuss a number of recent results on extending deep learning into decision making and control, including model-based algorithms, imitation learning, and inverse reinforcement learning, and explore the frontiers and limitations of current deep reinforcement learning algorithms

Tutorial {daterange} @ Parkside 2 Deep Learning for Health Care Applications: Challenges and Solutions In Tutorials Session C URL » Video »

It is widely believed that deep learning and artificial intelligence techniques will fundamentally change health care industries. Even though recent development in deep learning has achieved successes in many applications, such as computer vision, natural language processing, speech recognition and so on, health care applications pose many significantly different challenges to existing deep learning models. Examples include but not are limited to interpretations for prediction, heterogeneity in data, missing value, multi-rate multiresolution data, big and small data, and privacy issues. In this tutorial, we will discuss a series of problems in health care that can benefit from deep learning models, the challenges as well as recent advances in addressing those. We will also include data sets and demos of working systems.

Tutorial {daterange} @ Cockle Bay Real World Interactive Learning In Tutorials Session A URL » Video »

This is a tutorial about real-world use of interactive and online learning. We focus on systems for practical applications ranging from recommendation tasks and ad-display, to clinical trials and adaptive decision making in computer systems. There is quite a bit of foundational theory and algorithms from the field of machine learning yet practical use is fraught with several challenges. Success in interactive learning requires a complete learning system which handles exploration, data-flow, logging and real-time updating supporting the core algorithm. Each potential application also comes with multiple design choices and often do not fit the setting in theory as-is. We cover both foundational principles which have proved practically essential as well as recipes for success from practical experience. After the tutorial, participants should have both a firm understanding of the foundations and the practical ability to deploy and start using such a system in an hour.

Tutorial {daterange} @ Parkside 1 Sequence-To-Sequence Modeling with Neural Networks In Tutorials Session B Video »

Sequence-To-Sequence (Seq2Seq) learning was introduced in 2014, and has since been extensively studied and extended to a large variety of domains. Seq2Seq yields state-of-the-art performance on several applications such as machine translation, image captioning, speech generation, or summarization. In this tutorial, we will survey the basics of this framework, its applications, main algorithmic techniques and future research directions.

Tutorial {daterange} @ Parkside 2 Robustness Meets Algorithms (and Vice-Versa) In Tutorials Session C URL » Video »

In every corner of machine learning and statistics, there is a need for estimators that work not just in an idealized model but even when their assumptions are violated. It turns out that being provably robust and being efficiently computable are often at odds with each other. In even the most basic settings such as robustly computing the mean and covariance, until recently the only known estimators were either hard to compute or could only tolerate a negligible fraction of errors in high-dimensional applications. In this tutorial, we will survey the exciting recent progress in algorithmic robust statistics. We will give the first provably robust and efficiently computable estimators for several fundamental questions that were thought to be hard, and explain the main insights behind them. We will give practical applications to exploratory data analysis. Finally, we raise some philosophical questions about robustness. It is standard to compare algorithms (especially those with provable guarantees) in terms of their running time and sample complexity. But what frameworks can be used to explore their robustness?

Invited Talk {daterange} @ Darling Harbour Theatre Causal Learning In Invited Talk - Bernhard Schölkopf Video »

In machine learning, we use data to automatically find dependences in the world, with the goal of predicting future observations. Most machine learning methods build on statistics, but one can also try to go beyond this, assaying causal structures underlying statistical dependences. Can such causal knowledge help prediction in machine learning tasks? We argue that this is indeed the case, due to the fact that causal models are more robust to changes that occur in real world datasets. We discuss implications of causality for machine learning tasks, and argue that many of the hard issues benefit from the causal viewpoint. This includes domain adaptation, semi-supervised learning, transfer, life-long learning, and fairness, as well as an application to the removal of systematic errors in astronomical problems.

Talk {daterange} @ Parkside 1 PixelCNN Models with Auxiliary Variables for Natural Image Modeling In Deep generative models 1 PDF » Summary/Notes » Video »

We study probabilistic models of natural images and extend the autoregressive family of PixelCNN models by incorporating auxiliary variables. Subsequently, we describe two new generative image models that exploit different image transformations as auxiliary variables: a quantized grayscale view of the image or a multi-resolution image pyramid. The proposed models tackle two known shortcomings of existing PixelCNN models: 1) their tendency to focus on low-level image details, while largely ignoring high-level image information, such as object shapes, and 2) their computationally costly procedure for image sampling. We experimentally demonstrate benefits of our models, in particular showing that they produce much more realistically looking image samples than previous state-of-the-art probabilistic models.

Talk {daterange} @ Parkside 2 Tight Bounds for Approximate Carathéodory and Beyond In Continuous optimization 1 PDF » Summary/Notes » Video »

We present a deterministic nearly-linear time algorithm for approximating any point inside a convex polytope with a sparse convex combination of the polytope's vertices. Our result provides a constructive proof for the Approximate Carathéodory Problem, which states that any point inside a polytope contained in the $\ell_p$ ball of radius $D$ can be approximated to within $\epsilon$ in $\ell_p$ norm by a convex combination of $O\left(D^2 p/\epsilon^2\right)$ vertices of the polytope for $p \geq 2$. While for the particular case of $p=2$, this can be achieved by the well-known Perceptron algorithm, we follow a more principled approach which generalizes to arbitrary $p\geq 2$; furthermore, this naturally extends to domains with more complicated geometry, as it is the case for providing an approximate Birkhoff-von Neumann decomposition. Secondly, we show that the sparsity bound is tight for $\ell_p$ norms, using an argument based on anti-concentration for the binomial distribution, thus resolving an open question posed by Barman. Experimentally, we verify that our deterministic optimization-based algorithms achieve in practice much better sparsity than previously known sampling-based algorithms. We also show how to apply our techniques to SVM training and rounding fractional points in matroid and flow polytopes.

Talk {daterange} @ C4.5 Robust Adversarial Reinforcement Learning In Reinforcement learning 1 PDF » Summary/Notes » Video »

Deep neural networks coupled with fast simulation and improved computational speeds have led to recent successes in the field of reinforcement learning (RL). However, most current RL-based approaches fail to generalize since: (a) the gap between simulation and real world is so large that policy-learning approaches fail to transfer; (b) even if policy learning is done in real world, the data scarcity leads to failed generalization from training to test scenarios (e.g., due to different friction or object masses). Inspired from H-infinity control methods, we note that both modeling errors and differences in training and test scenarios can just be viewed as extra forces/disturbances in the system. This paper proposes the idea of robust adversarial reinforcement learning (RARL), where we train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. The jointly trained adversary is reinforced -- that is, it learns an optimal destabilization policy. We formulate the policy learning as a zero-sum, minimax objective function. Extensive experiments in multiple environments (InvertedPendulum, HalfCheetah, Swimmer, Hopper, Walker2d and Ant) conclusively demonstrate that our method (a) improves training stability; (b) is robust to differences in training/test conditions; and c) outperform the baseline even in the absence of the adversary.

Talk {daterange} @ C4.9& C4.10 Robust Probabilistic Modeling with Bayesian Data Reweighting In Probabilistic learning 1 PDF » Summary/Notes » Video »

Probabilistic models analyze data by relying on a set of assumptions. Data that exhibit deviations from these assumptions can undermine inference and prediction quality. Robust models offer protection against mismatch between a model's assumptions and reality. We propose a way to systematically detect and mitigate mismatch of a large class of probabilistic models. The idea is to raise the likelihood of each observation to a weight and then to infer both the latent variables and the weights from data. Inferring the weights allows a model to identify observations that match its assumptions and down-weight others. This enables robust inference and improves predictive accuracy. We study four different forms of mismatch with reality, ranging from missing latent groups to structure misspecification. A Poisson factorization analysis of the Movielens 1M dataset shows the benefits of this approach in a practical scenario.

Talk {daterange} @ C4.1 Multi-objective Bandits: Optimizing the Generalized Gini Index In Online learning 1 PDF » Summary/Notes » Video »

We study the multi-armed bandit (MAB) problem where the agent receives a vectorial feedback that encodes many possibly competing objectives to be optimized. The goal of the agent is to find a policy, which can optimize these objectives simultaneously in a fair way. This multi-objective online optimization problem is formalized by using the Generalized Gini Index (GGI) aggregation function. We propose an online gradient descent algorithm which exploits the convexity of the GGI aggregation function, and controls the exploration in a careful way achieving a distribution-free regret $\tilde{\bigO} (T^{-1/2} )$ with high probability. We test our algorithm on synthetic data as well as on an electric battery control problem where the goal is to trade off the use of the different cells of a battery in order to balance their respective degradation rates.

Talk {daterange} @ C4.4 Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis In Latent feature models PDF » Summary/Notes » Video »

We study the fundamental problem of Principal Component Analysis in a statistical distributed setting in which each machine out of m stores a sample of n points sampled i.i.d. from a single unknown distribution. We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of the centralized ERM solution that uses all mn samples. On the negative side, we show that in contrast to results obtained for distributed estimation under convexity assumptions, for the PCA objective, simply averaging the local ERM solutions cannot guarantee error that is consistent with the centralized ERM. We show that this unfortunate phenomena can be remedied by performing a simple correction step which correlates between the individual solutions, and provides an estimator that is consistent with the centralized ERM for sufficiently-large n. We also introduce an iterative distributed algorithm that is applicable in any regime of n, which is based on distributed matrix-vector products. The algorithm gives significant acceleration in terms of communication rounds over previous distributed algorithms, in a wide regime of parameters.

Talk {daterange} @ C4.8 The loss surface of deep and wide neural networks In Deep learning theory 1 PDF » Summary/Notes » Video »

While the optimization problem behind deep neural networks is highly non-convex, it is frequently observed in practice that training deep networks seems possible without getting stuck in suboptimal points. It has been argued that this is the case as all local minima are close to being globally optimal. We show that this is (almost) true, in fact almost all local minima are globally optimal, for a fully connected network with squared loss and analytic activation function given that the number of hidden units of one layer of the network is larger than the number of training points and the network structure from this layer on is pyramidal.

Talk {daterange} @ C4.6 & C4.7 Enumerating Distinct Decision Trees In Supervised learning 1 PDF » Summary/Notes » Video »

The search space for the feature selection problem in decision tree learning is the lattice of subsets of the available features. We provide an exact enumeration procedure of the subsets that lead to all and only the distinct decision trees. The procedure can be adopted to prune the search space of complete and heuristics search methods in wrapper models for feature selection. Based on this, we design a computational optimization of the sequential backward elimination heuristics with a performance improvement of up to 100X.

Talk {daterange} @ Darling Harbour Theatre Understanding Synthetic Gradients and Decoupled Neural Interfaces In Deep learning 1: backprop PDF » Summary/Notes » Video »

When training neural networks, the use of Synthetic Gradients (SG) allows layers or modules to be trained without update locking - without waiting for a true error gradient to be backpropagated -resulting in Decoupled Neural Interfaces (DNIs). This unlocked ability of being able to update parts of a neural network asynchronously and with only local information was demonstrated to work empirically in Jaderberg et al (2016). However, there has been very little demonstration of what changes DNIs and SGs impose from a functional, representational, and learning dynamics point of view. In this paper, we study DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation. We show that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and prove the convergence of the learning system for linear and deep linear models. On practical problems we investigate the mechanism by which synthetic gradient estimators approximate the true loss, and, surprisingly, how that leads to drastically different layer-wise representations. Finally, we also expose the relationship of using synthetic gradients to other error approximation techniques and find a unifying language for discussion and comparison.

Talk {daterange} @ Parkside 1 Parallel Multiscale Autoregressive Density Estimation In Deep generative models 1 PDF » Summary/Notes » Video »

PixelCNN achieves state-of-the-art results in density estimation for natural images. Although training is fast, inference is costly, requiring one network evaluation per pixel; O(N) for N pixels. This can be sped up by caching activations, but still involves generating each pixel sequentially. In this work, we propose a parallelized PixelCNN that allows more efficient inference by modeling certain pixel groups as conditionally independent. Our new PixelCNN model achieves competitive density estimation and orders of magnitude speedup - O(log N) sampling instead of O(N) - enabling the practical generation of 512x512 images. We evaluate the model on class-conditional image generation, text-to-image synthesis, and action-conditional video generation, showing that our model achieves the best results among non-pixel-autoregressive density models that allow efficient sampling.

Talk {daterange} @ Parkside 2 Oracle Complexity of Second-Order Methods for Finite-Sum Problems In Continuous optimization 1 PDF » Summary/Notes » Video »

Finite-sum optimization problems are ubiquitous in machine learning, and are commonly solved using first-order methods which rely on gradient computations. Recently, there has been growing interest in \emph{second-order} methods, which rely on both gradients and Hessians. In principle, second-order methods can require much fewer iterations than first-order methods, and hold the promise for more efficient algorithms. Although computing and manipulating Hessians is prohibitive for high-dimensional problems in general, the Hessians of individual functions in finite-sum problems can often be efficiently computed, e.g. because they possess a low-rank structure. Can second-order information indeed be used to solve such problems more efficiently? In this paper, we provide evidence that the answer -- perhaps surprisingly -- is negative, at least in terms of worst-case guarantees. However, we also discuss what additional assumptions and algorithmic approaches might potentially circumvent this negative result.

Talk {daterange} @ C4.5 Minimax Regret Bounds for Reinforcement Learning In Reinforcement learning 1 PDF » Summary/Notes » Video »

We consider the problem of provably optimal exploration in reinforcement learning for finite horizon MDPs. We show that an optimistic modification to value iteration achieves a regret bound of $\tilde {O}( \sqrt{HSAT} + H^2S^2A+H\sqrt{T})$ where $H$ is the time horizon, $S$ the number of states, $A$ the number of actions and $T$ the number of time-steps. This result improves over the best previous known bound $\tilde {O}(HS \sqrt{AT})$ achieved by the UCRL2 algorithm. The key significance of our new results is that when $T\geq H^3S^3A$ and $SA\geq H$, it leads to a regret of $\tilde{O}(\sqrt{HSAT})$ that matches the established lower bound of $\Omega(\sqrt{HSAT})$ up to a logarithmic factor. Our analysis contain two key insights. We use careful application of concentration inequalities to the optimal value function as a whole, rather than to the transitions probabilities (to improve scaling in $S$), and we use "exploration bonuses" built from Bernstein's inequality, together with using a recursive -Bellman-type- Law of Total Variance (to improve scaling in $H$).

Talk {daterange} @ C4.9& C4.10 Post-Inference Prior Swapping In Probabilistic learning 1 PDF » Summary/Notes » Video »

While Bayesian methods are praised for their ability to incorporate useful prior knowledge, in practice, convenient priors that allow for computationally cheap or tractable inference are commonly used. In this paper, we investigate the following question: for a given model, is it possible to compute an inference result with any convenient false prior, and afterwards, given any target prior of interest, quickly transform this result into the target posterior? A potential solution is to use importance sampling (IS). However, we demonstrate that IS will fail for many choices of the target prior, depending on its parametric form and similarity to the false prior. Instead, we propose prior swapping, a method that leverages the pre-inferred false posterior to efficiently generate accurate posterior samples under arbitrary target priors. Prior swapping lets us apply less-costly inference algorithms to certain models, and incorporate new or updated prior information “post-inference”. We give theoretical guarantees about our method, and demonstrate it empirically on a number of models and priors.

Talk {daterange} @ C4.1 Online Learning with Local Permutations and Delayed Feedback In Online learning 1 PDF » Summary/Notes » Video »

We propose an Online Learning with Local Permutations (OLLP) setting, in which the learner is allowed to slightly permute the \emph{order} of the loss functions generated by an adversary. On one hand, this models natural situations where the exact order of the learner's responses is not crucial, and on the other hand, might allow better learning and regret performance, by mitigating highly adversarial loss sequences. Also, with random permutations, this can be seen as a setting interpolating between adversarial and stochastic losses. In this paper, we consider the applicability of this setting to convex online learning with delayed feedback, in which the feedback on the prediction made in round $t$ arrives with some delay $\tau$. With such delayed feedback, the best possible regret bound is well-known to be $O(\sqrt{\tau T})$. We prove that by being able to permute losses by a distance of at most $M$ (for $M\geq \tau$), the regret can be improved to $O(\sqrt{T}(1+\sqrt{\tau^2/M}))$, using a Mirror-Descent based algorithm which can be applied for both Euclidean and non-Euclidean geometries. We also prove a lower bound, showing that for $M<\tau/3$, it is impossible to improve the standard $O(\sqrt{\tau T})$ regret bound by more than constant factors. Finally, we provide some experiments validating the performance of our algorithm.

Talk {daterange} @ C4.4 SPLICE: Fully Tractable Hierarchical Extension of ICA with Pooling In Latent feature models PDF » Summary/Notes » Video »

We present a novel probabilistic framework for a hierarchical extension of independent component analysis (ICA), with a particular motivation in neuroscientific data analysis and modeling. The framework incorporates a general subspace pooling with linear ICA-like layers stacked recursively. Unlike related previous models, our generative model is fully tractable: both the likelihood and the posterior estimates of latent variables can readily be computed with analytically simple formulae. The model is particularly simple in the case of complex-valued data since the pooling can be reduced to taking the modulus of complex numbers. Experiments on electroencephalography (EEG) and natural images demonstrate the validity of the method.

Talk {daterange} @ C4.8 Neural Taylor Approximations: Convergence and Exploration in Rectifier Networks In Deep learning theory 1 PDF » Summary/Notes » Video »

Modern convolutional networks, incorporating rectifiers and max-pooling, are neither smooth nor convex; standard guarantees therefore do not apply. Nevertheless, methods from convex optimization such as gradient descent and Adam are widely used as building blocks for deep learning algorithms. This paper provides the first convergence guarantee applicable to modern convnets, which furthermore matches a lower bound for convex nonsmooth functions. The key technical tool is the neural Taylor approximation -- a straightforward application of Taylor expansions to neural networks -- and the associated Taylor loss. Experiments on a range of optimizers, layers, and tasks provide evidence that the analysis accurately captures the dynamics of neural optimization. The second half of the paper applies the Taylor approximation to isolate the main difficulty in training rectifier nets -- that gradients are shattered -- and investigates the hypothesis that, by exploring the space of activation configurations more thoroughly, adaptive optimizers such as RMSProp and Adam are able to converge to better solutions.

Talk {daterange} @ C4.6 & C4.7 Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation In Supervised learning 1 PDF » Summary/Notes » Video »

We consider multi-class classification where the predictor has a hierarchical structure that allows for a very large number of labels both at train and test time. The predictive power of such models can heavily depend on the structure of the tree, and although past work showed how to learn the tree structure, it expected that the feature vectors remained static. We provide a novel algorithm to simultaneously perform representation learning for the input data and learning of the hierarchical predictor. Our approach optimizes an objective function which favors balanced and easily-separable multi-way node partitions. We theoretically analyze this objective, showing that it gives rise to a boosting style property and a bound on classification error. We next show how to extend the algorithm to conditional density estimation. We empirically validate both variants of the algorithm on text classification and language modeling, respectively, and show that they compare favorably to common baselines in terms of accuracy and running time.

Talk {daterange} @ Darling Harbour Theatre meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting In Deep learning 1: backprop PDF » Summary/Notes » Video »

We propose a simple yet effective technique for neural network learning. The forward propagation is computed as usual. In back propagation, only a small subset of the full gradient is computed to update the model parameters. The gradient vectors are sparsified in such a way that only the top-$k$ elements (in terms of magnitude) are kept. As a result, only $k$ rows or columns (depending on the layout) of the weight matrix are modified, leading to a linear reduction ($k$ divided by the vector dimension) in the computational cost. Surprisingly, experimental results demonstrate that we can update only 1--4\% of the weights at each back propagation pass. This does not result in a larger number of training iterations. More interestingly, the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given.

Talk {daterange} @ Parkside 2 Global optimization of Lipschitz functions In Continuous optimization 1 PDF » Summary/Notes » Video »

The goal of the paper is to design sequential strategies which lead to efficient optimization of an unknown function under the only assumption that it has a finite Lipschitz constant. We first identify sufficient conditions for the consistency of generic sequential algorithms and formulate the expected minimax rate for their performance. We introduce and analyze a first algorithm called LIPO which assumes the Lipschitz constant to be known. Consistency, minimax rates for LIPO are proved, as well as fast rates under an additional H\"older like condition. An adaptive version of LIPO is also introduced for the more realistic setup where Lipschitz constant is unknown and has to be estimated along with the optimization. Similar theoretical guarantees are shown to hold for the adaptive LIPO algorithm and a numerical assessment is provided at the end of the paper to illustrate the potential of this strategy with respect to state-of-the-art methods over typical benchmark problems for global optimization.

Talk {daterange} @ C4.5 Fairness in Reinforcement Learning In Reinforcement learning 1 PDF » Summary/Notes » Video »

We initiate the study of fairness in reinforcement learning, where the actions of a learning algorithm may affect its environment and future rewards. Our fairness constraint requires that an algorithm never prefers one action over another if the long-term (discounted) reward of choosing the latter action is higher. Our first result is negative: despite the fact that fairness is consistent with the optimal policy, any learning algorithm satisfying fairness must take time exponential in the number of states to achieve non-trivial approximation to the optimal policy. We then provide a provably fair polynomial time algorithm under an approximate notion of fairness, thus establishing an exponential gap between exact and approximate fairness.

Talk {daterange} @ C4.9& C4.10 Evaluating Bayesian Models with Posterior Dispersion Indices In Probabilistic learning 1 PDF » Summary/Notes » Video »

Probabilistic modeling is cyclical: we specify a model, infer its posterior, and evaluate its performance. Evaluation drives the cycle, as we revise our model based on how it performs. This requires a metric. Traditionally, predictive accuracy prevails. Yet, predictive accuracy does not tell the whole story. We propose to evaluate a model through posterior dispersion. The idea is to analyze how each datapoint fares in relation to posterior uncertainty around the hidden structure. This highlights datapoints the model struggles to explain and provides complimentary insight to datapoints with low predictive accuracy. We present a family of posterior dispersion indices (PDI) that capture this idea. We show how a PDI identifies patterns of model mismatch in three real data examples: voting preferences, supermarket shopping, and population genetics.

Talk {daterange} @ C4.1 Model-Independent Online Learning for Influence Maximization In Online learning 1 PDF » Summary/Notes » Video »

We consider \emph{influence maximization} (IM) in social networks, which is the problem of maximizing the number of users that become aware of a product by selecting a set of ``seed'' users to expose the product to. While prior work assumes a known model of information diffusion, we propose a novel parametrization that not only makes our framework agnostic to the underlying diffusion model, but also statistically efficient to learn from data. We give a corresponding monotone, submodular surrogate function, and show that it is a good approximation to the original IM objective. We also consider the case of a new marketer looking to exploit an existing social network, while simultaneously learning the factors governing information propagation. For this, we propose a pairwise-influence semi-bandit feedback model and develop a LinUCB-based bandit algorithm. Our model-independent analysis shows that our regret bound has a better (as compared to previous work) dependence on the size of the network. Experimental evaluation suggests that our framework is robust to the underlying diffusion model and can efficiently learn a near-optimal solution.

Talk {daterange} @ C4.4 Latent Feature Lasso In Latent feature models PDF » Summary/Notes » Video »

The latent feature model (LFM), proposed in \cite{griffiths2005infinite}, but possibly with earlier origins, is a generalization of a mixture model, where each instance is generated not from a single latent class but from a combination of \emph{latent features}. Thus, each instance has an associated latent binary feature incidence vector indicating the presence or absence of a feature. Due to its combinatorial nature, inference of LFMs is considerably intractable, and accordingly, most of the attention has focused on nonparametric LFMs, with priors such as the Indian Buffet Process (IBP) on infinite binary matrices. Recent efforts to tackle this complexity either still have computational complexity that is exponential, or sample complexity that is high-order polynomial w.r.t. the number of latent features. In this paper, we address this outstanding problem of tractable estimation of LFMs via a novel atomic-norm regularization, which gives an algorithm with polynomial run-time and sample complexity without impractical assumptions on the data distribution.

Talk {daterange} @ C4.8 Sharp Minima Can Generalize For Deep Nets In Deep learning theory 1 PDF » Summary/Notes » Video »

Despite their overwhelming capacity to overfit, deep learning architectures tend to generalize relatively well to unseen data, allowing them to be deployed in practice. However, explaining why this is the case is still an open area of research. One standing hypothesis that is gaining popularity, e.g. \citet{hochreiter1997flat, keskar2016large}, is that the flatness of minima of the loss function found by stochastic gradient based methods results in good generalization. This paper argues that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization. Specifically, when focusing on deep networks with rectifier units, we can exploit the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit to build equivalent models corresponding to arbitrarily sharper minima. Or, depending on the definition of flatness, it is the same for any given minimum. Furthermore, if we allow to reparametrize a function, the geometry of its parameters can change drastically without affecting its generalization properties.

Talk {daterange} @ C4.6 & C4.7 Resource-efficient Machine Learning in 2 KB RAM for the Internet of Things In Supervised learning 1 PDF » Summary/Notes » Video »

This paper develops a novel tree-based algorithm, called Bonsai, for efficient prediction on IoT devices – such as those based on the Arduino Uno board having an 8 bit ATmega328P microcontroller operating at 16 MHz with no native floating point support, 2 KB RAM and 32 KB read-only flash. Bonsai maintains prediction accuracy while minimizing model size and prediction costs by: (a) developing a tree model which learns a single, shallow, sparse tree with powerful nodes; (b) sparsely projecting all data into a low-dimensional space in which the tree is learnt; and (c) jointly learning all tree and projection parameters. Experimental results on multiple benchmark datasets demonstrate that Bonsai can make predictions in milliseconds even on slow microcontrollers, can fit in KB of memory, has lower battery consumption than all other algorithms while achieving prediction accuracies that can be as much as 30% higher than state-of-the-art methods for resource-efficient machine learning. Bonsai is also shown to generalize to other resource constrained settings beyond IoT by generating significantly better search results as compared to Bing’s L3 ranker when the model size is restricted to 300 bytes. Bonsai’s code can be downloaded from (http://www.manikvarma.org/code/Bonsai/download.html).

Talk {daterange} @ Darling Harbour Theatre Learning Important Features Through Propagating Activation Differences In Deep learning 1: backprop PDF » Summary/Notes » Video »

The purported "black box" nature of neural networks is a barrier to adoption in applications where interpretability is essential. Here we present DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input. DeepLIFT compares the activation of each neuron to its `reference activation' and assigns contribution scores according to the difference. By optionally giving separate consideration to positive and negative contributions, DeepLIFT can also reveal dependencies which are missed by other approaches. Scores can be computed efficiently in a single backward pass. We apply DeepLIFT to models trained on MNIST and simulated genomic data, and show significant advantages over gradient-based methods. Video tutorial: \url{http://goo.gl/qKb7pL}, code: \url{http://goo.gl/RM8jvH}.

Talk {daterange} @ Parkside 1 Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks In Deep generative models 1 PDF » Summary/Notes » Video »

Variational Autoencoders (VAEs) are expressive latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the inference model. We introduce Adversarial Variational Bayes (AVB), a technique for training Variational Autoencoders with arbitrarily expressive inference models. We achieve this by introducing an auxiliary discriminative network that allows to rephrase the maximum-likelihood-problem as a two-player game, hence establishing a principled connection between VAEs and Generative Adversarial Networks (GANs). We show that in the nonparametric limit our method yields an exact maximum-likelihood assignment for the parameters of the generative model, as well as the exact posterior distribution over the latent variables given an observation. Contrary to competing approaches which combine VAEs with GANs, our approach has a clear theoretical justification, retains most advantages of standard Variational Autoencoders and is easy to implement.

Talk {daterange} @ Parkside 2 Strong NP-Hardness for Sparse Optimization with Concave Penalty Functions In Continuous optimization 1 PDF » Summary/Notes » Video »

Consider the regularized sparse minimization problem, which involves empirical sums of loss functions for $n$ data points (each of dimension $d$) and a nonconvex sparsity penalty. We prove that finding an $\mathcal{O}(n^{c_1}d^{c_2})$-optimal solution to the regularized sparse optimization problem is strongly NP-hard for any $c_1, c_2\in [0,1)$ such that $c_1+c_2<1$. The result applies to a broad class of loss functions and sparse penalty functions. It suggests that one cannot even approximately solve the sparse optimization problem in polynomial time, unless P $=$ NP.

Talk {daterange} @ C4.5 Boosted Fitted Q-Iteration In Reinforcement learning 1 PDF » Summary/Notes » Video »

This paper is about the study of B-FQI, an Approximated Value Iteration (AVI) algorithm that exploits a boosting procedure to estimate the action-value function in reinforcement learning problems. B-FQI is an iterative off-line algorithm that, given a dataset of transitions, builds an approximation of the optimal action-value function by summing the approximations of the Bellman residuals across all iterations. The advantage of such approach w.r.t. to other AVI methods is twofold: (1) while keeping the same function space at each iteration, B-FQI can represent more complex functions by considering an additive model; (2) since the Bellman residual decreases as the optimal value function is approached, regression problems become easier as iterations proceed. We study B-FQI both theoretically, providing also a finite-sample error upper bound for it, and empirically, by comparing its performance to the one of FQI in different domains and using different regression techniques.

Talk {daterange} @ C4.9& C4.10 Automatic Discovery of the Statistical Types of Variables in a Dataset In Probabilistic learning 1 PDF » Summary/Notes » Video »

A common practice in statistics and machine learning is to assume that the statistical data types (e.g., ordinal, categorical or real-valued) of variables, and usually also the likelihood model, is known. However, as the availability of real-world data increases, this assumption becomes too restrictive. Data are often heterogeneous, complex, and improperly or incompletely documented. Surprisingly, despite their practical importance, there is still a lack of tools to automatically discover the statistical types of, as well as appropriate likelihood (noise) models for, the variables in a dataset. In this paper, we fill this gap by proposing a Bayesian method, which accurately discovers the statistical data types in both synthetic and real data.

Talk {daterange} @ C4.1 Online Learning to Rank in Stochastic Click Models In Online learning 1 PDF » Summary/Notes » Video »

Online learning to rank is a core problem in information retrieval and machine learning. Many provably efficient algorithms have been recently proposed for this problem in specific click models. The click model is a model of how the user interacts with a list of documents. Though these results are significant, their impact on practice is limited, because all proposed algorithms are designed for specific click models and lack convergence guarantees in other models. In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models. The class encompasses two most fundamental click models, the cascade and position-based models. We derive a gap-dependent upper bound on the T-step regret of BatchRank and evaluate it on a range of web search queries. We observe that BatchRank outperforms ranked bandits and is more robust than CascadeKL-UCB, an existing algorithm for the cascade model.

Talk {daterange} @ C4.4 Online Partial Least Square Optimization: Dropping Convexity for Better Efficiency and Scalability In Latent feature models PDF » Summary/Notes » Video »

Multiview representation learning is popular for latent factor analysis. Many existing approaches formulate the multiview representation learning as convex optimization problems, where global optima can be obtained by certain algorithms in polynomial time. However, many evidences have corroborated that heuristic nonconvex approaches also have good empirical computational performance and convergence to the global optima, although there is a lack of theoretical justification. Such a gap between theory and practice motivates us to study a nonconvex formulation for multiview representation learning, which can be efficiently solved by a simple stochastic gradient descent method. By analyzing the dynamics of the algorithm based on diffusion processes, we establish a global rate of convergence to the global optima. Numerical experiments are provided to support our theory.

Talk {daterange} @ C4.8 Geometry of Neural Network Loss Surfaces via Random Matrix Theory In Deep learning theory 1 PDF » Summary/Notes » Video »

Understanding the geometry of neural network loss surfaces is important for the development of improved optimization algorithms and for building a theoretical understanding of why deep learning works. In this paper, we study the geometry in terms of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy. We introduce an analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of this distribution under a set of simplifying assumptions. The shape of the spectrum depends strongly on the energy and another key parameter, $\phi$, which measures the ratio of parameters to data points. Our analysis predicts and numerical simulations support that for critical points of small index, the number of negative eigenvalues scales like the 3/2 power of the energy. We leave as an open problem an explanation for our observation that, in the context of a certain memorization task, the energy of minimizers is well-approximated by the function $1/2(1-\phi)^2$.

Talk {daterange} @ C4.6 & C4.7 Multi-Class Optimal Margin Distribution Machine In Supervised learning 1 PDF » Summary/Notes » Video »

Recent studies disclose that maximizing the minimum margin like support vector machines does not necessarily lead to better generalization performances, and instead, it is crucial to optimize the margin distribution. Although it has been shown that for binary classification, characterizing the margin distribution by the first- and second-order statistics can achieve superior performance. It still remains open for multi-class classification, and due to the complexity of margin for multi-class classification, optimizing its distribution by mean and variance can also be difficult. In this paper, we propose mcODM (multi-class Optimal margin Distribution Machine), which can solve this problem efficiently. We also give a theoretical analysis for our method, which verifies the significance of margin distribution for multi-class classification. Empirical study further shows that mcODM always outperforms all four versions of multi-class SVMs on all experimental data sets.

Talk {daterange} @ Darling Harbour Theatre Evaluating the Variance of Likelihood-Ratio Gradient Estimators In Deep learning 1: backprop PDF » Summary/Notes » Video »

The likelihood-ratio method is often used to estimate gradients of stochastic computations, for which baselines are required to reduce the estimation variance. Many types of baselines have been proposed, although their degree of optimality is not well understood. In this study, we establish a novel framework of gradient estimation that includes most of the common gradient estimators as special cases. The framework gives a natural derivation of the optimal estimator that can be interpreted as a special case of the likelihood-ratio method so that we can evaluate the optimal degree of practical techniques with it. It bridges the likelihood-ratio method and the reparameterization trick while still supporting discrete variables. It is derived from the exchange property of the differentiation and integration. To be more specific, it is derived by the reparameterization trick and local marginalization analogous to the local expectation gradient. We evaluate various baselines and the optimal estimator for variational learning and show that the performance of the modern estimators is close to the optimal estimator.

Talk {daterange} @ Parkside 1 Learning Texture Manifolds with the Periodic Spatial GAN In Deep generative models 1 PDF » Summary/Notes » Video »

This paper introduces a novel approach to texture synthesis based on generative adversarial networks (GAN) (Goodfellow et al., 2014), and call this technique Periodic Spatial GAN (PSGAN). The PSGAN has several novel abilities which surpass the current state of the art in texture synthesis. First, we can learn multiple textures, periodic or non-periodic, from datasets of one or more complex large images. Second, we show that the image generation with PSGANs has properties of a texture manifold: we can smoothly interpolate between samples in the structured noise space and generate novel samples, which lie perceptually between the textures of the original dataset. We make multiple experiments which show that PSGANs can flexibly handle diverse texture and image data sources, and the method is highly scalable and can generate output images of arbitrary large size.

Talk {daterange} @ Parkside 2 Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence In Continuous optimization 1 PDF » Summary/Notes » Video »

In this paper, a new theory is developed for first-order stochastic convex optimization, showing that the global convergence rate is sufficiently quantified by a local growth rate of the objective function in a neighborhood of the optimal solutions. In particular, if the objective function $F(\w)$ in the $\epsilon$-sublevel set grows as fast as $\|\w - \w_*\|_2^{1/\theta}$, where $\w_*$ represents the closest optimal solution to $\w$ and $\theta\in(0,1]$ quantifies the local growth rate, the iteration complexity of first-order stochastic optimization for achieving an $\epsilon$-optimal solution can be $\widetilde O(1/\epsilon^{2(1-\theta)})$, which is {\it optimal at most} up to a logarithmic factor. This result is fundamentally better in contrast with the previous works that either assume a global growth condition in the entire domain or achieve a local faster convergence under the local faster growth condition. To achieve the faster global convergence, we develop two different {\bf accelerated stochastic subgradient} methods by iteratively solving the original problem approximately in a local region around a historical solution with the size of the local region gradually decreasing as the solution approaches the optimal set. Besides the theoretical improvements, this work also include new contributions towards making the proposed algorithms practical: (i) we present practical variants of accelerated stochastic subgradient methods that can run without the knowledge of multiplicative growth constant and even the growth rate $\theta$; (ii) we consider a broad family of problems in machine learning to demonstrate that the proposed algorithms enjoy faster convergence than traditional stochastic subgradient method. For example, when applied to the $\ell_1$ regularized empirical polyhedral loss minimization (e.g., hinge loss, absolute loss), the proposed stochastic methods have a logarithmic iteration complexity.

Talk {daterange} @ C4.5 Why is Posterior Sampling Better than Optimism for Reinforcement Learning? In Reinforcement learning 1 PDF » Summary/Notes » Video »

Computational results demonstrate that posterior sampling for reinforcement learning (PSRL) dramatically outperforms existing algorithms driven by optimism, such as UCRL2. We provide insight into the extent of this performance boost and the phenomenon that drives it. We leverage this insight to establish an $\tilde{O}(H\sqrt{SAT})$ Bayesian regret bound for PSRL in finite-horizon episodic Markov decision processes. This improves upon the best previous Bayesian regret bound of $\tilde{O}(H S \sqrt{AT})$ for any reinforcement learning algorithm. Our theoretical results are supported by extensive empirical evaluation.

Talk {daterange} @ C4.9& C4.10 Bayesian Models of Data Streams with Hierarchical Power Priors In Probabilistic learning 1 PDF » Summary/Notes » Video »

Making inferences from data streams is a pervasive problem in many modern data analysis applications. But it requires to address the problem of continuous model updating, and adapt to changes or drifts in the underlying data generating distribution. In this paper, we approach these problems from a Bayesian perspective covering general conjugate exponential models. Our proposal makes use of non-conjugate hierarchical priors to explicitly model temporal changes of the model parameters. We also derive a novel variational inference scheme which overcomes the use of non-conjugate priors while maintaining the computational efficiency of variational methods over conjugate models. The approach is validated on three real data sets over three latent variable models.

Talk {daterange} @ C4.1 The Sample Complexity of Online One-Class Collaborative Filtering In Online learning 1 PDF » Summary/Notes » Video »

We consider the online one-class collaborative filtering (CF) problem that consist of recommending items to users over time in an online fashion based on positive ratings only. This problem arises when users respond only occasionally to a recommendation with a positive rating, and never with a negative one. We study the impact of the probability of a user responding to a recommendation, p_f, on the sample complexity, and ask whether receiving positive and negative ratings, instead of positive ratings only, improves the sample complexity. Both questions arise in the design of recommender systems. We introduce a simple probabilistic user model, and analyze the performance of an online user-based CF algorithm. We prove that after an initial cold start phase, where recommendations are invested in exploring the user's preferences, this algorithm makes---up to a fraction of the recommendations required for updating the user's preferences---perfect recommendations. The number of ratings required for the cold start phase is nearly proportional to 1/p_f, and that for updating the user's preferences is essentially independent of p_f. As a consequence we find that, receiving positive and negative ratings instead of only positive ones improves the number of ratings required for initial exploration by a factor of 1/p_f, which can be significant.

Talk {daterange} @ C4.8 The Shattered Gradients Problem: If resnets are the answer, then what is the question? In Deep learning theory 1 PDF » Summary/Notes » Video »

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new ``looks linear'' (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

Talk {daterange} @ C4.6 & C4.7 Kernelized Support Tensor Machines In Supervised learning 1 PDF » Summary/Notes » Video »

In the context of supervised tensor learning, preserving the structural information and exploiting the discriminative nonlinear relationships of tensor data are crucial for improving the performance of learning tasks. Based on tensor factorization theory and kernel methods, we propose a novel Kernelized Support Tensor Machine (KSTM) which integrates kernelized tensor factorization with maximum-margin criterion. Specifically, the kernelized factorization technique is introduced to approximate the tensor data in kernel space such that the complex nonlinear relationships within tensor data can be explored. Further, dual structural preserving kernels are devised to learn the nonlinear boundary between tensor data. As a result of joint optimization, the kernels obtained in KSTM exhibit better generalization power to discriminative analysis. The experimental results on real-world neuroimaging datasets show the superiority of KSTM over the state-of-the-art techniques.

Talk {daterange} @ Darling Harbour Theatre Equivariance Through Parameter-Sharing In Deep learning 2: invariances PDF » Summary/Notes » Video »

We propose to study equivariance in deep neural networks through parameter symmetries. In particular, given a group G that acts discretely on the input and output of a standard neural network layer, we show that its equivariance is linked to the symmetry group of network parameters. We then propose two parameter-sharing scheme to induce the desirable symmetry on the parameters of the neural network. Under some conditions on the action of G, our procedure for tying the parameters achieves G-equivariance and guarantees sensitivity to all other permutation groups outside of G.

Talk {daterange} @ Parkside 1 Generalization and Equilibrium in Generative Adversarial Nets (GANs) In Deep generative models 2 PDF » Summary/Notes » Video »

It is shown that training of generative adversarial network (GAN) may not have good generalization properties; e.g., training may appear successful but the trained distribution may be far from target distribution in standard metrics. However, generalization does occur for a weaker metric called neural net distance. It is also shown that an approximate pure equilibrium exists in the discriminator/generator game for a natural training objective (Wasserstein) when generator capacity and training set sizes are moderate. This existence of equilibrium inspires MIX+GAN protocol, which can be combined with any existing GAN training, and empirically shown to improve some of them.

Talk {daterange} @ Parkside 2 GSOS: Gauss-Seidel Operator Splitting Algorithm for Multi-Term Nonsmooth Convex Composite Optimization In Continuous optimization 2 PDF » Summary/Notes » Video »

In this paper, we propose a fast {\bf{G}}auss-{\bf{S}}eidel {\bf{O}}perator {\bf{S}}plitting (GSOS) algorithm for addressing multi-term nonsmooth convex composite optimization, which has wide applications in machine learning, signal processing and statistics. The proposed GSOS algorithm inherits the advantage of the Gauss-Seidel technique to accelerate the optimization procedure, and leverages the operator splitting technique to reduce the computational complexity. In addition, we develop a new technique to establish the global convergence of the GSOS algorithm. To be specific, we first reformulate the iterations of GSOS as a two-step iterations algorithm by employing the tool of operator optimization theory. Subsequently, we establish the convergence of GSOS based on the two-step iterations algorithm reformulation. At last, we apply the proposed GSOS algorithm to solve overlapping group Lasso and graph-guided fused Lasso problems. Numerical experiments show that our proposed GSOS algorithm is superior to the state-of-the-art algorithms in terms of both efficiency and effectiveness.

Talk {daterange} @ C4.5 Constrained Policy Optimization In Reinforcement learning 2 PDF » Summary/Notes » Video »

For many applications of reinforcement learning it can be more convenient to specify both a reward function and constraints, rather than trying to design behavior through the reward function. For example, systems that physically interact with or around humans should satisfy safety constraints. Recent advances in policy search algorithms (Mnih et al., 2016, Schulman et al., 2015, Lillicrap et al., 2016, Levine et al., 2016) have enabled new capabilities in high-dimensional control, but do not consider the constrained setting. We propose Constrained Policy Optimization (CPO), the first general-purpose policy search algorithm for constrained reinforcement learning with guarantees for near-constraint satisfaction at each iteration. Our method allows us to train neural network policies for high-dimensional control while making guarantees about policy behavior all throughout training. Our guarantees are based on a new theoretical result, which is of independent interest: we prove a bound relating the expected returns of two policies to an average divergence between them. We demonstrate the effectiveness of our approach on simulated robot locomotion tasks where the agent must satisfy constraints motivated by safety.

Talk {daterange} @ C4.9& C4.10 Ordinal Graphical Models: A Tale of Two Approaches In Probabilistic learning 2 PDF » Summary/Notes » Video »

Undirected graphical models or Markov random fields (MRFs) are widely used for modeling multivariate probability distributions. Much of the work on MRFs has focused on continuous variables, and nominal variables (that is, unordered categorical variables). However, data from many real world applications involve ordered categorical variables also known as ordinal variables, e.g., movie ratings on Netflix which can be ordered from 1 to 5 stars. With respect to univariate ordinal distributions, as we detail in the paper, there are two main categories of distributions; while there have been efforts to extend these to multivariate ordinal distributions, the resulting distributions are typically very complex, with either a large number of parameters, or with non-convex likelihoods. While there have been some work on tractable approximations, these do not come with strong statistical guarantees, and moreover are relatively computationally expensive. In this paper, we theoretically investigate two classes of graphical models for ordinal data, corresponding to the two main categories of univariate ordinal distributions. In contrast to previous work, our theoretical developments allow us to provide correspondingly two classes of estimators that are not only computationally efficient but also have strong statistical guarantees.

Talk {daterange} @ C4.4 Coresets for Vector Summarization with Applications to Network Graphs In Matrix factorization 1 PDF » Summary/Notes » Video »

We provide a deterministic data summarization algorithm that approximates the mean $\bar{p}=\frac{1}{n}\sum_{p\in P} p$ of a set $P$ of $n$ vectors in $\REAL^d$, by a weighted mean $\tilde{p}$ of a \emph{subset} of $O(1/\eps)$ vectors, i.e., independent of both $n$ and $d$. We prove that the squared Euclidean distance between $\bar{p}$ and $\tilde{p}$ is at most $\eps$ multiplied by the variance of $P$. We use this algorithm to maintain an approximated sum of vectors from an unbounded stream, using memory that is independent of $d$, and logarithmic in the $n$ vectors seen so far. Our main application is to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. For example, in the case of mobile networks, we can use GPS traces to identify meetings; in the case of social networks, we can use information exchange to identify friend groups. Our algorithm provably identifies the {\it Heavy Hitter} entries in a proximity (adjacency) matrix. The Heavy Hitters can be used to extract and represent in a compact way friend groups and activity summaries of users from underlying data exchanges. We evaluate the algorithm on several large data sets.

Talk {daterange} @ C4.8 Recovery Guarantees for One-hidden-layer Neural Networks In Deep learning theory 2 PDF » Summary/Notes » Video »

In this paper, we consider regression problems with one-hidden-layer neural networks (1NNs). We distill some properties of activation functions that lead to \emph{local strong convexity} in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective and most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (\ReLU s), leaky \ReLU s, squared \ReLU s and sigmoids. For activation functions that are also smooth, we show \emph{local linear convergence} guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity $ d \cdot \log(1/\epsilon) \cdot \poly(k,\lambda )$ and computational complexity $n\cdot d \cdot \poly(k,\lambda) $ for smooth homogeneous activations with high probability, where $d$ is the dimension of the input, $k$ ($k\leq d$) is the number of hidden nodes, $\lambda$ is a conditioning property of the ground-truth parameter matrix between the input layer and the hidden layer, $\epsilon$ is the targeted precision and $n$ is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity \emph{linear} in the input dimension and \emph{logarithmic} in the precision.

Talk {daterange} @ C4.6 & C4.7 Dual Supervised Learning In Supervised learning 2 PDF » Summary/Notes » Video »

Many supervised learning tasks are emerged in dual forms, e.g., English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach dual supervised learning. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.

Talk {daterange} @ Darling Harbour Theatre Warped Convolutions: Efficient Invariance to Spatial Transformations In Deep learning 2: invariances PDF » Summary/Notes » Video »

Convolutional Neural Networks (CNNs) are extremely efficient, since they exploit the inherent translation-invariance of natural images. However, translation is just one of a myriad of useful spatial transformations. Can the same efficiency be attained when considering other spatial invariances? Such generalized convolutions have been considered in the past, but at a high computational cost. We present a construction that is simple and exact, yet has the same computational complexity that standard convolutions enjoy. It consists of a constant image warp followed by a simple convolution, which are standard blocks in deep learning toolboxes. With a carefully crafted warp, the resulting architecture can be made equivariant to a wide range of two-parameter spatial transformations. We show encouraging results in realistic scenarios, including the estimation of vehicle poses in the Google Earth dataset (rotation and scale), and face poses in Annotated Facial Landmarks in the Wild (3D rotations under perspective).

Talk {daterange} @ Parkside 2 Breaking Locality Accelerates Block Gauss-Seidel In Continuous optimization 2 PDF » Summary/Notes » Video »

Recent work by Nesterov and Stich (2016) showed that momentum can be used to accelerate the rate of convergence for block Gauss-Seidel in the setting where a fixed partitioning of the coordinates is chosen ahead of time. We show that this setting is too restrictive, constructing instances where breaking locality by running non-accelerated Gauss-Seidel with randomly sampled coordinates substantially outperforms accelerated Gauss-Seidel with any fixed partitioning. Motivated by this finding, we analyze the accelerated block Gauss-Seidel algorithm in the random coordinate sampling setting. Our analysis captures the benefit of acceleration with a new data-dependent parameter which is well behaved when the matrix sub-blocks are well-conditioned. Empirically, we show that accelerated Gauss-Seidel with random coordinate sampling provides speedups for large scale machine learning tasks when compared to non-accelerated Gauss-Seidel and the classical conjugate-gradient algorithm.

Talk {daterange} @ C4.5 Reinforcement Learning with Deep Energy-Based Policies In Reinforcement learning 2 PDF » Summary/Notes » Video »

We propose a method for learning expressive energy-based policies for continuous states and actions, which has been feasible only in tabular domains before. We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution. We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the proposed algorithm include improved exploration and compositionality that allows transferring skills between tasks, which we confirm in simulated experiments with swimming and walking robots. We also draw a connection to actor-critic methods, which can be viewed performing approximate inference on the corresponding energy-based model.

Talk {daterange} @ C4.9& C4.10 Scalable Bayesian Rule Lists In Probabilistic learning 2 PDF » Summary/Notes » Video »

We present an algorithm for building probabilistic rule lists that is two orders of magnitude faster than previous work. Rule list algorithms are competitors for decision tree algorithms. They are associative classifiers, in that they are built from pre-mined association rules. They have a logical structure that is a sequence of IF-THEN rules, identical to a decision list or one-sided decision tree. Instead of using greedy splitting and pruning like decision tree algorithms, we aim to fully optimize over rule lists, striking a practical balance between accuracy, interpretability, and computational speed. The algorithm presented here uses a mixture of theoretical bounds (tight enough to have practical implications as a screening or bounding procedure), computational reuse, and highly tuned language libraries to achieve computational efficiency. Currently, for many practical problems, this method achieves better accuracy and sparsity than decision trees. In many cases, the computational time is practical and often less than that of decision trees.

Talk {daterange} @ C4.1 Identify the Nash Equilibrium in Static Games with Random Payoffs In Online learning 2 PDF » Summary/Notes » Video »

We study the problem on how to learn the pure Nash Equilibrium of a two-player zero-sum static game with random payoffs under unknown distributions via efficient payoff queries. We introduce a multi-armed bandit model to this problem due to its ability to find the best arm efficiently among random arms and propose two algorithms for this problem---LUCB-G based on the confidence bounds and a racing algorithm based on successive action elimination. We provide an analysis on the sample complexity lower bound when the Nash Equilibrium exists.

Talk {daterange} @ C4.4 Partitioned Tensor Factorizations for Learning Mixed Membership Models In Matrix factorization 1 PDF » Summary/Notes » Video »

We present an efficient algorithm for learning mixed membership models when the number of variables p is much larger than the number of hidden components k. This algorithm reduces the computational complexity of state-of-the-art tensor methods, which require decomposing an O(p^3) tensor, to factorizing O (p/k) sub-tensors each of size O(k^3). In addition, we address the issue of negative entries in the empirical method of moments based estimators. We provide sufficient conditions under which our approach has provable guarantees. Our approach obtains competitive empirical results on both simulated and real data.

Talk {daterange} @ C4.8 Failures of Gradient-Based Deep Learning In Deep learning theory 2 PDF » Summary/Notes » Video »

In recent years, Deep Learning has become the go-to solution for a broad range of applications, often outperforming state-of-the-art. However, it is important, for both theoreticians and practitioners, to gain a deeper understanding of the difficulties and limitations associated with common approaches and algorithms. We describe four types of simple problems, for which the gradient-based algorithms commonly used in deep learning either fail or suffer from significant difficulties. We illustrate the failures through practical experiments, and provide theoretical insights explaining their source, and how they might be remedied.

Talk {daterange} @ C4.6 & C4.7 Learning Infinite Layer Networks without the Kernel Trick In Supervised learning 2 PDF » Summary/Notes » Video »

Infinite Layer Networks (ILN) have been proposed as an architecture that mimics neural networks while enjoying some of the advantages of kernel methods. ILN are networks that integrate over infinitely many nodes within a single hidden layer. It has been demonstrated by several authors that the problem of learning ILN can be reduced to the kernel trick, implying that whenever a certain integral can be computed analytically they are efficiently learnable. In this work we give an online algorithm for ILN, which avoids the kernel trick assumption. More generally and of independent interest, we show that kernel methods in general can be exploited even when the kernel cannot be efficiently computed but can only be estimated via sampling. We provide a regret analysis for our algorithm, showing that it matches the sample complexity of methods which have access to kernel values. Thus, our method is the first to demonstrate that the kernel trick is not necessary, as such, and random features suffice to obtain comparable performance.

Talk {daterange} @ Darling Harbour Theatre Graph-based Isometry Invariant Representation Learning In Deep learning 2: invariances PDF » Summary/Notes » Video »

Learning transformation invariant representations of visual data is an important problem in computer vision. Deep convolutional networks have demonstrated remarkable results for image and video classification tasks. However, they have achieved only limited success in the classification of images that undergo geometric transformations. In this work we present a novel Transformation Invariant Graph-based Network (TIGraNet), which learns graph-based features that are inherently invariant to isometric transformations such as rotation and translation of input images. In particular, images are represented as signals on graphs, which permits to replace classical convolution and pooling layers in deep networks with graph spectral convolution and dynamic graph pooling layers that together contribute to invariance to isometric transformation. Our experiments show high performance on rotated and translated images from the test set compared to classical architectures that are very sensitive to transformations in the data. The inherent invariance properties of our framework provide key advantages, such as increased resiliency to data variability and sustained performance with limited training sets.

Talk {daterange} @ Parkside 1 Conditional Image Synthesis with Auxiliary Classifier GANs In Deep generative models 2 PDF » Summary/Notes » Video »

In this paper we introduce new methods for the improved training of generative adversarial networks (GANs) for image synthesis. We construct a variant of GANs employing label conditioning that results in 128 × 128 resolution image samples exhibiting global coherence. We expand on previous work for image quality assessment to provide two new analyses for assessing the discriminability and diversity of samples from class-conditional image synthesis models. These analyses demonstrate that high resolution samples provide class information not present in low resolution samples. Across 1000 ImageNet classes, 128 × 128 samples are more than twice as discriminable as artificially resized 32×32 samples. In addition, 84.7% of the classes have samples exhibiting diversity comparable to real ImageNet data.

Talk {daterange} @ Parkside 2 Stochastic DCA for the Large-sum of Non-convex Functions Problem and its Application to Group Variable Selection in Classification In Continuous optimization 2 PDF » Summary/Notes » Video »

In this paper, we present a stochastic version of DCA (Difference of Convex functions Algorithm) to solve a class of optimization problems whose objective function is a large sum of non-convex functions and a regularization term. We consider the $\ell_{2,0}$ regularization to deal with the group variables selection. By exploiting the special structure of the problem, we propose an efficient DC decomposition for which the corresponding stochastic DCA scheme is very inexpensive: it only requires the projection of points onto balls that is explicitly computed. As an application, we applied our algorithm for the group variables selection in multiclass logistic regression. Numerical experiments on several benchmark datasets and synthetic datasets illustrate the efficiency of our algorithm and its superiority over well-known methods, with respect to classification accuracy, sparsity of solution as well as running time.

Talk {daterange} @ C4.5 Prediction and Control with Temporal Segment Models In Reinforcement learning 2 PDF » Summary/Notes » Video »

We introduce a method for learning the dynamics of complex nonlinear systems based on deep generative models over temporal segments of states and actions. Unlike dynamics models that operate over individual discrete timesteps, we learn the distribution over future state trajectories conditioned on past state, past action, and planned future action trajectories, as well as a latent prior over action trajectories. Our approach is based on convolutional autoregressive models and variational autoencoders. It makes stable and accurate predictions over long horizons for complex, stochastic systems, effectively expressing uncertainty and modeling the effects of collisions, sensory noise, and action delays. The learned dynamics model and action prior can be used for end-to-end, fully differentiable trajectory optimization and model-based policy optimization, which we use to evaluate the performance and sample-efficiency of our method.

Talk {daterange} @ C4.9& C4.10 Learning Determinantal Point Processes with Moments and Cycles In Probabilistic learning 2 PDF » Summary/Notes » Video »

Determinantal Point Processes (DPPs) are a family of probabilistic models that have a repulsive behavior, and lend themselves naturally to many tasks in machine learning where returning a diverse set of objects is important. While there are fast algorithms for sampling, marginalization and conditioning, much less is known about learning the parameters of a DPP. Our contribution is twofold: (i) we establish the optimal sample complexity achievable in this problem and show that it is governed by a natural parameter, which we call the cycle sparsity; (ii) we propose a provably fast combinatorial algorithm that implements the method of moments efficiently and achieves optimal sample complexity. Finally, we give experimental results that confirm our theoretical findings.

Talk {daterange} @ C4.1 Follow the Compressed Leader: Faster Online Learning of Eigenvectors and Faster MMWU In Online learning 2 PDF » Summary/Notes » Video »

The online problem of computing the top eigenvector is fundamental to machine learning. The famous matrix-multiplicative-weight-update (MMWU) framework solves this online problem and gives optimal regret. However, since MMWU runs very slow due to the computation of matrix exponentials, researchers proposed the follow-the-perturbed-leader (FTPL) framework which is faster, but a factor $\sqrt{d}$ worse than the optimal regret for dimension-$d$ matrices. We propose a \emph{follow-the-compressed-leader} framework which, not only matches the optimal regret of MMWU (up to polylog factors), but runs no slower than FTPL. Our main idea is to ``compress'' the MMWU strategy to dimension 3 in the adversarial setting, or dimension 1 in the stochastic setting. This resolves an open question regarding how to obtain both (nearly) optimal and efficient algorithms for the online eigenvector problem.

Talk {daterange} @ C4.4 On Mixed Memberships and Symmetric Nonnegative Matrix Factorizations In Matrix factorization 1 PDF » Summary/Notes » Video »

The problem of finding overlapping communities in networks has gained much attention recently. Optimization-based approaches use non-negative matrix factorization (NMF) or variants, but the global optimum cannot be provably attained in general. Model-based approaches, such as the popular mixed-membership stochastic blockmodel or MMSB (Airoldi et al., 2008), use parameters for each node to specify the overlapping communities, but standard inference techniques cannot guarantee consistency. We link the two approaches, by (a) establishing sufficient conditions for the symmetric NMF optimization to have a unique solution under MMSB, and (b) proposing a computationally efficient algorithm called GeoNMF that is provably optimal and hence consistent for a broad parameter regime. We demonstrate its accuracy on both simulated and real-world datasets.

Talk {daterange} @ C4.8 Analytical Guarantees on Numerical Precision of Deep Neural Networks In Deep learning theory 2 PDF » Summary/Notes » Video »

The acclaimed successes of neural networks often overshadow their tremendous complexity. We focus on numerical precision - a key parameter defining the complexity of neural networks. First, we present theoretical bounds on the accuracy in presence of limited precision. Interestingly, these bounds can be computed via the back-propagation algorithm. Hence, by combining our theoretical analysis and the back-propagation algorithm, we are able to readily determine the minimum precision needed to preserve accuracy without having to resort to time-consuming fixed-point simulations. We provide numerical evidence showing how our approach allows us to maintain high accuracy but with lower complexity than state-of-the-art binary networks.

Talk {daterange} @ C4.6 & C4.7 Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees In Supervised learning 2 PDF » Summary/Notes » Video »

Random Fourier features is one of the most popular techniques for scaling up kernel methods, such as kernel ridge regression. However, despite impressive empirical results, the statistical properties of random Fourier features are still not well understood. In this paper we take steps toward filling this gap. Specifically, we approach random Fourier features from a spectral matrix approximation point of view, give tight bounds on the number of Fourier features required to achieve a spectral approximation, and show how spectral matrix approximation bounds imply statistical guarantees for kernel ridge regression.

Talk {daterange} @ Darling Harbour Theatre Deriving Neural Architectures from Sequence and Graph Kernels In Deep learning 2: invariances PDF » Summary/Notes » Video »

The design of neural architectures for structured objects is typically guided by experimental insights rather than a formal process. In this work, we appeal to kernels over combinatorial structures, such as sequences and graphs, to derive appropriate neural operations. We introduce a class of deep recurrent neural operations and formally characterize their associated kernel spaces. Our recurrent modules compare the input to virtual reference objects (cf. filters in CNN) via the kernels. Similar to traditional neural operations, these reference objects are parameterized and directly optimized in end-to-end training. We empirically evaluate the proposed class of neural architectures on standard applications such as language modeling and molecular graph regression, achieving state-of-the-art results across these applications.

Talk {daterange} @ Parkside 1 Learning to Discover Cross-Domain Relations with Generative Adversarial Networks In Deep generative models 2 PDF » Summary/Notes » Video »

While humans easily recognize relations between data from different domains without any supervision, learning to automatically discover them is in general very challenging and needs many ground-truth pairs that illustrate the relations. To avoid costly pairing, we address the task of discovering cross-domain relations given unpaired data. We propose a method based on a generative adversarial network that learns to discover relations between different domains (DiscoGAN). Using the discovered relations, our proposed network successfully transfers style from one domain to another while preserving key attributes such as orientation and face identity.

Talk {daterange} @ Parkside 2 Gradient Projection Iterative Sketch for Large-Scale Constrained Least-Squares In Continuous optimization 2 PDF » Summary/Notes » Video »

We propose a randomized first order optimization algorithm Gradient Projection Iterative Sketch (GPIS) and an accelerated variant for efficiently solving large scale constrained Least Squares (LS). We provide the first theoretical convergence analysis for both algorithms. An efficient implementation using a tailored line-search scheme is also proposed. We demonstrate our methods' computational efficiency compared to the classical accelerated gradient method, and the variance-reduced stochastic gradient methods through numerical experiments in various large synthetic/real data sets.

Talk {daterange} @ C4.5 An Alternative Softmax Operator for Reinforcement Learning In Reinforcement learning 2 PDF » Summary/Notes » Video »

A softmax operator applied to a set of values acts somewhat like the maximization function and somewhat like an average. In sequential decision making, softmax is often used in settings where it is necessary to maximize utility but also to hedge against problems that arise from putting all of one’s weight behind a single maximum utility decision. The Boltzmann softmax operator is the most commonly used softmax operator in this setting, but we show that this operator is prone to misbehavior. In this work, we study a differentiable softmax operator that, among other properties, is a non-expansion ensuring a convergent behavior in learning and planning. We introduce a variant of SARSA algorithm that, by utilizing the new operator, computes a Boltzmann policy with a state-dependent temperature parameter. We show that the algorithm is convergent and that it performs favorably in practice.

Talk {daterange} @ C4.9& C4.10 Deep Bayesian Active Learning with Image Data In Probabilistic learning 2 PDF » Summary/Notes » Video »

Even though active learning forms an important pillar of machine learning, deep learning tools are not prevalent within it. Deep learning poses several difficulties when used in an active learning setting. First, active learning (AL) methods generally rely on being able to learn and update models from small amounts of data. Recent advances in deep learning, on the other hand, are notorious for their dependence on large amounts of data. Second, many AL acquisition functions rely on model uncertainty, yet deep learning methods rarely represent such model uncertainty. In this paper we combine recent advances in Bayesian deep learning into the active learning framework in a practical way. We develop an active learning framework for high dimensional data, a task which has been extremely challenging so far, with very sparse existing literature. Taking advantage of specialised models such as Bayesian convolutional neural networks, we demonstrate our active learning techniques with image data, obtaining a significant improvement on existing active learning approaches. We demonstrate this on both the MNIST dataset, as well as for skin cancer diagnosis from lesion images (ISIC2016 task).

Talk {daterange} @ C4.1 On Kernelized Multi-armed Bandits In Online learning 2 PDF » Summary/Notes » Video »

We consider the stochastic bandit problem with a continuous set of arms, with the expected reward function over the arms assumed to be fixed but unknown. We provide two new Gaussian process-based algorithms for continuous bandit optimization -- Improved GP-UCB (IGP-UCB) and GP-Thomson sampling (GP-TS), and derive corresponding regret bounds. Specifically, the bounds hold when the expected reward function belongs to the reproducing kernel Hilbert space (RKHS) that naturally corresponds to a Gaussian process kernel used as input by the algorithms. Along the way, we derive a new self-normalized concentration inequality for vector-valued martingales of arbitrary, possibly infinite, dimension. Finally, experimental evaluation and comparisons to existing algorithms on synthetic and real-world environments are carried out that highlight the favourable gains of the proposed strategies in many cases.

Talk {daterange} @ C4.4 Nonnegative Matrix Factorization for Time Series Recovery From a Few Temporal Aggregates In Matrix factorization 1 PDF » Summary/Notes » Video »

Motivated by electricity consumption reconstitution, we propose a new matrix recovery method using nonnegative matrix factorization (NMF). The task tackled here is to reconstitute electricity consumption time series at a fine temporal scale from measures that are temporal aggregates of individual consumption. Contrary to existing NMF algorithms, the proposed method uses temporal aggregates as input data, instead of matrix entries. Furthermore, the proposed method is extended to take into account individual autocorrelation to provide better estimation, using a recent convex relaxation of quadratically constrained quadratic programs. Extensive experiments on synthetic and real-world electricity consumption datasets illustrate the effectiveness of the proposed method.

Talk {daterange} @ C4.8 Follow the Moving Leader in Deep Learning In Deep learning theory 2 PDF » Summary/Notes » Video »

Deep networks are highly nonlinear and difficult to optimize. During training, the parameter iterate may move from one local basin to another, or the data distribution may even change. Inspired by the close connection between stochastic optimization and online learning, we propose a variant of the {\em follow the regularized leader} (FTRL) algorithm called {\em follow the moving leader} (FTML). Unlike the FTRL family of algorithms, the recent samples are weighted more heavily in each iteration and so FTML can adapt more quickly to changes. We show that FTML enjoys the nice properties of RMSprop and Adam, while avoiding their pitfalls. Experimental results on a number of deep learning models and tasks demonstrate that FTML converges quickly, and outperforms other state-of-the-art optimizers.

Talk {daterange} @ C4.6 & C4.7 Logarithmic Time One-Against-Some In Supervised learning 2 PDF » Summary/Notes » Video »

We create a new online reduction of multiclass classification to binary classification for which training and prediction time scale logarithmically with the number of classes. We show that several simple techniques give rise to an algorithm which is superior to previous logarithmic time classification approaches while competing with one-against-all in space. The core construction is based on using a tree to select a small subset of labels with high recall, which are then scored using a one-against-some structure with high precision.

Talk {daterange} @ Parkside 1 Wasserstein Generative Adversarial Networks In Deep generative models 2 PDF » Summary/Notes » Video »

We introduce a new algorithm named WGAN, an alternative to traditional GAN training. In this new model, we show that we can improve the stability of learning, get rid of problems like mode collapse, and provide meaningful learning curves useful for debugging and hyperparameter searches. Furthermore, we show that the corresponding optimization problem is sound, and provide extensive theoretical work highlighting the deep connections to different distances between distributions.

Talk {daterange} @ Parkside 2 Connected Subgraph Detection with Mirror Descent on SDPs In Continuous optimization 2 PDF » Summary/Notes » Video »

We propose a novel, computationally efficient mirror-descent based optimization framework for subgraph detection in graph-structured data. Our aim is to discover anomalous patterns present in a connected subgraph of a given graph. This problem arises in many applications such as detection of network intrusions, community detection, detection of anomalous events in surveillance videos or disease outbreaks. Since optimization over connected subgraphs is a combinatorial and computationally difficult problem, we propose a convex relaxation that offers a principled approach to incorporating connectivity and conductance constraints on candidate subgraphs. We develop a novel efficient algorithm to solve the relaxed problem, establish convergence guarantees and demonstrate its feasibility and performance with experiments on real and very large simulated networks.

Talk {daterange} @ C4.5 Fake News Mitigation via Point Process Based Intervention In Reinforcement learning 2 PDF » Summary/Notes » Video »

We propose the first multistage intervention framework that tackles fake news in social networks by combining reinforcement learning with a point process network activity model. The spread of fake news and mitigation events within the network is modeled by a multivariate Hawkes process with additional exogenous control terms. By choosing a feature representation of states, defining mitigation actions and constructing reward functions to measure the effectiveness of mitigation activities, we map the problem of fake news mitigation into the reinforcement learning framework. We develop a policy iteration method unique to the multivariate networked point process, with the goal of optimizing the actions for maximal reward under budget constraints. Our method shows promising performance in real-time intervention experiments on a Twitter network to mitigate a surrogate fake news campaign, and outperforms alternatives on synthetic datasets.

Talk {daterange} @ C4.9& C4.10 Bayesian Boolean Matrix Factorisation In Probabilistic learning 2 PDF » Summary/Notes » Video » Video »

Boolean matrix factorisation aims to decompose a binary data matrix into an approximate Boolean product of two low rank, binary matrices: one containing meaningful patterns, the other quantifying how the observations can be expressed as a combination of these patterns. We introduce the OrMachine, a probabilistic generative model for Boolean matrix factorisation and derive a Metropolised Gibbs sampler that facilitates efficient parallel posterior inference. On real world and simulated data, our method outperforms all currently existing approaches for Boolean matrix factorisation and completion. This is the first method to provide full posterior inference for Boolean Matrix factorisation which is relevant in applications, e.g. for controlling false positive rates in collaborative filtering and, crucially, improves the interpretability of the inferred patterns. The proposed algorithm scales to large datasets as we demonstrate by analysing single cell gene expression data in 1.3 million mouse brain cells across 11 thousand genes on commodity hardware.

Talk {daterange} @ C4.1 Second-Order Kernel Online Convex Optimization with Adaptive Sketching In Online learning 2 PDF » Summary/Notes » Video »

Kernel online convex optimization (KOCO) is a framework combining the expressiveness of non-parametric kernel models with the regret guarantees of online learning. First-order KOCO methods such as functional gradient descent require only $O(t)$ time and space per iteration, and, when the only information on the losses is their convexity, achieve a minimax optimal $O(\sqrt{T})$ regret. Nonetheless, many common losses in kernel problems, such as squared loss, logistic loss, and squared hinge loss posses stronger curvature that can be exploited. In this case, second-order KOCO methods achieve $O(\log(\Det(K)))$ regret, which we show scales as $O(deff \log T)$, where $deff$ is the effective dimension of the problem and is usually much smaller than $O(\sqrt{T})$. The main drawback of second-order methods is their much higher $O(t^2)$ space and time complexity. In this paper, we introduce kernel online Newton step (KONS), a new second-order KOCO method that also achieves $O(deff\log T)$ regret. To address the computational complexity of second-order methods, we introduce a new matrix sketching algorithm for the kernel matrix~$K$, and show that for a chosen parameter $\gamma \leq 1$ our Sketched-KONS reduces the space and time complexity by a factor of $\gamma^2$ to $O(t^2\gamma^2)$ space and time per iteration, while incurring only $1/\gamma$ times more regret.

Talk {daterange} @ C4.4 Frame-based Data Factorizations In Matrix factorization 1 PDF » Summary/Notes » Video »

Archetypal Analysis is the method of choice to compute interpretable matrix factorizations. Every data point is represented as a convex combination of factors, i.e., points on the boundary of the convex hull of the data. This renders computation inefficient. In this paper, we show that the set of vertices of a convex hull, the so-called frame, can be efficiently computed by a quadratic program. We provide theoretical and empirical results for our proposed approach and make 