Accepted papers

Printed proceedings are available for purchase.

All of the talk videos are available online.

Conversational Speech Transcription Using Context-Dependent Deep Neural Networks Dong Yu, Frank Seide, Gang Li – Invited applications paper Abstract: Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, combine the classic artificial-neural-network HMMs with traditional context-dependent acoustic modeling and deep-belief-network pre-training. CD-DNN-HMMs greatly outperform conventional CD-GMM (Gaussian mixture model) HMMs: The word error rate is reduced by up to one third on the difficult benchmarking task of speaker-independent single-pass transcription of telephone conversations. discussion+video, ICML version (pdf) (bib)

Data-driven Web Design Ranjitha Kumar, Jerry Talton, Salman Ahmad, Scott Klemmer – Invited applications paper Abstract: This short paper summarizes challenges and opportunities of applying machine learning methods to Web design problems, and describes how structured prediction, deep learning, and probabilistic program induction can enable useful interactions for designers. We intend for these techniques to foster new work in data-driven Web design. discussion+video, ICML version (pdf) (bib)

Learning the Central Events and Participants in Unlabeled Text Nathanael Chambers, Dan Jurafsky – Invited applications paper Abstract: The majority of information on the Internet is expressed in written text. Understanding and extracting this information is crucial to building intelligent systems that can organize this knowledge. Today, most algorithms focus on learning atomic facts and relations. For instance, we can reliably extract facts like 'Annapolis is a City' by observing redundant word patterns across a corpus. However, these facts do not capture richer knowledge like the way detonating a bomb is related to destroying a building, or that the perpetrator who was convicted must have been arrested. A structured model of these events and entities is needed for a deeper understanding of language. This talk describes unsupervised approaches to learning such rich knowledge. discussion+video, ICML version (pdf) (bib)

Exemplar-SVMs for Visual Ob ject Detection, Label Transfer and Image Retrieval Tomasz Malisiewicz, Abhinav Shrivastava, Abhinav Gupta, Alexei Efros – Invited applications paper Abstract: Today's state-of-the-art visual object detection systems are based on three key components: 1) sophisticated features (to encode various visual invariances), 2) a powerful classifier (to build a discriminative object class model), and 3) lots of data (to use in large-scale hard-negative mining). While conventional wisdom tends to attribute the success of such methods to the ability of the classifier to generalize across the positive class instances, here we report on empirical findings suggesting that this might not necessarily be the case. We have experimented with a very simple idea: to learn a separate classifier for each positive object instance in the dataset. In this setup, no generalization across the positive instances is possible by definition, and yet, surprisingly, we did not observe any drastic drop in performance compared to the standard, category-based approaches. discussion+video, ICML version (pdf) (bib)

Capturing topical content with frequency and exclusivity Jonathan Bischof, Edoardo Airoldi – Accepted Abstract: Recent work in text analysis commonly describes topics in terms of their most frequent words, but the exclusivity of words to topics is equally important for communicating their content.We introduce Hierarchical Poisson Convolution (HPC), a model which infers regularized estimates of the differential use of words across topics as well as word frequency within topics. HPC uses known hierarchical structure on human labeled topics to make focused comparisons of differential usage within each branch of the tree. We develop a parallelized Hamiltonian Monte Carlo sampler that allows for fast and scalable computation. discussion+video, ICML version (pdf) (bib), more on ArXiv

TrueLabel + Confusions: A Spectrum of Probabilistic Models in Analyzing Multiple Ratings Chao Liu, Yi-Min Wang – Accepted Abstract: This paper revisits the problem of analyzing multiple ratings given by different judges. Different from previous work that focuses on distilling the true labels from noisy crowdsourcing ratings, we emphasize gaining diagnostic insights into our in-house well-trained judges. We generalize the well-known DawidSkene model (Dawid & Skene, 1979) to a spectrum of probabilistic models under the same “TrueLabel + Confusion” paradigm, and show that our proposed hierarchical Bayesian model, called HybridConfusion, consistently outperforms DawidSkene on both synthetic and real-world data sets. discussion+video, ICML version (pdf) (bib), more on ArXiv

Robust Multiple Manifold Structure Learning Dian Gong, Xuemei Zhao, Gerard Medioni – Accepted Abstract: We present a robust multiple manifold structure learning (RMMSL) scheme to robustly estimate data structures under the multiple low intrinsic dimensional manifolds assumption. In the local learning stage, RMMSL efficiently estimates local tangent space by weighted low-rank matrix factorization. In the global learning stage, we propose a robust manifold clustering method based on local structure learning results. The proposed clustering method is designed to get the flattest manifolds clusters by introducing a novel curved-level similarity function. Our approach is evaluated and compared to state-of-the-art methods on synthetic data, handwritten digit images, human motion capture data and motorbike videos. We demonstrate the effectiveness of the proposed approach, which yields higher clustering accuracy, and produces promising results for challenging tasks of human motion segmentation and motion flow learning from videos. discussion+video, ICML version (pdf) (bib), more on ArXiv

Two Manifold Problems with Applications to Nonlinear System Identification Byron Boots, Geoff Gordon – Accepted Abstract: Recently, there has been much interest in spectral approaches to learning manifolds—so-called kernel eigenmap methods. These methods have had some successes, but their applicability is limited because they are not robust to noise. To address this limitation, we look at two-manifold problems, in which we simultaneously reconstruct two related manifolds, each representing a different view of the same data. By solving these interconnected learning problems together, two-manifold algorithms are able to succeed where a non-integrated approach would fail: each view allows us to suppress noise in the other, reducing bias. We propose a class of algorithms for two-manifold problems, based on spectral decomposition of cross-covariance operators in Hilbert space and discuss when two-manifold problems are useful. Finally, we demonstrate that solving a two-manifold problem can aid in learning a nonlinear dynamical system from limited data. discussion+video, ICML version (pdf) (bib), more on ArXiv

On the Difficulty of Nearest Neighbor Search Junfeng He, Sanjiv Kumar, Shih-Fu Chang – Accepted Abstract: Fast approximate nearest neighbor search in large databases is becoming popular. Several powerful learning-based formulations have been proposed recently. However, not much attention has been paid to a more fundamental question: how difficult is (approximate) nearest neighbor search in a given data set? More broadly, which data properties affect the nearest neighbor search and how? This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces. To further justify why relative contrast is an important and effective measure, we present a theoretical analysis to prove how relative contrast determines/affects the performance/complexity of Locality Sensitive Hashing, a popular hashing based approximate nearest neighbor search method. Finally, relative contrast also provides an explanation for a family of heuristic hashing algorithms based on PCA with good practical performance. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning Force Control Policies for Compliant Robotic Manipulation Mrinal Kalakrishnan, Ludovic Righetti, Peter Pastor, Stefan Schaal – Invited applications paper Abstract: Developing robots capable of fine manipulation skills is of major importance in order to build truly assistive robots. These robots need to be compliant in their actuation and control in order to operate safely in human environments. Manipulation tasks imply complex contact interactions with the external world, and involve reasoning about the forces and torques to be applied. Planning under contact conditions is usually impractical due to computational complexity, and a lack of precise dynamics models of the environment. We present an approach to acquiring manipulation skills on compliant robots through reinforcement learning. The initial position control policy for manipulation is initialized through kinesthetic demonstration. This policy is augmented with a force/torque profile to be controlled in combination with the position trajectories. The Policy Improvement with Path Integrals (PI^2) algorithm is used to learn these force/torque profiles by optimizing a cost function that measures task success. We introduce a policy representation that ensures trajectory smoothness during exploration and learning. Our approach is demonstrated on the Barrett WAM robot arm equipped with a 6-DOF force/torque sensor on two different manipulation tasks: opening a door with a lever door handle, and picking up a pen off the table. We show that the learnt force control policies allow successful, robust execution of the tasks. discussion+video, ICML version (pdf) (bib)

Estimation of Simultaneously Sparse and Low Rank Matrices Pierre-André Savalle, Emile Richard, Nicolas Vayatis – Accepted Abstract: The paper introduces a penalized matrix estimation procedure aiming at solutions which are sparse and low-rank at the same time. Such structures arise in the context of social networks or protein interactions where underlying graphs have adjacency matrices which are block-diagonal in the appropriate basis. We introduce a convex mixed penalty which involves ℓ_1-norm and trace norm simultaneously. We obtain an oracle inequality which indicates how the two effects interact according to the nature of the target matrix. We bound generalization error in the link prediction problem. We also develop proximal descent strategies to solve the the optimization problem efficiently and evaluate performance on synthetic and real data sets. discussion+video, ICML version (pdf) (bib), more on ArXiv

Online Structured Prediction via Coactive Learning Pannaga Shivaswamy, Thorsten Joachims – Accepted Abstract: We propose Coactive Learning as a model of interaction between a learning system and a human user, where both have the common goal of providing results of maximum utility to the user. At each step, the system (e.g. search engine) receives a context (e.g. query) and predicts an object (e.g. ranking). The user responds by correcting the system if necessary, providing a slightly improved – but not necessarily optimal – object as feedback. We argue that such feedback can be inferred from observable user behavior, specifically clicks in web search. Evaluating predictions by their cardinal utility to the user, we propose efficient learning algorithms that have O(1/sqrt{T}) average regret, even though the learning algorithm never observes cardinal utility values. We demonstrate the applicability of our model and learning algorithms on a movie recommendation task, as well as ranking for web search. discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

Using CCA to improve CCA: A new spectral method for estimating vector models of words Paramveer Dhillon, Jordan Rodu, Dean Foster, Lyle Ungar – Accepted Abstract: Unlabeled data is often used to learn representations which can be used to supplement baseline features in a supervised learner. For example, for text applications where the words lie in a very high dimensional space (the size of the vocabulary), one can learn a low rank “dictionary” by an eigen-decomposition of the word co-occurrence matrix (e.g. using PCA or CCA). In this paper, we present a new spectral method based on CCA to learn an eigenword dictionary. Our improved procedure computes two set of CCAs, the first one between the left and right contexts of the given word and the second one between the projections resulting from this CCA and the word itself. We prove theoretically that this two-step procedure has lower sample complexity than the simple single step procedure and also illustrate the empirical efficacy of our approach and the richness of representations learned by our Two Step CCA (TSCCA) procedure on the tasks of POS tagging and sentiment classification. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Discrete Optimization Approach for Supervised Ranking with an Application to Reverse-Engineering Quality Ratings Allison Chang, Cynthia Rudin, Dimitris Bertsimas, Michael Cavaretta, Robert Thomas, Gloria Chou – Not for proceedings Abstract: We present a new methodology based on mixed integer optimization (MIO) for supervised ranking tasks. Other methods for supervised ranking approximate ranking quality measures by convex functions in order to accommodate extremely large problems, at the expense of exact solutions. As our MIO approach provides exact modeling for ranking problems, our solutions are benchmarks for the other non-exact methods. We report computational results that demonstrate significant advantages for MIO methods over current state-of-the-art. We also use our technique for a new application: reverse-engineering quality rankings. A good or bad product quality rating can make or break an organization, and in order to invest wisely in product development, organizations are starting to use intelligent approaches to reverse-engineer the rating models. We present experiments on data from a major quality rating company, and provide new methods for evaluating the solution. In addition, we provide an approach to use the reverse-engineered model to achieve a top ranked product in a cost-effective way. discussion+video pdf

Bounded Planning in Passive POMDPs Roy Fox, Naftali Tishby – Accepted Abstract: In Passive POMDPs actions do not affect the world state, but still incur costs. When the agent is bounded by information-processing constraints, it can only keep an approximation of the belief. We present a variational principle for the problem of maintaining the information which is most useful for minimizing the cost, and introduce an efficient and simple algorithm for finding an optimum. discussion+video, ICML version (pdf) (bib), more on ArXiv

Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss Shai Ben-David, David Loker, Nathan Srebro, Karthik Sridharan – Accepted Abstract: We carefully study how well minimizing convex surrogate loss functions, corresponds to minimizing the misclassification error rate for the problem of binary classification with linear predictors. In particular, we show that amongst all convex surrogate losses, the hinge loss gives essentially the best possible bound, of all convex loss functions, for the misclassification error rate of the resulting linear predictor in terms of the best possible margin error rate. We also provide lower bounds for specific convex surrogates that show how different commonly used losses qualitatively differ from each other. discussion+video, ICML version (pdf) (bib), more on ArXiv

Bayesian Efficient Multiple Kernel Learning Mehmet Gönen – Accepted Abstract: Multiple kernel learning algorithms are proposed to combine kernels in order to obtain a better similarity measure or to integrate feature representations coming from different data sources. Most of the previous research on such methods is focused on the computational efficiency issue. However, it is still not feasible to combine many kernels using existing Bayesian approaches due to their high time complexity. We propose a fully conjugate Bayesian formulation and derive a deterministic variational approximation, which allows us to combine hundreds or thousands of kernels very efficiently. We briefly explain how the proposed method can be extended for multiclass learning and semi-supervised learning. Experiments with large numbers of kernels on benchmark data sets show that our inference method is quite fast, requiring less than a minute. On one bioinformatics and three image recognition data sets, our method outperforms previously reported results with better generalization performance. discussion+video, ICML version (pdf) (bib), more on ArXiv

Bayesian Nonexhaustive Learning for Online Discovery and Modeling of Emerging Classes Murat Dundar, Ferit Akova, Alan Qi, Bartek Rajwa – Accepted Abstract: In this study we present a framework for online inference in the presence of a nonexhaustively defined set of classes that incorporates supervised classification with class discovery and modeling. A Dirichlet process prior (DPP) model defined over class distributions ensures that both known and unknown class distributions originate according to a common base distribution. In an attempt to automatically discover potentially interesting class formations, the prior model is coupled with a suitably chosen data model, and sequential Monte Carlo sampling is used to perform online inference. Our approach takes into account the rapidly accumulating nature of samples representing emerging classes, which is most evident in the biodetection application considered in this study, where a new class of pathogen may suddenly appear, and the rapid increase in the number of samples originating from this class indicates the onset of an outbreak. discussion+video, ICML version (pdf) (bib), more on ArXiv

Exact Soft Confidence-Weighted Learning Steven C.H. Hoi, Jialei Wang, Peilin Zhao – Accepted Abstract: In this paper, we propose a new Soft Confidence-Weighted (SCW) online learning scheme, which enables the conventional confidence-weighted learning method to handle non-separable cases. Unlike the previous confidence-weighted learning algorithms, the proposed soft confidence-weighted learning method enjoys all the four salient properties: (i) large margin training, (ii) confidence weighting, (iii) capability to handle non-separable data, and (iv) adaptive margin. Our experimental results show that SCW performs significantly better than the original CW algorithm. When comparing with the state-of-the-art AROW algorithm, we found that SCW in general achieves better or at least comparable predictive performance, but enjoys considerably better efficiency performance (i.e., producing less number of updates and spending less time cost). discussion+video, ICML version (pdf) (bib), more on ArXiv

Distributed Tree Kernels Fabio Massimo Zanzotto, Lorenzo Dell'Arciprete – Accepted Abstract: In this paper, we propose the distributed tree kernels (DTK) as a novel method to reduce time and space complexity of tree kernels. Using a linear complexity algorithm to compute vectors for trees, we embed feature spaces of tree fragments in low-dimensional spaces where the kernel computation is directly done with dot product. We show that DTKs are faster, correlate with tree kernels, and obtain a statistically similar performance in two natural language processing tasks. discussion+video, ICML version (pdf) (bib), more on ArXiv

Multiple Kernel Learning from Noisy Labels by Stochastic Programming Tianbao Yang, Mehrdad Mahdavi, Rong Jin, Lijun Zhang, Yang Zhou, – Accepted Abstract: We study the problem of multiple kernel learning from noisy labels. This is in contrast to most of the previous studies on multiple kernel learning that mainly focus on developing efficient algorithms and assume perfectly labeled training examples. Directly applying the existing multiple kernel learning algorithms to noisily labeled examples often leads to suboptimal performance due to the incorrect class assignments. We address this challenge by casting multiple kernel learning from noisy labels into a stochastic programming problem, and presenting a minimax formulation. We develop an efficient algorithm for solving the related convex-concave optimization problem with a fast convergence rate of O(1/T) where T is the number of iterations. Empirical studies on UCI data sets verify both the effectiveness of the proposed framework and the efficiency of the proposed optimization algorithm. discussion+video, ICML version (pdf) (bib), more on ArXiv

Improved Nystrom Low-rank Decomposition with Priors Kai Zhang, Liang Lan, Jun Liu, andreas Rauber – Accepted Abstract: Low-rank matrix decomposition has gained great popularity recently in scaling up kernel methods to large amount of data. However, some limitations could prevent them from working effectively in certain domains. For example, many existing approaches are intrinsically unsupervised, which does not incorporate side information (e.g., class labels) to produce task specific decomposition; also, they typically work “transductively” and the factorization does not generalize to new samples, causing lot of inconvenience for updated environment. To solve these problems, in this paper we propose an “inductive”-flavored method for low-rank kernel decomposition with priors or side information. We achieve this by generalizing the Nystróm method in a novel way. On the one hand, our approach employs a highly flexible, nonparametric structure that allows us to generalize the low-rank factors to arbitrarily new samples; on the other hand, it has linear time and space complexities, which can be orders of magnitudes faster than existing approaches and renders great efficiency in learning a low-rank kernel decomposition. Empirical results demonstrate the efficacy and efficiency of the proposed method. discussion+video, ICML version (pdf) (bib), more on ArXiv

Active Learning for Matching Problems Laurent Charlin, Rich Zemel, Craig Boutilier – Accepted Abstract: Effective learning of user preferences is critical to easing user burden in various types of matching problems. Equally important is active query selection to further reduce the amount of preference information users must provide. We address the problem of active learning of user preferences for matching problems, introducing a novel method for determining probabilistic matchings, and developing several new active learning strategies that are sensitive to the specific matching objective. Experiments with real-world data sets spanning diverse domains demonstrate that matching-sensitive active learning outperforms standard techniques. discussion+video, ICML version (pdf) (bib), more on ArXiv

Ensemble Methods for Convex Regression with Applications to Geometric Programming Based Circuit Design Lauren Hannah, David Dunson – Accepted Abstract: Convex regression is a promising area for bridging statistical estimation and deterministic convex optimization. We develop a new piecewise linear convex regression method that uses the Convex Adaptive Partitioning (CAP) estimator in an ensemble setting, Ensemble Convex Adaptive Partitioning (E-CAP). The ensembles alleviate some problems associated with convex piecewise linear estimators, such as instability when used to approximate constraints or objective functions for optimization, while maintaining desirable properties, such as consistency and O(n log(n)^2) computational complexity. We empirically demonstrate that E-CAP outperforms existing convex regression methods both when used for prediction and optimization. We then apply E-CAP to device modeling and constraint approximation for geometric programming based circuit design. discussion+video, ICML version (pdf) (bib), more on ArXiv

Groupwise Constrained Reconstruction for Subspace Clustering Ruijiang Li, Bin Li, Cheng Jin, Xiangyang Xue – Accepted Abstract: Recent proposed subspace clustering methods first compute a self reconstruction matrix for dataset, then converted it to an affinity matrix, before input to a spectral clustering method to obtain the final clustering result. Their success is largely based on the subspace independence assumption, which, however, does not always hold for the applications with increasing number of clusters such as face clustering. In this paper, we proposes a novel reconstruction based subspace clustering method without making the subspace independence assumption. In our model, certain properties of the reconstruction matrix are explicitly characterized using the latent cluster indicators, and the affinity matrix input to the spectral clustering is built from the posterior of the cluster indicators. Evaluation on both synthetic and real-world datasets show that our method can outperform the state-of-the-arts. discussion+video, ICML version (pdf) (bib), more on ArXiv

Stability of matrix factorization for collaborative filtering Yu-Xiang Wang, Huan Xu – Accepted Abstract: We study the stability vis a vis adversarial noise of matrix factorization algorithm for matrix completion. In particular, our results include: (I) we bound the gap between the solution matrix of the factorization method and the ground truth in terms of root mean square error; (II) we treat the matrix factorization as a subspace fitting problem and analyze the difference between the solution subspace and the ground truth; (III) we analyze the prediction error of individual users based on the subspace stability. We apply these results to the problem of collaborative filtering under manipulator attack, which leads to useful insights and guidelines for collaborative filtering system design. discussion+video, ICML version (pdf) (bib), more on ArXiv

Adaptive Regularization for Similarity Measures Koby Crammer, Gal Chechik – Accepted Abstract: Algorithms for learning distributions over weight-vectors, such as AROW were recently shown empirically to achieve state-of-the-art performance at various problems, with strong theoretical guaranties. Extending these algorithms to matrix models poses a challenge since the number of free parameters in the covariance of the distribution scales as n^4 with the dimension n of the matrix. We describe, analyze and experiment with two new algorithms for learning distribution of matrix models. Our first algorithm maintains a diagonal covariance over the parameters and is able to handle large covariance matrices. The second algorithm factores the covariance capturing some inter-features correlation while keeping the number of parameters linear in the size of the original matrix. We analyze the diagonal algorithm in the mistake bound model and show the superior precision of our approach over other algorithms in two tasks: retrieving similar images, and ranking similar documents. The second algorithms is shown to attain faster convergence rate. discussion+video, ICML version (pdf) (bib), more on ArXiv

Linear Off-Policy Actor-Critic Thomas Degris, Martha White, Richard Sutton – Accepted Abstract: This paper presents the first off-policy actor-critic reinforcement learning algorithm with a per-time-step complexity that scales linearly with the number of learned parameters. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems. discussion+video, ICML version (pdf) (bib), more on ArXiv (includes revisions)

Modeling Latent Variable Uncertainty for Loss-based Learning M. Pawan Kumar, Ben Packer, Daphne Koller – Accepted Abstract: We consider the problem of parameter estimation using weakly supervised datasets, where a training sample consists of the input and a partially specified annotation (called the output). In addition, the missing information in the annotation is modeled using latent variables. Traditional methods, such as expectation-maximization, overburden a single distribution with two separate tasks: (i) modeling the uncertainty in the latent variables during training; and (ii) making accurate predictions for the output and the latent variables during testing. We propose a novel framework that separates the demands of the two tasks using two distributions: (i) a conditional distribution to model the uncertainty of the latent variables for a given input-output pair; and (ii) a delta distribution to predict the output and the latent variables for a given input. During learning, we encourage agreement between the two distributions by minimizing a loss-based dissimilarity coefficient. Our approach generalizes latent SVM in two important ways: (i) it models the uncertainty over latent variables instead of relying on a pointwise estimate; and (ii) it allows the use of loss functions that depend on latent variables, which greatly increases its applicability. We demonstrate the efficacy of our approach on two challenging problems—object detection and action detection—using publicly available datasets. discussion+video, ICML version (pdf) (bib), more on ArXiv

Dimensionality Reduction by Local Discriminative Gaussians Nathan Parrish, Maya Gupta – Accepted Abstract: We present local discriminative Gaussian (LDG) dimensionality reduction, a supervised linear dimensionality reduction technique that acts locally to each training point in order to find a mapping where similar data can be discriminated from dissimilar data. We focus on the classification setting; however, our algorithm can be applied whenever training data is accompanied with similarity or dissimilarity constraints. Our experiments show that LDG is superior to other state-of-the-art linear dimensionality reduction techniques when the number of features in the original data is large. We also adapt LDG to the transfer learning setting, and show that it achieves good performance when the test data distribution differs from that of the training data. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning to Label Aerial Images from Noisy Data Volodymyr Mnih, Geoffrey Hinton – Accepted Abstract: When training a system to label images, the amount of labeled training data tends to be a limiting factor. We consider the task of learning to label aerial images from existing maps. These provide abundant labels, but the labels are often incomplete and sometimes poorly registered. We propose two robust loss functions for dealing with these kinds of label noise and use the loss functions to train a deep neural network on two challenging aerial image datasets. The robust loss functions lead to big improvements in performance and our best system substantially outperforms the best published results on the task we consider. discussion+video, ICML version (pdf) (bib)

The Most Persistent Soft-Clique in a Set of Sampled Graphs Novi Quadrianto, Chao Chen, Christoph Lampert – Accepted Abstract: When searching for characteristic subpatterns in potentially noisy graph data, it appears self-evident that having multiple observations would be better than having just one. However, it turns out that the inconsistencies introduced when different graph instances have different edge sets pose a serious challenge. In this work we address this challenge for the problem of finding maximum weighted cliques. We introduce the concept of most persistent soft-clique. This is subset of vertices, that 1) is almost fully or at least densely connected, 2) occurs in all or almost all graph instances, and 3) has the maximum weight. We present a measure of clique-ness, that essentially counts the number of edge missing to make a subset of vertices into a clique. With this measure, we show that the problem of finding the most persistent soft-clique problem can be cast either as: a) a max-min two person game optimization problem, or b) a min-min soft margin optimization problem. Both formulations lead to the same solution when using a partial Lagrangian method to solve the optimization problems. By experiments on synthetic data and on real social network data, we show that the proposed method is able to reliably find soft cliques in graph data, even if that is distorted by random noise or unreliable observations. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning Efficient Structured Sparse Models Alex Bronstein, Pablo Sprechmann, Guillermo Sapiro – Accepted Abstract: We present a comprehensive framework for structured sparse coding and modeling extending the recent ideas of using learnable fast regressors to approximate exact sparse codes. For this purpose, we develop a novel block-coordinate proximal splitting method for the iterative solution of hierarchical sparse coding problems, and show an efficient feed forward architecture derived from its iteration. This architecture faithfully approximates the exact structured sparse codes with a fraction of the complexity of the standard optimization methods. We also show that by using different training objective functions, learnable sparse encoders are no longer restricted to be mere approximants of the exact sparse code for a pre-given dictionary, as in earlier formulations, but can be rather used as full-featured sparse encoders or even modelers. A simple implementation shows several orders of magnitude speedup compared to the state-of-the-art at minimal performance degradation, making the proposed framework suitable for real time and large-scale applications. discussion+video, ICML version (pdf) (bib), more on ArXiv

PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram Kalyanakrishnan, Ambuj Tewari, Peter Auer, Peter Stone – Accepted Abstract: We consider the problem of selecting, from among the arms of a stochastic n-armed bandit, a subset of size m of those arms with the highest expected rewards, based on efficiently sampling the arms. This “subset selection” problem finds application in a variety of areas. Kalyanakrishnan & Stone (2010) frame this problem under a PAC setting (denoting it “Explore-m”) and analyze corresponding sampling algorithms both formally and experimentally. Whereas their formal analysis is restricted to the worst case sample complexity of algorithms, in this paper, we design and analyze an algorithm (“LUCB”) with improved expected sample complexity. Interestingly LUCB bears a close resemblance to the well-known UCB algorithm for regret minimization. We obtain a sample complexity bound for LUCB that matches the best existing bound for single-arm selection (that is, when m = 1). We also provide a lower bound on the worst case sample complexity of PAC algorithms for Explore-m. discussion+video, ICML version (pdf) (bib)

Nonparametric variational inference Samuel Gershman, Matt Hoffman, David Blei – Accepted Abstract: Variational methods are widely used for approximate posterior inference. However, their use is typically limited to families of distributions that enjoy particular conjugacy properties. To circumvent this limitation, we propose a family of variational approximations inspired by nonparametric kernel density estimation. The locations of these kernels and their bandwidth are treated as variational parameters and optimized to improve an approximate lower bound on the marginal likelihood of the data. Using multiple kernels allows the approximation to capture multiple modes of the posterior, unlike most other variational approximations. We demonstrate the efficacy of the nonparametric approximation with a hierarchical logistic regression model and a nonlinear matrix factorization model. We obtain predictive performance as good as or better than more specialized variational methods and sample-based approximations. The method is easy to apply to more general graphical models for which standard variational methods are difficult to derive. discussion+video, ICML version (pdf) (bib), more on ArXiv

The Convexity and Design of Composite Multiclass Losses Mark Reid, Robert Williamson, Peng Sun – Accepted Abstract: We consider composite loss functions for multiclass prediction comprising a proper (i.e., Fisher-consistent) loss over probability distributions and an inverse link function. We establish conditions for their (strong) convexity and explore their implications. We also show how the separation of concerns afforded by using this composite representation allows for the design of families of losses with the same Bayes risk. discussion+video, ICML version (pdf) (bib), more on ArXiv

Finding Botnets Using Minimal Graph Clusterings Peter Haider, Tobias Scheffer – Accepted Abstract: We study the problem of identifying botnets and the IP addresses which they comprise, based on the observation of a fraction of the global email spam traffic. Observed mailing campaigns constitute evidence for joint botnet membership, they are represented by cliques in the graph of all messages. No evidence against an association of nodes is ever available. We reduce the problem of identifying botnets to a problem of finding a minimal clustering of the graph of messages. We directly model the distribution of clusterings given the input graph; this avoids potential errors caused by distributional assumptions of a generative model. We report on a case study in which we evaluate the model by its ability to predict the spam campaign that a given IP address is going to participate in. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning the Experts for Online Sequence Prediction Elad Eban, Aharon Birnbaum, Shai Shalev-Shwartz, Amir Globerson – Accepted Abstract: Online sequence prediction is the problem of predicting the next element of a sequence given previous elements. This problem has been extensively studied in the context of individual sequence prediction, where no prior assumptions are made on the origin of the sequence. Individual sequence prediction algorithms work quite well for long sequences, where the algorithm has enough time to learn the temporal structure of the sequence. However, they might give poor predictions for short sequences. A possible remedy is to rely on the general model of prediction with expert advice, where the learner has access to a set of r experts, each of which makes its own predictions on the sequence. It is well known that it is possible to predict almost as well as the best expert if the sequence length is order of log (r). But, without firm prior knowledge on the problem, it is not clear how to choose a small set of good experts. In this paper we describe and analyze a new algorithm that learns a good set of experts using a training set of previously observed sequences. We demonstrate the merits of our approach by experimenting with the task of click prediction on the web. discussion+video, ICML version (pdf) (bib), more on ArXiv

Efficient Active Algorithms for Hierarchical Clustering Akshay Krishnamurthy, Sivaraman Balakrishnan, Min Xu, Aarti Singh – Accepted Abstract: Advances in sensing technologies and the growth of the internet have resulted in an explosion in the size of modern datasets, while storage and processing power continue to lag behind. This motivates the need for algorithms that are efficient, both in terms of the number of measurements needed and running time. To combat the challenges associated with large datasets, we propose a general framework for active hierarchical clustering that repeatedly runs an off-the-shelf clustering algorithm on small subsets of the data and comes with guarantees on performance, measurement complexity and runtime complexity. We instantiate this framework with a simple spectral clustering algorithm and provide concrete results on its performance, showing that, under some assumptions, this algorithm recovers all clusters of size Omega(log n) using O(n log^2 n) similarities and runs in O(n log^3 n) time for a dataset of n objects. Through extensive experimentation we also demonstrate that this framework is practically alluring. discussion+video, ICML version (pdf) (bib), more on ArXiv

Copula Mixture Model for Dependency-seeking Clustering Melanie Rey, Volker Roth – Accepted Abstract: We introduce a copula mixture model to perform dependency-seeking clustering when co-occurring samples from different data sources are available. The model takes advantage of the great flexibility offered by the copulas framework to extend mixtures of Canonical Correlation Analysis to multivariate data with arbitrary continuous marginal densities. We formulate our model as a non-parametric Bayesian mixture, while providing efficient MCMC inference. Experiments on synthetic and real data demonstrate that the increased flexibility of the copula mixture significantly improves the clustering and the interpretability of the results. discussion+video, ICML version (pdf) (bib), more on ArXiv

The Landmark Selection Method for Multiple Output Prediction Krishnakumar Balasubramanian, Guy Lebanon – Accepted Abstract: Conditional modeling x → y is a central problem in machine learning. A substantial research effort is devoted to such modeling when x is high dimensional. We consider, instead, the case of a high dimensional y, where x is either low dimensional or high dimensional. Our approach is based on selecting a small subset y_L of the dimensions of y, and proceed by modeling (i) x → y_L and (ii) y_L → y. Composing these two models, we obtain a conditional model x → y that possesses convenient statistical properties. Classification and regression experiments on multiple datasets show that this model outperforms the one vs. all approach as well as several sophisticated multiple output prediction methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

Subgraph Matching Kernels for Attributed Graphs Nils Kriege, Petra Mutzel – Accepted Abstract: We propose graph kernels based on subgraph matchings, i.e. structure-preserving bijections between subgraphs. While recently proposed kernels based on common subgraphs (Wale et al., 2008; Shervashidze et al., 2009) in general can not be applied to attributed graphs, our approach allows to rate mappings of subgraphs by a ﬂexible scoring scheme comparing vertex and edge attributes by kernels. We show that subgraph matching kernels generalize several known kernels. To compute the kernel we propose a graph-theoretical algorithm inspired by a classical relation between common subgraphs of two graphs and cliques in their product graph observed by Levi (1973). Encouraging experimental results on a classiﬁcation task of real-world graphs are presented. discussion+video, ICML version (pdf) (bib), more on ArXiv

Adaptive Canonical Correlation Analysis Based On Matrix Manifolds Florian Yger, Maxime Berar, Gilles Gasso, Alain Rakotomamonjy – Accepted Abstract: In this paper, we formulate the Canonical Correlation Analysis (CCA) problem on matrix manifolds. This framework provides a natural way for dealing with matrix constraints and tools for building efficient algorithms even in an adaptive setting. Finally, an adaptive CCA algorithm is proposed and applied to a change detection problem in EEG signals. discussion+video, ICML version (pdf) (bib), more on ArXiv

Batch Active Learning via Coordinated Matching Javad Azimi, Alan Fern, Xiaoli Zhang-Fern, Glencora Borradaile, Brent Heeringa – Accepted Abstract: Most prior work on active learning of classifiers has focused on sequentially selecting one unlabeled example at a time to be labeled in order to reduce the overall labeling effort. In many scenarios, however, it is desirable to label an entire batch of examples at once, for example, when labels can be acquired in parallel. This motivates us to study batch active learning, which iteratively selects batches of k>1 examples to be labeled. We propose a novel batch active learning method that leverages the availability of high-quality and efficient sequential active-learning policies by attempting to approximate their behavior when applied for k steps. Specifically, our algorithm first uses Monte-Carlo simulation to estimate the distribution of unlabeled examples selected by a sequential policy over k step executions. The algorithm then attempts to select a set of k examples that best matches this distribution, leading to a combinatorial optimization problem that we term “bounded coordinated matching'. While we show this problem is NP-hard in general, we give an efficient greedy solution, which inherits approximation bounds from supermodular minimization theory. Our experimental results on eight benchmark datasets show that the proposed approach is highly effective discussion+video, ICML version (pdf) (bib), more on ArXiv

Hybrid Batch Bayesian Optimization Javad Azimi, Ali Jalali, Xiaoli Zhang-Fern – Accepted Abstract: Bayesian Optimization aims at optimizing an unknown non-convex/concave function that is costly to evaluate. We are interested in application scenarios where concurrent function evaluations are possible. Under such a setting, BO could choose to either sequentially evaluate the function, one input at a time and wait for the output of the function before making the next selection, or evaluate the function at a batch of multiple inputs at once. These two different settings are commonly referred to as the sequential and batch settings of Bayesian Optimization. In general, the sequential setting leads to better optimization performance as each function evaluation is selected with more information, whereas the batch setting has an advantage in terms of the total experimental time (the number of iterations). In this work, our goal is to combine the strength of both settings. Specifically, we systematically analyze Bayesian optimization using Gaussian process as the posterior estimator and provide a hybrid algorithm that, based on the current state, dynamically switches between a sequential policy and a batch policy with variable batch sizes. We provide theoretical justification for our algorithm and present experimental results on eight benchmark BO problems. The results show that our method achieves substantial speedup (up to % 78) compared to a pure sequential policy, without suffering any significant performance loss. discussion+video, ICML version (pdf) (bib)

Efficient and Practical Stochastic Subgradient Descent for Nuclear Norm Regularization Haim Avron, Satyen Kale, Shiva Kasiviswanathan, Vikas Sindhwani – Accepted Abstract: We describe novel subgradient methods for a broad class of matrix optimization problems involving nuclear norm regularization. Unlike existing approaches, our method executes very cheap iterations by combining low-rank stochastic subgradients with efficient incremental SVD updates, made possible by highly optimized and parallelizable dense linear algebra operations on small matrices. Our practical algorithms always maintain a low-rank factorization of iterates that can be conveniently held in memory and efficiently multiplied to generate predictions in matrix completion settings. Empirical comparisons confirm that our approach is highly competitive with several recently proposed state-of-the-art solvers for such problems. discussion+video, ICML version (pdf) (bib), more on ArXiv

Gap Filling in the Plant Kingdom—Trait Prediction Using Hierarchical Probabilistic Matrix Factorization Hanhuai Shan, Jens Kattge, Peter Reich, Arindam Banerjee, Franziska Schrodt, Markus Reichstein – Accepted Abstract: Plant traits are a key to understand and predict the adaptation of ecosystems to environmental changes, which motivates the TRY project aiming at constructing a global database for plant traits and becoming a standard resource for the ecological community. Despite its unprecedented coverage, a large percentage of missing data substantially constrains joint trait analysis. Meanwhile, the trait data are characterized by the hierarchical phylogenetic structure of the plant kingdom. While factorization based matrix completion techniques have been widely used to address the missing data problem, traditional matrix factorization methods are unable to leverage the phylogenetic structure. We propose hierarchical probabilistic matrix factorization (HPMF), which effectively uses hierarchical phylogenetic information for trait prediction. We demonstrate HPMF's high accuracy, effectiveness of incorporating hierarchical structure and ability to capture trait correlation through experiments. discussion+video, ICML version (pdf) (bib), more on ArXiv

Sparse Support Vector Infinite Push Alain Rakotomamonjy – Accepted Abstract: In this paper, we address the problem of embedded feature selection for ranking on top of the list problems. We pose this problem as a regularized empirical risk minimization with p-norm push loss function (p=∞) and sparsity inducing regularizers. We leverage the issues related to this challenging optimization problem by considering an alternating direction method of multipliers algorithm which is built upon proximal operators of the loss function and the regularizer. Our main technical contribution is thus to provide a numerical scheme for computing the infinite push loss function proximal operator. Experimental results on toy, DNA microarray and BCI problems show how our novel algorithm compares favorably to competitors for ranking on top while using fewer variables in the scoring function. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Dantzig Selector Approach to Temporal Difference Learning Matthieu Geist, Bruno Scherrer, Alessandro Lazaric, Mohammad Ghavamzadeh – Accepted Abstract: LSTD is one of the most popular reinforcement learning algorithms for value function approximation. Whenever the number of samples is larger than the number of features, LSTD must be paired with some form of regularization. In particular, L1-regularization methods tends to perform feature selection by promoting sparsity and thus they are particularly suited in high–dimensional problems. Nonetheless, since LSTD is not a simple regression algorithm but it solves a fixed–point problem, the integration with L1-regularization is not straightforward and it might come with some drawbacks (see e.g., the P-matrix assumption for LASSO-TD). In this paper we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. In particular, we investigate the performance of the algorithm and its relationship with existing regularized approaches, showing how it overcomes some of the drawbacks of existing solutions. discussion+video, ICML version (pdf) (bib), more on ArXiv

Scaling Up Coordinate Descent Algorithms for Large ℓ_1 Regularization Problems Chad Scherrer, Mahantesh Halappanavar, Ambuj Tewari, David Haglin – Accepted Abstract: We present a generic framework for parallel coordinate descent (CD) algorithms that has as special cases the original sequential algorithms of Cyclic CD and Stochastic CD, as well as the recent parallel Shotgun algorithm of Bradley et al. We introduce two novel parallel algorithms that are also special cases—Thread-Greedy CD and Coloring-Based CD—and give performance measurements for an OpenMP implementation of these. discussion+video, ICML version (pdf) (bib), more on ArXiv

Cross-Domain Multitask Learning with Latent Probit Models Shaobo Han, Xuejun Liao, Lawrence Carin – Accepted Abstract: Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets. discussion+video, ICML version (pdf) (bib), more on ArXiv

Structured Learning from Partial Annotations Xinghua Lou, Fred Hamprecht – Accepted Abstract: Structured learning is appropriate when predicting structured outputs such as trees, graphs, or sequences. Most prior work requires the training set to consist of complete trees, graphs or sequences. Specifying such detailed ground truth can be tedious or infeasible for large outputs. Our main contribution is a large margin formulation that makes structured learning from only partially annotated data possible. The resulting optimization problem is non-convex, yet can be efficiently solve by concave-convex procedure (CCCP) with novel speedup strategies. We apply our method to a challenging tracking-by-assignment problem of a variable number of divisible objects. On this benchmark, using only 25% of a full annotation we achieve a performance comparable to a model learned with a full annotation. Finally, we offer a unifying perspective of previous work using the hinge, ramp, or max loss for structured learning, followed by an empirical comparison on their practical performance. discussion+video, ICML version (pdf) (bib), more on ArXiv

Maximum Margin Output Coding Yi Zhang, Jeff Schneider – Accepted Abstract: In this paper we study output coding for multi-label prediction. For a multi-label output coding to be discriminative, it is important that codewords for different label vectors are significantly different from each other. In the meantime, unlike in traditional coding theory, codewords in output coding are to be predicted from the input, so it is also critical to have a predictable label encoding. To find output codes that are both discriminative and predictable, we first propose a max-margin formulation that naturally captures these two properties. We then convert it to a metric learning formulation, but with an exponentially large number of constraints as commonly encountered in structured prediction problems. Without a label structure for tractable inference, we use overgenerating (i.e., relaxation) techniques combined with the cutting plane method for optimization. In our empirical study, the proposed output coding scheme outperforms a variety of existing multi-label prediction methods for image, text and music classification. discussion+video, ICML version (pdf) (bib), more on ArXiv

Sequential Nonparametric Regression Haijie Gu, John Lafferty – Accepted Abstract: We present algorithms for nonparametric regression in settings where the data are obtained sequentially. While traditional estimators select bandwidths that depend upon the sample size, for sequential data the effective sample size is dynamically changing. We propose a linear time algorithm that adjusts the bandwidth for each new data point, and show that the estimator achieves the optimal minimax rate of convergence. We also propose the use of online expert mixing algorithms to adapt to unknown smoothness of the regression function. We provide simulations that confirm the theoretical results, and demonstrate the effectiveness of the methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

An Infinite Latent Attribute Model for Network Data Konstantina Palla, David A. Knowles, Zoubin Ghahramani – Accepted Abstract: Latent variable models for network data extract a summary of the relational structure underlying an observed network. The simplest possible models subdivide nodes of the network into clusters; the probability of a link between any two nodes then depends only on their cluster assignment. Currently available models can be classified by whether clusters are disjoint or are allowed to overlap. These models can explain a “flat” clustering structure. Hierarchical Bayesian models provide a natural approach to capture more complex dependencies. We propose a model in which objects are characterised by a latent feature vector. Each feature is itself partitioned into disjoint groups (subclusters), corresponding to a second layer of hierarchy. In experimental comparisons, the model achieves significantly improved predictive performance on social and biological link prediction tasks. The results indicate that models with a single layer hierarchy over-simplify real networks. discussion+video, ICML version (pdf) (bib), more on ArXiv

On Local Regret Michael Bowling, Martin Zinkevich – Accepted Abstract: Online learning typically aims to perform nearly as well as the best hypothesis in hindsight. For some hypothesis classes, though, even finding the best hypothesis offline is challenging. In such offline cases, local search techniques are often employed and only local optimality guaranteed. For online decision-making with such hypothesis classes, we introduce local regret, a generalization of regret that aims to perform nearly as well as only nearby hypotheses. We then present a general algorithm that can minimize local regret for arbitrary locality graphs. We also show that certain forms of structure in the graph can be exploited to drastically simplify learning. These algorithms are then demonstrated on a diverse set of online problems (some previously unexplored): online disjunct learning, online Max-SAT, and online decision tree learning. discussion+video, ICML version (pdf) (bib)

Smoothness and Structure Learning by Proxy Benjamin Yackley, Terran Lane – Accepted Abstract: As data sets grow in size, the ability of learning methods to find structure in them is increasingly hampered by the time needed to search the large spaces of possibilities and generate a score for each that takes all of the observed data into account. For instance, Bayesian networks, the model chosen in this paper, have a super-exponentially large search space for a fixed number of variables. One possible method to alleviate this problem is to use a proxy, such as a Gaussian Process regressor, in place of the true scoring function, training it on a selection of sampled networks. We prove here that the use of such a proxy is well-founded, as we can bound the smoothness of a commonly-used scoring function for Bayesian network structure learning. We show here that, compared to an identical search strategy using the network’s exact scores, our proxy-based search is able to get equivalent or better scores on a number of data sets in a fraction of the time. discussion+video, ICML version (pdf) (bib), more on ArXiv

A fast and simple algorithm for training neural probabilistic language models Andriy Mnih, Yee Whye Teh – Accepted Abstract: Neural probabilistic language models (NPLMs) have recently superseded smoothed n-gram models as the best-performing model class for language modelling. Unfortunately, the adoption of NPLMs is held back by their notoriously long training times, which can be measured in weeks even for moderately-sized datasets. These are a consequence of the models being explicitly normalized, which leads to having to consider all words in the vocabulary when computing the log-likelihood gradients. We propose a fast and simple algorithm for training NPLMs based on noise-contrastive estimation, a newly introduced procedure for estimating unnormalized continuous distributions. We investigate the behaviour of the algorithm on the Penn Treebank corpus and show that it reduces the training times by more than an order of magnitude without affecting the quality of the resulting models. The algorithm is also more efficient and much more stable than importance sampling because it requires far fewer noise samples to perform well. We demonstrate the scalability of the proposed approach by training several neural language models on a 47M-word corpus with a 80K-word vocabulary, obtaining state-of-the-art results in the Microsoft Research Sentence Completion Challenge. discussion+video, ICML version (pdf) (bib)

Incorporating Causal Prior Knowledge as Path-Constraints in Bayesian Networks and Maximal Ancestral Graphs Giorgos Borboudakis, Ioannis Tsamardinos – Accepted Abstract: We consider the incorporation of causal knowledge about the presence or absence of (possibly indirect) causal relations into a causal model. Such causal relations correspond to directed paths in a causal model. This type of knowledge naturally arises from experimental data, among others. Specifically, we consider the formalisms of Causal Bayesian Networks and Maximal Ancestral Graphs and their Markov equivalence classes: Partially Directed Acyclic Graphs and Partially Oriented Ancestral Graphs. We introduce sound and complete procedures which are able to incorporate causal prior knowledge in such models. In simulated experiments, we show that often considering even a few causal facts leads to a significant number of new inferences. In a case study, we also show how to use real experimental data to infer causal knowledge and incorporate it into a real biological causal network. discussion+video, ICML version (pdf) (bib), more on ArXiv

High-Dimensional Covariance Decomposition into Sparse Markov and Independence Domains Majid Janzamin, Animashree Anandkumar – Accepted Abstract: In this paper, we present a novel framework incorporating a combination of sparse models in different domains. We posit the observed data as generated from a linear combination of a sparse Gaussian Markov model (with a sparse precision matrix) and a sparse Gaussian independence model (with a sparse covariance matrix). We provide efficient methods for decomposition of the data into two domains, viz Markov and independence domains. We characterize a set of sufficient conditions for identifiability and model consistency. Our decomposition method is based on a simple modification of the popular ℓ_1-penalized maximum-likelihood estimator (ℓ_1-MLE), and is easily implementable. We establish that our estimator is consistent in both the domains, i.e., it successfully recovers the supports of both Markov and independence models, when the number of samples n scales as n = Ω(d^2 log p), where p is the number of variables and d is the maximum node degree in the Markov model. Our conditions for recovery are comparable to those of ℓ_1-MLE for consistent estimation of a sparse Markov model, and thus, we guarantee successful high-dimensional estimation of a richer class of models under comparable conditions. discussion+video, ICML version (pdf) (bib)

Latent Collaborative Retrieval Jason Weston, Chong Wang, Ron Weiss, Adam Berenzweig – Accepted Abstract: Retrieval tasks typically require a ranking of items given a query. Collaborative filtering tasks, on the other hand, learn models comparing users with items. In this paper we study the joint problem of recommending items to a user with respect to a given query, which is a surprisingly common task. This setup differs from the standard collaborative filtering one in that we are given a query × user × item tensor for training instead of the more traditional user × item matrix. Compared to document retrieval we do have a query, but we may or may not have content features (we will consider both cases) and we can also take account of the user’s profile. We introduce a factorized model for this new task that optimizes the top ranked items returned for the given query and user. We report empirical results where it outperforms several baselines. discussion+video, ICML version (pdf) (bib), more on ArXiv

Lightning Does Not Strike Twice: Robust MDPs with Coupled Uncertainty Shie Mannor, Ofir Mebel, Huan Xu – Accepted Abstract: We consider Markov decision processes under parameter uncertainty. Previous studies all restrict to the case that uncertainties among different states are uncoupled, which leads to conservative solutions. In contrast, we introduce an intuitive concept, termed 'Lightning Does not Strike Twice,' to model coupled uncertain parameters. Specifically, we require that the system can deviate from its nominal parameters only a bounded number of times. We give probabilistic guarantees indicating that this model represents real life situations and devise tractable algorithms for computing optimal control policies using this concept. discussion+video, ICML version (pdf) (bib), more on ArXiv

On causal and anticausal learning Bernhard Schoelkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, Joris Mooij – Accepted Abstract: We consider the problem of function estimation in the case where an underlying causal model can be identified. This has implications for popular scenarios such as covariate shift, concept drift, transfer learning and semi-supervised learning. We argue that causal knowledge may facilitate some approaches for a given problem, and rule out others. In particular, we formulate a hypothesis for when semi-supervised learning can help, and corroborate it with empirical results. discussion+video, ICML version (pdf) (bib)

Compact Hyperplane Hashing with Bilinear Functions Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, Shih-Fu Chang – Accepted Abstract: Hyperplane hashing aims at rapidly searching nearest points to a hyperplane, and has shown practical impact in scaling up active learning with SVMs. Unfortunately, the existing randomized methods need long hash codes and a number of hash tables to achieve reasonable search accuracy. Thus, they suffer from reduced search speed and large memory overhead. To this end, this paper proposes a novel hyperplane hashing technique which yields compact hash codes. The key idea is the bilinear form of the proposed hash functions, which leads to higher collision probability than the existing hyperplane hash functions when using random projections. To further increase the performance, we propose a learning based framework in which bilinear functions are directly learned from the data. This yields compact yet discriminative codes, and also increases the search performance over the random projection based solutions. Large-scale active learning experiments carried out on two datasets with up to one million samples demonstrate the overall superiority of the proposed approach. discussion+video, ICML version (pdf) (bib), more on ArXiv

Continuous Inverse Optimal Control with Locally Optimal Examples Sergey Levine, Vladlen Koltun – Accepted Abstract: Inverse optimal control, also known as inverse reinforcement learning, is the problem of recovering an unknown reward function in a Markov decision process from expert demonstrations of the optimal policy. We introduce a probabilistic inverse optimal control algorithm that scales gracefully with task dimensionality, and is suitable for large, continuous domains where even computing a full policy is impractical. By using a local approximation of the reward function, our method can also drop the assumption that the demonstrations are globally optimal, requiring only local optimality. This allows it to learn from examples that are unsuitable for prior methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

Convex Multitask Learning with Flexible Task Clusters Wenliang Zhong, James Kwok – Accepted Abstract: Traditionally, multitask learning (MTL) assumes that all the tasks are related. This can lead to negative transfer when tasks are indeed incoherent. Recently, a number of approaches have been proposed that alleviate this problem by discovering the underlying task clusters or relationships. However, they are limited to modeling these relationships at the task level,which may be restrictive in some applications. In this paper, we propose a novel MTL formulation that captures task relationships at the feature-level. Depending on the interactions among tasks and features, the proposed method construct different task clusters for different features, without even the need of pre-specifying the number of clusters. Computationally, the proposed formulation is strongly convex, and can be efficiently solved by accelerated proximal methods. Experiments are performed on a number of synthetic and real-world data sets. Under various degrees of task relationships, the accuracy of the proposed method is consistently among the best. Moreover, the feature-specific task clusters obtained agree with the known/plausible task structures of the data. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Hierarchical Dirichlet Process Model with Multiple Levels of Clustering for Human EEG Seizure Modeling Drausin Wulsin, Shane Jensen, Brian Litt – Accepted Abstract: Driven by the multi-level structure of human intracranial electroencephalogram (iEEG) recordings of epileptic seizures, we introduce a new variant of a hierarchical Dirichlet Process—the multi-level clustering hierarchical Dirichlet Process (MLC-HDP)—that simultaneously clusters datasets on multiple levels. Our seizure dataset contains brain activity recorded in typically more than a hundred individual channels for each seizure of each patient. The MLC-HDP model clusters over channels-types, seizure-types, and patient-types simultaneously. We describe this model and its implementation in detail. We also present the results of a simulation study comparing the MLC-HDP to a similar model, the Nested Dirichlet Process and finally demonstrate the MLC-HDP's use in modeling seizures across multiple patients. We find the MLC-HDP's clustering to be comparable to independent human physician clusterings. To our knowledge, the MLC-HDP model is the first in the epilepsy literature capable of clustering seizures within and between patients. discussion+video, ICML version (pdf) (bib), more on ArXiv

Levy Measure Decompositions for the Beta and Gamma Processes Yingjian Wang, Lawrence Carin – Accepted Abstract: We develop new representations for the Levy measures of the beta and gamma processes. These representations are manifested in terms of an infinite sum of well-behaved (proper) beta and gamma distributions. Further, we demonstrate how these infinite sums may be truncated in practice, and explicitly characterize truncation errors. We also perform an analysis of the characteristics of posterior distributions, based on the proposed decompositions. The decompositions provide new insights into the beta and gamma processes, and we demonstrate how the proposed representation unifies some properties of the two. This paper is meant to provide a rigorous foundation for and new perspectives on L´evy processes, as these are of increasing importance in machine learning. discussion+video, ICML version (pdf) (bib), more on ArXiv

Building high-level features using large scale unsupervised learning Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Greg Corrado, Kai Chen, Jeff Dean, Andrew Ng – Accepted Abstract: We consider the challenge of building feature detectors for high-level concepts from only unlabeled data. For example, we would like to understand if it is possible to learn a face detector using only unlabeled images downloaded from the Internet. To answer this question, we trained a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (which has 10 million images, each image has 200x200 pixels). On contrary to what appears to be a widely-held negative belief, our experimental results reveal that it is possible to achieve a face detector via only unlabeled data. Control experiments show that the feature detector is robust not only to translation but also to scaling and 3D rotation. Also via recognition and visualization, we find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. discussion+video, ICML version (pdf) (bib)

Near-Optimal BRL using Optimistic Local Transitions Mauricio Araya, Olivier Buffet, Vincent Thomas – Accepted Abstract: Model-based Bayesian Reinforcement Learning (BRL) allows a found formalization of the problem of acting optimally while facing an unknown environment, i.e., avoiding the exploration-exploitation dilemma. However, algorithms explicitly addressing BRL suffer from such a combinatorial explosion that a large body of work relies on heuristic algorithms. This paper introduces BOLT, a simple and (almost) deterministic heuristic algorithm for BRL which is optimistic about the transition function. We analyze BOLT's sample complexity, and show that under certain parameters, the algorithm is near-optimal in the Bayesian sense with high probability. Then, experimental results highlight the key differences of this method compared to previous work. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Unified Robust Classification Model Akiko Takeda, Hiroyuki Mitsugi, Takafumi Kanamori – Accepted Abstract: A wide variety of machine learning algorithms such as support vector machine (SVM), minimax probability machine (MPM), and Fisher discriminant analysis (FDA), exist for binary classification. The purpose of this paper is to provide a unified classification model that includes the above models through a robust optimization approach. This unified model has several benefits. One is that the extensions and improvements intended for SVM become applicable to MPM and FDA, and vice versa. Another benefit is to provide theoretical results to above learning methods at once by dealing with the unified model. We give a statistical interpretation of the unified classification model and propose a non-convex optimization algorithm that can be applied to non-convex variants of existing learning methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

Manifold Relevance Determination Andreas Damianou, Carl Ek, Michalis Titsias, Neil Lawrence – Accepted Abstract: In this paper we present a fully Bayesian latent variable model which exploits conditional non- linear (in)-dependence structures to learn an efficient latent representation. The model is capable of learning from extremely high-dimensional data such as directly modelling high resolution images. The latent representation is factorized to represent shared and private information from multiple views of the data. Bayesian techniques allow us to automatically estimate the dimensionality of the latent spaces. We demonstrate the model by prediction of human pose in an ambiguous setting. Our Bayesian representation allows us to perform disambiguation in a principled manner by including priors which incorporate the dynamics structure of the data. We demonstrate the ability of the model to capture structure underlying extremely high dimensional spaces by learning a low-dimensional representation of a set of facial images under different illumination conditions. The model correctly automatically creates a factorized representation where the lighting variance is represented in a separate latent space from the variance associated with different faces. We show that the model is capable of generating morphed faces and images from novel light directions. discussion+video, ICML version (pdf) (bib), more on ArXiv

Residual Components Analysis Alfredo Kalaitzis, Neil Lawrence – Accepted Abstract: Probabilistic principal component analysis (PPCA) seeks a low dimensional representation of a data set in the presence of independent spherical Gaussian noise, Σ = σ^2I. The maximum likelihood solution for the model is an eigenvalue problem on the sample covariance matrix. In this paper we consider the situation where the data variance is already partially explained by other factors, e.g. conditional dependencies between the covariates, or temporal correlations leaving some residual variance. We decompose the residual variance into its components through a generalised eigenvalue problem, which we call residual component analysis (RCA). We explore a range of new algorithms that arise from the framework, including one that factorises the covariance of a Gaussian density into a low-rank and a sparse-inverse component. We illustrate the ideas on the recovery of a protein-signaling network, a gene expression time-series data set and the recovery of the human skeleton from motion capture 3-D cloud data. discussion+video, ICML version (pdf) (bib), more on ArXiv

Clustering to Maximize the Ratio of Split to Diameter Jiabing Wang, Jiaye Chen – Accepted Abstract: Given a weighted and complete graph G = (V, E), V denotes the set of n objects to be clustered, and the weight d(u, v) associated with an edge (u, v) belonging to E denotes the dissimilarity between objects u and v. The diameter of a cluster is the maximum dissimilarity between pairs of objects in the cluster, and the split of a cluster is the minimum dissimilarity between objects within the cluster and objects outside the cluster. In this paper, we propose a new criterion for measuring the goodness of clusters:?the ratio of the minimum split to the maximum diameter, and the objective is to maximize the ratio. For k = 2, we present an exact algorithm. For k >= 3, we prove that the problem is NP-hard and present a factor of 2 approximation algorithm on the precondition that the weights associated with edges of G satisfy the triangle inequality. The worst-case runtime of both algorithms is O(n3). We compare the proposed algorithms with the Normalized Cut by applying them to image segmentation. The experimental results on both natural and synthetic images demonstrate the effectiveness of the proposed algorithms. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Graphical Model Formulation of Collaborative Filtering Neighbourhood Methods with Fast Maximum Entropy Training Aaron Defazio, Tiberio Caetano – Accepted Abstract: Item neighbourhood methods for collaborative filtering learn a weighted graph over the set of items, where each item is connected to those it is most similar to. The prediction of a user's rating on an item is then given by that rating of neighbouring items, weighted by their similarity. This paper presents a new neighbourhood approach which we call item fields, whereby an undirected graphical model is formed over the item graph. The resulting prediction rule is a simple generalization of the classical approaches, which takes into account non-local information in the graph, allowing its best results to be obtained when using drastically fewer edges than other neighbourhood approaches. A fast approximate maximum entropy training method based on the Bethe approximation is presented which utilizes a novel decomposition into tractable sub-problems. When using precomputed sufficient statistics on the Movielens dataset, our method outperforms maximum likelihood approaches by two orders of magnitude. discussion+video, ICML version (pdf) (bib)

On-Line Portfolio Selection with Moving Average Reversion Bin Li, Steven C.H. Hoi – Accepted Abstract: On-line portfolio selection has attracted increasing interests in machine learning and AI communities recently. Empirical evidences show that stock's high and low prices are temporary and stock price relatives are likely to follow the mean reversion phenomenon. While the existing mean reversion strategies are shown to achieve good empirical performance on many real datasets, they often make the single-period mean reversion assumption, which is not always satisfied in some real datasets, leading to poor performance when the assumption does not hold. To overcome the limitation, this article proposes a multiple-period mean reversion, or so-called “Moving Average Reversion' (MAR), and a new on-line portfolio selection strategy named “On-Line Moving Average Reversion” (OLMAR), which exploits MAR by applying powerful online learning techniques. From our empirical results, we found that OLMAR can overcome the drawback of existing mean reversion algorithms and achieve significantly better results, especially on the datasets where the existing mean reversion algorithms failed. In addition to superior trading performance, OLMAR also runs extremely fast, further supporting its practical applicability to a wide range of applications. discussion+video, ICML version (pdf) (bib), more on ArXiv

Improved Information Gain Estimates for Decision Tree Induction Sebastian Nowozin – Accepted Abstract: Ensembles of classification and regression trees remain popular machine learning methods because they define flexible non-parametric models that predict well and are computationally efficient both during training and testing. During induction of decision trees one aims to find predicates that are maximally informative about the prediction target. To select good predicates most approaches estimate an information-theoretic scoring function, the information gain, both for classification and regression problems. We point out that the common estimation procedures are biased and show that by replacing them with improved estimators of the discrete and the differential entropy we can obtain better decision trees. In effect our modifications yield improved predictive performance and are simple to implement in any decision tree code. discussion+video, ICML version (pdf) (bib), more on ArXiv

Influence Maximization in Continuous Time Diffusion Networks Manuel Gomez Rodriguez, Bernhard Schölkopf – Accepted Abstract: The problem of finding the optimal set of source nodes in a diffusion network that maximizes the spread of information, influence, and diseases in a limited amount of time depends dramatically on the underlying temporal dynamics of the network. However, this still remains largely unexplored to date. To this end, given a network and its temporal dynamics, we first describe how continuous time Markov chains allow us to analytically compute the average total number of nodes reached by a diffusion process starting in a set of source nodes. We then show that selecting the set of most influential source nodes in the continuous time influence maximization problem is NP-hard and develop an efficient approximation algorithm with provable near-optimal performance. Experiments on synthetic and real diffusion networks show that our algorithm outperforms other state of the art algorithms by at least 20% and is robust across different network topologies. discussion+video, ICML version (pdf) (bib)

On the Size of the Online Kernel Sparsification Dictionary Yi Sun, Faustino Gomez, Juergen Schmidhuber – Accepted Abstract: We analyze the size of the dictionary constructed from online kernel sparsification, using a novel formula that expresses the expected determinant of the kernel Gram matrix in terms of the eigenvalues of the covariance operator. Using this formula, we are able to connect the cardinality of the dictionary with the eigen-decay of the covariance operator. In particular, we show that for bounded kernels, the size of the dictionary always grows sub-linearly in the number of data points, and, as a consequence, the kernel linear regressor constructed from the resulting dictionary is consistent. discussion+video, ICML version (pdf) (bib), more on ArXiv

Multi-level Lasso for Sparse Multi-task Regression Aurelie Lozano, Grzegorz Swirszcz – Accepted Abstract: We present a flexible formulation for variable selection in multi-task regression to allow for discrepancies in the estimated sparsity patterns accross the multiple tasks, while leveraging the common structure among them. Our approach is based on an intuitive decomposition of the regression coefficients into a product between a component that is common to all tasks and another component that captures task-specificity. This decomposition yields the Multi-level Lasso objective that can be solved efficiently via alternating optimization. The analysis of the “orthonormal design” case reveals some interesting insights on the nature of the shrinkage performed by our method, compared to that of related work. Theoretical guarantees are provided on the consistency of Multi-level Lasso. Simulations and empirical study of micro-array data further demonstrate the value of our framework. discussion+video, ICML version (pdf) (bib)

Fast Computation of Subpath Kernel for Trees Daisuke Kimura, Hisashi Kashima – Accepted Abstract: The kernel method is a potential approach to analyzing structured data such as sequences, trees, and graphs; however, unordered trees have not been investigated extensively. Kimura et al. (2011) proposed a kernel function for unordered trees on the basis of their subpaths, which are vertical substructures of trees responsible for hierarchical information in them. Their kernel exhibits practically good performance in terms of accuracy and speed; however, linear-time computation is not guaranteed theoretically, unlike the case of the other unordered tree kernel proposed by Vishwanathan and Smola (2003). In this paper, we propose a theoretically guaranteed linear-time kernel computation algorithm that is practically fast, and we present an efficient prediction algorithm whose running time depends only on the size of the input tree. Experimental results show that the proposed algorithms are quite efficient in practice. discussion+video, ICML version (pdf) (bib), more on ArXiv

Total Variation and Euler's Elastica for Supervised Learning Tong Lin, Hanlin Xue, Ling Wang, Hongbin Zha – Accepted Abstract: In recent years, total variation (TV) and Euler's elastica (EE) have been successfully applied to image processing tasks such as denoising and inpainting. This paper investigates how to extend TV and EE to the supervised learning settings on high dimensional data. The supervised learning problem can be formulated as an energy functional minimization under Tikhonov regularization scheme, where the energy is composed of a squared loss and a total variation smoothing (or Euler's elastica smoothing). Its solution via variational principles leads to an Euler-Lagrange PDE. However, the PDE is always high-dimensional and cannot be directly solved by common methods. Instead, radial basis functions are utilized to approximate the target function, reducing the problem to finding the linear coefficients of basis functions. We apply the proposed methods to supervised learning tasks (including binary classification, multi-class classification, and regression) on benchmark data sets. Extensive experiments have demonstrated promising results of the proposed methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning the Dependence Graph of Time Series with Latent Factors Ali Jalali, Sujay Sanghavi – Accepted Abstract: This paper considers the problem of learning, from samples, the dependency structure of a system of linear stochastic differential equations, when some of the variables are latent. We observe the time evolution of some variables, and never observe other variables; from this, we would like to find the dependency structure of the observed variables – separating out the spurious interactions caused by the latent variables' time series. We develop a new convex optimization based method to do so in the case when the number of latent variables is smaller than the number of observed ones. For the case when the dependency structure between the observed variables is sparse, we theoretically establish a high-dimensional scaling result for structure recovery. We verify our theoretical result with both synthetic and real data (from the stock market). discussion+video, ICML version (pdf) (bib)

A Generalized Loop Correction Method for Approximate Inference in Graphical Models Siamak Ravanbakhsh, Chun-Nam Yu, Russell Greiner – Accepted Abstract: Belief Propagation (BP) is one of the most popular methods for inference in probabilistic graphical models. BP is guaranteed to return the correct answer for tree structures, but can be incorrect or non-convergent for loopy graphical models. Recently, several new approximate inference algorithms based on cavity distribution have been proposed. These methods can account for the effect of loops by incorporating the dependency between BP messages. Alternatively, region-based approximations (that lead to methods such as Generalized Belief Propagation) improve upon BP by considering interactions within small clusters of variables, thus taking small loops within these clusters into account. This paper introduces an approach, Generalized Loop Correction (GLC), that benefits from both of these types of loop correction. We show how GLC relates to these two families of inference methods, then provide empirical evidence that GLC works effectively in general, and can be significantly more accurate than both correction schemes. discussion+video, ICML version (pdf) (bib)

Consistent Covariance Selection From Data With Missing Values Mladen Kolar, eric xing – Accepted Abstract: Data sets with missing values arise in many practical problems and domains. However, correct statistical analysis of these data sets is difficult. A popular likelihood approach to statistical inference from partially observed data is the expectation maximization (EM) algorithm, which leads to non-convex optimization and estimates that are difficult to analyze theoretically. We study a simple two step procedure for covariance selection, which is tractable in high-dimensions and does not require imputation of the missing values. We provide rates of convergence for this estimator in the spectral norm, Frobenius norm and element-wise ℓ_∞ norm. Simulation studies show that this estimator compares favorably with the EM algorithm. Our results have important practical consequences as they show that standard tools for covariance selection can be used when data contains missing values, without resorting to the iterative EM algorithm that can be slow to converge in practice for large problems. discussion+video, ICML version (pdf) (bib)

Is margin preserved after random projection? Qinfeng Shi, Chunhua Shen, Rhys Hill, Anton van den Hengel – Accepted Abstract: Random projections have been applied in many machine learning algorithms. However, whether margin is preserved after random projection is non-trivial and not well studied. In this paper we analyse margin distortion after random projection, and give the conditions of margin preservation for binary classification problems. We also extend our analysis to margin for multiclass problems, and provide theoretical bounds on multiclass margin on the projected data. discussion+video, ICML version (pdf) (bib), more on ArXiv

A Bayesian Approach to Approximate Joint Diagonalization of Square Matrices Mingjun Zhong, Mark Girolami – Accepted Abstract: We present a fully Bayesian approach to simultaneously approximate diagonalization of several square matrices which are not necessary symmetric. A Gibbs sampler has been derived for simulating the common eigenvectors and the eigenvalues for these matrices. Several data are used to demonstrate the performance of the proposed Gibbs sampler and we then provide comparisons to several other joint diagonalization algorithms, which shows that the Gibbs sampler achieves the state-of-the-art performance. As a byproduct, the output of the Gibbs sampler could be used to estimate the log marginal likelihood by the Bayesian information criterion (BIC) which correctly located the number of common eigenvectors. We then applied the Gibbs sampler to the blind sources separation problem, the common principal component analysis and the common spatial pattern analysis. discussion+video, ICML version (pdf) (bib), more on ArXiv

Predicting accurate probabilities with a ranking loss Aditya Menon, Xiaoqian Jiang, Shankar Vembu, Charles Elkan, Lucila Ohno-Machado – Accepted Abstract: In many real-world applications of machine learning classifiers, it is essential to predict the probability of an example belonging to a particular class. This paper proposes a simple technique for predicting probabilities based on optimizing a ranking loss, followed by isotonic regression. This semi-parametric technique offers both good ranking and regression performance, and models a richer set of probability distributions than statistical workhorses such as logistic regression. We provide experimental results that show the effectiveness of this technique on real-world applications of probability prediction. discussion+video, ICML version (pdf) (bib), more on ArXiv

Learning with Augmented Features for Heterogeneous Domain Adaptation Lixin Duan, Dong Xu, Ivor Tsang – Accepted Abstract: We propose a new learning method for heterogeneous domain adaptation (HDA), in which the data from the source domain and the target domain are represented by heterogeneous features with different dimensions. Using two different projection matrices, we first transform the data from two domains into a common subspace in order to measure the similarity between the data from two domains. We then propose two new feature mapping functions to augment the transformed data with their original features and zeros. The existing learning methods (e.g., SVM and SVR) can be readily incorporated with our newly proposed augmented feature representations to effectively utilize the data from both domains for HDA. Using the hinge loss function in SVM as an example, we introduce the detailed objective function in our method called Heterogeneous Feature Augmentation (HFA) for a linear case and also describe its kernelization in order to efficiently cope with the data with very high dimensions. Moreover, we also develop an alternating optimization algorithm to effectively solve the nontrivial optimization problem in our HFA method. Comprehensive experiments on two benchmark datasets clearly demonstrate that our HFA outperforms the existing HDA methods. discussion+video, ICML version (pdf) (bib), more on ArXiv

Dirichlet Process with Mixed Random Measures: A Nonparametric Topic Model for Labeled Data Dongwoo Kim, Suin Kim, Alice Oh – Accepted Abstract: We describe a nonparametric topic model for labeled data. The model uses a mixture of random measures (MRM) as a base distribution of the Dirichlet process (DP) of the HDP framework, so we call it the DP-MRM. To model labeled data, we define a DP distributed random measure for each label, and the resulting model generates an unbounded number of topics for each label. We apply DP-MRM on single-labeled and multi-labeled corpora of documents and compare the performance on label prediction with LDA-SVM and Labeled-LDA. We further enhance the model by incorporating ddCRP and modeling multi-labeled images for image segmentation and object labeling, comparing the performance with nCuts and rddCRP. discussion+video, ICML version (pdf) (bib), more on ArXiv

Evaluating Bayesian and L1 Approaches for Sparse Unsupervised Learning Shakir Mohamed, Katherine Heller, Zoubin Ghahramani – Accepted Abstract: The use of L_1 regularisation for sparse learning has generated immense research interest, with many successful applications in diverse areas such as signal acquisition, image coding, genomics and collaborative filtering. While existing work highlights the many advantages of L_1 methods, in this paper we find that L_1 regularisation often dramatically under-performs in terms of predictive performance when compared with other methods for inferring sparsity. We focus on unsupervised latent variable models, and develop L_1 minimising factor models, Bayesian variants of “L_1”, and Bayesian models with a stronger L_0-like sparsity induced through spike-and-slab distributions. These spike-and-slab Bayesian factor models encourage sparsity while accounting for uncertainty in a principled manner, and avoid unnecessary shrinkage of non-zero values. We demonstrate on a number of data sets that in practice spike-and-slab Bayesian methods outperform L_1 minimisation, even on a computational budget. We thus highlight the need to re-assess the wide use of L_1 methods in sparsity-reliant applications, particularly when we care about generalising to previously unseen data, and provide an alternative that, over many varying conditions, provides improved generalisation performance. discussion+video, ICML version (pdf) (bib)

Collaborative Topic Regression with Social Matrix Factorization for Recommendation Systems Sanjay Purushotham, Yan Liu – Accepted Abstract: Social network websites, such as Facebook, YouTube, Lastfm etc, have become a popular platform for users to connect with each other and share content or opinions. They provide rich information to study the influence of user's social circle in their decision process. In this paper, we are interested in examining the effectiveness of social network information to predict the user's ratings of items. We propose a novel hierarchical Bayesian model which jointly incorporates topic modeling and probabilistic matrix factorization of social networks. A major advantage of our model is to automatically infer useful latent topics and social information as well as their importance to collaborative filtering from the training data. Empirical experiments on two large-scale datasets show that our algorithm provides a more effective recommendation system than the state-of-the art approaches. Our results also reveal interesting insight that the social circles have more influence on people's decisions about the usefulness of information (e.g., bookmarking preference on Delicious) than personal taste (e.g., music preference on Lastfm). discussion+video, ICML version (pdf) (bib), more on ArXiv

LPQP for MAP: Putting LP Solvers to Better Use Patrick Pletscher, Sharon Wulff – Accepted Abstract: MAP inference for general energy functions remains a challenging problem. While most efforts are channeled towards improving the linear programming (LP) based relaxation, this work is motivated by the quadratic programming (QP) relaxation. We propose a novel MAP relaxation that penalizes the Kullback-Leibler divergence between the LP pairwise auxiliary variables, and QP equivalent terms given by the product of the unaries. We develop two efficient algorithms based on variants of this relaxation. The algorithms minimize the non-convex objective using belief propagation and dual decomposition as building blocks. Experiments on synthetic and real-world data show that the solutions returned by our algorithms substantially improve over the LP relaxation. discussion+video, ICML version (pdf) (bib), more on ArXiv

Clustering by Low-Rank Doubly Stochastic Matrix Decomposition Zhirong Yang, Erkki Oja – Accepted Abstract: Clustering analysis by nonnegative low-rank approximations has achieved remarkable progress in the past decade. However, most approximation approaches in this direction are still restricted to matr