A preprint of the blog post is now available under https://arxiv.org/abs/1806.06003. Many thanks for everyone’s interest & input and I hope this format will prove useful as well.

Foreword

The post coincides topically with last years’ first annual Conference on Robot Learning as well as the workshop on Challenges in Robot Learning at NIPS2017, the latter we had the pleasure of co-organising together with colleagues from Oxford, DeepMind, and MIT.

The events, as well as this post, cover current challenges and potentials of learning across various tasks of relevance in robotics and automation. In this context, similar to the long-term discussion on how much innate structure is optimal for artificial intelligence, there is the more short-term question of how to merge traditional programming and learning (not sure if I prefer the branding as differentiable programming or software 2.0) for more narrow applications in efficient, robust and safe automation. The question about structure as beneficial or limiting aspect becomes arguably easier to answer in the context of robotic near-term applications as we can simply acknowledge our ignorance (the missing knowledge about what will work best in the future) and focus on the present to benchmark and combine the most efficient and effective directions.

Existing solutions to many tasks in mobile robotics, such as localisation, mapping, or planning, focus on prior knowledge about the structure of our tasks and environments. This may include geometry or kinematic and dynamic models, which therefore have been built into traditional programs. However, recent successes and the flexibility of fairly unconstrained, learned models shift the focus of new academic and industrial projects. Successes in image recognition (ImageNet) as well as triumphs in reinforcement learning (Atari, Go, Chess) inspire like-minded research.

As the post has become a bit of a long read, I suggest to read it like a paper: intro, discussion & conclusions and then - only if you did not fall asleep after all - the rest. Similar to scientific papers, some paragraphs will require basic familiarity with the field. However, a coarse web search should be enough to illustrate most unexplained terminology. Additionally, to keep this engaging, I have added some of my favourite recent videos highlighting interesting research for each section. Finally, this is a high-level review with more details to be found in the respective references, which just represent a small subset of available work in each field, chosen based on personal interest as well as shameless self-promotion of our work. In general, please do apply the large-bag-of-salt principle. Also, there is a shorter version available.

The Broad Question

Recently, discussions have come up about the potential relevance of reinforcement learning for deployable mobile robots. When hearing these questions, it is easy to reject them as professional FOMO - their fear of missing out on a hyped technology. However, it is worth taking a moment to look at what underlies these thoughts.

Deep learning mainly has made its mark regarding applications in the perception pipeline of autonomous systems including pedestrian / car / cyclist / traffic sign detection, semantic segmentation, and other related tasks. While these perception systems heavily rely on learning; localisation, reasoning, and planning modules often continue to be the domain of carefully crafted rules and programs exploiting geometric priors and our own intuitions. The design of which requires expert knowledge and repeated iteration between testing - in simulation as well as on the real platform - and refinement of hundreds if not thousands of heuristics.

While for example in the early DARPA challenges, robotic systems nearly completely relied on these priors and intuitions, this paradigm is starting to shift ([1]). Given the success in perception tasks, the natural question is: ‘what else can I learn from data?’. Discussions about using (reinforcement) learning naturally arise in the context of reducing manual rule design efforts and instead automatically learning decision patterns. The overall question now focusses on the general application of learning in further parts of our pipeline; with RL representing one of the potentially more ‘high risk, high reward’ scenarios.

Waymo / Google Oxford Robotics Institute / Oxbotica Tesla NVIDIA Drive.ai Toyota Research Institute

While ML has been able to improve our efficiency in addressing various tasks, here a quick reminder: there is this great saying regarding the non-existence of free lunch, and it only becomes more relevant after life in grad-school. Independent of all its advantages, machine learning delivers no magic tool; its successful application commonly requires detailed domain knowledge, systems engineering and demands significant time for data collection / curation, experimental setup and safety arrangements.

top of the page

Learning for Autonomous Systems

Autonomous systems are generally modularised for the same reasons as any large software systems: reuseability, ease of testing, separation of responsibilities, interpretability, etc. Robots / autonomous systems are treated in this article as a collection of these modules, including: perception, localisation, mapping, tracking, prediction, planning, and control.

The following paragraphs survey a subsection of work in each field, exemplifying the state of the art of learning based methods for these modules, followed by additional directions which are relevant across the whole pipeline on uncertainty and introspection as well as representations. The following, final sections represent a more personal take on challenges and potentials. More resources on software systems and computer vision for autonomous platforms can be found under [2], [3], [4] and generally in the mentioned references.

top of the page

Perception

Current perception modules represent one of the principal success stories of deep learning in autonomous systems. Image classification, object detection, depth estimation, semantic segmentation, activity recognition are all principally dominated by deep learning [5], [6], [7] (a detailed survey of recent work can be found under [8]).

While classification benchmarks have been long-standing pillars of computer vision research, the ImageNet benchmark [9] in particular presents a cornerstone for the acceleration of progress in machine learning and in particular deep learning. A good share of models originally was developed specifically for this benchmark. ImageNet dataset as well as benchmark have massive forces for research on deep learning, which triumphed in all recent competitions.

The detection of traffic participants including pedestrians, cyclists and other vehicles [10], [11] relies predominantly on deep learning approaches for image [12], [13], [14] as well as LIDAR data [15], [16]. LIDAR can be understood as essentially a light based radar: the sensor’s output being a long sequence of distance measurements. Notably, the natural structure of LIDAR significantly differs from image data and is unfit for the application of models designed for images. One essential challenge is that the same pointcloud can be represented by many different sequences and applied models have to be permutation invariant.

Most early approaches are able to prevail by building on manually designed grids with predefined feature extractors for each cell [16], [17]. Replacing this kind of manual feature design, a more recent direction is the combination of low-level feature learning based on recurrent modules and high-level grid-based representations via convolutions for end-to-end training [18]. Further work relies on max-pooling as symmetric function over point-wise descriptors (treating the data as a set rather than as a sequence) and extensions to address local features at varying contextual scales [19], [20].

Similarly, pixel-wise semantic and instance segmentation [22], [23], [6], [5] as well as image-based depth / disparity estimation (mono and stereo) [24], [5], [25], [26], [27], [28] abide in the domain of deep learning based approaches. Interestingly however, ideas from geometric computer vision are making their way back into current research for the latter direction, e.g. in the form of reprojection losses [25]. Multi-task training with shared encoder segments has demonstrated additional improvements [29], [21] when parts of the architecture can be shared. Lastly, in general scene understanding, deep learning has been beneficial for tasks such as the prediction of road attributes [30].

This paragraph is reasonably high-level with little discussion as this space of tasks is dominated by learning (removing the need to argue for its relevance) and innovation often focuses on architecture or loss function design, which for now exceed the scope of this post. Lastly, a significant share of intriguing work in these fields takes place in product-oriented, more applied research, which - sadly but for obvious reasons - is less published.

top of the page

Localisation and Mapping

Let me start by quickly conveying some condensed and highly simplified intuition about the computations underlying current solutions to relative localisation, absolute localisation as well as the full SLAM (simultaneous localisation and mapping) problem.

In relative localisation, we commonly determine matching features in consecutive sensor measurements and, based on the changes in their coordinates, we compute our change in pose; the latter being accurately described via geometric rules (projection, triangulation, etc). Absolute localisation (localisation against a map) additionally involves a feature-matching process of current perception against locations in our map (e.g. via bag-of-words) to determine the coarse location and potentially re-localise. For SLAM, we additionally build a map. So while we determine our own location from the estimated positions of features, we additionally need to estimate the position of new features with respect to the map to update and enhance. Additional refinement of the map for this joint optimisation problem is commonly formulated via iterative filtering and bundle adjustment techniques. The overall problem has many unmentioned challenges based on the efficient, robust realisation of these sub-tasks as well as in the complexity of real-world data including sensor noise, occlusions and dynamic environments.

Applications in localisation and mapping have provided challenging benchmarks for learning-based approaches. Geometric methods e.g. for visual odometry continue to outperform end-to-end learning [5] (end-to-end indicating here the learning of the complete odometry image-pose pipeline - in opposition to learning modules such as interest point descriptors [31]). Geometric methods have the benefit of incorporating our prior knowledge about exact geometric rules (e.g. regarding homographies and projections), which learning based methods will at best learn to approximate. While we are able to formulate exact equations e.g. for homography estimation, there are various tasks which are solved more heuristically. The sub-optimal compression of available information, as in the context of feature descriptors, provides an opportunity for learning to minimise the loss of relevant information.

However, the actual gap between distinctly geometric or learned approaches for localisation is decreasing in practice due to the consolidation of both directions. Recent work combines the flexibility of learned sub-systems and prior knowledge about task-dependent computations incorporating prior intuition from geometric CV [32], [33], [34], [35]. One example is given by the integration of auxiliary training losses to address the common drift problem of relative pose estimation [36]. Additionally to predicting accurate relative transforms, this objective can be applied to the integrated motion over multiple steps [32] to reduce accumulated drift. Furthermore, learning-based approaches provide the benefit of being independent of knowledge about (intrinsic camera) calibration as distorted images can be directly used [37].

Absolute localisation - relative to map instead of relative to our last position - commonly relies on such a map populated with features to localise against. Generally, in the context of current deep learning, most approaches utilise no explicit constraint on the type of computation [37], [38], there are however notable exceptions [39]. Recent work aims at harnessing geometric prior knowledge to obtain more informative training objectives [40] (with a helpful survey under [41]).

Direct Sparse Odometry [42] Experience-Based Navigation [43] RatSLAM [44]

Early work on neural approaches to the full SLAM problem [45] is given by Milford and Wyeth [44] taking inspiration from computational models of the hippocampus of rodents (-> RatSLAM). More recently, Tateno and colleagues [46] apply learning to solve a sub-problem of SLAM: using a convolutional neural network as depth estimator (see section above) to overcome shortcomings of monocular SLAM regarding absolute scale. Another approach to neural SLAM is taken in [31] by addressing another task within the SLAM systems and providing a fast deep learning based point tracking systems (bonus points for calling all modules MagicX). An extension of their work (and combination with homography estimation [47]) extends past synthetic data and outperforms various learned and non-learned point detector/descriptor baselines on a range of tasks [48]. A final approach to neural SLAM is given by Zhang et al [39], who try to embed procedures mimicking that of traditional Simultaneous Localization and Mapping (SLAM) into the soft attention-based addressing of external memory architectures, in which the external memory acts as an internal representation of the environment. In essence, the authors aim at providing a network structure, which inherently encourages learning SLAM-like procedures. However, the evaluation focuses on limited toy examples in simulation (looking forward to follow-up work).

To enable loop closures, place recognition addresses the recognition of previously visited locations based on their appearance and is a relevant part of the SLAM pipeline. This represents a general localisation sub-problem well-suited for learning-based approaches [49]. Commonly, it is commonly targeted in the context of comparing the feature representations between potential candidates [50], [49] (the latter being a differentiable adaptation of the VLAD image descriptor [51]). Finally, appearance change in our environment continues to be one of the most challenging aspects for learning as well as geometric approaches [52]. Work on obtaining weather or lighting invariant representations to address this challenge is summarised in a later paragraph.

Predominantly, current combined methods for localisation focus on employing learning for sub-tasks which are only heuristically solved in traditional CV as well as utilising geometrically inspired structure and computations to learn into.

top of the page

Tracking and Prediction

In essence, most object tracking pipelines can be divided into two steps: prediction of all tracks based on a prediction model and creation as well as update of the tracks based on current measurements; the latter depending on accurate assignments between current measurements and existing tracks.

Focusing on multi-object scenarios, one of the principal challenges of tracking lies in the data association problem: knowing which of the current detections corresponds to which established track. The most common methods for high-frequency tracking pipelines rely on simple distance-based associations between predicted position and current detections. However, in cluttered environments and the context of occlusions, additional information such as appearance is required to enable accurate tracking. Learning-based approaches have been successfully applied here e.g. to metric learning for appearance-based entity re-identification for pedestrians [53]. By applying objectives such as contrastive [54], triplet [55] and magnet loss [56] these methods learn a metric space where different instances of the same type reside closer together. Further methods train models for direct appearance-based tracking [57], [58] (more details follow later).

To associate existing tracks with new detections or provide position updates at potentially higher frequency than the received detections, we need to predict future positions. A classic, the (extended/unscented) Kalman filter [59], [60], is actually sufficient in most common situations. The simple motion models often underlying these methods (e.g. constant velocities) though will turn out unreliable in the context of long-term predictions and cluttered environments. For accurate prediction of future trajectories, more information such as interactions with static scenery as well as other agents (cyclists, cars, pedestrians) needs to be considered.

Early work to represent these interactions with social force models [61], [61] applies potential fields for modelling repulsive and attracting forces. Reciprocal Velocity Obstacles (RVO) were introduced as a computationally efficient extension with applications not only directly for prediction but also integrated into motion planning [62], [63] though the approach requires additional knowledge about the interacting entities. Recent work on learned predictive models employs deep neural networks to integrate information about static environment and dynamic environments [64], [65], [66]. In addition to flexibly utilising large quantities of raw sensor measurements, these approaches have been shown to be able to partially address the drift of trajectories by directly predicting multi-step sequences [66]. Notably, the KF itself has become a target for learning. BackprobKF [67] provides a fully differentiable architecture for state estimation which is evaluated on the KITTI visual odometry benchmark [5].

Fully Convolutional Siamese Networks for Object Tracking [58] Predict Actions to Act Predictably [68] Deep Tracking (precursor to [69])

Given the perspective of robotics, we’re often interested not just in the prediction of tracks for known & detected objects but the complete prediction of future states, including aspects we do net explicitly handle in the detection module. Addressing this challenge, a different angle to tracking is given by approaches like ‘Deep Tracking’ [69] which predict complete future sensor observations (e.g. LIDAR and camera) [70], [71], [72], [73], [74]. These methods bypass the data association problem as well as the general detection challenge and can provide redundancy for the prediction of future observations, such as occupancy grids. However, learning generative models for the prediction of complete sensor measurements has so far proven particularly challenging.

In general, the prediction of future motion, in particular other agents’ reactions, has great benefits for the following modules including motion planning [75], [68].

top of the page

Planning and Control

Planning and control are the final components of our pipeline and the connecting modules to determine commands for actuation. A principal question for these modules is the type and source of supervision. While a significant share of currently deployed solutions builds on manually hand-crafted rules, learning provides a relevant alternative to prevent repeated hyperparameter and heuristic tuning for different environments and scenarios. Now, one solution for supervision signal can be through reinforcement learning, which - while representing a multi-faceted topic of its own - still needs to overcome many real-world challenges and simply is to intricate to cover as just a side aspect of this post. This section mostly focuses on Learning from Demonstration to provide supervision based on demonstrations of a task from human experts and other potential authorities.

Behavioural Cloning (BC) aims at directly mimicking expert behaviour to solve a task; essentially supervised optimisation of regression or classification models. BC can be integrated into existing pipelines to build on more abstract representations but most commonly has been investigated in the scenario of end-to-end learning based on raw inputs. These methods have been empirically demonstrated in constrained scenarios e.g. for lane keeping [76], [77], [78]. Given independence of the source of demonstration data, BC is not restricted to imitate human experts and can be applied with automatically generated trajectories [79].

However, this application of naively trained supervised models, in a non-iid scenario, comes with additional challenges: performed actions affect future input data -> small errors lead to the distribution diverging from the training data, which commonly focuses on states along optimal trajectories, a phenomenon known as covariate shift. Model performance degrades, potentially shifting the input data even further from our training data distribution, causing a compounding of errors [80]. To prevent this result, we need to learn how to recover from suboptimal states, which are usually not part of given expert demonstrations. [76] addresses the problem in the context of lane keeping by synthetically generating off lane-centre states with additional cameras on the sides of the vehicle with corrected steering manoeuvres. One of the most common approaches is presented by DAgger [80] and extensions, which collect additional expert supervision during application of the model - however this also leads to increased efforts for providing supervision [81], [82], [83]. Finally, this type of end-to-end modelling is limited in terms of interpretability, and can rely on larger amounts of training data than modular or abstracted approaches [84].

Inverse Reinforcement Learning (IRL) presents another popular approach to address the problem of covariate shift - by blending supervised learning with reinforcement learning (RL) or planning to learn robust models. IRL aims to infer expert preferences by optimising a reward function that generates agent behaviour similar to the expert demonstrations - instead of an imitating policy. In the context of probabilistic models, it can be understood as optimising a model that maximises the probability of the expert’s trajectories [85] (part of this model is the RL agent or planning approach).

While BC only learns accurate behaviour for expert-visited states, IRL extends to states visited by the RL or planning step and learns corrective behaviour when diverging from the original trajectories. Furthermore, recent work directly utilises human domain knowledge [86] to define behaviour for states not sufficiently encountered by either. However, IRL in comparison to BC relies on an accurate systems model to simulate behaviour or the possibility to sample on the real system. If both are not possible we can turn, in the context of mobile robotics, to another approach based on supervised learning: training supervised segmentation models for traversable terrain [87], [88].

Though the IRL problem is underconstrained, as many reward functions are able to describe the same optimal behaviour, various approaches have introduced simple assumptions to address the degeneracy and derive efficient, practical solutions [85], [89], [90]. Based on these, impressive successes of IRL include learning artistic flying manoeuvres for an RC helicopter [91] and predicting future motion for traffic participants such as cars and pedestrians [92], [93], [94].

A major benefit of IRL-based methods lies in the integration into existing systems. By applying methods to learn cost (negated reward) functions for existing motion planning systems [95], [86], [96], we can directly integrate learned models into deployable systems, which can straightforwardly be tested and benchmarked against existing, hand-crafted cost functions.

Cost Function Learning via IRL [86] Terrain Classification [87] IRL for Flying [91]

When treating the robot control problem as part of a multi-agent scenario, we aim to optimise our actions not just for internal goals but as well for interaction and the internal goals of other agents. This approach gains relevance in cases when the other agents are represented by humans, as in most robot applications. Research on interpretability of models has a long-standing history and recently gained increased attention based on the massive complexity and - more importantly - real world relevance of deep learning [99], [100], [101]. The eminent aspect for control design is the interpretability of robot behaviour, urging us to act predictably when directly interacting with humans [102]. When planning motions in the direct vicinity of humans we benefit from providing non-verbal cues and human-like behaviour, enabling others to infer our driving style [103] and intentions [68], [104] to ensure convenient and stressfree interaction. Broader surveys on legible behaviour for human robot interaction can be found in [105], [106].

top of the page

Safety, Uncertainty and Introspection

The following sections present aspects which are disjoint of the modular pipeline structure addressed before and cover more general, cross-module concerns and potentials of learning from data in robotics.

Notably, relevant performance metrics in active, sequential decision making such as robotics and autonomous platforms differ from the metrics commonly benchmarked for machine learning (such as classification accuracy, precision, recall, etc.). First, not all mistakes are equal: making a mistake confidently can be much more harmful than demonstrating uncertainty about the situation (e.g. the class of a pointcloud - pedestrian versus paper-bag). False negatives and false positives have massively different relevance for our modelling decisions. Second, we are able to act conservatively in the context of uncertainty and additionally probe the environment by collecting more data instead of having to make a confident choice. Given the safety requirements for wide-scale application of autonomous vehicles [107], this approach is highly compelling.

To determine the necessity of conservative behaviour, we depend on knowledge about the model’s uncertainty for its predictions, which is commonly investigated as model introspection. [108] investigates SVMs, GPs and a number of other popular classification models (pre deep learning) and empirically support the intuition that better introspection leads to improved decision making in the context of tasks such as autonomous driving or semantic map generation. Furthermore, the authors indicate that commonly used detection metrics of precision and recall can be insufficient for describing model performance in safety-critical applications.

Furthermore, commonly used pseudo probabilities for introspect model predictions, such as detection scores or softmax output, often do not suffice and supplemental uncertainty measures are required. Bayesian uncertainty modelling for deep learning has found interest long before the current AI summer [109], [110], [111], [112], but recently received more spotlight thanks to the application of deep learning in safety-critical environments. [113] investigates the use of aleatoric and epistemic uncertainty metrics in deep learning, respectively for describing noise inherent in the observations and model uncertainty. Notably, while we cannot easily affect observation noise on the software side, we can reduce model uncertainty by collecting more data. Both metrics are suited for different purposes. While aleatoric uncertainty becomes highly relevant in large scale applications where epistemic uncertainty can be neglected, epistemic uncertainty can be employed to determine covariate shift between training and application data. In addition, this detection of novel data can be addressed via generative models [114] (please check the related work section for a broader survey).

Synthesizing Robust Adversarial Examples [115] Bayesian SegNet (precursor work to [113]) Microsoft CarSim [116]

While the previously mentioned work aims at investigating and extending the capability of the model itself, a separate direction of research aims at redundancy and parallel streams of information to examine model predictions. Explicit introspection tools have been introduced for predicting the performance of perception modules [117], the whole pipeline from perception to planning [118] and control [119].

Finally, saliency detection presents another essential approach to investigate model predictions, identifying input sections of high relevance. These visualisations are essentially obtained by determining the minimal input changes required to change model predictions [120], [121], [122], [123].

In order to accurately understand strengths and weaknesses of our system and improve on the latter, we depend on thorough testing and refinement of our systems. Training and testing systems in simulation enables us to repeat and vary edge cases. In essence, it enables us to generate multiple orders of magnitude more driving data [124], [125], [126] at different granularities. Various datasets and simulators are openly available for research [127], [116], [128], [129], [130], [131], [132] with a more comprehensive list given here. While current simulators become increasingly accurate, the reality gap still persists and and emphasises the potential of related work on transfer learning and domain adaptation.

A last bit on safety: Just like every piece of software, learning-based approaches have their vulnerabilities and can be fooled. New types of adversarial attacks [133] - ways to mess with the input data to fool the model - as well as methods for defence have gained increased attention in the past years. Most impressively, recent work presents more general approaches and has shown adversarial examples with invariance with respect to 3D viewpoint [115] and attacked model [134]. Furthermore, [135] demonstrates the possibility of attacks on the training data set. Interesting work on defence against adversaries includes input transformations [136] and different encodings [137] (these references do not include anything past late 2017, not for lack of interest but lack of time - though there have been many recent improvements).

top of the page

Knowledge Representation and Efficient Models

Applications on mobile platforms and embedded systems have increased the demand for computationally efficient systems with reduced memory footprint aiming at on-chip rather than off-chip placement. The situation has lead to improvements of over an order of magnitude reduction in parameters, FLOPS, and the corresponding increase in possible frame-rate compared to previous state-of-the-art models with only limited reduction in accuracy [138], [139], [140]. Notable, the basic building blocks such as convolutions have been adapted via the introduction of asymmetric [141] and separable convolutions [142] to increase parameter efficiency.

Work on model compression based on pruning, trained quantization and Huffman coding was able to reduce the memory footprint of existing architectures [143], [144]. Furthermore, instead of directly compressing the model, knowledge distillation enables the training of smaller models via mimicking the predictions of large state-of-the-art architectures without their computational footprint [145], [146]. The underlying intuition being that the logits extracted by the more powerful network include more information than the hard one-hot encodings, in particular about relations between classes. Recently, it has additionally been shown that distillation between networks of the same architecture can increase performance [147].

Finally, a role for learning can be found in reducing computation and time requirements [79]. In this context, one perspective on AlphaGoZero [148] covers the aspect of learning to imitate more expensive computations (here: Monte Carlo Tree Search), which enables the final, trained program to run faster and with lower computational requirements. While the version of AlphaGo that bested Lee Sedol had an estimated power consumption of approximately 1 MW (50,000 times as much power as the amount of power required for a human brain), AlphaGoZero (which beat the previous version 100-0) uses an order of magnitude less compute.

AlphaGo Power Consumption (source: businessinsider.com) Learning to Prune in CNNs [144]

Transfer, Multimodal, and Representation Learning

Robotics platforms perceive their environment through a multitude of different sensors. Learning can aid to combine and analyse this flood of information. Integrated training with different sensing modalities enables us to capture joint distributions, deploy with restricted access to our sensor setup [149], [150], [151] and increases robustness to sensor failure [152].

The additional result of these high-throughput sensor setups is the generation of massive amounts of unsupervised data, exceeding our capability for manual dense annotation. Nonetheless, we can benefit from this overwhelming amount of data by splitting the problem in two: First, unsupervised learning of a representation from which we require less supervised data and second, provide supervision for determining the final mapping. By reducing the requirements for human annotation, representation learning (unsupervised learning) has significant potential, though has not yet found the same commercial success as supervised approaches. One of our challenges is that ‘we [often] don’t know what’s a good representation’ [153]. Essentially, we’re only able to benchmark in the context of surrogate tasks such as reconstruction accuracy or performance of subsequent classifier modules.

Various approaches aim at finding relevant representations based on such proxy objectives; the aspects in common for most approaches is the prediction or verification of structure - spatial and temporal. Predicting spatial structure includes the relative location of image patches [154], the order of shuffled image patches [155], image inpainting [156], or employing various foundational [supervised] proxy 3D tasks for learning a generic 3D representation [157]. The recent increase in compute capacity enables the extension of these ideas from the spatial to the temporal domain, thereby facilitating the use of temporal consistency and structure to learn representations from videos (e.g. [158]). Examples operate by verifying the temporal order of sequences [159], predicting future frame representations [160], or by predicting low-level motion-based clustering [23]. For much better formed views on representation learning I’d point to this survey [161], this post on predictive versus representation learning [162], or this post on recent work and extensions to find training signals for RL [163].

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs [164] Deep Multispectral Semantic Scene Understanding [151]

Deep Photo Style Transfer [165]





One particularly relevant type of representation learning in this context is the field of unsupervised domain adaptation, which aims to induce domain invariant embeddings such as to increase the performance for a task in domains without annotated data. The metric for evaluation is clearly defined in this case by the supervised task. Similar to various approaches emphasised in this post, the benefit of domain-invariant representations lies in addressing covariate shift, the encounter of data outside our supervised training distribution during deployment of the model. Essentially, while acquiring widespread training and validation data is most beneficial [167], the effort is often impractical based on expenses and the challenge of considering all potential conditions. Recent approaches aim at empowering domain adaptation methods via deep models [168], [169], [170], [171], [172]; furthermore extending to adaptation in continually changing environments [173]. Additionally to domain adaptation in the feature space of a model, transforming images between different domains has become a promising direction [170], [174], [175], [164], [176]. Of course, fine tuning of pretrained models [177], [178] from different domains or different tasks remains nearly [179] always helpful if small amounts of supervised data are available for the application domain.

top of the page

Challenges and Potentials

After this short review of current work on learning in different modules as well as at their intersections, this section concludes with promising directions and potentials as well as some challenges ahead.

Bottom-Up

The principal strength of learning lies in applications in direct, reactive perception problems including classification, detection, segmentation, and related tasks. State-of-the-art models provide high accuracy with comparably little encoded structure (other than e.g. convolutions).

While there is no strong foundation for believing that localisation tasks are inherently much harder (or easier) to learn than e.g. object detection, geometric approaches simply provide a much stronger baseline in this field [5], [180]. Traditionally, CV addresses localisation very structured by utilising geometric knowledge about the mathematical rules underlying particular sub-problems (triangulation, homography estimation, etc). Instead of aiming to formalise rules for complete mappings from images to relative or absolute poses, odometry or SLAM systems extract and match features across sequences or between current perception and constructed map (while constructing said map), utilising prior knowledge about geometric properties, and improving pose graphs on the back-end, e.g. via filtering or bundle adjustment [45]. However, the first generation of learning-based methods addressed the problem with little to no structure, learning end-to-end mappings and based on pre-existing, annotated datasets (e.g. [37]). While given infinite data, covering the complete distribution of interest, and a flexible enough model, this approach can perform really well, the reality is often more constrained.

By combining both directions, the strength of geometric methods can have a substantial contribution towards more robust and reliable approaches. Various aspects of the task can be solved to perfect accuracy given geometric knowledge, while other aspects are affected by the real-world noise and can benefit via learning from data. Components that are likely to be focus on improvement via machine learning in future work include feature extraction e.g. for loop closure detection and re-localization or better point descriptors for sparse SLAM methods. Finally, deep learning can greatly improve the quality of map semantics - i.e. going beyond poses or pointclouds to a more complete understanding of the types and functionalities of objects or regions in the map. One particularly relevant direction lies in improved robustness (for example through better handling of dynamic objects and environmental changes [181]). Learning will additionally be beneficial when the assumptions underlying traditional approaches are invalid or we require faster methods [40]. Back at ICCV 2015, the question if deep learning would replace geometric CV for SLAM might have been received with significant scepticism [182] (and might still be), however most research has actually not tried to replace geometry but instead to enhance and augment, only learning parts of the system where cannot provide exact prior structure.

Commonly, the accuracy of perception and localisation systems can represent a bottleneck for overall performance and safety of a system as they provide the foundation for the rest of the pipeline, emphasising the relevance of even minute improvements. Part of the direct consumers of their output are the tracking & prediction modules. Traditional tracking approaches (e.g. EKF, UKF) often provide reasonable solutions for the space of robotics. These methods are straightforward to implement and fully suffice as long as tracking itself does not become the bottleneck. Applications in more complex, cluttered environment, where data association becomes more challenging, represent a prime example of the benefits of learning for appearance-based tracking [57], [183], [184], [58]. Furthermore, densely populated scenarios often lead to more intricate interactions. Learned elaborate interactive motion models enable us to predict motion with higher accuracy in these environments and can be incorporated into existing trackers.

Traditionally, motion planning, at least for deployed platforms, has been one of the fields more resistant to learning-based approaches (in particular in safety-critical applications). Planning approaches represent structured procedures for reasoning [185], [186], utilising knowledge about e.g. kinematic and dynamic constraints, as well as geometric extension of platform and obstacles. As with localisation, parts of the planning computation are accurately modelled via known structure and equations (e.g. collision checking). In this context, learning focuses on more intuitive aspects, e.g. in improving prediction in interactive scenarios - such as highway lane merging - which requires to predict the reaction of other cars to potential actions. In essence, the more interactive and intuitive parts of driving, which are less governed by strict, easy-to-define rules, present opportunities for learning from data, where these forms of common sense and intuition are too complex for manual rules [187]. Recent work on imitation learning for driving for example outsources high-level planning and takes additional commands as input [188] to focus learning on what is more straightforward to learn. This approach does not learn to plan but essentially a reactive controller (based on raw images) and presents another example for merging learning with existing systems.

Route planning, as topological, high-level planning process, is commonly addressed via graph search (A* and friends). While there has been research on learning these kinds of programs as end-to-end approach in limited scenarios [189], [190], given current applications, the existing algorithms do not represent a bottleneck. However, it can be expected that the costs associated with edges and nodes for the route graph are well-suited for estimation from data.

Focusing on the incorporation of learning for planning and control into existing modular software pipelines, another principal application lies in the characterisation of traversability and obstacles as well as the prediction of the reactions of dynamic obstacles (known e.g. as pedestrians). Hand-crafted cost functions for different kinds of terrains are commonly designed to help bridge the gap between perception and action and reduce the complexity of our environment representation to focus on the aspects we care about. Learning cost functions for driving like human experts is addressed via a sub-field of learning from demonstration [96], [86]. Furthermore, similar techniques can be applied to learn the prediction of reactive behaviours - interweaving planning and prediction models for dynamic environments [68]. Finally, in addition to manually defined cost functions, planning and reasoning systems commonly include many other heuristics, parameters determined during deployment to work in tested scenarios. Here, learning can play a major role in determining general rules on how to turn these knobs to adapt to new environments or different user preferences.

Conditional Imitation Learning [188] Cognitive Mapping and Planning [97]

In the context of safety-critical applications, learning is suitable for the generation of parallel systems for redundancy and additional checks, modules that replicate functionality and enable the second-guessing of decisions. Optimally, the application of multiple redundant modules is not restricted to verification but can culminate in a framework for learning from disagreement, where disagreements between modules are not only detected but feed back into the optimisation of the overall system, such as through adaptation of a module’s uncertainty for future predictions. A recent example for this type of framework is given by Pei et al [191]. The authors devise an approach for sets of networks, retraining the modules that disagree with majority decisions. The underlying assumption, that the majority will always be right, is critical for success of the approach. Their training process aims at balancing two objectives: maximise the number of active neurons and trigger as many conflicts between the modules as possible. This objective is interestingly similar to basic ideas from software testing aiming to maximise code coverage. Variations of the idea aiming at adapting uncertainties and taking module uncertainty into account for a weighted majority represent promising further directions. Furthermore, it will be generally beneficial to address the transfer of various other concepts and paradigms which have been demonstrated successful and indispensable for software engineering.

top of the page

Top-Down

Machine learning (in particular deep learning) has the capability to extract rules from massive amounts of data and the benefit of high flexibility: merging of arbitrary objectives, Lego-like capabilities for the reuse and combination of models. On the other hand, we have accurate mathematical formulations about the underlying math and programs to solve specific sub-problems e.g. for localisation and planning. Modern deep learning improves the ease for the integration of fixed and learned modules, enabling us to be standing on the shoulders of giants from both fields and build on known solutions.

While potentially trained as independent modules, the overall trajectory goes towards combinations which can be optimised as complete system via deterministic gradients as well as - if required - various stochastic gradient estimators (REINFORCE (or likelihood-ratio) trick [192], evolution strategies [193], [194], continuous relaxations such as Gumbel-Softmax or Concrete distribution and extensions [195], [196], [197] ).

Combinations of both approaches can provide significant advantages via redundant systems and often complementary properties. In essence, we aim to take the best of both worlds when merging two systems; similarly to how automation via ML aims to adapt and enhance job responsibilities [198] by addressing tasks complementary to human strengths. Ongoing directions include optimising input data or correcting the output of traditional programs. Examples for input improvement include learning image enhancement networks for traditional visual odometry methods [199]; output refinement includes pose correction updates for visual localisation [200] and refining dense reconstructions [201] as well as hand-crafted cost maps for motion planning [86].

Similarly, pure learning-based approaches benefit from incorporating prior knowledge about the underlying structure: implicit and explicit translation invariance [202], [203], objectness [204], temporal structure [205], planning procedures such as value iteration [98], further geometric properties [206], [207], structure that encourages SLAM-like computations [39] and access to SLAM - location and map - information for reinforcement learning [208]. Notably, the incorporation of structure can, under specific circumstances, even help when the incorporated models are inaccurate [209].

The combination of prior geometric knowledge and the flexibility of learning enables the reformulation of geometric properties to create self-supervised objectives. Examples are given by [213], [214] which utilise predictions for depth, segmentation masks and poses to differentiably warp frames in time to match image sections. In addition to training with externally supervised labels these approaches enable self-supervised learning e.g. via reprojection photometric errors. Lastly, the use of data augmentation represents the application of prior knowledge, structuring our wanted invariances via randomising the relevant aspects in our training data. A related survey on limits and potentials of (deep) learning in robotics can be found in \citep{2018arXiv180406557S}.

When moving from manually defining features and computations to designing the most efficient structure to learn into, one question arises naturally: why not learn everything (the architecture [215], [216], the optimiser [217], [218], or complete programs [219]). However, required investments in data hygiene and annotation for many applications with potential for real world impact often render it more efficient, in terms of human effort, to port our prior knowledge into algorithmic structure. Leslie Kaelbling formulated this well during a panel session at CoRL2017: ‘What structure can we build in that does not obstruct learning?’. The point being twofold, with respect to model and the optimisation procedure. If structure is a necessary good or necessary evil might be up to discussion [220], [210], [211], [212], but for now, practically, it is necessary as well as are the advantages of learning.

Building autonomous platforms, like addressing any other sufficiently complex and versatile software problem, results in a significant effort for systems engineering and iterative testing and refinement. The relative emphasis on learning or traditional programming blocks narrows down to the required effort and efficiency when creating reliable, safe and generalisable systems with either approach as well as the potential benefits of combination.

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning [214] Driven to Distraction - deep learning for filtering dynamic objects [181] Depth and Motion Network for Learning Monocular Stereo [26]

top of the page

Acknowledgements

Over the last few years, discussion and collaborations directly and indirectly helped in shaping this post; with many colleagues and friends at the Oxford Robotics Institute, DeepMind and at BAIR. With respect to this post, I’m particularly thankful to Alex Bewley and Ankur Handa for thorough reading and feedback on earlier drafts.

I had the privilege of participating in a small percentage of the referenced work, none of which could have been done without an amazing group of collaboratores including Ingmar Posner, Dushyant Rao, Alex Bewley, Dominic Zeng Wang, Peter Ondruska, Dave Held, Carlos Florensa, Pieter Abbeel, Kyriacos Shiarli, Sasha Salter and Shimon Whiteson.

Even though not involved in chats about this post or previous publications, many others have been involved indirectly in shaping ideas: Julie Dequaire, Corina Gurau, Jeff Hawke, Martin Engelcke, Adam Kosiorek, Fabian Fuchs, Oliver Groth, Abishek Gupta, Martin Riedmiller, Raia Hadsell, Jonas Buchli, Larry Zitnick, Anca Dragan, Nathan Benaich, Shubho Sengupta and many others. However, it is impossible to provide a complete list here. (Obvious disclaimer: the final version of this post does not represent the opinion of any previous or current employer or colleague.)

[1]B. Salesky, “A Decade after DARPA: Our View on the State of the Art in Self-Driving Cars (Bryan Salesky,ArgoAI).” https://medium.com/self-driven/a-decade-after-darpa-our-view-on-the-state-of-the-art-in-self-driving-cars-3e8698e6afe8 , 2017. [2]Baidu, “BAIDU Apollo.” http://apollo.auto/ , 2017. [3]Udacity, “Udacity Selfdriving Car Project.” https://github.com/udacity/self-driving-car , 2017. [4]J. Janai, F. Güney, A. Behl, and A. Geiger, “Computer Vision for Autonomous Vehicles: Problems, Datasets and State-of-the-Art,” ArXiv e-prints, 2017. [5]A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets Robotics: The KITTI Dataset,” International Journal of Robotics Research (IJRR), 2013. [6]M. Cordts et al., “The Cityscapes Dataset for Semantic Urban Scene Understanding,” Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [7]R. Benenson, “CV Benchmarks - Last update 2016.” http://rodrigob.github.io/are_we_there_yet/build/ , 2016. [8]“A year in computer vision.” http://www.themtank.org/a-year-in-computer-vision , 2018. [9]O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015. [10]A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012. [11]M. Cordts et al., “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3213–3223. [12]J. Ren et al., “Accurate Single Stage Detector Using Recurrent Rolling Convolution,” arXiv preprint arXiv:1704.05776, 2017. [13]Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection,” ArXiv e-prints, 2016. [14]J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger,” ArXiv e-prints, 2016. [15]X. Chen, H. Ma, J. Wan, B. Li, and T. Xia, “Multi-View 3D Object Detection Network for Autonomous Driving,” ArXiv e-prints, 2016. [16]M. Engelcke, D. Rao, D. Zeng Wang, C. Hay Tong, and I. Posner, “Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks,” ArXiv e-prints, 2016. [17]T. Hackel, N. Savinov, L. Ladicky, J. D. Wegner, K. Schindler, and M. Pollefeys, “SEMANTIC3D.NET: a New Large-Scale Point Cloud Classification Benchmark,” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, 2017. [18]C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” ArXiv e-prints, 2016. [19]C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation.” [20]C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems, 2017, pp. 5105–5114. [21]K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” ArXiv e-prints, 2017. [22]A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez, “A Review on Deep Learning Techniques Applied to Semantic Segmentation,” ArXiv e-prints, 2017. [23]D. Pathak, R. Girshick, P. Dollár, T. Darrell, and B. Hariharan, “Learning Features by Watching Objects Move,” ArXiv e-prints, 2016. [24]D. Scharstein, R. Szeliski, and H. Hirschmüller, “Middlebury Stereo Vision.” http://vision.middlebury.edu/stereo/ , 2001. [25]C. Godard, O. Mac Aodha, and G. J. Brostow, “Unsupervised Monocular Depth Estimation with Left-Right Consistency,” in CVPR, 2017. [26]B. Ummenhofer et al., “DeMoN: Depth and Motion Network for Learning Monocular Stereo,” ArXiv e-prints, 2016. [27]Y. Kuznietsov, J. Stückler, and B. Leibe, “Semi-Supervised Deep Learning for Monocular Depth Map Prediction,” ArXiv e-prints, 2017. [28]R. Garg, G. Carneiro, and I. Reid, “Unsupervised CNN for single view depth estimation: Geometry to the rescue,” in European Conference on Computer Vision, 2016, pp. 740–756. [29]D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and L. Van Gool, “Fast Scene Understanding for Autonomous Driving,” ArXiv e-prints, 2017. [30]A. Seff and J. Xiao, “Learning from Maps: Visual Common Sense for Autonomous Driving,” ArXiv e-prints, 2016. [31]D. DeTone, T. Malisiewicz, and A. Rabinovich, “Toward Geometric Deep SLAM,” ArXiv e-prints, 2017. [32]S. Pillai and J. J. Leonard, “Towards Visual Ego-motion Learning in Robots,” ArXiv e-prints, 2017. [33]S. Wang, R. Clark, H. Wen, and N. Trigoni, “DeepVO: Towards End-to-End Visual Odometry with Deep Recurrent Convolutional Neural Networks,” ArXiv e-prints, 2017. [34]M. Turan, Y. Almalioglu, H. Araujo, E. Konukoglu, and M. Sitti, “Deep EndoVO: A Recurrent Convolutional Neural Network (RCNN) based Visual Odometry Approach for Endoscopic Capsule Robots,” ArXiv e-prints, 2017. [35]L. Carlone and S. Karaman, “Attention and Anticipation in Fast Visual-Inertial Navigation,” ArXiv e-prints, 2016. [36]D. Nistér, O. Naroditsky, and J. Bergen, “Visual odometry,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, 2004, vol. 1, pp. I–I. [37]A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946. [38]E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, and carsten Rother, “Uncertainty-Driven 6D Pose Estimation of Objects and Scenes From a Single RGB Image,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [39]J. Zhang, L. Tai, J. Boedecker, W. Burgard, and M. Liu, “Neural SLAM,” ArXiv e-prints, 2017. [40]A. Kendall and R. Cipolla, “Geometric loss functions for camera pose regression with deep learning,” arXiv preprint arXiv:1704.00390, 2017. [41]A. Kendall, “Reprojection Losses: Deep Learning Surpassing Classical Geometry in Computer Vision?” https://alexgkendall.com/computer_vision/Reprojection_losses_geometry_computer_vision/, 2017. [42]J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [43]W. Churchill and P. Newman, “Experience-based navigation for long-term localisation,” The International Journal of Robotics Research, vol. 32, no. 14, pp. 1645–1661, 2013. [44]M. J. Milford, G. F. Wyeth, and D. Prasser, “RatSLAM: a hippocampal model for simultaneous localization and mapping,” in Robotics and Automation, 2004. Proceedings. ICRA’04. 2004 IEEE International Conference on, 2004, vol. 1, pp. 403–408. [45]cometlabs, “SLAM Systems Overview.” https://blog.cometlabs.io/teaching-robots-presence-what-you-need-to-know-about-slam-9bf0ca037553 , 2017. [46]K. Tateno, F. Tombari, I. Laina, and N. Navab, “CNN-SLAM: Real-time dense monocular SLAM with learned depth prediction,” arXiv preprint arXiv:1704.03489, 2017. [47]D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep Image Homography Estimation,” ArXiv e-prints, 2016. [48]D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoint: Self-Supervised Interest Point Detection and Description,” ArXiv e-prints, 2017. [49]R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: CNN architecture for weakly supervised place recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5297–5307. [50]N. Sunderhauf et al., “Place recognition with convnet landmarks: Viewpoint-robust, condition-robust, training-free,” Proceedings of Robotics: Science and Systems XII, 2015. [51]H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descriptors into a compact image representation,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, 2010, pp. 3304–3311. [52]S. Lowry et al., “Visual place recognition: A survey,” IEEE Transactions on Robotics, vol. 32, no. 1, pp. 1–19, 2016. [53]S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person Re-Identification by Local Maximal Occurrence Representation and Metric Learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. [54]R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in Computer vision and pattern recognition, 2006 IEEE computer society conference on, 2006, vol. 2, pp. 1735–1742. [55]E. Hoffer and N. Ailon, “Deep metric learning using triplet network,” in International Workshop on Similarity-Based Pattern Recognition, 2015, pp. 84–92. [56]O. Rippel, M. Paluri, P. Dollar, and L. Bourdev, “Metric learning with adaptive density discrimination,” arXiv preprint arXiv:1511.05939, 2015. [57]M. et al Kristan, “The visual object tracking vot2016 challenge results,” 2016. [58]L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” in European Conference on Computer Vision, 2016, pp. 850–865. [59]S. J. Julier and J. K. Uhlmann, “A new extension of the Kalman filter to nonlinear systems,” in Int. symp. aerospace/defense sensing, simul. and controls, 1997, vol. 3, no. 26, pp. 182–193. [60]E. A. Wan and R. Van Der Merwe, “The unscented Kalman filter for nonlinear estimation,” in Adaptive Systems for Signal Processing, Communications, and Control Symposium 2000. AS-SPCC. The IEEE 2000, 2000, pp. 153–158. [61]D. Helbing and P. Molnar, “Social force model for pedestrian dynamics,” Physical review E, vol. 51, no. 5, p. 4282, 1995. [62]J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obstacles for real-time multi-agent navigation,” in Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, 2008, pp. 1928–1935. [63]J. Alonso-Mora, A. Breitenmoser, M. Rufli, P. Beardsley, and R. Siegwart, “Optimal Reciprocal Collision Avoidance for Multiple Non-Holonomic Robots,” in Distributed Autonomous Robotic Systems: The 10th International Symposium, A. Martinoli, F. Mondada, N. Correll, G. Mermoud, M. Egerstedt, M. A. Hsieh, L. E. Parker, and K. Støy, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 203–216. [64]A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social lstm: Human trajectory prediction in crowded spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 961–971. [65]T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Soft + Hardwired Attention: An LSTM Framework for Human Trajectory Prediction and Abnormal Event Detection,” ArXiv e-prints, 2017. [66]M. Pfeiffer, G. Paolo, H. Sommer, J. Nieto, R. Siegwart, and C. Cadena, “A Data-driven Model for Interaction-aware Pedestrian Motion Prediction in Object Cluttered Environments,” ArXiv e-prints, Sep. 2017. [67]T. Haarnoja, A. Ajay, S. Levine, and P. Abbeel, “Backprop KF: Learning Discriminative Deterministic State Estimators,” ArXiv e-prints, May 2016. [68]M. Pfeiffer, U. Schwesinger, H. Sommer, E. Galceran, and R. Siegwart, “ Predicting Actions to Act Predictably: Cooperative Partial Motion Planning with Maximum Entropy Models ,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016. [69]J. Dequaire, P. Ondrúška, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: End-to-end tracking using recurrent neural networks,” The International Journal of Robotics Research, p. 0278364917710543, 2017. [70]C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in Advances in Neural Information Processing Systems, 2016, pp. 64–72. [71]M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015. [72]N. Kalchbrenner et al., “Video pixel networks,” arXiv preprint arXiv:1610.00527, 2016. [73]C. Vondrick and A. Torralba, “Generating the future with adversarial transformers,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [74]S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling,” ArXiv e-prints, 2017. [75]E. Schmerling, K. Leung, W. Vollprecht, and M. Pavone, “Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction,” ArXiv e-prints, 2017. [76]M. Bojarski et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016. [77]D. A. Pomerleau, “Neural network based autonomous navigation,” in Vision and Navigation, Springer, 1990, pp. 83–93. [78]U. Muller, J. Ben, E. Cosatto, B. Flepp, and Y. L. Cun, “Off-road obstacle avoidance through end-to-end learning,” in Advances in neural information processing systems, 2006, pp. 739–746. [79]M. Pfeiffer, M. Schaeuble, J. Nieto, R. Siegwart, and C. Cadena, “From Perception to Decision: A Data-driven Approach to End-to-end Motion Planning for Autonomous Ground Robots,” ArXiv e-prints, 2016. [80]S. Ross, G. J. Gordon, and J. A. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” in Proceedings of the International Conference on Artifical Intelligence and Statistics, 2010. [81]M. Laskey et al., “Shiv: Reducing supervisor burden in dagger using support vectors for efficient learning from demonstrations in high dimensional state spaces,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on, 2016, pp. 462–469. [82]G. Kahn, T. Zhang, S. Levine, and P. Abbeel, “Plato: Policy learning using adaptive trajectory optimization,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, 2017, pp. 3342–3349. [83]M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg, “Dart: Noise injection for robust imitation learning,” in Conference on Robot Learning, 2017, pp. 143–156. [84]S. Shalev-Shwartz and A. Shashua, “On the Sample Complexity of End-to-end Training vs. Semantic Abstraction Training,” ArXiv e-prints, 2016. [85]B. D. Ziebart, A. L. Maas, J. A. Bagnell, and A. K. Dey, “Maximum Entropy Inverse Reinforcement Learning.,” in AAAI, 2008, pp. 1433–1438. [86]M. Wulfmeier, D. Rao, D. Z. Wang, P. Ondruska, and I. Posner, “Large-scale cost function learning for path planning using deep inverse reinforcement learning,” The International Journal of Robotics Research, vol. 36, no. 10, pp. 1073–1087, 2017. [87]D. Barnes, W. Maddern, and I. Posner, “Find Your Own Way: Weakly-Supervised Segmentation of Path Proposals for Urban Autonomy,” ArXiv e-prints, 2016. [88]S. Thrun, M. Montemerlo, and A. Aron, “Probabilistic Terrain Analysis For High-Speed Desert Driving.,” in Robotics: Science and Systems, 2006, pp. 16–19. [89]N. D. Ratliff, J. A. Bagnell, and M. A. Zinkevich, “Maximum margin planning,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 729–736. [90]J. Choi and K.-E. Kim, “MAP inference for Bayesian inverse reinforcement learning,” in Advances in Neural Information Processing Systems, 2011, pp. 1989–1997. [91]P. Abbeel, A. Coates, and A. Y. Ng, “Autonomous Helicopter Aerobatics through Apprenticeship Learning,” The International Journal of Robotics Research, vol. 29, no. 13, pp. 1608–1639, 2010. [92]H. Kretzschmar, M. Spies, C. Sprunk, and W. Burgard, “Socially compliant mobile robot navigation via inverse reinforcement learning,” The International Journal of Robotics Research, p. 0278364915619772, 2016. [93]B. D. Ziebart et al., “Planning-based Prediction for Pedestrians,” in Proceedings of the 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, Piscataway, NJ, USA, 2009, pp. 3931–3936. [94]Q. P. Nguyen, B. K. H. Low, and P. Jaillet, “Inverse Reinforcement Learning with Locally Consistent Reward Functions,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 1738–1746. [95]N. D. Ratliff, D. Silver, and J. A. Bagnell, “Learning to search: Functional gradient techniques for imitation learning,” Autonomous Robots, vol. 27, no. 1, pp. 25–53, 2009. [96]K. Shiarlis, J. Messias, and S. Whiteson, “Rapidly exploring learning trees,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, 2017, pp. 1541–1548. [97]S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik, “Cognitive Mapping and Planning for Visual Navigation,” ArXiv e-prints, 2017. [98]A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value Iteration Networks,” ArXiv e-prints, 2016. [99]P. Hall, W. P. Sri, and S. Ambati, “Ideas on interpreting machine learning.” https://www.oreilly.com/ideas/ideas-on-interpreting-machine-learning , 2017. [100]Z. C. Lipton, “The Mythos of Model Interpretability,” ArXiv e-prints, 2016. [101]DARPA, “DARPA Explainable Artificial Intelligence.” https://www.cc.gatech.edu/ alanwags/DLAI2016/(Gunning)%20IJCAI-16%20DLAI%20WS.pdf , 2017. [102]S. S. Shwartz, “The need for human-like dirving (section of video).” https://youtu.be/FovLsAFiIJU?t=1m31s , 2016. [103]D. Sadigh, S. Sastry, S. A. Seshia, and A. D. Dragan, “Planning for Autonomous Cars that Leverage Effects on Human Actions.,” in Robotics: Science and Systems, 2016. [104]S. H. Huang, D. Held, P. Abbeel, and A. D. Dragan, “Enabling Robots to Communicate their Objectives,” ArXiv e-prints, 2017. [105]A. D. Dragan, “Legible Robot Motion Planning,” 2015. [106]A. D. Dragan, “Robot Planning with Mathematical Models of Human State and Action,” ArXiv e-prints, 2017. [107]Waymo, “Waymo Safety Report.” https://storage.googleapis.com/sdc-prod/v1/safety-report/waymo-safety-report-2017-10.pdf?utm_source=The+Comet+Newsletter , 2017. [108]H. Grimmett, R. Triebel, R. Paul, and I. Posner, “Introspective classification for robot perception,” The International Journal of Robotics Research, vol. 35, no. 7, pp. 743–762, 2016. [109]C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural networks,” arXiv preprint arXiv:1505.05424, 2015. [110]D. J. C. MacKay, “A practical Bayesian framework for backpropagation networks,” Neural computation, vol. 4, no. 3, pp. 448–472, 1992. [111]R. M. Neal, Bayesian learning for neural networks, vol. 118. Springer Science & Business Media, 2012. [112]Y. Gal, “What My Deep Model Doesn’t Know...” http://mlg.eng.cam.ac.uk/yarin/website/blog_3d801aa532c1ce.html , 2016. [113]A. Kendall and Y. Gal, “What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?,” ArXiv e-prints, 2017. [114]W. Wang, A. Wang, A. Tamar, X. Chen, and P. Abbeel, “Safer Classification by Synthesis,” ArXiv e-prints, 2017. [115]A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok, “Synthesizing Robust Adversarial Examples,” ArXiv e-prints, 2017. [116]Microsoft, “Microsoft Carsim.” https://www.microsoft.com/en-us/research/blog/autonomous-car-research/ , 2017. [117]C. Gurău, D. Rao, C. H. Tong, and I. Posner, “Learn from experience: probabilistic prediction of perception performance to avoid failure,” The International Journal of Robotics Research, vol. 0, no. 0, p. 0278364917730603, 0ADAD. [118]S. Daftry, S. Zeng, J. A. (D. Bagnell, and M. Hebert, “Introspective Perception: Learning to Predict Failures in Vision Systems,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2016), 2016. [119]L. Fridman, B. Jenik, and B. Reimer, “Arguing Machines: Perception-Control System Redundancy and Edge Case Discovery in Real-World Autonomous Driving,” ArXiv e-prints, 2017. [120]R. Fong and A. Vedaldi, “Interpretable Explanations of Black Boxes by Meaningful Perturbation,” ArXiv e-prints, 2017. [121]P. Dabkowski and Y. Gal, “Real Time Image Saliency for Black Box Classifiers,” ArXiv e-prints, 2017. [122]R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization,” ArXiv e-prints, 2016. [123]Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling Network Architectures for Deep Reinforcement Learning,” ArXiv e-prints, Nov. 2015. [124]Waymo, “Inside Waymo’s Secrect Testing and Simulation Facilities.” https://www.theatlantic.com/technology/archive/2017/08/inside-waymos-secret-testing-and-simulation-facilities/537648/ , 2017. [125]N. Y. Times, “NYT - Virtual Reality Driverless Cars.” https://www.nytimes.com/2017/10/29/business/virtual-reality-driverless-cars.html , 2017. [126]Uber, “Uber ATG Datavisualisation.” https://eng.uber.com/atg-dataviz/ , 2017. [127]A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in Conference on Robot Learning, 2017, pp. 1–16. [128]S. R. Richter, Z. Hayder, and V. Koltun, “Playing for Benchmarks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2213–2222. [129]G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. Lopez, “ The SYNTHIA Dataset: A Large Collection of Synthetic Images for Semantic Segmentation of Urban Scenes,” in cvpr, 2016. [130]“UnrealCV: Virtual Worlds for Computer Vision,” ACM Multimedia Open Source Software Competition, 2017. [131]A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual Worlds as Proxy for Multi-Object Tracking Analysis,” ArXiv e-prints, 2016. [132]A. Shafaei, J. J. Little, and M. Schmidt, “Play and Learn: Using Video Games to Train Computer Vision Models,” ArXiv e-prints, 2016. [133]xix.ai, “Adversarial Attacks.” https://blog.xix.ai/how-adversarial-attacks-work-87495b81da2d , 2017. [134]S.-M. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard, “Universal adversarial perturbations,” arXiv preprint arXiv:1610.08401, 2016. [135]P. W. Koh and P. Liang, “Understanding Black-box Predictions via Influence Functions,” ArXiv e-prints, 2017. [136]C. Guo, M. Rana, M. Cisse, and L. van der Maaten, “Countering Adversarial Images using Input Transformations,” ArXiv e-prints, 2017. [137]Anonymous, “Thermometer Encoding: One Hot Way To Resist Adversarial Examples,” International Conference on Learning Representations, 2018. [138]G. Huang, S. Liu, L. van der Maaten, and K. Q. Weinberger, “CondenseNet: An Efficient DenseNet using Learned Group Convolutions,” ArXiv e-prints, 2017. [139]A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep neural network architecture for real-time semantic segmentation,” arXiv preprint arXiv:1606.02147, 2016. [140]A. G. Howard et al., “MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications,” ArXiv e-prints, 2017. [141]C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” ArXiv e-prints, 2015. [142]F. Chollet, “Xception: Deep Learning with Depthwise Separable Convolutions,” arXiv preprint arXiv:1610.02357, 2016. [143]S. Han, H. Mao, and W. J. Dally, “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding,” ArXiv e-prints, 2015. [144]Q. Huang, K. Zhou, S. You, and U. Neumann, “Learning to Prune Filters in Convolutional Neural Networks,” ArXiv e-prints, 2018. [145]C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541. [146]G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015. [147]T. Furlanello, Z. Lipton, L. Itti, and A. Amandkumar, “Born Again Neural Networks,” in NIPS Workshop on Meta Learning, 2017. [148]D. Silver et al., “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, p. 354, 2017. [149]J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal deep learning,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 689–696. [150]D. Rao, M. D. Deuge, N. Nourani–Vatani, S. B. Williams, and O. Pizarro, “Multimodal learning and inference from visual and remotely sensed data,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 24–43, 2017. [151]A. Valada, G. L. Oliveira, T. Brox, and W. Burgard, “Deep Multispectral Semantic Scene Understanding of Forested Environments using Multimodal Fusion,” in International Symposium on Experimental Robotics, 2016, pp. 465–477. [152]G.-H. Liu, A. Siravuru, S. Prabhakar, M. Veloso, and G. Kantor, “Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation,” ArXiv e-prints, 2017. [153]Y. Bengio, “Yoshua Bengio Interview.” https://www.youtube.com/watch?v=pnTLZQhFpaE , 2017. [154]C. Doersch, A. Gupta, and A. A. Efros, “Unsupervised visual representation learning by context prediction,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1422–1430. [155]M. Noroozi and P. Favaro, “Unsupervised learning of visual representations by solving jigsaw puzzles,” in European Conference on Computer Vision, 2016, pp. 69–84. [156]D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2536–2544. [157]A. R. Zamir, T. Wekel, P. Agrawal, C. Wei, J. Malik, and S. Savarese, “Generic 3d representation via pose estimation and matching,” in European Conference on Computer Vision, 2016, pp. 535–553. [158]syntropy.ai, “Dimensional Reduction via Sequential Data.” https://medium.com/syntropy-ai/dimensional-reduction-via-sequential-data-798d4c3510d9 , 2017. [159]I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and Learn: Unsupervised Learning using Temporal Order Verification,” ArXiv e-prints, 2016. [160]C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 98–106. [161]Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” ArXiv e-prints, 2012. [162]R. Grosse, “Predictive learning vs. representation learning.” https://hips.seas.harvard.edu/blog/2013/02/04/predictive-learning-vs-representation-learning/ , 2013. [163]G. Patrini, “In search of the missing signals.” http://giorgiopatrini.org/posts/2017/09/06/in-search-of-the-missing-signals/ , 2017. [164]T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, “High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs,” ArXiv e-prints, 2017. [165]F. Luan, S. Paris, E. Shechtman, and K. Bala, “Deep Photo Style Transfer,” ArXiv e-prints, 2017. [166]L. Theis, A. van den Oord, and M. Bethge, “A note on the evaluation of generative models,” ArXiv e-prints, 2015. [167]Intel, “Intel Mobileye plans for autonomous fleet.” https://newsroom.intel.com/news/intel-mobileye-integration-plans-build-fleet-autonomous-test-cars/ , 2017. [168]Y. Ganin et al., “Domain-Adversarial Training of Neural Networks,” Journal of Machine Learning Research, vol. 17, pp. 1–35, 2016. [169]M. Wulfmeier, A. Bewley, and I. Posner, “Addressing Appearance Change in Outdoor Robotics with Adversarial Domain Adaptation,” in Proceedings of the IEEE International Conference on Intelligent Robots and Systems, 2017. [170]K. Bousmalis et al., “Using Simulation and Domain Adaptation to Improve Efficiency of Deep Robotic Grasping,” ArXiv e-prints, 2017. [171]E. Tzeng et al., “Towards adapting deep visuomotor representations from simulated to real environments,” arXiv preprint arXiv:1511.07111, 2015. [172]G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,” arXiv preprint arXiv:1702.05374, 2017. [173]M. Wulfmeier, A. Bewley, and I. Posner, “Incremental Adversarial Domain Adaptation for Continually Changing Environments,” ArXiv e-prints, 2017. [174]J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” ArXiv e-prints, 2017. [175]Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation,” ArXiv e-prints, 2017. [176]X. Huang and S. Belongie, “Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization,” arXiv preprint arXiv:1703.06868, 2017. [177]J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in deep neural networks?,” in Advances in neural information processing systems, 2014, pp. 3320–3328. [178]J. Donahue et al., “DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition,” ArXiv e-prints, 2013. [179]M. Wulfmeier, I. Posner, and P. Abbeel, “Mutual Alignment Transfer Learning,” ArXiv e-prints, 2017. [180]J. Zhu, “Image Gradient-based Joint Direct Visual Odometry for Stereo Camera,” in Int. Jt. Conf. Artif. Intell, 2017, pp. 4558–4564. [181]D. Barnes, W. Maddern, G. Pascoe, and I. Posner, “Driven to Distraction: Self-Supervised Distractor Learning for Robust Monocular Visual Odometry in Urban Environments,” ArXiv e-prints, 2017. [182]T. Malisiewicz, “The Future of Real-Time SLAM and Deep Learning vs SLAM.” http://www.computervisionblog.com/2016/01/why-slam-matters-future-of-real-time.html , 2016. [183]M. Kristan et al., “The visual object tracking vot2015 challenge results,” in Proceedings of the IEEE international conference on computer vision workshops, 2015, pp. 1–23. [184]J. Valmadre, L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “End-to-end representation learning for Correlation Filter based tracking,” ArXiv e-prints, 2017. [185]S. M. LaValle, “Rapidly-exploring random trees: A new tool for path planning,” 1998. [186]M. Pivtoraiko and A. Kelly, “Efficient constrained path planning via search in state lattices,” in International Symposium on Artificial Intelligence, Robotics, and Automation in Space, 2005, pp. 1–7. [187]E. Davis and G. Marcus, “Commonsense reasoning and commonsense knowledge in artificial intelligence,” Communications of the ACM, vol. 58, no. 9, pp. 92–103, 2015. [188]F. Codevilla, M. Müller, A. Dosovitskiy, A. López, and V. Koltun, “End-to-end Driving via Conditional Imitation Learning,” ArXiv e-prints, 2017. [189]A. Graves et al., “Hybrid computing using a neural network with dynamic external memory,” Nature, vol. 538, pp. 471 EP -, 2016. [190]A. Neelakantan, Q. V. Le, and I. Sutskever, “Neural Programmer: Inducing Latent Programs with Gradient Descent,” ArXiv e-prints, 2015. [191]K. Pei, Y. Cao, J. Yang, and S. Jana, “DeepXplore: Automated Whitebox Testing of Deep Learning Systems,” arXiv preprint arXiv:1705.06640, 2017. [192]R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992. [193]T. Back, Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press, 1996. [194]T. Salimans, J. Ho, X. Chen, and I. Sutskever, “Evolution strategies as a scalable alternative to reinforcement learning,” arXiv preprint arXiv:1703.03864, 2017. [195]C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,” arXiv preprint arXiv:1611.00712, 2016. [196]E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” arXiv preprint arXiv:1611.01144, 2016. [197]G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein, “REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models,” in Advances in Neural Information Processing Systems, 2017, pp. 2624–2633. [198]McKinsey, “Human + machine: A new era of automation in manufacturing.” https://www.mckinsey.com/business-functions/operations/our-insights/human-plus-machine-a-new-era-of-automation-in-manufacturing , 2017. [199]R. Gomez-Ojeda, Z. Zhang, J. Gonzalez-Jimenez, and D. Scaramuzza, “Learning-based Image Enhancement for Visual Odometry in Challenging HDR Environments,” ArXiv e-prints, 2017. [200]V. Peretroukhin and J. Kelly, “DPC-Net: Deep Pose Correction for Visual Localization,” ArXiv e-prints, 2017. [201]M. Tanner, S. Saftescu, A. Bewley, and P. Newman, “Meshed Up: Learnt Error Correction in 3D Reconstructions,” ArXiv e-prints, 2018. [202]Y. LeCun and others, “Generalization and network design strategies,” Connectionism in perspective, pp. 143–155, 1989. [203]I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal, Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016. [204]A. Byravan and D. Fox, “SE3-Nets: Learning Rigid Body Motion using Deep Neural Networks,” ArXiv e-prints, 2016. [205]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. [206]A. Handa, M. Bloesch, V. Patraucean, S. Stent, J. McCormac, and A. Davison, “gvnn: Neural Network Library for Geometric Computer Vision,” ArXiv e-prints, 2016. [207]M. Jaderberg, K. Simonyan, A. Zisserman, and others, “Spatial transformer networks,” in Advances in Neural Information Processing Systems, 2015, pp. 2017–2025. [208]S. Bhatti, A. Desmaison, O. Miksik, N. Nardelli, N. Siddharth, and P. H. S. Torr, “Playing Doom with SLAM-Augmented Deep Reinforcement Learning,” ArXiv e-prints, 2016. [209]T. Weber et al., “Imagination-Augmented Agents for Deep Reinforcement Learning,” ArXiv e-prints, 2017. [210]Y. LeCun and G. Marcus, “Debate: ‘Does AI Need More Innate Machinery?" (Yann LeCun, Gary Marcus).” https://www.youtube.com/watch?v=vdWPQ6iAkT4 , 2017. [211]D. George et al., “A generative vision model that trains with high data efficiency and breaks text-based CAPTCHAs,” Science, 2017. [212]S. Sabour, N. Frosst, and G. E Hinton, “Dynamic Routing Between Capsules,” ArXiv e-prints, 2017. [213]S. Vijayanarasimhan, S. Ricco, C. Schmid, R. Sukthankar, and K. Fragkiadaki, “SfM-Net: Learning of Structure and Motion from Video,” ArXiv e-prints, 2017. [214]A. Byravan, F. Leeb, F. Meier, and D. Fox, “SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control,” CoRR, vol. abs/1710.00489, 2017. [215]E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized Evolution for Image Classifier Architecture Search,” ArXiv e-prints, 2018. [216]A. Brock, T. Lim, J. M. Ritchie, and N. Weston, “SMASH: One-Shot Model Architecture Search through HyperNetworks,” ArXiv e-prints, 2017. [217]J. X. Wang et al., “Learning to reinforcement learn,” ArXiv e-prints, 2016. [218]M. Andrychowicz et al., “Learning to learn by gradient descent by gradient descent,” ArXiv e-prints, 2016. [219]N. Kant, “Recent Advances in Neural Program Synthesis,” ArXiv e-prints, 2018. [220]LeCun and Manning, “Deep Learning, Structure and Innate Priors - A Discussion between Yann LeCun and Christopher Manning.” http://www.abigailsee.com/2018/02/21/deep-learning-structure-and-innate-priors.html , 2018.

top of the page