Standard computer vision systems assume access to intelligently captured inputs (e.g., photos from a human photographer), yet autonomously capturing good observations is a major challenge in itself. We address the problem of learning to look around: How can an agent learn to acquire informative visual observations? We propose a reinforcement learning solution, where the agent is rewarded for reducing its uncertainty about the unobserved portions of its environment. Specifically, the agent is trained to select a short sequence of glimpses, after which it must infer the appearance of its full environment. To address the challenge of sparse rewards, we further introduce sidekick policy learning, which exploits the asymmetry in observability between training and test time. The proposed methods learned observation policies that not only performed the completion task for which they were trained but also generalized to exhibit useful “look-around” behavior for a range of active perception tasks.

INTRODUCTION

Visual recognition has witnessed dramatic successes in recent years. Fueled by benchmarks composed of carefully curated web photos and videos, the focus has been on inferring semantic labels from human-captured images—whether classifying scenes, detecting objects, or recognizing activities (1–3). However, visual perception requires making not only inferences from observations but also decisions about what to observe. Methods that use human-captured images implicitly assume properties in their inputs, such as canonical poses of objects, no motion blur, or ideal lighting conditions. As a result, they gloss over important hurdles for robotic agents acting in the real world.

For an agent, individual views of an environment offer only a small fraction of all relevant information. For instance, an agent with a view of a television screen in front of it may not know whether it is in a living room or a bedroom. An agent observing a mug from the side may have to move to see it from above to know what is inside.

An agent ought to be able to enter a new environment or pick up a new object and intelligently (non-exhaustively) “look around.” The ability to actively explore would be valuable in both task-driven scenarios (e.g., a drone searches for signs of a particular activity) and scenarios where the task itself unfolds simultaneously with the agent’s exploratory actions (e.g., a search-and-rescue robot enters a burning building and dynamically decides its mission). For example, consider a service robot that is moving around in an open environment without specific goals, waiting for future tasks like delivering a package from one person to another or picking up coffee from the kitchen. It needs to efficiently and constantly gather information so that it is well prepared to perform future tasks with minimal delays. Similarly, consider a search-and-rescue scenario, where a robot is deployed in a hostile environment, such as a burning building or earthquake collapse, where time is of the essence. The robot has to adapt to such new unseen environments and rapidly gather information that other robots and humans can use to effectively respond to situations that dynamically unfold over time (humans caught under debris, locations of fires, and presence of hazardous materials). Having a robot that knows how to explore intelligently can be critical in such scenarios, reducing risks for people while providing an effective response.

Any such scenario brings forth the question of how to collect visual information to benefit perception. A naïve strategy would be to gain full information by making every possible observation—that is, looking around in all directions or systematically examining all sides of an object. However, observing all aspects is often inconvenient if not intractable. Fortunately, in practice, not all views are equally informative. The natural visual world contains regularities, suggesting that not every view needs to be sampled for accurate perception. For instance, humans rarely need to fully observe an object to understand its three-dimensional (3D) shape (4–6), and one can often understand the primary contents of a room without literally scanning it (7). In short, given a set of past observations, some new views are more informative than others (Fig. 1).

Fig. 1 Looking around efficiently is a complex task requiring the ability to reason about regularities in the visual world using cues like context and geometry. Top: An agent that has observed limited portions of its environment can reasonably predict some unobserved portions (e.g., water near the ship) but is much more uncertain about other portions. Where should it look next? Bottom: An agent inspecting a 3D object. Having seen a top view and a side view, how must it rotate the mug now to get maximum new information? Critically, we aim to learn policies that are not specific to a given object or scene, nor to a specific perception task. Instead, the look-around policies ought to benefit the agent exploring new, unseen environments and performing tasks unspecified when learning the look-around behavior.

This fact leads us to investigate the question of how to effectively look around: How can a learning system make intelligent decisions about how to acquire new exploratory visual observations? We propose a solution based on “active observation completion”: An agent must actively observe a small fraction of its environment so that it can predict the pixelwise appearances of unseen portions of the environment.

Our problem setting relates to but is distinct from previous work in active perception, intrinsic motivation, and view synthesis. Although there is interesting recent headway in active object recognition (8–11) and intelligent search mechanisms for detection (12–14), such systems are supervised and task specific—limited to accelerating a predefined recognition task. In reinforcement learning (RL), intrinsic motivation methods define generic rewards, such as novelty or coverage (15–17), that encourage exploration for navigation agents, but they do not self-supervise policy learning in an observed visual environment, nor do they examine transfer beyond navigation tasks. View synthesis approaches use limited views of the environment along with geometric properties to generate unseen views (18–22). Whereas these methods assume individual human-captured images, our problem requires actively selecting the input views themselves. Our primary goal is not to synthesize unseen views but rather to use novel view inference as a means to elicit intelligent exploration policies that transfer well to other tasks.

In the following, we first formally define the learning task, overview our approach, and present results. Then, after the results, we discuss limitations of the current approach and key future directions, followed by Materials and Methods—an overview of the specific deep networks and policy learning approaches we developed. This article expands upon our two previous conference papers (23, 24).

Active observation completion Our goal is to learn a policy for controlling an agent’s camera motions such that it can explore novel environments and objects efficiently. To this end, we formulate an unsupervised learning objective based on active observation completion. The main idea is to favor sequences of camera motions that make the unseen parts of the agent’s surroundings easier to predict. The output is a look-around policy equipped to gather new images in new environments. As we will demonstrate in results, it prepares the agent to perform intelligent exploration for a wide range of perception tasks, such as recognition, light source localization, and pose estimation.

Problem formulation The problem setting is formally stated as follows. The agent starts by looking at a novel environment (or object) X from some unknown viewpoint (25). It has a budget T of time to explore the environment. The learning objective is to minimize the error in the agent’s pixelwise reconstruction of the full—mostly unobserved—environment using only the sequence of views selected within that budget. To do this, the agent must maintain an internal representation of how the environment would look conditioned on the views it has seen so far. We represent the entire environment as a “viewgrid” containing views from a discrete set of viewpoints. To do this, we evenly sample N elevations from −90° to 90° and M azimuths from 0° to 360° and form all MN possible (elevation, azimuth) pairings. The viewgrid is then denoted by V(X) = {x(X, θ(i)) ∣ 1 ≤ i ≤ MN}, where x(X, θ(i)) is the 2D view of X from viewpoint θ(i), which is the ith pairing. More generally, θ(i) could capture both camera angles and position; however, to best exploit existing datasets, we limited our experiments to camera rotations alone with no translation movements. The agent expends its time budget T in discrete increments by selecting T − 1 camera motions in sequence. Each camera motion comprises an actively chosen “glimpse.” At each time step, the agent gets an image observation x t from the current viewpoint. It then makes an exploratory motion (a t ) based on its policy π. When the agent executes action a t ∈ A , the viewpoint changes according to θ t + 1 = θ t + a t . For each camera motion a t executed by the agent, a reward r t is provided by the environment. Using the view x t , the agent updates its internal representation of the environment, denoted V ^ ( X ) . Because camera motions are restricted to have proximity to the current camera angle and candidate viewpoints partially overlap, the discrete action space promotes efficiency without neglecting the physical realities of the problem [following (8, 9, 23, 26)]. During training, the full viewgrids of the environments are available to the agent as supervision. During testing, the system must predict the complete viewgrid, having seen only a few views within it. We explored our idea in two settings (Fig. 1). In the first, the agent scans a scene through its limited field-of-view camera; the goal is to select efficient camera motions so that after a few glimpses, it can model unobserved portions of the scene well. In the second, the agent manipulates a 3D object to inspect it; the goal is to select efficient manipulations so that after only a small number of actions, it has a full model of the object’s 3D shape. In both cases, the system must learn to leverage visual regularities (shape primitives, context, etc.) that suggest the likely contents of unseen views, focusing on portions that are hard to “hallucinate” (i.e., predict pixelwise). Posing the active view acquisition problem in terms of observation completion has two key advantages: generality and low-cost (label-free) training data. The objective is general in the sense that pixelwise reconstruction places no assumptions about the future task for which the glimpses will be used. The training data are low cost, because no manual annotations are required; the agent learns its look-around policy by exploring any visual scene or object. This assumes that capturing images is much more cost-effective than manually annotating images.