Humans are able to seamlessly integrate tactile and visual stimuli with their intuitions to explore and execute complex manipulation skills. They not only see but also feel their actions. Most current robotic learning methodologies exploit recent progress in computer vision and deep learning to acquire data-hungry pixel-to-action policies. These methodologies do not exploit intuitive latent structure in physics or tactile signatures. Tactile reasoning is omnipresent in the animal kingdom, yet it is underdeveloped in robotic manipulation. Tactile stimuli are only acquired through invasive interaction, and interpretation of the data stream together with visual stimuli is challenging. Here, we propose a methodology to emulate hierarchical reasoning and multisensory fusion in a robot that learns to play Jenga, a complex game that requires physical interaction to be played effectively. The game mechanics were formulated as a generative process using a temporal hierarchical Bayesian model, with representations for both behavioral archetypes and noisy block states. This model captured descriptive latent structures, and the robot learned probabilistic models of these relationships in force and visual domains through a short exploration phase. Once learned, the robot used this representation to infer block behavior patterns and states as it played the game. Using its inferred beliefs, the robot adjusted its behavior with respect to both its current actions and its game strategy, similar to the way humans play the game. We evaluated the performance of the approach against three standard baselines and show its fidelity on a real-world implementation of the game.

Our proposed approach draws from the notion of an “intuitive physics engine” in the brain that may be the cause of our abilities to integrate multiple sensory channels, plan complex actions ( 7 – 9 ), and learn abstract latent structure ( 10 ) through physical interaction, even from an early age. Humans learn to play Jenga through physics-based integration of sight and touch: Vision provides information about the location of the tower and current block arrangements but not about block interactions. The interactions are dependent on minute geometric differences between blocks that are imperceptible to the human eye. Humans gain information by touching the blocks and combining tactile and visual senses to make inferences about their interactions. Coarse high-level abstractions such as “will a block move” play a central role in our decision-making and are possible precisely because we have rich physics-based representations. We emulated this hierarchical learning and inference in the robotic system depicted in Fig. 1A using the artificial intelligence architecture schematically shown in Fig. 1B . To learn the mechanics of the game, the robot builds its physics-based representation from the data it collects during a brief exploration phase. We show that the robot builds purposeful abstractions that yield sample-efficient learning of a physics model of the game that it leverages to reason, infer, and act while playing the game.

In this work, we propose a hierarchical learning approach to acquiring manipulation skills. In particular, we pose a top-down bottom-up ( 3 – 6 ) learning approach to first build abstractions in the joint space of touch and vision that are then used to learn rich physics models. We used Jenga as a platform to compare and evaluate our approach. We have developed a simulation environment in which we compare the performance of our approach to three other state-of-the-art learning paradigms. We further show the efficacy of the approach on an experimental implementation of the game.

Current learning methodologies struggle with these challenges and have not exploited physics nearly as richly as we believe that humans do. Most robotic learning systems still use purely visual data, without a sense of touch; this fundamentally limits how quickly and flexibly a robot can learn about the world. Learning algorithms that build on model-free reinforcement learning (RL) methods have little to no ability to exploit knowledge about the physics of objects and actions. Even the methods using model-based RL or imitation learning have mostly used generic statistical models that do not explicitly represent the knowledge about physical objects, contacts, or forces that humans have from a very early age. As a consequence, these systems require far more training data than humans do to learn new models or new tasks, and they generalize much less broadly and less robustly.

While learning contact-rich manipulation skills, we face two important challenges: active perception and hybrid behavior. In the former, we ask how we use temporal tactile and visual information gathered by probing our environment through touch to learn about the world. In the latter, we ask how we effectively infer and learn multimodal behavior to control touch. These two challenges are central to mastering physical interactions. Jenga is a quintessential example of a contact-rich task where we need to interact with the tower to learn and to infer block mechanics and multimodal behavior by combining touch and sight.

Humans, even young children, learn complex tasks in the physical world by engaging richly with the physics of the world ( 1 ). This physical engagement starts with our perception and extends into how we learn, perform, and plan actions. We seamlessly integrate touch and sight in coming to understand object properties and relations, allowing us to intuit structure in the physical world and to build physics representations that are central to our ability to efficiently plan and execute manipulation skills ( 2 ).

RESULTS

In this study, we demonstrate the efficacy and sample efficiency of a hierarchical learning approach to manipulation on the challenging game of Jenga. To this end, we first outline the task and evaluation metric. Next, we present quantitative results for our method and three competitive state-of-the-art alternative approaches in simulation. We then demonstrate the fidelity of our approach on a real-world implementation of the game. We end the section with an analysis of the physics and abstractions learned using the proposed approach.

Evaluation metric Jenga is a particularly interesting instance of physical manipulation games, where mastering contact interactions is key to game play. Block mechanics are not discernible from just perception; rather, intrusive interaction, together with tactile and visual feedback, is required to infer and reason about the underlying latent structure, i.e., the different types of block mechanics. The complex, partially observable, and multimodal mechanics of the blocks in Jenga pose a daunting challenge to robots learning to play the game. These challenges are not exclusive to this game and exist in many manipulation skills, such as assembly and tool use. Hence, progress in effective learning of these skills is important and necessitates rich models and policies. In this study, we evaluated the robot’s ability to play the game by counting the number of successful consecutive block extractions in randomly generated towers. This metric evaluates a model and/or policy’s ability to account for the complex and time-varying mechanics of the tower as the game progresses. This metric, and our study, emphasizes physics modeling and does not explicitly evaluate the adversarial nature of the game.

Task specifications In this subsection, we specify the sensing modalities, actions, and rules by which the robot is allowed to play the game in both simulated and real environments: 1) Sensing. The robot has access to its own pose, the pose of the blocks, and the forces applied to it at every time step. The simulated robot observes these states directly, whereas the experimental robot has access to noisy estimates. 2) Action primitives. The robot uses two “primitive” actions, push and extract/place. Using the push primitive, the robot first selects a block and moves to a collision-free configuration in plane. The robot then selects a contact location and heading and pushes for a distance of 1 mm and repeats. The action is considered complete if either the robot chooses to retract or a maximum distance of 45 mm is reached. The extract/place primitive searches for a collision-free grasp of the block and places it on top of the tower at a random unoccupied slot. The extract/place primitives are parametric and computed per call; hence, they are not learned. 3) Base exploration policy. The robot has access to a base exploration policy for data collection. This policy randomizes the push primitive by first selecting a block at random, then executing a sequence of randomized contact locations and headings. 4) Termination criteria. A run, defined as an attempt at a new tower, is terminated when one of the following conditions is met: (i) All blocks have been explored, (ii) a block is dropped outside the tower, or (iii) the tower has toppled. 5) Tower and robot specifications. The simulated tower is composed of the same number and similar distribution of movable versus immobile blocks as the real tower. This is due to slight perturbations to weight distribution resulting from small tolerances in the height of the blocks. The relative dimensions of the tower and the end effector are consistent for both environments. The robot’s nominal execution loop is to select a block at random and to attempt the push primitive. During the push primitive, the robot either chooses push poses and headings or retracts. If the block is extracted beyond three-fourths of its length, the extract/place primitive is invoked. During a run, the nominal execution loop is continued until a termination criterion is met. A key challenge is that movable versus immobile pieces are indistinguishable before contact; consequently, the robot needs to control the hybrid/multimodal interaction for effective extraction without causing damage to the tower. If the damage compounds, then the tower loses integrity and termination criteria are met earlier. Hence, this problem is a challenging example that motivates the need for abstract reasoning with a fusion of tactile and visual information with a rich representation of physics.

Simulation Figure 2A depicts our Jenga setup in the MuJoCo (11) simulation environment. We used the simulation environment to compare the performance of our proposed approach [hierarchical model abstractions (HMAs)] to several standard baselines (Fig. 2B). Specifically, we chose a feed-forward neural network (NN) as a representative nonhierarchical model-based approach, a mixture of regressions (MOR) model as a generic hierarchical model-based approach, and the proximal policy optimization (PPO) (12) implementation of RL as a model-free approach. All models have access to the same set of states, actions, and model predictive controller (MPC) (details in Materials and Methods). Figure 2C depicts the schematics of our proposed approach and the MOR models. The MOR model makes use of latent variables l t , where t denotes time. Our HMA model uses abstractions denoted with c t . The states and noisy observations are denoted by s t and z t , respectively. Fig. 2 Jenga setup in simulation and the baseline comparisons. (A) The simulation setup is designed to emulate the real-world implementation. (B) Learning curve of the different approaches with confidence intervals evaluated over 10 attempts. Solid lines denote the median performances; shadings denote one standard deviation. (C) Visual depiction of the structure of the MOR and the proposed approach (HMA). Figure 2B shows the number of blocks successfully extracted in sequence as a function of the number of samples used to learn either a model (HMA, MOR, or NN) or a policy (RL). A sample is an incremental slice of the push trajectory. For the model-based approaches, the samples were collected during an exploration phase in which the robot interacted with the tower and collected states and actions. The exploration phase followed the robot’s nominal execution loop using the exploration policy. Once data collection was complete, a model was trained and its fidelity was evaluated in a model predictive control framework per number of samples over an unseen set of test towers. For the purpose of evaluation, the test towers were uniform for all models. For reference, a complete push was composed of 45 steps where a sample was measured at each step. A single sample took about 2 s to collect experimentally and 0.3 s in simulation. The robot could interact with a total of 45 blocks for a new tower. We found empirically that about 47% of pieces move in a real-world random tower and emulated this ratio in the simulation environment. Consequently, the robot can extract, on average, 21 blocks for an unperturbed tower of 3 by 18 (the last 3 layers are prohibited by the rules). This value provides a reasonable goal against which performance can be evaluated. The proposed approach reached the expected number of successful consecutive extractions within 100 samples. The MOR model was next to achieve the maximum score, requiring an order of magnitude more samples. The feed-forward NN saturated in performance, falling short of the expected maximum extractions. Upon closer inspection, we found that this model was unable to reliably predict the multimodal interactions and behaved either too conservatively or too recklessly. The RL algorithm was the slowest in convergence. Per run, all approaches were presented with new towers at random, which explains, in part, why the RL algorithm took a large number of samples to converge: The space of possible tower configurations was very large.

Experiments In our experimental setup, the robot had access to an Intel RealSense D415 camera (RGB) and an ATI Gamma six-axis force/torque sensor mounted at the wrist (Fig. 1A). These two modalities provided noisy approximations of the current pose of the pieces in the tower and the forces applied to the robot (details in Materials and Methods). We used the robot’s forward kinematics to estimate the pose of the gripper. To estimate the pose of the fingertips, we computed the deflection of the fingers with respect to the gripper by using the measured force applied to the finger and known compliance parameters. We used the experimental setup to demonstrate the fidelity of the proposed approach. The failure criteria in the experimental setting were expanded to include tower rotation or displacements exceeding 15° and 10 mm, respectively. This criterion was imposed by the poor predictions made in the vision systems beyond these values. The exploration strategy was also modified to include a hand-coded supervisory algorithm that attempted to mitigate damage to the tower using measurements of poses and forces but was tuned to allow mistakes. Table 1 shows the robot’s performance before and after exploration. Here, a successful push is one in which the robot is able to push a block to a desired end goal without dropping it outside the tower or causing excessive damage. A successful extraction is one in which the robot is able to pull the block free after a push without damaging the tower, and a successful placement is placing the block on top of the tower without damage. Table 1 Summary statistics for exploration and learned physics. A comparison of the performances of the robot using the exploration strategy and the learned model. View this table: The robot showed an appreciable improvement in block extraction from 56 to 88%. The most noticeable gain was in side-block extraction, doubling the success rate from 42.6 to 78.3%. The results suggest that middle-block extraction is considerably easier than the side block extraction because of constrained motion and the favorable weight distribution of the tower. The robot was able to displace 42.7% of the blocks, close to the empirical average of a random tower. The two main failure modes for the extraction of blocks were as follows: (i) excessive forces applied to the tower (characteristic of failing to identify block behaviors) and (ii) poorly controlled extraction of blocks (characteristic of poor predictive ability). The first failure mode often resulted in either a large tower perturbation such that the vision system was no longer reliable or tower collapse. The second failure mode often led to blocks either dropping outside the tower or ending in configurations that were difficult to grasp or occluded to the camera.

Model learning In this study, we represented the physics of the tower by using a hierarchical probabilistic model (Fig. 2C) with the structural composition similar to that of dynamic Bayesian networks. We used a top-down bottom-up learning approach to learn abstractions and physics for the tower. This methodology was inspired by models of cognition, in particular “concept learning,” and the intuitive abstractions that humans develop to facilitate complex manipulation skills (see Discussion). In top-down learning, the objective is to build abstractions from the physics of the tower. This approach is an instance of latent variable learning, where the variable identifies the type of macrobehavior. Specifically, in top-down learning, abstractions are acquired before learning detailed motion models and explicitly encode temporal macrobehaviors of blocks. Figure 3 shows a low-dimensional representation of the recovered abstractions through clustering in the relevant features space (details in Materials and Methods). Because the clusters have no labels, we intuited their semantic meanings from inspection of trajectory traces in force and visual domains. The green cluster denotes traces where the robot was not in contact with any block, in particular at the beginning of the trajectory where measured forces are negligible and blocks do not move. The gray cluster denotes blocks that resisted motion and were stuck, exhibiting large resistive forces and little to no translation. The blue cluster denotes blocks that moved fairly easily (large displacements) and exhibited negligible resistance. The yellow cluster denotes blocks that moved but offered meaningful resistance to the robot. Fig. 3 Concepts learned from exploration data. Means and covariances of the four clusters are projected to the space of “normal force (N),” “block rotation (rad),” and “block extraction/depth (dm).” The four clusters carry intuitive semantic meanings, and we refer to them as follows: green, “no block”; gray, “no move”; blue, “small resistance”; and yellow, “hard move.” Bottom-up learning refers to learning explicit state-transition models, factored by the abstractions, using sensory data. We used a probabilistic model, here a Bayesian neural network (BNN), to model the conditional distribution of future states given current states and actions in the joint force and visual domains (see Materials and Methods). The BNN was trained on the data collected during exploration. Figure 4 depicts two examples of the physics learned by these models. Figure 4A depicts the analytical friction cone between the fingertip and block overlaid with the predicted normal and tangential forces given their current measured values. This cone was computed assuming Coulomb friction and rigid point contact between the fingertip and block. Under these assumptions, any transferable force between the finger and block must lie on the boundary or in the interior of these cones. The predictions of the models are in good agreement with the cone, implying that it has learned some latent representation of friction. We note that the friction cone is invariant to the abstractions, and the predictions of the models reflect this fact well, implying coherency across models and abstractions. Fig. 4 Learned intuitive physics. (A) Overlay of the analytical friction cone and predicted forces given the current measurements. The friction coefficient between the finger material (PLA) and wood is between 0.35 and 0.5; here, we use 0.42 as an approximation. (B) Normal force applied to the tower as a function of the height of the tower. Each box plot depicts the minimum, maximum, median, and standard deviation of the force measures. Figure 4B shows the resistive force of blocks as a function of the tower height. The model is able to capture the intuitive tendency of block extraction resistance to decrease with tower height (because of the decrease in effective remaining weight on top of each block). The abstractions facilitate this by factoring the state space between macro-block behaviors and efficiently differentiating between movable versus immobile blocks.