Legged robots pose one of the greatest challenges in robotics. Dynamic and agile maneuvers of animals cannot be imitated by existing methods that are crafted by humans. A compelling alternative is reinforcement learning, which requires minimal craftsmanship and promotes the natural evolution of a control policy. However, so far, reinforcement learning research for legged robots is mainly limited to simulation, and only few and comparably simple examples have been deployed on real systems. The primary reason is that training with real robots, particularly with dynamically balancing systems, is complicated and expensive. In the present work, we introduce a method for training a neural network policy in simulation and transferring it to a state-of-the-art legged system, thereby leveraging fast, automated, and cost-effective data generation schemes. The approach is applied to the ANYmal robot, a sophisticated medium-dog–sized quadrupedal system. Using policies trained in simulation, the quadrupedal machine achieves locomotion skills that go beyond what had been achieved with prior methods: ANYmal is capable of precisely and energy-efficiently following high-level body velocity commands, running faster than before, and recovering from falling even in complex configurations.

INTRODUCTION

Legged robotic systems are attractive alternatives to tracked/wheeled robots for applications with rough terrain and complex cluttered environments. The freedom to choose contact points with the environment enables them to overcome obstacles comparable to their leg length. With such capabilities, legged robots may one day rescue people in forests and mountains, climb stairs to carry payloads in construction sites, inspect unstructured underground tunnels, and explore other planets. Legged systems have the potential to perform any physical activity humans and animals are capable of.

A variety of legged systems are being developed in the effort to take us closer to this vision of the future. Boston Dynamics introduced a series of robots equipped with hydraulic actuators (1, 2). These have advantages in operation because they are powered by conventional fuel with high energy density. However, systems of this type cannot be scaled down (usually >40 kg) and generate smoke and noise, limiting them to outdoor environments. Another family of legged systems is equipped with electric actuators, which are better suited to indoor environments. Massachusetts Institute of Technology’s (MIT) Cheetah (3) is one of the most promising legged systems of this kind. It is a fast, efficient, and powerful quadrupedal robot designed with advanced actuation technology. However, it is a research platform optimized mainly for speed and has not been thoroughly evaluated with respect to battery life, turning capability, mechanical robustness, and outdoor applicability. Boston Dynamics’ newly introduced robot, SpotMini, is also driven by electric actuators and is designed for both indoor and outdoor applications. Although details have not been disclosed, public demonstrations and media releases (4) are convincing evidence of its applicability to real-world operation. The platform used in this work, ANYmal (5), is another promising quadrupedal robot powered by electric actuators. Its bioinspired actuator design makes it robust against impact while allowing accurate torque measurement at the joints. However, the complicated actuator design increases cost and compromises the power output of the robot.

Designing control algorithms for these hardware platforms remains exceptionally challenging. From the control perspective, these robots are high-dimensional and nonsmooth systems with many physical constraints. The contact points change over the course of time and depending on the maneuver being executed and, therefore, cannot be prespecified. Analytical models of the robots are often inaccurate and cause uncertainties in the dynamics. A complex sensor suite and multiple layers of software bring noise and delays to information transfer. Conventional control theories are often insufficient to deal with these problems effectively. Specialized control methods developed to tackle this complex problem typically require a lengthy design process and arduous parameter tuning.

The most popular approach to controlling physical legged systems is modular controller design. This method breaks the control problem down into smaller submodules that are largely decoupled and therefore easier to manage. Each module is based on template dynamics (6) or heuristics and generates reference values for the next module. For example, some popular approaches (7–10) use a template-dynamics-based control module that approximates the robot as a point mass with a massless limb to compute the next foothold position. Given the foothold positions, the next module computes a parameterized trajectory for the foot to follow. The last module tracks the trajectory with a simple proportional-integral-derivative (PID) controller. Because the outputs of these modules are physical quantities, such as body height or foot trajectory, each module can be individually hand-tuned. Approaches of this type have achieved impressive results. Kalakrishnan et al. (11) demonstrated robust locomotion over challenging terrain with a quadrupedal robot: To date, this remains the state of the art for rough-terrain locomotion. Recently, Bellicoso et al. (12) demonstrated dynamic gaits, smooth transitions between them, and agile outdoor locomotion with a similar controller design. Yet, despite their attractive properties, modular designs have limitations. First, limited detail in the modeling constrains the model’s accuracy. This inherent drawback is typically mitigated by limiting the operational state domain of each module to a small region where the approximations are valid. In practice, such constraints lead to substantial compromises in performance, such as slow acceleration, fixed upright pose of the body, and limited velocity of the limbs. Second, the design of modular controllers is extremely laborious. Highly trained engineers spend months to develop a controller and to arduously hand-tune the control parameters per module for every new robot or even for every new maneuver. For example, running and climbing controllers of this kind can have markedly different architectures and are designed and tuned separately.

More recently, trajectory optimization approaches have been proposed to mitigate the aforementioned problems. In these methods, the controller is separated into two modules: planning and tracking. The planning module uses rigid-body dynamics and numerical optimization to compute an optimal path that the robot should follow to reach the desired goal. The tracking module is then used to follow the path. In general, trajectory optimization for a complex rigid-body model with many unspecified contact points is beyond the capabilities of current optimization techniques. Therefore, in practice, a series of approximations are used to reduce complexity. Some methods approximate the contact dynamics to be smooth (13, 14), making the dynamics differentiable. Notably, Neunert et al. (13) demonstrated that such methods can be used to control a physical quadrupedal robot. Other methods (15) prespecify the contact timings and solve for sections of trajectories where the dynamics remain smooth. A few methods aim to solve this problem with little to no approximation (16, 17). These methods can discover a gait pattern (i.e., contact sequence) with hard contact models and have demonstrated automatic motion generation for two-dimensional (2D) robotic systems, but, like any other trajectory optimization approach, the possible contact points are specified a priori. Although more automated than modular designs, the existing optimization methods perform worse than state-of-the-art modular controllers. The primary issue is that numerical trajectory optimization remains challenging, requires tuning, and in many cases, can produce suboptimal solutions. Besides, optimization has to be performed at execution time on the robot, making these methods computationally expensive. This problem is often solved by reducing precision or running the optimization on a powerful external machine, but both solutions introduce their own limitations. Furthermore, the system still consists of two independent modules that do not adapt to each other’s performance characteristics. This necessitates hand-tuning of the tracker; yet, accurately tracking fast motion by an underactuated system with many unexpected contacts is nearly impossible.

Data-driven methods, such as reinforcement learning (RL), promise to overcome the limitations of prior model-based approaches by learning effective controllers directly from experience. The idea of RL is to collect data by trial and error and automatically tune the controller to optimize the given cost (or reward) function representing the task. This process is fully automated and can optimize the controller end to end, from sensor readings to low-level control signals, thereby allowing for highly agile and efficient controllers. On the down side, RL typically requires prohibitively long interaction with the system to learn complex skills—typically weeks or months of real-time execution (18). Moreover, over the course of training, the controller may exhibit sudden and chaotic behavior, leading to logistical complications and safety concerns. Direct application of learning methods to physical legged systems is therefore complicated and has only been demonstrated on relatively simple and stable platforms (19) or in a limited context (20).

Because of the difficulties of training on physical systems, most advanced applications of RL to legged locomotion are restricted to simulation. Recent innovations in RL make it possible to train locomotion policies for complex legged models. Levine and Koltun (21) combined learning and trajectory optimization to train a locomotion controller for a simulated 2D walker. Schulman et al. (22) trained a locomotion policy for a similar 2D walker with an actor-critic method. More recent work obtained full 3D locomotion policies (23–26). In these papers, animated characters achieve remarkable motor skills in simulation.

Given the achievements of RL in simulated environments, a natural question is whether these learned policies can be deployed on physical systems. Unfortunately, such simulation-to-reality transfer is hindered by the reality gap—the discrepancy between simulation and the real system in terms of dynamics and perception. There are two general approaches to bridging the reality gap. The first is to improve simulation fidelity either analytically or in a data-driven way; the latter is also known as system identification (27–32). The second approach is to accept the imperfections of simulation and aim to make the controller robust to variations in system properties, thereby allowing for better transfer. This robustness can be achieved by randomizing various aspects of the simulation: using a stochastic policy (33), randomizing the dynamics (34–37), adding noise to the observations (38), and perturbing the system with random disturbances. Both approaches lead to improved transfer; however, the former is cumbersome and often impossible, and the latter can compromise the performance of the policy. Therefore, in practice, both are typically used in conjunction. For instance, the recent work of Tan et al. (35) demonstrated successful sim-to-real transfer of locomotion policies on a quadrupedal system called Minitaur via the use of an accurate analytical actuator model and dynamic randomization. Although it achieved impressive results, the method of Tan et al. (35) crucially depended on accurate analytical modeling of the actuators, which is possible for direct-drive actuators (as used in Minitaur) but not for more complex actuators, such as servomotors, series elastic actuators (SEAs), and hydraulic cylinders, which are commonly used in larger legged systems.

In this work, we developed a practical methodology for autonomously learning and transferring agile and dynamic motor skills for complex and large legged systems, such as the ANYmal robot (5). Compared with the robots used in (35), ANYmal has a much larger leg length relative to footprint, making it more dynamic, less statically stable, and therefore more difficult to control. In addition, it features 12 SEAs, which are difficult to control and for which sufficiently accurate analytical models do not exist. Gehring et al. (39) have attempted analytical modeling of an SEA, but, as we will show, their model is insufficiently accurate for training a high-performance locomotion controller.

Our approach is summarized in Fig. 1. Our key insight on the simulation side is that efficiency and realism can be achieved by combining classical models representing well-known articulated system and contact dynamics with learning methods that can handle complex actuation (Fig. 1, steps 1 and 2). The rigid links of ANYmal, connected through high-quality ball bearings, closely resemble an idealized multibody system that can be modeled with well-known physical principles (40). However, this analytical model does not include the set of mechanisms that map the actuator commands to the generalized forces acting on the rigid-body system: the actuator dynamics, the delays in control signals introduced by multiple hardware and software layers, the low-level controller dynamics, and compliance/damping at the joints. Because these mechanisms are nearly impossible to model accurately, we learned the corresponding mapping in an end-to-end manner—from commanded actions to the resulting torques—with a deep network. We learned this “actuator net” on the physical system via self-supervised learning and used it in the simulation loop to model each of the 12 joints of ANYmal. Crucially, the full hybrid simulator, including a rigid-body simulation and the actuator nets, runs at nearly 500,000 time steps per second, which allows the simulation to run roughly a thousand times faster than real time. About half of the run time was used to evaluate the actuator nets, and the remaining computations were efficiently performed via our in-house simulator, which exploits the fast contact solver of Hwangbo et al. (41), efficient recursive algorithms for computing dynamic properties of articulated systems (composite rigid-body algorithm and recursive Newton-Euler algorithm) (40), and a fast collision-detection library (42). Thanks to efficient software implementations, we did not need any special computing hardware, such as powerful servers with multiple central processing units (CPUs) and graphics processing units (GPUs), for training. All training sessions presented in this paper were done on a personal computer with one CPU and one GPU, and none lasted more than 11 hours.

Fig. 1 Creating a control policy. In the first step, we identify the physical parameters of the robot and estimate uncertainties in the identification. In the second step, we train an actuator net that models complex actuator/software dynamics. In the third step, we train a control policy using the models produced in the first two steps. In the fourth step, we deploy the trained policy directly on the physical system.

We used the hybrid simulator for training controllers via RL (Fig. 1, step 3). The controller is represented by a multilayer perceptron (MLP) that took as input the history of the robot’s states and produced as output the joint position target. Specifying different reward functions for RL yielded controllers for different tasks of interest.

The trained controller was then directly deployed on the physical system (Fig. 1, step 4). Unlike the existing model-based control approaches, our proposed method is computationally efficient at run time. Inference of the simple network used in this work took 25 μs on a single CPU thread, which corresponds to about 0.1% of the available onboard computational resources on the robot used in the experiments. This is in contrast to model-based control approaches that often require an external computer to operate at sufficient frequency (13, 15). Also, by simply swapping the network parameter set, the learned controller manifested vastly different behaviors. Although these behaviors were trained separately, they share the same code base: Only the high-level task description changed depending on the behavior. In contrast, most of the existing controllers are task-specific and have to be developed nearly from scratch for every new maneuver.

We applied the presented methodology to learning several complex motor skills that were deployed on the physical quadruped. First, the controller enabled the ANYmal robot to follow base velocity commands more accurately and energy-efficiently than the best previously existing controller running on the same hardware. Second, the controller made the robot run faster, breaking the previous speed record of ANYmal by 25%. The controller could operate at the limits of the hardware and push performance to the maximum. Third, we learned a controller for dynamic recovery from a fall. This maneuver is exceptionally challenging for existing methods because it involves multiple unspecified internal and external contacts. It requires fine coordination of actions across all limbs and must use momentum to dynamically flip the robot. To the best of our knowledge, such recovery skill has not been achieved on a quadruped of comparable complexity.