Making a robot learn how to move, part 2 – reinforcement learning in the real, wild world

Tackling the bottlenecks of the physical world.

This is the second part of a series of posts. You can find here the Intro and Part 1, about Evolutionary Algorithms. You can find the code for the described algorithms and experiments in this GitHub repository.

In the recent years, Reinforcement Learning has had a reinassence. Various breakthroughs and remarkable results have gained the attention of the whole scientific community, and even of the pop culture: from AlphaGo to DQN applied to Atari, to the very recent OpenAI DOTA 2 bot.

New algorithms and architectures have been released at an ashtoning speed, beating state-of-the-art results and solving new tasks. These algorithms have performed very well even in simulated physical worlds, with humanoids and animals-like models learning to walk, jump, avoid obstacles, as DeepMind recently showed.

A robot dealing with the physical world.

But the majority of these results have a thing in common: they require large computational power, time, and they are performed on virtual or simulated worlds. Hundreds of thousands of episodes are needed to obtain good results, and this is often not a problem thanks to the recent availability of high performance computers. But what happens when reinforcement learning meets the time costs and problems of the real world? Will these algorithms work if they are run, for example, for a hundred episodes? This, in part, is what we tried to analyze in the field of robotics.

[…] we worked with the main focus of obtaining good results is a very short time.

As my previous posts explain, the task that a collegaue of mine and I tried to solve was to find ways to make a robotic manipulator follow as precisely as possible a given trajectory, without knowing its physical parameters. We decided to implement intelligent control algorithms to compare them with classic feedback controllers, first of all a common PD. All the algorithms run in parallel to a PD feedback controller, trying to improve its performance, creating a more intelligent controller.

Since the robot doesn’t know its dynamical parameters, it doesn’t know which are the correct torques to give to its joints in order to achieve a desired motion. The error in the motion, thus, is not differentiable, for example, in the parameters of a function approximator like a neural network, as I discussed in Part 1. This kind of problems, where an agent has to try actions and learn by sensing their unknown effects on an environment, is the common setting for reinforcement learning, so we tried to implement some RL algorithms among other techniques discussed in other posts.

I will not introduce from zero Reinforcement Learning theory to make this post shorter and more readable, but tere are really plenty of resources to learn about it, on Medium too.

We implemented two main algorithms. One is Q-Learning, a discrete action algorithm, that we used to tune in real time the gains of the PD controller while the robot was moving. The second is Actor-Critic, another very popular algorithm used to find a policy, this time for continous actions, that controls the torques on the joints.

All the details of the implementation are in the GitHub repository.

Q-Learning for dynamic PD tuning.

Moving a robotic manipulator implies finding the right torques to input to the joints, that have continuous values in a certain range. Q-Learning, on the other hand, is an algorithm that learns the Q value of discrete actions in a state, Q(s,a), to perform the action that is expected to give the highest reward (more precisely, the highest total return following the policy). Its common versions can’t deal with large action spaces, let alone continous action spaces. But we were interested in the idea of dynamically tuning the gains of a PD controller as the robot moved. The PD controller gives input torques based on the error between the desired and actual position amd velocity. While this simple linear controller can give remarkable results with fixed P and D gains, that are usually tuned with some trial and error, the idea was to make the algorithm tune them in real time based on the motion of the robot, ideally making the controller more versatile. The classic PD is a widely used controller, given its simplicity but also its good performances, but it’s anyway a linear feedback controller that doesn’t exploit any information about the trajectory or other data, but uses only the error values in the last time step.

The main architecture was the following: the actions were simply to increase or decrease the P and D gains independently of a fixed amount. The intuition was to use, as a state, the current state of the robot, along with the evolution of the errors in the last steps. To approximate the Q functions, we used neural networks: the input were the values of the state, as previously described, while the outputs of the networks were the Q values for the various actions. The total reward was only computed in a short time range of 10 steps, so the main goal of the algorithm was to obtain the highest reward in that range, choosing the update for the gains. The reward that we implemented was, instead of the negative error in position and velocity (higher error -> lower reward), the difference between the previous error in the previous steps range, and the current one. This was done to partially decorrelate the performance on the various step ranges: if the algorithm performed badly during learning in a particular steps range, the error will be obviously large in the next steps too, since the robot moves continously, even if in the next steps the algorithm chose a good action, that can be penalized without a valid reason. Thus, if we only consider relative improvements between short subsequent situations, we can actually exploit good actions.

As I wrote in the introduction, we worked with the main focus of obtaining good results is a very short time, something around 100 episodes/full trajectory runs. Otherwise, the algorithm can become infeasible for the physical world and it’s outperformed by other control algorithms, like Model Predictive Control or Iterative Learning.

We perfomed a large number of experiments tweaking architectures and hyperparameters, but the final results was often for the algorithm to constantly increase the gains. This makes sense, since it’s well known in control theory that higher gains often lead to better performances, but high gains implies high power consumption and sometimes exceeding the motor limits. We tried to make the algorithm learn more unpredictable ways of tuning the gains in real time that obtained better performances with lower energy costs, but this particular environment, with the given trajectories, didn’t really give those results, while confirming some known empirical concepts in control theory, learned without any prior knowledge by the algorithm.

Actor-Critic for Deterministic Policy Learning

The second type of algorithm that we implemented was Actor-Critic, to learn a deterministic action policy (we then experimented with a stochastic policy aswell). The idea is straightforward: inspired by other papers, we decided to develop an actor-critic architecture for each joint independently, to learn a deterministic policy that allowed the robot to follow more precisely the trajectory, compensating the weak points of the PD controller. A quite recent master thesis showed interesting results on a 6R robot using a similar architecture, with better overall performances than classic learning algorithms in control theory.

Implementing a different actor-critic policy for each joint makes the algorithm more scalable, but can also be more challenging to learn the complex dynamical couplings of the joints in this decentralized architecture.

The overall architecture was the following: to extract features from the robot state, we developed a 2D array of Gaussians Radial Basis Functions. The number of gaussians was empirically chosen, as well as the means and variances. The input to these array, for each joint, was the position and velocity error. The various outputs of the RBFs were then sent as input to a fully connected layer (i.e., a weighted sum) to generate the torque to use on the joint. These weights were then tuned by the algorithm.

As an exploratory signal, added to the actor signal in order to try different actions, we used a sum of sinusoid at different frequencies, instead of the classic white gaussian noise, since a physical robot cannot actually generate white noise torques, due to the high harmonics (the «white» refers to having all kind of frequencies in the spectrum, while a physical system is always a low-pass system).

Our experiments showed how these kind of algorithms can be very unstable, as it’s known in the RL literature. An apparent convergence of the error towards low values can then explode in the very last steps. Many recent papers, like the famous DDPG algorithm, suggest tricks to make the learning more stable, such as two networks for both the Actor and the Critic with different time updates to compute the target, and the episodic memory, a very famous technique in recent RL literature that allows to decorrelate the experiences by storing them in a data structure and then using them in a stochastic order to actually update the policy.

Look, the error is approaching zero…

…expect for exploding at the next step.

While, as a test, our algorithm was able to solve famous Gym Environments, like Mountain Car, the learning phase on the robot was quite unstable, and the algorithm could not improve the performances of the robot in a short time, except for a 1 or 2 joints robot. We tried many experiments with different architectures and hyperparameters, but we could’t replicate the results of the paper we analyzed. One of the main reasons, we suspect, is that the robot used in that case was controlled in velocity and had a built-in low level controller for acceleration, thus handling by itself many inertia related problems, while we controlled the robot directly in acceleration without any other low level controller.

Interesting comparison of PD torques (red) and Actor torques (green) that improved performances. Note how the algorithm learns to be in phase with the PD, but also to give negative peaks in the first oscillations to avoid over-shooting.

All the experiments confirmed how the physical world can be challenging for algorithms that perform well in simulations.

We didn’t explore further the problems at the time since another method, that I will present in the next post, gave remarkable results in a very short learning time, but I’m interested in exploring different techniques for this task in the future.

All the experiments confirmed how the physical world can be challenging for algorithms that perform well in simulations. The time delays introduced by executing actual trajectories on a robot are a bottleneck that is often forgotten when using simulated physical worlds. There is active research in these fields, that often lead to research in «learning to learn», i.e. methods for robots to learn new tasks and abilities with few-shots examples, like the recent OpenAI blogpost and paper.

In the next post I will show a different technique, based on feedback-based dynamical model learning with basis functions neural networks, that gave remarkable results.

You can find the code of all the experiments in this GitHub repository.