UC Berkeley researchers have published a paper demonstrating how Deep Reinforcement Learning can be used to control dexterous robot hands for complicated tasks. Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrationsproposes a low-cost and high-efficiency control method that uses demonstration and simulation techniques to accelerate the learning process.

Why Dexterous Hands?

The most common robot hands today are simple parallel jaw grippers used as manipulators on well-structured product manufacturing lines. However, there are also dexterous robot manipulators: multi-fingered hands capable of performing a wide range of humanlike actions such as moving objects, opening doors, typing, painting, etc.

DDexterous robot hands with delicate sensing and actuation are however difficult to control and very expensive — the high-end Allegro Hand for example costs about US$15,000. Deep reinforcement learning (Deep RL) offers the possibility of automating complicated control tasks using cheaper hardware. The Berkeley researchers demonstrated their accelerated learning techniques with two separate hardware platforms: the aforementioned state-of-the-art Allegro hand, and a custom-built three-fingered Dynamixel Claw, which costs under US$2,500.

Dynamixel Claw (left) and Allegro Hand.

Model-free Reinforcement Learning In the Real World

Deep RL algorithms enhance the learning process by trial and error. The algorithms also induce a pre-setup reward function to provide guided feedback. In their experiments, researchers used a simple valve rotation task: the hand must open the valve by rotating it 180 degrees.

Illustration of valve rotation task

The pre-setup reward function consists of the negative distance between the current and desired valve orientation. The dexterous hand must figure out how to rotate the valve correctly on its own. With this weak reward signal and advanced behaviour strategy, researchers used a truncated natural policy gradient learning process to prove that reinforcement learning can actually accomplish a given task in the real world with different hardware.

Final trained valve rotation policy

Accelerating Learning with Human Demonstrations

Researchers employed a Demonstration Augmented Policy Gradient (DAPG) method in the supervised learning process, which incorporates human demonstrations with the reinforcement learning. The method accelerates Deep RL in two ways:

Provides a good initialization for the policy via behavior cloning.

Provides an auxiliary learning signal throughout the learning process to guide exploration using a trajectory tracking auxiliary reward.

Researchers were able to significantly reduce task training time using this enhanced copy-learn process.

Accelerating Learning with Simulation

A large hand action/movement dataset can also accelerate the learning process. To make simulated data more representative of the real world, randomization parameters must be handled by robust policies to reduce visual and physical discrepancies. The researchers also proved this method’s potential for accelerating the learning process. However, constructing an accurate simulator is expensive and time consuming as it must be done manually.

Valve rotation policy transferred from simulation using randomization

Accelerating Learning with Learned Models

Researchers realized that learned dynamics models from their previous work — such as Optimal Control with Learned Local Models and Learning Dexterous Manipulation Models — could also accelerate real-world reinforcement learning. However, the performance of those methods is limited by the quality of model that can be learned. The researchers believe that in practice, model-free algorithms will still deliver the best asymptotic performance.

Challenges

Although the research team demonstrated that Deep RL can help boost performance in real world manipulator applications, the technique presents several challenges of its own:

Due to the requirement to take a large number of exploratory actions, it was observed that the hands often heat up quickly, which requires pauses to avoid damage. Since the hands must attempt the task multiple times, it was necessary to build an automatic reset mechanism. In the future, a promising direction to remove this requirement is to automatically learn reset policies. Reinforcement learning methods require rewards to be provided, and this reward must still be designed manually. Some of the team’s recent work has looked at automating reward specification.

Further information on UC Berkeley’s dexterous manipulation research can be found in these papers: