In this section, we first introduce the control problem by describing the governing dynamics. Next, we go deeper into our learning and control algorithm (software). Finally, we finish this section by providing insight into the physical design of our physical system.

System dynamics

Equation (1) defines the relationship between the joint kinematics and the applied torques of the limb18 (forward model):

$$\ddot q = - I\left( q \right)^{ - 1}C\left( {q,\dot q} \right) + B\dot q + I\left( q \right)^{ - 1}T$$ (1)

where \(q\;{\it{\in }}\;{\cal R}^{2 \times 1}\), \(\dot q\;{\it{\in }}\;{\cal R}^{2 \times 1}\) and \(\ddot {q} \;{\it{\in }}\;{\cal R}^{2 \times 1}\) are joint angle vector and its first and second derivatives, respectively, \(I{\it{\in }}\;{\cal R}^{2 \times 2}\) is the inertial matrix, \(C\left( {q,\dot q} \right)\;{\it{\in }}\;{\cal R}^{2 \times 1}\) is the Coriolis and centripetal forces matrix, \(B\left( {\dot q} \right)\;{\it{\in }}\;{\cal R}^{2 \times 2}\) is the joint friction matrix and \(T\;{\it{\in }}\;{\cal R}^{2 \times 1}\) is the applied joint torque vector. The musculotendon forces (here, cables pulled by the motors) are then related to the applied joint torques vector as described in equation (2):

$$T = M(q)F_0a$$ (2)

where \(M\left( q \right){\it{\in }}\;{\cal R}^{2 \times 3}\) is the moment arm matrix, F 0 is a 3 × 3 diagonal matrix containing the maximal force values that can be exerted by each actuator and \(a\;{\it{\in }}\;{\cal R}^{3 \times 1}\) is the normalized actuation value of each actuator42,43. Please note that this is an under-determined system (three input force values generate two torques) where there is redundancy in the production of net joint torques at each instant. However, because the system is driven by tendons that can pull but not push (and not driven by torque motors coupled directly to the joints, as is common in robotics), joint rotations also depend on the ability of the controller to pay out and reel in tendon as needed, else the movement can be disrupted or the system be non-controllable, respectively (this is why we use back-drivable brushless d.c. motors and maintain tension in the tendons at all times). As such, these tendon-driven systems present the challenge of being simultaneously under- and over-determined42,43. The presence of constant tension in the tendons and friction in the joints (which can be heard in Supplementary Video 1) helps stabilize the system but also adds a deadband for the control of subtle movements.

The goal of the inverse map is to find the actuation values vector (a) for any given set of desired kinematics \((q,\dot{q} ,\ddot{q} )\) without using any implicit model and only from the babbling and task-specific data. The mapping done by the ANN used in the lower-level control of this study is described in equation (3):

$$a = {\rm{ANN}}(q,\dot{q} ,\ddot{q} )$$ (3)

Finally, the higher-level controller (in the RL task) is in charge of exploring the kinematic space and converging to desired kinematic trajectories that yield high reward. Although these equations are effective for describing and controlling systems, we designed G2P’s lower level control with the premise that only the joint dynamics were observable (while not being used in real time), and that the only controllable element is a. As a consequence, our system does not have any direct a priori conception of the model structure or the constants that drive the dynamics; lower level control must infer those relationships using training data from babbling and refine them after each attempt using only task-specific input–output data (without being provided with a desired or error signal while refining the map after each attempt).

Learning and control algorithm

Learning and control in this first implementation of the G2P algorithm takes place at two levels: (1) inverse mapping and refinements (the lower-level control) and (2) the reward-based RL algorithm (the higher-level control). The lower level is responsible for creating an inverse map that converts kinematics into viable control sequences (motor commands). The higher-level control is responsible for reward-driven exploration (RL) of the parametrized kinematics space, which is further passed to the lower-level control and ultimately run through the system.

Inverse mapping and refinements

The lower-level control relies on two phases. As the system is provided with no prior information on its dynamics, topology or structure, it will first explore it dynamics in a general sense by running random control sequences to the motors, which we call motor babbling. After 5 min of motor babbling, the system creates the initial inverse map using the babbling data and then further refines this map using data collected from particular task-specific explorations, which we refer to as task-specific adaptation. This transition from motor babbling to adaptation to a particular task is the reason we refer to this algorithm as ‘general to particular’ or G2P.

Motor babbling

During this phase, the system tries random control sequences and collects the resulting limb kinematics. A multi-layer perceptron ANN is trained with this input–output set to generate an inverse map between the system inputs (here, motor activation levels) and desired system outputs (here, system kinematics: joint angles, angular velocities and angular accelerations). Although sparse and not tailored for any subsequent task of interest, data from these random inputs and outputs suffice for the ANN to create an approximate general map based on the system’s dynamics.

Random activation values for the babbling

The motor activation values (control sequences) for motor babbling were generated using two pseudo-random number generators (uniformly distributed). The first random number generator defines the probability for the activation level to move from one command level to another. This value was set to 1/fs and therefore the activation values for each actuator will change on an average rate of 1 Hz. The second number defines the activation level of the next state with sampling from a range of 15% (to prevent tendons from going slack) to 100% activation. The resulting command signals were stair-step transitions in activations to each motor. Three command signals were created (using different initial random seed), which ran three motors during the motor babbling. It is important to note that these stair-step random activities are designed to explore the general dynamics of the system and are not tailored for any tasks performed during this study (Fig. 6).

Structure of the ANN

The ANN representing the inverse map from 6D limb kinematics to 3D motor control sequences (equation (3)) has three layers (input, hidden and output layers) with 6, 15 and 3 nodes, respectively. The transfer functions for all nodes were selected as the hyperbolic tangent sigmoid function (with a scaling for the output layer to keep it in the range of the outputs). The performance function was selected as m.s.e. The Levenberg–Marquardt backpropagation technique was used to train the ANN, and weights and biases were initialized according to the Nguyen–Widrow initialization algorithm. Generating and training of ANNs were carried out using MATLAB’s Neural Network ToolBox (MathWorks; see MATLAB’s Deep Learning Toolbox—formerly known as the Neural Network toolbox—documentation for more details).

Task-based refinements

Motor babbling yields sample observations distributed across a wide range of dynamics, but still represents a sparse sampling of the range of state-dependent dynamic responses of the double pendulum (Fig. 6). As a result, this initial inverse map (ANN 0 , Fig. 2) can be further refined when provided with more task-specific data.

The higher-level control will initiate the exploration phase using ANN 0 . However, with each exploration, the system is exposed to new, task-specific data, which are appended to the database and incorporated into the refined ANN K map (Fig. 2). This refinement is achieved by using the current weights as the initial weight of the refined ANN and training it on the cumulative data after each attempt. A validation set is used to stop over-fitting to the training data. The weights will not be updated for a run if the performance over the validation deteriorates for six consecutive attempts (default settings for the used toolbox). The data to be used to train the ANN were randomly divided into train, test and validation sets with 70%, 15% and 15% ratios, respectively. It is important to note that refinements can update the map’s validity only up to a point; if major changes to the physical system are experienced (changing the tendon routings or the structure of the system) the network would probably need to retrain on new babbling data. This could be manually performed or a threshold for feedforward error could be set to activate rebabbling. However, we found that motor babbling done strictly while the limb was suspended in the air nevertheless worked well when it was used to produce intermittent contact with the treadmill to produce locomotion on the treadmill and there was no need to rebabble in this study unless a motor, tendon cable or link was replaced.

The reinforcement learning algorithm for the treadmill task

A two-phase reinforcement learning approach is used to systematically explore candidate system dynamics, using a 10D feature vector, ultimately converging to a feature vector that yields high reward. Similar to the ideas used in refs. 56 and 57, we have simplified the search by parametrizing the task as a 10-element feature vector to avoid having the RL agent explore all possible time-varying sequences of motor activations (and their resulting kinematics). We used a 10D feature vector to create cyclical trajectories. The goal of the policy search RL here is to converge to a parameter vector that yields high reward (treadmill movement). The use of a lower-level control to learn the inverse map enabled us to use a policy-based model-free RL with only 10 parameters (feature vector). The system will start from an exploration phase (uniformly random parameter search) and once the reward passes a certain threshold, the policy will change to a multivariate Gaussian distribution-based stochastic search centred on the feature vector that yielded the highest reward so far (see Methods).

Please note that the ANN in the lower-level control only creates an inverse dynamical model between the motor activation values and the joint kinematics (and has no information about the treadmill reward). The RL agent perceives this inverse model simply as a part of the environment. Therefore, this method should not be confused with model-based RL algorithms where the agent utilizes a model to find actions whose predicted reward is maximal.

Creating cyclic trajectories using feature vectors

At each step of the reinforcement algorithm, the policy must produce a candidate set of kinematics. We defined 10 equally distributed spokes (each 36° apart; see Fig. 1c) on the angle–angle space. We can then set the lengths (distance from the centre) of each spoke to define an arbitrary closed path that defines angle changes, which remains a smooth, closed trajectory. The positioning of the spokes and centre are defined by the range of the babbling data. These 10 lengths of the spokes are the 10D feature vector. Using interpolation of these 10 locations, we yield an angle–angle trajectory and derivate those points (equally spaced in the time domain) to obtain the associated angular velocities and accelerations, which fully describe the joint kinematics in the time domain. Using the inverse map (lower-level control) these 6D target limb kinematics (\(q,\dot q,\ddot{q}\)) will be mapped into the associated control sequences. The produced control sequences (motor activation values) are then replicated 20 times and fed to the motors to produce 20 back-to-back repetitions of the cyclical movement. Repeating the task 20 times allows us to smooth the effect of unexpected physical dynamics of the task (for example, system noise, unequal friction values over the treadmill band, nonlinearities of the system and so on), which might lead to fluctuations in reward. The features were bounded in the [0.1–1] range for the treadmill task and [0.2–0.8] during the free cyclical movements experiments to provide more focused task-specific trajectories.

Exploration phase

Exploring random attempts across the 10D feature vector space (uniform at random in [0.1–1]; equation (1)) will eventually produce solutions that yield a treadmill reward. Exploration continues until either the reward is higher than a predefined threshold or stopped when a maximal run number is surpassed (a failure).

Exploitation phase

Once the reward passes the threshold, the system will select a new feature vector in the vicinity of the feature vector from a 10D Gaussian distribution, with each dimension centred at the threshold-jumping solution. Much like a Markov process, with each successful attempt the 10D distribution will be centred on the values of the feature vector that yielded the best reward thus far. The standard deviation of these Gaussian distributions is inversely related to the reward (the distribution will shrink as the system is getting more reward). The minimal standard deviation is bounded at 0.03. This mechanism helps in converging to the behaviour with higher reward and exploring its vicinity in feature space (forming high reward habits) within a reasonable time span but without any guarantee of finding global optima. This is analogous to vertebrate learning behaviour, which can form efficient functional habits that may not be optimal. The governing equations for generating the next feature vector to be executed by the higher-level control are described in equation (4):

$$\bar F = \left\{\begin{array}{*{20}{l}} {u(f_{\mathrm{m}},f_{\mathrm{M}})} & R_{\mathrm{b}} < {\mathrm{reward}}\,{\mathrm{threshold}}\\ \max ({\mathrm{min}}({\cal N}(F_{\mathrm{b}},{\sum} {(R_{\mathrm{b}})),f_{\mathrm{M}}),f_{\mathrm{m}})} & {\mathrm{Otherwise}} \end{array}\right.$$ (4)

where u, and \({\cal N}\) are Uniform and Gaussian distributions, respectively, \(\bar F\) is the feature vector of the next attempt, f m and f M are the min and max bounds for each feature in the feature vector, respectively (0.1 and 1 in this test), R is the reward, R b is the highest reward so far, F b is equal to the feature vector that yielded R b and \({\sum} {(R_{\mathrm{b}})}\) is described as

$${\sum} {(R_{\mathrm{b}}) = \sigma (R_{\mathrm{b}})I_{10}}$$ (5)

where I 10 is a 10 × 10 identity matrix, R is the reward, and sigma is defined as

$$\sigma \left( {R_{\mathrm{b}}} \right) = (b - R_{\mathrm{b}})/a$$ (6)

where a and b are scaling and bias constants, respectively. Here, we empirically selected values of a and b of 600 and 9,000, respectively (Table 1). Note that these values only change the deviation of the feature that will have an impact on the exploration–exploitation trade off; we observed that the performance of the system is not very sensitive to these values (that is, the system will find an acceptable solution as long as reasonable values are set for them).

Table 1 Pseudo code for the RL Full size table

Between every attempt, the ANN’s weights are refined with the accumulated data set (from motor babbling and task-specific trajectories), regardless of the reward or reinforcement phase. This reflects the goal for our system to learn from every experience.

Simulations

We first prototyped our methods in simulation using a double-pendulum model of a tendon-driven limb (equations and code for the double pendulum simulation are adapted, with modifications, from ref. 58). Similar to the physical system, our method proved to be efficient in the simulation and yielded comparable results (Supplementary Figs. 2 and 3). These simulations were kept isolated from the physical implementation, and the results were never used as seeds for the physical implementation. It is important to note that, similar to any other modelling attempt, these simulations are simplified representations of the real physics. In addition, some values of the system are very challenging (if not impossible) to measure (for example, the moment arm value function), which is another reason why we think model-free approaches are absolutely necessary in this field. The simulations in this study are mainly designed to test the feasibility of the algorithm before testing it on the real system and are meant to only reflect the general structure of the system, so the parameters of these simulations are not fine-tuned to accurately mimic the physical system.

Physical system

We designed and built a planar robotic tendon-driven limb with two joints (proximal with a fixed height and distal) driven by three tendons, each actuated by a d.c. brushless motor. A passive hinged foot allowed natural contact with the ground. We used d.c. brushless motors as they have low mechanical resistance and are back-drivable. The motor assembly and proximal joint were housed in a carriage that could be lowered or raised to a set elevation for the foot to either reach a treadmill or hang freely in the air (Fig. 3).

We used the minimum number of tendons required to have full control of both joints (a minimum of n + 1 tendons are required, where n is the number of joints)42. Further considerations and part details can be found in the Supplementary Information.

Feasible wrench set and design validation

The feasible force set of a tendon-driven limb is defined by all possible output force vectors it can create. Equation (7) describes the static output wrench for a tendon-driven system42:

$$w = J(q)^{ - {\mathrm{T}}}M(q)F_0a$$ (7)

where w represents the wrench (forces and torques) output and J(q)−T is the Jacobian inverse transpose of the limb, which transforms net joint torques into endpoint wrenches.

By evaluating all binary combinations for the elements in a, the resultant wrenches give rise to a feasible force set. It is important to preserve the physical capability of the tendon routing through the many iterations of limb design, so at each design phase we computed these sets for different positions throughout the limb propulsive stroke. Joint moment arms and tendon routings were simulated and ultimately built to have adequate endpoint torque and forces in all directions, which is important for versatility42. Many other effective designs (different tendon routings, different link lengths and so on) or design optimization techniques can be used and their performances in the tasks performed here can be evaluated; however, that is out of the scope of the current study.

Mechanical considerations

The carriage was attached to a wooden support structure via linear bearing and slide rails to adjust its vertical position. A clamp prevented sliding once the vertical position was set. Sandpaper was glued to the footpad and in strips across the treadmill to improve traction (Fig. 3a,b).

Data acquisition

The control system had to provide research-grade accuracy and consistent sampling to enable an effective hardware test of the G2P. A Raspberry Pi (Raspberry Pi Foundation) served as a dedicated control loop operator—issuing commands to the motors, sensing angles at each of the proximal and distal joints, and recording the treadmill band displacement (Fig. 3a,b). Furthermore, the electrical power consumption for each motor was measured at 500 Hz using current-sensing resistors in parallel with the motor drivers, calculating the watt-hours over each intersample interval and reporting the amortized mean power (watts) for the entire attempt. All commands were sent, and data received, via WiFi communication with the Raspberry Pi as csv files.

Running the system

The limb was placed in a consistent starting posture before activations were run to minimize variance in the initial conditions of the physical system. To aid development, a live-streaming video feed was designed for real-time visualization on any computer on the network (Supplementary Video 1). A computer sent a control sequence to the Raspberry Pi, and after it was successfully run, the computer received (1) the paired input-to-output data in csv format for iterative analysis or training, (2) the net distance (mm) covered over the course of the entire action and (3) the amortized power the system consumed during the trial. Once the data had been collected, to calculate kinematics to train the inverse map, samples were first interpolated using their corresponding time labels to combat the non-uniform intersample interval of 78 ± 5 Hz. Prescribed activation trajectories were also served at this rate. The pipeline for data acquisition was designed with Python 3.6.