Now, let’s understand the Mix & Match Architecture in Steps →

Policy Mixing

Policy mixing will be done by explicit mixing for the sake of variance reduction.

Knowledge Transfer

For simplicity we consider the case of K = 2. Consider the problem of ensuring that final policy (π2) matches the simpler policy (π1) , while having access to samples from the control policy(πmm) .

Same as previous one

For simplicity, we define our M&M loss over the trajectories directly and trajectories (s ∈ S) are sampled from the control policy. The 1 − α term is introduced so that the distillation cost disappears when we switch to π 2 .

Adjusting α (alpha) through training

α is the variable used in the population mass function equation (first equation).

This is the first equation

An important component of the proposed method is how to set values of α through time. For simplicity let us again consider the case of K = 2, where one needs just a single α (as c now comes from Bernoulli distribution) which we

treat as a function of time t.

Online hyperparameter tuning → Since α changes through time one cannot use typical hyperparameter tuning techniques as the space of possible values is exponential in number of timesteps (α = (α (1) , · · · , α (T ) ) ∈ 4 Tk−1 ,

where 4 k denotes a k dimensional simplex).

To solve this issue we use Population Based Training (PBT).

Population Based Training and M&M

Population Based Training (PBT) keeps a population of agents, trained in parallel, in order to optimise hyperparameters through time while training and periodically query each other to check how well they are doing relative to others. Badly performing agents copy the weights (neural network parameters) of stronger agents and perform local modifications of their hyperparameters.

This ability of PBT to modify hyperparameters throughout a single training run makes it is possible to discover powerful adaptive strategies e.g. auto-tuned learning rate annealing schedules.

This way poorly performing agents are used to explore the hyperparameters space.

So, we need to define two functions →

eval → which measures how strong a current agent is explore → which defines how to instigate the hyperparameters.

Note: Keep in mind the PBT agents are the MIX & MATCH Agents which is already a mixture of constituent agents.

Now, we should use one of the two schemes mentioned below, depending on the characteristics of the problem we are interested in.

If models is having performance improvement by switching from simple to the more complex model, then

a ) Provide eval with performance (i.e. reward over k episodes) of the mixed policy.

b ) For an explore function for α we randomly add or subtract a fixed value (truncating between 0 and 1). Thus, once there is a significant benefit of switching to more complex one — PBT will do it automatically.

2. Often we want to switch from an unconstrained architecture to some specific, heavily constrained one (where there may not be an obvious benefit in performance from switching).

When training a multitask policy from constituent single-task policies, we can make eval an independent evaluation job which only looks at performance of an agent with α K = 1.

This way we directly optimise for the final performance of the model of interest, but at the cost of additional evaluations needed for PBT.