Each of those nine boxes is the PathNet at a different iteration. In this case, PathNet was trained on two different games using a Advantage Actor-critic or A3C. Although Pong and Alien seem very different at first, we observe a positive transfer learning using PathNet (take a look at the score graph).

How does it train

First of all, we need to define the modules. Let L be the number of layers and N be the maximum number of modules per layer (the paper indicates that N is typically 3 or 4). The last layer is dense and not shared between the different tasks. Using A3c, this last layer represents the value function and policy evaluation.

After defining those modules, P genotypes (=pathways) are generated in the network. Due to the asynchronous nature of A3c, multiple workers are spawned to evaluate each genotype. After T episodes, a worker selects a couple of other pathways to compare to, if any of those pathways have a better fitness, it adopts it and continues training with that new pathway. If not, the worker continues evaluating the fitness of its pathway.

Pathways are trained using stochastic gradient descent with back-propagation throughout a single path at a time. This guarantees a manageable training time.

Transfer learning

After learning a task, the network fixes all parameters on the optimal path. All other parameters must be reinitialized, otherwise PathNet will perform poorly on a new task.

Using A3C, the optimal path of the previous task is not modified by the back-propagation pass on the PathNet for a new task. This can be viewed as a safety net to not erase previous knowledge

Results