Ideas from this summary are taken from the TreeQN and ATreeC paper.

In this work on-line planning in complex environments without a known transition model is explored. Planning in this case requires a transition model to be learned from data. To address these challenges two models are proposed: TreeQN and ATreeC. TreeQN constructs a tree on the fly by recursively applying a learned transition model in the latent space and then backs up reward estimates and value estimates to estimate Q-values. ATreeC is structured in a similar fashion except the backed up Q-estimates are passed through a softmax layer and are used to parameterize a stochastic policy.

Implicit vs Explicit models

In reinforcement learning model-based agents are generally explicitly constrained to the environment. By this I mean there is some constraint in the loss where the transition model is attempting to reconstruct the original frame(in latent space transition models there would be an additional decoder).

However, in implicit transition models there is no constraint saying that the original frame needs to be reconstructed. The benefit of not adding the frame construction constraint is that the transition module will be focused on predicting future latent states that will help the agent to receive maximal reward.

A simple example of this could be an example of a frame that contains a bullet going towards the agent. The bullet represents a very small value in pixel space but has a large weight associated with the agent surviving. The explicit model(original frame reconstruction constraint) would emphasize learning to recreate the background and other features that take up most of the frame. As opposed to the implicit model(no reconstruction constraint) which would focus only on capturing features which maximize reward(like the small bullet).

In both TreeQN and ATreeC it was found that the implicit model received better performance.

TreeQN

TreeQN planning

Mean Squared Error was used as the loss between the target Q value and the evaluation Q value. Auxiliary losses were placed on the subtree reward estimates. Auxiliary loss on subtree latent state predictions from the transition model and subtree value estimates were found to provide suboptimal results. Each auxiliary loss is scaled by a constant.

A small 2 layer neural network was used to approximate reward given a latent state s_t and action a_t. A single layer neural network is used to approximate the value(v_t) of a latent state s_t. A 2 layer neural network was used to predict the delta between the current latent state s_t and s_{t+1}. The incoming latent state s_t is scaled its l2 norm before being sent to the transition model. The purpose of this is to prevent the state representation from growing or shrinking as we go deeper in the tree.

Tree backup is through a combination of mixing Q estimates at each layer of the tree via the function b(x) and TD(λ) defined below.