Why this paper?

First scalable successful combination of reinforcement learning and deep learning. Result outperforms preceding approaches (at Atari games). Only uses pixel data + game scores + number of actions and the same architecture across different games.Why reinforcement learning? That is how animals and humans seem to make decisions in their environments as evidenced by parallels seen in neural data of neurons and temporal difference RL algorithms.

What about previous approaches?

Handcrafted features. When non-linear approximations of Q are used, values are unstable. Other stable neural nets approaches were there, like Q-iteration. They are slow, though - don’t work for large networks.

What are the outcomes?

Tested the method against best performing approaches at the time and a professional game tester. Used 49 different Atari games, since they covered a wide variety of tasks (again, the goal is not to learn to play games, but to learn to do the best in environments, that’s why this is important). As a result achieved 75% of human score in more than half of the games and outperforming previous approaches by a margin in almost all of the games. Video Some noticeable things, it could be observed that temporal strategies were learned (Breakout). Video Although, it still seems to struggle to learn long temporal strategies (Montezuma’s Revenge).

How?

Store agent’s previous experience in a dataset: D_t = {e_1, …, e_t}, where e_t = (s_t, a_t, r_t, s_{t+1}). Update Q function estimates using a randomly samples (or minibatches of samples) of experience and attempting to minimize an optimization function.As a result the method worked as was indicated by two indices of learning - predicted Q-values (meaning that the algorithm believes it can get a better average total reward) and score per episode (meaning that the algorithm actually gets a better average total reward).

Technical Stuff

Demonstrated in experiments that each of the components of the method are important: replay memory, separate target Q network and deep convolutional network architecture. By examining the final layers in the network given input pixels, found that similar game states (perceptually and in terms of reward) were close in t-SNE representation. Additionally, those representations make sense not only for states that the agent lead itself into, but also states that players got into.

Technical Training How?

Compress input data from 210 x 160 with 128 color palette into 84 x 84, by taking max for each pixel for each color from itself and previous frame to remove flickering (some objects were only in odd frames, or varied, and others in even, etc). Use m recent frames as input to algorithm 1 (m = 4, but is robust to changes, for ex. For m=3,5).There are |a| outputs, each representing Q value Q(s, a) for state s. Advantage - single forward pass to get Q(s, a) for all a. 3 conv. Layers and 2 fully connected, each followed by rectification nonlinearity. Used previous 1mil frames as memory. Equivalent to about 4.6 hours. Means that as it gets better, forgets the bad transitions (or the other way around, if it gets worse remembers bad things).Used frame-skipping technique (k-1 frames of action, one of decision). k = 4.Different network for each game, but architecture is the same and hyperparameters. One caveat: scores during training were clipped at -1 for negative scores and at 1 for positive. “Clipping the rewards in this manner limits the scale of the error derivatives and makes it easier to use the same learning rate across multiple games.”

Hyperparameters: m, k (frame skipping), network (num filters, strides, filter sizes, etc), Results for other methods were already available for those 49 games.

Technical Evaluation How?

Played each game for up to 30 times for up to 5 min with different initial random conditions. Random agent chose random actions at frequency of 10 Hz. 10 Hz is the fastest a human can select the fire button. Also tried 60 Hz, had little change in performance measure.

Link to paper: Human-level control through deep reinforcement learning : Nature : Nature Research