Each unit has 4 data structures for input and output. Let be the FF input to unit at iteration , a vector of real numbers in the interval . In higher layers is a mass function, but in the lowest layer any values in this range can be provided. Let be the FF output, a matrix containing a normalized likelihood function over possible classifications of the input within . For the FB pass, let be a matrix of equal dimension to containing a probability mass function over predicted future classification-states in unit . Similarly, let be the FB output from , a vector of equal size to the input vector containing a prediction of future input to unit . To form a hierarchy, FF outputs from multiple lower units are concatenated and presented to higher unit[s]. Conversion from matrix to vector is not important as the classifiers (see below) assume all input dimensions are independent.

Messages between units (U) in different layers are relayed via “reward correlator” components (RC). FF messages (blue arrows) represent classifications of the current state of the agent in the world; these are correlated with objective internal measures of agent state (reward). The same reward value is provided to every RC; the hierarchy is tasked with modelling the separate external causes of changes in reward. FB messages are “predictions” of future agent-world state (red arrows). Biased messages are produced by RC components, making the hierarchy more likely to “predict” states in which it performs actions correlated with high reward. Sensor data is concatenated with motor output to form the interface to the MPF hierarchy. The FB output of an MPF unit is of the same form as its FF input. Different data may be presented to each unit at the bottom of the hierarchy. Sensor inputs and motor outputs may be mixed within one unit or interfaced to different units.

We operate our hierarchy iteratively. Each iteration includes a feed-forward (FF) pass of every unit, followed by a feed-back (FB) pass of every unit. The units are traversed in such an order that all units at the lowest level are FF prior to any unit at the next higher level (breadth-first or level-order traversal). On the FB pass, the units at the highest level are FB prior to any “lower” units, until the lowest level is reached. Each iteration therefore consists of a FF and a FB pass of the entire hierarchy ( figure 2 ). Although synchronous operation of the hierarchy is not biologically realistic, it should not affect the results of the algorithm described below.

The different problems of discrete and continuous outputs (action spaces) are discussed in the Reinforcement Learning literature. Many RL algorithms (such as Q-learning and SARSA) cannot handle continuous action spaces. However, approximately optimal continuous outputs can be learnt by methods such as CACLA (Continuous Actor-Critic Learning Automaton) [26] . The RL literature does include Monte-Carlo methods to explore the space of possible actions (policies) [28] , similar to our approach for discrete outputs.

If continuous motor outputs are desired, values in can be used without further processing. Discrete outputs are more problematic because learning within the MPF unit (within a SOM in this paper) will cause a feedback loop, pulling motor outputs towards intermediate values. Instead, discrete outputs can be produced by sampling from a multinomial of possible actions with probabilities . should represent the action actually chosen from . Therefore let if action k was chosen, and otherwise.

We want the MPF to generate behaviour directly. If then . Given the behaviour of the MPF, will be a prediction of and will be a prediction/suggestion of motor commands at ; i.e. when trained, .

However, in an iterative artificial adaptive agent , meaning that the state is comprised of current sensor values and consequent actions taken. The agent must learn which action to choose given that it is in a particular world/self state, so we must store this combination together. Imagine the Markov-Graph of this model; we are encoding the current vertex and outbound edges, rather than the current vertex and inbound edges (learning how we got into a nasty situation is not as directly useful as learning how to get out of it!). This is similar to the state-action pairing seen in SARSA.

Since in MPF the FF and FB data structures are of equal size, agent sensor values and motor commands must be present in both and . If is a vector of current values from the agent's sensors, and is a vector of values corresponding to motor commands, then . In other words, the input/output state to the MPF's lowest level is a concatenation of sensor values and motor commands.

Many reinforcement learning algorithms - such as Q-learning [3] and SARSA [33] - model the effect of [state,action] pairs on reward. The state contains both external and internal measurements from the agent in its world. Actions are generated by the agent. The expected reward of performing actions when in state is the “Quality” of the pair, typically denoted .

In the FB pass we wish to modify message passed from higher unit[s] to unit . Since is a probability mass function, we wish to increase the value (mass) of matrix elements associated with increase in reward, and reduce elements associated with decreases in reward. This can be done with the following formulas, in which is a normalizing factor ensuring constant mass and is a global scalar parameter determining the maximum influence of adaptive bias . Note that matrix is a nonlinear function of the correlation of unit states with reward, to ensure that weak correlations are rapidly tested and either strengthened or depleted. is a scalar constant corresponding to the uniform mass value i.e. if are the dimensions of . is the modified mass function: (5) (6) (7)

This formulation arises because we only want to change the correlation for active elements in and the influence of on any element should depend on the probability that SOM model represents the state that caused . ensures that the correlation never changes too quickly, forgetting historic values. If events happen comparatively quickly compared to the rate of iterating the hierarchy, a delay of at least 1 iteration should be applied to the correlating formula as shown above, although should be relayed without delay to higher units. In more sophisticated implementations, integrals of over time should be correlated with reward.

For every unit , if is a correlation matrix of equal dimension to and is a scalar learning-rate parameter (gradually decreased over time), then we define a temporary matrix to correlate: (3) (4)

The FF pass through the hierarchy should classify the current state as accurately as possible. The purpose of the FB pass is to generate predictions, and as a result, behaviour. We choose to modify messages between units in the FB pass, causing the MPF to preferentially “predict” states where its output causes actions associated with higher reward. More specifically, in the FF pass we correlate matrix with scalar and in the FB pass we modify matrix . Since there is a feedback loop within each MPF unit (detailed below), an alternative arrangement would be to correlate with and modify prior to relaying it to higher unit[s].

In the FF pass, delayed lower unit output is correlated with reward . The FF message is then relayed, unaltered. Correlations are stored in matrix . In the FB pass, higher unit messages are modified to bias them towards states correlated with high reward. The modified message is then relayed to lower units.

In this paper we suggest that correlation of activity patterns with reward values could occur between layers of the hierarchy. We posit a “reward correlator” component that relays messages between units in different layers, i.e. matrices and are inputs and outputs of a reward correlator above in FF and FB passes respectively ( figure 3 ).

Since we have defined that data inside the MPF hierarchy includes representations both of [sensed world-state] and [agent motor-actions], it should be possible to correlate activity patterns within the hierarchy with the reward values that result from the agent taking specific actions in specific situations. While it is necessary that concepts with appropriate abstractions and invariances exist somewhere in the hierarchy, it is not desirable to have to define where, before learning. We also wish to preserve the homogeneity of the MPF, therefore it must be possible to add the adaptive components throughout the hierarchy without negative effects.

In this paper we rely on the existence of an arbitrarily deep hierarchy with increasing temporal pooling, to avoid the need to consider discounted future rewards. We assume that for any event with delayed reward there will exist a level in the hierarchy that remains constant for the duration of the event-reward interval. For example, a state in the hierarchy corresponding to a high-level plan such as “walk the dog” could be active for long enough for all relevant rewards to be integrated, despite the existence of other transient plans during this period. This is unlikely to be an ideal approach and in future work we will investigate the use of discounted future reward.

This differs significantly from conventional reinforcement learning where a “discount” factor allows future reward signals to be considered when evaluating state-action pairs [4] . Rewards further in the future have less influence, and are therefore said to be “discounted”. Many RL algorithms iteratively propagate discounted rewards backwards in time towards the events that caused them.

Since the agent should be highly motivated to improve a bad situation, it is more useful to maximize the first derivative of reward. We also wish to measure changes in reward over a period of time, because the delays between actions and their consequences are varying and unknown. However, it is more likely that recent actions are responsible for changes in reward. Over a few iterations, this can be approximated simply as an exponentially-weighted moving average: (1) (2) is the maximum possible absolute derivative of reward per iteration, if known. The parameter is determines the influence of historic reward signals. If consequences of actions may take some time to be reflected in , the value of should instead be computed over a window of time. The period should be increased for units at higher levels in the hierarchy, whose state changes very slowly.

The adaptive-MPF system uses reinforcement learning rather than supervised learning because we do not wish to provide “correct” outputs for every conceivable situation. Instead, we wish to measure impacts of external causes on properties of the agent, such as pain or hunger. In this paper we will use the simplest possible reward function, providing a single varying scalar value such that . We use the term “reward” to imply that this function should be maximized. We will expect the MPF to build a model of the world that is capable of understanding the causes of changes in reward. By combining all possible definitions of things good and bad within a single scalar, it becomes much harder for the MPF to learn the separate causes of high and low reward.

SOM-MPF Implementation

Thus far we have described several additions to the Memory-Prediction Framework to enable it to be used as a complete control system for an adaptive intelligent agent. In summary, we have reformulated the input/output data to include sensors and actuators, and are modifying FB messages between hierarchy layers, by correlating delayed FF messages with smoothed reward signals (figure 2).

MPF Unit Structure. Although the above modifications should be compatible with various implementations of MPF (including HTM), we will describe our implementation of the MPF unit. Each unit performs spatial pooling, sequence learning plus prediction, and temporal pooling. We will discuss each of these components (figure 4). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Internal structure of a SOM-MPF unit. In the FF pass, SOM and RSOM components perform spatial and temporal pooling (compression) by classification of in terms of the finite sets of models and . FF SOM classification output is biased by previous prediction resulting in . is the FF input to the RSOM temporal pooler. RSOM FF output is used as unit FF output . Between the poolers, an internal loop estimates unit state in terms of SOM models, using both and FB predictions from higher units via the RSOM. Internal predictor output is combined with RSOM FB output to give the bias matrix . Predictions from higher units indicate the current sequence, as in HTM; predictions within the unit allow position within sequences to be tracked also. In the FB pass, roulette selections from and the combined PMF are used to reverse the RSOM and SOM transformations, giving unit FB output . https://doi.org/10.1371/journal.pone.0029264.g004 Our MPF unit is based on the Kohonen Self-Organising Map (SOM) [34], and unlike some HTM solutions, is capable of online learning. The SOM is a biologically-inspired artificial neural network used for unsupervised classification and dimensionality reduction. Others have previously used SOMs to build MPF-like hierarchies, such as Miller and Lommel's Hierarchical Quilted SOM (HQSOM) [35]. Pinto [36] extended this to a complete MPF implementation. Other SOM variants could equally be used. The chief innovation in Miller and Lommel [35] is use of a “Recurrent”-SOM (RSOM) that can perform temporal pooling (clustering) by allowing current classification to be affected by previous classifications. Therefore, a SOM-RSOM pair can perform both spatial (SOM) and temporal (RSOM) pooling, as described in the MPF.

Feed-Forward Pass: Spatial and Temporal Pooling. The SOM consists of two matrices and . is an matrix of models of the input vector such that given elements in , the dimensions of are . and are parameters that determine the number of models the SOM will contain. In this case the SOM has a 2-d topology, which is usually sufficient but cannot optimally represent all data. has size and represents the likelihood of observing the SOM model given the evidence . Each SOM model represents a possible configuration of and the models in the SOM learn to maximize their coverage of the input space observed in over time. Since the SOM has been thoroughly discussed in many works, the reader should consult e.g. [35] or [36] for detailed SOM weight update equations. For our purposes we define the likelihood function as the inverse of normalized sum of squared error, giving matrices and : (8) (9) These equations produce a very smooth result, with significant responses from many models within the SOM. This is desirable because we wish to bias the FF classification result using a matrix that was produced in the previous FB pass. is a probability mass function representing a (biased) prediction of , the spatial pooler classification: (10) is a normalizing constant such that becomes a probability mass function over the classification-states represented by the spatial pooling SOM models. The superscript ‘s’ indicates that this is the FF output of the spatial pooler in unit . According to MPF, the FF output of the spatial pooler (SOM) should be the FF input to the temporal pooler (RSOM). Since the RSOM and SOM treat all input dimensions independently, we can rearrange the SOM output matrix to become a vector of elements. However, as discussed in [35], the RSOM input should be highly orthogonal. This can be achieved by setting the maximum value in to 1 and others to zero. For other details of the RSOM, see [35]. The FF output of the unit would typically be the FF output of the temporal pooling RSOM: (11)but in hierarchy layers where a lot of spatial compression is required (e.g. in the visual cortex) the temporal pooler can be omitted. In this case the unit FF output is taken from the spatial pooling SOM: (12) In [37] it is noted that in higher layers of the hierarchy, there is little or no advantage to further spatial pooling. This is believed to be represented in biology by the absence of neocortical layer 4 [37]. To reproduce this effect, in these units the spatial pooling SOM can be omitted. The classification process in the RSOM is similar to the SOM but functions as a “leaky-integrator” so that classification outcome changes slowly. A matrix of equal dimension to is needed: (13) (14) It is not necessary to bias the RSOM classification result, either using a prediction or for adaptive purposes. This asymmetry is because in the FF pass, active RSOM sequences become spatial patterns in a higher unit, where they can be predicted. In the FB pass, adaptive selection between RSOM sequences translates into preference for sequences containing better spatial patterns. Hence: (15) is therefore a normalized likelihood function if an RSOM is used, or a probability mass function otherwise.

Prediction and Sequence Learning. The SOM-MPF units used in our experiments include either first-order or variable-order Markov prediction. The MPF framework does not require a prediction feature, as temporal pooling generates predictions of proximate future and past states in the FB pass. However, our prediction module predicts only future states, which reduces uncertainty within the system. It also allows units to track position within sequences. For prediction and sequence learning, an MPF unit should do three things: identify the set of observed temporal sequences, classify the current temporal sequence, and predict future sequences. Both first-order and variable order variants of HTM have been developed. The benefit of variable-order prediction can easily be illustrated: A 2nd (or higher) -order model can distinguish between in sequences and , whereas a 1st order model cannot. In [17], Hawkins et al use a Variable-order Markov Model (VMM) to implement the temporal pooling stage of HTM. However, they note that even with a VMM, the hierarchy must be used to distinguish between longer intersecting sequences. (The hierarchy allows assembly of longer sequences from shorter ones). The difference is flexibility and efficiency; a VMM-hierarchy can distinguish longer sequences using fewer layers. In our SOM-MPF implementation we present a first-order Markov Model to predict future classification outcomes. Later, we also show how using the biologically-inspired technique described in [17], we can adjust the first-order model to behave as if it were a variable-order model.

First-Order Prediction. The input to the prediction module is and the output is a matrix of the same size. Both are probability mass functions. is a prediction of . is generated from a matrix of size (i.e. each model in the SOM is treated independently and regardless of SOM topology). is updated using and and approximates the conditional probability of SOM model being active at given that model is active at time . The sum of each column of is normalized to 1. (16) (17)where is the learning rate (typically 0.99 initially and reduced to around 0.01 over time) and: (18) (19) Equations 17, 18 and 19 increment the conditional probabilities in if is observed to decrease while is increasing (a transition between and ). Since is a probability mass function, a reduction in mass at is interpreted as the exiting of state . Similarly, an increase in mass at represents entering state . These equations are best understood as approximating transition probabilities by computing the relative frequency of transitions between states. The relative frequency of an event becomes closer to the probability of an event as the number of trials increases. However, in this case the approximation is biased towards recent events by . Since the underlying system is continually changing (due to SOM learning), frequency-based approximation biased towards recent data is simple and effective. A first-order prediction can be obtained from and by: (20) Matrices and are treated as vectors in equation 20. is a normalizing constant giving a total mass of 1: (21)

Feedback Pass. In MPF, the purpose of the FB pass is to generate a prediction of the next FF pass. MPF proposes that FF classification be combined with FB prediction, yielding more accurate models of the world given noisy, indeterminate or insufficient data. Feedback also improves state estimation between sibling units, via higher units. In this paper, the purpose of the FB pass is also to generate adaptive behaviour. The FB pass through the unit starts with , a prediction from higher units in the hierarchy. (For the highest layer can be a uniform distribution). is a probability mass function over the set of models in the RSOM. “Roulette” selection is used to select an element from (i.e. the probability of the selection of any element is proportional to the value of ). The corresponding weights from this RSOM model are then copied to a matrix . The use of Roulette selection for the selection of a FB model from the (R)SOMs is unique to this paper and has some useful properties. If there are multiple modes in the FB pass will test them individually, until one fits. When a mode in accurately predicts reality, it will rapidly be reinforced by the feedback loops within the unit and hierarchy, and the mass of the other modes will decrease. More importantly, there is no guarantee that interpolating between the models in the SOM generates viable patterns, therefore a weighted-sum of the models in is not effective given high variance modes or a multimodal case (in practice, multi-modal distributions are quite common). Using clustering techniques to find a single mode in is more expensive and in our experiments gave no noticeable improvement. Over time, the series of selections from can be interpreted as a probability mass function because the normalized likelihood functions represented by the SOM models are conditioned on the distributions in . The kurtosis of the distribution in balances the conflicting demands of exploration and exploitation; if the distribution is flat chosen actions will be more random (i.e. exploratory). We wish to generate all behaviour within the MPF hierarchy and not require any external module to help control the agent. However, the MPF must explore the gamut of possible action-sequences, motor outputs etc. and learn their consequences. This objective is achieved both by using Roulette selection of individual SOM models, and by adding random noise to the models in the FB pass. Let represent the roulette-selected model from the RSOM. To add noise: (22) The magnitude of the noise is scaled by , a parameter that should initially be 1 and decreased over time to a low value ( ). All results are clamped to unit range. The schedule for reducing noise magnitude should consider the location of the unit within the hierarchy; higher units inputs' are not well defined until lower units have learned. is a uniformly distributed random value from the interval , as produced by most software random number generators. is a mass function of the same random variables as . They are both predictions of the outcome of the next FF classification from the SOM. We combine them using the element-wise product and add a small uniform mass to every element, giving us the bias matrix : (23) will be used in iteration . The uniform mass serves to introduce some plasticity and uncertainty into the system even when it has modelled predictable data very accurately. It also prevents numerical instability when predictions from higher layers do not agree with predictions within the unit, or when the final bias does not agree with observed reality. There is a fundamental conflict between the objectives of accurate prediction and adaptive bias; by definition, adaptive bias disrupts - damages - the prediction process. It is also important that biased classification in the FF pass does not become locked into an internal loop, ignoring observations from below. The final step in the FB pass is to transform into so that the message can be passed down the hierarchy. This is achieved by using “Roulette” selection to pick a SOM model from and adding noise: (24)