In this section, we review the model and inversion scheme of the previous section in light of established procedures for supervised and self-supervised learning. This section considers HDMs from the pragmatic point of view of statistics and machine learning, where the data are empirical and arrive as discrete data sequences. In the next section, we revisit these models and their inversion from the point of view of the brain, where the data are sensory and continuous. This section aims to establish the generality of HDMs by showing that many well-known approaches to data can be cast as an inverting a HDM under simplifying assumptions. It recapitulates the unifying perspective of Roweis and Ghahramani [20] with a special focus on hierarchical models and the triple estimation problems DEM can solve. We start with supervised learning and then move to unsupervised schemes. Supervised schemes are called for when causes are known but the parameters are not. Conversely, the parameters may be known and we may want to estimate causes or hidden states. This leads to a distinction between identification of a model's parameters and estimation of its states. When neither the states nor parameters are known, the learning is unsupervised. We will consider models in which the parameters are unknown, the states are unknown or both are unknown. Within each class, we will start with static models and then consider dynamic models.

All the schemes described in this paper are available in Matlab code as academic freeware (http://www.fil.ion.ucl.ac.uk/spm). The simulation figures in this paper can be reproduced from a graphical user interface called from the DEM toolbox.

In the identification of nonlinear dynamic systems, one tries to characterise the architecture that transforms known inputs into measured outputs. This transformation is generally modelled as a generalised convolution [23] . When then inputs are known deterministic quantities the following m = 1 dynamic model applies (37)Here η and y play the role of inputs (priors) and outputs (responses) respectively. Note that there is no state-noise; i.e., Σ w = 0 because the states are known. In this context, the hidden states become a deterministic nonlinear convolution of the causes [23] . This means there is no conditional uncertainty about the states (given the parameters) and the D -step reduces to integrating the state-equation to produce deterministic outputs. The E -Step updates the conditional parameters, based on the resulting prediction error and the M -Step estimates the precision of the observation error. The ensuing scheme is described in detail in [24] , where it is applied to nonlinear hemodynamic models of fMRI time-series. This is an EM scheme that has been used widely to invert deterministic dynamic causal models of biological time-series. In part, the motivation to develop DEM was to generalise EM to handle state-noise or random fluctuations in hidden states. The extension of EM schemes into generalised coordinates had not yet been fully explored and represents a potentially interesting way of harnessing serial correlations in observation noise to optimise the estimates of a system's parameters. This extension is trivial to implement with DEM by specifying very high precisions on the causes and state-noise.

It is interesting to note that transposing the general linear model is equivalent to the switching the roles of the causes and parameters; θ (1)T ↔ η. Under this transposition, one could replace the D -step with the E -step. This gives exactly the same results because the two updates are formally identical for static models, under which (36)The exponential term disappears because the update is integrated until convergence; i.e., Δt = ∞. At this point, generalised motion is zero and an embedding order of n = 0⇒D = 0 is sufficient. This is a useful perspective because it suggests that static models can be regarded as models of steady-state or equilibrium responses, for systems with fixed point attractors.

If we have flat priors on the parameters, Π θ = 0, the conditional moments in Equation (35) become maximum likelihood ( ML ) estimators. Finally, under i.i.d. (identically and independently distributed) assumptions about the errors, the dependency on the hyperparameters disappears (because the precisions cancel) and we obtain ordinary least squares ( OLS ) estimates; µ θ = η − y T , where η − = (ηη T ) −1 η is the generalised inverse.

Consider the linear model, with a response that has been elicited using known causes, y = θ (1) η+z (1) . If we start with an initial estimate of the parameters, θ (1) = 0, the E -step reduces to (35)These are the standard results for the conditional expectation and covariance of a general linear model, under parametric (i.e., Gaussian error) assumptions. From this perspective, the known causes η T play the role of explanatory variables that are referred to collectively in classical statistics as a design matrix. This can be seen more easily by considering the transpose of the linear model in Equation 34; y T = η T θ (1)T +z (1)T . In this form, the causes are referred to as explanatory or independent variables and the data as response or dependent variables. A significant association between these two sets of variables is usually established by testing the null hypothesis that θ (1) = 0. This proceeds either by comparing the evidence for (full or alternate) models with and (reduced or null) models without the appropriate explanatory variables or using the conditional density of the parameters, under the full model.

In nonlinear optimisation, we want to identify the parameters of a static, nonlinear function that maps known causes to responses. This is a trivial case of the static model above that obtains when the hierarchical order reduces to m = 1 (34)The conditional estimates of θ (1) optimise the mapping g (1) : η→y for any specified form of generating function. Because there are no dynamics, the generalised motion of the response is zero, rendering the D -step and generalised coordinates redundant. Therefore, identification or inversion of these models reduces to conventional expectation-maximisation ( EM ), in which the parameters and hyperparameters are optimised recursively, through a coordinate ascent on the variational energy implicit in the E and M -steps. Expectation-maximisation has itself some ubiquitous special cases, when applied to simple linear models:

Usually, supervised learning entails learning the parameters of static nonlinear generative models with known causes. This corresponds to a HDM with infinitely precise priors at the last level, any number of subordinate levels (with no hidden states) (33)One could regard this model as a neural network with m hidden layers. From the neural network perspective, the objective is to optimise the parameters of a nonlinear mapping from data y to the desired output η, using back propagation of errors or related approaches [21] . This mapping corresponds to inversion of the generative model that maps causes to data; g (i) : η→y. This inverse problem is solved by DEM . However, unlike back propagation of errors or universal approximation in neural networks [22] , DEM is not simply a nonlinear function approximation device. This is because the network connections parameterise a generative model as opposed to its inverse; h: y→η (i.e., recognition model). This means that the parameters specify how states cause data and can therefore be used to generate data. Furthermore, unlike many neural network or PDP (parallel distributed processing) schemes, DEM enables Bayesian inference through an explicit parameterisation of the conditional densities of the parameters.

In these models the causes are known and enter as priors η with infinite precision; Σ v = 0. Furthermore, if the model is static or, more generally when g x = 0, we can ignore hidden states and dispense with the D -step.

In terms of establishing the generality of the HDM, it is sufficient to note that Bayesian filters simply estimate the conditional density on the hidden states of a HDM. As intimated in the introduction , their underlying state-space models assume that z t and w t are serially independent to induce a Markov property over sequential observations. This pragmatic but questionable assumption means the generalised motion of the random terms have zero precision and there is no point in representing generalised states. We have presented a fairly thorough comparative evaluation of DEM and extended Kalman filtering (and particle filtering) in [2] . DEM is consistently more accurate because it harvests empirical priors in generalised coordinates of motion. Furthermore, DEM can be used for both inference on hidden states and the random fluctuations driving them, because it uses an explicit conditional density q(x̃,ṽ) over both.

Deconvolution under HDMs is related to Bayesian approaches to inference on states using Bayesian belief update procedures (i.e., incremental or recursive Bayesian filters). The conventional approach to online Bayesian tracking of nonlinear or non-Gaussian systems employs extended Kalman filtering [30] or sequential Monte Carlo methods such as particle filtering. These Bayesian filters try to find the posterior densities of the hidden states in a recursive and computationally expedient fashion, assuming that the parameters and hyperparameters of the system are known. The extended Kalman filter is a generalisation of the Kalman filter in which the linear operators, of the state-space equations, are replaced by their partial derivatives evaluated at the current conditional mean. See also Wang and Titterington [31] for a careful analysis of variational Bayes for continuous linear dynamical systems and [32] for a review of the statistical literature on continuous nonlinear dynamical systems. These treatments belong to the standard class of schemes that assume Wiener or diffusion processes for state-noise and, unlike HDM, do not consider generalised motion.

State-space models have the following form in discrete time and rest on a vector autoregressive ( VAR ) formulation (43)where w t is a standard noise term. These models are parameterised by a system matrix A, an input matrix B, and an observation matrix g x . State-space models are special cases of linear HDMs, where the system-noise can be treated as a cause with random fluctuations (44)Notice that we have had to suppress state-noise in the HDM to make a simple state-space model. These models are adopted by conventional approaches for inference on hidden states in dynamic models:

In deconvolution problems, the objective is to estimate the inputs to a dynamic system given its response and parameters. (42)This model is similar to Equation 37 but now we have random fluctuations on the unknown states. Estimation of the states proceeds in the D -Step. Recall the E -Step is redundant because the parameters are known. When Σ (1) is known, the M -Step is also unnecessary and DEM reduces to deconvolution. This is related to Bayesian deconvolution or filtering under state-space models:

The model in Equation 41 is also referred to as a Gaussian process model [27] – [29] . The basic idea behind Gaussian process modelling is to replace priors p(v) on the parameters of the mapping, g(v): v→y with a prior on the space of mappings; p(g(v)). The simplest is a Gaussian process prior ( GPP ), specified by a Gaussian covariance function of the response, Σ(y|λ). The form of this GPP is furnished by the hierarchical structure of the HDM.

When there are many more causes then observations, a common device is to eliminate the causes in Equation 40 by recursive substitution to give a model that generates sample covariances and is formulated in terms of covariance components (i.e., hyperparameters). (41)Inversion then reduces to iterating the M -step. The causes can then be recovered from the hyperparameters using Equation 39 and the matrix inversion lemma. This can be useful when inverting ill-posed linear models (e.g., the electromagnetic inversion problem; [25] ). Furthermore, by using shrinkage hyperpriors one gets a behaviour known as automatic relevance determination ( ARD ), where irrelevant components are essentially switched off [26] . This leads to sparse models of the data that are optimised automatically.

The inversion was cross-validated with expectation maximization (EM), where the M-step corresponds to restricted maximum likelihood (ReML). This example used a simple two-level model that embodies empirical shrinkage priors on the first-level parameters. These models are also known as parametric empirical Bayes (PEB) models (left). Causes were sampled from the unit normal density to generate a response, which was used to recover the causes, given the parameters. Slight differences in the hyperparameter estimates (upper right), due to a different hyperparameterisation, have little effect on the conditional means of the unknown causes (lower right), which are almost indistinguishable.

When the model above is linear, we have the ubiquitous hierarchical linear observation model used in Parametric Empirical Bayes ( PEB ; [8] ) and mixed-effects analysis of covariance ( ANCOVA ) analyses. (40)Here the D -Step converges after a single iteration because the linearity of this model renders the Laplace assumption exact. In this context, the M -Step becomes a classical restricted maximum likelihood ( ReML ) estimation of the hierarchical covariance components, Σ (i)z . It is interesting to note that the ReML objective function and the variational energy are formally identical under this model [15] , [18] . Figure 3 shows a comparative evaluation of ReML and DEM using the same data. The estimates are similar but not identical. This is because DEM hyperparameterises the covariance as a linear mixture of precisions, whereas the ReML scheme used a linear mixture of covariance components.

In static systems, the problem reduces to estimating the causes of inputs after they are passed through some linear or nonlinear mapping to generate observed responses. For simple nonlinear estimation, in the absence of prior expectations about the causes, we have the nonlinear hierarchal model (38)This is the same as Equation 33 but with unknown causes. Here, the D -Step performs a nonlinear optimisation of the states to estimate their most likely values and the M -Step estimates the variance components at each level. As mentioned above, for static systems, Δt = ∞ and n = 0. This renders it a classical Gauss-Newton scheme for nonlinear model estimation (39)Empirical priors are embedded in the scheme through the hierarchical construction of the prediction errors, ε and their precision Π, in the usual way; see Equation 11 and [15] for more details.

In these models, the parameters are known and enter as priors η θ with infinite precision, Σ θ = 0. This renders the E -Step redundant. We will review estimation under static models and then consider Bayesian deconvolution and filtering with dynamic models. Static models imply the generalised motion of causal states is zero and therefore it is sufficient to represent conditional uncertainty on their amplitude; i.e., n = 0⇒D = 0. As noted above the D -step for static models is integrated until convergence to a fixed point, which entails setting Δt = ∞; see [15] . Note that making n = 0 renders the roughness parameter irrelevant because this only affects the precision of generalised motion.

In summary, we have seen that endowing dynamical models with a hierarchical architecture provides a general framework that covers many models used for estimation, identification and unsupervised learning. A hierarchical structure, in conjunction with nonlinearities, can emulate non-Gaussian behaviours, even when random effects are Gaussian. In a dynamic context, the level at which the random effects enter controls whether the system is deterministic or stochastic and nonlinearities determine whether their effects are additive or multiplicative. DEM was devised to find the conditional moments of the unknown quantities in these nonlinear, hierarchical and dynamic models. As such it emulates procedures as diverse as independent components analysis and Bayesian filtering, using a single scheme. In the final section, we show that a DEM -like scheme might be implemented in the brain. If this is true, the brain could, in principle, employ any of the models considered in this section to make inferences about the sensory data it harvests.

This ontology is one of many that could be constructed and is based on the fact that hierarchical dynamic models have several attributes that can be combined to create an infinite number of models; some of which are shown in the figure. These attributes include; (i) the number of levels or depth; (ii) for each level, linear or nonlinear output functions; (iii) with or without random fluctuations; (iii) static or dynamic (iv), for dynamic levels, linear or nonlinear equations of motion; (v) with or without state noise and, finally, (vi) with or without generalised coordinates.

This section has tried to show that the HDM encompasses many standard static and dynamic observation models. It is further evident than many of these models could be extended easily within the hierarchical framework. Figure 7 illustrates this by providing a ontology of models that rests on the various constraints under which HDMs are specified. This partial list suggests that only a proportion of potential models have been covered in this section.

Each row corresponds to a level, with causes on the left and hidden states on the right. In this case, the model has just two levels. The first (upper left) panel shows the predicted response and the error on this response (their sum corresponds to the observed data). For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area. These are sometimes referred to as “tubes”. Finally, the grey lines depict the true values used to generate the response. Here, we estimated the hyperparameters, parameters and the states. This is an example of triple estimation, where we are trying to infer the states of the system as well as the parameters governing its causal architecture. The hyperparameters correspond to the precision of random fluctuations in the response and the hidden states. The free parameters correspond to a single parameter from the state equation and one from the observer equation that govern the dynamics of the hidden states and response, respectively. It can be seen that the true value of the causal state lies within the 90% confidence interval and that we could infer with substantial confidence that the cause was non-zero, when it occurs. Similarly, the true parameter values lie within fairly tight confidence intervals (red bars in the lower right).

Figure 6 summarises the results after convergence of DEM (about sixteen iterations using an embedding order of n = 6, with a roughness hyperparameter, γ = 4). Each row corresponds to a level in the model, with causes on the left and hidden states on the right. The first (upper left) panel shows the predicted response and the error on this response. For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area. It can be seen that there is a pleasing correspondence between the conditional mean and veridical states (grey lines). Furthermore, the true values lie largely within the 90% confidence intervals; similarly for the parameters. This example illustrates the recovery of states, parameters and hyperparameters from observed time-series, given just the form of a model.

In this model, a simple Gaussian ‘bump’ function acts as a cause to perturb two coupled hidden states. Their dynamics are then projected to four response variables, whose time-courses are cartooned on the left. This figure also summarises the architecture of the implicit inversion scheme (right), in which precision-weighted prediction errors drive the conditional modes to optimise variational action. Critically, the prediction errors propagate their effects up the hierarchy (c.f., Bayesian belief propagation or message passing), whereas the predictions are passed down the hierarchy. This sort of scheme can be implemented easily in neural networks (see last section and [5] for a neurobiological treatment). This generative model uses a single cause v (1) , two dynamic states and four outputs y 1 ,…,y 4 . The lines denote the dependencies of the variables on each other, summarised by the equations (in this example both the equations were simple linear mappings). This is effectively a linear convolution model, mapping one cause to four outputs, which form the inputs to the recognition model (solid arrow). The inputs to the four data or sensory channels are also shown as an image in the insert.

In this model, causes or inputs perturb the hidden states, which decay exponentially to produce an output that is a linear mixture of hidden states. Our example used a single input, two hidden states and four outputs. To generate data, we used a deterministic Gaussian bump function input v (1) = exp( 1 / 4 (t−12) 2 ) and the following parameters (50)During inversion, the cause is unknown and was subject to mildly informative (zero mean and unit precision) shrinkage priors. We also treated two of the parameters as unknown; one parameter from the observation function (the first) and one from the state equation (the second). These parameters had true values of 0.125 and −0.5, respectively, and uninformative shrinkage priors. The priors on the hyperparameters, sometimes referred to as hyperpriors were similarly uninformative. These Gaussian hyperpriors effectively place lognormal hyperpriors on the precisions (strictly speaking, this invalidates the assumption of a linear hyperparameterisation but the effects are numerically small), because the precisions scale as exp(λ z ) and exp(λ w ). Figure 5 shows a schematic of the generative model and the implicit recognition scheme based on prediction errors. This scheme can be regarded as a message passing scheme that is considered in more depth in the next section.

Blind deconvolution tries to estimate the causes of an observed response without knowing the parameters of the dynamical system producing it. This represents the least constrained problem we consider and calls upon the same HDM used for system identification. An empirical example of triple estimation of states, parameters and hyperparameters can be found in [2] . This example uses functional magnetic resonance imaging time-series from a brain region to estimate not only the underlying neuronal and hemodynamic states causing signals but the parameters coupling experimental manipulations to neuronal activity. See Friston et al. [2] for further examples, ranging from the simple convolution model considered next, through to systems showing autonomous dynamics and deterministic chaos. Here we conclude with a simple m = 2 linear convolution model (Equation 42), as specified in Table 1 .

In the same way that factor analysis is a generalisation of PCA to non-Gaussian causes, ICA can be extended to form sparse-coding models of the sort proposed by Olshausen and Fields [37] by allowing observation error. (49)This is exactly the same as the ICA model but with the addition of observation error. By choosing g (2) to create heavy-tailed (supra-Gaussian) second-level causes, sparse encoding is assured in the sense that the causes will have small values on most occasions and large values on only a few. Note the M -Step comes into play again for these models. All the models considered so far are for static data. We now turn to BSS in dynamic systems.

Independent component analysis ( ICA ) decomposes the observed response into a linear mixture of non-Gaussian causes [36] . Non-Gaussian causal states are implemented simply in m = 2 hierarchical models with a nonlinear transformation at higher levels. ICA corresponds to (48)Where, as for PCA , Σ v = I. The nonlinear function g (2) transforms a Gaussian cause, specified by the priors at the third level, into a non-Gaussian cause and plays the role of a probability integral transform. Note that there are no hyperparameters to estimate and consequently there is no M -Step. It is interesting to examine the relationship between nonlinear PCA and ICA ; the key difference is that the nonlinearity is in the first level in PCA , as opposed to the second in ICA . Usually, in ICA the probability integral transform is pre-specified to render the second-level causes supra-Gaussian. From the point of view of a HDM this corresponds to specifying precise priors on the second-level parameters. However, DEM can fit unknown distributions by providing conditional estimates of both the mixing matrix θ (1) and the probability integral transform implicit in g(v (2) ,θ (2) ).

Parameters and causes were sampled from the unit normal density to generate a response, which was then used for their estimation. The aim was to recover the causes without knowing the parameters, which is effected with reasonable accuracy (upper). The conditional estimates of the causes and parameters are shown in lower panels, along with the increase in free-energy or log-evidence, with the number of DEM iterations (lower left). Note that there is an arbitrary affine mapping between the conditional means of the causes and their true values, which we estimated, post hoc to show the correspondence in the upper panel.

The model for factor analysis is exactly the same as for PCA but allowing for observation error (47)When the covariance of the observation error is spherical; e.g., Σ (1)z = λ (1)z I, this is also known as a probabilistic PCA model [35] . The critical distinction, from the point of view of the HDM, is that the M -Step is now required to estimate the error variance. See Figure 4 for a simple example of factor analysis using DEM . Nonlinear variants of factor analysis obtain by analogy with Equation 46.

The Principal Components Analysis ( PCA ) model assumes that uncorrelated causes are mixed linearly to form a static observation. This is a m = 1 model with no observation noise; i.e., Σ (1)z = 0. (45)where priors on v (1) = z (2) render them orthonormal Σ v = I. There is no M -Step here because there are no hyperparameters to estimate. The D -Step estimates the causes under the unitary shrinkage priors on their amplitude and the E -Step updates the parameters to account for the data. Clearly, there are more efficient ways of inverting this model than using DEM ; for example, using the eigenvectors of the sample covariance of the data. However, our point is that PCA is a special case of an HDM and that any optimal solution will optimise variational action or energy. Nonlinear PCA is exactly the same but allowing for a nonlinear generating function. (46)See [34] for an example of nonlinear PCA with a bilinear model applied to neuroimaging data to disclose interactions among modes of brain activity.

In all the examples below, both the parameters and states are unknown. This entails a dual or triple estimation problem, depending on whether the hyperparameters are known. We will start with simple static models and work towards more complicated dynamic variants. See [33] for a comprehensive review of unsupervised learning for many of the models in this section. This class of models is often discussed under the rhetoric of blind source separation (BSS), because the inversion is blind to the parameters that control the mapping from sources or causes to observed signals.

Neuronal Implementation

In this final section, we revisit DEM and show that it can be formulated as a relatively simple neuronal network that bears many similarities to real networks in the brain. We have made the analogy between the DEM and perception in previous communications; here we focus on the nature of recognition in generalised coordinates. In brief, deconvolution of hidden states and causes from sensory data (D-step) may correspond to perceptual inference; optimising the parameters of the model (E-step) may correspond to perceptual learning through changes in synaptic efficacy and optimising the precision hyperparameters (M-step) may correspond to encoding perceptual salience and uncertainty, through neuromodulatory mechanisms.

Hierarchical models in the brain. A key architectural principle of the brain is its hierarchical organisation [38]–[41]. This has been established most thoroughly in the visual system, where lower (primary) areas receive sensory input and higher areas adopt a multimodal or associational role. The neurobiological notion of a hierarchy rests upon the distinction between forward and backward connections [42]–[45]. This distinction is based upon the specificity of cortical layers that are the predominant sources and origins of extrinsic connections (extrinsic connections couple remote cortical regions, whereas intrinsic connections are confined to the cortical sheet). Forward connections arise largely in superficial pyramidal cells, in supra-granular layers and terminate on spiny stellate cells of layer four in higher cortical areas [40],[46]. Conversely, backward connections arise largely from deep pyramidal cells in infra-granular layers and target cells in the infra and supra-granular layers of lower cortical areas. Intrinsic connections mediate lateral interactions between neurons that are a few millimetres away. There is a key functional asymmetry between forward and backward connections that renders backward connections more modulatory or nonlinear in their effects on neuronal responses (e.g., [44]; see also Hupe et al. [47]). This is consistent with the deployment of voltage-sensitive NMDA receptors in the supra-granular layers that are targeted by backward connections [48]. Typically, the synaptic dynamics of backward connections have slower time constants. This has led to the notion that forward connections are driving and illicit an obligatory response in higher levels, whereas backward connections have both driving and modulatory effects and operate over larger spatial and temporal scales. The hierarchical structure of the brain speaks to hierarchical models of sensory input. We now consider how this functional architecture can be understood under the inversion of HDMs by the brain. We first consider inference on states or perception.

Perceptual inference. If we assume that the activity of neurons encode the conditional mode of states, then the D-step specifies the neuronal dynamics entailed by perception or recognizing states of the world from sensory data. Furthermore, if we ignore mean-field terms; i.e., discount the effects of conditional uncertainty about the parameters when optimising the states, Equation 23 prescribes very simple recognition dynamics (51)Where, is prediction error multiplied by its precision, which we have re-parameterised in terms of a covariance component, . Here, the matrix Λ can be thought of as lateral connections among error-units. Equation 51 is an ordinary differential equation that describes how neuronal states self-organise, when exposed to sensory input. The form of Equation 51 is quite revealing, it suggests two distinct populations of neurons; state-units whose activity encodes and error-units encoding ξ(t), with one error-unit for each state. Furthermore, the activities of error-units are a function of the states and the dynamics of state-units are a function of prediction error. This means the two populations pass messages to each other and to themselves. The messages passed among the states, mediate empirical priors on their motion, while the lateral connections among the error-units, −Λξ weight prediction errors in proportion to their precision.

Hierarchical message passing. If we unpack these equations we can see the hierarchical nature of this message passing (see Figure 8). (52)This shows that error-units receive messages from the states in the same level and the level above, whereas states are driven by error-units in the same level and the level below. Critically, inference requires only the prediction error from the lower level ξ(i) and the level in question, ξ(i+1). These constitute bottom-up and lateral messages that drive conditional means towards a better prediction, to explain away the prediction error in the level below. These top-down and lateral predictions correspond to g̃(i) and f̃(i). This is the essence of recurrent message passing between hierarchical levels to optimise free-energy or suppress prediction error; i.e., recognition dynamics. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 8. Schematic detailing the neuronal architectures that encode an ensemble density on the states and parameters of one level in a hierarchical model. This schematic shows the speculative cells of origin of forward driving connections that convey prediction error from a lower area to a higher area and the backward connections that are used to construct predictions. These predictions try to explain away input from lower areas by suppressing prediction error. In this scheme, the sources of forward connections are the superficial pyramidal cell population and the sources of backward connections are the deep pyramidal cell population. The differential equations relate to the optimisation scheme detailed in the main text and their constituent terms are placed alongside the corresponding connections. The state-units and their efferents are in black and the error-units in red, with causes on the left and hidden states on the right. For simplicity, we have assumed the output of each level is a function of, and only of, the hidden states. This induces a hierarchy over levels and, within each level, a hierarchical relationship between states, where hidden states predict causes. https://doi.org/10.1371/journal.pcbi.1000211.g008 The connections from error to state-units have a simple form that depends on the gradients of the model's functions; from Equation 12 (53)These pass prediction errors forward to state-units in the higher level and laterally to state-units at the same level. The reciprocal influences of the state on the error-units are mediated by backward connections and lateral interactions. In summary, all connections between error and state-units are reciprocal, where the only connections that link levels are forward connections conveying prediction error to state-units and reciprocal backward connections that mediate predictions (see Figure 8). We can identify error-units with superficial pyramidal cells, because the only messages that pass up the hierarchy are prediction errors and superficial pyramidal cells originate forward connections in the brain. This is useful because it is these cells that are primarily responsible for electroencephalographic (EEG) signals that can be measured non-invasively. Similarly the only messages that are passed down the hierarchy are the predictions from state-units that are necessary to form prediction errors in lower levels. The sources of extrinsic backward connections are largely the deep pyramidal cells and one might deduce that these encode the expected causes of sensory states (see [49] and Figure 9). Critically, the motion of each state-unit is a linear mixture of bottom-up prediction error; see Equation 52. This is exactly what is observed physiologically; in that bottom-up driving inputs elicit obligatory responses that do not depend on other bottom-up inputs. The prediction error itself is formed by predictions conveyed by backward and lateral connections. These influences embody the nonlinearities implicit in g̃(i) and f̃(i). Again, this is entirely consistent with the nonlinear or modulatory characteristics of backward connections. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 9. Schematic detailing the neuronal architectures that encode an ensemble density on the states and parameters of hierarchical models. This schematic shows how the neuronal populations of the previous figure may be deployed hierarchically within three cortical areas (or macro-columns). Within each area the cells are shown in relation to the laminar structure of the cortex that includes supra-granular (SG) granular (L4) and infra-granular (IG) layers. https://doi.org/10.1371/journal.pcbi.1000211.g009

Encoding generalised motion. Equation 51 is cast in terms of generalised states. This suggests that the brain has an explicit representation of generalised motion. In other words, there are separable neuronal codes for different orders of motion. This is perfectly consistent with empirical evidence for distinct populations of neurons encoding elemental visual features and their motion (e.g., motion-sensitive area V5; [39]). The analysis in this paper suggests that acceleration and higher-order motion are also encoded; each order providing constraints on a lower order, through . Here, D represents a fixed connectivity matrix that mediates these temporal constraints. Notice that only when . This means it is perfectly possible to represent the motion of a state that is inconsistent with the state of motion. The motion after-effect is a nice example of this, where a motion percept coexists with no change in the perceived location of visual stimuli. The encoding of generalised motion may mean that we represent paths or trajectories of sensory dynamics over short periods of time and that there is no perceptual instant (c.f., the remembered present; [50]). One could speculate that the encoding of different orders of motion may involve rate codes in distinct neuronal populations or multiplexed temporal codes in the same populations (e.g., in different frequency bands). See [51] for a neurobiologically realistic treatment of temporal dynamics in decision-making during motion perception and [52] for a discussion of synchrony and attentive learning in laminar thalamocortical circuits. When dealing with empirical data-sequences one has to contend with sparse and discrete sampling. Analogue systems, like the brain can sample generalised motion directly. When sampling sensory data, one can imagine easily how receptors generate . Indeed, it would be surprising to find any sensory system that did not respond to a high-order derivative of changing sensory fields (e.g., acoustic edge detection; offset units in the visual system, etc; [53]). Note that sampling high-order derivatives is formally equivalent to high-pass filtering sensory data. A simple consequence of encoding generalised motion is, in electrophysiological terms, the emergence of spatiotemporal receptive fields that belie selectivity to particular sensory trajectories.

Perceptual learning and plasticity. The conditional expectations of the parameters, µθ control the construction of prediction error through backward and lateral connections. This suggests that they are encoded in the strength of extrinsic and intrinsic connections. If we define effective connectivity as the rate of change of a unit's response with respect to its inputs, Equation 51 suggests an interesting antisymmetry in the effective connectivity between the state and error-units. The effective connectivity from the states to the error-units is . This is simply the negative transpose of the effective connectivity that mediates recognition dynamics; . In other words, the effective connection from any state to any error-unit has the same strength (but opposite sign) of the reciprocal connection from the error to the state-unit. This means we would expect to see connections reciprocated in the brain, which is generally the case [39],[40]. Furthermore, we would not expect to see positive feedback loops; c.f., [54]. We now consider the synaptic efficacies underlying effective connectivity. If synaptic efficacy encodes the parameter estimates, we can cast parameter optimisation as changing synaptic connections. These changes have a relatively simple form that is recognisable as associative plasticity. To show this, we will make the simplifying but plausible assumption that the brain's generative model is based on nonlinear functions a of linear mixtures of states (54)Under this assumption correspond to matrices of synaptic strengths or weights and a can be understood as a neuronal activation function that models nonlinear summation of presynaptic inputs over the dendritic tree [55]. This means that the synaptic connection to the ith error from the jth state depends on only one parameter, which changes according to Equation 29 (55)This suggests that plasticity comprises an associative term and a decay term mediating priors on the parameters. The dynamics of the associative term are given by Equation 21 (and exploiting the Kronecker form of Equation 22). The integral of this associative term is simply the covariance between presynaptic input and postsynaptic prediction error, summed over orders of motion. In short, it mediates associative or Hebbian plasticity. The product of pre and postsynaptic signals is modulated by an activity-dependent term, , which is the gradient of the activation function at its current level of input (and is constant for linear models). Critically, updating the conditional estimates of the parameters, through synaptic efficacies, , uses local information that is available at each error-unit. Furthermore, the same information is available at the synaptic terminal of the reciprocal connection, where the ith error-unit delivers presynaptic inputs to the jth state. In principle, this enables reciprocal connections to change in tandem. Finally, because plasticity is governed by two coupled ordinary differential equations (Equation 55), connection strengths should change more slowly than the neuronal activity they mediate. These theoretical predictions are entirely consistent with empirical and computational characterisations of plasticity [56],[57].

Perceptual salience and uncertainty. Equation 51 shows that the influence of prediction error is scaled by its precision or covariance that is a function of µλ. This means that the relative influence of bottom-up, lateral and top-down effects are modulated by the conditional expectation of the hyperparameters. This selective modulation of afferents mirrors the gain-control mechanisms invoked for attention; e.g., [58],[59]. Furthermore, they enact the sorts of mechanisms implicated in biased competition models of spatial and object-based attention mediating visual search [60],[61]. Equation 51 formulates this bias or gain-control in terms of lateral connections, among error-units. This means hyperparameter optimisation would be realised, in the brain, as neuromodulation or plasticity of lateral interactions among error-units. If we assume that the covariance is a linear mixture of covariance components, R i among non-overlapping subsets of error-units, then (56)Where . Under this hyperparameterisation, modulates subsets of connections to encode a partition of the covariance. Because each set of connections is a function of only one hyperparameter, their plasticity is prescribed simply by Equation 31 (57)The quantities might correspond to specialised (e.g., noradrenergic or cholinergic) systems in the brain that broadcast their effects to the ith subset of error-units to modulate their responsiveness to each other. The activities of these units change relatively slowly, in proportion to an associative term and decay that mediates hyperpriors. The associative term is basically the difference between the sample covariance of precision-weighted prediction errors and the precision expected, under the current value of . As above, changes in occur more slowly than the fast dynamics of the states because they are driven by , which accumulates energy gradients to optimise variational action. One could think of as the synaptic efficacy of lateral or intrinsic connections that depend upon classical neuromodulatory inputs and other slower synaptic dynamics (e.g., after-hyperpolarisation potentials and molecular signalling). The physiological aspects of these dynamics provide an interesting substrate for attentional mechanisms in the brain (see Schroeder et al., [62] for review) and are not unrelated to the ideas in [63]. These authors posit a role for acetylcholine (an ascending modulatory neurotransmitter) in mediating expected uncertainty. This is entirely consistent with the dynamics of that are driven by the amplitude of prediction errors encoding the relative precision of sensory signals and empirical priors. Modulatory neurotransmitters have, characteristically, much slower time constants, in terms of their synaptic effects, than glutamatergic neurotransmission that is employed by cortico-cortical extrinsic connections.

The mean-field partition. The mean-field approximation q(ϑ) = q(u(t))q(θ)q(λ) enables inference about perceptual states, causal regularities and context, without representing the joint distribution explicitly; c.f., [64]. However, the optimisation of one set of sufficient statistics is a function of the others. This has a fundamental implication for optimisation in the brain (see Figure 10). For example, ‘activity-dependent plasticity’ and ‘functional segregation’ speak to reciprocal influences between changes in states and connections; in that changes in connections depend upon activity and changes in activity depend upon connections. Things get more interesting when we consider three sets, because quantities encoding precision must be affected by and affect neuronal activity and plasticity. This places strong constraints on the neurobiological candidates for these hyperparameters. Happily, the ascending neuromodulatory neurotransmitter systems, such as dopaminergic and cholinergic projections, have exactly the right characteristics: they are driven by activity in presynaptic connections and can affect activity though classical neuromodulatory effects at the post-synaptic membrane [65], while also enabling potentiation of connection strengths [66],[67]. Furthermore, it is exactly these systems that have been implicated in value-learning [68]–[70], attention and the encoding of uncertainty [63],[71]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 10. The ensemble density and its mean-field partition. q(ϑ) is the ensemble density and is encoded in terms of the sufficient statistics of its marginals. These statistics or variational parameters (e.g., mean or expectation) change to extremise free-energy to render the ensemble density an approximate conditional density on the causes of sensory input. The mean-field partition corresponds to a factorization over the sets comprising the partition. Here, we have used three sets (neural activity, modulation and connectivity). Critically, the optimisation of the parameters of any one set depends on the parameters of the other sets. In this figure, we have focused on means or expectations µi of the marginal densities, q(ϑi) = N(ϑi: µi,Ci). https://doi.org/10.1371/journal.pcbi.1000211.g010