We show that Langevin Markov chain Monte Carlo inference in an energy-based model with latent variables has the property that the early steps of inference, starting from a stationary point, correspond to propagating error gradients into internal layers, similar to backpropagation. The backpropagated error is with respect to output units that have received an outside driving force pushing them away from the stationary point. Backpropagated error gradients correspond to temporal derivatives with respect to the activation of hidden units. These lead to a weight update proportional to the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. Simulations and a theoretical argument suggest that this rate-based update rule is consistent with those associated with spike-timing-dependent plasticity. The ideas presented in this article could be an element of a theory for explaining how brains perform credit assignment in deep hierarchies as efficiently as backpropagation does, with neural computation corresponding to both approximate inference in continuous-valued latent variables and error backpropagation, at the same time.

© 2017 Massachusetts Institute of Technology

1 Introduction Section: Choose Top of page Abstract 1 Introduction << 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... 7 Conclusion References CITING ARTICLES It has been hypothesized numerous times (Hinton & Sejnowski, 1986; Friston & Stephan, 2007; Berkes, Orban, Lengyel, & Fiser, 2011) that, given a state of sensory information (current and past inputs), neurons are collectively performing inference, that is, moving toward configurations that better “explain” the observed sensory data and so minimize the corresponding “surprise” and maximize a form of coherence between sensory observations and internal representations. We can think of the configuration of internal neurons (hidden units or latent variables) as an “explanation” for the observed sensory data. Under this hypothesis, when an unexpected signal arrives at visible neurons, the other neurons will change their state (from one that is near a stochastic equilibrium) so as to better reflect that change in input.1 If this error signal is slowly driving visible neurons toward the externally observed value, this creates a perturbation for the whole network. In this article, we consider what happens early in this process, when the perturbation due to the incorrect prediction of visible neurons is propagated toward inner areas of the brain (hidden units and hidden layers). As one of the two main contributions of this work, we show how the propagation of this perturbation is mathematically equivalent to the propagation of activation gradients by the backpropagation algorithm in a deep neural network. This result assumes that latent variables are continuous valued (unlike in the vanilla Boltzmann machine), making the system an energy-based model similar to the continuous Hopfield net (Hopfield, 1984), defined by an energy function over both continuous visible and hidden units. It also assumes that the neurons are leaky integrators with injected noise. As a consequence, neural computation corresponds to gradually going down the energy while adding noise, which corresponds to running a Langevin Markov chain Monte Carlo (MCMC) as inference mechanism. By applying the chain rule, the backpropagated activation gradients can be used to estimate the error gradient with respect to the weights of the incoming connections. This leads to a weight update performing gradient descent on the output prediction error which is similar to a temporal form of contrastive Hebbian learning. The update is proportional to the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. This brings us to the second main contribution of this work. Biological experiments (Markram & Sakmann, 1995; Gerstner, Kempter, van Hemmen, & Wagner, 1996) suggest that synaptic change in several types of neurons and stimulation contexts depends on the relative timing of presynaptic and postsynaptic spikes. Learning rules that mimic this observation are called spike-timing dependent plasticity (STDP) rules. We show that our proposed weight update that depends only on firing rates and their temporal derivatives is consistent with the STDP observations, according to a series of simulations. This result provides an interesting machine learning interpretation for STDP, since artificial neural networks are generally cast in terms of firing rates. This work is only a stepping stone. Several points still need to be elucidated before a complete theory of learning, inference, and credit assignment is elaborated that is both biologically plausible and makes sense from a machine learning point of view for whole networks (with global optimization of the whole network and not being limited to learning of visible neurons that receive a target). See section 6 for a discussion. We start by explaining the connection between the leaky integrator-based neural computation and Langevin MCMC in energy-based models in section 2. Based on this, we show in section 3 that, starting from a fixed point, early inference in such an energy-based model performs backpropagation of prediction error gradients. In section 4, we analyze the resulting weight update and show that it mimics STDP. Section 5 discusses related work and open questions, and directions of future research are given in section 6. We conclude in section 7.

2 Neural Computation Does Inference: Going Down the Energy Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... << 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... 7 Conclusion References CITING ARTICLES We consider the hypothesis that a central interpretation of neural computation (what neurons do, on a short timescale at which weights can be considered fixed) is that it is performing iterative inference. Iterative inference means that the hidden units of the network are gradually changed toward configurations that are more probable given the sensory input and according to the current model of the world associated with the parameters of the model. In other words, they are approximately moving toward configurations more probable under some probability distribution associated with the model and eventually sampling from . Before making the connection between Boltzmann machines or energy-based models and backpropagation, let us see in more mathematical detail how neural computation could be interpreted as inference. 2.1 Leaky Integrator Neurons For this purpose, consider the classical leaky integrator neural equation. Let represent the state of the system at time , a vector with one element per unit (e.g., representing a single neuron), where is a real-valued quantity associated with the th unit, corresponding to a time-integrated voltage potential. Let us denote for the state of the visible units (i.e., a subset of the elements of representing the externally driven input) and for the state of the hidden units (i.e., the remaining elements of such that ). Furthermore, let be the function that computes a new complete state given its previous state , with denoting the parts of that, respectively, output the new states of the visible and hidden units. Here, we assume that the input units are clamped to an external signal and that the time evolution of the unclamped hidden units follows a leaky integrators equation, (2.1) where represents the network-generated pressure on neurons (i.e., is what the rest of the network asks neuron to move toward) and the corrupted state results from synaptic noise and spiking effects. Here, we roughly model this corruption by simple additive noise, (2.2) and we see that equation (2.3) which brings exponentially fast (with time constant ) toward the target value , along with the random walk movements brought by the noise . We assume that is a weighted sum of input signals coming from the connected neurons into neuron (although the derivation below does not depend on the specific form of , only on the assumption that it corresponds to an energy gradient). Note that the standard biological neuron models assume that is a weighted sum of the input spikes (or of the input firing rates). whererepresents the network-generated pressure on neurons (i.e.,is what the rest of the network asks neuronto move toward) and the corrupted stateresults from synaptic noise and spiking effects. Here, we roughly model this corruption by simple additive noise,and we see that equation 2.1 corresponds to the discretization of a differential equation,which bringsexponentially fast (with time constant) toward the target value, along with the random walk movements brought by the noise. We assume thatis a weighted sum of input signals coming from the connected neurons into neuron(although the derivation below does not depend on the specific form of, only on the assumption that it corresponds to an energy gradient). Note that the standard biological neuron models assume thatis a weighted sum of the input spikes (or of the input firing rates). 2.2 Machine Learning Interpretation: Leaky Integration as Langevin MCMC Why can equation 2.1 be seen as performing inference, moving toward states more probable under ? Note, first, that in equation 2.1 represents a guess for a new configuration, with being a noisy direction of movement and is a noise-free direction. Consider the following direction of change for the neural state , (2.4) for some energy-based probability distribution associated with the model, with energy function . This is an assumption that can easily be fulfilled, for example, if is the linear combination of some input rates , as we will show below. Interestingly, equation corresponds to the reconstruction error (Vincent, corresponds to the reconstruction function of the autoencoder, and is the probability density function that it estimates, with the associated energy function . for some energy-based probability distributionassociated with the model, with energy function. This is an assumption that can easily be fulfilled, for example, ifis the linear combination of some input rates, as we will show below. Interestingly, equation 2.4 also emerges in recent work on unsupervised learning using denoising autoencoders and denoising score matching, wherecorresponds to the reconstruction error (Vincent, 2011 ; Alain & Bengio, 2013 ). In that context,corresponds to the reconstruction function of the autoencoder, andis the probability density function that it estimates, with the associated energy function With this interpretation, the leaky integration neural computation of equation 2.1 seems to follow a Langevin MCMC (Andrieu, de Freitas, Doucet, & Jordan, 2003), since (2.5) where for the last line we used equations . Hence, from the point of view of the noisy states , we see that the update equation corresponds to (2.6) which we recognize as going down the gradient of the energy with learning rate and adding noise . where for the last line we used equations 2.2 and 2.4 and we see that we are going down the energy gradient of. Hence, from the point of view of the noisy states, we see that the update equation corresponds towhich we recognize as going down the gradient of the energy with learning rateand adding noise 2.3 A Possible Energy Function To fix ideas and illustrate the possibility of a driving function corresponding to the gradient of an energy function, we propose the following energy function, closely related to the Hopfield net (but with latent variables) or with the Boltzmann machine energy function (but with continuous nonlinearities inserted): (2.7) where is the weight between unit and unit and is the neural nonlinearity, some kind of monotonic bounded function that outputs a value between 0 and 1 corresponding to a firing rate, as, for example, (2.8) with thresholds and such that and . whereis the weight between unitand unitandis the neural nonlinearity, some kind of monotonic bounded function that outputs a value between 0 and 1 corresponding to a firing rate, as, for example,with thresholdsandsuch thatand Note that from a biological perspective, the nonlinear transformation can be seen as modeling the monotonically increasing dependency of the actual firing rate (or, equivalently, the probability of firing) on the integrated activity (the expected membrane potential, averaged over the random effects of both pre- and postsynaptic spikes). With this energy function, the driving function would be (2.9) To obtain this, we have assumed that . Otherwise, we would get that (2.10) which automatically symmetrizes the weight matrix when following a weight update like the one we propose below. To obtain this, we have assumed that. Otherwise, we would get thatwhich automatically symmetrizes the weight matrix when following a weight update like the one we propose below. This formula for the driving input is similar to the usual weighted sum of firing rates, except for the novel factor , which is 1 if and 0 otherwise. This would suggest that when neuron is saturated (being either shut off or firing at the maximal rate), the external inputs have no impact on its state. The only term that remains in the neural update equation, equation 2.1, is the one that drives the state toward 0, bringing it out of the saturation region and back into a regime where the neuron is sensitive to the outside feedback, so long as is not a saturated value (which is guaranteed by the condition that ). This idea is developed further below. 2.3.1 Fixed-Point Behavior In particular, it is interesting to note what happens around a fixed point of the state dynamics. Inspection of equation 2.5 reveals that for a fixed point , it must be the case that (2.11) which means that (2.12) Let us consider the hypothesis that unit is saturated: . In this case, , and the neural update becomes which converges toward with . Thus, could not have been a fixed point. Therefore, assuming that is not a saturated value (i.e., that the origin corresponds to a region where the derivative is significantly nonzero, ), we get that if , the network cannot be at a fixed point. Equivalently, if the whole network has converged to a fixed point, all neurons must be in a sensitive regime, which makes them respond to their inputs. which means thatLet us consider the hypothesis that unitis saturated:. In this case,, and the neural update becomeswhich converges towardwith. Thus,could not have been a fixed point. Therefore, assuming thatis not a saturated value (i.e., that the origin corresponds to a region where the derivative is significantly nonzero,), we get that if, the network cannot be at a fixed point. Equivalently, if the whole network has converged to a fixed point, all neurons must be in a sensitive regime, which makes them respond to their inputs.

3 Link to Backpropagation Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation << 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... 7 Conclusion References CITING ARTICLES We are now ready to present the first important result of this article: a link between neural computation as inference in an energy-based model and backpropagation of prediction error gradients. 3.1 Propagation of Perturbations Consider what happens when a network such as we have described is in an equilibrium, that is, when the state is at a fixed point. At that point, according to equation 2.11, the average gradient over the noise of the energy is 0 and weight updates are also 0 on average. To make the link to supervised backpropagation simpler, let us consider two kinds of visible units: input units and output units , such that and . Suppose that we start by letting the network settle to a fixed point with clamped to the observed input values. Then we obtain an output at the fixed point , where , . Equivalently, we have that (3.1) That is, the free units (hidden and output) have settled to a value that is in agreement with the clamped input units. That is, the free units (hidden and output) have settled to a value that is in agreement with the clamped input units. Now suppose that a target value is observed and gradually drives the output units from their fixed-point value toward . This happens because the output units are also leaky integrator neurons, meaning that their state gradually changes based on the input they receive, in the direction of the driving signal. Let us denote (3.2) that initial change of in the direction of the driving signal when going from time step 0 to time step 1 (following the neural update equation, equation that initial change ofin the direction of the driving signalwhen going from time step 0 to time step 1 (following the neural update equation, equation 2.1 ). Let us consider as a training objective the squared prediction error between and . That would push the global state away from the equilibrium where it was sitting and into a region where is nonzero. It is given by (3.3) Since, at equilibrium, , this corresponds to the mismatch between the prediction and the target value driving the output units, that is, (3.4) Note that (3.5) which shows that updating by exactly corresponds to performing a step of gradient descent on with respect to the objective , where can be seen as a learning rate. Of course, our real interest is in changing the weights, especially the weights of lower layers. Since, at equilibrium,, this corresponds to the mismatch between the predictionand the target valuedriving the output units, that is,Note thatwhich shows that updatingbyexactly corresponds to performing a step of gradient descent onwith respect to the objective, wherecan be seen as a learning rate. Of course, our real interest is in changing the weights, especially the weights of lower layers. So how would the rest of the network react to this external perturbation? Would hidden neurons also move in the direction of reducing the output prediction error? Let us first consider the top hidden layer , which consists of units directly connected to the output units . Starting from the fixed point, we find that the change with respect to would propagate to via the update of the neural state, which, in the noise-free condition, is given by (3.6) With small (arising out of our assumption that the visible units only gradually move toward their target), we can approximate the above by taking the Taylor expansion of at the fixed point , which is given by (3.7) exploiting at the fixed point. Insertion into equation (3.8) Hence, using equation (3.9) Note that with the assumed layer-wise neural network structure (i.e., having no connections between neurons of the same layer) (and , respectively) is zero if the neural nonlinearity fulfills . This is the case for the nonlinearity given in equation are zero). Therefore, the -terms are omitted in the following. Withsmall (arising out of our assumption that the visible units only gradually move toward their target), we can approximate the above by taking the Taylor expansion ofat the fixed point, which is given byexploitingat the fixed point. Insertion into equation 3.6 yieldsHence, using equation 3.5 , we have thatNote that with the assumed layer-wise neural network structure (i.e., having no connections between neurons of the same layer)(and, respectively) is zero if the neural nonlinearityfulfills. This is the case for the nonlinearity given in equation 2.8 , since it is piecewise linear (and thus all the higher derivatives ofare zero). Therefore, the-terms are omitted in the following. In order to obtain backpropagation, what we would like to get is (3.10) as resulting from and the application of the chain rule. This would exactly fit equation , which is done in the following. Note that is a first derivative of a function related to the energy function, that is, (3.11) where (3.12) Thus, we get , , and it becomes clear that and are cross-derivatives of . As we know that cross derivatives are symmetric, we have (3.13) We are now ready to rewrite , equation Similarly, the perturbation will be transmitted at the next time step to the units that are directly connected to (but not to ). We have (3.14) where we use to denote all hidden layers except the first one. Here, the Taylor expansion yields (3.15) where the second term on the right-hand side is equal to zero assuming the proposed network structure, since with no connections between and , we obtain . Following the same line of arguments as before, we get Analogously, the update of layer in the th time step is (3.16) Thus, we have shown so far that neural computation as inference in an energy-based model performs backpropagation of (weighted) prediction error gradients. as resulting fromand the application of the chain rule. This would exactly fit equation 3.9 , if we can show that, which is done in the following. Note thatis a first derivative of a function related to the energy function, that is,whereThus, we get, and it becomes clear thatandare cross-derivatives of. As we know that cross derivatives are symmetric, we haveWe are now ready to rewrite, equation 3.9 , in a form that equates it with a backpropagated gradient:Similarly, the perturbationwill be transmitted at the next time step to the unitsthat are directly connected to(but not to). We havewhere we useto denote all hidden layers except the first one. Here, the Taylor expansion yieldswhere the second term on the right-hand side is equal to zero assuming the proposed network structure, since with no connections betweenand, we obtain. Following the same line of arguments as before, we getAnalogously, the update of layerin theth time step isThus, we have shown so far that neural computation as inference in an energy-based model performs backpropagation of (weighted) prediction error gradients. 3.2 Stochastic Gradient Descent Weight Update The above result is consistent with and inspired by the idea previously proposed by Hinton (2007) that temporal change can encode backpropagated gradients. What would it take for the at layer to turn into a stochastic gradient descent (SGD) weight update with respect to the prediction error ? Let us assume that is the weight between neuron in layer and neuron in layer . Let us further recall that , as described by equation 3.16, is proportional to . If we regard as a function of given by inserting equation 2.9 into equation 2.1 and applying the chain rule, we would get the following weight update: (3.17) Since is proportional to , we would obtain the following weight update: (3.18) Interestingly, it turns out that such a learning rule allows simulating the relationship between spike timing and synaptic change according to STDP, as shown by a theoretical argument and simulations in the following section. This analysis is valid only in the early stages of the perturbation of propagation. (See Scellier & Bengio, Interestingly, it turns out that such a learning rule allows simulating the relationship between spike timing and synaptic change according to STDP, as shown by a theoretical argument and simulations in the following section. This analysis is valid only in the early stages of the perturbation of propagation. (See Scellier & Bengio, 2016 , for a mathematical framework allowing one to correctly extend the gradient computation to the case where the perturbation on the output unit fully propagates to the lower layers until the latter reach a fixed point, and requiring propagating not just the initial perturbation but also reaching a modified fixed point or equilibrium but yielding the same learning rule.) With the above STDP-compatible learning rule, the change in weights due to the initial perturbation would be approximately proportional to the backpropagation update. However, note that the multiplicative factor for units at layer makes the initial changes much slower for the lower layers. This is because the leaky integrator neurons have not had time to integrate the information yet, so practically, it will take on the order of the time constant (from equation 2.3) times for the change in to become significant unless we adjust the per layer learning rates accordingly. Interestingly, as Scellier and Bengio (2016) show, the factor disappears if we go beyond the initial perturbation propagation and allow the network to settle after nudging the output units by . This allows one to show that the resulting weight update actually corresponds to stochastic gradient descent on the prediction error, provided the weights are also constrained to remain symmetric: .

4 Link to Spike-Timing-Dependent Plasticity Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... << 5 Related Work 6 Open Questions and Futu... 7 Conclusion References CITING ARTICLES We now present the second main contribution of this work: we analyze the weight update rule given by equations 3.18 in more detail and establish a connection to STDP. With the opposite sign, such a weight update rule was proposed by Hinton and McClelland (1988) well before the STDP behavior was reported. A similar rule, also studied by Xie and Seung (2000), also drew a link to STDP. They started by assuming a particular pattern relating spike timing and weight change and then showed that the resulting weight change could be approximated by the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. Instead, we go in the other direction, showing that if the weight change is proportional to the product of the presynaptic firing rate (or simply the presence of a spike) and the temporal rate of change of the postsynaptic activity, then we recover a relationship between spike timing and weight change that has the precise characteristics of the one observed experimentally by biologists. We present both an easy-to-understand theoretical justification for this result (see section 4.1), as well as simulations that confirm it (see section 4.2). 4.1 Spike-Timing-Dependent Plasticity Spike-timing-dependent plasticity is a central subject of research in synaptic plasticity. However, much more research is needed to solidify the links between STDP and a machine learning interpretation of it at the scale of a whole network, that is, at the scale of a network with “hidden layers” which need to receive a useful training signal. (See Markram, Gerstner, & Sjöström, 2012, for a recent review of the neuroscience research on STDP.) We present here a simple view of STDP that has been observed at least in some common neuron types and preparations: there is a weight change if there is a presynaptic spike in the temporal vicinity of a postsynaptic spike. This change is positive if the postsynaptic spike happens just after the presynaptic spike (and larger if the timing difference is small) and negative if it happens just before (and, again, larger if the timing difference is small, on the order of a few milliseconds). This is illustrated in Figure 1 from Bi and Poo (2001) based on biological data. The amount of change decays to zero as the temporal difference between the two spikes increases beyond a few tens of milliseconds. We are thus interested in this temporal window around a presynaptic spike during which a postsynaptic spike, before or after the presynaptic spike, induces a change of the weight. Keep in mind that the above pattern of spike-timing dependence is only one aspect of synaptic plasticity. (See Feldman, 2012, for an overview of aspects of synaptic plasticity that go beyond spike timing, including firing rate, as one would expect from this work, but also synaptic cooperativity—nearby synapses on the same dendritic subtree—and depolarization—due to multiple consecutive pairings or spatial integration across nearby locations on the dendrite, as well as the effect of the synapse’s distance to the soma.) 4.1.1 Rate-Based Aspect of STDP A hypothesis inspired by Xie and Seung (2000) and Hinton (2007) is that the STDP weight change can be associated with the temporal rate of change of postsynaptic activity, as a proxy for its association with post minus presynaptic spike times. Both of the above contributions focus on this rate aspect of neural activity, and this is also the case of the synaptic update rule, equation 3.18, studied here. Since stochastic gradient cares only about the average change instead of equation 3.18, we could also write (4.1) where is the binary indicator of a spike from neuron , which occurs with an average rate of . This works, for example, if we assume that the input spikes are randomly drawn from a Poisson process with a rate proportional to . This assumption ignores spike synchrony effects (which we will do in the rest of this article and which we leave for future work to investigate). whereis the binary indicator of a spike from neuron, which occurs with an average rate of. This works, for example, if we assume that the input spikes are randomly drawn from a Poisson process with a rate proportional to. This assumption ignores spike synchrony effects (which we will do in the rest of this article and which we leave for future work to investigate). 4.1.2 Why It Works Why does the proposed rate-based weight update mimic the STDP behavior observed during biological experiments? Consider a rising postsynaptic activity level, , as in Figure 2, and a presynaptic spike occurring somewhere during that rise. Let us assume a Poisson process for the postsynaptic spike as well, with a rate proportional to . According to this assumption, the rising postsynaptic activity makes postsynaptic spikes more likely in a fixed time window following the presynaptic spike than in a window of the same length preceding it. Therefore, if only one spike happens in the whole time interval, it is more likely to be after the presynaptic spike, yielding a positive spike timing difference and, thus, a positive weight change. The situation would be symmetrically reversed if the activity level was decreasing (i.e., ), which would make negative weight changes more likely. Furthermore, the stronger the slope is, the greater will this effect be. A larger slope also reduces the expected relative spike timing between the presynaptic spike and stochastically occurring postsynaptic spikes, which for a strong slope possibly occur just after or just before the presynaptic spike. When spikes occur on both sides and with about the same time difference, the effects on cancel each other. 4.2 Simulation Results To validate the hypothesis that equation 3.18 (or equation 4.1, respectively) yields a relationship between spike timing difference and weight change that is consistent with biological observations shown in Figure 1, we ran the simulations presented in the following sections. 4.2.1 Method We simulate random variations in the presynaptic neural firing rate as well as random variations in the postsynaptic firing rate induced by an externally driven voltage. By exploring many configurations of variations and levels at pre- and postsynaptic sides, we hope to cover the possible natural variations. The Poisson process underlying the spike generation (as described in section 4.1) is approximated by performing a discrete-time simulation with a Bernoulli draw of the binary decision spike-versus-no-spike within the time interval of interest. That is, we generate and record pre- and postsynaptic spikes sampled according to a Bernoulli at each discrete with probability proportional to and , respectively, and record as well, in order to implement either equation 3.18 or a classical nearest-neighbor STDP update rule (Markram, Lübke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Morrison, Diesmann, & Gerstner, 2008; Van Rossum, Bi, & Turrigiano, 2000).2 For every presynaptic spike, we consider a window of 20 time steps before and after. If we have one or more postsynaptic spikes in both the left and right windows, no postsynaptic spike at all, or a single postsynaptic spike that coincides with the presynaptic spike, the weight is not updated. Otherwise we measure the time difference between the closest postsynaptic spike (the nearest neighbor) and the presynaptic spike and compute the weight change using the current variable values. To compute appropriate averages, 500 random sequences of rates are generated, each of length 160 time steps, and 1000 randomly sampled spike trains are generated according to these rates. For measuring the effect of weight changes, we measure the average squared rate of change (or, rather, its discrete time approximation ) in two conditions: with weight changes (according to equation 3.18), and without. 4.2.2 Results An example of spike sequence, the integrated postsynaptic activity, and the corresponding weight values is illustrated in Figure 3. Figure 4 shows the results of these simulations, comparing the weight change obtained at various spike timing differences for equation 3.18 (left) and for the nearest-neighbor STDP rule (right), both matching the biological data well (see Figure 1). Figure 5 shows that both update rules are strongly correlated, in the sense that for a given amount of weight change induced by one rule, we observe an average a linearly proportional weight change by the other rule.

5 Related Work Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work << 6 Open Questions and Futu... 7 Conclusion References CITING ARTICLES An important inspiration for this work is the idea proposed by Hinton (2007) that brains can implement backpropagation by using temporal derivatives to represent activation gradients and the suggestion that combining this assumption with STDP would approximately yield SGD on the synaptic weights. The idea of neural computation corresponding to a form of stochastic relaxation toward lower energy configurations is, of course, very old—for example with the Boltzmann machine (Hinton & Sejnowski, 1986) and its Gibbs sampling procedure. (For more recent work in this direction, see also Berkes et al., 2011.) What differs here from the Boltzmann machine is that we consider the state space to be continuous (associated with the expected voltage potential, integrating out the random effects due to spikes) rather than discrete and that we consider very small steps (going down the gradient of the energy), which is more like a Langevin MCMC, rather than allowing each neuron to stochastically jump with higher probability to its optimal state, given its neighbors’ configuration, which is what Gibbs sampling does. The closest work to the STDP-related weight update proposed here is probably that of Xie and Seung (2000), which we have already mentioned. Notable additions to this work brought by our analysis include demonstrating that the relationship of spike timing to weight change is a consequence of equation 3.18, rather than the other way around (and a small difference in the use of instead of ). The proposed rule can be contrasted with theoretical synaptic learning rules, which tend to be based on the Hebbian product of pre- and postsynaptic activity, such as the BCM rule (Bienenstock, Cooper, & Munro, 1982; Intrator & Cooper, 1992). The main difference is that the rule studied here involves the temporal derivative of the postsynaptic activity rather than the actual level of postsynaptic activity. There are, of course, many other papers on theoretical interpretations of STDP, and there are many references in Markram et al. (2012). But more work is needed to explore the connection of STDP to an unsupervised learning objective that could be used to train not just a single-layer network (like PCA and traditional Hebbian updates) but also a deep unsupervised model. Many approaches (Fiete & Seung, 2006; Rezende & Gerstner, 2014) rely on variants of the REINFORCE algorithm (Williams, 1992) to estimate the gradient of a global objective function (basically by correlating stochastic variations at each neuron with the changes in the global objective). Although this principle is simple, it is not clear that it will scale to very large networks due to the linear growth of the variance of the estimator with the number of neurons, which makes it tempting to explore other avenues.

6 Open Questions and Future Work Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... << 7 Conclusion References CITING ARTICLES This work is only a stepping-stone on the way to a complete theory of learning that is both biologically plausible and makes sense from a machine learning perspective. Many open questions remain to be answered, especially on the side of biological plausibility. If the energy function proposed in equation 2.7 is closer to a biological truth than the energy defined for continuous Hopfield networks (Hopfield, 1984), we should see that the average soma voltage potential is insensitive to changes in the input firing rates when the neuron is saturated (completely turned off or maximally turned on) and synaptic weight changes should also vanish under this condition, as seen by inspection of equation 3.18. It would clearly be interesting to test these predictions in actual biological experiments. Another open question is how to reconcile the need for symmetric weights when we introduce an energy function and the fact that and are stored at two physically different places in biological neurons. An encouraging observation is that earlier work on autoencoders empirically showed that even when the forward and backward weights are not tied, they tend to converge to symmetric values, and in the linear case the minimization of reconstruction error automatically yields symmetric weights (Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010). Another encouraging piece of evidence, also linked to autoencoders, is the theoretical result from Arora, Liang, and Ma (2015), showing that the symmetric solution minimizes the autoencoder reconstruction error between two successive layers of rectifying (ReLU) units. In addition, Lillicrap et al. (2014) show that even when the feedback weights do not match the feedforward weights, backpropagation still works because the feedforward weights end up aligning better with the feedback weights through feedback alignment. This suggests that the strict weight symmetry condition can be relaxed without hurting the learning procedure. While the work we have presented focuses on supervised learning, much remains to be done to obtain a complete probabilistic theory of unsupervised learning that is consistent with STDP. We believe that we have put interesting ingredients in place and that some of our ideas here could be helpful for building such a theory. Many open questions also remain in the field of neural plasticity. STDP describes only one view of synaptic plasticity, and it may be different in different neurons and preparations. A particularly interesting type of result to consider in the future is about the relationship between more than two spikes induced statistically by the proposed STDP rule. It would be interesting to compare the statistics of triplets or quadruplets of spike timings and weight change under the proposed rule against the corresponding biological observations (Froemke & Dan, 2002). Future work should also attempt to reconcile the theory with neuroscience experiments showing more complex relationships between spikes and weight change, involving factors other than timing, such as firing rate, of course, but also synaptic cooperativity and depolarization (Feldman, 2012). In addition to these factors, some synaptic changes do not conform to the STDP pattern, in either sign (anti-Hebbian STDP) or nature (e.g., incorporating a general Hebbian element as in the BCM rule). In this regard, although this work suggests that some spike timing behavior may be explained purely in terms of the rates trajectory, it remains an open question if (and where) precise timing of spikes remains essential to explain network-level learning behavior from a machine learning perspective. It is interesting to note in this regard how the BCM rule was adapted to account for the relative timing of spike triplets (Gjorgjievaa, Clopath, Audet, & Pfister, 2011). In artificial neural networks, communication between neurons usually corresponds to real-valued messages. While this works very well in artificial neurons, it does not fit its biological counterpart. However, connections can be drawn between these real-valued messages and spiking (i.e., binary) messages. A real value can be sent with binary messages by sending the value one and the value zero of the time, so that the message will then have the right expected value. Recent work actually has shown that stochastically quantizing the activations of neural networks still allowed them to be trained (Hubara, Courbariaux, Soudry, El-Yaniv, & Bengio, 2016). As Hinton (2016) explained, sending noisy binary messages drawn from a Poisson process is similar to dropout, which is a powerful regularizer, allowing one to train networks with more parameters than the number of available examples.

7 Conclusion Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... 7 Conclusion << References CITING ARTICLES This work presents a piece of a theory that aims at closing the gap between learning in artificial and biological neural networks. We started by showing that neural updates based on leaky integration correspond to Langevin MCMC steps in an energy-based model with continuous hidden variables. Thus, neural computation can be seen as performing probabilistic inference in the sense of moving toward states that have higher probability under the energy-based probability distribution associated with the model. Starting from a stationary point and letting visible units receive an outside driving force pushing them toward some target value, early steps of neural computation correspond to propagating error gradients into internal layers, similar to backpropagation in supervised artificial neural networks. The propagated error gradients can be turned into a weight update that performs stochastic gradient descent minimizing the difference between the state of the visible units and the target value. Scellier and Bengio (2016) have actually shown through experiments that it is possible to train deep supervised neural networks using this approach. They have also extended the theoretical framework presented here in order to prove that after the output units have been nudged toward a better value (according to prediction error) and after the dynamics have settled to a very near fixed point, applying the contrastive Hebbian update similar to the one presented here corresponds to gradient descent on the prediction error. The resulting weight update is proportional to the rate of change of postsynaptic activity times the presynaptic activity. Interestingly, as we have shown through simulations and a qualitative argument, it yields a behavior similar to the STDP observations in the brain that depend on spike timing differences. On top of underlining the biological plausibility of the analyzed learning process, this result provides an interesting machine learning interpretation for STDP, since artificial neural networks are generally cast in terms of firing rates.

Acknowledgments We thank Benjamin Scellier, Dong-Hyun Lee, Jyri Kivinen, Jorg Bornschein, Roland Memisevic, and Tim Lillicrap for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.

References Section: Choose Top of page Abstract 1 Introduction 2 Neural Computation Does... 3 Link to Backpropagation 4 Link to Spike-Timing-De... 5 Related Work 6 Open Questions and Futu... 7 Conclusion References << CITING ARTICLES