The feedforward network

The network consists of an input layer with N input units and an output unit functioning as a leaky integrated and fire (LIF) neuron (see Output production). The input units are connected to the output unit via N synaptic weights, \({w}_{m}\) (Fig. 2), or via K = N/5 dendritic strengths, \({J}_{i}\) (Fig. 3). In the synaptic scenario, \(\{{w}_{m}\}\) are the tunable parameters (Fig. 2), whereas for the dendritic scenario, \(\{{J}_{i}\}\) are the tunable parameters while \(\{{w}_{m}\}\) are time-independent (Fig. 3).

The supervised learning algorithm

The scenario of supervised learning by a biological perceptron is examined using a teacher and a student. The mission of the student is to imitate the responses, i.e. the outputs, of the teacher, where both have the same architecture. For each input the teacher produces an output. The timings and the amplitudes as well as the resulting teacher’s outputs for each input unit, are provided to the student. Those input/output relations constitute the entire information provided to the student for each input. The algorithm is composed of 3 parts: output production, weights adaptation and learning.

Output production: An identical asynchronous input, example, is given to the teacher and the student, each produces its output according to their weights and decaying input summation, OT and OS, respectively (see Output production – Leaky integrate and fire neuron).

Weight adaptation: For each input unit the teacher preforms weights adaptation next to its output production, following its input/output (see Adaptation). The student preforms the same adaptation as the teacher, unless otherwise stated (see Student’s adaptation).

Learning: The student preforms learning steps, unless otherwise stated, on weights with conflicting outputs with the teacher, i.e. O m T ≠ O m S for the mth input unit.

Inputs generation

Each input is composed of N/2 randomly stimulated input units. For each stimulated unit a random delay and a stimulation amplitude are chosen from given distributions. The delays are randomly chosen from a uniform distribution with a resolution of 0.01 ms, such that the average time-lag between two consecutive stimulations is 5 ms. Stimulation amplitudes were randomly chosen from a uniform distribution in the range [0.8,1.2]. Note that the reported results are qualitatively robust to the scenario where all non-zero amplitudes equal 1. In the dendritic scenario, the five \({w}_{m}\) connected to the same dendrite were stimulated sequentially in a random order and with an average time-lag of 5 ms between consecutive stimulations.

Output production – Leaky integrate and fire neuron

In the synaptic adaptation scenario, the voltage of the output unit is described by the leaky integrate and fire (LIF) model

$$\begin{array}{c}\frac{dV}{dt}=-\,\frac{V-{V}_{st}}{{\rm T}}+\mathop{\sum }\limits_{m=1}^{N}{w}_{m}\delta (t-{\tau }_{m})\end{array}$$ (3)

where V(t) is the scaled voltage, Τ = 20 ms is the membrane time constant and V st = 0 stands for the scaled stable (resting) membrane potential. \({w}_{m}\) and τ m stand for the mth weight and delay, respectively. A spike occurs when the voltage crosses the threshold, V(t) ≥ 1 and at that time the output unit produces an output of 1, otherwise the output is 0. After a spike occurs, the voltage is set to V st . For simplicity, we scale the equation such that V th = 1, V st = 0, consequently, V ≥ 1 is above threshold and V < 1 is below threshold. Nevertheless, results remain the same for both the scaled and unscaled equations, e.g. V st = −70 mV and V th = −54 mV. The initial voltage was set to V (t = 0) = 0.

In the dendritic adaptation scenario, the voltage of each dendritic terminal is described by

$$\begin{array}{c}\frac{d{V}_{i}}{dt}=-\,\frac{{V}_{i}-{V}_{st}}{{\rm T}}+{J}_{i}\cdot \mathop{\sum }\limits_{m=\frac{N}{K}(i-1)+1}^{\frac{N}{K}\cdot i}{w}_{m}\delta (t-{\tau }_{m})\end{array}$$ (4)

where V i (t) and \({J}_{i}\) stand for the voltage and the strength of the ith dendrite, respectively. The rest of the parameters are identical to the synaptic adaptation scenario.

Adaptation

The adaptation for the synaptic scenario, is done according to

$$\begin{array}{c}{w}_{m}^{+}={w}_{m}\cdot (1+\delta (t))\end{array}$$ (5)

where t is the time-lag between a sub-threshold stimulation to \({w}_{m}\) (stimulation that didn’t evoke spike, output 0) and an evoked spike. Similarly, the dendritic adaptation is given by

$$\begin{array}{c}{J}_{i}^{+}={J}_{i}\cdot (1+\delta (t))\end{array}$$ (6)

where t now is the time-lag between a sub-threshold stimulation at \({J}_{i}\) and an evoked spike from another dendrite. For both scenarios

$$\begin{array}{c}\delta (t)=A\cdot \exp (-\frac{t}{15})\cdot sign(t)\end{array}$$ (7)

representing the strengthening/weakening of a weight conditioned to a prior/later evoked spike at a time delay t, respectively, where a cutoff time window of 50 ms is enforced (Fig. S1). For simplicity, a step function,

$$\begin{array}{c}\delta (t)=\pm \,A\end{array}$$ (8)

was used for all time delay t, unless otherwise stated. However, all results are robust to adaptation in the form of either exponential decay or a step function.

Student’s adaptation

In order to perform the same adaptation as the teacher, the required information is the teacher’s temporal input/output relations. Note that although the student performs the same adaptation steps as the teacher, it does not necessarily ensure tracking of the parameters of the teacher, since the changes in the weights are relative to the current value of the weights of the student.

Learning

Learning steps were performed on the student’s weights with conflicting output with the teacher. This learning rule is based on a gradient descent dynamics, which minimizes a cost function

$${C}_{m}=({V}^{T}-{V}^{S})\cdot ({O}_{m}^{T}-{O}_{m}^{S})$$

that measures the deviation of the student voltage form the teacher voltage in case of an error (unmatched spike timings between the teacher and the student). A spike is considered as V=1. The change in the weights \({w}_{m}\) is proportional to the negative derivative of the cost function relative to the weight.

$$\begin{array}{ccc}{\rm{\Delta }}{w}_{m}\propto -\,\frac{d{C}_{m}}{d{w}_{m}} & = & -\frac{d{C}_{m}}{d{V}^{S}}\frac{d{V}^{S}}{d{w}_{m}}({O}_{m}^{T}-{O}_{m}^{S})=\frac{d{V}^{S}}{d{w}_{m}}({O}_{m}^{T}-{O}_{m}^{S})\\ & = & {x}_{m}\exp (\frac{t-{\tau }_{m}}{T})({O}_{m}^{T}-{O}_{m}^{S}).\end{array}$$

For simplicity, the weighted exponential prefactor is neglected, but qualitative results remain similar for both cases. Consequently, the learning step for the synaptic scenario is similar to the traditional perceptron learning algorithm

$$\begin{array}{c}{w}_{m}^{+}={w}_{m}+\lambda \cdot ({O}_{m}^{T}-{O}_{m}^{S})\cdot {x}_{m}\end{array}$$ (9)

and similarly for the dendritic scenario

$$\begin{array}{c}{J}_{i}^{+}={J}_{i}+\lambda \cdot ({O}_{m}^{T}-{O}_{m}^{S})\cdot {x}_{m}.\end{array}$$ (10)

λ denotes the learning step and O m T and O m S are the outputs of the teacher and the student at the mth input unit in the ith dendrite, respectively, and \({x}_{m}\) denotes the stimulation amplitude of the mth input unit.

Calculating the generalization error

The generalization error is estimated every 2000 (1000) inputs in the synaptic (dendritic) scenario. The estimation consists of up to 250,000 inputs presented to the teacher and the student. The generalization error is defined as

$$\begin{array}{c}{\varepsilon }_{g}=\frac{total\,no.\,of\,mismatch\,firing\,}{total\,no.\,of\,stimulations}.\end{array}$$ (11)

For each measured ε g (p) the generalization error is measured at least three times, and the average is presented with the largest error bar among all measured ε g (p).

Figure 2: \(\{{w}_{m}\}\) were chosen from a uniform distribution in the range [0.1, 0.9] and then were normalized to a mean equals to 0.5. Adaptation was following Eq. (8) with A = 0.003, and learning was following Eq. (9) with λ = 1/1000. \({w}_{m}\) was bounded from above by 1.5 and from below by 10−4.

In panel B, the STD of the generalization error was in the order of the size of the circles, and therefore not shown in the graph.

In panel C, two types of firing fractions are presented. The first consists of the total number of spikes normalized by N. The second indicates the number of spikes induced by input summation normalized by N, i.e. evoked spikes by weights obeying \({w}_{m}\cdot {x}_{m} < 1\).

In panel D, the normalized histogram of the synaptic weights at p = 6 * 104 is plotted using 75 bins.

In panel E, the error of a perceptron was calculated according to

$$\begin{array}{c}{\varepsilon }_{g}=\frac{1}{\pi }{\cos }^{-1}(R)\end{array}$$ (12)

where R is the normalized overlap between the teacher’s and the student’s weights

$$R=\frac{({w}^{S}-1)\cdot ({w}^{T}-1)}{||{w}^{S}-1||\cdot ||{w}^{T}-1||}.$$ (13)

Figure 3: \(\{{w}_{m}\}\) were chosen from a uniform distribution in the range [0.1, 0.9] and then were normalized to a mean equals to 0.5. \(\{{J}_{i}\}\) were chosen from a uniform distribution in the range [0.5, 1.5].

In panel B, stimulations with low amplitudes (0.01) were given to the N/2 unstimulated input units, resulting in non-frozen \({J}_{i}\). In addition, adaptation was performed according to the spike timings of the two prior and two later stimulated dendrites, instead of using a cutoff in time. This modification was introduced to overcome the order of the delays in our setup, i.e. delays of w belonging to dendrite i are greater than delays of w belonging to dendrite i − 1. Consequently, using a cutoff might break the symmetry in the adaptation steps of J (number and strength) before and after the dendrite generating a spike. The adaptation follows Eq. (8) with A = 0.003 and the learning follow Eq. (9) with λ = 1/N. \({J}_{i}\) was bounded from below by 0.1 and from above by 2. The first p where \({J}^{T}\) and \({J}^{S}\) were identical is denoted on the x-axis, ε g = 0.

In panel C, the normalized histogram of 75 bins is presented for the dynamics excluding the transient of the first 10,000 inputs. Similarly, the transient was excluded in the normalized histogram presented in the inset.

In panels D, E adaptation follows Eq. (7) with A = 0.05, and learning follows Eq. (9) with λ = 1/N. \({J}_{i}\) was bounded from below by 0.1 and from above by 2.5. The first p where the non-frozen \({J}^{T}\) and \({J}^{S}\) were identical is denoted on the x-axis, ε g = 0.

In panel F, the error of a perceptron was calculated similarly to the synaptic case, Eqs (12) and (13), with \({J}^{T}\) and \({J}^{S}\) instead of \({w}^{T}\) and \({w}^{S}\), respectively.

Figure 4: The network was composed of an input layer which was fully connected to a hidden layer, each consisted of N units, and a single output unit. All hidden units and the output unit functioned as LIF neurons (see Output production). The fixed delays in the first and the second layer were chosen randomly from a uniform distribution in the range [0, 5N/2] with a resolution of 0.001 ms, where the teacher and the student had the same architecture. Initial weights for both layers were chosen from a uniform distribution in the range [0.1,0.9] and were bounded from above by 1.5 and from below by 10−4. An asynchronous input was given to N/2 randomly chosen input units. For each stimulated unit a random delay was chosen from a uniform distribution with a resolution of 0.001 ms, such that the average time-lag between two consecutive stimulations was 5 ms. Stimulation amplitudes were set to 1. The delays were chosen such that stimulation routes from an input unit to the output unit were non-degenerated, i.e. no more than one stimulation arrived simultaneously to a unit. In the first step of the dynamics the hidden layer produced its outputs. Second, those outputs were used as the input to the output unit, which then produced its output. Finally, weights adaptation was performed using Eq. (7) with A = 0.001 and learning following Eq. (9) with λ = 0.002. For the first layer of weights the teacher and the student preformed weights adaptation according to their inputs and outputs of the hidden units. For the second layer of weights, the student performed adaptation according to the teacher’s spike timings and its own sub-threshold stimulation timings. Learning steps were performed on the student’s weights of the first and the second layers with conflicting output with the teacher. Learning was based on the distinct stimulation routes, i.e. for each spike in the teacher’s output, the student knew which hidden unit evoked the spike. The error was calculated every 2000 inputs following Eq. (11), using the upper bound of the total number of stimulations, N2/2. For N = 100 all the delays were chosen with a resolution of 0.0001 ms.

Figure S2: Parameters are the same as in Fig. 2, with A = 0.001 (Eq. 8) and λ = 50/N (Eq. 9).

Figure S6: Parameters are the same as in Fig. 3E, with λ = 1/1000 (Eq. 10).

Figure S7: Data extracted from Fig. 3E. The normalized overlap of \({J}_{i}\) was divided to the overlap of frozen and non-frozen dendrites. A frozen dendrite was defined such that the variance of its strengths during the last 500 inputs was less than 10−3. For each group the normalized overlap was calculated following Eq. (13) with \({J}^{T}\) and \({J}^{S}\) instead of \({w}^{T}\) and \({w}^{S}\), respectively.

Figure S8: For the synaptic scenario, data was extracted from simulation using the same parameters as in Fig. 2. After each input the following term was calculated for each stimulated weight

$$\begin{array}{c}({w}^{T}(p-1)-{w}^{S}(p-1))({w}_{after\,learning}^{S}-{w}_{befor\,learning}^{S})\end{array}$$ (14)

measuring whether the learning step decreases the gap between wT(p − 1) and wS(p − 1). In case the term, Eq. (14), was positive (negative) the step was classified as attractive (repulsive). For the dendritic scenario, data was extracted from simulation using the same parameters as in Fig. 3B, and the attractive/repulsive steps were calculated using Eq. (14) with \({J}^{T}\) and \({J}^{S}\) instead of \({w}^{T}\) and \({w}^{S}\), respectively.