In-Vitro experiments

The experimental methods are similar to our previous studies6,7 and only the modifications are presented. All procedures were in accordance with the National Institutes of Health Guide for the Care and Use of Laboratory Animals and Bar-Ilan University Guidelines for the Use and Care of Laboratory Animals in Research and were approved and supervised by the Bar-Ilan University Animal Care and Use Committee.

Experiments protocol

The details of the experimental protocol of Figure 1 are as follows. The neuronal response latency (NRL) was used to accurately adjust the time-lag between intracellular evoked spikes or EPSPs originated from consecutive intra- and extra- cellular stimulations to be in the range of 1–4 ms. We note that an above-threshold extracellular stimulation given shortly, e.g. 2 ms, after an above-threshold intracellular stimulation, does not result in an evoked spike, and can be used to enhance adaptation. The thresholds and NRL were rechecked at the end of the experiment, in order to ensure their stability.

Statistical analysis

The demonstrated results were repeated tens of times on many cultures.

Simulations of biological neural networks

The simulation methods are similar to our previous studies8 and only the used parameters and modifications are presented.

Inputs generation

Each input was composed of N/2 randomly stimulated input units. For each stimulated unit a random delay and a stimulation amplitude were chosen from given distributions. The delays were randomly chosen from a uniform distribution with a resolution of 1 (0.001) ms, such that the average time-lag between two consecutive stimulations was 2 (10) ms for the synaptic (dendritic) scenario. Stimulation amplitudes were randomly chosen from a uniform distribution in the range [0.8, 1.2]. Note that the reported results are qualitatively robust to the scenario where all the non-zero amplitudes equal 1. In the dendritic scenario, the five \({W}_{m}\) connected to the same dendrite were stimulated sequentially in a random order and with an average time-lag of 10 ms between consecutive stimulations.

Calculating the generalization error

The estimation consisted of up to 20,000 inputs presented to the teacher and the student, where each input generates about 30/200 evoked spikes for the synaptic/dendritic scenario. The generalization error is defined as

$$\begin{array}{c}{\varepsilon }_{g}=\frac{total\,no.of\,mismatch\,firing\,}{total\,no.of\,stimulations}.\end{array}$$

The details of the simulations in Figure 2 are as follows. Panel B: \(\{{W}_{m}\}\) were chosen from a uniform distribution in the range [0.1, 0.2]. The adaptation and learning steps were A = 0.05 and \({\rm{\eta }}\) = 1/1000, respectively. \({W}_{m}\) was bounded from above by 1.5 and from below by 10−4. The fixed learning rate was compared to the accelerating method using adaptive learning step:

$${\eta }^{t+1}={\eta }^{t}\cdot {e}^{-\tau }+B\cdot sign({O}^{T}-{O}^{S})$$

using \(\tau \) = 0.1, B = 0.01 and \(\eta \) was initiated as 1/1000.

Panel C: \(\{{W}_{m}\}\) were chosen from a uniform distribution in the range [0.1, 0.9] and then were normalized to a mean equals to 0.5. \(\{{J}_{i}\}\) were chosen from a uniform distribution in the range [0.5, 1.5]. Stimulations with low amplitudes (0.01) were given to the N/2 unstimulated input units, resulting in non-frozen \({J}_{i}\). The adaptation and learning steps were A = 0.05 and \({\rm{\eta }}\) = 1/1000, respectively. \({J}_{i}\) was bounded from below by 0.1 and from above by 3. The fixed learning rate was compared to the accelerating method using adaptive learning step:

$${\eta }^{t+1}={\eta }^{t}\cdot {e}^{-\tau }+B\cdot sign({O}^{T}-{O}^{S})$$

using \(\tau \) = 0.1, B = 0.01 and \(\eta \) was initiated as 1/1000.

Simulations of neural network

Architecture

The feedforward neural network contains 784 input units, 30 hidden units and 10 output units in a fully connected architecture. Each unit in the hidden and the output layers has an additional input from a bias unit. Weights from the input layer to the hidden layer, W 1 , and from the hidden layer to the output layer, W 2 , were randomly chosen from a Gaussian distribution with a zero average and standard deviation equals 1. All weights were normalized such that all input weights to each hidden unit have an average equals 0 and a STD equals 1. The initial value of the bias of each weights was set to 1. We trained the network on the handwritten digits dataset, MNIST, using gradient descent. The inputs, examples from the train dataset, contain 784 pixel values in the range [0, 255]. We normalized the inputs such that the average and the STD are equal to 0 and 1, respectively.

Forward propagation

The output of a single unit in the hidden layer, \({a}_{j}^{1}\), was calculated as:

$${z}_{j}^{1}=\sum _{i}({W}_{ij}^{1}\cdot {X}_{i})$$

$${a}_{j}^{1}=\frac{1}{1+{e}^{-{Z}_{j}^{1}}}$$

where \({W}_{ij}^{1}\) is the weight from the ith input to the jth hidden unit, \({X}_{i}\) is the ith input, and b1 j is the bias for the jth hidden unit.

For the output layer, the output of a single unit, \({a}_{j}^{2}\) was calculated as:

$${z}_{j}^{2}=\sum _{i}({W}_{ij}^{2}\cdot {a}_{i}^{1})$$

$${a}_{j}^{2}=\frac{1}{1+{e}^{-{Z}_{j}^{2}}}$$

where \({W}_{ij}^{2}\) is the weight from the ith hidden unit to the jth output unit, \({a}_{i}^{1}\) is the output of the ith unit in the hidden layer, and b2 j is the bias for the jth output unit.

Back propagation

We used two different cost functions; the first was the cross entropy:

$$C=-\frac{1}{N}\sum _{n}[y\ast \,\log (a)+(1-y)\ast \,\log (1-a)]$$

and the second was the mean square error (MSE):

$$C=\frac{1}{2N}\sum _{n}{(y-a)}^{2}$$

where y are the desired labels and \(a\) stands for the current 10 output units of the output layer. The summation is over all training examples, N.

The backpropagation method computes the gradient for each weight with respect to the chosen cost function. The weights and biases were updated according to 3 different methods:

(1) Momentum The weights update: $${W}^{t+1}=(1-\alpha )\cdot {W}^{t}+{V}^{t+1}$$ $${V}^{t+1}=\mu \cdot {V}^{t}-{\eta }_{0}\cdot {

abla }_{{{\rm{W}}}^{{\rm{t}}}}C$$ where t is the discrete time-step W are the weights, α is a regularization constant, \(\eta \) is the fixed learning rate, and \({

abla }_{{{\rm{W}}}^{{\rm{t}}}}C\) is the gradient of the cost function for each weight at time t. V was initialized as: \(-{\eta }_{0}\cdot {

abla }_{{\rm{W}}}{C}_{first}\), where \({

abla }_{W}{C}_{first}\) is the first computed gradient and the biases update: $${V}_{b}^{t+1}=\mu \cdot {V}_{b}^{t}-{\eta }_{0}\cdot {

abla }_{{b}^{t}}C$$ $${b}^{t+1}={b}^{t}+{V}_{b}^{t+1}$$ where \({

abla }_{b}C\) is the gradient of the cost function of each bias with respect to its weight, b, and V b was initialized as: \(-\eta \cdot {

abla }_{{\rm{b}}}{C}_{first}\), where \({

abla }_{b}{C}_{first}\) is the first computed bias gradient. (2) Acceleration The weights update: $${W}^{t+1}=(1-\alpha )\cdot {W}^{t}-|{\eta }^{t+1}|\cdot {

abla }_{{W}^{t}}C$$ $${\eta }^{t+1}={\eta }^{t}\cdot {e}^{-\tau }+{A}_{1/2}\cdot tanh({\beta }_{1/2}\cdot {

abla }_{{W}^{t}}C)$$ where \(\eta \) is defined for each weight, \(\,{A}_{1}\) and \({\beta }_{1}\) are constants representing the amplitude and the gain between the input and the hidden layers, respectively, and \({A}_{2}\) and \({\beta }_{2}\) represent the same between the hidden and the output layers. \(\eta \) was initialized as: \({A}_{1/2}\cdot tanh({\beta }_{1/2}\cdot {

abla }_{W}{C}_{first})\), where \({

abla }_{W}{C}_{first}\) is the first computed gradient and the biases update: $${b}^{t+1}={b}^{t}-|{\eta }_{b}^{t+1}|\cdot {

abla }_{{b}^{t}}C$$ $${\eta }_{b}^{t+1}={\eta }_{b}^{t}\cdot {e}^{-\tau }+{A}_{1/2}\cdot tanh({\beta }_{1/2}\cdot {

abla }_{{b}^{t}}C)$$ (3) Advanced acceleration:

The weights update

$${W}^{t+1}=(1-\alpha )\cdot {W}^{t}+{V}^{t+1}$$

$${V}^{t+1}=\mu \cdot {V}^{t}-{\eta }^{t+1}\cdot {

abla }_{{W}^{t}}C$$

$${\eta }^{t+1}={\eta }^{t}\cdot {e}^{-\tau }+{A}_{1/2}\cdot tanh({\beta }_{1/2}\cdot {

abla }_{{W}^{t}}C)$$

and the biases update:

$${b}^{t+1}={b}^{t}+{V}_{b}^{t+1}$$

$${V}_{b}^{t+1}=\mu \cdot {V}_{b}^{t}-{\eta }_{b}^{t+1}\cdot {

abla }_{{b}^{t}}C$$

$${\eta }_{b}^{t+1}={\eta }_{b}^{t}\cdot {e}^{-\tau }+{A}_{1/2}\cdot tanh({\beta }_{1/2}\cdot {

abla }_{{b}^{t}}C)$$

Testing the network

The network classification accuracy was tested on the MNIST dataset for testing, containing 10,000 inputs. The test inputs were also normalized to have an average of each equals to 0 and a STD equals to 1.

Optimization

For each update method the parameters were chosen to maximize the test accuracy. For optimization we first used a grid of the adjustable parameters followed by a fine tuning with higher resolution for each parameter. The optimization was performed over 3 parameters for the momentum method (\(\mu ,{\eta }_{0},\alpha \)), 6 parameters for the acceleration method (\({A}_{1},{A}_{2},{\beta }_{1},{\beta }_{2},\tau ,\,\alpha \)) and for 7 parameters for the advanced acceleration method (\({A}_{1},{A}_{2},{\beta }_{1},{\beta }_{2},\tau ,\,\alpha ,\mu \)).

The details of the simulations in Figure 3 are as follows. Panel B: The feed forward neural network was presented with 300 examples with equally number of appearance of each digit. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.9,{\eta }_{0}=0.02,\alpha =0\), acceleration with \({A}_{1}=0.1,{A}_{2}=0.27,{\beta }_{1}=\,5000,{\beta }_{2}=\,900,\tau =\,0.598,\,\alpha =0.003\), and advanced acceleration with \({A}_{1}=0.07,{A}_{2}=0.07,{\beta }_{1}=\infty ,{\beta }_{2}=\,\infty ,\tau =0.29,\,\alpha =\,0.005,\mu =0.5\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

Panel C: The feed forward neural network was presented with 6000 examples randomly taken from the 60000 examples in the training dataset, and presented to the network with 30 mini-batches of size 200. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.68,{\eta }_{0}=0.65,\alpha =0.08\), acceleration with \({A}_{1}=3,{A}_{2}=0.7,{\beta }_{1}=\,1500,{\beta }_{2}=\,1000,\tau =\,0.4,\)\(\alpha =0.065\), and advanced acceleration with \({A}_{1}=0.5,{A}_{2}=0.5,{\beta }_{1}=\infty ,{\beta }_{2}=\,\infty ,\tau =0.1,\,\alpha =\,0.1,\)\(\mu =0.55\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

Panel D: The feed forward neural network was presented with 1200 examples randomly taken from the 60000 examples in the training dataset, and presented to the network with 24 mini-batches of size 50. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.75,\)\({\eta }_{0}=0.6,\alpha =0.11\), acceleration with \({A}_{1}=1.15,{A}_{2}=0.6,{\beta }_{1}=\,4500,{\beta }_{2}=\,3500,\tau =\,0.1,\,\alpha =0.055\), and advanced acceleration with \({A}_{1}=0.55,{A}_{2}=0.55,{\beta }_{1}=\infty ,{\beta }_{2}=\,\infty ,\tau =0.1,\,\alpha =\,0.1,\mu =0.55\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

Panel E: The feed forward neural network was presented with 60 with equally number of appearance of each digit. The 60 examples were presented to the network 5 times. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.87,{\eta }_{0}=0.035,\alpha =0.005\), acceleration with \({A}_{1}=0.11,{A}_{2}=0.26,{\beta }_{1}=\,2000,{\beta }_{2}=\,2500,\tau =\,0.45,\,\alpha =0.008\), and advanced acceleration with \({A}_{1}=0.035,{A}_{2}=0.02,{\beta }_{1}=4500,{\beta }_{2}=\,5,\tau =0.0619,\alpha =\,0.01,\mu =0.6\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

Panel F: The feed forward neural network was presented with 300 examples with equally number of appearance of each digit. The mean-square-error cost function was used and the following parameters for each method: momentum with \(\mu =0.6,{\eta }_{0}=0.35,\alpha =0.005\), acceleration with \({A}_{1}=0.95,{A}_{2}=0.25,{\beta }_{1}=\,5000,\)\({\beta }_{2}=\,40,\tau =\,0.15,\alpha =0.008\), and advanced acceleration with \({A}_{1}=0.06,{A}_{2}=0.09,{\beta }_{1}=2100,{\beta }_{2}=\,1,\)\(\tau =0.015,\,\alpha =\,0.005,\mu =0.8\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

The feed forward neural network was presented with 300 examples with equally number of appearance of each digit. The examples are composed of 5 subsets of 60 balanced examples, where in every subset each label appears exactly 6 times. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.87,{\eta }_{0}=0.035,\alpha =0.005\), acceleration with \({A}_{1}=0.11,{A}_{2}=0.26,{\beta }_{1}=\,2000,{\beta }_{2}=\,2500,\tau =\,0.45,\,\alpha =0.008\), and advanced acceleration with \({A}_{1}=0.035,{A}_{2}=0.02,{\beta }_{1}=4500,{\beta }_{2}=\,5,\tau =0.0619,\,\alpha =\,0.01,\mu =0.6\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.

The feedforward neural network was presented with 300 examples with equally number of appearance of each digit. The examples are composed of 30 subsets of 10 balanced examples, where in every subset each label appears exactly one time. Results are also presented for the advanced acceleration method where for each subset of 10 examples, the labels were in a fixed order. The cross entropy cost function was used and the following parameters for each method: momentum with \(\mu =0.87,{\eta }_{0}=0.035,\alpha =0.005\), acceleration with \({A}_{1}=0.11,{A}_{2}=0.26,{\beta }_{1}=\,2000,{\beta }_{2}=\,2500,\tau =\,0.45,\,\alpha =0.008\), and advanced acceleration with \({A}_{1}=0.035,{A}_{2}=0.02,{\beta }_{1}=4500,{\beta }_{2}=\,5,\tau =0.0619,\alpha =\,0.01,\mu =0.6\). Results are presented as the average of 100 different runs, and typical error bars are presented for the last point.