The clipped ReLU nonlinearity is similar to ReLU except that it is bounded above by a maximum value: that is f(x) = clip(x, r min , r max ), where r min = 0 and r max = 100. a SI increased significantly with σ 0 . Linear regression slope: 0.55 ± 0.28, R2 = 0.01 (two-sided Wald test, n = 280 experimental conditions, p = 0.049). In a–c, solid black lines are the linear fits and shaded regions are 95% confidence intervals for the linear regression. b SI decreased significantly with λ 0 . Linear regression slope: −3.87 ± 0.66, R2 = 0.11 (two-sided Wald test, n = 280 experimental conditions, p = 0.000). Note that this result differs from the corresponding result in the case of ReLU networks, where λ 0 did not have a significant effect on the SI (Fig. 2c). c SI decreased significantly with ρ. Linear regression slope: −418 ± 64, R2 = 0.13 (two-sided Wald test, n = 280 experimental conditions, p = 0.000). d SI as a function of task. Overall, the ordering of the tasks by SI was similar to that obtained with the ReLU nonlinearity (Fig. 3a). However, note that training was substantially more difficult with the clipped ReLU nonlinearity than with the ReLU nonlinearity. Across all tasks and all conditions, ReLU networks had a training success (defined as reaching within 50% of the optimal performance) of ~60%, whereas the clipped ReLU networks had a training success of only ~9.3%. In particular, we were not able to successfully train any networks in the CD task and very few in the 2AFC task. As a consequence, some of the differences between the tasks ended up not being significant in the clipped ReLU case. Error bars represent mean ± standard errors across different hyperparameter settings. Exact sample sizes for the derived statistics shown in d are reported in Supplementary Table 1. e, f Recurrent connection weight profiles (as in Fig. 6a–c) in conditions where SI > 4.8 and in conditions where SI < 3, respectively. The weights were smaller in magnitude in f, because most of the low SI networks were trained under strong regularization. Solid lines represent mean weights and shaded regions represent standard deviations of weights. Both means and standard deviations are averages over multiple networks.