Tailored ensembles of neural networks optimize sensitivity to stimulus statistics

The dynamic range of stimulus processing in living organisms is much larger than a single neural network can explain. For a generic, tunable spiking network we derive that while the dynamic range is maximal at criticality, the interval of discriminable intensities is very similar for any network tuning due to coalescence. Compensating coalescence enables adaptation of discriminable intervals. Thus, we can tailor an ensemble of networks optimized to the distribution of stimulus intensities, e.g., extending the dynamic range arbitrarily. We discuss potential applications in machine learning.

Living organisms are constantly exposed to sensory stimuli with intensities that cover multiple orders of magnitude [1][2][3]. The organisms' ability to cope with this variability of stimulus intensities determines how well they will thrive. Hence, evolution favored nervous systems that developed the capability to process this variability. Stimulus intensity is typically encoded in neural firing rates, with stronger intensities eliciting higher firing rates. Such reliable, monotonous encoding of stimulus intensity has for example been found in the mammalian auditory system [4][5][6][7], with indication that a small number of neurons is sufficient to discriminate small changes in sound level intensity [5] and that neurons adjust their response to the complex statistics of the sound level distribution [7].
The capability to process a broad distribution of stimulus intensities is typically quantified by the dynamic range. The dynamic range of a neural network is defined as the log-ratio between the strongest and weakest stimulus intensities that are reliably encoded by the neural firing rate. Experimentally, the range of reliably encoded stimulus intensities was measured to extend from 40 dB to 50 dB in cat primary auditory nerve fibers [8]. This clearly does not cover the full range of hearingranging for humans from approximately 0 dB to about 120 dB sound pressure level -resulting in the so called 'dynamic range problem' [3,9]. It was argued that the dynamic range problem can be overcome by adaptation of neural response to the presented stimulus statistics [7]. However, as the future stimulus intensity cannot be perfectly predicted, maximizing the dynamic range remains a powerful strategy.
The dynamic range of a recurrent neural network, driven by external stimuli, is maximal at the critical point of the non-driven network [10]. Thereby criticality fosters flexible information processing, and is thus a versatile candidate state for neural networks and brain function [11,12]. If already at criticality, the dynamic range was shown to depend on topology [13,14], with homogeneous yet random networks reaching a higher dynamic range than those with a heterogeneous topology. The theoretical results are supported by experiments on cul-tured cortex slices [15], showing maximal dynamic ranges in those networks with power-law avalanche-size distributions (signature of criticality). However, the dynamic range of O(10 dB) observed in cultured neural networks is more than ten time smaller than the perceptual dynamic range of humans, and thus recurrent networks alone probably cannot solve the dynamic range problem.
In this letter, we show that processing capability might not be optimized by a single network, but instead by an ensemble of specialized neural networks that is tuned to stimulus statistics. First, we note that for an optimal encoding of stimulus statistics it is not sufficient to maximize the dynamic range; more importantly, the interval of stimulus intensities that are reliably discriminated by the network -the discriminable interval -also needs to cover the relevant intensities. Second, we show analytically that for a single neural network near criticality, this discriminable interval cannot be changed, i.e., it cannot be tuned to the relevant stimulus intensities. Third, we derive how to construct neural networks with adaptive synaptic weights such that the discriminable interval gradually changes when changing the distance to criticality, thereby providing a mechanism to tune the network to specific stimulus intensities. Last, we demonstrate that a tailored ensemble of such networks can optimize the ensemble response to stimulus statistics; if needed with a drastically increased dynamic range that emerges synergistically in the ensemble.
No matter what type of system one analyzes, the dynamic range and the discriminable interval are response measures defined in terms of the range of stimulus intensities that can be reliably discriminated in the systems' output ( Fig. 1 A). For a neural network, we treat the stimulus intensity as an external Poisson input with rate h per neuron, resulting in a network spike rate A(h). For zero input rate, the network typically produces its minimal, baseline rate A min , whereas for very strong input rates the network rate typically saturates. Hence, the network rate covers the interval [A min , A max ]. On this interval, the discriminable interval is defined as the 10 th to 90 th percentiles,  discriminable input rates [h(A 0.1 ), h(A 0.9 )], the dynamic range is defined as ∆ = 10 log 10 [h(A 0.9 )/h(A 0.1 )] in decibel (dB). These response measures characterize the range of input rates that can be discriminated in contrast to response measures that characterize how well a system can detect changes in the input rate [16].
We analyze neural network responses using a branching network, because it shows a maximal dynamic range at criticality [10,17,18]. For analytical tractability, we consider a fully connected network of N binary neurons without refractory period [19]. After each time step ∆t, each neuron i can be either silent (s i = 0) or excited to spike (s i = 1). It can be excited by (1) an external Poisson input with rate h, such that the transition probability is λ(h) = 1 − exp(−h∆t); or by (2) a presynaptic neuron j (which was excited in the previous time step) with transition probability w ij = w = m N , where the branching parameter m is the control parameter. If no external or internal input reaches an excited neuron, it returns to a silent state in the next time step. The network activity at each time step, A t = i s i t , is thus determined by both internal (m) and external (h) activation.
For the branching network, we can analytically derive the neuron spike rate as a function of external input rate. In the following, we sketch the main steps and refer to Ref. [19] for a detailed derivation. Given a network activity A t at time t, the probability for any neuron to be excited in the next time step is given by . The network activity A t+1 is then binomially distributed with expectation value A t+1 |A t = N p(A t ). Demanding stationary activity, A = A t = A t+1 |A t , and neglecting fluctuations in a mean-field approximation, Using the Lambert-W function defined as W (z)e W (z) = z [20], this is solved by to obtain the solution for the expected neuron spike rate which turns out to be system-size independent for sufficiently large N . Figure 1 B shows that Eq. (2) accurately describes our numerical results for N = 10 4 . From the neuron rate, we readily obtain mean-field solutions for the dynamic range and discriminable interval.
which allows us to compute h(a 0.1 ) and h(a 0.9 ). In correspondence with previous results, we recover that the dynamic range is maximal at criticality ( Fig. 1 C). Importantly, it does not diverge for = 1 − m → 0 [13], which can be seen clearly from the figure inset. Moreover, the discriminable interval barely changes for small ( Fig. 1 B). Thus for cortical networks with m ≈ 0.98 [21,22], the discriminable intervals of networks with different m strongly overlap. Strongly overlapping discriminable intervals are also observed in cultured cortex slices [15]. We can derive the bounds of the discriminable interval h(a x ) by expanding ln(1 − a) = ∞ n=1 a n /n and rewriting Eq. (3) as h bn (m, a) = (a + ∞ n=2 a n /n) /∆t with = 1 − m. For sufficiently smaller than a, which holds in the vicinity of the critical point, the bounds are barely distinguishable (see also Fig. S3). This result is valid for sufficiently large system sizes, and limits the dynamic range even in the infinite system limit. In the following, we will propose a framework that allows to shift the discriminable interval.
Branching networks are a finite-size network embedding of space-less branching processes [23], which approximate spike propagation in neuronal tissue [10,17,21,[24][25][26][27]. The branching process evolves in discrete time steps ∆t and if A t neurons are excited, one finds on average A t+1 |A t = mA t excited neurons in the next time step. Considering in addition a network-wide external input at rate hN , one expects A t+1 |A t = mA t +N h∆t. For stationary activity, A = A t+1 |A t = mA + N h∆t, such that The inverse is straightforward, h bp (m, a) = (1 − m)a/∆t, and leads to a dynamic range that is independent of m (see Supplemental Material S3). In contrast, the discriminable interval is highly dependent on m with bounds h bp (m, a x ) = (1 − m)x/∆t. These response measures differ drastically from those of the branching network. The response measures of the branching process differ drastically from those of the branching network, because of coalescence (the simultaneous activation of the same neuron from multiple sources) in the branching network [19]. Coalescence alters the critical non-equilibrium phase transition from subcritical-supercritical in the branching process to absorbing-active in the branching network. We can, however, map the network activity of a branching network to a branching process with an effective branching parameter m eff revealing that coalescence reduces the effective branching parameter m eff (A t ) < m = wN .
Having identified coalescence as the main difference, we can compensate it to construct network dynamics in agreement with the dynamics of a corresponding branching process. This can be achieved by inverting Eq. (5) for a target branching parameter m eff (w cc , A t ) = m, obtaining adaptive (coalescence-compensating) synaptic weights w cc (A t ) (see Supplemental Material S1). We assume that the network may at most compensate coalescence with internal sources (setting λ(h) = 0) such that because compensating coalescence with external input would require knowledge of the external input rate h (a logical contradiction as one of the network's tasks is to infer this rate). For finite networks in close vicinity of the critical point, we need to introduce a finite-network cutoff at w cc (m, N ) = ln(N )/N , which avoids an additional absorbing boundary at A = N (see Supplemental Material S1). Compensating internal coalescence with adaptive weights [Eq. (6)] implies such that for small input h we approximate the target branching process statistics. The resulting coalescence-compensating model shows much improved power-law avalanche-size distributions at criticality (Fig. 2). In the branching network the powerlaw is cut off at s ≈ N , for which typical avalanches (Fig. 2, inset) with bell-shape A(t, T ) = T F(t/T ) [28] imply the avalanche peak value to scale as A peak ∼ T ∼ √ N (see Supplemental Material S2). In the coalescencecompensating network, this maximum is now extended to the non-absorbing boundary T ∼ N , such that the power-law characteristics extend until s ≈ N 2 . This means that for the convergence-compensating network the avalanche-size distribution covers twice as many orders of magnitude as for the branching network. Similarly, the power-law avalanche-duration distribution is extended from √ N to N (see Supplemental Material S2). The coalescence-compensating network retains maximal dynamic range at criticality but recovers the dependence of discriminable interval on the branching parameter m. To show this we compute the neuron rate for the coalescence-compensating network. For stationary activ- where adaptive synaptic weights [Eq. (7)] compensate internal coalescence and the external input is limited by the finite network size with excitation probability λ(h) for a Poisson input. We thus obtain for which the inverse is again straightforward h cc (m, a) = − log [1 − (1 − m)a/(1 − ma)] /∆t. The dynamic range remains maximal at criticality (Fig. 1 C), while the discriminable intervals remains depended on m (Fig. 1 D). Figure 1 D further shows that our mean-field solution well describes our numerical results (N = 10 2 ) and that log-spaced choices of = 1 − m yield a homogeneous overlap of discriminable intervals.
Our results show that an ensemble of coalescencecompensating networks can be tailored to optimize sensitivity to stimulus statistics. For example, a bimodal distribution of stimulus intensities can be well processed by at least two networks with disjoint discriminable in-  tervals (Fig. 3), consistent with the heterogeneous singleneuron responses observed in auditory midbrain when presented with bimodal sound level intensities [7]. Processing a bimodal distribution of stimulus intensities may be also relevant for higher cortical areas, receiving input from areas with up-and-down states [29][30][31][32], or when reacting to complex behavior such as bimodal escape sequence duration of drosophila [33]. If, however, stimulus statistics cannot be anticipated, surprises are best dealt with by assuming a uniform broad distribution of stimulus intensities. In this case, the best strategy is to maximize the dynamic range by an ensemble of networks with sufficiently overlapping discriminable intervals ( Fig. 1 D  and Fig. 3).
The optimized sensitivity of the ensemble to stimulus statistics can be exploited in machine-learning applications. A potential application could be a tailored ensemble of coalescence-compensating recurrent networks in reservoir computing devices [34][35][36][37][38][39]. Stacking these reservoirs will perform optimal separation of stimulus intensities within the union of discriminable intervals of participating networks.
Our results suggest an alternative strategy to solve the discrepancy between the dynamic range of a single network compared to the large range of sensory stimulus intensities (dynamic range problem). First, networks probably normalize the input statistics, as observed in the early stages of the visual pathway [40]. Second, a single network could implement an adaptable discriminable interval by tuning its distance to criticality (m). Supporting this possibility, rapid adjustment of neuronal sensitivity to the stimulus intensity was demonstrated in a recent study of the Drosophila hearing system [41]. We hypothesize that the brain combines all strategies for maximized robustness.
JZ and VP received financial support from the German Ministry of Education and Research (BMBF) via the Bernstein Center for Computational Neuroscience (BCCN) Göttingen under Grant No. 01GQ1005B. JW was financially supported by Gertrud-Reemtsma-Stiftung. AL received funding from a Sofja Kovalevskaja Award from the Alexander von Humboldt Foundation, endowed by the Federal Ministry of Education and Research.
Internal and external coalescence reduce the effective branching parameter for static connection weights w = m/N [19]. Thereby, macroscopic branching parametersm, estimated from the network rate, differ from the model branching parameter m. For a detailed discussion, we refer to Ref. [19]. We will now exploit the analytical insight on coalescence to construct microscopic dynamics that compensate for internal coalescence. The basic idea is simple: adjust the microscopic dynamics (w) such that the effective branching parameter matches the desired macroscopic branching parameter.
To compensate for coalescence, we adjust the weights w such that m eff (w, A t ) = m for all A t , and thereby tune the model parameter equal to the macroscopic branching parameter m =m. Inserting Eq. (5), we obtain activitydependent weights which would compensate for internal and external coalescence. We now assume that the network has a mechanism to communicate the current activity A t to each neuron, but that it cannot have information about the external input rate h. As a result, we neglect the factor (1 − λ(h)) and obtain the adaptive (coalescencecompensating) weights which compensate only for internal coalescence. Inserted as weight in Eq. (5), the remaining external coalescence leads to m eff (w cc (m, A t ), A t ) = m(1 − λ(h)), cf. Eq. (7), which determines the N → ∞ limit of linear-regression estimates [19]. The adaptive (activity-dependent) weight, Eq. (S2), bridges the gap to the macroscopic description (Fig. S1). The conditional expectation value for the coalescencecompensating network with adaptive weights now coincides with that of a branching process (dashed line), different from the conditional expectation value of the branching network subject to coalescence (solid line), see Ref. [19].
We note that Eq. (S2) is well defined only for m < 1, because for m = 1 the adaptive weights diverge at  Figure S1. Universal scaling function for effective spreading of network activity as discussed in Ref. [19] (m = 1 in the separation-of-timescale regime). For the branching network (solid line), the scaling function is non-linear, causing a bias in linear estimates of the branching parameter through network rate. However, for coalescence-compensating networks (data points), the conditional expectation value is again linear and is described by the scaling function of a branching process (dashed line).
A t = N . This is explained by the fact that for critical-like dynamics (m = 1) the compensation of coalescence induces an avalanche-size distribution with a perfect power law behavior. A perfect power law, however, includes avalanches of all sizes. For a finite network with coalescence compensation this mathematically results in a second absorbing state at full network activity.
We need to avoid the absorbing state at full network activity for m → 1, because the microscopic dynamics would not allow to ever leave this state again. For this, we will derive a cutoff weight w cc (m, N ) = w † cc such that the probability to transition away from A t = N is nonzero.
To derive the cutoff weight, we need to calculate the probability that all neurons are activated, if in the previous time step already all neurons were active. Recall for the probability to activate a single neuron given that A t neurons are active. Now, we restrict our discussion to the divergent case mA t = N in Eq. (S2) (defining for m = 1 w † cc = w cc (1, N )) and further neglect the external input rate (h → 0), because the divergence only occurs for m → 1, where the external input rate has to be small for persistent activity. Then, the probability to activate all neurons in the network, given that all neurons where activated already in the previous time step is (S3) The probability to transition away from A t = N is given by the probability to not activate at least one neuron, i.e., by 1 − p all . We now ask, how to choose w † cc such that once the network is fully active, there is a non-vanishing probability to transition away from A t = N , i.e., that p all < 1. For this we solve Eq. (S3) for w † cc and obtain where we have introduced α = ln(p all ) < 0. Next, we aim for an asymptotic expansion around N → ∞. For large N we can expand the exponential function e α/N = 1 + α/N + (α/N ) 2 /2 + . . . such that Restricting ourselves to the leading order we get, Ensuring that lim N →∞ 1 N ln(−α/N ) = 0 according to l'Hospital rule and expanding e x 1 + x, we obtain Equation (S7) allows to set the probability to transition away from the state of full network activity. We chose such that the state of full network activity is guaranteed to be not absorbing. Our explicit choice for w † cc has several motivations: First, it is the simplest choice; Second, for m = 1 Eq. (S2) yields w cc (m = 1, cc ≈ w cc (m = 1, N − 1); and last, w † cc highlights the interpretation in terms of a random Erdős-Rényi network, namely that in the limit N → ∞ weights with probability ln(N )/N set us at the transition between disconnected graphs (p all = 0) and fully connected graphs (p all = 1).

S2. Adaptive weights extend the range of critical branching statistics
Above, we argued that compensating for internal coalescence will extend the range over which a finite network will reproduce statistics of a true branching process. In the following, we will show that this is indeed the case even for critical dynamics (m = 1).
We here complement Fig. 2 (N = 100) in the main text by large networks with N = 10 4 neurons and critical dynamics (m = 1) in the separation-of-timescale regime (Fig. S2). Compared to a branching network of the same size, the avalanche-size distribution and avalancheduration distributions in the coalescence-compensating model show a drastically extended power law. In particular, it appears that the power-law regime is extended from avalanche sizes s ∼ N to s ∼ N 2 and from avalanche duration d ∼ √ N to d ∼ N . The results are preserved in the driven regime with small external input rate h 1 (not shown).
We can understand the extended range of critical branching statistics by considering the universal shape of avalanches. It was shown that the average time development A(t, d) of neural avalanches with the same duration d collapses on a universal shape F(x), if properly rescaled as A(t, d) = dF(t/d) for our model [28]. It can be anticipated that the power-law characteristics in finitesize branching networks extends up to s max = O(N ). Considering any reasonable parabolic avalanche shape F(x) between rectangular and triangular, the area always scales as s ∼ d A peak where the peak itself scales as A peak ∼ d (according to the universal avalanche shape where F peak = const) such that s ∼ d 2 . With a maximal avalanche size s max = O(N ), we thus expect a maximal avalanche duration d max = O( √ N ). This expectation nicely agrees with our numerical observation ( Fig. S2 and main Fig. 2). In particular, this means that -due to coalescence -typical avalanches of duration d max have a maximum number of simultaneously activated neurons that scales as A peak max ∼ d max ∼ √ N . Compensating coalescence now shifts the potential maximum number of simultaneously activated neurons from A peak max ∼ √ N to A peak max ∼ N . Because of the universal avalanche shape, the maximum duration consequently scales as d max ∼ A peak max ∼ N and the maximal size then scales as s max ∼ d max A peak max ∼ N 2 .
S3. Adaptive-weight networks have dynamic range and discriminable intervals with properties from both branching process and branching network If compensating for coalescence extends the range of avalanche statistics, one could conclude, that the resulting coalescence-compensating model is "more critical" than the branching network. Intuitively, one may expect that this leads to more optimal information processing and other benefits that come with operating with critical dynamics. For example, one could expect that the dynamic range, which is maximal for critical-like dynamics [10], is even larger in coalescence-compensating networks with critical dynamics. In the following, we will show that on the level of a single network, the dynamic range is not improved when we compensate for coalescence.
To analytically calculate the dynamic range of the branching network, we start with the mean-field approximation Eq. For the coalescence-compensating network, we first need to calculate the neuron rate as response to external input rate. Assuming stationary activity A ≈ A t+1 |A t , we can use Eq. (7) to write down the meanfield approximation A cc = m[1 − λ(h)]A cc + N λ(h). (S14) Solving this for the neuron rate, we obtain a cc (m, h) = λ(h) 1 − m(1 − λ(h)) . (S15) This rate is only finite and non-negative for m < 1. In this range, a min = 0 and a max = 1 always, such that a x = x. Calculating the inverse of Eq. (S15), we find the dynamic range for coalescence-compensating networks (m < 1) (S17) For coalescence-compensating networks the interval of discriminable input rates is a function of the branching parameter (Fig. S3).
The results for the coalescence-compensating network are consistent with a modified branching process. Consider a branching process with upper-bound population activity A t ≤ N and external input per time  Figure S3. Influence of compensating coalescence on the discriminable interval. For the branching network, the discriminable interval barely changes especially for branching parameters close to critical-like dynamics (m = 1). For the coalescence-compensating network, the discriminable interval becomes a function of the branching parameter.
step N h∆t. The stationary activity is A bp (m, h) = N h∆t/(1 − m) [23]. The discriminating activity is defined as A x = xN such that the inverse of the activity yields h bp (A x ) = (1 − m)x/∆t, (S18) and the dynamic range is independent of the branching parameter ∆ bp (m) = 10 log 10 (0.9/0.1) ≈ 9.5. (S19) Importantly, the discriminable interval [h bp (A 0.1 ), h bp (A 0.9 )] is highly depending on m.