Efficiency of local learning rules in threshold-linear associative networks

We show that associative networks of threshold linear units endowed with Hebbian learning can operate closer to the Gardner optimal storage capacity than their binary counterparts and even surpass this bound. This is largely achieved through a sparsification of the retrieved patterns, which we analyze for theoretical and empirical distributions of activity. As reaching the optimal capacity via non-local learning rules like back-propagation requires slow and neurally implausible training procedures, our results indicate that one-shot self-organized Hebbian learning can be just as efficient.


INTRODUCTION
Local learning rules, those that change synaptic weights depending solely on pre-and post-synaptic activation, are generally considered to be more biologically plausible than non-local ones. They can be implemented through extensively studied processes in the synapses [1] and they allow neural networks to self-organise into content-addressable memory devices [2][3][4]. But how effective are local learning rules? Quite ineffective, has been the received wisdom since the 80's, when back-propagation algorithms came to the fore. However, this common sense is based on analysing networks of binary units [3,[5][6][7], while neurons in the brain are not binary.
A better, but still mathematically simple description of neuronal input-current-to-impulse-frequency transduction is via a threshold-linear transfer function [8][9][10], which also represents the activation function predominantely adopted in recent deep learning applications [11][12][13][14]) (often called, in that context, Rectified Linear units or ReLu). Therefore, one may ask whether the results from the 80's highlighting the contrast between the effective, iterative procedures used in machine learning and the self-organized, one-shot, perhaps computationally ineffective Hebbian learning are valid beyond binary units [15]. The Hopfield model [3], which includes a simple Hebbian [2] prescription for structuring all the connection weights in one go, had been analysed and found to be able to retrieve only up to p max 0.14N activity patterns, in a network of N binary units [5], with C = N − 1 input connections per unit. In contrast, Elizabeth Gardner showed [7] that the optimal capacity such a network can attain is p max = 2C, about 14 times higher. This optimal capacity can be approached with iterative procedures based on the notion of a desired output for each unit, and of progressively reducing the difference between current and desired output -like backpropagation, in multi-layer neural networks. This consolidated the impression that unsupervised, Hebbian plasticity may well be of biological interest, but is rather inefficient and unsuitable for performance-driven machine learning applications. The negative characterization was not redeemed by the finding that in sparsely connected nets, those where C N , the pattern capacity α c = p max /C can be closer to the Gardner bound (a factor 3 away ) [6]; and approach it, when the coding is sparse, i.e., the fraction of units active in each pattern is f 1 [16]. What about TL units? Are they more efficient in the unsupervised learning of memory patterns? Here we derive and evaluate a closed set of equations for the optimal pattern capacityà la Gardner in networks of TL units, as a function of the fraction f of active units, and test our results by learning those weights with a TL perceptron. We show that first, while for stored patterns that are binary, such errorless capacity is larger than the Hebbian capacity no matter how sparse the code, this does not, in general, hold for non-binary stored patterns, and for other distributions the Hebbian capacity can even surpass the Gardner bound. This perhaps surprising violation of the bound is because the Gardner calculation imposes an infinite output precision [17], while Hebbian learning exploits its loose precision to sparsify the retrieved pattern. In other words, with TL units, Hebbian capacity can get much closer to the optimal capacity or even surpass it, by permitting errors and retrieving a sparser version of the stored pattern. Experimentally observed firing activity distributions from the inferior-temporal cortex [18], which can be taken as patterns to be stored, would be sparsified about 50% by Hebbian learning, and would reach about 50% − 80% of the Gardner capacity.

MODEL DESCRIPTION
We consider a network of N units and p patterns of activity, {η µ i } µ=1,..,p i=1,..,N each representing one memory stored in the connection weights via some procedure. Each η µ i is drawn independently for each unit i and each memory µ from a common distribution Pr(η). We denote the activity of each unit i by v i and assume that it is determined by the activity of the C units feeding to it, through a TL activation function where [x] + = x for x > 0 and = 0 otherwise; and both the gain g and threshold ϑ are fixed parameters taken to be set for the whole network. In a fully connected, recurrent network, C = N − 1, but we intend to consider also more general cases of diluted connectivity. The storage capacity α c ≡ p max /C, is then the maximal number of memories that the network can store and individually retrieve, per unit. The synaptic weights J ij are taken to satisfy the spherical normalization condition for all i j =i We are interested in finding the set of J ij that satisfy Eq. (2), such that patterns {η µ i } µ=1,..,p i=1,..,N are self-consistent solutions of Eqs. 1, namely that for all i and µ we have,

REPLICA ANALYSIS
Adapting the procedure introduced by Elisabeth Gardner [7] for binary units to our network, we evaluate the fractional volume of the space of the interactions J ij which satisfy Eqs. (2) and (1); and using the replica trick we obtain the standard order parameters m a = 1 √ C j J a ij and q ab = 1 C j J a ij J b ij corresponding, respectively, to the average of the weights within each replica and to their overlap between replicas (Supplemental Material at [URL], Sect. A). Assuming the replica symmetric ansatz simplifies q ab ≡ q and m a ≡ m. We focus on increasing the number p of stored memories, in the C → ∞ limit, up to when the volume of the compatible weights shrinks to a single point, i.e., there is a unique solution, and the maximal storage capacity has been reached. For this purpose we take the limit q → 1, corresponding to the case where all the replicated weights are equal, implying that only one configuration exist which satisfies the equations. Adding a further memory pattern would make it impossible, in general, to satisfy all equations.
We have derived a system of two equations for the maximal storage capacity α c = p max /C for a network of threshold-linear units 1 where we have introduced the averages over Pr(η): is the normalized difference between the threshold and the mean input, while f = Pr(η > 0) is the fraction of active units. The two equations yield x (hence implicitly setting an optimal value for ϑ) and α c . Note that both equations can be understood as averages over units, respectively of the actual input and of the square input, which determine the amount of quenched noise and hence the storage capacity.
The storage capacity then depends on the proportion f of active units, but also on the gain g, and on the moments of the distribution Pr(η), d 1 and d 3 . As can be seen in Fig. 1a, at fixed g, the storage capacity increases as more and more units remain below threshold, ceasing to contribute to the quenched noise. In fact, the storage capacity diverges, for f → 0, as see Supplemental Material at [URL], Sect. B. for the full derivation. At fixed f , there is an initially fast increase with g followed by a plateau dependence for larger values of g. One can show that α c → g 2 g 2 +1 as f → 1, i.e., when all the units in the memory patterns are above threshold, it is always α c < 1 for any finite g. At first sight this may seem absurd: a linear system of N 2 independent equations and N 2 variables always has an inverse solution, which would lead to a storage capacity of (at least) one. Similar to what already noted in [17], however, the inverse solution does not generally satisfy the spherical constraint in Eq. (2); but it does, in our case, in the limit g → ∞ and this can also be understood as the reason why the capacity is highest when g is very large. In practice, Fig. 1 indicates that over a broad range of f values the storage capacity approaches its g → ∞ limit already for moderate values of the gain; while the dependence on d 1 and d 3 is only noticeable for small g, as can be seen by comparing Fig. 1c and d. In the g → ∞ limit, one sees that Eqs.

COMPARISON WITH A HEBBIAN RULE: THEORETICAL ANALYSIS
As described in the Introduction, when used in a fully connected Hopfield network of binary units, and no sparse coding, Hebbian learning could only reach ∼ 1/14 of the Gardner bound. With very sparse connectivity, however, the same network can get to 1/π of the bound, even if it gets there through a second-order phase transition, where the retrieved pattern has vanishing overlap with the stored one [6]. With TL units, the appropriate comparison to the capacityà la Gardner is therefore that of a Hebbian network with extremely diluted connectivity, as in this limit the weights J ij and J ji are effectively independent and noise reverberation through loops is negligible. The capacity of such a sparsely connected TL network was evaluated analytically in [19]. Whereas in the g → ∞ limit the Gardner capacity depends on Pr(η) only via f , for Hebbian networks it does depend on the distribution, and most importantly on, a, the sparsity whose relation to f depends on the distribution [19]. One can see in Fig. 2a that when attention is restricted to binary patterns, the Gardner capacity seems to provide an upper bound to the capacity reached with Hebbian learning; more structured distributions of activity, however, dispel such a false impression: the quaternary example already shows higher capacity for sufficiently sparse patterns. The bound, in fact, would only apply to perfect errorless retrieval, whereas Hebbian learning creates attractors which are, up to the Hebbian capacity limit, correlated but not identical to the stored patterns, similarly to what occurs with binary units [5]; in particular, we notice that when considering TL units and Hebbian learning, in order to reach close to the capacity limit, the threshold has to be such as to produce sparser pattern at retrieval, in which only the units with the strongest inputs get activated. The sparsity a r = v µ i 2 / (v µ i ) 2 of the retrieved memory can be calculated [19], see Supplemental Material at [URL], Sect. D for an outline. Fig. 2b shows the ratio of the sparsity of the retrieved pattern produced by Hebbian learning, denoted as a H r , to that of the stored pattern a, vs. f . As one can see, except for the binary patterns at low f , the retrieved patterns, at the storage capacity, are always sparser than the stored ones. The largest sparsification happens for quaternary patterns, for which the Hebbian capacity overtakes the bound on errorless retrieval, at low f . Sparser patterns emerge as, to reach close to α H c , ϑ has to be such as to inactivate most of the units with intermediate activity levels in the stored pattern. Of course, the perspective is different if α H c is considered as a function of a r instead of a, in which case the Gardner capacity remains unchanged, as it implies retrieval with a r = a, and above α H c for each of the 3 sample distributions; see Fig. 1 of Supplemental Material at [URL].

COMPARISON WITH A HEBBIAN RULE: EXPERIMENTAL DATA
Having established that the Hebbian capacity of TL networks can surpass the Gardner bound on errorless retrieval for some distributions, we ask what would happen with the distribution of firing rates naturally occurring in the brain. Here we consider published distributions of single units in infero-temporal visual cortex, while the animals were watching short naturalistic movies [18]. Such distributions can be taken as examples of patterns elicited by the visual stimulus, and to be stored with Hebbian learning, given appropriate conditions, and later retrieved using attractor dynamics, triggered by a partial cue [20][21][22][23][24]. How many such patterns can be stored, and with what accompanying sparsification? Fig. 3a and b show the analysis of two sample distributions (of the top right and top left cells in Fig. 2 of [18]). The observed distributions, in blue, with the "Gardner" label, are those we assume could be stored, and which could be retrieved exactly as they are with a suitable training procedure bound by the Gardner capacity. In orange, instead, we plot the distribution that would be retrieved following Hebbian learning operating at its capacity, see Supplemental Material at [URL], Sect. F., for the estimation of the retrieved distribution. Note that the absolute scale of the retrieved firing rate is arbitrary, what is fixed is only the shape of the distribution, which is sparser (as clear already from the higher bar at zero). The pattern in Fig. 3a, which has a < 0.5, could also be fitted with a one-parameter exponential shape having f = 2a (see Supplemental Material at [URL], Sect. E). In that panel we also report the values of the α ). Fig. 3c shows both α G c (f ) and α Hexp c (f ); on top of these curves and in the inset we have indicated as diamonds the values calculated for the 9 empirical distributions present in [18] and as circles the fitted values for those which could be fitted to an exponential.
There are three conclusions that we can draw from these data. First, the Hebbian capacity from the empirical distributions is about 80% of that of the exponential fit, when available. Second, for distributions like those of these neurons, the capacity achieved by Hebbian learning is about 50% − 80% of the Gardner capacity for errorless retrieval, depending on the neuron and whether we take its discrete distribution "as is", or fit it with a continuous exponential. Third, Hebbian learning leads to retrieved patterns which are 2 − 3 times sparser than the stored patterns, again depending on the particular distribution and whether we take the empirical distributions or their exponential fit. The empirical distributions achieve a lower capacity than that of their exponential fit, which leads to further sparsification at retrieval. This is illustrated in Fig. 3d, which shows the ratio of the sparsity of patterns retrieved after Hebbian storage to that of the originally stored pattern, vs. f . for those 4 that can be fit to an exponential distribution. The asterisk marks the two cells whose distribution is plotted in a) and b). d) Sparsification of the retrieved patterns, for Hebbian learning.

DISCUSSION
The general notion of attractor neural networks (ANN) has been instrumental in conceptualizing the storage of long-term memories, including those experimentally accessible, such as spatially selective memories in rodents. For example, the activity of Place [25,26] and Grids cells [27] has been analyzed in terms of putative low-dimensional attractor manifolds [28,29]. In detail, however, the applicability of advanced mathematical results [30,31] has been challenged by several cortically implausible assumptions incorporated in the early models. In particular, an understanding of the effectiveness of Hebbian learning has been hampered by the fact that a natural benchmark, the Gardner bound on the storage capacity, had been derived initially only for binary units. The bound for binary units, as described in the Introduction, was found to be far above the Hebbian capacity, but then no binary or quasi-binary pattern of activity has ever been observed in the cerebral cortex. A few studies have considered non-binary units: TL networks have been shown to be less susceptible to spin-glass effects [32] and to mix-ups of memory states [33] but in the framework ofà la Gardner calculations they have focused on other issues than associative networks storing sparse representations. For instance in [17] the authors carried out a replica analysis on a generic gain function, but then focused their study on the tanh activation function and on activity patterns that were not neurally motivated. Clopath and Brunel [34] considered monotonically increasing activation functions under the constraint of non-negative weights as a model of cerebellar Purkinje cells and, more recently, it has been shown with a replica analysis [35] that TL units can largely tolerate perturbations in the exact values of the weights and of the inputs.
Here, we report the analytical derivation of the Gardner capacity for networks of TL units, validate the result with TL perceptron training, and compare it with the performance of networks with Hebbian weights. We argue that the comparison has to be framed in the context of the difference between stored and retrieved patterns, that becomes more salient, the higher the storage load of the Hebbian network. It remains to be assessed whether a stability parameter, comparable to the κ used in the original Gardner calculations [7], could be considered also for TL units. Further understanding would also derive from the comparison between the maximal information content per synapse when patterns are stored via Hebbian or iterative learning, as previously performed for binary units [36].
For typical cortical distributions of activity in visual cortex, Hebbian one-shot local learning leads to utilize already 50% − 80% of the available capacity for errorless retrieval, leading to markedly sparser retrieved activity. In the extreme in which only the most active cells remain active, those retrieved memories cannot be regarded as the full pattern, with its entire information content, but more as a pointer, effective perhaps to address the full memory elsewhere, as posited in index theories of 2-stage memory retrieval [37].

A. Derivation of the storage capacity
We start by considering a single threshold-linear unit whose activity is denoted by u. The neuron receives C inputs v j , for j = 1 · · · C through synaptic weights J j . The activity of the neuron is determined through the threshold-linear activation function as We assume that we have p patterns, indexed as µ = 1 · · · p, of activity over the inputs that we denote by ξ µ j . To each input pattern µ we also consider a desired output activity by the neuron that we denote η µ . We are interested in finding how many patterns can be stored in the synaptic weights, such that the input activity elicits the desired output activity, assuming that the synaptic weights satisfy the spherical constraint j =i This task, essentially boils down to calculating the expectation of the logarithm of the fractional volume V of the interaction space over the distribution of η and ξ, defined as For calculating logV η,ξ , we use the replica trick. The initial task is to compute the replicated average V n ξ , namely We first compute the numerator. To compute the averages over ξ in the numerator, we note that the delta function can be written as For the average of the Heaviside function, we write We now use the above identities in Eqs. (5) and (6) to compute the following quantity that appears in the numerator of Eq. (4), assuming independently drawn ξ as arXiv:2007.12584v1 [cond-mat.dis-nn] 24 Jul 2020 In order to compute the average of the delta functions in Eq. (7), we use the approximation to calculate the following average where in going from the second to third line in Eq. (9), we have used the fact that ξ µ j ξ µ k = ξ µ j ξ µ k . Expanding the second exponential in the second line of Eq. (5), we can write in the large C limit in which we have assumed symmetric replicas and defined d inp Similarly, using the identity in Eq.
Using Eq. (10) and (12), the quantity M (q ab , m a ) defined through Eq. (7) can be written as We now insert Eq. (13) back to Eq. (4) and enforce the definitions of m and q in Eq. (11) using the identities and the normalization of Eq. (4) using such that the numerator in Eq. (4) can be written as j,a dJ a ij e − a,j E a 2 (J a j ) 2 + a,jm a J a j + a<bq ab J a ij J b ij .
Defining the function we can write We can then compute A in Eq. (18) using the saddle point approximation, by maximizing the argument of the exponential, that is maximising In order to proceed to make this extremisation we assume a replica symmetric ansatz: with these assumptions In the above Eq. (21), W and M are calculated using the limits for n → 0 of the expressions in Eq. (13) and (17), as follows. For W , we use the Gaussian trick combined with the replica symmetric expression for W to get Using a n ≈ 1 + n log a and log(1 + a) ≈ a, we have In order to perform the Guassian integrals one can show that for general a, b parameters: Therefore, integrating over J in Eq. (24), leads to: and over t, finally leads to: Computing M is a bit more tricky.
as one have to compute I 1 (q, m, η µ ) and I 2 (q, m). Using the Gaussian trick in Eq. (22) and assuming replica symmetry we rewrite Eq. (10) as with Dt = dt √ 2π e −t 2 /2 . In a very similar way we can write Eq. (12) as We define P (η µ > 0) = f and rewrite Eq. (13) as Simplifying for the sake of visualization Eq. (28) and (29) as one can use again a n ≈ 1 + n log a and log(1 + a) ≈ a, which is valid for n → 0, to write M (q, m) as Turning back to the original notation we can further develop the terms composing the above approximation. The first one yields: and the second one yields: where in the last passage we made a simple change of variables. Therefore we can rewrite Eq. (30) as: (36) Now we can evaluate the derivatives where G = G(q,q, m,m, E) given by Eq. (21), and set them to zero to find the maximum of Eq. (21), with W (m,q, E) given by Eq. (26) and M (q, m) given by Eq. (36).
With the first three derivatives equalized to zero, which are applied only to the second and third term of Eq. (21), and assuming Cq m 2 and |C(1 − 2q)| m 2 as C → ∞, we obtain the relationŝ The derivative in q requires in addition the integration by parts of the term multiplied by (1 − f ) enabling to reach the simplified solution: where α ≡ p/C is the storage capacity. As explained in the main text we take the limit q → 1, in which the storage capacity α becomes the critical one α c . Note that in this limit: This enables to further simplify the above equations as where in the second approximation we have Taylor expanded H(x) around x = 0. The simple application of the limit q → 1 with the above approximations, and the introduction of the variable leads to the final set of equations for the critical storage capacity where d out 1,2,3 are defined in the same way as d inp 1,2,3 except that the averages are now over the output distribution η. Going from the calculation reported above for the threshold-linear perceptron it is straightforward to calculate the optimal capacity of a network of threshold linear units. Considering the network defined through Eq. (1) of the main text, the corresponding volume we need to calculate can be written as Since V T can be written as the product of the individual volumes of the connection weights towards each unit, as V T = N i V i and thus logV T η = N logV i η , we will essentially be dealing with individual perceptrons like the one we just studied. Putting d inp As explained in the main text, we evaluate the maximal storage capacity in the limit g → ∞, which is reached for moderate values of g. Eq. (3) of the main text in the g → ∞ limit reduces to: which provides the universal α G c bound for errorless retrieval, dependent only through f on the distribution of the patterns.

B. Derivation of the limits
From Eq. (3) of the main text it is possible to evaluate the two limits of very sparse and non-sparse coding. First, a simple substitution at f = 1 leads to The case f → 0 is a bit trickier. We first rearrange the first equation in Eq.
(3) as As f goes to zero, for the left hand side to be equal to the right hand side, we should have x → ∞. We therefore use the expansion to write the right hand side of Eq. (47) as We find a solution to Eq. (48) through the following iterative procedure. We first solve the leading term for We then insert x from Eq. (49) into exp(−x 2 /2) = √ 2πf x 3 to obtain the logarithmic correction where in the last passage we have used the Taylor expansion of the square √ 1 − y = 1 − y 2 + O(y 2 ) around y = 0 as for f → 0, ln x 3 → 0.
We have tested numerically that the above expression Eq. (50) for x is indeed a solution to Eq. (47) for f → 0.
We now proceed to evaluating α c , we apply the same Taylor expansion as before To summarise in the limit f → 0 we obtain Substituting x in α c to the leading order leads to Eq.(5) presented in the main text.

C. Training algorithm
For the purpose of assessing whether the Gardner capacity for errorless retrieval can be reached with explicit training, we can decompose a network of, say, N + 1 = 10001 units into N + 1 independent threshold linear perceptrons. A threshold linear perceptron is just a 1-layer feedforward neural network with N inputs and one output, the activity of which is given by a threshold-linear activation function.
[h] + = max(0, h) The network is trained with p patterns. One can then think of the input as a matrixξ of dimension [N × p] and of the output as a vector η of dimension [1 × p].
The aim of the algorithm is to tune the weights such that all p patterns can be memorized. In order to tune the weights we start from an initial connectivity vector J 0 of dimension [1 × N ] and estimate the output η as: where g is the gain parameter. We then compare the output η with the desired output η through the loss function The TL perceptron algorithm can be seen as simply a stripped down version of backpropagation, for a 1-layer network: the weights J are modified by gradient descent to minimize the loss during the steps k = 1..k M AX where k M AX is the number of steps needed for the gradient descent in order to reach the minima dL( J k ) d J k = 0. If at the minima L(J k MAX ) = 0 at least a set of weights exists for errorless retrieval at that p value. The storage capacity α c = p max N is evaluated by estimating p max as the highest p value enabling to reach L(J k MAX ) = 0. Initializing the weights around zero facilitates reaching the minima. The chain derivative that in general implements gradient descent in backpropagation, in this case reduces to where Θ( η) is the Heaviside step function applied to all N elements of η and where γ is a learning rate, which we vary in order to facilitate reaching the minima.

D. Hebbian capacity and sparsity of the retrieved pattern
From the calculation reported in [1], it can be shown that for a network of threshold-linear units described in Eq. (1) of the main text in which p patterns are stored through Hebbian learning, the storage capacity α c can be found as the value of α for which there are values of v c and w c that solve the equation at a single point on the w, v plane; where and and the auxiliary variables v and w, defined as in [2] and dependent on the threshold ϑ, quantify the signal to noise ratio respectively of the specific signal of the pattern to be retrieved and the one of the background both versus the noise due to memory loading [1]. Estimating v c and w c translates to optimizing the threshold ϑ such that it maximizes the storage capacity.
In [1] the above expression are reported assuming a = η = η 2 , but the above equations do not make this assumption and can be derived easily from the calculation reported in [1].
Following [1], the average of the activity and the average of the square activity in the patterns retrieved with Hebbian weights are calculated considering that the field, i.e. the input received by a cell with activity η in the memory, is normally distributed around a mean field proportional to x. If we call z a random variable normally distributed with mean zero and variance one, x is already the mean field properly normalized. With the threshold-linear transfer function, the output will be g(x + z) for x + z > 0 and 0 with probability φ(−x). Therefore the average activity and the average square activity are: the sparsity of the retrieved memory is thus a H r = V 2 / V 2 . As reported in the main text, we have compared capacity values using a binary, ternary, quaternary and an exponential distribution: One can see that all distributions are such that η = ∞ 0 dηP (η)η = a and η 2 = ∞ 0 dηP (η)η 2 = a, so that a coincides with the sparsity η 2 / η 2 of the network. The fraction of active units is thus related to a as f = a, 9a/5, 9a/4, 2a respectively.
As a supplement to Fig. 2 of the main text, reproduced here in the 3 separate panels in the upper row in Fig. 1, we show a comparison between the Hebbian capacity and the Gardner one when plotted as a function of the output sparsity (in the bottom row of Fig. 1). The Gardner storage capacity is now in each of these 3 cases above the Hebbian capacity, taken as a function of the output sparsity instead of the input one. Fig. (2). Comparison between the Hebbian and Gardner storage capacity for 3 discrete distributions. The upper row considers as sparsity parameter the one of the input pattern, the lower row the one of the retrieved pattern. The Garner capacity is that given by Eq. (3) of the main text E. Analytical derivation for the exponential distribution In order to facilitate the comparison we extend the analytical calculations in [1] and evaluate explicitly the analytical expression of Eq. (57) and (58) for the exponential distribution. In general, for A 2 we write

F. Comparison with real data
In the real activity distributions we use, each neuron emits, in time bins of fixed duration (we use 100msec), 0, . . . , n, . . . , n max spikes, with relative frequency c n , such that nmax n=0 c n = 1. These values are taken from Fig. 2 of [3] and correspond to the histograms in blue in Fig.2 Fig.3 of the main text); they are assumed to be the distributions of the patterns to be stored. If the weights are those described by the Gardner calculation, these patterns can be retrieved as they are, and their distribution remains the same. If they are stored with Hebbian weights close to the maximal Hebbian capacity, however, the retrieved distributions look different, and they can be derived as follows.

below (and in
The firing rate V of a neuron in retrieving a stored pattern η is assumed proportional to w + vη/ η + z [1], where the parameters w and v are appropriately rescaled signal-to-noise ratios (general and pattern-specific), such that the normally distributed random variable z, of zero mean and unitary variance, is taken to describe all other non constant (noise) terms, besides η itself. Averaging over z one can write, as in Eq.(62), that at the maximal capacity where x ≡ w + vη/ η and at the saddle-point the parameters w and v take the values w c and v c that maximize capacity, as explained in [1]. This implies setting an optimal value for the threshold ϑ, which in the analysis is absorbed into the parameter w, and which determines the sparsity of the retrieved distribution. The gain g remains, however, a free parameter, that affects neither sparsity nor capacity. It is a rescaled version of the original gain g in the hypothetical TL transfer function. In other words, the maximal Hebbian capacity determines the shape of the retrieval activity distribution, but not its scale (e.g., in spikes per sec).
To produce a histogram, that details the frequency with which the neuron would produce n spikes at retrieval, e.g. again in bins of 100msec, one has to set this undetermined scale. We set it arbitrarily, with the rough requirement that the frequency of producing n max spikes at retrieval be below what it is in the observed distribution, taken to describe storage, and negligible for n max + 1 spikes. Having set the scale g, the frequency with which the neuron emits n spikes at retrieval, with 0 < n < n max is the probability that n − 1/2 < V < n + 1/2, that is, it is a sum over contributions from each η, such that i.e., with appropriate expressions for the two extreme bins. These are the distributions shown in Fig.3 in the main text, and in Fig.2 below. We took g = 1 2 , as this value satisfies the a priori requirements and allows to keep the same number of bins in the retrieved memory as in the stored one (and the coefficients sum up to one, to a very good approximation). Fig. (3) in the main text, we report in Fig. 2 the same analysis for all 9 single cells reported (using 100ms bins) in [3].

Supplementary to
In each panel we write the capacityà la Gardner and the Hebbian one (calculated without fitting an exponential) for the 9 empirical distributions, as well as the sparsity of the original distribution and the sparsity of the one that would be retrieved with Hebbian weights. For simplicity of visualization we also show the storage capacity values against each other, calculatedà la Gardner andà la Hebb (again, without fitting an exponential), as a single scatterplot for the 9 distributions, in Fig. 3.