Statistical Mechanics of High-Dimensional Inference

To model modern large-scale datasets, we need efficient algorithms to infer a set of $P$ unknown model parameters from $N$ noisy measurements. What are fundamental limits on the accuracy of parameter inference, given finite signal-to-noise ratios, limited measurements, prior information, and computational tractability requirements? How can we combine prior information with measurements to achieve these limits? Classical statistics gives incisive answers to these questions as the measurement density $\alpha = \frac{N}{P}\rightarrow \infty$. However, these classical results are not relevant to modern high-dimensional inference problems, which instead occur at finite $\alpha$. We formulate and analyze high-dimensional inference as a problem in the statistical physics of quenched disorder. Our analysis uncovers fundamental limits on the accuracy of inference in high dimensions, and reveals that widely cherished inference algorithms like maximum likelihood (ML) and maximum-a posteriori (MAP) inference cannot achieve these limits. We further find optimal, computationally tractable algorithms that can achieve these limits. Intriguingly, in high dimensions, these optimal algorithms become computationally simpler than MAP and ML, while still outperforming them. For example, such optimal algorithms can lead to as much as a 20% reduction in the amount of data to achieve the same performance relative to MAP. Moreover, our analysis reveals simple relations between optimal high dimensional inference and low dimensional scalar Bayesian inference, insights into the nature of generalization and predictive power in high dimensions, information theoretic limits on compressed sensing, phase transitions in quadratic inference, and connections to central mathematical objects in convex optimization theory and random matrix theory.


I. INTRODUCTION
Remarkable advances in measurement technologies have thrust us squarely into the modern age of "bigdata," which yields the potential to revolutionize a variety of fields spanning the sciences, engineering, and humanities, including neuroscience [1,2], systems biology [3], health care [4], economics [5], social science [6], and history [7].However, the advent of large scale data sets presents severe statistical challenges that must be solved if we are to gain conceptual insights from such data.
A fundamental origin of the difficulty in analyzing many large scale data sets lies in their high dimensionality.For example, in classically designed experiments, we often measure a small number of P variables, chosen carefully ahead of time to test a specific hypothesis, and we take a large number of N measurements.Thus the measurement density α = N P is extremely large, and such data sets are low dimensional: they consist of a large number of N points in a low P dimensional space (Fig. 1A).Much of the edifice of classical statistics operates within this low-dimensional, high measurement density limit.Indeed, as reviewed below, as α → ∞, classical statistical theory gives us fundamental limits on the accuracy with which we can infer statistical models of such data, as well as the optimal statistical inference procedures to follow in order to achieve these limits.
In contrast to this classical scenario, our technological capacity for high-throughput measurements has led to a dramatic cultural shift in modern experimental design across many fields.We now often simultaneously measure many variables at once in advance of choosing any specific hypothesis to test.However, we may have limited time or resources to conduct such experiments, so we can only make a limited number of such simultaneous measurements.For example, through multielectrode recordings, we can simultaneously measure the activity P = 1000 neurons in mammalian circuits, but only for N = O(100) trials of any given trial type.Through microarrays, we can simultaneously measure the expression levels of P = O(6000) genes in yeast, but again in a limited number of N = O(100) experimental conditions.Thus while both N and P are large, the measurement density α is finite.Such datasets are high dimensional, in that they consist of a small number of points in a high dimensional space (Fig. 1B), and it can be extremely B. HIGH DIMENSIONAL DATA A. LOW DIMENSIONAL DATA FIG. 1.A cartoon view of low (A) versus high (B) dimensional data.In the latter scenario, a finite measurement density, or ratio between data points and dimensions, leads to errors in inference.arXiv:1601.04650v2[stat.ML] 22 Feb 2016 challenging to detect regularities in such data [8].Moreover, classical statistical theory gives no prescriptions for how to optimally analyze such data.
In this work, we extend classical statistical theory to the modern age of high dimensional data, obtaining fundamental generalizations of statistical theorems dating back to the 1940's [9,10].We do so by interpreting the problem of high dimensional statistical inference within the framework of statistical physics.In particular we focus on one of the most ubiquitous statistical inference procedures: regression, which attempts to find a linear relationship between a cloud of data points and another variable of interest.By exploiting the methods of statistical mechanics, we obtain fundamental limits on the accuracy of high dimensional inference as well as the optimal procedures to follow to achieve these limits.Our results reveal surprisingly simple connections between optimal high dimensional inference and low dimensional scalar Bayesian estimation, as well as quantitative insights into how the predictive power, or generalization capability, of an inference algorithm is related to its accuracy in separating signal from noise.Moreover, a variety of topics, including random matrix theory, compressed sensing, and fundamental objects in convex optimization theory, such as proximal mappings and Moreau envelopes, emerge naturally through our analysis.We give an intuitive summary of our results in the discussion section.

A. Statistical inference framework
To more concretely introduce this work we give a precise definition of the problems we are solving.Formally, let s 0 be an unknown P dimensional vector governing the linear response of a system's scalar output y to a P dimensional input x through the relation y = x • s 0 + , where denotes noise originating either from unobserved inputs or imperfect measurements.For example, in sensory neuroscience, y could reflect a linear approximation of the response of a single neuron to a sensory stimulus x, so that s 0 is the neuron's receptive field.Alternatively, in genetic networks, y could reflect the linear response of one gene to the expression levels x of a set of P genes.Suppose we perform N measurements, indexed by µ = 1, . . ., N in which we probe the system with an input x µ and record the resulting output y µ .This yields a set of noisy measurements constraining the linear response vector s 0 through the N equations y µ = x µ • s 0 + µ .
We assume the noise µ and components s 0 i are each drawn i.i.d.from a zero mean noise density P ( ), and a prior distribution P s (s).For convenience below, we define signal and noise energies in terms of the minus log probability of their respective distributions: E = − log P and E s = − log P s .We further assume the experimental design of inputs is random: input components x µ i are drawn i.i.d.from a zero mean Gaussian with variance 1  P , yielding inputs of expected norm 1.In many systems identification applications, including for example in sensory neuroscience, this random design would correspond to a white-noise stimulus.Now, given knowledge of the N input-output pairs {x µ , y µ }, the noise density P , and the prior information encoded in P s , we would like to infer, in a computationally tractable manner, an estimate ŝ of the true response vector s 0 .A critical parameter governing inference performance is the ratio of the number of measurements N to the dimensionality P of the unknown model parameter s 0 , i.e. the measurement density α = N P .The performance of any inference procedure can be characterized in several ways.Most simply, we would like to achieve a small, per-component mean square error, q s = 1 P P i=1 (ŝ i −s 0 i ) 2 in inferring the true parameters, or signal s 0 .Alternatively, it is useful to note that any inference procedure yielding an estimate ŝ implicitly decomposes the measurement vector y into the sum of a signal component Xŝ and a noise estimate ˆ = y − Xŝ.Thus an inference procedure corresponds to a particular separation of measurements into estimated signal and noise, y = Xŝ + ˆ , which will generically differ from the true decomposition, y = Xs 0 + .While q s reflects the error in estimating signal, q = 1 N N µ=1 (ˆ µ − µ ) 2 reflects the error in estimating noise.Finally, one of the main performance measures of an inference procedure is its ability to generalize, or make predictions about the measurement outcome y in response to a new randomly chosen input x not present in the training set {x µ }.Given an estimate ŝ, it can be used to make the prediction ŷ = x • ŝ, and the average performance of this prediction is captured by the generalization error E gen = (y − ŷ) 2 .Here the double average • denotes an average over both the training data {x µ , y µ }, which ŝ depends on, and the held out testing data {x, y}, which is necessarily independent of ŝ.An alternate measure of performance is the average error in the ability of ŝ to simply predict the training data: µ .In general, E train < E gen , since through the process of inference, the learned parameters ŝ can acquire subtle correlations with the particular realization of training inputs {x µ } and noise { µ } so as to reduce E train .Situations where E train E gen correspond to inference procedures that overfit to the training data, and do not exhibit predictive power by generalizing to new data.Now what inference procedures can achieve good performance in a computationally tractable manner?Regularized M-estimation (see [11? ] for reviews) yields a large family of computationally tractable estimation procedures in which ŝ is computed through the minimization Here s is a candidate response vector, ρ is a loss function that penalizes deviations between actual measurements y µ and expected measurements x µ •s under the candidate s, and σ(s) is a regularization function that exploits prior information about s 0 .
In the absence of such prior information, a widely used procedure is maximum likelihood (ML) inference, ŝML = arg max s log P ({y µ } | {x µ }, s) . (2) ML corresponds to noise energy minimization through the choice ρ = E and σ = 0 in (1).Amongst all unbiased estimation procedures (in which ŝ = s 0 , where • denotes an average over noise realizations), this energy minimization is optimal, but only in the low dimensional limit.Thus, amongst unbiased procedures, ML achieves the minimum mean squared error (MMSE), when α → ∞, but not at finite α.Recent work [12][13][14] uses non-statistical mechanics based methods to find the optimal ρ at finite α, but leaves open the fundamental question of how to optimally exploit prior information by choosing a nonzero σ.
With prior knowledge, the Bayesian posterior mean achieves the MMSE estimate, However, while no inference procedure can outperform high dimensional Bayesian inference of the posterior mean, this procedure is not an M-estimator, and it is often computationally intractable due to the P dimensional integral.A widely used, more computationally tractable surrogate is maximum a-posteriori (MAP) inference, ŝMAP = arg max which corresponds to noise and signal energy minimization through the choice ρ = E and σ = E s in (1).MAP inference, by potentially introducing a non-zero bias (so that ŝ = s 0 ) can out-perform ML at finite α, but is not in general optimal.However, the exploitation of prior information through a judicious, even if suboptimal, choice of σ can dramatically reduce estimation error.For example, the seminal advance of compressed sensing (CS) [15][16][17] uses ρ = 1 2 2 and σ ∝ |s|.This choice can lead to accurate inference of sparse s 0 even when α < 1, where sparsity means that P s (s) assigns a small probability to nonzero values.
Despite the important and successful special cases of MAP inference and CS, there exists no general method to choose the best ρ and σ for inference.The central questions we address in this work are: (1) Given an estimation problem defined by the triplet of measurement density, noise and prior (α, E , E s ), and an estimation procedure defined by the loss and regularization pair (ρ, σ), what is the typical error q s achieved for random inputs x µ and noise µ ?(2) What is the minimal achievable estimation error q opt over all possible choices of convex procedures (ρ, σ)? (3) Which procedure (ρ opt , σ opt ) achieves the minimal error q opt ?(4) Are there simple universal relations between q s and q which measure the ability of an inference procedure to accurately separate signal and noise, and E train and E gen , which capture the predictive power of an inference procedure?Our discussion section gives a summary of the answers we find to these questions.

A. Review and formulation of classical scalar inference
Before considering the finite α regime, it is useful to review classical statistics in the α → ∞ limit, in the context of scalar estimation, where P = 1.In particular, we formulate these results in a suggestive manner that will aid in understanding the novel phenomena that emerge in modern, high dimensional statistical inference, derived below.Here, for simplicity, we choose the scalar measurements x µ = 1 ∀ µ in (1).Thus we must estimate the scalar s 0 from α = N noisy measurements, y µ = s 0 + µ .With no regularization (σ = 0), for large N , ŝ in (1) will be close to s 0 , so simply Taylor expanding ρ about s 0 yields the asymptotic error (see, [11? ], [18], appendix A.1) The Cramer-Rao (CR) bound is a fundamental information theoretic lower bound, at any N , on the error of any unbiased estimator ŝ({y µ }) (obeying ŝ − s 0 = 0): where J [ ] is the Fisher information from a single measurement y, (7) The Fisher information measures the susceptibility of the output y to small changes in the parameter s 0 .The higher this susceptibility, the lower the achievable error in (6).For finite N , it is not clear there exists a loss function ρ whose performance saturates the CR bound.However, a central result in classical statistics states that as N → ∞, the choice ρ = E saturates (6), as can be seen by substituting ρ = E in (5) ( [18], appendix A.2).
With knowledge of the true signal distribution P (s 0 ), the posterior mean s | {y µ } = ds s P (s | {y µ }) achieves minimal possible error q s , amongst all inference procedures, biased or not, at any finite N .We compute this minimal q s , in the limit of large N , via a saddle point approximation to this Bayesian integral, yielding a mean field theory (MFT) for low dimensional Bayesian inference ( [18], appendix A.3), where the N measurements y µ of s 0 , corrupted by non-Gaussian noise µ , can be replaced by a single measurement y = s 0 + √ q d z, corrupted by an effective Gaussian noise of variance Here z is a zero mean unit variance Gaussian variable.In our MFT, q s is the MMSE error q MMSE s of this equivalent single measurement, Gaussian noise inference problem: We further prove a general lower bound on the asymptotic error and demonstrate that this bound is tight when the signal and noise are Gaussian ( [18], appendix A.3). Thus, the classical theory of unbiased statistical inference as the measurement density α → ∞ reveals that ML achieves information theoretic limits on error (6).Moreover, our novel asymptotic analysis of Bayesian inference as α → ∞ (Eqs.8-10), reveals the extent to which biased procedures that optimally exploit prior information can circumvent such limits.Our work below constitutes a fundamental extension of these results to modern high dimensional problems at finite measurement density.

B. Statistical mechanics framework
To understand the properties of the solution ŝ to (1), we define an energy function yielding a Gibbs distribution P G (s) = 1 Z e −βE(s) that freezes onto the solution of (1) in the zero temperature β → ∞ limit.In this statistical mechanics system, x µ , µ and s 0 play the role of quenched disorder, while the components of the candidate parameters s comprise thermal degrees of freedom.For large N and P , we expect selfaveraging to occur: the properties of P G for any typical realization of disorder coincide with the properties of P G averaged over the disorder.Therefore we compute the average free energy −β F ≡ ln Z x µ , µ ,s 0 using the replica method [19].We employ the replica symmetric (RS) approximation, which is effective for convex ρ and σ.Interestingly, our calculation ([18], section 2.1) goes through without assuming a quadratic loss, as in previous replica analyses of compressed sensing [20,21].For a review of statistical mechanics methods applied to high dimensional inference in diverse settings, see [8].
Central objects in optimization theory emerge naturally from our replica analysis, and the resulting mean field theory (MFT) is most naturally described in terms of them.First is the proximal map x → P λ [ f ](x), where This mapping is a proximal descent step that maps x to a new point that minimizes f , while remaining proximal to x, as determined by a scale λ.The proximal map is closely related to the Moreau envelope of f , given by M λ [ f ] is a minimum convolution of f (x) with a quadratic x 2 2λ , yielding a lower bound on f that is smoothed over a scale λ.See Fig. 2AB for an example.The proximal map and Moreau envelope are related: where the prime denotes differentiation w.r.t.x.Thus a proximal descent step on f can be viewed as a gradient descent step on M λ [ f ] with step length λ.See [18], appendix C.1, and also [22] for a review of these topics.
Our replica analysis yields a pair of zero temperature MFT distributions P MF (s 0 , ŝ) and P MF ( , ˆ ).The first describes the joint distribution of a single component (s 0 i , ŝi ) in (1), while the second describes the joint distribution of a noise component µ and its estimate ˆ µ ≡ y µ − x µ • ŝ.The MFT distributions can be described in terms of a pair of coupled scalar noise and signal estimation problems, depending on a set of RS order parameters (q s , q d , λ ρ , λ σ ).Here q s and q d reflect the variance of additive Gaussian noise that corrupts the noise and signal s 0 , respectively, yielding the measured variables, where z and z s are independent zero mean unit variance Gaussians.From these measurements, estimates ˆ and ŝ B A FIG. 3. A low-dimensional scalar MFT for high dimensional inference.(A) and (B) are schematic descriptions of Eqns.(15) and (16).They describe a pair of scalar statistical estimation problems, one for a noise variable , drawn from P in (A), and the other for a signal variable s 0 , drawn from Ps in (B).Each variable is corrupted by additive Gaussian noise, and from these noise corrupted measurements, the original variables are estimated through proximal descent steps, yielding a noise estimate ˆ in (A) and a signal estimate ŝ in (B).The MFT distributions PMF( , ˆ ) and PMF(s 0 , ŝ) are obtained by integrating out z and zs in (A) and (B) respectively.These joint MF distributions describe the joint distribution of pairs of single components ( µ, ˆ µ), and (s 0 i , ŝi) in (1), after integrating out all other elements of the quenched disorder in the training data and true signal.
of the original noise and signal s 0 are obtained through proximal descent steps on the loss ρ and regularization σ: where λ ρ and λ σ reflect scale parameters.The joint MFT distributions are then obtained by integrating out z and z s .These MFT equations can be thought of as defining a pair scalar estimation problems, one for the noise, and one for the signal (see Fig. 3AB for a schematic).
The order parameters obey self-consistency conditions that couple the performance of these scalar estimation problems: Here • denotes averages over the quenched disorder in (15).The pair of MF distributions determine various measures of inference performance in (1).In particular, q s predicts the typical per-component error of the learned model parameters, or signal ŝ, while q = (ˆ − ) 2 qs predicts the typical per-component error of the estimated noise.The model's prediction, or generalization error E gen = (y − x • ŝ) 2 on a new example (x, y) not present in the training set {x µ , y µ } can be obtained by substituting y = x • s 0 + into E gen .This yields the MFT prediction for the generalization error, E gen = ( qs ) 2 = q s + 2 .In contrast, the MFT prediction for the training error is simply Because the proximal map is contractive, with Jacobian less than 1 [22], the MFT predicts, as expected, that E train < E gen .The reason for the reduced E train is due to the subtle correlations that the learned parameters ŝ can acquire with the particular realization of training inputs {x µ } and noise { µ }, through the optimization in (1).Remarkably, these subtle correlations are captured in the MFT simply through a proximal descent step in ( 16) on the cost ρ.This step contracts the variable qs controlling E gen towards the minimum of ρ at the origin, leading to smaller E train .We explore many more consequences of this MFT below.

C. Inference without prior information
If we cannot exploit prior information, we simply choose σ = 0, which yields ŝ = s 0 q d in ( 16), so that the RHS of ( 17) and ( 18) reduce to q s = q d and λ ρ = λ σ .Then, replacing q d with q s on the LHS of (17), and comparing to (5), we see that the high dimensional inference error is analogous to the low dimensional one with the number of measurements N replaced by the measurement density α, the cost ρ(•) replaced by its Moreau envelope M λρ [ ρ ](•), and the noise further corrupted by additive Gaussian noise of variance q s , with q s and λ ρ determined self-consistently through ( 17)-( 18).
As a simple example, consider the ubiquitous case of quadratic cost: ρ(x) = 1 2 x 2 .Then the proximal map ( 16) is simply linear shrinkage to the origin, ˆ ( qs ) = 1 1+λρ qs , and ( 17) and ( 18) are readily solved: 2 and E train = α−1 α 2 .Thus as the measurement density approaches 1 from above, the error in inferred parameters ŝ and E gen diverge, while E train vanishes, indicating severe overfitting.Now, in the space of all convex costs ρ, for a given density α and noise energy E , what is the minimum possible estimation error q opt ?By performing a functional minimization of q s over ρ subject to the constraints ( 17) and ( 18) ((see [18] sec.4.1 and 5.1 for details) we find that q opt is the minimal solution to where the second inequality follows from the convolutional Fisher inequality ( [18], appendix B.2).This result The shape of the optimal loss function in (20) for high dimensional inference as a function of the error or smoothing parameter q.As α varies from high to low measurement density, q varies from low to high values, and the optimal loss function varies from the ML loss to quadratic.Intermediate versions of the optimal loss behave like a smoothed version of the ML loss, with increased smoothing as measurement density decreases (or dimensionality increases).
is the high dimensional analog of the Cramer-Rao bound in (6).By the data processing inequality for Fisher information, J [ q opt ] < J [ ], indicating higher error in the high dimensional (19) than low dimensional setting (6).Thus the price paid for even optimal high-dimensional inference at finite measurement density, relative to ML inference at infinite density, is increased error due to the presence of additional gaussian noise with dimensionality dependent variance q s .Now can this minimal error q opt be achieved, and if so, which cost function ρ opt achieves it?Constrained functional optimization over ρ yields the functional equation M q opt [ ρ ](x) = E q opt (see [18] sec.5.1 for details), which can be inverted (see [18] appendix B.2) to find The validity of this equation under the RS assumption requires that ρ opt be convex.Convexity of the noise energy E is sufficient to guarantee the convexity of ρ opt , and so for this class of noise, (20) yields the optimal inference procedure.
In the classical α → ∞ limit, we expect q opt to be small; indeed to leading order in 1  α , (19) has the solution (20) reduces to ρ opt = E , recovering the optimality of ML and its performance (6) at infinite measurement density.In the high dimensional α → 1 limit, q opt diverges, so that q opt approaches a Gaussian with variance 2 + q opt , yielding in (20) ρ opt (x) = x 2 2 .Thus, remarkably, at low measurement density, simple quadratic minimization, independent of the noise distribution, becomes an optimal inference procedure.As the measurement density decreases, ρ opt interpolates between E and a quadratic; in essence ρ opt at finite density α is a smoothed version of the ML choice ρ = E where the amount of smoothing increases, as the density de-creases (or dimensionality increases).See Fig. 4 for an example of a family of optimal inference procedures, and their performance advantage relative to ML, for Laplacian noise (E = | |).
These results are consistent with and provide a new statistical mechanics based derivation of results in [12][13][14], and they illustrate the severity of overfitting in the face of limited data.

D. Inference with prior information
We next explore how we can combat overfitting by optimally exploiting prior information about the distribution of the model parameters, or signal s 0 .

Optimal quadratic inference: a high SNR phase transition
To understand the MFT for regularized inference, it is useful to start with the oft-used quadratic loss and regularization: ρ(x) = 1 2 x 2 and σ(x) = 1 2 γx 2 .In this case, the proximal maps in ( 16) become linear and the RS equations ( 17) and ( 18) are readily solved ( [18], sec.3.1).It is useful to express the results in terms of the fraction of unexplained variance qs = qs s 2 and the SNR = s 2 / 2 .For quadratic inference, qs depends on the signal and noise distributions only through the SNR.We find that in the strong regularization limit, γ → ∞, qs → 1, as the regularization pins the estimate ŝ to the origin, while in the weak regularization limit γ → 0, qs → 1 SNR(α−1) , recovering the unregularized case.There is an optimal intermediate value of the regularization weight, γ = 1 SNR , leading to the highest fraction of variance explained.Thus optimal quadratic inference obeys the principle that high-quality data, as measured by high SNR, requires weaker regularization.For this optimal γ, qs arises as the solution to the set of simultaneous equations We denote the solution to these equations by qs = qQuad s (α, SN R).This function is simply the fraction of unexplained variance of optimal quadratic inference at a given measurement density and SNR, and an explicit expression is given by qQuad where φ = 1 SNR (see [18], sec.3.2 for details).This expression simplifies in several limits.At high SNR 1, Thus, as a function of measurement density, the high SNR behavior of quadratic inference exhibits a phase transition at the critical density α c = 1.Below this density, in the undersampled regime, performance asymptotes to a finite error, independent of SNR.Above this density, in the oversampled regime, inference error decays with SNR as SNR −1 .Surprisingly, at the critical density, the decay with SNR is slower, and exhibits a universal decay exponent of − 1 2 , independent of the signal and noise distributions.This exponent, and its universality, is verified numerically in Fig. 5A.Moreover, as α → 1, qQuad s , remains O(1) at any finite SNR, unlike the unregularized case.Indeed, for α 1, qQuad s = 1 − α SNR SNR+1 .Thus quadratic regularization can tame the divergence of unregularized inference at low measurement density.
The phase transition behavior of optimal quadratic inference can be understood from the perspective of random matrix theory (RMT).In the special case of (1) when ρ(x) = 1 2 x 2 and σ(x) = 1 2 1 SNR x 2 , the optimal estimate ŝ has the analytic solution where X is an N by P measurement matrix whose N rows are the N measurement vectors x µ (see [18] Sec. 3.5 for more details).This analytic solution for ŝ enables a direct average over the noise and true signal s 0 in y to yield This expression can be reduced to an average over the eigenvalue distribution of the random measurement correlation matrix X T X, which has the well known Marcenko-Pasteur (MP) form [23]: (26) where the nonzero support of the density is restricted to the range λ ∈ [λ − , λ + ], with λ ± = ( √ α ± 1) 2 .Also 1 α<1 is 1 when α < 1 and 0 otherwise.Thus at measurement densities α < 1, the MP distribution has an additional delta function at the origin with weight 1 − α, reflecting the fact that the P × P measurement correlation matrix X T X is not full rank when N < P .In terms of ρ MP (λ), (25) reduces to where ∆(λ) = (1 + λ • SNR) −1 .Direct calculation reveals that expression (27) for qQuad s (α, SNR), derived via random matrix theory, is consistent with the expression (22), derived via our theory of high dimensional statistical inference.
The expression for qQuad Only when α = 1 does the gap in the MP density vanish.In this case, near the origin, the density diverges as λ −1/2 (see Fig. 5B middle).At high SNR, because ∆(λ) induces an effective cut-off at 1  SNR , the integral in (27) can be approximated as Thus the origin of the phase transition in (23) at the critical value α = 1 arises from the vanishing of a gap in the MP distribution.Moreover, the universal decay exponent at the critical value of α = 1 is related to the power law behavior of the MP density near the origin at α = 1.Remarkably, this highly nontrivial behavior is captured simply through the outcome of our replica analysis for optimal quadratic inference, encapsulated in the pair of equations in (21).

The worst signal and noise distributions are Gaussian
We note that this optimal quadratic inference procedure is optimal amongst all possible inference procedures, if and only if the signal and noise are Gaussian, since, in that case, it is equivalent to the Bayesian MMSE inference procedure.Moreover, we note that Gaussian signal A high SNR phase transition in optimal quadratic inference.(A) At large SNR, the MSE of optimal quadratic inference exhibits three distinct scaling regimes for α < 1, α = 1, and α > 1 (see eq. ( 23)), independent of the signal and noise distributions.For example, when α = 0.9 < 1, qQuad s approaches a constant, whereas when α = 1 or α = 1.1 > 1, qQuad s approaches 0 as SNR −1/2 or SNR −1 respectively.The theoretical curves (blue) match numerical experiments (error bars) for a finite sized problems (N and P vary with N = αP and √ N P = 300), where the error bars reflect the standard deviation across 80 trials using both signal and noise either Gaussian (black) or Laplacian (red) distributed.(B) The behavior of the MP density (black) in (26).For α = 1 the nonzero continuous part of the density exhibits a gap at the origin, whereas for α = 1 the gap vanishes and the distribution diverges at the origin.For α < 1 there is an additional δ-function at the origin (green bar) with weight 1 − α (red dot).The blue curve shows the function ∆(λ) = (1 + λ • SNR) −1 appearing in the integral for qQuad s in ( 27), for the value SNR = 100.and noise are in some sense the worst type of signal and noise distributions, in the space of all inference problems with a given SNR.To see this, consider a non-Gaussian signal and noise with a given SNR.The performance of optimal quadratic inference for this non-Gaussian signal and noise only depends on the pair of distributions through their SNR, and is equivalent to the performance of optimal quadratic inference for Gaussian signal and noise at the same SNR.However, in the non-Gaussian case, a non-quadratic inference algorithm could potentially outperform the quadratic one, but not in the Gaussian case, since quadratic inference is already optimal in that case.Thus in the space of inference problems of a given SNR, the worst case performance of optimal inference occurs when both the signal and noise are Gaussian.

Optimal inference with non-Gaussian signal and noise
What is the optimal (non-quadratic) inference procedure in the face of non-Gaussian signal and noise?We address this by performing a functional minimization of q s over both ρ and σ, subject to constraints ( 17) and (18), which yields ( [18], sec.5.2), where q opt s and q opt d satisfy and the function q MMSE s is defined in (9).Again, the validity of (28)-(29) under the RS assumption requires convexity of ρ opt and σ opt .Convexity of the signal and noise energies, E s and E are sufficient to guarantee convexity of ρ opt and σ opt , and so for this class of signal and noise, with log concave distributions, (28)-(29) yields an optimal inference procedure.However, by judicious applications of the Cauchy-Schwarz inequality, we prove ( [18], sec.4.1) that even for non-convex E s and E , the inference error q s for any convex procedure (ρ, σ) must exceed q opt s in (30).This result yields a fundamental limit on the performance of any convex inference procedure of the form (1) in high dimensions.
Intriguingly, by comparing the optimal achievable high dimensional M-estimation performance q opt s in (30) to the asymptotic performance of low dimensional scalar Bayesian inference in ( 8) and ( 9), we find a striking parallel.In particular, q opt s corresponds to the low dimensional asymptotic MMSE in a scalar estimation problem where the effective number of measurements N = α and the noise is further corrupted by additional Gaussian noise of variance q opt s ( → + q opt s z).The correction to the low dimensional scalar asymptotics (9), valid only at large N , in the high dimensional regime at finite measurement density α, is obtained by self-consistently solving for q opt s in (30).In essence, at finite measurement density, there is irreducible error in estimating the signal, q opt s .This error contributes to the effective Gaussian noise q opt d in the scalar MFT estimation problem for the signal, shown in Fig. 3B, where the proximal map becomes the Bayesian posterior mean map in the optimal case.On the otherhand, this irreducible, extra gaussian noise is absent in low dimensions (compare LHS of (30) to ( 8)).This irreducible error q opt s can be found by selfconsistently solving for it in the RHS of (30).Finally, as a simple point, we note that direct calculation reveals that (30) reduces to (21) when the signal and noise are both Gaussian distributed, as expected, since optimal quadratic inference is the best procedure for Gaussian signal and noise.
Furthermore, using the fact that the equalities in (30) become inequalities for non-optimal procedures ( [18], section 4.2), we can derive a high dimensional analogue of (10), and prove a lower bound on the inference error q s for any convex (ρ, σ): This results reflects a fundamental generalization of the high-dimensional CR bound ( 19) that includes information about the signal distribution P s that can be optimally exploited by a regularizer σ.
by the data processing inequality for Fisher information, this high dimensional lower bound is larger than the lowdimensional one (10) under the replacement α → N .Thus, as in the unregularized case (19), the price paid for even optimal high-dimensional regularized inference at finite measurement density, relative to scalar Bayesian inference at asymptotically infinite density, is increased error due to the presence of additional gaussian noise with dimensionality dependent variance q opt s .

Optimal high dimensional inference smoothly interpolates between MAP and quadratic inference
The optimal inference procedure (28)-( 29) is a smoothed version of MAP inference (see Fig. 4C for an example of smoothing), where the MAP choices ρ = E and σ = E s are smoothed over scales q opt s and q opt d respectively to obtain ρ opt and σ opt .As α → ∞, both q opt s and q opt d approach 0 at the same rate, implying ρ opt → E and σ opt → E s .Thus at high measurement density, MAP inference is the optimal M-estimator.This conclusion is intuitively reasonable because at high measurement densities, the mode of the posterior distribution over the signal, returned the MAP estimate, is typically close to the mean of the posterior distribution, which is the optimal MMSE estimate amongst all inference procedures.
Alternatively, as α → 0, q opt s → s 2 from below, while q opt d diverges as 1  α .The divergence of q opt d implies that σ opt in (29) approaches a quadratic.Thus, remarkably, at low measurement density, simple quadratic regularization, independent of the signal distribution, becomes an optimal inference procedure.Furthermore, in the low density plus high SNR limit, where 2 s 2 , ρ opt also approaches a quadratic.Thus overall, optimal high dimensional inference at high SNR interpolates between MAP and quadratic inference as the measurement density decreases.In Figure 6 we demonstrate, for Laplacian signal and noise, that optimal inference outperforms both MAP and quadratic inference at all α, approaching the former at large α and the latter at small α.

A relation between optimal high dimensional inference of signal, and low-dimensional Bayesian inference of noise
There is an interesting connection between optimal high dimensional inference, and low-dimensional scalar Bayesian inference.Indeed, when ρ and σ take their optimal forms in (28) and (29), then the proximal descent steps in ( 16) used to estimate noise and signal in the pair of coupled estimation problems comprising the MFT (shown schematically in Fig. 3AB) become optimal Bayesian estimators.In particular, for optimal ρ and σ, (16) becomes ( [18], section 5.2) In essence, computation of the proximal map becomes computation of the posterior mean, which is the optimal, MMSE method for estimating signal and noise in the MFT scalar estimation problems.This gives an intuitive explanation for the form of ρ opt and σ opt in (28) and (29): these are exactly the forms of loss and regularization required for the proximal descent estimates in (16) to become optimal posterior mean estimates in (32).

A relation between signal-noise separation, and predictive power
Furthermore, there is an interesting connection between our ability to optimally estimate noise and signal, and the training and test error.In particular, just as our error q opt s in estimating the signal is given by ( 30) and ( 9), our error in estimating the noise is given by q opt = (ˆ − ) 2 , with ˆ given in (32), yielding In terms of these quantities, the generalization and training errors of the optimal M-estimator have very simple forms ( [18], section 5.2): our optimal inference (28,29) (black), MAP inference (red), and optimal quadratic inference (blue).The theoretical predictions (solid curves) match numerical simulations (error bars) which reflect the standard deviation calculated over 20 trials using a convex optimization solver for randomly generated, finite sized data (with N and P varying while N = αP and √ N P = 250).Note that optimal inference can significantly outperform common but suboptimal methods.For example to achieve a fraction of unexplained variance of 0.4, optimal inference requires a measurement density of α ≈ 1.7 while quadratic and MAP inference require α ≈ 2.1 and α ≈ 2.2 respectively.This reflects a reduction of approximately 20 percent in the amount of required data.
This leads to an intuitively appealing result: inability to estimate the signal leads directly to increased generalization error, while inability to estimate the noise leads to decreased training error.
The reason for this latter effect is that if the optimal inference procedure cannot accurately separate signal from noise to correctly estimate the noise, then it mistakenly identifies noise in the training data as signal, and this noise is incorporated into the parameter estimate ŝ.Thus ŝ acquires correlations with the particular realization of noise in the training set so as to reduce training error.However, this reduced training error comes at the expense of increased generalization error, due again to mistaking noise for signal.The predicted decrease of training error and increase of generalization error for the optimal inference procedure as measurement density decreases is demonstrated in Fig. 6.Interestingly, this figure also demonstrates that training error need not decrease at low measurement density for suboptimal algorithms, like MAP.Thus, in summary, the ability to correctly separate signal from noise to extract a model of the measurements y in (1) is intimately related to the predictive power of the extracted model ŝ in (1).Inability to estimate noise reduces training error, while inability to estimate signal increases generalization error.The combination is a hallmark of overfitting the learned model parameters to the training data, and thereby incurring a loss of predictive power on new, held-out data.

E. Inference without noise
Motivated by compressed sensing, there has been a great deal of interest in understanding when and how we can perfectly infer the signal, so that q s = 0, in the undersampled measurement regime α < 1.This can only be done in the absence of noise ( = 0), but what properties must the signal distribution satisfy to guarantee such remarkable performance?In this special case of no noise, qs simply becomes a Gaussian variable with variance q s , with Fisher information J[ qs ] = 1 qs .Using this, and a relation between MMSE and Fisher information ( [18], appendix B.4), the optimality equations in (30) become Partially eliminating q opt d yields Here the inequality arises through an application of the convolutional Fisher inequality and then fully eliminating q opt d .Given that for any signal and noise distribution, we have proven that no convex inference procedure can achieve an error smaller than q opt s , (36) yields a general, sufficient, information theoretic condition for perfect recovery of the signal in the noiseless undersampled regime: the Fisher information of the signal distribution must diverge.This condition holds for example in sparse signal distributions that place finite probability mass at the origin.More generally, (36) yields a simple lower bound on noiseless, undersampled inference in terms of the measurement density and signal Fisher information.Moreover, in situations where the signal energy is convex, (29) remains the optimal inference procedure, while ρ opt is replaced with a hard constraint enforcing optimization only over candidate signals s satisfying the noiseless measurement constraints y µ = x µ • ŝ.

III. DISCUSSION
In summary, our theoretical analyses, verified by simulations, yield a fundamental extension of time honored results in low-dimensional classical statistics to the modern regime of high dimensional inference, relevant in the current age of big data.In particular, we characterize the performance of any possible convex inference procedure for arbitrary signal and noise distributions (Eqs.17-18), we find fundamental information theoretic lower bounds on the error achievable by any convex procedure for arbitrary signal and noise (Eq.31), and, we find the inference procedure that optimally exploits information about the signal and noise distributions, when their energies are convex (Eqs.28-29).Moreover we find a simple information theoretic condition for successful compressed sensing (Eq.36), or perfect inference without full measurement.These results generalize classical statistical results, based on Fisher information and the Cramer-Rao bound, that were discovered over 60 years ago.
Moreover, our analysis uncovers several interesting surprises about the nature of optimal high dimensional inference.In particular, we find that the optimal high dimensional inference procedure is a smoothed version of ML in the unregularized case, and a smoothed version of MAP in the regularized case, where the amount of smoothing increases as the measurement density decreases, or equivalently as the dimensionality increases.At low measurement densities and high dimensions, the optimal smoothed loss and regularization functions become simple quadratics (in the regularized case, this is proveably true strictly at high SNR, but empirically, replacing the optimal loss with quadratic loss incurs very little performance decrement even at moderate SNR (Fig. 6A)).This observation reveals a fortuitous interplay between problem difficulty and algorithmic simplicity: at low measurement density, precisely when inference becomes statistically difficult, the optimal algorithm becomes computationally simple.Finally, we uncover phase transitions in the behavior of this simple quadratic inference algorithm, with a universal critical exponent in the decay of inference error with SNR at a critical measure-ment density (Eq.23).
Also, our analyses reveal several conceptual insights into the nature of overfitting and generalization in optimal high dimensional inference through novel connections scalar to Bayesian inference in one dimension.This connection arises due to the nature of the mean field theory of general high dimensional inference, which can be expressed in terms of two coupled scalar estimation problems for the noise and signal respectively (Fig. 3).In the optimal case, these scalar inference procedures based on proximal descent steps (Eq.16) become Bayesian inference procedures (Eq.32).In particular, any inference algorithm implicitly decomposes the given measurements y µ = x µ • s 0 + µ into a superposition of estimated signal and estimated noise: y µ = x µ • ŝ + ˆ µ .The scalar Bayesian inference problems yield a MFT prediction for the error in estimating the signal (average per component L 2 discrepancy between s and ŝ) and noise (average per component L 2 discrepancy between µ and ˆ µ ).Errors in inference arise because the noise µ seeps into the estimated signal ŝ.This inability to accurately separate signal and noise by even the optimal inference algorithm leads to divergent effects on the training and generalization error.The former decreases as the estimated signal ŝ acquires spurious correlations with the true noise µ to explain the measurement outcomes y µ .The latter increases because the noise in a held out, previously unseen measurement outcome cannot possibly be correlated with the signal ŝ estimated from previously seen training data.Indeed, for the optimal inference algorithm, we find exceedingly simple quantitative relationships between inference errors of noise and signal, and high dimensional training and generalization error (Eq.34).This yields both quantitative and conceptual insight into the nature of overfitting in high dimensions, whereby training error can be far less than generalization error.
Overall, our results illustrate the power of statistical mechanics based methods to generalize classical statistics to the new regime of high dimensional data analysis.We hope that these results will provide both firm theoretical guidance, as well as practical algorithmic advantages in terms of both statistical and computational efficiency, to many fields spanning the ranges of science, engineering and the humanities, as they all attempt to navigate the brave new-world of big-data.

FIG. 4 .
FIG. 4. Unregularized inference for Laplacian noise E = | |.A comparison of the generalization error (A) and training error (B) of the optimal unregularized M-estimator (20) (black) with ML (red), and quadratic (blue) loss functions.Solid curves reflect theoretically derived predictions of performance.Error bars reflect performance obtained through numerical optimization of (1) using standard convex optimization solvers for finite size problems (N and P vary, with N = αP and √ N P = 250).The width of the error bars reflect standard deviation of performance across 100 different realizations of the quenched disorder.(C)The shape of the optimal loss function in(20) for high dimensional inference as a function of the error or smoothing parameter q.As α varies from high to low measurement density, q varies from low to high values, and the optimal loss function varies from the ML loss to quadratic.Intermediate versions of the optimal loss behave like a smoothed version of the ML loss, with increased smoothing as measurement density decreases (or dimensionality increases).

s
in (27) can now be used to elucidate the nature of the phase transition in Fig. 5A.At high SNR, the function ∆(λ) remains O(1) in a narrow regime of width O( 1 SNR ) near the origin.However, when α < 1, the left edge λ − of the nonzero part of the MP density remains separated from the origin.Due to this eigenvalue density gap, the dominant contribution to the integral in (27) arises from the δ-function at the origin, yielding qQuad s ≈ 1−α when α < 1 (see Fig. 5B top).When α > 1, the δ-function is absent and the dominant contribution arises from the nonzero part of the MP density.This density has support over a range that is O(α) yielding qQuad s = O( 1 SNR α ) (see Fig. 5B bottom).

FIG. 6 .
FIG. 6. Regularized inference for Laplacian noise and signal E = | |, Es = |s 0 |.(A) The normalized MSE, or fraction of unexplained variance qs.(B) The training error.Each plot shows the respective performance of 3 different inference procedures: our optimal inference (28,29) (black), MAP inference (red), and optimal quadratic inference (blue).The theoretical predictions (solid curves) match numerical simulations (error bars) which reflect the standard deviation calculated over 20 trials using a convex optimization solver for randomly generated, finite sized data (with N and P varying while N = αP and √ N P = 250).Note that optimal inference can significantly outperform common but suboptimal methods.For example to achieve a fraction of unexplained variance of 0.4, optimal inference requires a measurement density of α ≈ 1.7 while quadratic and MAP inference require α ≈ 2.1 and α ≈ 2.2 respectively.This reflects a reduction of approximately 20 percent in the amount of required data.