Rapid and deterministic estimation of probability densities using scale-free field theories

The question of how best to estimate a continuous probability density from finite data is an intriguing open problem at the interface of statistics and physics. Previous work has argued that this problem can be addressed in a natural way using methods from statistical field theory. Here I describe new results that allow this field-theoretic approach to be rapidly and deterministically computed in low dimensions, making it practical for use in day-to-day data analysis. Importantly, this approach does not impose a privileged length scale for smoothness of the inferred probability density, but rather learns a natural length scale from the data due to the tradeoff between goodness-of-fit and an Occam factor. Open source software implementing this method in one and two dimensions is provided.

The question of how best to estimate a continuous probability density from finite data is an intriguing open problem at the interface of statistics and physics. Previous work has argued that this problem can be addressed in a natural way using methods from statistical field theory. Here I describe new results that allow this field-theoretic approach to be rapidly and deterministically computed in low dimensions, making it practical for use in day-to-day data analysis. Importantly, this approach does not impose a privileged length scale for smoothness of the inferred probability density, but rather learns a natural length scale from the data due to the tradeoff between goodness-of-fit and an Occam factor. Open source software implementing this method in one and two dimensions is provided.
Suppose we are given N data points, x 1 , x 2 , . . . , x N , each of which is a D-dimensional vector drawn from a smooth probability density Q true (x). How might we estimate Q true from these data? This classic statistics problem is known as "density estimation" [1] and is routinely encountered in nearly all fields of science. Ideally, one would first specify a Bayesian prior p(Q) that weights each density Q(x) according to some sensible measure of smoothness. One would then compute a Bayesian posterior p(Q|data) identifying which densities are most consistent with both the data and the prior. However, a practical implementation of this straight-forward approach has yet to be developed, even in low dimensions. This paper discusses one such strategy, the main theoretical aspects of which were worked out by Bialek et al. in 1996 [2]. One first assumes a specific smoothness length scale . A prior p(Q| ) that strongly penalizes fluctuations in Q below this length scale is then formulated in terms of a scalar field theory. The maximum a posteriori (MAP) density Q , which maximizes p(Q| , data) and serves as an estimate of Q true , is then computed as the solution to a nonlinear differential equation. This approach has been implemented and further elaborated by others [3][4][5][6][7][8][9][10]; a connection to previous literature on "maximum penalized likelihood" [1] should also be noted.
Left lingering is the question of how to choose the length scale . Bialek et al. argued, however, that the data themselves will typically select a natural length scale due to the balancing effects of goodness-of-fit (i.e. the posterior probability of Q ) and an Occam factor (reflecting the entropy of model space [11]). Specifically, if one adopts a "scale-free" prior p(Q), defined as a linear combination of scale-dependent priors p(Q| ), then the posterior distribution over length scales, p( |data), will become sharply peaked in the large data limit. This important insight was confirmed computationally by Nemenman and Bialek [3] and provides a compelling alternative to cross-validation, the standard method of selecting length scales in statistical smoothing problems [1].
However computing p( |data) requires first computing Q at every relevant length scale, i.e. solving an infinite compendium of nonlinear differential equations. Nemenman and Bialek [3] approached this problem by computing Q at a finite, pre-selected set of length scales. Although this strategy yielded important results, it also has significant limitations. First, it is unclear how to choose the set of length scales needed to estimate Q true to a specified accuracy. Second, as was noted [3], this strategy is very computationally demanding. Indeed, no implementation of this approach has since been developed for general use, and performance comparisons to more standard density estimation methods have yet to be reported.
Here I describe a rapid and deterministic homotopy method for computing Q to a specified accuracy at all relevant length scales. This makes low-dimensional density estimation using scale-free field-theoretic priors practical for use in day-to-day data analysis. The open source "Density Estimation using Field Theory" (DEFT) software package, available at github.com/jbkinney/13_ deft, provides a Python implementation of this algorithm for 1D and 2D problems. Simulation tests show favorable performance relative to standard Gaussian mixture model (GMM) and kernel density estimation (KDE) approaches [1].
Following [2,3] we begin by defining p(Q) as a linear combination of scale-dependent priors p(Q| ): Adopting the Jeffreys prior p( ) ∼ −1 renders p(Q) covariant under a rescaling of x [11]. Our ultimate goal will be to compute the resulting posterior, As in [3], we limit our attention to a D-dimensional cube having volume V = L D . We further assume periodic boundary conditions on Q, and impose G D grid points (G in each dimension) at which Q will be computed. To guarantee that each density is positive and normalized, we define Q in terms of a real scalar field φ as Each Q corresponds to multiple different φ, but there is a one-to-one correspondence with fields φ nc that have no constant Fourier component. Using this fact, we adopt the standard path integral measure Dφ nc as the measure on Q-space, and define the prior p(Q| ) in terms of a field theory on φ nc . In this paper we consider the specific class of priors discussed by [2], i.e.
Here we take α to be a positive integer and define the differential operator ∆ = (−∇ 2 ) α . Z 0 is the corresponding normalization factor. This prior effectively constrains the α-order derivatives of φ nc , strongly dampening Fourier modes that have wavelength much less than . Applying Bayes's rule to this prior yields the following exact expression for the posterior [12]: where is the "action" described by [2] and explored in later work [3,[8][9][10]. Here, It should be noted that [2] used Q(x) = const × e −φ(x) in place of Eq. 3, and enforced normalization using a delta function factor in the prior p(Q| ); Eq. 6 was then derived using a large N saddle point approximation. The alternative formulation in Eqs. 3 and 4 renders the action in Eq. 6 exact.
The MAP density Q corresponds to the classical path φ , i.e. the field that minimizes S N . Setting δS N /δφ = 0 gives the nonlinear differential equation, where Q (x) = e −φ (x) /V is the probability density corresponding to φ . The central finding of this paper is that, instead of computationally solving Eq. 7 at select length scales , we can compute φ at all length scales of interest using a convex homotopy method [13]. First we differentiate Eq. 7 with respect to t, yielding If we know φ at any specific length scale i , we can determine φ at any other length scale f -and at all length scales in between -by integrating Eq. 8 from i to f . Because S N [φ] is a strictly convex function of φ, each φ so identified will uniquely minimize this action. Moreover, because Eq. 6 is exact, each corresponding Q will fit the data optimally even when N is small. And since the matrix representation of e −t ∆ + Q is sparse, dφ /dt can be rapidly computed at each successive value of t using standard sparse matrix methods.
To identify a length scale i from which to initiate integration of Eq. 8, we look to the large length scale limit, where a weak-field approximation can be used to compute φ i . Linearizing Eq. 7 and solving for φ gives, for |k| > 0, , where hats denote Fourier transforms, k ∈ Z D indexes the Fourier modes of the volume V , and each τ k = ln[(2π|k|) 2α L D−2α ] is a log eigenvalue of V ∆. To guarantee that none of the Fourier modes of Having computed φ at every relevant length scale, a semiclassical approximation gives The MAP probability, p(φ | , data), is represented by the exponential term. This is the "goodness-of-fit", which steadily increases as gets smaller. This is multiplied by an Occam factor (the inverse rooted quantity), which steadily decreases as get smaller due to the increasing entropy of model space. As discussed by [2,3], this tradeoff causes p( |data) to peak at a nontrivial, datadetermined length scale * . The length scale prior p( ) must decay faster than −1 in the infrared in order for p( |data) to be normalizable. The need for such regularization reflects redundancy among the priors p(φ nc | ) for > i that results from the volume V supporting only a limited number of long wavelength Fourier modes. Similar concerns hold in the ultraviolet due to our use of a grid. We therefore set 3. (Color) Comparison of density estimation methods in one dimension. 10 2 densities Qtrue(x) were generated randomly, each as the sum of five Gaussians. Data sets of various size N were then drawn from each Qtrue, after which estimates Q * were computed using DEFT (G = 100, various α), KDE (using Scott's rule to set kernel bandwidth), and GMM (using the Bayesian information criterion to choose the number of components). Accuracy was quantified using the geodesic distance Dgeo(Qtrue, Q * ) shown in Eq. 11. Box plots indicate the median, interquartile range, and 5%-95% quantile range of these Dgeo values.
p( ) = 0 for > i and < f . The posterior density p(Q|data) can be sampled by first choosing ∼ p( |data), then selecting φ ∼ p(φ| , data). This latter step simplifies when p( |data) is strongly peaked about a specific * because, in this case, the eigenvalues and eigenfunctions of ∆ + e t Q do not depend strongly on which is selected and therefore need to be computed only once. p(φ| , data) can then be sampled by choosing where each ψ j is an eigenfunction of the operator ∆ + e t * Q * satisfying d D x ψ j (x) 2 = 1, λ j is the corresponding eigenvalue, and each η j is a normally distributed random variable. Fig. 1 illustrates key steps of the DEFT algorithm. First, the user specifies a data set {x n } N n=1 , a bounding box for the data, and the number of grid points to be used. A histogram of the data is then computed using bins that are centered on each grid point (Fig. 1a). Next, length scales i and f are chosen. Eq. 8 is then integrated to yield φ at a set of length scales between i and f chosen automatically by the ODE solver to achieve the desired accuracy. Eq. 9 is then used to compute p( |data) at each of these length scales, after which * is identified. Finally, a specified number of densities are sampled from p(Q|data) using Eq. 10.
DEFT is not completely scale-free because both the box size L and grid spacing h are pre-specified by the user. In practice, however, Q * appears to be very insensitive to the specific values of L and h as long as the data It is interesting to consider how the choice of α affects Q * . As Bialek et al. have discussed [2], this fieldtheoretic approach produces ultraviolet divergences in φ when α < D/2. Above this threshold, increasing α typically increases the smoothness of Q * , although not necessarily by much (see Fig. 2c). However, larger values of α may necessitate more data before the principal Fourier modes of Q true appear in Q * . Increasing α also reduces the sparseness of the ∆ matrix, thereby increasing the computational cost of the homotopy method.
To assess how well DEFT performs in comparison to more standard density estimation methods, a large number of data sets were simulated, after which the accuracy of Q * produced by various estimators was computed. Specifically, the "closeness" of Q true to each estimate Q * was quantified using the natural geodesic distance [14], As shown in Fig. 3, DEFT performed substantially better when α = 2 or 3 than when α = 1. This likely reflects the smoothness of the simulated Q true densities. DEFT outperformed the KDE method tested here and, for α = 2 or 3, performed as well or better than GMM. This latter observation suggests nearly optimal performance by DEFT, since each simulated Q true was indeed a mixture of Gaussians.
In two dimensions, DEFT shows a remarkable ability to discern structure from a limited amount of data (Fig. 4). As in 1D, larger values of α give a smoother Q * . However, DEFT requires substantially more computational power in 2D than in 1D due to the increase in the number of grid points and the decreased sparsity of the ∆ matrix. For instance, the computation shown in Fig. 1 took about 0.3 sec, while the DEFT computations shown in Fig. 4 took about 1-3 sec each [15].
Field-theoretic density estimation faces two significant challenges in higher dimensions. First, the computational approach described here is impractical for D 3 due to the enormous number of grid points that would be needed. It should be noted, however, that the 1D field theory discussed by Holy [4] allows Q to be computed without using a grid. It may be possible to extend this approach to higher dimensions.
The "curse of dimensionality" presents a more fundamental problem. As discussed by Bialek et al. [2], this manifests in the fact that increasing D requires a proportional increasing in α, i.e. in one's notion of "smoothness." This likely indicates a fundamental problem with using ∆ = (−∇ 2 ) α to define high dimensional priors. Using a different operator for ∆, e.g. one with reduced rotational symmetry, might provide a way forward.
I thank Gurinder Atwal, Anne-Florence Bitbol, Daniel Ferrante, Daniel Jones, Bud Mishra, Swagatam Mukhopadhyay, and Bruce Stillman for helpful conversations. Support for this work was provided by the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory.