Density estimation on small datasets

How might a smooth probability distribution be estimated, with accurately quantified uncertainty, from a limited amount of sampled data? Here we describe a field-theoretic approach that addresses this problem remarkably well in one dimension, providing an exact nonparametric Bayesian posterior without relying on tunable parameters or large-data approximations. Strong non-Gaussian constraints, which require a non-perturbative treatment, are found to play a major role in reducing distribution uncertainty. A software implementation of this method is provided.

The need to estimate smooth probability distributions from a limited number of samples is ubiquitous in data analysis [1]. This "density estimation" problem also presents a fundamental conceptual challenge in statistical learning, important aspects of which remain unresolved. These outstanding problems are especially acute in the context of small datasets, where standard large-dataset approximations do not apply. Here we investigate the potential for Bayesian field theory, an area of statistical learning based on field-theoretic methods in physics [2][3][4][5], to estimate probability densities in this small data regime.
Density estimation requires answering two distinct questions. First, what is the best estimate for the underlying probability distribution? Second, what do other plausible distributions look like? Ideally, one would like to answer these questions by first considering all possible distributions (regardless of mathematical form), then identifying those that fit the data while satisfying a transparent notion of smoothness. Such an approach should not require one to manually identify values for critical parameters, specify boundary conditions, or make invalid mathematical approximations in the small data regime. However, the most common density estimation approaches, including kernel density estimation (KDE) [1] and Dirichlet process mixture modeling (DPMM) [6,7], do not satisfy these requirements.
Previous work has described a Bayesian field theory approach, called Density Estimation using Field Theory (DEFT) [8,9], for addressing the density estimation problem in low dimensions. DEFT satisfies all of the above criteria except for the last one: in [8,9], an appeal to the large data regime was used to justify a Laplace approximation (i.e., a saddle-point approximation) of the Bayesian posterior. This approximation facilitated the sampling of an ensemble of plausible densities, as well as the identification of an optimal smoothness lengthscale. Independent but closely related work [10] has also relied heavily on this approximation.
Here we investigate the performance of DEFT in the small data regime and find that the Laplace approximation advocated in prior work can be catastrophic. This is because non-Gaussian features of the DEFT posterior are * Email correspondence to jkinney@cshl.edu critical for suppressing "wisps" -large positive fluctuations that otherwise occur in posterior-sampled densities. We further find that these non-Gaussian effects cannot be addressed perturbatively using Feynman diagrams, as has been suggested in other Bayesian field theory contexts [4,5]. These results are not specific to DEFT, but rather reflect the fundamentally nonperturbative nature of the density estimation problem.
Happily, we find that importance resampling [7] can rapidly and effectively correct for the Laplace approximation. The resulting DEFT algorithm, which we have made available in robust and easy-to-use software, thus appears to satisfy all of the above requirements for an ideal density estimation method in one dimension. Tests of DEFT on simulated data show favorable performance relative to KDE and DPMM. We also illustrate the utility of DEFT on real data from the Large Hadron Collider [11] and World Health Organization (WHO) [12] .
We first recap the DEFT approach to density estimation [8,9]. Consider N data points {x i } N i=1 drawn from a smooth one-dimensional probability distribution Q true (x) that is confined to an x-interval of length L. From these data we wish to obtain a best estimate Q * of Q true , as well as an ensemble of plausible distributions with which to quantify the uncertainty in this estimate.
DEFT reparametrizes each candidate distribution Q in terms of a field φ via After adopting a Bayesian prior that constrains the αorder x-derivative of φ (denoted by ∂ α φ in what follows), and accounting for the likelihood of the data given φ, one obtains a posterior distribution on φ. We represent this posterior as p(Q data, is the "posterior action" described in [9]. In Eq. 2, is a smoothness lengthscale that has yet to be determined and  (d) 100 distributions sampled from the Laplace-approximated posterior p Lap (Q data), which accounts for uncertainty in as well as in Q. (e) 100 distributions generated using importance resampling of the Laplace ensemble. The differential entropies of the illustrated distributions are provided.
S [φ] is minimized at the maximum a posteriori (MAP) field φ . The MAP field φ is unique even in the absence of boundary conditions; see SI.2 for details. Although φ cannot be solved analytically, it is readily computed as the solution to a convex optimization problem after discretization of the x-domain at G equally-spaced grid points. In this discrete representation, R becomes a histogram with bin width h = L G. As long as h ≪ , the choice of G will not greatly affect φ . The optimal lengthscale * is identified by maximizing the Bayesian evidence, p(data ); see SI.3 for details. Q * = Q * is then used as our best density estimate. Fig. 1(a-c) illustrates this procedure on simulated data.
To characterize the uncertainty in the DEFT estimate Q * , we sample the Bayesian posterior p(Q data) = ∫ d p( data)p(Q data, ). Each sample is generated by first drawing from p( data), then drawing Q from p(Q data, ). Previous work [8] has suggested that this sampling task be performed using the Laplace approximation, i.e., approximating p(Q data, ) with a Gaussian that has the same mean and Hessian. The corresponding action, S Lap [φ], is thus quadratic in δφ = φ − φ . This Laplace approximation has the advantage that posterior samples Q can be rapidly and independently generated [8]. Fig. 1d shows multiple Qs sampled from the Laplace posterior p Lap (Q data) = ∫ d p( data)p Lap (Q data, ). Clearly something is very wrong. Although many of these Qs appear reasonable, others exhibit wisps that have substantial probability mass far removed from the data.
We hypothesized that wisps are an artifact of the Laplace approximation. To correct for potential inaccuracies of this approximation, we adopted an importance resampling approach [7]. For each sampled φ we computed a weight We then resampled the Laplace ensemble with replacement, selecting each φ (and thus Q) with a probability proportional to w [φ]. A mixture of such resampled ensembles across lengthscales was then used to generate an ensemble reflecting p(Q data); see SI.4 for details. Fig. 1e shows 100 distributions Q from this resampled posterior. Wisps no longer appear. Eliminating wisps is especially important when estimating values for summary statistics, such as distribution entropy. In entropy estimation, the goal is to discern a value for the quantity H true = H[Q true ] where H[Q] = − ∫ dx Q(x) log 2 Q(x). Using the DEFT posterior ensemble, we can estimate H true asĤ ± δH, wherê H = ⟨H⟩ and δH = ⟨H 2 ⟩ − ⟨H⟩ 2 , with ⟨⋅⟩ denoting a posterior average. Previous work expressed hope that the ensemble provided by the Laplace approximation might serve this purpose [8]. But in this case we see thatĤ is far less accurate than the point estimates H[R] or H[Q * ], and δH is enormous (Fig. 1d). Importance resampling fixes both problems: the resultingĤ is closer to H true than either point estimate, and δH is remarkably small (Fig. 1e).
We now turn to the problem of understanding how wisps arise. To this end we consider the variation in the action upon φ → φ + δφ. One finds that The first (kinetic) term on the right hand side of Eq. 4 imposes a smoothness constraint on δφ, while the second (potential) term keeps δφ confined to a potential well consistent with the data. See SI.5 for details. Note that V is convex, nonnegative, and vanishes when δφ = 0.
By analogy to equipartition, we define n eff , the effective number of degrees of freedom constrained by the data, as twice the value of the second term in Eq. 4 averaged over the posterior ensemble. Typical fluctuations δφ will therefore exhibit V (δφ) ∼ n eff 2. We now separately consider the "data rich" regime of the x domain, which we define by Q (x) ≫ n eff 2N L, and the "data poor" regime, corresponding to Q (x) ≪ n eff 2N L. In the data rich regime, fluctuations are small enough that V adheres well to its Laplace approximation, V ≈ N LQ δφ 2 2. Under this nearly symmetric potential, both positive fluctuations δφ + and negative fluctuations δφ − are constrained by By contrast, V is highly asymmetric in the data poor regime and produces highly asymmetric fluctuations. Positive fluctuations satisfy δφ + ∼ n eff 2N LQ , whereas negative fluctuations obey See SI.5 for more information.
The key point is that adopting is equivalent to assuming the Laplace approximation for V throughout the entire x-domain. Because δφ rich ≫ δφ − poor in data poor regions, the Laplace approximation greatly overestimates the size of downward fluctuations in φ . This results in the large upward fluctuations in Q that we identify as wisps. We note that wisps are especially prominent at the x-interval boundaries in Fig. 1 for two reasons: (i) Q is especially small here, making these regions very data poor, and (ii) the kinetic term in Eq. 4, which is all that suppresses wisps in data poor regions, is less effective at constraining δφ because data are present on only one side.
Feynman diagrams provide a general means of correcting for inaccuracies in Laplace approximations [13], and have been advocated in the context of some Bayesian field theory regression problems [4,5]. For density estimation, however, Feynman diagrams are ineffective if any region of the x interval is data poor. This is due to the action S [φ] being strongly coupled. For example, in the Bayesian evidence computations used to determine * , DEFT estimates the action . See SI.3 for details. At first, one might think it possible to correct for potential inaccuracies in this approximation using a series of vacuum diagrams (see SI.6), i.e., log Z However, as described in SI.8, the number of diagrams needed to obtain accurate results is prohibitive when data-poor regions of the x-interval are present. Fortunately, one can instead compute nonperturbative corrections to this log ratio using the importance resampling weights in Eq. 3 via See SI.7 for details. These results reflect a fundamental yet underappreciated aspect of density estimation: unless data are observed throughout the x-domain, the uncertainties in estimated probability densities require a nonperturbative treatment. Specifically, nonperturbative methods such as the Laplace approximation or Feynman diagrams can only be expected to work if Q true (x) ≳ 1 N L everywhere within the x domain. Very often, however, density estimation is applied to data like that in Fig. 1, which is localized far away from one or both x-interval boundaries. We argue that the analysis of such data will quite generally require a nonperturbative treatment.
To benchmark the performance of DEFT, we quantified its ability to estimate probability densities of known functional form. Specifically, we simulated datasets of varying size N from a variety of Q true distributions, then asked two questions. First, how accurately does Q * estimate Q true ? Second, how typical is Q true among the distributions in the Bayesian posterior? In both contexts, DEFT was compared to KDE and DPMM. See SI.9 for details on how KDE and DPMM were implemented. Fig.  2 shows the results of these performance tests for two different choices of Q true . Fig. S3 in SI provides analogous results for other Q true distributions.
To answer the first question, we compared the Kullback-Leibler divergence, D KL (Q true Q * ), achieved by each estimator on each dataset. Note that smaller values for these divergences indicate better method accuracy. As illustrated in Fig. 2b, DEFT usually performed comparably to KDE and DPMM at N = 10, and somewhat better at N = 100. DEFT appears to have a particular advantage over both KDE and DPMM on Q true distributions that bump up against one or both x-interval boundaries. Also unsurprising is that DEFT performs notably better with α = 2, 3, and 4 than with α = 1, since α = 1 yields non-smooth Q * distributions with cusps at each data point [8,14].
To answer the second question, we computed where D KL (Q true Q * ) falls within the distribution of divergences D KL (Q Q * ) observed for Q ∼ p(Q data). This location is naturally quantified by a p-value corresponding to the null hypothesis that Q true ∼ p(Q data). If Q true is typical of plausible Qs, these p-values should be uniformly distributed between 0 and 1. Alternatively, p-values clustered close to 0 indicate that posterior ensemble p(Q data) overestimates how much Q true diverges from Q * , whereas p-values clustered close to 1 indicate that p(Q data) underestimates this uncertainty. Fig. 2c shows our results for the two choices of Q true in Fig. 2a; results for other choices of Q true are shown in Fig. S3. In general, the p-values for DEFT (with α = 2, 3, and 4) were distributed with remarkable uniformity. DEFT with α = 1 tended to overestimate uncertainties, whereas KDE and DPMM tended to underestimate uncertainties.
Finally, we illustrate the capabilities of DEFT using data reported in the initial observation of the Higgs boson [11] (see Fig. S4 for an analysis of data from the WHO). Fig. 3a, which is a reconstruction of Fig. 4 of [11], shows a histogram of the invariant masses of N = 58 4-lepton events observed by the CMS Collaboration at the Large Hadron Collider. Such events are generated by the decays of the Higgs boson via H → ZZ → 4 , but they also arise from a variety of background decay processes. One of the challenges faced by the CMS Collaboration was determining whether these data exhibit a localized excess of events representing a possible Higgs resonance. Fig.  3b shows DEFT applied to these data using default parameters. Despite Higgs decays representing only ∼ 10% of the observed events, DEFT detects a prominent local maxima near the Higgs resonance at m H = 125 GeV. The confidence in this maxima can be quantified by sampling p(Q data): 81% of sampled Qs have exactly one local maximum between 110 GeV and 140 GeV (7% have no local maxima and 12% have multiple local maxima), and these maxima occurred at 127.1 GeV ± 3.7 GeV.
Here we have shown that DEFT can effectively address density estimation needs on small datasets in one dimension. DEFT provides point estimates comparable to KDE and DPMM, but does not suffer from the multiple drawbacks of these other methods. In particular, the only key parameter that the user must specify is a small positive integer α that defines the qualitative meaning of smoothness and which governs how DEFT relates to maximum entropy estimation (see [9]). In our experience, however, using α = 3 seems to work well nearly all of the time. Other parameters, such as the number of grid points G, reflect computational practicalities. These parameters can be chosen automatically and have little effect on the results as long as reasonable values are used.
DEFT thus addresses a major outstanding need, not just in statistical learning theory but also in day-to-day data analysis. To this end we have developed an open source Python package called SUFTware. SUFTware allows users to apply DEFT in one dimension to their own data, and in the future will include additional fieldtheory-based statistical methods. This implementation is sufficiently fast for routine use: the computations for Fig. 1  A derivation for Eq. 2 has already been reported in Ref. [9]. The derivation presented here, however, is more straight-forward. The action in Eq. 2 is given by where S 0 [φ] is the "prior action", corresponding to a Bayesian prior p(Q ) ∝ exp(−S 0 [φ]), while S data [φ], the "likelihood action", is related to likelihood via p(data Q) ∝ exp(−S data [φ]). DEFT uses a prior action of the form The parameter α reflects a fundamental choice in how one defines "smoothness", and is a lengthscale below which fluctuations in φ are strongly damped. The derivation of S data [φ] is as follows. Suppose we are given N data points drawn from a probability distribution Q true (x) that is confined to the interval [x min , x max ]. Label these data in order of increasing value as x 1 , x 2 , . . . , x N . Next, imagine these data as being produced by a stochastic process in time, with x being the time variable and r(x) being the instantaneous emission rate. The likelihood of the data is then given by where ∫ dx indicates integration over the entire x-domain and R( is the raw data density referred to in the main text. Next, we parametrize the emission rate r(x) using the field φ(x) via The probability density corresponding to this rate is and so our definition of φ here is consistent with the definition of φ in the main text. We therefore see that the likelihood density in Eq. S3 is given by p(data φ) ∝ exp(−S data [φ]) where the corresponding action (after dropping the constant term N log(L N )) is, Plugging Eq. S2 and Eq. S6 into Eq. S1 gives Eq. 2 of the main text. Note the origin of the two terms in the integrand in Eq. S6: the term linear in φ comes from the exact locations of the N data points, whereas the nonlinear term (which leads to such interesting behavior) comes from regions of the x domain in which no data is observed. We briefly discuss a subtle issue with the above derivation. The probability distribution Q(x) is invariant under additive shifts in the underlying field, i.e., φ(x) → φ(x) + c for any constant c. By contrast, the likelihood action S data [φ] is not invariant under such transformations. This difference is due Eq. S4 which, by specifying how the emission rate r(x) relates to φ(x), introduces an additional assumption about how φ should be constrained by data. But although this additional assumption alters p(φ data), it does not alter p(Q data). The more involved derivation of S [φ] provided in Ref. [9] demonstrates this fact explicitly.

SI.2. THE MAP FIELD φ
To solve for φ , the maximum a posteriori (MAP) field at lengthscale , we set δS δφ = 0. The resulting equation of motion is The operator ∆ α that appears here is the "bilateral Laplacian", which is described in Ref. [9]. Briefly, ∆ α is defined by the requirement that for any two fields ϕ and φ. This bilateral Laplacian is identical to the standard α-order Laplacian (−1) α ∂ 2α in the interior of the x-interval, but differs at the boundaries. Specifically, the standard α-order Laplacian requires the additional specification of α boundary conditions in order to be self-adjoint. By contrast, the bilateral Laplacian is self-adjoint without the specification of any boundary conditions. The equation of motion, Eq. S7, thus has a unique solution without the need to assume any boundary conditions on φ. See Ref. [9] for more information. By integrating Eq. S7 we find that ∫ dx e −φ (x) = L, due to ∫ dx R(x) = 1 and ∫ dx∆ α φ = ∫ dx(∂ α 1)(∂ α φ ) = 0. The MAP density Q thus has a simple form: Similarly, multiplying Eq. S7 on the left by x k for k = 1, . . . , α − 1 and integrating reveals that i.e., the first α − 1 moments of Q exactly match those of the data.
As described in Ref. [9], DEFT computes the map field φ for a set of lengthscales 0 , 1 , 2 , . . . , K , ranging from 0 = 0 to K = ∞. These lengthscales are chosen so that neighboring MAP densities, Q k and Q k+1 , are approximately equally spaced along this "MAP curve", as quantified by the geodesic distance D geo (Q k , Q k+1 ). We note that Q 0 is in fact the data histogram R, while Q ∞ is in fact the maximum entropy distribution consistent with the moment constraints in Eq. S10. See Ref. [9] for details.

SI.3. THE EVIDENCE p(data )
The DEFT algorithm computes the MAP field at lengthscales spanning = 0 to = ∞. The optimal lengthscale * is then computed by maximizing the Bayesian evidence p(data ). The key quantity needed for this procedure is the "evidence ratio," which is given by It can be shown that respectively denote the posterior partition function and the prior partition function. The prior partition function Z 0 can be computed analytically, although it has a divergence that must be regularized. By contrast, the posterior partition function Z can only be analytically computed in the Laplace approximation. We therefore instead use the quantity . The resulting evidence ratio in this approximation is found to be where η = N (L ) 2α , and "ker" and "row" respectively denote the kernel and row space of the bilateral Laplacian ∆ α . See Ref. [9] for details. It should be emphasized that, although the Laplace approximation can be grossly innacurate when sampling Q ∼ p(Q data), it does not strongly effect the evidence ratio E( ). This is because log E( ) typically varies over many orders of magnitude, whereas log(Z Z Lap ) varies with far less dramatically. This is demonstrated in Fig. S2 below. Nevertheless, the SUFTware implementation of DEFT includes an option to correct for this approximation using importance sampling, as described in the main text. In the data poor regime, T eff 2 ≫ 1 (right panel), resulting in highly asymmetric fluctuations; in particular, the magnitude of δφ − due to f is substantially less than would result from f Lap . Note also that the very large positive fluctuations δφ + in the data poor regime have little noticeable effect on Q , since they just push Q closer to zero.

SI.4. SAMPLING THE POSTERIOR p(φ, data)
The posterior probability p(φ, data) can be decomposed as This forms the basis for our posterior sampling procedure. First, we sample plausible s from p( data). Note that p( data) ∝ p(data ) p( ) by Bayes's Theorem. Assuming p( ) is uniform over the length of the MAP curve as quantified by geodesic distance (see Ref. [9]), p( data) becomes proportional to the evidence ratio E( ). We thus sample values of from the set { 0 , 1 , . . . , K } used to trace the MAP curve, each k being selected with probability proportional to E( k ). For each of these values, we then sample plausible φs from p(φ , data). Here we employ importance sampling. Specifically, we can rewrite the distribution p(φ , data) as follows where we have made use of Eq. S36 (derived below). Therefore, we first sample φs from the Laplace-approximated distribution p Lap (φ data, ), then correct for the non-Gaussian nature of the original distribution by resampling these φs using the importance weights w [φ].

SI.5. ORIGIN OF WISPS
To derive Eqs. 4 and 5, it suffices to note that because the EOM in Eq. S7 causes all first-order terms in δφ to cancel. Next, we express V (δφ) = N LQ f (δφ) where is just e −δφ with the 0th and 1st order terms subtracted out. This function is plotted in Fig. S1. The key to deriving the magnitude of fluctuations δφ in different regimes is the relationship ⟨V (δφ)⟩ ∼ n eff 2 , which we rephrase here as where is an effective temperature.
In the data rich regime, T eff 2 ≪ 1. Therefore, f (δφ) ≪ 1 for typical fluctuations δφ. As illustrated in Fig. S1 (left panel), the Laplace approximation works well in this regime. Setting and solving for δφ gives Eq. 6.
In the data poor regime, T eff 2 ≫ 1. As illustrated in Fig. S1 (right panel), f is highly asymmetric in this regime and so the positive and negative fluctuations, δφ + and δφ − , need to be treated separately. Specifically, Solving the latter condition for δφ − gives Eq. 7. Note in Fig. S1 (right panel) how the the Laplace approximation greatly overestimates the magnitude of negative fluctuations δφ − in the data poor regime.
Here we show how Feynman diagrams can be used to compute log(Z Z Lap ), thereby obtaining corrections to the Laplace approximation. Our exposition closely follows that sketched by Zinn-Justin [13]. However, because Feynman diagrams are rarely used in the context of statistical inference, we felt it worthwhile to make these calculations explicit.
Upon discretization of the x-interval using G grid points, the action in Eq. 2 becomes where i, j = 1, 2, . . . , G. In what follows we represent fluctuations in φ about from the MAP field φ using the rescaled fluctuation x = √ N (φ − φ ). The action can then be expanded in the following way: where the Laplace action is and The quantity log(Z Z Lap ) is conveniently given by the sum of connected vacuum diagrams. At O(N −1 ), the relevant diagrams contain only 3rd-order and 4th-order vertices. From the expansion in Eq. S24 we see that the values corresponding to these vertices are given by −B ijk √ N and −C ijkl N , respectively. We also need the propagator matrix P , which is given by the inverse of the Hessian A, i.e., P ij = (A −1 ) ij . We thus obtain log Z where the contribution from each diagram is Alternatively, the correction log(Z Z Lap ) can be computed using importance sampling involving the weights w in Eq. 3. To see how, we express the partition function Z as an average over the Laplace ensemble: where ⟨⋅⟩ Lap denotes the mean taken with respect to the Laplace posterior p Lap (φ data, ), and w denotes the importance sampling weights in Eq. 3. The quantity log(Z Z Lap ) can thus be computed using Eq. 9.

SI.8. FEYNMAN DIAGRAMS VS. IMPORTANCE SAMPLING
Perhaps disappointingly, Feynman diagrams generally do not work well in situations where wisps appear. This is because the posterior action in such cases is strongly coupled. To see this, consider an expansion of the potential V in Eq. 5 to m'th order in δφ: To produce accurate results, the potential V m must include enough terms to sufficiently approximate V when evaluated at δφ = −δφ − poor = −φ * + log(N n eff ). This would require m min = φ * − log(N n eff ) terms at the very least, since not until here do the (all positive) terms in this power series begin to decrease. Thus, the number of terms that would be needed cannot be fixed a priori, but rather must increase with φ * . This presents a major problem for Feynmandiagram-based expansions. Any diagram influenced by the the m min 'th term in Eq. S37 must contain an m min 'th order vertex. But m min can be quite large: for φ * in Fig. 1, finds m min > 100 near the boundaries of the x-interval. Evaluating Feynman diagrams up to such high order is not feasible.
This expectation is confirmed in Fig. S2, which compares the two ways of computing log(Z Z Lap ) for two different choices of Q true . The Feynman diagram approximation works well when Q true fills the entire x-interval, indicating that the action S [φ] is nearly quadratic and the corrections to Laplace approximation are small. However, when Q true vanishes in large regions of the x domain, the Feynman diagram approximation is a very bad. In this case, the action S [φ] is strongly coupled and a fundamentally non-perturbative approach is required to compute the corrections.
Although the non-quadratic nature of the posterior action can lead to a partition function Z differing from its Laplace-approximated value Z Lap by a large amount, we find that Laplace approximation generally works well nevertheless for identifying the optimal lengthscale. This is because log Z Lap typically varies by multiple orders of magnitude across different values of , thereby swamping potential inaccuracies in the Z ≈ Z Lap assumption. ) computed for 100 different datasets, generated as above, using either Feynman diagrams (Eq. 8) or importance weights (Eq. 9). These two quantities agree well in (c) but poorly in (d). Squared Pearson correlations, ρ 2 , are shown in the titles of (c,d).

SI.9. OTHER DENSITY ESTIMATION METHODS
Here we describe the Kernel density estimation (KDE) and Dirichlet process mixture modeling (DPMM) algorithms used for the computations shown in Fig. 2 and Fig. S3.

A. Kernel density estimation
KDE is arguably the most common approach to density estimation in one dimension. Given data {x i } N i=1 , the KDE density estimate is given by where K(z) is the kernel function and w is the "bandwidth". We used a Gaussian kernel, and chose the bandwidth w using cross-validation. Specifically, we considered 100 candidate bandwidths geometrically distributed between w min (the minimum spacing between data points) and w max (10 times the span of the data). We then chose the bandwidth w that maximized the jackknifed log likelihood where the subscript on Q * −i indicates the density Q * computed as Eq. S38 but using a dataset missing the datum x i . KDE does not provide an explicit posterior on Q. Therefore, to compute p-values for Fig. 2 and Fig. S3, we approximated posterior samples Q ∼ p(Q data) by applying KDE to bootstrap-resampled datasets.
B. Dirichlet process mixture modeling DPMM is arguably the most popular nonparametric Bayesian method for estimating probability densities. DPMMs have a hierarchical structure, in the sense that each data point is assumed to be drawn from one of a number of "clusters," with each cluster having a probability density defined by a kernel of pre-specified functional form.
In the computations for Fig. 2 and Fig. S3, we adopted the finite DPMM described in Refs. [6,7]. Densities were assumed to be of the form where H is the number of clusters, w h is the probability of cluster h, and m h is the set of parameters defining the density of cluster h. K m (z) was assumed to be a Gaussian density specified by m = (µ, σ 2 ), i.e., a mean and a variance. A normal-inverse-gamma distribution was used as the prior on m: p(µ, σ 2 ) = N (µ μ,κσ 2 ) Γ −1 (σ 2 α,β), (S42) whereκ = 1,α = 1,β =σ 2 ,σ  . Densities were estimated for 9 different global health indicators reported by the WHO in [12]. Each datum corresponds to a different country; N varies between panels because of missing data in [12]. Orange shows a histogram of each global health indicator computed using G = 100 grid points. The best DEFT estimate Q * is shown in dark blue, while 100 posterior-sampled densities Q ∼ p(Q data) are shown in light blue. As in Fig. 3, default DEFT parameters were used for all 9 of these datasets.