An importance sampling method for Feldman-Cousins confidence intervals

In various high-energy physics contexts, such as neutrino-oscillation experiments, several assumptions underlying the typical asymptotic confidence interval construction are violated, such that one has to resort to computationally expensive methods like the Feldman-Cousins method for obtaining confidence intervals with proper statistical coverage. By construction, the computation of intervals at high confidence levels requires fitting millions or billions of pseudo-experiments, while wasting most of the computational cost on overly precise intervals at low confidence levels. In this work, a simple importance sampling method is introduced which reuses pseudo-experiments produced for all tested parameter values in a single mixture distribution. This results in a significant error reduction on the estimated critical values, especially at high confidence levels, and simultaneously yields a correct interpolation of these critical values between the parameter values at which the pseudo-experiments were produced. The theoretically calculated performance is demonstrated numerically using a simple example from the analysis of neutrino oscillations. The relationship to similar techniques applied in statistical mechanics and $p$-value computations is discussed.


I. INTRODUCTION
An essential part of any experiment is the statistical analysis to extract information about the model parameters, such as physics constants, from the measurement outcome.As measurements inherently include statistical fluctuations, one often reports these constraints in the form of confidence intervals (or confidence regions in higher dimensions).These are intervals over the parameter space calculated from the observed data, which are constructed in such a way that for any true value of the parameters, at least a pre-defined percentage of the possible experimental outcomes would produce an interval that covers the true parameter value.The pre-defined percentage over possible experimental outcomes is called the confidence level (CL).
For the rest of this paper we shall use the following notation: x denotes the experimental outcome, which can be a vector of many observations within the single experiment.θ denotes the model parameters, which can contain one or higher dimensional continuous degrees of freedom, and may contain discrete degrees of freedom as well.p(x | θ) denotes the probability distribution function for the experimental outcomes given some model parameters.p(x | θ) seen as a function of θ for a given experimental outcome is called the likelihood function and denoted L(θ | x) • • = p(x | θ).The parameter value for which the likelihood is maximized is denoted θ(x) • • = arg max θ L(θ | x), and the difference of the log-likelihood at some parameter value to the maximum likelihood is denoted as ∆χ 2 In many cases a useful theorem by Wilks [1] can be applied, which greatly simplifies the construction of such confidence intervals.The theorem says that in the asymptotic limit, ∆χ 2 (θ | x) evaluated at the true parameter value is distributed as a chi-squared distribution with k degrees of freedom, where k is the dimension of the parameter space θ, which has to be continuous.The theorem holds under suitable conditions which ensure that a maximum likelihood value can be found in the neighborhood of the true parameter value with a quadratic Taylor expansion of the likelihood.Given this asymptotic distribution, one can thus construct a confidence interval by all values of θ that satisfy ∆χ 2 (θ | x) ≤ ∆χ 2 c , where the critical value ∆χ 2 c is easily computed from the quantile function of the chi-squared distribution.
Due to the necessary assumptions, confidence intervals based on Wilks' theorem are not suitable if the number of observations is small, or the parameter space is unsuitable because of physical boundaries (such as θ ≥ 0), discrete degrees of freedom, or periodicities that cannot be captured by the quadratic expansion.Neutrino oscillation experiments for example suffer from all of these deficiencies, for which we will present an example later.In this situation, one has to resort to actually producing ensembles of pseudo-experiments for selected parameter values to study the distribution of a suitable statistic to be used for the construction of the confidence interval.
A commonly used method is the Feldman-Cousins (FC) method [2], where for each pseudo-experiment x generated assuming a true value θ t , the ∆χ 2 (θ t | x ) value at the true parameter value is computed to obtain its distribution.

II. CRITICAL VALUES IN THE CONVENTIONAL FELDMAN-COUSINS METHOD
To prepare the notation, we briefly review the computation of critical values in the conventional Feldman-Cousins method.First, we make a choice of S points in the parameter space, which we denote θ s with s going from 1 to S. At each θ s , we now generate an ensemble of n exp pseudo-experiments {x} s by sampling from p(x | θ s ).While all pseudo-experiments are assumed to live in the same space, the s suffix on the curly brackets representing the ensemble keeps track of the distribution that generated the experiments.For each pseudo-experiment x ∈ {x} s , we now compute ∆χ 2 (θ s | x) and find the 1 − α quantile ∆χ 2 c,s through any suitable estimator.For example, one may simply sort the ∆χ 2 (θ s | x) values and take the α × n exp largest value as ∆χ 2 c,s , in which case we have Here, I(•) is the indicator function returning 1 if the logical statement in the parentheses is true, and 0 otherwise.• denotes the floor function.
Finally, the critical value function ∆χ 2 c (θ) is obtained by some interpolation scheme.For example, one may set ∆χ 2 c (θ s ) • • = ∆χ 2 c,s and linearly interpolate for any θ values in between.To reduce the interpolation error, one typically has to either manually or automatically [3] adjust the choice of sampling parameter values {θ} S in an iterative scheme.
The asymptotic variance on the critical values is proportional to the binomial error α(1−α)/n exp , so high-CL (α 1) generally means that one needs n exp 1/α for reliable critical values.Since the whole process is repeated for all S points in the parameter space, the total number of generated (and fitted) pseudo-experiments is S × n exp S/α.

A. Definition
Our new method, which we shall refer to as the "mixture Feldman-Cousins" method, differs from the conventional method mainly in the reuse of all generated pseudo-experiments for the critical-value computation of each target parameter space point θ t with an additional weight The value in the denominator is the sampling probability distribution of x ∈ {x} mix • • = S s=1 {x} s , which is the mixture distribution of p(x | θ s ) for all θ s values.Since the weights are based on the sampling probabilities which are nothing but the likelihood function, they are computable using the same procedure that calculates the ∆χ 2 (θ | x) for each pseudo-experiment.Due to taking the difference of two ∆χ 2 values, the contribution from the minimum χ 2 at θ(x), as well as any θ-independent offsets (e.g. the n! factor in the Poisson likelihood) vanish in the denominator and hence do not need to be known accurately.While in the conventional method one only needs to compute ∆χ 2 (θ s | x) for the θ s value at which the pseudo-experiment was generated, here we need it for all θ s (including s = s) and θ t .Now we can define the critical value ∆χ 2 c,t as the w-weighted 1 − α quantile of ∆χ 2 (θ t | x) for x ∼ {x} mix , for example where the is meant to represent that we take the smallest ∆χ 2 c,t that satisfies the inequality.

B. Derivation
In order to obtain more pseudo-experiments at large ∆χ 2 values, which would yield more precise high-CL critical values, we use an importance sampling approach: instead of directly sampling from the target distribution p(x | θ t ), we sample from a different distribution and weight the sampled the toys by the ratio of probability distributions to calculate the relevant quantities under the target distribution (the critical values).The question therefore becomes: what is the ideal sampling distribution to generate the desired pseudo-experiments?
Note that it is important to find a sampling distribution that is as close as possible to the target distribution apart from generating high ∆χ 2 pseudo-experiments with higher probability.In particular, if each experiment x consists of m measurements, the experiments are points in an m-dimensional space and there are m dimensions in which we can stretch or shrink the sampling distribution.Instead of thinking about estimating quantiles, let's think of estimating the probability density p(Y (x) | θ t ) using histograms for Y (x) • • = ∆χ 2 (θ t | x).When using reweighting, in addition to the binomial error n exp p(1 − p) for the number of pseudo-experiments falling into a bin, there will be an additional contribution due to the variance of weights: given the estimator and using I(•) 2 = I(•) we get Var where in Eq. ( 6) the first term is the usual binomial error due to the number of pseudo-experiments falling into the bin, and the second is the additional term due to the variance of weights among pseudo-experiments falling into the bin.
We therefore want to increase the number (n exp π b ) of pseudo-experiments falling into a high-y(x) bin to reduce the binomial error, while at the same time keeping the weight-variance within the bin as small as possible.This means the ideal case of 0-variance would be for the weights to depend on x through y(x) alone.Or equivalently, since the weights are the ratio of the target and sampling distribution, we want to use a sampling distribution that differs from the target distribution only by a functional factor of ∆χ 2 (θ t | x).
The key idea is to think about the meaning of a high ∆χ 2 (θ t | x) value.The likelihood L(θ | x) is the probability to sample the given pseudo-experiment x from θ.A high ∆χ 2 (θ t | x) = −2 log L(θ t | x)/L( θ(x) | x) means there exists a value θ(x) where it's more likely to sample the given pseudo-experiment than at the "target" θ t value.Thus by using pseudo-experiments generated at θ = θ t , we can more efficiently obtain ones with high ∆χ 2 (θ t | x).
The naive choice of simply using pseudo-experiments generated at some θ ( = θ t ) weighted by the ratio of sampling probabilities p(x | θ t )/p(x | θ ) however will do worse than before.This is because θ(x) depends on the pseudoexperiment x, such that for some pseudo-experiments it may be more preferable to sample x from p(x | θ t ) than from p(x | θ ), resulting in an exponentially large (often unbounded) variance of weights.
The solution is simple: by using a mixture distribution which includes θ t , we can guarantee the weights to be bounded from above, because p sample (x) C. Bounds on pseudo-experiment weights for a good grid If we choose the grid {θ} S dense and wide enough (a good grid) such that we may assume to have a good minimum θS (x) on {θ} S in the sense of for all x, we can put a much stricter bound on the weights than Eq. ( 10).Here, 1 will be smaller for denser spacing of {θ} S and ∆χ 2 max will be larger for a wider range covered by {θ} S .Since this additional condition can deal with the case of θ t / ∈ {θ} S as well, let us define a symbol C which is 1 if θ t ∈ {θ} S and 0 otherwise.Note that to guarantee Eq. ( 11) under C = 0 one generally needs to have parameter values in {θ} S that surround θ t sufficiently well.For example, with a 1-dimensional continuous θ parameter, one needs min s θ s ≤ θ t ≤ max s θ s .
First, we focus on the pseudo-experiments with ∆χ 2 (θ t | x) ≤ ∆χ 2 max , which are our primary interest, and note that The sum of probability ratios is now bounded from below by because {θ} S includes both θS (x) and (if C = 1) θ t .This means for any pseudo-experiment with < ∆χ 2 (θ t | x) ≤ ∆χ 2 max , it is more likely to be sampled in Sn exp samples from 1 S S s=1 p(x | θ s ) than in n exp samples from the target distribution p(x | θ t ).The sum of probability ratios is further bounded from above by This means the weights are bounded by We see that the bounds depend on the pseudo-experiments through ∆χ 2 (θ t | x) only, and also note that for sufficiently large ∆χ 2 (θ t | x) the ratio of upper w max to lower bound w min converges to which indicates a small relative variance of weights as long as the number of grid points S is not a very large number.
For pseudo-experiments with ∆χ 2 (θ t | x) above the threshold ∆χ 2 max , we have by Eq. ( 11), and hence an upper bound on the weights D. Critical value estimator performance with a good grid Since quantiles (the critical values) are just the inverse function of the cumulative distribution function (CDF), we can estimate the relative reduction of the quantile estimation variation by the reduction of the CDF estimation variance.The relationship for an observable y ∼ f (y) is given by Var[ŷ(P )] = f (y) −2 Var[ P (y)] where ŷ(P ) is the quantile function estimator, P (y) the CDF estimator, and f (y) the probability distribution function.
Following Eq. ( 3), and using the shorthand notation Y (x) . This is an unbiased estimator for the target CDF P (y | θ t ) where the variance from a single pseudo-experiment is where in going from the second to the third line we used I(•) 2 = I(•), and in going to the fourth line we used the upper bound from Eq. ( 16) and Eq. ( 19), and in going to the fifth line we used Y (x) ≥ y from the argument of the indicator function.The variance of the CDF estimator is therefore where we note that the S factors in the first two terms were cancelled thanks to being able to reuse the pseudoexperiments generated at all S values for the CDF estimation of each θ t value.
For reference, the variance on the CDF estimator in the conventional FC method (denoted in the following equations by "conv") is given by the binomial error so the variance on the estimated critical values ŷ(P | θ t ) in the mixture-FC method is smaller by the factor where y is the true P -quantile satisfying P (y | θ t ) = P .The typical functional shape of the upper bound is shown in Fig. 1a.Let us first consider the case of P P (y max | θ t ).For the small P ≤ 1/2 values one is typically interested in, the mixture model method obtains more precise critical values than the conventional method (i.e.γ ≤ 1) for all y ≥ if θ t ∈ {θ} S (C = 1), or all y ≥ + 2 log 2 if θ t / ∈ {θ} S (C = 0).As y increases, the relative variance first decreases linearly, and for y 2 + it starts to decrease exponentially as γ exp(−y/2).As y further increases toward ∆χ 2  max , and P P (y max | θ t ) fails to hold anymore, the B(y)P (y max | θ t )/P term becomes dominant, which saturates to P (y max | θ t )/P = 1 for y ≥ ∆χ 2 max .Hence the improvement flattens out to γ ≤ 1 C+1 for y ≥ ∆χ 2 max , which is still at least as good as the conventional FC method.By choosing suitable parameter points {θ} S and thus a suitable ∆χ 2  max , critical values of the desired precision can be calculated.As the exponential reduction in variance cancels the typically exponential dependence of the CDF on the test-statistic (exp(−y/2) in the case of a chi-squared distribution), the relative error on the estimated CDF becomes approximately flat over a wide range of test-statistic values (Fig. 1b), which is much more efficient than for the conventional FC where low-CL become over-precise with more pseudo-experiments, while high-CL still suffer from large errors.

E. Interpolation
While for the conventional FC method one can only compute the critical values at the parameter value θ s where the pseudo-experiments were generated, in the mixture-FC method it is sufficient to guarantee that the target parameter value θ t is sufficiently close and surrounded by the sampling points {θ} S such that condition Eq. ( 11) holds.Considering that for a typical setup the toys to be generated are the same as those used in the conventional FC method, this means that the mixture-FC method not only reduces the uncertainty on the critical values at the sampling points {θ} S , but also allows interpolating the critical values between these points with similar performance.

F. Diagnostics and error estimation
As the mixture-FC method exploits the relationship of the ∆χ 2 statistic to the probability of sampling pseudoexperiments, it is essential that the calculation of ∆χ 2 matches the process used to generate the pseudo-experiments.It is for example not allowed to sample from a poisson random number generator while using an approximation like Pearson's χ 2 for ∆χ 2 .A simple diagnostic is to calculate the average weight across all pseudo-experiments in {x} mix and check that this is equal to 1 up to statistical fluctuations.Since the same pseudo-experiments will be used for all target parameter values θ t (which the weights are a function of), the statistical fluctuations of these average weights will be correlated for different θ t values.
To estimate the error of the computed critical values, we recommend using resampling methods such as the nonparametric bootstrap [4] or jackknife [5] instead of simple methods like binomial errors, in order to capture not only max = 35 and the true Y (x)-CDF is assumed to be chi-squared with 1 degree of freedom.In (a), the exponential growth factor for the green line depends on the assumed CDF, unlike the red line whose decay factor is given by Eq. (36).
the statistical fluctuations in the number of pseudo-experiments that fall into a range of ∆χ 2 values, but also the statistical fluctuations in their weights.

IV. EXAMPLE WITH A SINGLE CYCLIC PARAMETER
We consider a simple example that uses a binned-Poisson model, inspired by the search for CP violation in a long-baseline neutrino oscillation experiment, here in particular the T2K experiment [6].The model has a single angular parameter called the "CP violation phase" for each bin with index b = 1, 2, • • • , 10 (Fig. 2).The main feature of this model is that one is mostly sensitive to sin δ CP through the overall normalization of approximately 100 total observations ( b n b ), and weakly sensitive to the cos δ CP component through the "shape" of the observations as a function of b (meant to represent bins of increasing neutrino energy).Deviations from Wilks' theorem are caused by sin δ CP having physical boundaries at ±1 (resulting in reduced critical values around sin δ CP = ±1), the sign of cos δ CP acting as an effectively discrete degree of freedom (resulting in increased critical values at some sin δ CP = ±1 values), as well as the Poisson nature of the observations.In an actual experiment, one would have further continuous discrete physics parameters degenerate with δ CP as well as various systematic uncertainties treated as nuisance parameters.For simplicity and clarity however we focus on δ CP alone, which for continuity with the earlier sections will be referred to as θ = (δ CP ), and the observations as We generate n exp = 10, 000 pseudo-experiments at each of S = 16 values of θ evenly distributed in the parameter range [−π, π].We first focus on the target value of θ t = −π/2.Fig. 3a shows the distribution of ∆χ 2 (θ t | x) obtained for pseudo-experiments x sampled from different θ s values.In the conventional FC method, only those generated at θ s = θ t are used, which correspond to the black histogram which falls off quickly for large ∆χ 2 (θ t | x).In the mixture-FC method we further make use of the pseudo-experiments generated at all other θ s values, of which θ s = 0 and the other extreme of θ s = π/2 are shown by the red and green histograms respectively.Clearly, the pseudo-experiments sampled from the shifted θ s values have a significantly higher fraction of large ∆χ 2 (θ t | x) values.At the same time, one can see one of the problems arising from using only the pseudo-experiments generated at θ s = π/2, in that one would need to apply very large weights for the small ∆χ 2 (θ t | x) region where θ s = π/2 has a very small sampling probability.The mixture of pseudo-experiments generated at all 16 θ values however, shown by the blue histogram, is able to generate more pseudo-experiments for all ∆χ 2 (θ t | x) values, with the difference in slope compared to the black target histogram showing the exponential increase is pseudo-experiments for larger ∆χ 2 (θ t | x) values.This is even clearer to see in Fig. 3b where the mixture distribution was reweighted using the assigned weights.Good agreement with the target distribution as simulated by the conventional FC method is seen, and the total number of unweighted pseudo-experiments in the mixture-FC method exceeds the theoretical lower bound.
We now check some of the diagnostics for the mixture-FC method.The distribution of importance sampling weights w(x | θ t ) are shown in Fig. 4a which are found to be mostly a function of ∆χ 2 (θ t | x) with small additional variance.The weights are found to be well contained by the theoretical bounds from Eq. ( 16), which were drawn assuming a = 0.3 value by looking at the ∆χ 2 ( θS (x) | x) distributions in Fig. 4b.The sum of weights is found to be consistent with 1 (Fig. 5).
Next, we look at the critical values.Fig. 6 shows the critical values as function of the (true/target) parameter value θ using both the standard FC method (black error bars) and the mixture-FC method.Despite using the same set of pseudo-experiments, the critical values obtained with the mixture-FC method have significantly smaller uncertainty especially at higher CL, and also provide access to details of the functional shape between the 16 sampling values of θ.
For the 1σ critical values (Fig. 6b) we see that despite the relatively fine spacing of sampling values, the interpolation error as indicated by the non-overlap of red and gray error bands next to the θ = ±π/2 values is larger than the size of the binomial error band in the conventional method.As these binomial error bands do not capture the interpolation error, their smallness can be misleading, and renders the interpolation feature of the mixture-FC method very useful.
For the 2σ (Fig. 6c) and 3σ critical values (Fig. 6d) we see good consistency between the two methods while also q q q q q qq qq q q q qq q qq q q q q q q q q q qq q q q q qq q q q qqq q q qqqqq qqqqqq q qq q q q qqq qqq q q q q q q q q qq q qq q q qq q q q q q q q qq q q q q q qq q q q q q q q q q q q q q q qq qqq qq q qq qqq q q q q q q q q q q q q q q qq q q q q q qq q qq qq qq q q qqqq qq q q qqqqq q q qq q qq qq q q q qq q qq q qq q qq q q q q q q q q q q q q q q q q qq qq qqq qqq q qqqqq qqqqqq qqqqq qqqqqqqqq q q qqqq qqq qqq qqq q q q q qq qqq q qq qqq q q q q q q q qq q qq q Sampling θ = − π/2 (target) Sampling θ = 0 Sampling θ = + π/2 Mixture of 16 θ−values Number of pseudo−experiments q q q q q qq qq qq qq q qq qqq qqq q q q q q q qq q q q q qq q q q q q q q qq qq qqq qqq qqqqq qqqqqqqqqqq qqqqqqqqqqqqqq qqqqqqqqqqq qqqq q qq qqqqqqqqq qqq q qq qq q qq q q qq q Conventional FC Weighted mixture Unweighted mixture Theoretical lower limit histogram is also drawn with boxes representing the error from number of pseudo-experiments in each bin and their weight variance, but these errors are smaller than the line width and not visible.The "theoretical lower limit" on the total number of pseudo-experiments in the mixture distribution is obtained by multiplying the lower bound in Eq. ( 14) to the red "weighted mixture" histogram assuming C = 1 and = 0.3.
noting the significantly smaller errors in the mixture-FC calculation.For 3σ CL (Fig. 6d) we see the errors in the conventional method are already so large that some of the features of the critical values are not recognizable, such as the bumps at θ = 0, π and the asymmetry of critical values for a flip of the sin δ CP sign, caused by Poisson statistics.
For 4σ and higher CL the conventional FC method is unable to determine the critical values except for a lower limit.The mixture-FC method on the other hand still produces critical values with comparable relative error sizes to the lower CL critical values.
The estimated relative errors are plotted in Fig. 7 and are consistent with the typical shape from theoretical arguments (Fig. 1b).To draw the upper bound from γ in Eq. (36), we conservatively assume ∆χ 2 max = 32 based on Fig. 4b, i.e. we will only assume ∆χ 2 ( θS (x) | x) ≤ up to ∆χ 2 (θ t | x) ≤ 32.In this example, the actual mixture-FC error estimated with the bootstrap is smaller than the theoretical upper limit from γ by about factor 2 for ∆χ 2 t < 16.This can be interpreted as more than one sampling value θ s contributing to the sampling of each pseudo-experiment, rather than the assumption in the theoretical upper limit that only θS (x) would contribute.For ∆χ 2 t > 16 = ∆χ 2 max /2 on the other hand the theoretical upper limit starts to increase significantly, whereas the actual error estimated with the bootstrap only grows slowly.This can be interpreted as our choice of ∆χ 2 max = 32 being overly conservative: with the present example, the chosen sampling grid {θ} S appears to be effective up to significantly higher ∆χ 2 values.This is partly due to the convenient situation of having a parameter θ = (δ CP ) with a bounded parameter space q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Conventional FC Mixture model θ ∆χ c 2 q q q q q q q q q q q q q q q q q Conventional FC Mixture model 1.8 1.9 2.0 2.1 2.2 2.3 θ ∆χ c 2 q q q q q q q q q q q q q q q q q Conventional FC Mixture model 2.8 2.9 3.0 3.1 3.2 θ ∆χ c 2 q q q q q q q q q q q q q q q q q Conventional FC Mixture model  For the standard FC the standard error from the binomial distribution is shown (black solid line), where the more precise CDF estimate from the mixture-FC method was used in computing these errors.For the mixture-FC method the bootstrap error estimate (red solid line) is well below the theoretical upper limit of Eq. ( 36) calculated assuming = 0.3 and ∆χ 2 max = 32 (blue dashed line).
V. DISCUSSION

A. Relation to techniques in statistical mechanics
The presented method is similar in spirit to the "Multiple Histogram Reweighting" ("multi-histogram") method [7] in statistical mechanics, where statistical ensembles are simulated for various parameter values and combined by reweighting to the desired parameter value.In the multi-histogram method, the ensembles are combined with an additional per-ensemble weight, which is adjusted to minimize the overall error on the variable to be estimated.A similar per-ensemble weighting could be applied in the presented mixture-FC method as well, where these additional weights would be allowed to depend on the target ∆χ 2 t • • = ∆χ 2 (θ t | x) value as well, in order to reduce the variance on the critical value estimator as much as possible.
One difference to the multi-histogram method however, is that because we do not resort to Markov-Chain Monte-Carlo techniques to sample the pseudo-experiments, the sampling distribution of pseudo-experiments x at each parameter value θ is known exactly including the normalization constant.Hence the iterative procedure that is required at the end of the multi-histogram method to self-consistently determine these normalization constants (the free energies) is not necessary in the mixture-FC method.

B. Relation to the marginal distribution
The sampling distribution constructed as a mixture over several parameter values {θ} S can be considered a marginal probability distribution with prior π(θ) = 1 S S s=1 δ(θ − θ s ), where δ(•) is the Dirac delta function.Additional perensemble weights as discussed in the previous paragraph would correspond to an alternative prior π(θ | ∆χ 2 t ) = S s=1 r s (∆χ 2 t ) δ(θ − θ s ) where r s (∆χ 2 t ) can be optimized to reduce errors subject to the condition s r s (∆χ 2 t ) = 1 for all ∆χ 2 t .One can even generalize the discussion to continuous priors π(θ | ∆χ 2 t ), where in order to preserve the arguments on efficiency reduction, we would need to extend the single-point condition from Eq. 11 to a condition on a finite-size region on π(θ | ∆χ 2 t ).Unlike in the conventional FC method, where one needs a large number of pseudo-experiments at each target parameter value, it can be preferable in the mixture-FC method to generate less pseudo-experiments at each sampling value, but instead increase the number of considered sampling points S. If Sn exp is held fixed, this results in a reduction of the variance of critical values by reducing the variance in weights bounded from above by exp( /2).
Given this relation to the marginal distribution, let us now consider the computation of ∆χ , where in the denominator, the profiling operation was replaced by a marginalization over θ with some prior over θ.We have therefore a simple likelihood ratio test between p(x | θ t ) and p m (x) • • = dθ π(θ)p(x | θ) and it now becomes evident that in order to efficiently generate pseudo-experiments with small p-values under the null hypothesis p(x | θ t ), one should simply generate the pseudoexperiments from the alternative hypothesis p m (x), which is what is being done in the mixture-FC method.
In practice, it will be easier to use the discrete "prior" over {θ} S as was discussed in the text, because unless the likelihood is gaussian, the numerical integration required for marginalization usually increases the computational cost and complexity.This relation to the profiling/marginalization similarities can nevertheless be exploited to motivate an ideal spacing of {θ} S values.Out of the well-known objective priors, Jeffreys' prior [8] is known to produce a prior that would be uniform in the parameterization in which the likelihood is gaussian, if such a parameterization exists.Since profiling and marginalization with a uniform prior over a gaussian likelihood produce equivalent results up to a constant offset, Jeffreys' prior can be considered a good candidate for choosing the {θ} S values at which to generate pseudo-experiments.For example, in the CP-violation analysis that was discussed in the earlier section, it would be more suitable to choose a uniform spacing of parameter values not in δ CP but in sin δ CP with equal probabilities for the sign of cos δ CP , since the dominant constraint is due to the total number of events N ∼ Poisson(λ = A + B sin δ CP ) for some constants A and B, resulting in an approximately gaussian likelihood over sin δ CP .

C. Nuisance parameters
Because the significant error-reduction in the mixture-FC method exploits the specific relation of the ∆χ 2 (θ t | x) statistic to the distribution that generates the pseudo-experiments, one cannot assume all features to directly translate to an analysis with nuisance parameters or "systematic" parameters as they are often called in physics.Especially for the commonly used methods of profile-FC [9] or posterior Highland-Cousins methods [10], where the space of nuisance parameters from which to generate the pseudo-experiments is significantly reduced based on constraints by the observed experimental data, it is possible to have situations where the straightforward application of the mixture-FC method does not yield the exponential reduction of errors on the estimated critical values given by Eq. 36.One should therefore not rely on these to estimate the number of required pseudo-experiments.
In a relatively general setting, when the target distribution is directly a part of the mixture distribution (so C = 1), one can show that even in the worst case, the variance on the critical values only increases very slightly compared to the conventional method, by a factor 1/(1 − P (y)) (see Appendix ).This factor is negligible considering that for high CL we have P (y) 1.The weights are bounded from above by a similar limit, which is important for well defined importance sampling behavior.The naive application of the mixture-FC method to Feldman-Cousins confidence intervals is therefore still worth a try.In fact, certain situations may yield near-exponential reduction of errors as in the case without nuisance parameters, but due the lack of theoretical guarantees it is suggested to carefully study the distribution of weights and the reliability of bootstrap error estimates in this situation.
Because one cannot guarantee an exponential reduction of errors in a setting with nuisance parameters, the ability to interpolate critical values will be more interesting in this setting.Here it is important that the pseudo-experiments generated between neighboring θ s values (and suitable values of nuisance parameters) sufficiently overlap in the space of pseudo-experiments.Otherwise, the mismatch between pseudo-experiment generation and the statistical model behind the test statistic may quickly result in a large spread of weight values, that would make both the estimated critical values as well as their error estimates unreliable.This is because with nuisance parameters, there are significantly more dimensions in which the pseudo-experiments can differ, even if they have similar values for the test statistic.
In one specific situation however, all properties discussed in earlier sections are directly applicable despite the presence of nuisance parameters.This is when using the prior Highland-Cousins method in conjunction with a marginal-∆χ 2 statistic, where it is essential to use the same prior distribution π(η) for the nuisance parameters η in both cases.This is because here the effect of nuisance parameters is entirely absorbed by the probability model to generate the pseudo-experiments, in the sense of p(x | θ) = dη π(η) p(x | θ, η), such that as far as the mixture-FC method is concerned, no nuisance parameters exist.
More detailed discussions with examples and possible modifications to the sampling distributions for pseudoexperiments will be discussed in a separate publication.the errors on the critical values, with exponential reduction for high confidence level critical values, at almost no additional computational cost.The method was further shown to enable accurate interpolation of critical values between the parameter values at which the pseudo-experiments were generated.The theoretically calculated performance was confirmed using a simple example for the analysis of neutrino oscillations.While the exponential reduction of errors is currently only guaranteed for analyses without nuisance parameters, the general technique is applicable to any analysis making use of the Feldman-Cousins method.
E[ • ] means to take the expectation with x ∼ 1 S S s=1 p(x | θ s ), and E[ • | θ t ] to take the expectation with x ∼ p(x | θ t ).Now defining FIG. 1.(a) Example functional shape of upper bound on the ratio of estimated critical value variance from Eq. (36).The red line indicates the error contribution from pseudo-experiments with y ≤ Y (x) < ∆χ 2 max (first term with A(y)) which is responsible for the exponential reduction of total uncertainty until the contribution from pseudo-experiments with y ≥ ∆χ 2 max (second term with B(y) shown by green line) takes over for very high CL critical values.(b) The relative error on the calculated CDF estimator P (y | θt) assuming nexp = 10, 000 pseudo-experiments at each sampling value θs.A reference 10% error threshold is indicated by the dotted line.The example used for both plots is constructed assuming = 1, S = 10, C = 1, ∆χ 2max = 35 and the true Y (x)-CDF is assumed to be chi-squared with 1 degree of freedom.In (a), the exponential growth factor for the green line depends on the assumed CDF, unlike the red line whose decay factor is given by Eq. (36).

δ
FIG.2.Predicted number of events λ b (δCP) for each bin index b as used in the example.

(b) Estimated ∆χ 2 t
FIG. 3. ∆χ 2 t • • = ∆χ 2 (θt | x) distributions with target θt = −π/2 for (a) various sampling parameter values θs, and (b) comparison of estimated distributions at θt obtained using standard FC and mixture-FC methods.In both plots, error bars indicate 1σ binomial confidence intervals.Vertical dashed lines indicate 1, 2, 3, 4, 5σ confidence level critical values obtained by the mixture-FC method.(a) Error bars are omitted for bins with zero entries for clarity.(b) The red "weighted mixture"histogram is also drawn with boxes representing the error from number of pseudo-experiments in each bin and their weight variance, but these errors are smaller than the line width and not visible.The "theoretical lower limit" on the total number of pseudo-experiments in the mixture distribution is obtained by multiplying the lower bound in Eq. (14) to the red "weighted mixture" histogram assuming C = 1 and = 0.3.

FIG. 6 . 1 , 2 , 3 , 4 ,
FIG. 6. 1, 2, 3, 4, 5σ confidence level critical values (given as ∆χ 2 c ) obtained from the same set of pseudo-experiments with the standard FC (black error bars) and mixture-FC method (red error bands).In both cases the error bars/bands indicate the 1σ error on the csritical values obtained with binomial/bootstrap errors for the standard/mixture FC method respectively.Dashed lines indicate critical values by Wilks' theorem, which are not valid here, but still drawn for reference.(a) The black error bars for 4 and 5σ have been slightly offset to prevent overlap.(b,c,d) The gray error bands indicate the linear interpolation of the error bar end-points in the standard FC method.

FIG. 7 .
FIG.7.Estimated relative errors on the CDF estimator P (y | θt) with target θt = −π/2.For the standard FC the standard error from the binomial distribution is shown (black solid line), where the more precise CDF estimate from the mixture-FC method was used in computing these errors.For the mixture-FC method the bootstrap error estimate (red solid line) is well below the theoretical upper limit of Eq. (36) calculated assuming = 0.3 and ∆χ 2 max = 32 (blue dashed line).