Quantum Randomness Generation by Probability Estimation with Classical Side Information

We develop a framework for certifying randomness from Bell-test trials based on directly estimating the probability of the measurement outcomes with adaptive test supermartingales. The number of trials need not be predetermined, and one can stop performing trials early, as soon as the desired amount of randomness is extractable. It can be used with arbitrary, partially known and time-dependent probabilities for the random settings choices. Furthermore, it is suitable for application to experimental configurations with low Bell violation per trial, such as current optical loophole-free Bell tests. It is possible to adapt to time-varying experimental parameters. We formulate the framework for the general situation where the trial probability distributions are constrained to a known set. Randomness expansion with logarithmic settings entropy is possible for many relevant configurations. We implement probability estimation numerically and apply it to a representative settings-conditional probability distribution of the outcomes from an atomic loophole-free Bell test [Rosenfeld et al., Phys. Rev. Lett. 119:010402 (2017), arXiv:1611.04604 (2016)] to illustrate trade-offs between the amount of randomness, error, settings entropy, unknown settings biases, and number of trials. We then show that probability estimation yields more randomness from the loophole-free Bell-test data analyzed in [Bierhorst et al., arXiv:1702.05178 (2017)] and tolerates adversarial settings probability biases.

devices and determining the measurement outcomes are space-like separated, preventing any communication between them. Furthermore, trials must be committed to in advance, so that it is not possible to postselect them on a success criterion. Because these conditions require large separation and/or fast devices as well as high-efficiency measurements, only recently has it become possible to perform successful Bell tests satisfying these criteria. Such Bell tests may be referred to as "loophole-free", and the list of successful loophole-free Bell tests includes ones based on heralded atom entanglement [7,8] and ones utilizing entangled photon-pairs with high-efficiency detectors [9,10].
Experimental certified randomness was first demonstrated in Ref. [1] (see also Refs. [11,12]) with pairs of ions located in separate traps. This demonstration claimed the presence of 42 bits of entropy with an error of 0.01 in their string of measurement outcomes, with respect to classical side information and restricting correlations to quantum achievable ones. Extraction would have reduced the number of bits produced and increased the error by an amount that was not determined. Recently, we and our collaborators demonstrated end-toend randomness extraction [13], producing 256 bits within 0.001 of uniform from one of the data sets from the loophole-free Bell test reported in Ref. [10]. These bits are certified with respect to classical side information and non-signaling assumptions, which in principle allows for super-quantum correlations. Extracting randomness from today's optical loophole-free Bell tests required the theoretical advances in Ref. [13] to deal with the fact that each trial demonstrates very little violation of Bell inequalities. Previous works were not sufficient for certifying entropy without increasing the number of trials by orders of magnitude. A specific comparison is in Ref. [13].
A benefit of the theory developed in Ref. [13] is that it allows for an adaptive protocol that can track changes in the trial statistics during the protocol. This is helpful in current experiments, where we find that measurable drifts in parameters can wipe out a randomness certificate if not accounted for. The fact that the protocol can adapt is inherited from its use of the "probability-based ratio" protocol for obtaining p-value bounds against local realism (LR) in Bell tests [14,15]. Here we develop a different class of randomness generation protocols based on "probability estimation." Probability estimation involves obtaining high-confidence-level upper bounds on the actual probability of the measurement outcomes given the known constraints on the distributions. We show that randomness generation can be reduced to probability estimation. Since probability estimation is a statistical estimation problem, we then take advantage of the theory of test supermartingales [16] to bypass the framework of Bell inequalities and directly determine probability estimators expressed as products of probability estimation factors (PEFs). PEFs are functions of a trial's settings and outcomes and provide a way of multiplicatively accumulating probability estimates trial-by-trial. While relationships between PEFs and Bell inequalities exist, characteristic measures of quality for Bell inequalities, such as violation signal-to-noise, winning probability or statistical strength for rejecting LR, are not good measures of PEF performance. We develop tools for obtaining PEFs. In particular, we show that when the distributions of settings and outcomes are constrained to a convex polytope, PEFs can be effectively optimized with convex optimization over the polytope given one parameter, the "power" (see Def. 22). The optimization can explicitly take into account the number of trials and the error goal. In the limit where the power parameter goes to zero, an asymptotic rate is obtained that can be interpreted as the optimal rate for producing entropy for random bits. This generalizes and improves on min-entropy estimators for Bell configurations described in works such as Refs. [17][18][19], which are optimal for single-trial min-entropy estimation.
In a large class of situations including the standard Bell-test configurations, PEFs directly lead to exponential expansion of input randomness, as expected from previous works [20][21][22], which prove exponential expansion with more trials or worse settings entropy than PEFs but secure against quantum side information. We prove that asymptotically, the settings entropy can be logarithmic in the output entropy. This is the best result so far for randomness expansion without using a cross-feeding protocol, which can accomplish infinite expansion [23,24]. To accomplish exponential expansion we use highly biased settings distributions. We point out that it is not necessary to have independent and identically distributed (i.i.d.) settings choices. In particular, if the settings are obtained by choosing, ahead of time, a random "test" trial among a block of 2 k trials, we can achieve good expansion while eliminating the need for decompressing uniformly random input bits into a stream of highly biased and independent ones.
As a demonstration of the power of probability estimation, we show how it would perform on a representative example for distributions achieved in loophole-free Bell experiments with atoms based on heralded entanglement [8]. We then apply it to the main data set from Ref. [10] analyzed in Ref. [13], showing that we could improve the amount of randomness extracted substantially, while ensuring that the certificates are valid even if the input randomness is biased, a problem that was noticed and accounted for in the report on the loophole-free Bell test for this experiment [10]. Finally, we reanalyze the data from the first demonstration of certified experimental randomness in a Bell test free of the detection loophole but subject to the locality loophole, which was based on ions [1]. We demonstrate significantly more randomness than reported in this reference. These examples demonstrate that probability estimation is a practical way for implementing entropy production in randomness generation.
Our framework is in the spirit of the entropy accumulation framework of Ref. [25], but takes advantage of the simplifications possible for randomness generation with respect to classical side information. In particular, the outside entities have no interaction with the protocol devices after the protocol starts, and the framework can be cast purely in terms of random variables without invoking quantum states. This avoids the complications of a full representation of the protocol in terms of quantum processes. With these simplifications, our framework applies to any situation with known constraints on past-conditional probability distributions and accumulates the probability estimates trial-by-trial. We interpret these estimates as smooth min-entropy estimates, but prefer to certify the extracted randomness directly.
In the entropy accumulation framework, the relevant estimators, called min-tradeoff functions, must be chosen before the protocol, and the final certificate is based on the sum of statistics derived from these functions. Finding suitable min-tradeoff functions is in general difficult. In the probability estimation framework, probability estimators can be adapted and accumulate multiplicatively. For relevant situations, PEFs are readily obtained and the tradeoff between randomness and error can be effectively optimized.
The analog of min-tradeoff functions in the probability estimation framework are entropy estimators. We show that logarithms of PEFs are proportional to entropy estimators, and essentially all entropy estimators are related to PEFs in this way. In this sense, there is no difficulty in finding entropy estimators. However, PEFs are more informative for applications, so except for illuminating asymptotic behavior, there is little to be gained by seeking entropy estimators directly.
A feature of entropy accumulation is optimality of asymptotic rates for min-tradeoff functions. Probability estimation also achieves optimal asymptotic rates. In both cases, the tradeoff between error and amount of randomness makes these asymptotic rates less relevant, which we demonstrate for probability estimation on the Bell-test examples mentioned above. Furthermore, the framework can be used with any randomness-generating device with randomized measurements to verify the behavior of the device subject to trusted physical constraints. The only requirement is that the constraints can be formulated as constraints on the probability distributions and are sufficiently strong to allow for randomness certification. The remainder of the paper is structured as follows. We summarize the main results in Sect. I B. We lay out our notation and define the basic concepts required for the probability estimation framework in Sect. II. This section includes introductions to less familiar material on classical smooth min-entropies, test martingales and the construction of test martingales from Bell inequalities. In Sect. III, we define exact and soft probability estimation and show how randomness generation can be reduced to probability estimation. The measurement outcomes can be fed into appropriate randomness extractors, where the number of near-uniform random bits is naturally related to the probability estimate. We give three protocols that compose probability estimators with randomness extractors. The first is based on general relationships between probability estimation and smooth min-entropy and reprises techniques from Refs. [12,13,26]. The second relies on banked randomness to avoid the possibility of protocol failure. The third requires extractors that are well-behaved on uniform inputs to enable a direct analysis of the composition. Although we do not demonstrate an end-to-end randomness-generation protocol including extraction here, our goal is to provide all the information needed for implementing such a protocol in future work, with all relevant constants given explicitly. Sect. IV shows how to perform probability estimation for a sequence of trials by means of implicit test supermartingales defined adaptively. The main tool involves PEFs to successively accumulate probability estimates. The main results involve theorems showing how PEFs can be "chained" to form probability estimates. PEFs are readily constructed for distribution constraints relevant in Bell-test configurations. We proceed to an exploration of basic PEF properties in Sect. V, where we find that there is a close relationship between the rates for PEFs and those for a class of functions called "entropy estimators", which are the analog of min-tradeoff functions in our framework. We establish that for error that is e −o(n) , the achievable asymptotic rates are optimal. Next, in Sect. VI, we consider a family of PEFs constructed from Bell functions whose expectations bound maximum conditional probabilities in one trial. In Sect. VII we show that this family can be used for exponential expansion by means of highly biased settings choices. Given a constant error bound, the settings distribution can be interpreted as a random choice of a constant (on average) number of test trials with uniformly random settings, where the remaining trials have fixed settings. We note that this is a theoretical proof-of-principle of exponential expansion. In practice, we prefer to numerically optimize the PEFs with respect to the desired error and calibrated experimental distribution. The final section Sect. VIII explores the three examples mentioned above.

B. Summary of Main Results
This manuscript aims to establish the foundations of probability estimation and contains a large number of mathematical results based on mathematical concepts introduced here. In this section we summarize the main results without precise definitions, in more familiar terms and with less generality than the probability estimation framework established later.
The context for our work consists of experiments where a sequence of trials is performed. In each trial, settings are chosen according to a random variable (RV) Z and outcomes are obtained according to an RV C. The sequences of outcomes and settings obtained in the experiment are denoted by C and Z, where for n trials, C = (C i ) n i=1 and Z = (Z i ) n i=1 , with C i and Z i the i'th trial's outcomes and settings. The main example of such an experiment is the standard Bell test, where there are two physically separated stations. In a trial, the stations randomly select measurement settings X and Y (respectively) and the stations' devices produce outcomes A and B (respectively). In this case Z = XY and C = AB. The physical separation of the stations and the physics of the measurement devices constrain the distributions of ABXY to be non-signaling once loopholes are accounted for. Classical devices are further constrained by LR, which can be violated by quantum devices. This violation is associated with randomness that can be exploited for randomness generation [2], see the introduction above. Here we consider randomness generation where the generated bits are random relative to any external entity holding classical side information E.
The traditional approach to randomness generation is to first derive a bound on the smooth min-entropy of the outcomes conditional on settings and E from the statistics of the observed value cz of CZ given constraints on the joint distribution of CZ and E. The smooth conditional min-entropy H min,µ (C|ZE) for the joint distribution µ is given by the negative logarithm of the maximum probability of C given Z and E, averaged over Z and E, up to an error bound , which is the smoothness parameter. It can be formally defined as the maximum λ ≥ 0 such that there exist µ and ν where µ is within total variation distance of µ and µ (cze) ≤ 2 −λ ν(ze) for all cze. (A more involved but equivalent definition based on maximum probabilities is given in Sect. II D, Def. 3.) The smooth conditional minentropy characterizes the number of near-uniform random bits that can be extracted from the outcomes with a randomness extractor applied after obtaining the min-entropy bound. The distance from uniform of the random bits obtained is parametrized by an error bound that, in the simplest case, is the sum of the error bound of the smooth min-entropy and an error parameter of the extractor used to obtain the random bits. Here we reduce the problem of obtaining a smooth conditional min-entropy bound to that of estimating the conditional probability P(c|ze) for the observed values cz, independent of the value e of E.
Constraints on the joint distribution of CZ and E are determined by a statistical model H consisting of the allowed joint probability distributions, which may enforce non-signaling conditional on E and other constraints such as that the conditional distributions are quantum achievable by causally separated devices sharing an initial state. Given H, a level-(conditional) probability estimator for H and C|Z is a function U : cz → U (cz) ∈ [0, 1] such that for all µ ∈ H and all values e of E, the probability that CZ takes a value cz for which U (cz) ≥ P µ (C = c|Z = z, E = e) is at least 1 − . The probability estimate U (cz) differs from a smooth min-entropy estimate in that the quantity being estimated depends on the data cz, while the smooth conditional min-entropy is a characteristic of the overall distribution µ. Our first result is that one can obtain a smooth min-entropy estimate from a probability estimate.
We establish this lemma for the larger class of soft probability estimators, which provide extensions that may be useful in some applications. In particular, softness enhances adapt-ability and enables use of trial information not determined by CZ. For simplicity, we do not consider softening in this overview.
Here is a sketch of one way to generate randomness from C using the lemma above. First determine a level-2 probability estimator U , then run an experiment to obtain an instance cz of CZ. If U (cz) > p, the protocol failed. If not, apply a classical-proof extractor E to c with input min-entropy − log 2 (p/ ) to produce random bits. The number of random bits produced can be close to the input min-entropy. The definition and properties of extractors are summarized in Sect. II D. The parameters chosen ensure that if the probability of success is at least , then conditional on success, the random bits produced are uniform within TV distance + x , where x is the extractor error. In Sect. III C, we provide details for three protocols for randomness generation from probability estimators that take advantage of features of probability estimation to improve on the sketch just given.
We previously developed a powerful method for constructing Bell functions that optimize the statistical strength for testing LR by multiplying "probability based ratios" [14,15]. The method can be seen as an application of the theory of test supermartingales [16]. The definitions and basic properties of test supermartingales are given in Sect. II F. Here we show that this theory can be applied to the problem of constructing probability estimators. The basic tool is to construct probability estimation factors (PEFs) that are computed for each trial of an experiment. Let C be the trial model. A PEF with power β > 0 for C and C|Z is a non-negative function F : cz → F (cz) such that for all µ ∈ C we have E µ (F (CZ)µ(C|Z) β ) ≤ 1. The fundamental theorem of PEFs is that they can be "chained" over trials to construct probability estimators. When chaining trials, the experimental model H is constructed from individual trial models C by requiring that each trial's probability distribution conditional on the past is in C. The trial models and PEFs may depend on past settings and outcomes. Conditioning on settings requires an additional conditional independence property as specified in Sect. IV A. A simplified version of the fundamental theorem for the case where all trial models and PEFs are the same is the following: Theorem. (Thm. 23) Let F be a PEF with power β for C and C|Z and define T (CZ) = n i=1 F (C i Z i ). Then 1/( T (CZ)) 1/β is a level-probability estimator for H and C|Z. The proof is enabled by martingale theory and requires constructing a test supermartingale.
From the fundamental theorem for PEFs, we can see that up to adjustments for probability of success and the error bound, the min-entropy per trial witnessed by a PEF F with power β at distribution µ is expected to be the "log-prob rate" E µ (log(F (cz)))/β. In a sequence of trials with devices behaving according to specifications, we can expect the trial distributions to be approximately i.i.d. with known distribution µ. In this case the RVs log(F (C i Z i ))/β are also approximately i.i.d., and their sum is typically close to nE µ (log(F (cz)))/β. Since the sum determines the conditional min-entropy witnessed by the probability estimator obtained from F , a goal of PEF construction is to maximize the log-prob rate. We show that PEF construction and optimization reduces to the problem of maximizing a concave function over a convex domain. If the trial model is a convex polytope, the convex domain is defined by finitely many extreme points and PEF optimization has an effective implementation (Thm. 28). For the standard Bell-test configuration, convex polytopes that include the model and that have a manageable number of extreme points exist. We implement and apply PEF optimization in Sect. VIII. Given a trial model and a distribution µ in the model, the maximum number of nearuniform random bits that can be produced per trial in an asymptotically long sequence of trials is given by the minimum conditional entropy of the outcomes given the settings and E. The minimum is over all distributions ν of CZE such that ν(CZ|e) ∈ C for all e and ν(CZ) = µ(CZ). This is a consequence of the asymptotic equipartition property [27]. We prove that PEFs can achieve the maximum rate in the asymptotic limit.
Theorem. (Thm. 43) For any trial model C and distribution µ ∈ C with minimum conditional entropy g, the supremum of the log-prob rates of PEFs for C is g.
To prove optimality we define entropy estimators for model C as real-valued functions K(CZ) such that E µ (K(CZ)) is a lower bound on the conditional entropy for all µ ∈ C. We then show that there are entropy estimators for which the entropy estimate E µ (K(CZ)) approaches the minimum conditional entropy, and that for every entropy estimator, there are PEFs whose log-prob rates approach the entropy estimate.
It is desirable to minimize the settings entropy used for randomness generation. For this we consider maximum probability estimators for a model C, which are defined as functions G(CZ) such that E µ (G(CZ)) ≥ max cz µ(c|z) for all µ ∈ C. Non-trivial maximum probability estimators exist. For example, every Bell inequality for the standard two-settings, two-outcomes Bell-test configuration for two parties has associated maximum probability estimators G(CZ), and a distribution that violates the Bell inequality satisfies E µ (G(CZ)) < 1. For every maximum probability estimator G(CZ), we construct a family of PEFs for which the log-prob rates at µ ∈ C approach − log(E µ (G(CZ))). We analyze this family of PEFs to determine how log-prob rates depend on number of trials and power, and find that with this family it is possible to achieve exponential expansion of settings entropy. For this, we consider models C C|Z of distributions of C conditional on Z. An unconditonal model C is obtained by specifying a probability distribution for the settings.
Theorem. (Thm. 52) Let G(CZ) be a maximum probability estimator for C where C is determined by the conditional model C C|Z with the uniform settings distribution. Assume given µ ∈ C such that E µ (G(CZ)) < 1. Then for a constant error bound, there exists a family of PEFs and settings probability distributions determined by the number of trials n such that the settings entropy is O(log(n)) and the smooth conditional min-entropy of the outcomes is Ω(n).
The constants in the construction for exponential expansion are not of excessive size, but we consider the construction a proof of principle, not a practical proposal. Given that for relevant configurations, we can optimize PEFs directly, for finite experiments and the best expansion, it is preferable to use directly optimized PEFs. We determine expansion opportunities for the trial distribution observed in an atomic Bell test in Sect. VIII.

A. Notation
Much of this work concerns stochastic sequences of random variables (RVs). RVs are functions on an underlying probability space. The range of an RV is called its value space. Here, all RVs have finite value spaces. We truncate sequences of RVs so that we only consider finitely many RVs at a time. With this we may assume that the underlying probability space is finite too. We use upper-case letters such as A, B, . . . , X, Y, . . . to denote RVs. The value space of an RV such as X is denoted by Rng(X). The cardinality of the value space of X is |Rng(X)|. Values of RVs are denoted by the corresponding lower-case letters. Thus x is a value of X, often thought of as the particular value realized in an experiment. In the same spirit, we use Ω to denote the universal RV defined as the identity function on the set of the underlying probability space. Values of Ω are denoted by ω. When using symbols for values of RVs, they are implicitly assumed to be members of the range of the corresponding RV. In many cases, the value space is a set of letters or a set of strings of a given length. We use juxtaposition to denote concatenation of letters and strings. For a string s, |s| denotes its length. Unless stated otherwise, a string-valued RV S produces fixed length strings. |S| denotes the length of the strings, S i is the i'th letter of the string, and S ≤i is the length i prefix of the string. By default, strings are binary, which implies, for example, |S| = log 2 (|Rng(S)|). Sequence RVs (stochastic sequences) are denoted by capital bold-face letters, with the corresponding lower-case bold-face letters for their values. For example, we Our conventions for indices are that we generically use N to denote a large upper bound on sequence lengths, n to denote the available length and i, j, k, l, m as running indices. By convention, A ≤0 is the empty sequence of RVs. Its value is constant, independent of Ω. When multiple stochastic sequences are in play, we refer to the collection of i'th RVs in the sequences as the data from the i'th trial. We typically imagine the trials as happening in time and being performed by an experimenter. We refer to the data from the trials preceding the upcoming one as the past. The past can also include initial conditions and any additional information that may have been obtained. These are normally implicit when referring to or conditioning on the past.
Probabilities are denoted by P(. . .). If there are multiple probability distributions involved, we disambiguate with a subscript such as in P ν (. . .) or simply ν(. . .), where ν is a probability distribution. We generally reserve the symbol µ for the global, implicit probability distribution, and may write µ(. . .) instead of P(. . .). Expectations are similarly denoted by E(. . .) or E µ (. . .). If φ is a logical expression involving RVs, then {φ} denotes the event where φ is true for the values realized by the RVs. For example, {f (X) > 0} is the event {ω : f (X(ω)) > 0} written in full set notation. The brackets {. . .} are omitted for events inside P(. . .) or E(. . .). As is conventional, commas separating logical expressions are interpreted as conjunction. When the capital/lower-case convention can be unambiguously interpreted, we abbreviate "X = x" by "x". For example, with this convention, P(x, y) = P(X = x, Y = y). Furthermore, we omit commas in the abbreviated notation, so P(xy) = P(x, y). RVs or functions of RVs appearing outside an event but inside P(. . .) or after the conditioner in E(. . . | . . .) result in an expression that is itself an RV. We can define these without complications because of our assumption that the event space is finite. Here are two examples. P(f (X)|Y ) is the RV whose value at ω is P(f (X) = f (X(ω))|Y = Y (ω)). This is a function of the RVs X and Y and can be described as the RV whose value is P(f (X) = f (x)|Y = y) whenever the values of X and Y are x and y, respectively. Similarly E(X|Y ) is the RV defined as a function of Y , with value E(X|Y = y) whenever Y has value y. Note that X plays a different role before the conditioners in E(. . .) than it does in P(. . .), as E(X|Y ) is not a function of X, but only of Y . We comment that conditional probabilities with conditioners having probability zero are not well-defined, but in most cases can be defined arbitrarily. Typically, they occur in a context where they are multiplied by the probability of the conditioner and thereby contribute zero regardless. An important context involves expectations, where we use the convention that when expanding an expectation over a finite set of values as a sum, zero-probability values are omitted. We do so without explicitly adding the constraints to the summation variables. We generally use conditional probabilities without explicitly checking for probability-zero conditioners, but it is necessary to monitor for well-definedness of the expressions obtained.
To denote general probability distributions, usually on the joint value spaces of RVs, we use symbols such as µ, ν, ρ, with modifiers as necessary. As mentioned, we reserve the unmodified µ for the distinguished global distribution under consideration, if there is one. Other symbols typically refer to probability distributions defined on the joint range of some subset of the available RVs. The set of probability distributions on Rng(X) is denoted by S X . We usually just say "distribution" instead of "probability distribution". The terms "distributions on Rng(X)" and "distributions of X" are synonymous. The support of a distribution ν on X is denoted by Supp(ν) = {x|ν(x) > 0}. When multiple RVs are involved we denote marginal and conditional distributions by expressions such as µ[X|Y = y] for the distribution of X on its value space conditional on {Y = y}. The probability of x for this distribution can be denoted by µ[X|Y = y](x), which is well-defined for P(Y = y) > 0 and therefore well-defined with probability one because of our finiteness assumptions. Note the use of square brackets to distinguish the distribution specification from the argument determining the probability at a point. If ν is a joint distribution of RVs, then we extend the conventions for arguments of P(. . .) to arguments of ν, as long as all the arguments are determined by the RVs for which ν is defined. For example, if ν is a joint distribution of X, Y , and Z, then ν(x|y) has the expected meaning, as does the RV ν(X|Y ) in contexts requiring no other RVs. We denote the uniform distribution on Rng(X) by Unif X , omitting the subscript if the value space is clear from context. If R and S are independent RVs with marginal distributions ν = µ[R] and ν = µ[S] on their ranges, then their joint distribution is denoted by µ[R, S] = ν ⊗ ν .
In our work, probability distributions are constrained by a statistical model, which is defined as a set of distributions and denoted by letters such as H or C. The models for trials to be considered here are usually convex and closed. For a model C, we write Extr(C) for the set of extreme points of C and Cvx(C) for the convex closure of C defined as the smallest closed convex set containing C.
The total variation (TV) distance between ν and ν is defined as where φ for a logical expression φ denotes the {0, 1}-valued function evaluating to 1 iff φ is true. True to its name, the TV distance satisfies the triangle inequality. Here are three other useful properties. First, if ν and ν are joint distributions of X and Y and the marginals satisfy , then the TV distance between ν and ν is the average of the TV distances of the Y -conditional distributions: Second, if for all y, ν[X|y] = ν [X|y], then the TV distance between ν and ν is given by the TV distance between the marginals on Y : Third, the TV distance satisfies the data-processing inequality. That is, for any stochastic process E on Rng(X) and distributions ν and ν of X, TV(E(ν), E(ν )) ≤ TV(ν, ν ). We use this property only for functions E, but for general forms of this result, see Ref. [28].
When constructing distributions close to a given one in TV distance, it is often convenient to work with subprobability distributions. A subprobability distribution of X is a subnormalized non-negative measure on Rng(X), which in our case is simply a non-negative functionν on Rng(X) with weight w(ν) = xν (x) ≤ 1. For expressions not involving conditionals, we use the same conventions for subprobability distributions as for probability distributions. When comparing subprobability distributions,ν ≤ν means that for all x, ν(x) ≤ν (x), and we say thatν dominatesν. Lemma 1. Letν be a subprobability distribution of X of weight w = 1 − . Let ν and ν be distributions of X satisfyingν ≤ ν andν ≤ ν . Then TV(ν, ν ) ≤ .

B. Bell-Test Configurations
The standard example of a Bell-test configuration involves a pair of devices located at two stations, A and B. The stations engage in a sequence of trials. In a trial, each station chooses one of two settings and obtains a measurement outcome, either 0 or 1, from their device. A Bell test with this configuration is called a (2, 2, 2) Bell test, where the numbers indicate the number of stations, the number of settings available to each station and the number of outcomes at each setting. The test produces two stochastic sequences from each station. The sequence of settings choices is denoted by X for A and Y for B, and the sequence of measurement outcomes is denoted by A and B, respectively. In using this notation, we allow for arbitrary numbers of settings and measurement outcomes. For a (2, k, l) Belltest configuration, |Rng(X i )| = |Rng(Y i )| = k and |Rng(A i )| = |Rng(B i )| = l. The main role of separating the configuration by station is to justify the assumptions, which can be generalized to more stations if desired. The assumptions constrain the models consistent with the configuration. In this work, once the constraints are determined, the separation into stations plays little role, but we continue to identify settings and outcomes RVs. We use Z and C to denote the corresponding stochastic sequences. For the case of two stations, . We also refer to RVs D and R. For D, we assume that D i is determined by (that is a function of) C i . We use D to study protocols where the randomness is extracted from D instead of C. For example, this includes protocols where only A's measurement outcomes are used, in which case we set D = A. We assume that C i is determined by R i , and we use R to contain additional information accumulated during the experiment that can be used to adapt the protocol but is kept private. The main purpose of the RV Z is to contain data that may become public. This applies to the settings if the source of the random settings choices is public or otherwise not sufficiently trusted. In situations where there is only one trial under consideration or we do not need the structure of the stochastic sequences, we use the non-boldface variants of the RVs, that is D, C, R, and Z.

C. Assumptions for Bell Tests
We are interested in limiting the information held by an external entity. This work is concerned with external entities holding classical side information, so the external entity E's state is characterized by an RV E. Physical entities and systems are denoted by capital sansserif letters, a convention we already used to refer to stations. All our results are proven for finite event spaces. While this is reasonable with respect to the RVs representing information explicitly used by the protocols, it is an unverifiable restriction on E. For countable E one can use the observation that conditioning on large-probability finite subsets of E gives distributions that are arbitrarily close in TV distance to the unconditioned distribution.
Our theory is formulated in the untrusted device model [2] where E may have had the devices in possession before the trials of the protocol of interest start. Once the protocol starts, E cannot receive information from the devices. Note that this does not preclude that E sends information to the devices via one-way channels, see below. This simplification is possible for local randomness-generation protocols where the participating parties need no remote classical communication. In many other applications such as quantum key distribution, the protocols involve both quantum and classical communication, in which case E can gain additional information at any time.
To ensure that E cannot receive information from the devices after the start of the protocol requires physical security to isolate the joint location of the devices and the stations. Formalizing this context requires a physical framework for which subsystems and interactions can be defined, with classical subsystems playing a special role. In the case of applications to Bell-test configurations, we motivate the constraints on the models with physical arguments, but we do not prove the constraints with respect to a formal physical framework. Let D be the pair of devices with which A and B implement the protocol. The absence of information passing from ABD to E is ensured if there is no physical interactions between the two sets of systems after the protocol starts. While Z is private, we can time shift any one-way communication from E to ABD, or non-interacting dynamics of E to before the beginning of the protocol. A communication from E to ABD can be described by adding the communication as a system C to E, which interacts with E then becomes isolated from E and later becomes part of the devices. This makes time shifting possible, after which we can assume that any physical processes relevant to the protocol act only on ABD. Of course, we insist that after time-shifting, the state of E is classical. This justifies the use of a single RV E to determine the state of E. In this work, we make few assumptions on the physics of D and formulate constraints on distributions purely in terms of the RVs produced by the protocol, conditional on E.
The formal restriction that E holds only classical side information allows for quantum E provided that there is an additional quantum system H independent of E such that the joint state at the start of the protocol of the systems ABD, H and E forms a quantum Markov chain ABD ↔ H ↔ E [29]. This is a quantum generalization of stating that E and ABD are independent conditionally on H. The system H needs to have no interaction with E and ABD after the start of the protocol. For example, H can be part of a generic decohering environment. The Markov chain property implies that after an extension of E, without loss of generality, the information held by E can be assumed to be classical. Operationally, this situation can be enforced if we trust or ensure that the devices have no long-term quantum memory.
Two assumptions constraining the possible distributions are required for randomness generation with Bell tests. The settings constraints restrict the settings distributions, and the non-signaling constraints enforce the absence of communication between the stations during a trial. There are different ways in which the settings distributions can be constrained. The strongest constraint we use is that Z i+1 is uniform, independent of Z ≤i , C ≤i and E.
In general, we can consider weaker constraints on Z i+1 . For example, Z i+1 may have any distribution that is independent of E but determined in a known way by the past. Most generally, Z i+1 has a distribution belonging to a specified set conditional on E and the past. If we later allow Z to become public, or if Z is from a public source of randomness, then Z i+1 must be conditionally independent of R ≤i given Z ≤i and E. This constraint can be avoided if we trust that the data that normally contributes to Z remains private, in which case we can exploit that our framework is flexible in how the experimental data relates to the random variables: If normally Z contains settings choices, we can instead let Z be empty, add the settings choices into the RV C and extract randomness from all or some of C. With this assignment of RVs, conditional independence is not required. However, for non-trivial randomness generation, it is then desirable to extract strictly more random bits than were required for the settings choices. Otherwise the randomness obtained may just be due to the input randomness with no other contribution.
To state the non-signaling constraints requires explicitly referencing each station's RVs. The non-signaling constraints are assumed to be satisfied for trial i conditional on the past and E. With conditioning on these RVs implicit and for two stations, the non-signaling constraints consist of the identities which assert remote settings independence of measurement outcomes. The non-signaling constraints can be strengthened using the semidefinite-programming hierarchy for quantum distributions [30] when the devices are assumed to be quantum.
In any given implementation of randomness generation, the settings and non-signaling constraints must be justified by physical arguments. The settings constraints require a source of random bits that are sufficiently independent of external entities at the last point in time where they had interactions with the devices and the stations. For protocols such as those discussed here, where the output's randomness is assured conditionally on the settings, the settings randomness can become known after the last interaction, so it is possible to use public sources of randomness. However, if the same public string is used by many randomness generators, a scattershot attack may be possible. To justify the non-signaling constraints, this randomness must not leak to the stations' devices before it is used to apply the device settings. Thus, we need to trust that it is possible to generate bits that are, before the time of use, unpredictable by the stations' devices. For the non-signaling constraints, we also rely on physical isolation of the stations during a trial, preferably enforced by relativistic causality. It usually goes without saying that we trust the timing, recording and computing devices that monitor and analyze the data according to the protocol. However, given the complexity of modern electronics and the prevalence of reliability and security issues, caution is advised.

D. Min-Entropy and Randomness Extraction
To be able to extract some amount of randomness from a bit-string RV D, it suffices to establish a min-entropy bound for the distribution of D. The min-entropy of D is H min (D) = − log(p max (D)), where p max (D) = max d P(d). By default, logarithms are base e, so entropy is expressed in nits (or e-bits), which simplifies calculus with entropic quantities. We convert to bits to determine string lengths as needed. We switch to logarithms base 2 in Sect. VIII for quantitative comparisons. Since we usually work with bounds on min-entropies rather than exact values, we say that D has min-entropy − log(p) or D has max-prob p if p max ≤ p and add the adjective "exact" to refer to − log(p max ) or p max . If D has min-entropy σ log(2), then it is possible to extract close to σ near-uniform bits from D given some uniform seed bits. The actual number of near-uniform bits that can be safely extracted depends on |D|, how many uniform seed bits are used, the maximum acceptable TV distance from uniform of the near-uniform output bits, and the extractor used. An extractor E has as input a bit string D, a string of uniform seed bits S independent of all other relevant RVs, and the following parameters: A lower bound σ h log(2) on the min-entropy of D, the desired number σ of uniform bits, and the maximum acceptable TV distance from uniform x of the output. We write E(D, S; σ h , σ, x ) for the length σ bit string produced by the extractor. This bit string is close to uniform provided the input satisfies extractor constraints that depend on the specific extractor used. Formally, a strong extractor has the property that if the parameters n = |D|, l = |S|, σ h , σ, x satisfy its extractor constraints, S is an independent and uniform bit string, and D has min-entropy at least σ h log(2), then Extractors used in this work are strong by default. The statement that (n, σ h , σ, x ) satisfies the extractor constraints means that there exists l so that the full list of parameters (n, l, σ h , σ, x ) satisfies the extractor constraints. We always include σ h ≤ n, 1 ≤ σ ≤ σ h and 0 < x ≤ 1 among the extractor constraints. Extractor constraints are partially determined by information theoretic bounds, but constraints for practical extractors are typically stronger. See Ref. [31]. In previous work [13], we have used an implementation of Trevisan's strong extractor based on the framework of Mauerer, Portmann and Scholz [32] that we called the TMPS extractor E TMPS . The extractor constraints for E TMPS are not optimized. Simplified constraints are 0 < x ≤ 1, 2 ≤ σ ≤ σ h ≤ n, σ ∈ N, and See Ref. [13] for the smaller expression for l used by the implementation. The TMPS extractor is designed to be secure against quantum side information. In general, strong extractors can be made classical-proof, that is, secure against classical side information, with a modification of the extractor constraints. Some strong extractors such as Trevisan's can also be made quantum-proof, that is, secure against quantum side information, with a further modification of the constraints. See Ref. [32] for explanations and references to the relevant results, specifically Lem. A.3, Thm. B.4, Lem. B.5 and Lem. B.8. All our protocols assume strong extractors and are secure against classical side information. But they do not require that the strong extractor be classical-or quantum-proof: In one case, the relevant constraint modification is implicit in our proof, and in the other cases, we take advantage of special properties of our techniques. The TMPS extractor constraints given above include the modifications required for being quantum-proof; we did not roll back these modifications. The TMPS extractor also satisfies that conditional on the seed bits, the output is a linear hash of D. For bits, a linear hash of D is a parity of D, computed as i h i D i mod (2), where the h i ∈ {0, 1} form the parity vector. We call extractors with this property linear extractors.
Given that the extracted bits are allowed a non-zero distance from uniform, it is not necessary to satisfy a strict min-entropy bound on the extractor input D. It suffices for the distribution of D to have some acceptably small TV distance h from a distribution with min-entropy σ h log(2). We say that D has h -smooth min-entropy σ h log(2) if its distribution is within TV distance h of one with min-entropy σ h log (2). For the situation considered so far in this section, the error bound (or smoothness) h and the extractor error x can be added when applying the extractor to D. In this work, we generally work directly with the maximum probability. We say that D has h -smooth max-prob p if it has h -smooth min-entropy − log(p).
We generally want to generate bits that are near-uniform conditional on E and often other variables such as Z. For our analyses, E is not particularly an issue because our results hold uniformly for all values of E, that is, conditionally on {E = e} for each e. However this is not the case for Z. Definition 3. The distribution µ of DZE has -smooth ZE-conditional max-prob p if the following two conditions hold: 1) For each ze there exists a subprobability distributioñ µ ze of D such thatμ ze ≤ µ[D|ze] andμ ze ≤ p; 2) the total weight of these subprobability distributions satisfies ze w(μ ze )µ(ze) ≥ 1 − . The minimum p for which µ has -smooth ZE-conditional max-prob p is denoted by P u, max,µ (D|ZE). (The superscript u alludes to the uniformity of the bound with respect to ZE.) The distribution µ of DZE has -smooth average ZE-conditional max-prob p if there exists a distribution ν of DZE with TV(ν, µ) ≤ and ze max d (ν(d|ze))ν(ze) ≤ p. The minimum p for which µ has -smooth average ZE-conditional max-prob p is denoted by P max,µ (D|ZE). The quantity H min,µ (D|ZE) = − log(P max,µ (D|ZE)) is the -smooth ZEconditional min-entropy.
We refer to the smoothness parameters as error bounds. Observe that the definitions are monotonic in the error bound. For example, if P max,µ ≤ p and ≥ , then P max,µ ≤ p. The quantity ze max d (ν(d|ze))ν(ze) in the definition of P max,µ can be recognized as the (average) maximum guessing probability of D given Z and E (with respect to ν), whose negative logarithm is the guessing entropy defined, for example, in Ref. [26]. The relationships established below reprise results from the references. We use them to prove soundness of the first two protocols for composing probability estimation with randomness extractors (Thms. 19 and 20) but bypass them for soundness of the third (Thm. 21).
We focus on probabilities rather than entropies because in this work, we achieve good performance for finite data by direct estimates of probabilities of actual events, not entropies of distributions. While entropy estimates may be considered the ultimate goal for existing applications, they are not fundamental in our approach. The focus on probabilities helps us to avoid introducing logarithms unnecessarily.
A summary of the relationships between conditional min-entropies and randomness extraction is given in Ref. [34] for the quantum case and can be specialized to the classical case. When so specialized, the definition of smooth conditional min-entropy in, for example, Ref. [34] differs from the one above in that Ref. [34] uses one of the fidelity-related distances. One such distance reduces to the Hellinger distance h for probability distributions for which h 2 ≤ TV ≤ √ 2h. Uniform or average bounds on Z-conditional max-probs with respect to E = e can be lifted to the ZE-conditional max-probs, as formalized by the next lemma. Proof. For the first claim, for each e, letμ ze be subprobability distributions witnessing For the second claim, for each e, let ν e witness P e max, Furthermore, as required for the conclusion.
The main relationships between conditional and average conditional max-probs are determined by the next lemma.
Proof. Suppose that P u, max,µ (D|ZE) ≤ p. Letμ ze be subprobability distributions witnessing this inequality as required by the definition. Since p ≥ 1/|Rng(D)| and by Lem.
For the second part of the lemma, suppose that P max,µ (D|ZE) ≤ pδ and let ν be a distribution witnessing this inequality. By definition, where we used the identities ν [D|ze] = ν [D|ze] and Eq. 3, and applied the data-processing inequality.
soμ ze is as required by the definition.
We remark for emphasis that as noted in the proof of the lemma, the construction of ν satisfies that ν [ZE] = ν[ZE], where ν is the witness of the assumption in the second part.
The smoothness parameter for smooth conditional max-prob composes well with strong extractors. See also Ref. [26], Prop. 1.
Lemma 6. Suppose that the distribution µ of DZE satisfies P u, h max,µ (D|ZE) ≤ 2 −σ h , and S is a uniform and independent seed string with respect to µ. If (n = |D|, l = |S|, σ h , σ, x ) satisfies the extractor constraints, then Proof. Letμ ze be subprobability distributions witnessing P u, h max,µ (D|ZE) ≤ 2 −σ h as required by the definition, and define ν as we did at the beginning of the proof of the first part of Lem. 5 By the data-processing inequality, The result now follows by the triangle inequality.
By means of the third part of Lem. 5 we can also compose smooth average conditional max-prob with a strong extractor.
and S is a uniform and independent seed string with respect to µ.
The result now follows from Lem. 6.

E. Randomness Generation Protocols
A randomness generation protocol P is parameterized by σ, the requested number of uniform bits and , the protocol error bound, defined as the TV distance from uniform of these bits. Minimally, the output consists of a string of σ bits. In addition, the output can contain an additional string containing other random bits that may have been used internally for seeding the extractor or choosing settings, in case these bits can be used elsewhere. Further, the protocol may output a flag indicating success or failure. If the probability of success can be less than 1, the protocol may require as input a minimum acceptable probability of success κ. We write P = (P X , P S , P P ), where P X is the principal σ-bit output, P S is the possibly empty but potentially reusable string of seed and other random bits, and P P is the success flag, which is 0 ("fail") or 1 ("pass"). Here, we have suppressed the arguments σ and (or σ, and κ) of P, P X , P S , and P P , since they are clear from context. The outputs of P are treated as RVs, jointly distributed with E and any other RVs relevant to the situation. In this context, we constrain the joint distribution of the RVs according to a model H. When P S is non-empty, we assume that the marginal distribution µ[P S ] is known to the user.
The property that a protocol output satisfies the request is called soundness. We formulate soundness so that the TV distance is conditional on success and absorbs the probability of success criterion in a sense to be explained.
Definition 8. The randomness generation protocol P is (σ, )-sound with respect to E and model H if for all µ ∈ H, |P X | = σ and there is a distribution ν E of E such that We normally use E = ZE, but E = E is an option if the settings choices are private. The definition ensures that P is nearly indistinguishable, namely within TV distance , from an ideal protocol with the same probability of success, namely such a protocol for which P X is perfectly uniform and independent of other variables conditional on success.
An alternative definition of soundness is to make the minimum acceptable probability of success κ explicit and require that either P(P P = 1) < κ or the conditional TV distance in Eq. 15 is bounded by . We refer to a protocol satisfying this property as (σ, , κ)-sound. The two definitions are closely related: Proof. Suppose that P is (σ, κ)-sound. Consider any µ ∈ H and let ν E be the distribution of E in Eq. 15. If P(P P = 1) ≥ κ, then the desired TV distance is at most κ/κ = . Since µ ∈ H is arbitrary, P is (σ, , κ)-sound.
For the other direction, again consider any µ ∈ H. If P(P P = 1) < κ, then the left-hand side of Eq. 15 is at most κ, because the TV distance is bounded by 1. If P(P P = 1) ≥ κ, let ν E be such that the TV distance in Eq. 15 is bounded by . Then the full left-hand side of this equation is bounded by P(P P = 1) ≤ . The conclusion follows.
We prefer to use (σ, )-soundness, since this is often more informative than (σ, , κ)soundness, allowing for flexibility in how the protocol is used without looking inside the soundness proof. However, in both cases, soundness proofs may contain additional information about the relationship between the probability of success and the error bound that is useful when implementing the protocol in a larger context. For example, the soundness established in Thm. 19 establishes a relationship between relevant error bounds that does not match either definition, but readily implies forms of either one.
In studies of specific randomness generation protocols, an important consideration is completeness, which requires that the actual probability of success of the protocol is non-trivially large, preferably exponentially close to 1 with respect to relevant resource parameters. Completeness is readily satisfied by our protocols for typical Bell-test configurations.
The actual probability of success should be distinguished from any relevant minimum acceptable probability of success in view of the discussion above. Experimental implementations so far demonstrate that success probabilities are acceptable, but not arbitrarily close to 1. As discussed in the introduction, theory shows that there are randomness generation protocols using quantum systems that can achieve high success probabilities. Here we emphasize protocols for which the success probability is 1 by allowing for injection of banked randomness when insufficient randomness is available from the allotted resources. Assuming completeness, we can also take advantage of the ability to stop the protocol only when enough randomness is generated, but since this requires care when extracting near-uniform bits from potentially long strings, we do not develop this approach further here.

F. Test Factors and Test Supermartingales
Definition 10. A test supermartingale [16] with respect to a stochastic sequence R and model H is a stochastic sequence T = (T i ) N i=0 with the properties that T 0 = 1, for all i T i ≥ 0, T i is determined by R ≤i and the governing distribution, and for all distributions in Here R captures the relevant information that accumulates in a sequence of trials. It does not need to be accessible to the experimenter. The σ-algebras induced by R ≤i define the nested sequence of σ-algebras used in more general formulations. Between trials i and i + 1, the sequence R ≤i is called the past. In the definition, we allow for T i to depend on the governing distribution µ. With this, for a given µ, T i is a function of R ≤i . Below, when stating that RVs are determined, we implicitly include the possibility of dependence on µ without mention. The µ-dependence can arise through expressions such as E µ (G|R ≤i ) for some G, which is determined by R ≤i given µ. One way to formalize this is to consider µ-parameterized families of RVs. We do not make this explicit and simply allow for our RVs to be implicitly parameterized by µ. We note that the governing distribution in a given experiment or situation is fixed but usually unknown with most of its features inaccessible. As a result, many RVs used in mathematical arguments cannot be observed even in principle. Nevertheless, they play important roles in establishing relationships between observed and inferred quantities.
We can define test supermartingales in terms of such sequences: Let F be a stochastic sequence satisfying the three conditions. Then the stochastic sequence with members T 0 = 1 and where we pulled out the determined quantity T i from the conditional expectation. In this work, we construct test supermartingales from sequences F with the above properties. We refer to any such sequence as a sequence of test factors, without necessarily making the associated test supermartingale explicit. We extend the terminology by calling an RV F a test factor with respect to H if F ≥ 0 and E(F ) ≤ 1 for all distributions in H.
For an overview of test supermartingales and their properties, see Ref. [16]. The notion of test supermartingales and proofs of their basic properties are due to Ville [35] in the same work that introduced the notion of martingales. The name "test supermartingale" appears to have been introduced in Ref. [16]. Test supermartingales play an important theoretical role in proving many results in martingale theory, including that of proving tail bounds for large classes of martingales. They have been studied and applied to Bell tests [14,15,36].
The definition implies that for a test supermartingale T, for all n, E(T n ) ≤ 1. This follows inductively from E(T i+1 ) = E(E(T i+1 |R ≤i )) ≤ E(T i ) and T 0 = 1. An application of Markov's inequality shows that for all > 0, Proof. The theorem follows from Doob's maximal inequalities. The particular inequality we need is normally stated for a non-negative submartingale T in the form where λ = 1/ for our purposes. Statements and proofs of this inequality are readily found online. A textbook treatment is in Ref. [37] Ch. V, Cor. 22. To apply the above maximal inequality to the test supermartingale T, letT i+1 = E(T i+1 |R ≤i ). Note thatT i+1 is determined by R ≤i and by the definition of supermartingales,T i+1 ≤ T i . Define T 0 = 1 and T i = 1≤j≤i T j /T j . By pulling out determined quantities from the conditional expectation, we get Hence T is a test martingale. In particular, it is a non-negative submartingale with E(T n ) = 1. We claim that T i ≥ T i . This holds for i = 0. For a proof by induction, compute . This gives max 1≤i≤n T i ≥ T * , so the event {T * ≥ λ} implies the event in the maximal inequality, and monotonicity of probabilities implies the theorem.
One can produce a test supermartingale adaptively by determining the test factors F i+1 to be used at the next trial. If the i'th trial's data is R i , including any incidental information obtained, F i+1 is expressed as a function of R ≤i and data from the i + 1'th trial (a pastparameterized function of R i+1 ), and constructed to satisfy F i+1 ≥ 0 and E(F i+1 |R ≤i ) ≤ 1 for any distribution in the model H. Note that in between trials, we can effectively stop the experiment by assigning all future F i+1 = 1 conditional on the past. This is equivalent to constructing the stopped process relative to a stopping rule. This argument also shows that the stopped process is still a test supermartingale. Here we consider only bounded length test supermartingales (with very large bound N if necessary), so are not concerned with questions of convergence as N → ∞. But we note that since E(|F i |) = E(F i ) ≤ 1 for all i, Doob's martingale convergence theorem applies, and lim i→∞ F i exists almost surely and is integrable. Furthermore, this limit also has expectation at most 1. See, for example, Ref. [37] Ch. V, Thm. 28.
More generally, we use test supermartingales for estimating lower bounds on products of positive stochastic sequences G. Such lower bounds are associated with unbounded-above confidence intervals. We need the following definition: is called the coverage probability.
As noted above, the RVs U , V and X may be µ-dependent. For textbook examples of confidence intervals, X is a parameter determined by µ, and U and V are calculated from an estimation error. We need the full generality of the definition above. The quantity in the definition is a significance level, which corresponds to a confidence level of (1 − ).

Lemma 13. Let F and G be two stochastic sequences with
Proof. The assumptions imply that the F i+1 /G i+1 form a sequence of test factors with respect to H and generate the test supermartingale T/U, where division in this expression is term-by-term. Therefore, It follows that [T * , ∞) is a confidence interval for U n at level .

G. From Bell Functions to Test Factors
Consider a specific trial subject to settings and non-signaling constraints, where the settings distribution is fixed and known. To simplify notation, we omit trial indices and conditioning on the past. Let ν = µ[Z] be the settings distribution. A Bell function B maps settings and outcomes to real values and satisfies E(B(CZ)) ≤ 0 for all local realistic (LR) distributions with the given settings distribution. The set of LR distributions given that the settings distribution is ν consists of mixtures of non-signaling distributions with outcomes determined by the settings. These are distributions such that C = AB is a function of Z = XY for which A does not depend on Y and B does not depend on X. See Sect. VIII, Eq. 158 for more detail. The expression E(B(CZ)) ≤ 0 is closely related to traditional Bell inequalities, which may be expressed in the conditional form z E(B (Cz)|z) ≤ 0. Provided ν(z) = 0 implies B (cz) = 0 for all c, any Bell inequality in this form can be converted to a Bell function by defining B(cz) = B (cz)/ν(z). Conversely, a Bell function B for settings distribution ν determines a Bell inequality by defining B (cz) = B(cz)ν(z). Bell inequalities apply to LR distributions independent of what the settings probabilities are, and these settings probabilities are then considered to be the free choice of the experimenter. We do not use this perspective here.
Let −l be a lower bound for the Bell function B with l > 0. Then F = (B + l)/l is a test factor with respect to LR distributions. Such test factors can provide an effective method for rejecting local realism (LR) at high significance levels by use of Eq. 17. As an example, consider the (2, 2, 2) Bell-test configuration, with the uniform settings distribution, ν = 1/4. The ranges of the settings X, Y and the outcomes A, B are {0, 1}. The following is a Bell function: The inequality E(B(CZ)) ≤ 0 is equivalent to the well-known Clauser-Horne-Shimony-Holt (CHSH) inequality [38]. We give the function in this form for simplicity and because one way to verify that B is a Bell function is to note that d(a, b) = |a − b| satisfies the triangle inequality, so this Bell function belongs to the class of distance-based Bell functions. See Refs. [39,40]. Since for all arguments, exactly one of the expressions inside . . . is true, the minimum value of B is −1. Thus B +1 is a test factor. More generally, 1+λB is a test factor for any 0 ≤ λ ≤ 1 [15]. By optimizing these test factors, asymptotically optimal statistical strength (expected − log(p-value) per trial) for rejecting LR can be achieved [14,15]. While probability estimation uses probability estimation factors (PEFs) to assign numerical values to experimental outcomes, it does not require Bell functions or even Bell-test configurations. The connection of Bell functions to test factors serves as an instructive example and can yield PEFs useful for probability estimation as witnessed implicitly by Ref. [13]. See also Sect. VI A. In general, optimal PEFs cannot be constructed from Bell functions. Therefore, in this work we prefer to directly optimize PEFs without referencing Bell functions, see Sects. IV and VI.

A. Probability Estimation
Consider RVs C, Z and E whose joint distribution is constrained to be in the model H.
In an experiment, we observe Z and C, but not E. For this section, it is not necessary to structure the RVs as stochastic sequences, so we use D, C, Z and R in place of D, C, Z and R, but follow the conventions introduced in Sect. II B. In particular, we assume that D is determined by C, and C is determined by R. There is no loss of generality in this assumption, it simply allows us to omit C (or D) as arguments to functions and in conditioners when R is already present. The distinction between C and R only appears when we consider time-ordered trials, so for the remainder of this section, only R is used.
We focus on probability estimates with coverage probabilities that do not depend on E, formalized as follows.
it is monotone non-decreasing in the second argument (the confidence level) and for each , The level of a probability estimator relates to the smoothness parameter for smooth minentropy via the relationships established below in Sect. III B. We also use the term error bound to refer to the level of a probability estimator. The first condition on U ensures that the confidence upper bounds that it provides are non-decreasing with confidence level, so that the corresponding confidence intervals are consistently nested. The second is the required minimum coverage probability for confidence regions at a given confidence level.
Our inclusion of the random variable E here and in the next definition is in a sense redundant: Uniformity of the estimator means that we could instead have considered the model H of distributions on Z and R consisting of distributions µ[RZ|e] over all µ ∈ H and all e. We refer to E explicitly because of the role played by external entities in this work.
For the most general results, we need a softening of the above definition. The softening adds smoothing and averaging similar to what is needed to define smooth average conditional min-entropy.
if for all e and distributions µ in H the following hold: 1) There exist subprobability dis- there exists a non-negative function q(R|DZ) of DRZ such that r q(r|dz) = 1 for all dz, and for all drz. The function U : Note that in the definition of soft probability estimators, we have chosen to leave the dependence ofμ z and q on e implicit.
The notion of a soft UPE is weaker than that of a UPE, see the next lemma. Note that in this definition we explicitly consider the joint distribution of DR, even though we assume that D is determined by R. Considering the joint distribution helps simplify the notation for probabilities. However, we take this assumption into account by not explicitly including D as an argument to F . In this definition, for any given e, F estimates the probability of d conditional on z but requires r. One can interpret F (rz)q(r|dz) as an implicit and distribution-dependent estimate of the probability of dr given z, where q(r|dz) accounts for the probability of r conditional on dz.
Lemma 16. If F is an -UPE, then it is an -soft UPE.

B. From Probability Estimation to Min-Entropy
The next lemma shows that -soft UPEs with constant upper bounds certify smooth min-entropy.
Lemma 17. Suppose that F is an -soft UPE(D:R|Z; E, H) such that F ≤ p for some constant p. Then for each e and µ ∈ H, P u, Proof. Fix e and letμ z (DR) and q(R|DZ) be as in the definition of soft UPEs. According to our conventions, Further,μ z (d) ≤ µ(d|ze) for all dz, as can be seen by summing the defining inequality forμ z over r. It follows that theμ z [D] witness that P u, max,µ[DZ|e] (D|Z) ≤ p. For the last statement, apply Lem. 4.
A weaker relationship holds for general soft UPEs.
and Lem. 4 to complete the proof. For the remainder of the proof, e is fixed, so we simplify the notation by universally conditioning on {E = e} and omitting the explicit condition. Further, we omit e from suffixes. Thus κ = κ e from here on. Let χ(RZ) = F (RZ) ≤ p . Then κ = E(χ). Let κ z = µ(F ≤ p|z). We have z κ z µ(z) = κ and κ z = µ(z|F ≤ p)κ/µ(z).
Letμ z and q witness that F is an -soft UPE(D:R|Z; E, H). Defineν(drz) =μ z (dr)χ(rz)µ(z)/κ. The weight ofν satisfies Thusν is a subprobability distribution of weight at least 1 − /κ. We use it to construct the distribution ν witnessing the conclusion of the lemma. For each drz we bound from which we getν(dz)/µ(z|F ≤ p) ≤ p/κ z by summing over r. Defineν[D|z](d) = ν(dz)/µ(z|F ≤ p) and let w z = w(ν[D|z]), where this extends the conditional probability notation to the subprobability distributionν with the understanding that the conditionals are with respect to µ given {F ≤ p}. Applying the first step of Eq. 29 and continuing from there, we haveν which also implies that For the average maximum probability of ν, we get which together with the argument at the beginning of the proof establishes the lemma.
We remark that the witness ν constructed in the proof of the lemma satisfies ν[ZE] = µ[ZE|F ≤ p]. We refer to ν in the next section, where to emphasize that ν is constructed

C. Protocols
Our goal is to construct probability estimators from test supermartingales and use them in randomness generation by composition with an extractor. The required supermartingales are introduced in Sect. IV. Here we give three ways in which probability estimators can be composed with an extractor for randomness generation protocols. For the first, given a probability estimator, we can estimate smooth min-entropy and compose with an extractor by chaining Lem. 18 with Lem. 7. The second draws on banked randomness to avoid failure. The third requires a linear extractor. The first protocol is given in the following theorem: Theorem 19. Let F be an h -soft UPE(D:R|Z; E, H) with P(F ≤ pδ) = κ, D a bitstring of length n, and pδ ≥ 1/2 n . Write σ h = − log 2 (p) and suppose that E is a strong extractor such that (n, l, σ h , σ, x ) satisfies the extractor constraints. Let S be a length l uniform and independent seed string. Abbreviate E = E(D, S; σ h , σ, x ). Then In particular, P X = E, P S = S, P P = F ≤ pδ defines a (σ, x + 2 h + δ)-sound randomness generation protocol.
We can use the remarks after Lem. 5 The protocol performance and soundness proof of Thm. 19 is closely related to the performance and proof of the Protocol Soundness Theorem in Ref. [13], which in turn are based on results of Refs. [12,26]. The parameters in these theorems implicitly compensate for the fact that we do not require the extractor to be classical-proof, see the comments on extractor constraints in Sect. II D.
The next protocol has no chance of failure but needs access to banked randomness. The banked randomness cannot be randomness from previous instances of the protocol involving the same devices. Let U be a soft UPE(D:R|Z; E) for H. Let σ ∈ N + and > 0 be the requested number of bits and error bound. Assume we have access to a source of uniform bits S that is independent of all other relevant RVs. For banked randomness, we also have access to a second source of such random bits S b . Let E be a strong extractor. We define a randomness generation protocol P(σ, ; U) by the following steps: 0. Choose n, σ h , h > 0, x > 0 and l so that (n + σ h , l, σ h , σ, x ) satisfies the extractor constraints and h + x = . 1. Perform trials to obtain values r and z of R and Z expressed as bit strings, with |D| = n. Let d be the value taken by D. (Recall that by our conventions, D is determined by R.) 2. Determine p = U(rz, 1 − h ) and let k = max(0, σ h − log 2 (1/p) ).

Obtain values (s) ≤l and (s
If the distribution of Z is known and independent of E, let z = z, otherwise let z be the empty string. 6. Output P X = E(d , (s) ≤l , σ h , σ, x ), P S = z (s) ≤l and P P = 1.
Note that we have left the parameter choices made in the first step free, subject to the given constraints. They can be made to minimize the expected amount of banked randomness needed given information on device performance and resource bounds. It is important that the choices be made before obtaining r, z and the seed bits. That is, they are conditionally independent of these RVs given the pre-protocol past, where H is satisfied conditionally on this past. The length of D must be fixed before performing trials, but this does not preclude stopping trials early if an adaptive probability estimator is used. In this case D can be zero-filled to length n. The requirement for fixed length n is imposed by the extractor properties: We do not have an extractor that can take advantage of variable-and unbounded-length inputs.
Theorem 20. The protocol P(σ, ; U) is a (σ, )-sound and complete randomness generation protocol with respect to H.
Proof. The protocol has zero probability of failure, so completeness is immediate. But note that for the protocol to be non-trivial, the expected number of banked random bits used should be small compared to σ.
Let K, D and Z be the RVs corresponding to the values k, d and z constructed in the protocol. Given the parameter choices that are made first, K is determined by R and Z. Define S K = (S b ) ≤K 0 σ h −K and R = RS K . Taking advantage of uniformity in e, we fix e and implicitly condition everything on {E = e}. Letμ z and q be the functions witnessing that p = U(rz, 1 − h ) is an h -soft UPE. Now letν z be the family of subprobability distributions on D R defined byν z (d r ) =μ z (dr)µ(s k |drz) =μ z (dr)2 −k , where we used that K is determined by RZ.
It follows that the function d r z → 2 −σ h is a h -soft UPE(D :R |Z; E, H) with a constant upper bound, witnessed byν z and q(r |d z) = q(r|dz). From Lem. 17, D has -smooth Z-conditional max-prob 2 −σ h given {E = e} for each e. The theorem now follows from the composition lemma Lem. 6 after taking note of Lem. 4.
The third protocol is the same as the second except that if the estimated probability is too large in step 2, the protocol fails. In this case, banked random bits are not used. This gives a protocol Q(σ, ; U) whose steps are: 0. Choose n, σ h , h > 0, x > 0 and l so that (n, l, − log 2 (2 −σ h + 2 −n ), σ, x ) satisfies the extractor constraints, h + x = and 2 −n ≤ 2 −σ h . 1. Perform trials to obtain values r and z of R and Z expressed as bit strings, with |D| = n.
return with Q X and Q S the empty strings and Q P = 0 to indicate failure.
Theorem 21. The protocol Q(σ, ; U) defined above when used with a linear strong extractor is a (σ, )-sound randomness generation protocol with respect to H.
Proof. Because the distribution of Z conditional on {Q P = 1} can differ significantly from its initial distribution in a way that is not known ahead of time, Z's randomness cannot be reused. Consequently, in this proof, Z and E always occur together, so we let Z stand for the joint RV ZE throughout.
The proof first replaces the distribution µ[DZSQ P ] by ν[DZSQ P ] so that ν[D|Z] ≤ 2 −σ h + 2 −n and conditionally on {Q P = 1}, the distribution's change is small in a sense to be defined. Since the conclusion only depends on {Q P = 1}, changes when {Q P = 0} can be arbitrary. We therefore define TV pass so that for global distributions ν and ν satisfying ν[Q P ] = ν [Q P ] and for all RVs U , Note that TV pass is the same as the total variation distance between ν[M(U Q P )Q P ] and ν [M(U Q P )Q P ) for any process M satisfying M(U 1) = U and M(U 0) = v for a fixed value v. Such processes forget U when Q P = 0. Note that (σ, )-soundness is defined in terms of TV pass . Write κ = P(Q P = 1) and p = 2 −σ h . We omit the subscript ≤l from S ≤l . To construct ν we refine the proof of Lem. 18. Write F (rz) = U(rz, 1 − h ), and letμ z and q witness that F is an h -soft UPE(D:R|Z; E, H). The event {Q P = 1} is the same as {F ≤ p}. The distribution ν to be constructed satisfies ν[ZSQ P ] = µ[ZSQ P ]. We maintain that S is uniform and independent of the other RVs for both ν and µ. Thus it suffices to construct ν[DZQ P ] and define ν = ν[DZQ P ] ⊗ Unif S . We start by defining ν[DZ|Q P = 1] here to be ν[DZ] as constructed in the proof of Lem. 18. Consistent with the notation in that proof, we also use the notation ν z = ν[D|z, Q P = 1]. Note that ν[Z|Q P = 1](z) = µ(z|Q P = 1).
We then set This ensures that ν[ZSQ P ] = µ[ZSQ P ] after adding in the uniform and independent RV S. The distribution ν satisfies the following additional properties by construction: and in view of Eq. 31 and the preceding paragraph.
We can now apply the extractor to get Here we transferred the extractor guarantee conditionally on Z = z for each z, noting that the marginal distribution on Z is the same for both arguments. From here on we abbreviate E = E(D, S).
Before proving this inequality we use it to prove the theorem as follows: It remains to establish Eq. 41. We start by taking advantage of the linearity of the extractor. We have Define R s = Rng(E(D, s)) and r s = |R s | ≤ |Rng(E(D, S)| = 2 σ . Since E is a linear extractor, Rng(D) can be treated as a vector space over a finite field. For all x ∈ R s , For x ∈ R s , ν(E = x , s|z, Q P = b) = 0. In the last two equations, we can move s into the conditioner in each expression and use independence and uniformity of S with respect to Z and Q P to get ν(E = x |zs) = ν(E = x |zs, Q P = 1) x ∈ R s µ(Q P = 1|z) Abbreviate ν zs = ν[E|zs], and ν zsb = ν[E|zs, Q P = b], which are conditional distributions of E. Then ν zs0 (x ) = x ∈ R s /r s . Write κ z = µ(Q P = 1|z) = µ(Q P = 1|zs) for the conditional passing probabilities. With these definitions, we can write ν zs = κ z ν zs1 + (1 − κ z )ν zs0 and calculate Here, we used the fact that x f (x ) f (x ) > 0 is monotone increasing with f , noting that 1/r s ≥ 2 −σ .
Since ν(s|z) = ν(s|z, Q P = 1) = Unif S (s), we can apply Eq. 2 to obtain and Applying the previous two displayed equations and the inequality before that, we conclude that for all z

A. Standard Models for Sequences of Trials
Each of Thms. 19, 20, and 21 reduces the problem of randomness generation protocols to that of probability estimation. We now consider the situation where CZ is a stochastic sequence of n trials, with the distributions of the i'th trial RVs C i Z i in a model C i conditional on the past and on E. For the remainder of the paper, the conditioning on E applies universally, and relevant statements are uniform in the values of E. We therefore no longer mention E explicitly. For the most general treatment, we define R i = C i R i where R i is additional information obtainable in a trial and R 0 = R 0 is information available initially. Test supermartingales and test factors are with respect to R ≤i Z ≤i . According to our convention, D i is a function of C i . Other situations can be cast into this form by changing the definition of C i , for example by extending C i with other information, and modifying models accordingly.
Formally, we are considering models H(C) of distributions of RZ defined by a family of conditionally defined models C i+1|r ≤i z ≤i of C i+1 Z i+1 . H(C) consists of the distributions µ with the following two properties. First, for all i and r ≤i z ≤i , Second, µ satisfies that Z i+1 is independent of R ≤i conditionally on Z ≤i (and E). Models H(C) satisfying these conditions are called standard. The second condition is needed in order to be able to estimate Z-conditional probabilities of D and corresponds to the Markov-chain condition in the entropy accumulation framework [25]. In many cases, the sets C i+1|r ≤i z ≤i do not depend on r, but we take advantage of dependence on z ≤i . In our applications, the sets capture the settings and non-signaling constraints. For simplicity, we write C i+1 = C i+1|r ≤i z ≤i , leaving the conditional parameters implicit.
Normally, models for trials are convex and closed. If Z is non-trivial, the second condition on standard models prevents their being convex closed for n > 1, but we note that our results generally extend to the convex closure of the model used.

B. UPE Constructions
Let F i+1 be non-negative functions of C i+1 Z i+1 , parameterized by R ≤i and Z ≤i . We call such functions "past-parameterized". Let T 0 = 1 and We choose the F i so that they are probability estimation factors according to the following definition.
Definition 22. Let β > 0, and let C be any model, not necessarily convex. A probability estimation factor (PEF) with power β for C and D|Z is a non-negative RV F = F (CZ) such that for all ν ∈ C, E ν (F ν(D|Z) β ) ≤ 1.
As usual, if the parameters are clear from context, we may omit them. In particular, unless differently named RVs are involved, PEFs are always for D|Z.
In the case where R = C = D, we show that U β and U * β are UPEs. In the general case, U β is a soft UPE. We start with the special case. Note that β cannot be adapted during the trials. On the other hand, before the i'th trial, we can design the PEFs F i for the particular constraints relevant to the i'th trial.
Proof. We first observe that This follows by induction with the identity by conditional independence of Z j+1 . We claim that F i+1 P(C i+1 |Z i+1 Z ≤i C ≤i ) β is a test factor determined by C ≤i+1 Z ≤i+1 . To prove this claim, for all c ≤i z ≤i , where we invoked the assumption that F i+1 is a PEF with power β for C i+1 . By arbitrariness of c ≤i z ≤i , and because the factors are determined by C ≤i+1 Z ≤i+1 , the claim follows. The product of these test factors is with T i = i j=1 F j . Thus, the sequence T i P(C ≤i |Z ≤i ) β i is a test supermartingale, where the inverse of the second factor is monotone non-decreasing. We remark that as a consequence, T n is a PEF with power β for H(C), that is, for R = C = D, chaining PEFs yields PEFs for standard models.
From Eqs. 20 and 21 with U i = P(C ≤i |Z ≤i ) −β and manipulating the inequalities inside P(.), we get To conclude that U β and U * β are UPEs, it now suffices to refer to their defining identities Eqs. (51) and (52).
That F i+1 can be parameterized in terms of the past as F i+1 = F i+1 (C i+1 Z i+1 ; C ≤i Z ≤i ) allows for adapting the PEFs based on CZ, but no other information in R can be used. Since T i is a function of C ≤i Z ≤i , this enables stopping when a target value of T i is achieved.
In order to use additional information available in R, we now treat the case where C, R = D. While not necessary at this point, we do this with respect to a softening of the PEF properties.
Definition 24. Let β > 0, and let C be any model, not necessarily convex. A soft probability estimation factor (soft PEF) with power β for C and D|Z is a non-negative RV F = F (CZ) such that for all ν ∈ C, there exists a function q(C|DZ) ≥ 0 of CDZ with c q(c|dz) ≤ 1 for all dz and If c q(c|dz) < 1 for some dz, we can increase q to ensure c q(c|dz) = 1 without increasing the left-hand side of Eq. 59. Hence, without loss of generality, we can let c q(c|dz) = 1 in the above definition. We always set q(c|dz) = 0 for probability-zero values cdz. This does not cause problems with Eq. 59 given the convention that when expanding an expectation over a finite set of values, probability-zero values are omitted from the sum.
The direct calculation in the proof of the next lemma shows that PEFs are soft PEFs.
Lemma 25. If F is a PEF with power β for C, then it is a soft PEF with power β for C.
Proof. Let F be a PEF with power β for C and ν ∈ C. Define q(C|DZ) = ν(C|DZ). Then q satisfies the condition in the definition of soft PEFs. Since D is a function of C, we have ν(dcz) = ν(cz) d = D(c) . From this identity we can deduce Since ν ∈ C is arbitrary, this verifies that every PEF F is a soft PEF. Proof. Fix ν ∈ H and let q(C|DZ) be the function witnessing that F is a soft PEF for this ν. Defineν The Markov inequality and the definition of soft PEFs implies that Hence dczν z (dc)ν(z) = dcz ν(dcz) V(cz) ≥ ν(c|z)/q(c|dz) The definition ofν z (dc) and ν(dc|z) = ν(c|z) d = D(c) ensure that as required for soft UPEs. Since ν is arbitrary, the above verifies that V = (F ) −1/β is an -soft UPE. Proof. Below we first show that chaining soft PEFs yields soft PEFs. Then, by applying Lem. 26 we prove the theorem. For this result, direct chaining of the probabilities fails. Instead, we decompose alternately according to C i and R i given C i , as follows: where we simplified in the last step by applying the fact that Z i+1 is conditionally independent of R ≤i given Z ≤i and taking advantage of the assumption that C i is determined by R i . The identity can be expanded recursively, replacing the first term in the last expression obtained each time. We identify two products after the expansion, namely and which satisfy whenever P(C ≤i R ≤i |Z ≤i ) is not zero. Again, we took advantage of our assumption that C i is a function of R i to omit unnecessary arguments. Let µ be an arbitrary distribution in H(C) and note that for all r ≤i z ≤i , According to the definition of soft PEFs, there exists and c i+1 q i+1 (c i+1 |d i+1 z i+1 ) ≤ 1 for all d i+1 z i+1 , where the past parameters are implicit. We define the additional product Later we find that the function q(R|DZ) needed for the final soft UPE construction according to Def. 15 is the i = n instance of for the incremental factor defining p ≤i+1 , so that p ≤i+1 = p i+1 p ≤i . We show by induction that p ≤i satisfies the required normalization condition. The initial value is p ≤0 = 1 (empty products evaluate to 1), and for the induction step, for all d ≤i+1 z ≤i+1 The inner sum on the right-hand side evaluates to Substituting back in Eq. 72 and applying the induction hypothesis gives the normalization condition.
We claim that is a test factor determined by R ≤i+1 Z ≤i+1 . To prove this claim, with and omitting the past parameters of p i+1 , for each r ≤i z ≤i we compute by the definition of soft PEFs, where we used Eqs. 68 and 71 to get the second line. By arbitrariness of r ≤i z ≤i , the claim follows. Since n−1 j=0 p j+1 = p ≤n , if we set the function q in the definition Def. 24 to p ≤n , we conclude that T n (RZ) = n−1 j=0 F j+1 is a soft PEF with power β for D|Z and H(C). Hence, chaining soft PEFs yields soft PEFs for standard models. Since U β (RZ, (1− )) = (T n ) −1/β , to finish the proof of the theorem, we apply Lem. 26 with C there replaced by R here, D there with D here and Z there with Z here. The proof of Lem. 26 shows that the softness witness q(R|DZ) for U β (RZ, 1 − ) is also given by p ≤n .

C. Effectiveness of PEF Constraints
The constraints on PEFs F i are linear, but it is not obvious that this set of linear constraints has a practical realization as a linear or semidefinite program. Thm. 28 below shows that when C = D, it suffices to use the extreme points Extr(C) of C, which implies that if C is a known polytope, then a finite number of constraints suffice. This includes the cases where C is characterized by typical settings and non-signaling constraints for (k, l, m) Belltest configurations and generalizes to cases where the settings distributions are not fixed but constrained to a polytope of settings distributions. See Sect. VII A. For the case where C = D, we show that soft PEFs for Extr(C) are soft PEFs for C. In particular PEFs for Extr(C) yield soft PEFs for C. When C is a polytope this yields an effective construction for soft PEFs.
Theorem 28. Let C be a convex set of distributions of CZ. Then for β > 0 the family of inequalities is implied by the subset with ν ∈ Extr(C).
Proof. For a given ν ∈ C, we can write the constraint on F as Thus, each ν determines a linear inequality for F . For distributions σ on CZ and ρ on Z so that l ν|ν (cz) are the coefficients of F in the inequality above. If ρ(z) = 0, then σ(cz) ≤ σ(z) = 0, so we set l ν|ρ (cz) = 0, which is consistent since the power in the numerator is larger than that in the denominator. Since F is non-negative, it suffices to show that for each cz, the coefficients l ν|ρ (cz) are biconvex functions of ν and ρ. If ν = λν 1 + (1 − λ)ν 2 with 0 ≤ λ ≤ 1 and ν i ∈ C, then Given that cz F (cz)l ν |ν (cz) ≤ 1 for ν = ν 1 and ν = ν 2 , we find This observation extends to arbitrary finite convex combinations by reduction to the above case. Consequently the constraints associated with extremal distributions suffice.
Let r(x|y) = x 1+β y −β , where we define r(0|0) = 0. The desired convexity expressed in Eq. 79 is implied by the fact that r(x|y) is biconvex in its arguments for relevant values, which is Lem. 29 below.
Lemma 29. Define r(x|y) = x 1+β y −β with β > 0. If (x, y) is expressed as a finite convex combination (x, y) = i λ i (x i , y i ) where x i ≥ 0, y i ≥ 0 and y i = 0 implies x i = 0, then r(x|y) ≤ i λ i r(x i |y i ).
Corollary 30. If F (CZ) is a PEF with power β for C|Z and Extr(C), then F (CZ) is a PEF with power β for C|Z and the convex closure of C.
The same observation holds for soft PEFs.
Theorem 31. If F (CZ) is a soft PEF with power β for D|Z and Extr(C), then F (CZ) is a soft PEF with power β for D|Z and the convex closure of C.
Proof. Let ν i ∈ Extr(C) and ν = i λ i ν i be a finite convex combination of the ν i . Let q i (C|DZ) witness that F is a soft PEF at ν i . From the joint convexity of ν 1+β /σ β according to Lem. 29, for all cdz To apply Lem 29, note that the definition of soft PEFs implies that q i (c|dz) > 0 whenever ν i (cz) > 0 as otherwise the defining inequality cannot be satisfied. Define q(c|dz) = i λ i q i (c|dz)ν i (z)/ν(z) if ν(z) > 0 and q(c|dz) = 0 otherwise. We have Now Summing both sides over dcz gives the desired inequality Corollary 32. If F (CZ) is a PEF with power β for D|Z and Extr(C), then F (CZ) is a soft PEF with power β for D|Z and the convex closure of C.
Proof. By Lem. 25, F (CZ) is a soft PEF with power β for D|Z and Extr(C), so we can apply Thm. 31.

D. Log-Probs and PEF Optimization Strategy
Given the probability estimation framework, the number of random bits that can be extracted is determined by the negative logarithm of the final probability estimate after adjustments for error bounds and extractor constraints. We focus on the quantities obtained from Eq. 51, which reflect the final values of the PEF products.
Definition 33. Let F i be PEFs with power β > 0 for C i . The log-prob of F is i log(F i (c i z i ))/β. Given an error bound , the net log-prob is given by i log(F i (c i z i ))/β − log(1/ )/β.
In the randomness generation protocols of Sect. III C with the UPEs of Sect. IV B, the net log-prob is effectively a "raw" entropy base e available for extraction. For the protocol to succeed, the net log-prob must exceed σ h log(2), where σ h is the extractor parameter for the input min-entropy in bits. In the protocol with banked randomness, the net log-prob contributes almost fully to the input min-entropy provided to the extractor. In Sect. VIII, we further adjust the net log-prob by subtracting the entropy used for settings choices and call the resulting quantity the net entropy. This does not yet take into account the entropy needed for the extractor seed, nor the fact that the number of output bits is smaller than the input min-entropy according to the extractor constraints.
The log-prob and net log-prob are empirical quantities. For the average quantities with respect to a distribution we consider expected contributions per trial.

Definition 34.
If ν ∈ C is the probability distribution of CZ at a trial and F is a PEF with power β > 0 for C, then we define the log-prob rate of F at ν as for each i and c i+1 , z i+1 , then the trials are empirically i.i.d. with distribution ν. That is, they are i.i.d. from the point of view of the user, but not necessarily from the point of view of E. In this case, the expected log-prob after n trials is given by nO ν (F ; β). When using the term "rate", we implicitly assume a situation where we expect many trials with the same distribution. We omit the parameters β or ν when they are clear from context. The quantity O(F ) could be called the trial's nominal min-entropy increase. We refer to it as "nominal" because the final min-entropy provided to the extractor has to be reduced by log(1/ )/β (see Def. 33). This reduction diverges as β goes to zero.
The log-prob rate O(F ) is a concave function of F , where F is constrained to a convex set through the inequalities determined by C. The problem of maximizing O(F ) is therefore that of maximizing a concave function over a convex domain. For (k, l, m) Bell-test configurations with k, l, m small and typical settings and non-signaling constraints, it is feasible to solve this numerically. However, the complexity of the full problem grows very rapidly with k, l, m.

E. Schema for Probability Estimation
The considerations above lead to the following schematic probability estimation protocol. We are given the maximum number of available trials n, a probability estimation goal q and the error bound h . In general, the protocol yields a soft UPE. The protocol can take advantage of experimentally relevant operational details to improve probability estimation by adapting PEFs while it is running. In view of Thm. 23 (if C i = D i ) or Thm. 27 (if not), each trial's PEF can be adapted by taking advantage of the values of the R j and Z j from previous trials as well as any initial information that may be part of E. However, the values of the parameters n, q, h and β in the protocol must be fully determined by initial information and cannot be adapted later.
Schema for Probability Estimation: Given: n, the maximum number of trials, q, the probability estimation goal, and h , the error bound.
1. Set T 0 = 1. Choose β > 0 based on prior data from device commissioning or other earlier uses of the devices. 2. For each i = 0, . . . , n − 1: 2.1 After the i'th and before the i + 1'th trial: 2.1.1 Determine the set C i+1 constraining the distributions at the i + 1'th trial. 2.1.2 Estimate the distribution ν ∈ C i+1 of C i+1 Z i+1 from previous measurements (see the remark after the schema).

2.1.3
Based on the estimate ν and other information available in R ≤i and Z ≤i , determine a soft PEF F i+1 with power β for D i+1 |Z i+1 and C i+1 , optimized for log-prob rate at ν.

Obtain the
In this schema, we have taken advantage of the ability to stop early while preserving the required coverage properties. According to Thm. 27 an implementation of the schema returns the value of an h -soft UPE suitable for use in a randomness generation protocol with Thm. 19, Thm. 20, or Thm. 21.
We remark that there are different ways of estimating the distribution ν[C i+1 Z i+1 ]. Our method works as along as the estimated distribution is determined by R ≤i Z ≤i and initial information. In practice, we estimate the distribution ν ∈ C i+1 by maximum likelihood using the past trial results C ≤i Z ≤i , where the likelihood is computed with the assumption that these trial results are i.i.d. (see Eq. 163 for details). We stress that the use of the i.i.d. assumption is only for performing the estimate. The probability estimation protocol and the derived randomness generation protocol are sound regardless of how we obtain the estimate.

V. SOME PEF PROPERTIES AND PEF OPTIMALITY
A. Dependence on Power Lemma 35. If F is a PEF with power β for C, then for all 0 < γ ≤ 1, F γ is a PEF with power γβ for C. Furthermore, given a distribution ρ ∈ C, the log-prob rate at ρ is the same for F and F γ .
We refer to the transformation of PEFs in the lemma as "power reduction" by γ.
Proof. We have, for all ν ∈ C, by concavity of x → x γ . Hence F γ is a PEF with power γβ for C. Accordingly, The lemma implies the following corollary: Corollary 36. The supremum of the log-prob rates at ρ for all PEFs with power β is nonincreasing in β.
Hence, to determine the best log-prob rate without regard to the error bound, one can analyze PEFs in the limit where the power goes to 0.

B. PEFs and Entropy Estimation
Definition 37. An entropy estimator for D|C and C is a real-valued function K of CZ with the property that for all ν ∈ C, E ν (K(CZ)) ≤ E ν (− log(ν(D|Z))). (91) The entropy estimate of K at ν is E ν (K(CZ)).
Entropy estimators are the analog of affine min-tradeoff functions in the entropy accumulation framework of Ref. [25]. The following results establish a close relationship between entropy estimators and PEFs. First we show that the set of entropy estimators contains the log-β-roots of positive PEFs with power β > 0.
Theorem 38. Let F be a function of CZ such that F ≥ 0. If there exists β > 0 such that F β is a PEF with power β for C, then for all ν ∈ C, E ν (log(F (CZ))) ≤ E ν (− log(ν(D|Z))).
According to Thm. 38, positive PEFs F β with power β define entropy estimators log(F ) whose entropy estimate at a distribution ρ ∈ C is the log-prob rate of the PEF. Our next goal is to relate log-prob rates for PEFs with power approaching 0 to entropy estimates. For this we need the following definition: Definition 39. The asymptotic gain rate at ρ of C is the supremum of the log-prob rates at ρ achievable by PEFs with positive power for C.
By the properties of PEFs, the asymptotic gain rate at ρ of C is the same as that of Cvx(C).
The asymptotic gain rate is defined in terms of a supremum over all PEFs, including those whose support is properly contained in Rng(CZ). The next lemma shows that the same supremum is obtained for positive PEFs.
Lemma 40. Let g be the asymptotic gain rate at ρ of C and define g + = sup{O ρ (F β ; β) : F > 0 and F β is a PEF with power β > 0 for C}. (93) Then g = g + and the supremum of the entropy estimates at ρ of entropy estimators for C is at least g.
Proof. It is clear that g ≥ g + . Both g and g + are non-negative because we can always choose F = 1 (see Thm. 46 below). It suffices to consider the non-trivial case g > 0. Suppose that F contributes non-trivially to the supremum defining the asymptotic gain rate, namely F β is a PEF with power β > 0 and log-prob rate g F = O ρ (F β ; β) > 0. We can choose F so that g F is arbitrarily close to g. Consider G β = (1 − δ)F β + δ with δ > 0 sufficiently small. Then G β is a positive PEF with power β (see Thm. 46) and has log-prob rate by monotonicity of the logarithm. We have log(1 − δ)/β = O(δ). Consequently, Since δ is arbitrary, we can find positive G so that G β is a PEF with power β and log-prob rate g G at ρ bounded below by a quantity arbitrarily close to g F . Since g F is itself arbitrarily close to g, we have g = g + . The last statement of the lemma follows from Thm. 38 and the definitions of asymptotic gain rate and entropy estimators.
According to the lemma above, the supremum of the entropy estimates at ρ by entropy estimators is an upper bound for the asymptotic gain rate. The upper bound could be strict. But the next theorem implies that in fact, the supremum of the entropy estimates gives exactly the asymptotic gain rate. Theorem 41. Suppose K is an entropy estimator for C. Then the asymptotic gain rate at ρ is at least the entropy estimate of K at ρ given by σ = E ρ (K(CZ)).
Proof. To prove the theorem requires constructing families of PEFs with small powers whose log-prob rates approach the entropy estimate of K. Let k max = max(K) and k min = min(K). We may assume that k max > 0 as otherwise, the entropy estimate is not positive. For sufficiently small > 0, we determine γ > 0 such that G(CZ) γ = (e − +K(CZ) ) γ is a PEF with power γ for C. We require that < 1/2. Consider ν ∈ C and define G γ is a PEF with power γ provided that f (γ) ≤ 1 for all ν. We Taylor-expand f (γ) at γ = 0 for γ ≥ 0 with a second-order remainder.
The solutions x = 0 and log(x) = −a are minima with g(x) = 0. The remaining solution is obtained from log(x 0 ) = −(a + 2/(1 + γ)). The candidate maxima are at this solution and at the boundary x 1 = 1. The values of g at these points are If x 0 > 1, the solution at x 0 is irrelevant. The condition x 0 ≤ 1 is equivalent to −(a(1 + γ) + 2) ≤ 0, in which case g(x 0 ) ≤ 4. We have a 2 ≤ max((k max − ) 2 , (k min − ) 2 ), so let u = max(4, k 2 max , k 2 min + |k min | + 1/4), which is a loose upper bound on the maximum of g(x) for 0 < x ≤ 1. For the bound, we used that k max > 0 and 0 < < 1/2 imply Returning to bounding f , we get from which To ensure that f (γ) ≤ 1 it suffices to satisfy Let w be the right-hand side of this inequality. To remove the dependence of w on γ while maintaining f (γ) ≤ 1, we reduce the upper bound on γ in a few steps. Since e −kmaxγ ≥ 1 − k max γ, If γ ≤ w , then γ ≤ 2 /(u|Rng(C)|), so we can substitute this in the right-hand side for For < u|Rng(C)|/(4k max ), this simplifies further to w ≥ w = /(u|Rng(C)|) by substituting the bound on for the second occurrence of in the expression for w . The bound on is satisfied since < 1/2 is already assumed and from u ≥ max(4, k 2 max ), u/(4k max ) ≥ 1/2. We now require which ensures that f (γ) ≤ 1. Since the bound on f (γ) is independent of ν ∈ C, G γ is a PEF with power γ and the log-prob rate of G γ at ρ is which approaches σ as goes to zero, as required to complete the proof of the theorem.

C. Error Bound Tradeoffs
So far, the results of this section have ignored the contribution of the error bound h to the probability estimate, giving the appearance that arbitrarily small powers are optimal. For a given finite number of trials, or if h is intended to decrease exponentially with output min-entropy, the optimal power for PEFs is bounded away from zero. This is because h increases the probability estimate by a factor of −1/β h , which diverges as β goes to zero. We can analyze the situation for the case where h = e −κσn , where σ is the log-prob rate O ρ (F ; β) at a given distribution ρ ∈ C. Per trial, we get a net log-prob rate of We can consider the family F γ of PEFs with power γβ obtained by power reduction of F . The net log-prob rate at power γβ is Consequently, the net log-prob rate is never improved by power reduction of a given PEF. However, there are usually PEFs with higher net log-prob rates but lower powers. There is therefore a tradeoff between the log-prob rate and the term κ/β that we expect to be optimized at some finite non-zero β. This effect is demonstrated by example in Sect. VIII, see Fig. 1. We remark that in the above definitions, we may consider κ = κ(n) as a function of n.

D. Optimality of PEFs
Consider the situation where R i Z i E i are i.i.d. with a distribution ν such that for all e, ν[C i Z i |E i = e i ] is in C. Here we have structured the RV E as a sequence RV, E. We call this the i.i.d. scenario with respect to ν, where we specify ν as a distribution of RZE without indices. We consider the error bound to be e −o(n) . Our results so far establish that if the asymptotic gain rate at ρ = ν[CZ] is g, we can certify smooth conditional min-entropy and extract near-uniform random bits at an asymptotic rate arbitrarily close to g. This is formalized by the next theorem.
Theorem 42. Let g be the asymptotic gain rate at ρ of C and n = e −κ(n)n with κ(n) = o(1). Assume the i.i.d. scenario. Then for any δ > 0, there exists a PEF G β with power β for C such that with asymptotic probability one, the net log-prob after n trials with PEFs F i = G(C i Z i ) β and error bound n exceeds (g − δ)n.
Proof. By the definition of asymptotic gain rate, there exists a PEF G β with power β such that E(log(G)) = g − δ/3. We may assume g − δ/3 > 0. The RV S n = n i=1 log(F i )/β = n i=1 log(G(C i Z i )) is a sum of bounded i.i.d. RVs, so according to the weak law of large numbers, the probability that S n ≥ (g − 2δ/3)n goes to 1 as n → ∞. The net log-prob is given by S n − log(1/ )/β = S n − κ(n)n/β. Since κ(n) = o(1), for n sufficiently large, κ(n)/β ≤ δ/3. Thus with asymptotic probability 1, the net log-prob after n trials exceeds (g − δ)n.
We claim that the asymptotic gain rate g at ρ is equal to the minimum of the conditional entropies H(D|ZE; ν) over all distributions ν of CZE such that ν[CZ] = ρ, where the conditional entropy base e is defined by According to the asymptotic equipartition property (AEP) [27] specialized to the case of finite classical-classical states, the infimum of the conditional entropies is the optimal randomness-generation rate. In this sense, our method is optimal. The claim is established by the next theorem. Proof. Suppose that ν is a distribution of CZE such that ν[CZ] = ρ. Define ρ e = ν[CZ|E = e] ∈ C and λ e = ν(e). Then, ρ = e λ e ρ e . Consider an arbitrary entropy estimator K for C and write f (σ) = E σ (K(CZ)) for its entropy estimate at σ. By definition, f (σ) ≤ H(D|Z; σ) for all σ ∈ C. Since f (σ) is linear in σ, we have For the second part of the theorem, since the asymptotic gain rate of C is the same as that of Cvx(C), without loss of generality assume that C is convex closed. For any distribution σ ∈ C define h min (σ) = inf We claim that h min (ρ) is the supremum of entropy estimates at ρ for C and that the infimum in the definition of h min (ρ) is achieved by a sum involving a bounded number of terms. The conditional entropy is concave in the joint distribution of its variables (see, for example, Ref. [43], Cor. 11.13, which is readily specialized to the classical case). It follows that if one of the σ e contributing to the sum defining h min (σ) is not extremal, we can replace it by a convex combination of extremal distributions to decrease the value of the sum. Thus, we only have to consider σ e ∈ Extr(C) for defining h min . It follows that h min (σ) is the convex roof extension of the function σ → H(D|Z; σ) on Extr(C). Convex roof extensions are defined in Ref. [44]. See this reference for a proof that h min is convex. In fact, the graph of h min on C is the lower boundary of the convex closure of the set {(σ, H(D|Z; σ)) : σ ∈ Extr(C)}. Specializing h min (σ) to the case σ = ρ, since the dimension is finite we can apply Carathéodory's theorem and express h min (ρ) as finite convex combinations e λ e H(D|Z; ρ e ) with ρ e ∈ Extr(C). The number of terms required is at most d + 2, where d is the dimension of C. Since h min is convex and has a closed epigraph, for any > 0 there exists an affine function f on C such that f (ρ) ≥ h min (ρ) − and the graph of f is below that of h min (See Ref. [45], Sect. 3.3.2 and Exercise 3.28). We can extend f to all distributions σ of CZ and express f (σ) as an expectation f (σ) = E σ (K(CZ)) for some real-valued function K of CZ. Relevant existence and extension theorems can be found in textbooks on convex analysis, topological vector spaces or operator theory. For example, see Ref. [46], Ch. 1. We now have that for all σ ∈ C, It follows that K is an entropy estimator and that its entropy estimate at ρ is h min (ρ) − . By arbitrariness of , h min (ρ) is the supremum of entropy estimates at ρ for C, and hence g = h min (ρ) by Lem. 40 and Thm. 41. From above, we can write h min (ρ) = d+2 e=1 λ e H(D|Z; ρ e ), with λ e ≥ 0, d+2 e=1 λ e = 1 and ρ e ∈ Extr(C). We can set Rng(E) = {1, . . . , d + 2} and define ν on CZE by ν(cze) = ρ e (cz)λ e . Then H(D|ZE; ν) = e λ e H(D|Z; ρ e ) = h min (ρ) = g, which is what we aimed to prove.

VI. ADDITIONAL PEF CONSTRUCTIONS
Effectiveness of PEF constraints as discussed in Sect. IV C provides a practical way to construct PEFs when the convex closure of the set C has a finite number of extreme points. We demonstrate this construction in Sect. VIII. In addition, we can construct PEFs by the strategies discussed in the following two subsections.

A. PEFs from Maximum Probability Estimators
A strategy for constructing PEFs is to determine functions F that solve the inequalities Any such F is a PEF with power β for C. The next theorem provides families of PEFs satisfying these inequalities. Since the expectations of such PEFs witness the maximum conditional probability max dz ν(d|z) in a trial, the corresponding probability estimators may be referred to as maximum probability estimators.
A reasonable choice for α in the theorem is α =b = E ρ (B(CZ)) with ρ our best guess for the true distribution of the trial. The inequality in Eq. 118 and the expression for F suggests that we should maximizeb for the best results at ρ. To optimize the log-prob rate or the net log-prob rate, we can then vary β and α. If the condition F ≥ 0 is not satisfied, we can either reduce λ in the definition or replace B by γB and α by γα for an appropriate γ ∈ (0, 1). The latter replacement preserves the validity of Eq. 118. Thm. 49 shows that PEFs obtained by these methods for a given B witness an asymptotic gain rate of at least − log (1 −b), which justifies the goal of maximizingb and is what we would hope for given the interpretation of the right-hand-side of Eq. 118 as a worst-case conditional max-prob.
Functions B satisfying the conditions in the theorem withb > 0 for a given non-LR distribution are readily constructed for a large class of Bell-test configurations. See the discussions after the proofs of this and the next theorem. The family of PEFs constructed accordingly contains PEFs with good log-prob rates, as witnessed by Thm. 49 below, which quantifies the performance of B in terms ofb. The family can also be used as a tool for proving exponential randomness expansion, see Thm. 52.
Proof. We use the following general inequality: For x < 1, since the right-hand-side defines the tangent line of the graph of the convex function (1−x) −β at x = α. Provided F ≥ 0, given any ν ∈ C we can compute To continue, with Eq. 120 we bound To apply Eq. 120, note that Eq. 118 implies E ν (B(CZ)) < 1. Substituting in Eq. 121 and applying Eq. 118 gives Since ν is an arbitrary distribution in C, we conclude that if F ≥ 0, it is a PEF with power β and, in consideration of the first line of Eq. 121, it satisfies Eq. 117.
For (2, 2, 2) Bell tests with known settings probabilities, starting from a Bell function B 0 with positive expectation at ρ, one can determine m > 0 such that B = B 0 /2m satisfies Eq. 118. Computationally, m can be found by checking the constraints of Eq. 118 at extremal ν[C|Z]. This construction was exploited in Ref. [13]. See the lemma in the proof of the "Entropy Production Theorem" in this reference, where setting B = (T − 1)/2m with respect to the notation there defines a function satisfying Eq. 118. This observation about non-trivial Bell functions is a consequence of the fact that positive expectations of such Bell functions witness the presence of a Popescu-Rohrlich (PR) box in the distribution, and such a box has maximum outcome probability 1/2 for each setting z. (See Sect. VIII for the definition of PR boxes.) For λ = 1, Thm. 44 can be generalized to a weighted form with minor modifications to the proof.
Theorem 45. Let γ be a positive function of DZ. Suppose that B is a function of CZ such that Let α < 1, β > 0 and define If F ≥ 0, then F is a PEF with power β for C.
Proof. It suffices to adjust the sequence of inequalities leading to Eq. 123 as follows Eq. 124 defines a set of convex constraints on B. Namely 1 − E ν (B(CZ)) ≥ γ(dz)ν(d|z) for all dz and ν ∈ C.
If we use B to construct PEFs according to Thm. 45, a reasonable goal is to maximizeb = E ρ (B(CZ)) subject to these constraints, where ρ is an estimate of the true distribution. We remark that the asymptotic gain rate witnessed by PEFs constructed according to Thm. 45 from a given B is at least − log(1 −b) + E ρ log(γ(DZ)), see Thm. 49.
Given an optimal B, one can use Eq. 125 to define PEFs, choosing parameters to optimize the log-prob rate at ρ. If F as constructed does not satisfy F ≥ 0, we can replace B by γ B and α by γ α for an appropriate γ ∈ (0, 1). Alternatively, we can reduce the power β.
For general C, solving Eq. 128 may be difficult. But suppose that the measurement settings distribution ν[Z] is fixed, ν[Z] = ρ[Z] for all ν ∈ C, and the conditional distributions ν[C|Z] belong to a convex set C C|Z determined by semidefinite constraints. Now C = {ν(C|Z)ρ(Z) : ν[C|Z] ∈ C C|Z }, which is a special case of the sets free for Z defined in Sect. VII A. This is the standard situation for Bell configurations, in which case Eq. 128 is related to the optimization problems described in Refs. [17][18][19]. These references define convex programs that determine the maximum available min-entropy for one trial with distribution ρ. In fact, the program given in Eq. (8) of Ref. [18] is related to the dual of Eq. 128 when C C|Z is the set of quantum realizable conditional probability distributions. To make the relationship explicit and show that with the given assumptions, Eq. 128 is effectively solvable, defineB byB(cz) =B(cz)ρ(z). With explicit sums, Eq. 128 becomes Minimize:  (1 −b). For all dz, c z σ(c z )ρ(c |z ) ≥ −aγ(dz)ρ(d|z), and since (B min , −1) ∈ D * , (1 −b) ≥ γ(dz)ρ(d|z). Hence (σ, a) · x ≥ −aγ(dz)ρ(d|z) + aγ(dz)ρ(d|z) = 0, in consideration of a ≥ 0. The dual of D * is D again, so x ∈ D. Because of probability normalization, D 1 is contained in several hyperplanes not containing the origin, so D is pointed. The intersection of any of these hyperplanes with D is D 1 . Since the set D 1 is closed and bounded, it meets every extremal ray of D where it intersects these hyperplanes (Thm 1.4.5 of Ref. [46]). Consequently, (ρ[C|Z], (1−b)) is a convex combination of elements of D 1 . Since we are in finite dimensions we can apply Carathéodory's theorem and express ρ[C|Z] as a finite convex combination dzk λ dzk ν dzk (c |z ) = ρ(c |z ) satisfying dzk λ dzk γ(dz)ν dzk (d|z) = (1 −b) with ν dzk ∈ C C|Z , dzk λ dzk = 1 and λ dzk ≥ 0. If we define λ dz = k λ dzk and ν dz = k λ dzk ν dzk /λ dz if λ dz > 0 (otherwise, ν dz can be any member of the set C C|Z ), then k can be eliminated in the convex combination. By construction, (1−b) is the maximum value of dz λ dz γ(dz)ν dz (d|z) for any family ν dz ∈ C C|Z and λ dz ≥ 0 satisfying that for all c z , dz λ dz ν dz (c |z ) = ρ(c |z ) and dz λ dz = 1. We can writeν dz = λ dz ν dz to absorb the coefficients λ dz . With this, the value of the following problem is (1 −b):
The set [0, ∞)C C|Z is the cone generated by C C|Z . If C C|Z is characterized by a semidefinite program, then so is [0, ∞)C C|Z (by eliminating inhomogenous constraints in the semidefinite program's standard form, see Ref. [45], Eq. (4.51)). Eq. 130 can then be cast as a semidefinite program also and solved effectively. By semidefinite-programming duality, this can then be used to obtain effective solutions of Eq. 128, provided the semidefinite program is formulated to satisfy strong duality. For related programs, Ref. [17] claim strong duality by exhibiting strictly feasible solutions. We conclude with remarks on the relationship between the optimization problem in Eq. 130 and that in Eq. (8) of Ref. [18] (referenced as "P8" below). To relate Eq. 130 to P8, set γ(dz) = 1 and let C C|Z be the set of quantum achievable conditional probability distributions, for which there is a hierarchy of semidefinite-programming relaxations [30]. Then identify both c and d here with ab there (so c = d), z here with xy there, and ν dz (c |z ) here with ρ(z)P cz (c |z ) there, where P cz (c |z ) = αβ:αzβz=c P αβ (c |z ). The objective function of Eq. 130 now matches that of P8. But unless there is only one setting, the equality constraints in Eq. 130 are a proper subset of those of P8. Observe that the equality constraints of P8 when expressed in terms of the variables P cz (c |z ) require that for each z and c z , c P cz (c |z ) = ρ(c |z ). For this, note that P8 includes the constraints αβ P αβ (c |z ) = ρ(c |z ) for all c z . For any z, the left-hand side can be written as c αβ:αzβz=c P αβ (c |z ) = c P cz (c |z ). These identities imply that cz ρ(z)P cz (c |z ) = z ρ(z)ρ(c |z ) = ρ(c |z ), but are stronger when |Rng(Z)| > 1. Furthermore, according to P8, the P cz must be expressed as sums of the P αβ ∈ C C|Z as specified above, and this implies additional constraints. As a result the optimal value for P8 is in general smaller.

B. Convex Combination
Optimization over all PEFs with a given power can be highly demanding, because both the size of the range of CZ and the number of extreme points of C in a general (k, l, m) Bell-test configuration are large. Furthermore, the size of the range of CZ determines the dimension of the search space, so if this size is large and the amount of data available for making an estimate of the true distribution is limited, there is a risk of overfitting when optimizing the log-prob rate for the estimated distribution at a given power β. See Eq. (162) of Sect. VIII for the explicit formulation of the optimization problem. However, in many cases we have a small set of candidate PEFs expected to be helpful in a given situation, where the set was obtained in earlier studies of the experimental devices before running any protocols. Then, the optimization problem can be greatly simplified by means of the next theorem.
Theorem 46. Let (F i ) r i=1 be a collection of PEFs with power β and define F 0 = 1. Then every weighted average F = r i=0 λ i F i with λ i ≥ 0 and i λ i = 1 is a PEF with power β. Proof. It suffices to observe that F 0 = 1 is a PEF for all powers, and the set of PEFs with power β is convex closed since it is defined by a family of linear constraints.
Optimizing PEFs with respect to the coefficients of weighted averages is efficient and less susceptible to overfitting issues. If ρ is estimated by empirical frequencies from past trials, one can evaluate the objective function directly on the past data, in which case the technique is not limited to configurations with computationally manageable |Rng(CZ)|. This strategy was proposed for optimizing test factors for rejecting LR in Ref. [15] and used in Refs. [36,40].

VII. REDUCING SETTINGS ENTROPY
A. Sets of Distributions that are Free for Z In our applications, the sets of distributions C are determined by constraints on the Z-conditional distributions of C and separate constraints on the marginal distributions of Z. We first formalize this class of sets. As defined in Sect. II, for any RV X, S X = {ρ : ρ is a distribution of X} and Cvx(X ) is the convex closure of X .
In the case of Bell tests, C C|Z consists of the set of non-signaling distributions, possibly satisfying additional quantum constraints. If Z is a settings choice RV with known distribution ν, then C Z = {ν}. For ρ C|Z = (ρ C|z ) z ∈ S C|Z , we define ρ C|Z ν by (ρ C|Z ν)(cz) = ρ C|z (c)ν(z). With this, C in Eq. 131 can be written as Cvx(C C|Z C Z ). If C Z = {ν}, then C C|Z C Z = C C|Z ν is already convex closed, so we can omit the convex-closure operation.
The extreme points of C C|Z ν consist of the set of ρ C|Z ν with ρ C|Z extremal in C C|Z . In general, we have: Here, the overline on the right-hand side denotes topological closure.
Proof. C C|Z and C Z are bounded closed sets with finite dimension, hence compact. The operation is continuous and therefore maps bounded closed sets to bounded closed sets. So C C|Z C Z is bounded and closed, as is Extr(C C|Z ) Extr(C Z ), which is contained in By bilinearity of , we have C C|Z C Z ⊆ Cvx(Extr(C C|Z ) Extr(C Z )). Since convex closure is idempotent, Cvx(C C|Z C Z ) ⊆ Cvx(Extr(C C|Z ) Extr(C Z )). We conclude that these two convex sets are identical. Every bounded closed set contains the extreme points of its convex closure (Thm 1.4.5 of Ref. [46]). Accordingly, Extr(Cvx(C C|Z C Z )) ⊆ Extr(C C|Z ) Extr(C Z ).
We note that the members of Extr(C C|Z ) Extr(C Z ) are extremal. This can be seen as follows: Let the member ρ C|Z ν of Extr(C C|Z ) Extr(C Z ) be a finite convex combination of members of C C|Z C Z , written as ρ C|Z ν = i λ i ρ C|Z;i ν i . By marginalizing to Z and by extremality of ν, it follows that ν i = ν for all i. Given this, and extremality of ρ C|Z , we can also conclude that ρ C|Z;i = ρ C|Z . Thus the inclusion in Lem. 48 becomes an equality if we topologically close the left-hand side. In particular, if C C|Z and C Z are polytopes, then the right-hand side is finite and therefore Cvx(C C|Z C Z ) is also a polytope with the expected extreme points, a fact that we exploit for finding PEFs by our methods.
We fix the convex set C C|Z ⊆ S C|Z for the remainder of this section. For brevity, instead of referring to properties "for C = Cvx(C C|Z C Z )", we just say "for C Z ", provided C C|Z is clear from context.

B. Gain Rates for Biased Settings
For randomness expansion, it is desirable to minimize the entropy of µ[Z], since this is the main contributor to the number of random bits required as input to the protocol. The other contributor is the seed required for the extractor, but good extractors already use relatively little seed. We expect that reducing the entropy of µ[Z] may reduce the asymptotic gain rate. However, in the case where we have uniform bounds on the maximum probability as required for Thm. 45, we can show that the reduction is limited.
Theorem 49. Let ρ C|Z ∈ C C|Z . Let γ be a positive function of DZ. Let B be a function of CZ satisfying Eq. 124 for C Z = {Unif Z }. Assume thatb = E ρ C|Z Unif Z (B) > 0. Let ν be a distribution of Z with p min = min z ν(z) > 0. Then the asymptotic gain rate at ρ C|Z ν for By our assumptions on B, for all σ C|Z ∈ C C|Z , 1−E σ C|Z Unif Z B(CZ) ≥ max dz γ(dz)P σ C|Z Unif Z (d|z), which is Eq. 124. Since P σ C|Z ν (d|z) = P σ C|Z Unif Z (d|z), we can apply Eq. 132 to conclude that B satisfies Eq. 124 for C Z = {ν}. According to Thm. 45, is a PEF with power β for C Z = {ν} provided that F β ≥ 0. Note that Eq. 124 impliesb < 1.
Since B ≥ −wq/p min , for sufficiently small β, the condition F β ≥ 0 is satisfied. Specifically, we require that β < 1/a, where we define a = (wq/p min +b)/(1 −b). The asymptotic gain rate g from the theorem statement satisfies The expression inside the limit is (135) In general, given βa < 1 and x ≥ −a, we can approximate log(1+βx) from Taylor expansion around x = 0 with a second-order remainder to get , which is the variance of X. Substituting X for x in the inequality of Eq. 136 and applying expectations to both sides gives where the minus sign on the order notation emphasizes that the lower bound is negative. The expression in Eq. 135 is therefore bounded below by − log(1−b)+E ρ C|Z ν log(γ(DZ))−O(β) which goes to − log(1 −b) + E ρ C|Z ν log(γ(DZ)) as β → 0 + . In view of Eq. 134, the theorem follows.

C. Spot-Checking Settings Distributions
To approach the asymptotic gain rate in Thm. 49 requires small powers which negatively impact the net log-prob rate. We analyze the effect on the net log-prob rate in the case where µ is a mixture of a deterministic distribution δ z 0 and Unif Z . This corresponds to using the same setting for most trials and randomly choosing test trials with uniform settings distribution. This is referred to as a spot-checking strategy [22]. Later we consider a protocol where one random trial out of a block of 2 k trials has Z uniformly distributed.
To simplify the analysis, we take advantage of the fact that for configurations such as those of Bell tests, we can hide the choice of whether or not to apply a test trial from the devices. This corresponds to appending a test bit T to Z, where T = 1 indicates a test trial and T = 0 indicates a fixed one, with Z = z 0 . The set C C|ZT is obtained from C C|Z by constraining µ[C|zt] to be independent of t for each z and (µ[C|zt]) z ∈ C C|Z . For any ν C|Z ∈ C C|Z there is a correspondingν C|ZT ∈ C C|ZT defined byν C|zt (c) = ν C|z (c) for all c, z and t. The map ν C|Z →ν C|ZT is a bijection.
Let q = 1/|Rng(Z)| and ρ C|Z ∈ C C|Z . Let ν r be the probability distribution of ZT defined by ν r (z1) = rq and ν r (z0) = (1 − r)δ z,z 0 for some value z 0 of Z. Since we are analyzing the case where r is small, we assume r < 1/2. The entropy of the distribution ν r is given by S(ν r ) = H(r) + r log(1/q), where H(r) = −r log(r) − (1 − r) log(1 − r). Let B be a function of CZ satisfying Eq. 118 for . From Eq. 118, it follows thatb < 1. Define B r (CZT ) by Setting the function B r to zero when {T = 0} is convenient. Because F r,β is related to the tangent line of (1 − x) −β at x =b and B r = 0 corresponds to x = 0, the expected value of F r,β is slightly below 1 for non-test trials.
Theorem 50. There exist constants d > 0, d ≥ 0 independent of r such that for β ≤ dr, F r,β as defined in Eq. 139 is a PEF with power β for C C|ZT ν r , and its log-prob rate g r,β atρ C|ZT ν r is g r,β ≥ − log(1 −b) − d β/r.
In practical situations, we anticipate using numerical optimization to determine the PEFs, which we expect to improve on the bounds in the theorem. However, we believe that the constants d and d obtained in the proof are reasonable. They are given by with w = − min(0, min(B)).
Proof. The proof is a refinement of that of Thm. 49. For any σ C|Z ∈ S C|Z , consider the expectation of B r with respect toσ C|ZT ν r .
Since Pσ C|ZT νr (d|zt) = P σ C|Z Unif Z (d|z) and B satisfies Eq. 118 for C Z = {Unif Z }, so does B r but for C ZT = {ν r }. Thus Thm. 44 applies, and we have that F r,β as defined in Eq. (139) is a PEF with power β for C C|ZT ν r , provided β is small enough. From Eq. (138) and the fact min(B) ≥ −w, we have that B r ≥ −w/r. From the proof of Thm. 49, we can replace wq/p min by w/r in the expression for a there to see that it suffices to satisfy β < 1/a = (1−b)/(w/r+b) in order to make sure F r,β ≥ 0. The upper bound can be estimated as given our assumption that r < 1/2 andb > 0. With foresight, we set d = (1 −b)/(2w +b) < 1/(2ra). With β ≤ dr, this implies F r,β > 0 and βa < 1/2.
Next we lower bound Eρ C|ZT νr (log(F r,β ))/β by the same strategy that we used in the proof of Thm. 49. The result is where here we define v r = Eρ C|ZT νr X 2 r with X r = (B r −b)/(1 −b). Note that from Eq. 141 and the definition ofb, Eρ C|ZT νr X r = 0. An explicit expression for v r is obtained as follows: For the second term in the last line, compute where for the third line we applied the identity E(U 2 ) = E((U − E(U )) 2 ) + E(U ) 2 relating the expectation of the square to the variance. Combining this with Eq. 144 gives v r r =v Write c for the right-hand side of this inequality, which is independent of r. Substituting in Eq. 143, we get because our earlier choice for d implies βa < 1/2. We now set d = 2c to complete the proof of the theorem.
From the proof above, we can extract the variance of F r,β atρ C|ZT ν r , from which we can obtain a bound on the variance of log(F r,β ). Since we refer to it below when discussing the statistical performance of exponential randomness expansion, we record the result here.

D. Exponential Randomness Expansion
For PEFs satisfying the conditions for Thm. 50, it is possible to achieve exponential expansion of the entropy of Z.
Theorem 52. Let F r,β be the family of PEFs from Thm. 50. Consider n trials with model H(C C|ZT ν r ) and a fixed error bound h . Consider ρ C|Z ∈ C C|Z and let g r,β be the logprob rate of F r,β atρ C|ZT ν r , and assume that the trials are i.i.d. for CZT with this distribution. We can choose r = r n and β = β n as functions of n so that the expected net log-prob is g net = ng rn,βn − log(1/ h )/β n ≥ n(− log(1 −b))/3 = e Ω(nS(νr n )) , where nS(ν rn ) is the total input entropy of ZT.
The performance in the theorem statement may be compared to that in Ref. [22], where the settings entropy for n trials is Ω(log 3 (n)) (see the end of Sect. 1.1 of the reference). With our terminology, this implies a net log-prob e O(S 1/3 ) where S is the total input entropy, which is subexponential in input entropy. It is also possible to match the soundness performance of the "one-shot" protocol of Ref. [21], Cor. 1.2, by modifying the proof below with r n and β n proportional to n −ω with 0 < ω < 1 instead of ω = 1. However, the referenced results are valid for quantum side information.
The proof of Thm. 52 given below computes explicit constants for the orders given in the theorem statement. The constants depend on the PEF construction in the previous section, and the performance can always be improved by direct PEF optimization. As a result we do not expect the constants to be needed in practice.
The i.i.d. assumption is a completeness requirement that we expect to hold approximately if the devices perform as designed. Explicit constants expressed in terms of those in Thm. 50 are given with Eq. 153 below. Notably, the expected number of test trials is independent of n and proportional to log(1/ h ), which is what one would hope for in the absence of special strategies allowing feed-forward of generated randomness from independent devices, an example of which is the cross-feeding protocol of Ref. [24]. However, since non-test trials contribute negatively to the log-prob, this means that the full gain and the error bound are witnessed by a constant number of test trials. One should therefore expect a large instanceto-instance variation in the actual log-prob obtained, raising the question of whether the success probability, or the expected number of trials before reaching a log-prob goal, are sufficiently well behaved. We consider this question after the proof.
Proof. Let d and d be the constants in Thm. 50. From this theorem, the expected net log-prob is bounded by withb as defined for Thm. 50. The input entropy per trial S(ν r ) = H(r) + r log(1/q) is bounded above by −2r log(r), provided we take r ≤ q/e. For this, note that r log(1/q) ≤ r log(1/(re)) = −r log(r) − r and −(1 − r) log(1 − r) ≤ r since the function x → −(1 − x) log(1 − x) is concave for 0 ≤ x < 1, is equal to 0 at x = 1, and has slope 1 at x = 0. If we choose r n = c /n with c to be determined later, the expected number of test trials is c and the total input entropy satisfies nS(ν rn ) ≤ 2c log(n/c ), or equivalently n ≥ c e nS(νr n )/(2c ) . For sufficiently large n, r ≤ q/e is satisfied. If we then choose β n = cr n = cc /n with c ≤ d such that g net grows linearly with n, we have accomplished our goal. Write g 0 = − log(1−b). Substituting the expressions for β n and r n in Eq. 151, which implies the theorem.
According to Lem. 51, the variance atρ C|ZT ν r of the quantity n log(F rn,βn )/β whose expectation is the expected log-prob for the parameters chosen in the proof above is given by v = n v rn,βn β 2 n ≤ 4 log(2) 2 n r nv Here, to get the last line we used Eq. 140. In view of Eq. 152, the expected log-prob is at least 2ng 0 /3, with up to 1/2 of that taken away by the error-bound requirement. Here we are interested in the distribution of the net log-prob, which depends on the actual trial results. The amount that is subtracted from the log-prob to get the net log-prob for the error bound is independent of the trial results and given by f = n log(1/ h )/(cc ). By increasing c by a constant factor while holding c fixed, we can reduce f and still achieve exponential expansion, albeit with a smaller constant in the exponent. The first two terms on the right-hand side of Eq. 152 are independent of c and provide a lower bound on the expected log-prob. These considerations imply that if the variance is small enough compared to (ng 0 /6) 2 (for example), with correspondingly high probability the net log-prob exceeds ng 0 /6. Small variance is implied by large c , because according to Eq. 154, the variance of the log-prob is reduced if c is increased. To be specific, let κ ≥ log(1/ h ). We can set from which v ≤ 2 log(2) 2 n 2 g 2 0 min(3dd /g 0 , 1) 9κ ≤ 2 log(2) 2 n 2 g 2 0 9κ .
By Chebyshev's inequality, a conservative upper bound on the probability that n log(F rn,βn )/β < 2ng 0 /3 − √ v/λ is λ 2 . Since at most ng 0 /3 is subtracted to obtain the net log-prob, if we require that √ v/λ ≤ ng 0 /6, then with probability at least 1 − λ 2 we get a net log-prob of at least ng 0 /6, which is sufficient for exponential expansion. From Eq. 156, we can achieve this if we have 2 log(2) 2 n 2 g 2 0 9κλ 2 ≤ n 2 g 2 that is, if κ ≥ 8 log(2) 2 /λ 2 . In terms of c , this means that c needs to be set to max(3 log(1/ h ), 24 log(2) 2 /λ 2 )/(cg 0 ) for a probability of at least 1 − λ 2 of having a net log-prob of ng 0 /6. Better bounds on the probabilities above can be obtained by taking into consideration the i.i.d. assumptions. For example one can apply Chernoff-Hoeffding bounds to improve the estimates. Alternatively, we can use the option to acquire trials until a desired net log-prob is obtained with a conservative upper bound on the maximum number of trials.

E. Block-wise Spot-Checking
For randomness expansion without having to decompress uniform bits for generating biased distributions of Z, we suggest the following spot-checking strategy consisting of blocks of trials. Each block nominally contains 2 k trials. One of them is chosen uniformly at random as a test trial, for which we use the random variable T s , now valued between 0 and 2 k − 1 with a uniform distribution. The l'th trial in a block uses Z = z 0 (distribution δ z 0 ), unless l = T s + 1 in which case the distribution of Z is Unif Z . For analyzing these trials, observe that the distribution of Z at the l'th trial conditional on not having encountered the test trial yet, that is, given that {l ≤ T s }, is a mixture of δ z 0 and Unif Z , with the probability of the latter being 1/(2 k − l + 1). Thus we can process the trial using an adaptive strategy, designing the PEF for the l'th trial according to Thm. 50, where the distribution of T has probability r l = 1/(2 k − l + 1) of T = 1. We call this the block-wise spot-checking strategy.
Here we must choose a single power for the PEFs independent of l. Once the test trial is encountered, the remaining trials in the block do not need to be performed as we can, equivalently, set the PEFs to be trivial from this point on. Of course, if there is verifiable randomness even when the setting is fixed and known by the external entity and devices, we can continue. The average number of trials per block is therefore (2 k + 1)/2. Note that r 2 k −l+1 = 1/l and the probability of reaching the 2 k − l + 1'th trial is l/2 k . When we choose the common power of the PEFs, the goal is to optimize the expected log-prob per block in consideration of the way in which the settings probabilities change as well as the anticipated trial statistics.

A. General Considerations
All our applications are for the (2, 2, 2) Bell-test configuration. In this configuration, Z = XY and D = C = AB with X, Y, A, B ∈ {0, 1}. The distributions are free for Z. The set of conditional distributions consists of those satisfying the non-signaling constraints Eq. 5. We denote this set by N C|Z . This includes distributions not achievable by twoparty quantum states, so additional quantum constraints can be considered. There is a hierarchy of semidefinite programs to generate such constraints due to Navascues, Pironio and Acín [30] (the NPA hierarchy). Because semidefinite programs define convex sets that are not polytopes and have a continuum of extreme points, we cannot make direct use of the NPA hierarchy. In the absence of a semidefinite hierarchy for the PEF constraints, we require the conditional distributions to be a polytope defined by extreme points. In general, we can use the NPA hierarchy to generate finitely many additional linear constraints to add to the non-signaling constraints. The set of extreme points can then be obtained by available linear programming tools and used for PEF optimization. To keep the number of extreme points low, it helps to add only constraints relevant to the actual distribution expected at a trial. We do not have a systematic way to identify relevant constraints. Below, we consider additional quantum constraints based on Tsirelson's bound [47]. We note that one can consider constructing PEFs via Thm. 44 as discussed after the theorem. With this method one can take advantage of the NPA hierarchy directly but is restricted to PEFs satisfying Eq. 117.
The extreme points of N C|Z are given by the deterministic LR distributions and variations of Popescu-Rohrlich (PR) boxes [48]. The deterministic LR distributions are parameterized by two functions f A : x → f A (x) ∈ {0, 1} and f B : y → f B (y) ∈ {0, 1} defining each station's settings-dependent outcome, giving distributions This yields the 16 LR extreme points. The PR boxes have distributions ν g defined by functions g : xy → {0, 1}, where |{xy|g(xy) = 1}| is odd, according to with addition of values in {0, 1} modulo 2. The function g indicates for which settings pairs the stations' outcomes are perfectly anti-correlated. There are 8 extreme points in this class. The simplest quantum constraint to add is Tsirelson's bound [47], which can be expressed in the form E(B(CZ)) ≤ ( √ 2 − 1)/4 for all distributions compatible with quantum mechanics, with B(CZ) given by Eq. 22 and fixing the settings distribution to be uniform. There are 8 variants of Tsirelson's bound expressed as E(B g (CZ)) ≤ ( corresponding to the 8 PR boxes labelled by g, with B g (CZ) as defined in the next paragraph. Let Q C|Z be the set of conditional distributions obtained by adding the eight conditional forms of Eq. 160 to the constraints. For any PR box ν g , let p(g) = 0 or p(g) = 1 according to whether |g −1 (1)| = 1 or |g −1 (1)| = 3 and define This implies that the Bell function B(CZ) in Eq. 22 can be expressed as B(CZ) = B g:xy →x×y (CZ), with x × y the numeric product of x and y. For any distribution ρ ∈ N C|Z {Unif Z }, the expectation of E ρ (B g (CZ)) bounds the contribution of the PR box ν g to ρ in the sense that if ρ = (1 − p)ρ LR + pν g , with ρ LR being an LR distribution, then E ρ (B g (CZ)) ≤ p/4. From Ref. [49], one determines that the bound is tight in the sense that if E ρ (B g (CZ)) = p/4, then there exists an LR distribution ρ LR such that ρ = (1−p)ρ LR +pν g . These LR distributions are convex combinations of the eight deterministic LR distributions for which E ν f A ,f B (B g (CZ)) = 0. We observe that these are extremal, and all non-signaling distributions can be expressed as convex combinations of extreme points involving at most one PR box. This makes it possible to explicitly determine the extreme points of Q C|Z . They are the convex combinations (1−q)ν f A ,f B +qν g with q = √ 2−1 that satisfy Tsirelson's bound with equality on the one-dimensional line connecting one of the above 8 deterministic LR distributions to the corresponding PR box. We emphasize that probability estimation is not based on the Bell-inequality framework. However, any PEF F for a Bell-test configuration has the property that F − 1 is a Bell function and therefore associated with a Bell inequality. This follows because the conditional probabilities for the extremal LR distributions are either 0 or 1. Therefore, regardless of the power β, the PEF constraints from these extreme points are equivalent to the constraints for Bell functions after subtracting 1. One can construct PEFs from Bell functions by reversing this argument and rescaling the resulting factor to satisfy the constraints induced by non-LR extreme points. But the relationship between measures of quality of Bell functions (such as violation signal-to-noise, winning probability or statistical strength for rejecting LR) and those of related PEFs is not clear.

B. Applications and Methods
We illustrate probability estimation with three examples. In the first, we estimate the conditional probability distribution of the outcomes observed in the recent loophole-free Bell test with entangled atoms reported in Ref. [8]. In the second, we reanalyze the data used for extracting 256 random bits uniform within 0.001 in Ref. [13]. Finally, we reanalyze the data from the ion experiment of Ref. [1], which was the first to demonstrate minentropy estimation from a Bell test. For the first example, we explore the log-prob rates and net log-prob rates over a range of choices of parameters for PEF power, error bound, unknown settings bias and test-trial probability. The log-prob rates and net log-prob rates are determined directly from the inferred probabilities of the outcomes, from which the expected net log-prob can be inferred under the assumption that the trials are i.i.d. For the other two examples, we explicitly apply the probability estimation schema and obtain the total log-prob for different scenarios. In each scenario, the parameter choices are made based on training data, following the protocol in Ref. [13]. We show the certifiable min-entropy for the reported data as a function of the error bound and compare to the min-entropies reported in Refs. [1,13].
For determining the best PEFs, we can optimize the log-prob rate at the estimated distribution ν given the power β. No matter which of N C|Z or Q C|Z is considered, the freefor-Z sets C obtained accordingly have a finite number of extreme points. We can therefore take advantage of the results in Sect. IV C to solve the optimization problem effectively. Given the list of extreme points (ρ i ) q i=1 of the set C and the estimated distribution ν in C, the optimization problem can be stated as follows:

Maximize:
cz log(F (cz))ν(cz) Subject to: F (cz) ≥ 0, ∀cz Since the objective function is concave with respect to F and the constraints are linear, the problem can be solved by any algorithm capable of optimizing non-linear functions with linear constraints on the arguments. In our implementation, we use sequential quadratic programming. Due to numerical imprecision, it is possible that the returned numerical solution does not satisfy the second constraint in Eq. 162 and the corresponding PEF is not valid. In this case, we can multiply the returned numerical solution by a positive factor smaller than 1, whose value is given by the reciprocal of the maximal left-hand side of the above second constraint at the extreme points of C. Then, the re-scaled solution is a valid PEF.
All examples involve inferring an experimental probability distribution, in the first example from the published data, and in the other two from an initial set of trials referred to as the training set. Due to finite statistics, the data's empirical frequencies are unlikely to satisfy the constraints used for probability estimation. It is necessary to ensure that the distribution for which log-prob rates and corresponding PEFs are calculated satisfies the constraints, because otherwise, the distribution is not consistent with the physical model underlying the randomness generation process, and so unrealistically high predicted logprob rates may be obtained. To ensure self-consistency, before solving the optimization in Eq. 162, we determine the closest constraints-satisfying conditional distributions. For this, we take as input the empirical frequencies f (cz) and find the most likely distribution ν of CZ with (ν[C|z]) z in Q C|Z and ν(z) = f (z) for each z. That is, we solve the following optimization problem: The objective function is proportional to the logarithm of the ratio of the likelihood of the observed results with respect to the constrained distribution ν and that with respect to the empirical frequencies f . This optimization problem involves maximizing a concave function over a convex domain and can be solved by widely available tools. Note that in all cases, we ensure that the solution satisfies Tsirelson's bounds. Once ν(C|Z) has been determined, we use (ν[C|z]) z ∈ C C|Z for the predictions in the first example, and for determining the best PEFs for the data to be analyzed in the other two examples. In each case, we use the model for the settings probability distribution relevant to the situation, not the empirical one f (z) = ν(z).
An issue that comes up when using physical random number sources to choose the settings is that these sources are almost but not quite uniform. Thus, allowances for possible trial-bytrial biases in the settings choices need to be made. We do this by modifying the distribution constraints. Instead of having a fixed uniform settings distribution, for any bias b, we consider the set For Rng(V ) = {0, 1}, the extreme points of B V,b are distributions ν of the form ν(v) = (1±(−1) v b)/2. For the (2, 2, 2) configuration, Z = XY and we use C Z,b = Cvx(B X,b ⊗B Y,b ) for the settings distribution. We then consider the free-for-Z distributions given by N C|Z C Z,b and Q C|Z C Z,b . Thus, we allow any settings distribution that is in the convex span of independent settings choices with limited bias. We do not consider the polytope of arbitrarily biased joint settings distributions given by B Z,b but note that C Z,b ⊆ B Z,2b+b 2 and B Z,b 2 ⊆ C Z,b . Another polytope that could be used is B V, = {ν[V ] : ν(v) ≥ for all v}. We have B Z,b ⊆ B Z,(1−b)/4 . It is known that there are Bell inequalities that can be violated for certain quantum states and settings-associated measurements when the settings distribution is arbitrary in B Z, for all 1/4 ≥ > 0 [50].

C. Exploration of Parameters for a Representative Distribution of the Outcomes
We switch to logarithms base 2 for all log-probs and entropies in the explorations to follow. Thus all explicitly given numerical values for such quantities can be directly related to bits without conversion. For clarity, we indicate the base as a subscript throughout.
For the first example, we use outcome table XII from the supplementary material of Ref. [8] to determine a representative distribution ρ atoms for state-of-the-art loophole-free Bell tests involving entangled atoms. This experiment involved two neutral atoms in separate traps at a distance of 398 m. They were prepared in a heralded entangled state before choosing uniformly random measurement settings and determining the outcomes. The experiment was loophole-free and showed violation of LR at a significance level of 1.05 × 10 −9 . The outcome table shows outcome counts for two different initial states labeled Ψ + and Ψ − . We used the counts for the state Ψ − , which has better violation. We determined the optimal conditional distribution satisfying Q C|Z as explained in the previous section, which gives the settings-conditional distribution ρ atoms shown in Table I.
We begin by exploring the relationship between power, log 2 -prob rate and net log 2 -prob rate. Fig. 1 shows the optimal log 2 -prob rates and net log 2 -prob rates as a function of power β for a few choices of error bound h . As expected, the log 2 -prob rates decrease as the power β increases. The asymptotic gain rates can be inferred from the values approached as β goes to zero. The inferred asymptotic gain rates are approximately 0.088 and 0.191 for N C|Z and Q C|Z , respectively. For small β, constraints Q C|Z improve the log 2 -prob rates by around a factor of 2 over N C|Z . The net log 2 -prob rates show the trade-off between error bound and log 2 -prob rate, with clear maxima at β > 0.
Another view of the trade-off is obtained from the expected net log 2 -prob obtained after 27683 trials at different final error bounds, where we optimize the expected log 2 -prob with respect to power β. This is shown in Fig. 2. The expected net log 2 -prob is essentially the amount of min-entropy that we expect to certify by probability estimation in measurement outcomes from the 27683 trials in the experiment of Ref. [8]. The number of near-uniform bits that could have been extracted from this experiment's data is substantially larger than that shown in Fig. 8 for another Bell test using two ions.
We determined the robustness of the asymptotic gain rates and the log 2 -prob rates against biases in the settings random variables X and Y , with the settings distribution constrained by C Z,b as explained in the previous section. The bias is bounded but may vary within these bounds from trial to trial. Fig. 3 shows how the asymptotic gain rate depends on bias. While the maximum bias tolerated in this case is less than 0.05, based on the results of Ref. [50], we expect that there are quantum states and measurement settings for which all biases strictly below 1 can be tolerated in the ideal case. See also Ref. [51]. Fig. 4 shows the dependence of the log 2 -prob rate on power β at three representative biases.
Finally, we consider the challenge of producing more random bits than are consumed. This requires using a strategy to minimize the entropy used for settings. We assume i.i.d. trials, with settings distributed according to ν r (xy) = (1 − r) xy = 11 + r/4, which uses uniform settings distribution with probability r and a default setting 11 with probability 1 − r. Let S(r) = −(3r/4) log 2 (r/4) − (1 − 3r/4) log 2 (1 − 3r/4) be the base-2 entropy of ν r , σ(β) the log 2 -prob rate for a given PEF with power β at ρ atoms , and h = 2 −κ the desired error bound parameterized in terms of the κ and independent of n. If we optimistically assume that all the available min-entropy is extractable, then the expected net number of random bits after n trials is given by σ net (n, κ, β, r) = nσ(β) − κ/β − nS(r).
We refer to σ net as the expected net entropy. At given values of n and κ, we can maximize the expected net entropy over β > 0 and r > 0. Randomness expansion requires that σ net ≥ 0.
Here we do not account for the seed requirements or the extractor constraints. Fig. 5 shows the expected net entropy as a function of n at ρ atoms . Of interest here are the break-even points, which are where the expected net entropy becomes positive. For this, consider the break-even equation nσ(β) − κ/β − nS(r) = 0. Equivalently, κ = β(σ(β) − S(r))n, and to minimize the number of trials n required for break-even, it suffices to maximize β(σ(β) − S(r)), independently of κ. (This independence motivates our choice of parametrization of h .) Given the maximizing values of β and r, the critical value for n at the break-even point is expressed as n c = κ β(σ(β) − S(r)) .
One can see that n c scales linearly with κ. Following this strategy, we obtained the values of the parameters at the break-even points. They are shown in Table II.

D. Randomness from an Optical Loophole-Free Bell Test
Ref. [13] describes the generation of 256 bits of randomness within 0.001 of uniform from the optical loophole-free Bell test of Ref. [10], assuming only non-signaling constraints for certification. The data set used is labelled XOR 3 (pulses 3-9) and exhibits a p-value against LR of 2.03 × 10 −7 according to the analysis in Ref. [10]. It consists of about 1.82 × 10 8 trials with a low violation of LR per trial. The protocol of Ref. [13] was developed specifically to obtain randomness from Bell-test data with little violation per trial. It implicitly involves probability estimation. Table III shows the closest distribution in Q C|Z inferred from the training set consisting of the first 50 × 10 6 trials of XOR 3.
The analysis in Ref. [13] left open the question of how much min-entropy one can certify in XOR 3 given bias in the settings choices. As reported in Ref. [10], small biases were present in the experiment, due to fluctuations in temperature that affected the physical random sources used and issues with the high-speed synchronization electronics. A plot of the net log 2 -prob achieved as a function of error bound at a few representative biases is shown in Fig. 6. We conclude that randomness certification from the XOR 3 data set is robust against biases substantially larger than the average ones inferred in Ref. [10]. The robustness is similar to that reported in Ref. [10] for the p-values against LR. To get the results in Fig. 6, we determined the optimal PEFs and powers according to the conditional distribution in Table III. Then we applied the PEFs obtained to the remaining trials (the analysis trials). Here, we did not adapt the PEFs while processing the analysis trials.
We also determined the asymptotic gain rates from the conditional distribution in Table III. The dependence of the asymptotic gain rate on the bias parameter b is shown in Fig. 7. Notably, the robustness against bias is similar to that for ρ atoms shown in Fig. 3. This may be due to both experiments being designed for the same family of Bell inequalities. For higher Bell-test bias tolerance, different inequalities and measurement settings need to be used [50].

E. Reanalysis of "Random Numbers Certified by Bell's Theorem [1]"
Finally, we applied the procedure that we used in reanalyzing the XOR 3 data set in the previous section to the data from the experiment reported in Ref. [1]. This experiment involved two ions separated by about 1 meter in two different ion traps. As a Bell test, the experiment closed the detection loophole, but was subject to the locality loophole. The authors certified a min-entropy of 42 bits with respect to classical side information at an error bound of 0.01 with 3016 trials. They assumed quantum constraints (implemented by the NPA hierarchy) for this certification. For the reanalysis, we set aside a training set consisting of the first 1000 trials to estimate a constraints-satisfying conditional distribution. The distribution obtained is given in Table IV. We optimized PEFs and their powers as described in the previous section and applied them to the remaining 2016 trials, the analysis set. The result is shown in Fig. 8. We expect that if it had been possible to optimize the PEFs based on the calibration experiments preceding the 3016 trials, then the log 2 -probs obtained would have been approximately 50 % larger, assuming that the trial statistics were sufficiently stable.   Table II. FIG. 6. Net log 2 -probs achieved in the XOR 3 data set from Ref. [10]. This is the net log 2 -prob achieved by probability estimation applied to the analysis set after determining the best PEF and power according to the conditional distribution inferred from the training set. The error bound is an input parameter. The curves show the achieved net log 2 -probs for non-signaling (N C|Z ) and quantum (Q C|Z ) constraints and three representative biases. Here, we performed probability estimation according to Eq. 52. The curve that is lowest on the right is the net log 2 -prob reported in Ref. [13], where only non-signaling constraints were exploited.  TABLES TABLE I. Inferred settings-conditional distribution ρ atoms (ab|xy). Shown are the settingsconditional probabilities of the outcomes estimated from all trials with the prepared state Ψ − reported in Ref. [8]. This is the Q C|Z -satisfying distribution based on the outcomes from 27683 heralded trials. Each column is for one combination of outcomes ab, and the rows correspond to the settings combination xy. Because the probabilities are settings-conditional, each row's probabilities add to 1, except for rounding error. We give the numbers at numerical precision for traceability. They are used for determining PEFs, not to make a statement about the actual distribution of the outcomes in the experiment. Statistical error does not affect the validity of the PEFs obtained (see the paragraph after Eq. 162).