Unbiased estimation of sampling variance for Simpson’s diversity index

Quantification of measurement uncertainty is a cornerstone of scientific inquiry. Yet, for ecological measures of diversity obtaining reliable estimates of this uncertainty is a surprisingly challenging statistical problem. Here, a solution to this problem is presented in the form of an unbiased estimator for the sampling variance of Simpson’s diversity index. Numerical tests show that this estimator outperforms currently used methods by large margins in small samples. We illustrate uses of the estimator for quantifying diversity in marine ecology and immunology.

Quantification of measurement uncertainty is crucial for robust scientific inference, yet accurate estimates of this uncertainty remain elusive for ecological measures of diversity.Here, we address this longstanding challenge by deriving a closed-form unbiased estimator for the sampling variance of Simpson's diversity index.In numerical tests the estimator consistently outperforms existing approaches, particularly for applications in which species richness exceeds sample size.We apply the estimator to quantify biodiversity loss in marine ecosystems and to demonstrate ligand-dependent contributions of T cell receptor chains to specificity, illustrating its versatility across fields.The novel estimator provides researchers with a reliable method for comparing diversity between samples, essential for quantifying biodiversity trends and making informed conservation decisions.
Living systems are characterized by immense diversity across multiple scales from molecules to ecosystems [1][2][3][4][5].Quantitatively understanding how this diversity is produced and supports biological function has been a central question in the physics of living systems: On the ecosystem scale, statistical physics approaches have for instance identified conditions under which diverse interacting species can be stably maintained [1][2][3], while on the molecular and cellular scale, probabilistic modelling has shed light on how antibody and T cell receptor diversity in the adaptive immune system are generated [4][5][6][7].
To compare predictions from ecological theory to experimentally measured diversities requires quantification of measurement uncertainty; similarly comparing changes in biodiversity in response to habitat loss or climate change requires determining whether diversity differs more than expected by sampling chance alone [8][9][10][11].A number of estimators have been proposed to quantify sampling variance in diversity estimation [12][13][14][15][16].However, none of the current estimators are unbiased outside of asymptotically large samples and numerical tests suggests that their finite sample bias can be severe.This is a practically important limitation, as overestimation of sampling uncertainty diminishes the ability to detect true trends in biodiversity.Reversely, underestimating sampling variance can inflate the apparent significance of observed changes in diversity, potentially leading to spurious conclusions.
Here, we address this gap by deriving an unbiased estimator of the sampling variance of Simpson's index.We show the superior performance of the estimator compared to previous approaches on simulated and real data and apply it to establish ligand-dependent differences in T cell receptor diversity.Simpson's index is used widely from ecology [12,17] and microbiology [14,18] to economics [19] as a measure of species diversity.The index is defined as the probability of species coincidence of a random pair of individuals, where p i is the frequency of species i in the population i = 1, ..., S and S the number of distinct species.This index can be converted into an effective number of species D = 1/p C [17].Importantly, the unbiased estimator introduced here provides for the first time interval estimates for diversity that do not depend systematically on sample size.It thus addresses a recently highlighted gap in methods [10] to compare diversity estimates between samples of different sizes without rarefaction.
To illustrate the problem setting we reanalyze data from Albano et al. [11] (Fig. 1), who considered the following question: Has climate change led to a biodiversity loss in the Mediterranean sea? Fig. 1A shows the raw data, the number n i of counted molluscs belonging icance of the difference in diversity.Data: Non-invasive mollusc species found on shallow subtidal ground at 12m depth of the coast of Ashqelon [11].
to species i in a patch on the sea floor (top) and corresponding counts for empty mollusc shells found at the same site in death assemblages (bottom).A diversity index turns these species counts into scalar measures of current and past biodiversity (points in Fig. 1B).However, less than 500 molluscs were counted across 65 distinct species, so some sampling variability in the diversity estimates is expected.The method we introduce allows robust quantification of this sampling variance (error bars in Fig. 1B).Here addition of confidence intervals rules out sampling variability as the sole source of the differences supporting the conclusion that biodiversity has decreased significantly.

THE UNBIASED ESTIMATOR
In 1949 Simpson published a short letter in Nature that proposed the index that now bears his name [12].In the same publication Simpson also showed that provides an unbiased estimate of underlying population diversity from a finite sample of size N .Here n i with S i n i = N is the number of counts of the i-th species in the sample, which follows a multinomial distribution under the commonly used assumption that each individual from the population is sampled with equal probability [12,14,15].
Here we propose that the variance Var( pC ) of the point estimate (Eq.2) can be calculated without bias using the following closed-form estimator: where and .

BACKGROUND AND DERIVATION
For the following derivation, it is instructive to recall the insight leading to the unbiased point estimator given in Eq. 8. To estimate p C from a sample, it is tempting to simply replace the population frequencies p i in Eq. 1 with the sampled frequencies f i = n i /N .However, one is well advised to resist this temptation as such plugin estimators are known to be severely biased in small samples [12,13,15,20].Instead Eq. 2 should be used, which is the probability of coincidence when drawing pairs of items from the sample without replacement.This estimator of p C is unbiased, i.e. ⟨ pC ⟩ = p C , where ⟨.⟩ is an average over repeated samples of a fixed size.Evaluating the expectation requires calculating the factorial moment ⟨n i (n i − 1)⟩, which can be calculated most conveniently from the probability generating function G(z 1 , . . ., z S ) = ⟨ i z ni i ⟩.From the definition of G it follows that ∂ ∂z 2 i G(1, . . ., 1) = ⟨n i (n i − 1)⟩.For the multinomial distribution G(z 1 , . . ., z S ) = ( i p i z i ) N , and thus which can be plugged into Eq. 2 to complete the proof that pC is unbiased.We now turn to reviewing the state of the art for variance estimation for Simpson's index.We first recall the formula for the sampling variance of the point estimator was again already given by Simpson [12] (derived in the appendix for completeness), This estimator is a linear combination of the triplet coincidence probability, the square of the coincidence probability p 2 C and p C with sample size dependent parameters a, b and c given in Eq. 6.Based on this formula, Grundmann et al. [14] proposed estimating variance by plugging the empirical frequencies f i = n i /N into an asymptotic expansion of Eq. 8 for N → ∞, However, we will find that this plugin estimator is substantially biased for small N , similarly to plugin estimators of diversity indices themselves.We will also find substantial biases for the non-asymptotic plugin estimator In addition to these plugin estimators, another popular approach is due to Chao et al. [15].This method is widely used in the field due to its implementation in the R package iNEXT [21].Chao's method estimates variances by bootstrapping from a population constructed from the sample by coverage-reweighting observed species frequencies and by augmenting the sample with an estimated number of rare unseen species.Surprisingly, despite its widespread use, we will find that this estimator has the largest bias and variance of tested methods.
Can we generalize the coincidence counting approach underlying the unbiased point estimate to the problem of interval estimation?To derive a better estimator for the variance of Simpson's index we exploit the linearity of Eq. 8 and decompose the problem into the unbiased estimation of each of the three terms.For the first term, the analogy to pC suggests to estimate the triplet probability using Eq. 5. To show that this estimator is indeed unbiased requires calculating the third factorial moment ⟨n i (n i − 1)(n i − 2)⟩.This factorial moment can again be calculated by taking derivatives of the probability generating function G of the multinomial distribution, which can be plugged into the definition of pT to demonstrate its absence of bias.For the second term, we reexpress the squared coincidence probability as where we have used the variance decomposition formula, , and the unbiasedness of Simpson's estimator, ⟨ pC ⟩ = p C .Plugging this expression into Eq.8 and solving for Var( pC ) yields Finally, the third term can be estimated using Eq. 2.
Combining these results proves our central finding, the unbiasedness of the estimator proposed in Eq. 4.

BENCHMARKING ON SIMULATED DATA
To compare the empirical performance of the different estimators we turned to numerical experiments, applying estimators to samples from a range of ecologically relevant species abundance distributions.We simulated drawing samples of different sizes N from the population ranging from N = 10 to N = 10000.Repeated sampling at a given sample size allows evaluation of how empirical estimates deviate from the ground truth value, Var(p C ), computed using Eq. 8 from the species abundances.Given an estimator x of a parameter with true value x a natural measure of its quality is the mean squared error, which can be decomposed into a bias and variance term, where Bias(x) = ⟨x − x⟩ and Var(x) = ⟨(x − ⟨x⟩) 2 ⟩.
To make values more readily interpretable we normalize Bias and Variance by the true value x to the fractional values, Bias/x and Var/x 2 .We display bias (Fig. 2B) and variance (Fig. 2C) separately, to investigate any potential trade-offs between bias and variance [22].
In our first experiments, we drew relative species abundances from Dirichlet distributions, ρ(p i ) ∝ p α−1 i .These distributions arise in ecology as the steady-state of neutral birth, death and immigration dynamics in a Wright-Fisher diffusion limit [7,23].We sampled a species abundance distribution of support S = 1000 by sampling p i uniformly form the probability simplex, this is from a Dirichlet distribution with α = 1 (Fig. 2A).The numerical results demonstrate that the proposed estimator is not only unbiased (Fig. 2B), but also has lower variance than all other estimators (Fig. 2C).We also generated a more uniform and more peaked distribution of species abundances, corresponding respectively to high or low relative immigration rates, by using a Dirichlet parameter α = 4 (Fig. S1A) and α = 0.25 (Fig. S1D), respectively.We find that the unbiased estimator consistently performs best regardless of the choice of α (Fig. S1B-C,E-F).
Different sample size regimes, which we analzye in detail in the next section, predict the performance of the different estimators.For sample sizes N > S (green shading) all estimators have relatively low variance, but Chao's estimator and the full plugin estimator are substantially biased unless N ≫ S. All estimators are highly variable for very small sample sizes N < √ 2S (blue shading), which corresponds to the sample size at which the expected number of coincidences for a uniform distribution of S species is one.
To test the generality of these findings beyond neutral models, we repeated the numerical experiment with alternative species abundance distributions.In theoretical ecology, chaotic competition in generalized Lotka-Volterra models [3] and stochastic environmental fluctuations [24] have been shown to lead to the emergence of heavy-tailed species abundance distributions.Empirically, long-standing evidence shows that species abundance distributions in many complex ecosystems are well fit by lognormal [26] or power-law distributions [25].We thus tested our estimator on samples from species abundance distributions following a lognormal (Fig. S2) and power-law form (Fig. S3).The results again show that other methods have substantial bias in small samples that is removed by the proposed estimator.

INSIGHTS INTO SAMPLE SIZE SCALING
To gain intuition into how estimator performance depends on N we consider limits of the variance formula (Eq.8).For large N , the variance is asymptotically equal to This shows that in large samples the variance of the estimator scales with the familiar 1/N scaling of an arithmetic average.Note further that the expression in brack-ets can be interpreted as the variance of species probabilities, When this variance is estimated by plugging in empirical frequencies, there is an additional sampling variance contribution, which explains why plugin estimators are positively biased.
Conversely, when the number of species S is increased at fixed N the third term in Eq. 8 asymptotically dominates as p T ∼ 1/S 2 and p 2 C ∼ 1/S 2 while p C ∼ 1/S, thus Interestingly, the variance scales as 1/N 2 in this limit, which explains the sharp rise in estimator variances as N ∼ √ 2S.As coincidences are rare they occur roughly independently across the N (N − 1)/2 possible pairs and the distribution of the total number of coincidences n C is approximately Poissonian [28] with mean ⟨n C ⟩ = N (N − 1)p C /2.The variance of a Poisson distribution equals its mean, and as pC = 2nc N (N −1) we have A Poisson approximation thus recovers Eq. 19 identifying counting noise as the dominant source of variance in small samples.
The scaling analysis suggests that in terms of mean squared error it might be preferable to only estimate the Poisson term in the very smallest sample to reduce variance stemming from the estimation of the first two terms in Eq. 8. Benchmarking of the Poisson estimator, confirms this intuition: The Poisson estimator greatly reduces variance in small samples at the expense of moderate negative bias (Fig. S4).In practice, we propose using the maximum value of the Poisson and unbiased estimator to increase robustness in the smallest samples.Future work might more formally address the problem of combining the two estimators to minimize overall mean squared error using the statistical framework of shrinkage estimators [29].

COMPARISONS ON EMPIRICAL DATA
To demonstrate the practical importance of unbiased estimation we next compared estimators on empirical data.In a textbook dataset on how vegetation patterns depend on the management of dune meadows [27] we find that the unbiased estimator produces consistently lower estimates of variance than Chao's estimator (Fig. 3).In this dataset N varies from 15 to 48, which is of the same order of magnitude than the S = 30 species which are distinguished in this dataset.The wider interval estimates of Chao's method are compatible with the previously demonstrated bias of this methodod for N ∼ S. Overestimation of sampling variance decreases the percentage of pairs of sites with non-overlapping error bars from 44% for the unbiased method to 29% for Chao's method, demonstrating the potential gain in statistical power using the proposed estimator.We next compared estimator performance for the problem of quantifying T cell receptor (TCR) diversity.Stochastic genetic recombination creates hypervariable TCRs, which are the molecular basis for how our adaptive immune system responds to diverse pathogens [6] (Fig. 4A).TCR diversity can be quantified by considering receptors as "species" with associated probabilities corresponding to the likelihood of clonal lineages encoding the same receptor.The number S of receptors that can be created by recombination is immense with estimates as large as S ∼ 10 39 for the TCRβ chain [31].Therefore this problem illustrates a practical use case for diversity estimation methods in the N ≪ S regime.
To test the variance estimators we constructed a metarepertoire of 30 million T cell clonotypes by combining samples from multiple healthy donors from a cohort study [49] and then split this repertoire into nonoverlapping pools of different sizes.The unbiased estimator outperforms all other estimators in terms of bias (Fig. S5B) and variance (Fig. S5C).(Chao's estimator was excluded from this comparison due to its slow computational speed at tested sample sizes.)Our results show that using the unbiased estimator only >10000 sequences are needed to estimate Var( pC ) with a coefficient of variation ≪ 1.Such sampling depths are now readily obtainable via bulk [49] or single cell TCR sequencing [32] allowing this estimator to be applied to quantify differences in TCR diversity across individual samples.

APPLICATION: HOW MANY T CELL RECEPTORS BIND TO A GIVEN LIGAND?
Having established the good empirical performance of the unbiased estimator, we sought to exploit our statistical advance to provide a quantitative answer to an important open question in immunology (Fig. 4): How degenerate is the mapping between antigen receptors and their ligands?The diversity of antigen receptors binding to a ligand determines the breadth and polyclonality of the adaptive immune response and quantification of this diversity is thus of central interest in the field [30,33,34].Recent experimental advances in single cell sequencing of T cells sorted for specificity to multimerized ligands allow experimental probing of this diversity [30,35,36].By quantifying the diversity of ligand-specific TCRs at multiple levels, we demonstrate the statistical significance of ligand-dependent differences in the contribution of the two chains of the heterodimeric receptor to binding specificity.
The dataset we consider consists of the sequences of 415 αβTCRs specific to three ligands [30], which are important viral epitopes (see Fig. 4 caption).We label them A, B, C in the text for conciseness.Each experiment involved sorting T cells from multiple individuals, but for simplicity we will determine overall TCR diversity regardless of donor-origin, a limitation which can be relaxed as dataset sizes increase.Using our method we determined the diversity of the full receptor (Fig. 4B) and its parts (Fig. 4C-F) along with their associated sampling uncertainy.We find that the effective diversity of TCRs binding each ligand is on the order of a thousand receptors (Fig. 4B) albeit with a large associated sampling uncertainty.Interestingly, when zooming in on the diversity of the component parts of the heterodimeric receptor, we find strong statistical support for hypothesized differences [30,36] in α and β chain diversity among ligandspecific receptors (Fig. 4C-F).For instance, TCRs specific to ligand A have significantly more diverse α than β chains, while the reverse is true for ligand B. The diversity of V gene segments found in specific TCRs also varies significantly between ligands and is largest among TCRs specific to ligand C.
The unbiased interval estimates (blue) are equivalent to Chao's estimates (orange) for V gene diversity (Fig. 4E-F), but are substantially tighter for the diversity estimates on the full receptor level (Fig. 4B) and receptor chain level (Fig. 4C,D), again highlighting the upward bias of alternative estimators.Importantly, the quantification of TCR diversity at different levels leads to hypotheses about the structural basis of recognition for each ligand.For example, TCRs specific to ligand C are FIG.4: Quantifying the diversity of ligand-specific T cell receptors.(A) Schematic diagram of the interaction between TCRs and their ligands, peptides bound to major histocompatibility complex (MHC).The αβTCR is a heterodimer composed of an α and β chain, each containing variable loops that together determine TCR specificity to its ligand binding partners.The sequence of the variable loops are determined during genetic recombination which involves the choice of gene segments, called V genes, and additional diversification within the hypervariable complementary determining region 3. expected to make fewer contacts on average between the V-gene encoded CDR1 and CDR2 loops and the ligand.Similarly, diversity restriction among the two chains in ligand A and B might be reflective of how many contacts each chain is making with the ligand.These hypotheses will soon become testable as more structures of TCRs in complex with their cognate ligands are solved [33,37].

CONCLUSION AND DISCUSSION
This work introduced a method to estimate the sample variance of Simpson's diversity index without bias for arbitrary species abundance distributions.To our knowledge the estimator we have introduced here is the only provably unbiased variance estimator for any diversity metric.This unbiased estimator does not seem to be widely known despite its superior statistical properties compared to existing methods.Additionally the unbiased estimator has a closed form analytical expression and is thus fast to calculate even for large samples, in contrast to bootstrapping approaches.
As the estimator is unbiased it has the practically important property of producing estimates that do not vary systematically with sample size.This is an important advantage for practical applications in which sample sizes vary between ecological communities.Such variation has become an increasingly important concern in ecology, as the field has moved to apply techniques developed for field studies with well-controlled sampling effort to the assessment of microbiome [8,10] or immune repertoire [16,34,38] diversity from high-throughput sequencing experiments.By using the unbiased estimator introduced here ecologists can avoid the loss of information inherent in the common practice of subsampling larger samples down to the smallest sample size, known as rarefaction.Our method thus fills a previously identified gap in the ecological literature [10] to overcome the need for rarefaction by bias-corrected interval estimators.
An extension of our work could revisit methods for interval estimation for other diversity metrics such as Shannon entropy.For these metrics past work has focused on reducing bias in the point estimates themselves given the absence of an unbiased estimator [13,20].Our work might be generalized to address the variance estimation problem for these bias-corrected estimators for other diversity metrics.Another direction for future work is to compare the performance of the estimators on samples with overdispersion [39], which goes beyond the multinomial sampling assumption that underlies all tested estimators.
We note that the negative logarithm of Simpson's index, − log p C , is the Renyi entropy (of order 2) [9,40].The Renyi entropy in turn lower-bounds Shannon entropy − i p i log p i , a relation that has been exploited to estimate entropy rates of dynamical systems [13,41] and neural spike trains [42].Thus we expect that our estimator will also be of use outside of ecology in the many other areas that use the concept of entropy.Interestingly, the estimator we have introduced can determine sampling variances even when the total number of species S exceeds the sample size N .This shows that the surprising ability to infer entropies way before the distribution is fully sampled, known in statistical physics as Ma's square-root regime of entropy estimation [13,41,43] and in probability theory as the birthday paradox [44,45] generalizes from point to inverval estimation.
Application of the new estimator experimentally identified ligand-specific T cell receptors showed that their effective receptor diversity is on the order of ∼ 1000 and demonstrated ligand-dependent restriction of TCR chain diversity.The effective number of receptors is very small compared to the multiple trillions of αβ receptors that can be produced by recombination demonstrating the stringent selection of antigen-specific TCRs in these experiments [34].Knowing how many TCRs on average bind a given ligand is important in experimental design for TCR screens as it can help guide the breadth and depth of sampling strategies.Quantification of variability in the effective number of TCRs binding to different ligands using the method introduced in this paper could yield insights into the mechanistic basis of immunodominance hierarchies, and help quantify how much the effective diversity of specific TCRs depends on cutoffs on TCR avidity imposed by different experimental assays, prior exposure or age [33,36,46].Finally, quantification of ligand-specific TCRs diversity might help predict variation in performance of machine learning models for different ligands [47,48].
To aid adoption of the method we have made a reference implementation of the estimator available as an open source Python package [50].We are hopeful that our novel method for interval estimation of diversity will enable focusing of sampling efforts for monitoring biodiversity loss in our changing world.
To derive the variance of Simpson's estimator we need to calculate various (cross-)moments of the multinomial distribution.Taking derivatives of the probability generating function demonstrates that the factorial moments are equal to where x (n) = x(x − 1)...(x − n + 1) denotes the falling factorial.To calculate the raw moments of the distribution we make use of the moment generating function, M (t 1 , . . ., t S ) = G(e t1 , . . ., e t S ) = ⟨e S i tini ⟩, (22) which for the multinomial distribution is equal to M (t 1 , . . ., t S ) = ( S i p i e ti ) N .
Calculating the partial derivatives of the moment generating function at t 1 = • • • = t S = 0 we obtain The variance of pc can be expressed as The key calculation concerns the numerator of the first term, which is equal to i n (2) Expanding the first term, and evaluating the second average using Eq.21 yields Using the expression for the moments Eqs.25-27, and noting that j̸ =i p 2 j = j p 2 j − p 2 i = p C − p 2 i , we obtain 4N (3)   i p 3 i + 2N (2) Plugging this numerator into Eq.28 we obtain after some algebra, Var( pC ) = 4N (3)  i p 3 i − 2N (2) (2N − 3)p 2 C + 2N (2) p C (N (N − 1)) 2  , (32) the expression first published by Simpson [12].

FIG. 1 :
FIG. 1: Quantifying collapse of marine biodiversity in the Eastern Mediterranean sea.Illustration of the problem setting of uncertainty quantification for finite sample diversity estimates.(A) Distributions of sampled mollusc species currently alive (top) and found in surficial dead assemblages (bottom).(B) Simpson Diversity index with error bars calculated using the proposed method (Eq.4).Error bars show pC ± Var( pC ) and demonstrate statistical signif-

FIG. 2 :
FIG. 2: Benchmarking of variance estimators on simulated data.(A) Frequency-rank plot of species abundances.Probabilities of S = 1000 species were drawn from the steady-state Dirichlet distribution of a neutral model with immigration and stochastic drift (parameter α = 1).(B) Bias and (C) variance as a function of sample size N .Expectation values were calculated over 1000 repeated draws at each sample size from the population distribution.Bias and variance are expressed as fractions, i.e. divided by the true value or its square, respectively.The shaded areas differentiate sampling regimes: blue N < √ 2S, orange √ 2S < N < S, and green N > S.The number of bootstrap samples for Chao's method was set to 200 as recommended[15].

FIG. 3 :
FIG.3: Comparison of interval estimation on empirical data.The dataset presented in Jongman et al.[27] contains the abundances of 30 plant species at 20 different sampling sites on the Dutch island of Terschelling.Interval estimates for the biodiversity at each site were obtained from the unbiased estimator (blue) and the Chao estimator (orange) as pC ± Var method ( pC ).
FIG.4: Quantifying the diversity of ligand-specific T cell receptors.(A) Schematic diagram of the interaction between TCRs and their ligands, peptides bound to major histocompatibility complex (MHC).The αβTCR is a heterodimer composed of an α and β chain, each containing variable loops that together determine TCR specificity to its ligand binding partners.The sequence of the variable loops are determined during genetic recombination which involves the choice of gene segments, called V genes, and additional diversification within the hypervariable complementary determining region 3. (B-F) Interval estimates for the diversity of pMHC-specific receptors were obtained from the unbiased estimator (blue) and the Chao estimator (orange) for three viral epitopes.Diversity was assessed on the level of (B) the full receptor, (C) the α and (D) β chain, as well as (E) the Vα and (F) Vβ gene choice.Diversities are shown as effective number equivalents of Simpson's index, D = 1/ pC , with error bars calculated by error propagation Var(pC )/ pC 2 .Data: Dash et al.[30].Ligand A: Influenza virus peptide M158.Ligand B: Epstein-Barr virus peptide BMLF1280.Ligand C: Human cytomegalovirus peptide pp65495.The three peptides are presented by a common MHC, the Human Leukocyte Antigen (HLA) A*02:01.