Active pooling design in group testing based on Bayesian posterior prediction

In identifying infected patients in a population, group testing is an effective method to reduce the number of tests and correct the test errors. In the group testing procedure, tests are performed on pools of specimens collected from patients, where the number of pools is lower than that of patients. The performance of group testing heavily depends on the design of pools and algorithms that are used in inferring the infected patients from the test outcomes. In this paper, an adaptive design method of pools based on the predictive distribution is proposed in the framework of Bayesian inference. The proposed method executed using the belief propagation algorithm results in more accurate identification of the infected patients, as compared to the group testing performed on random pools determined in advance.

For identifying infected patients in a population, group testing is an effective method to reduce the number of tests and correct test errors. In group testing, tests are performed on pools of specimens collected from patients, where the number of pools is lower than that of patients. The performance of group testing considerably depends on the design of pools and algorithms that are used for inferring the infected patients from the test outcomes. In this paper, an adaptive design method of pools based on the predictive distribution is proposed in the framework of Bayesian inference. The proposed method executed using a belief propagation algorithm results in more accurate identification of the infected patients, compared with the group testing performed on random pools determined in advance.

I. INTRODUCTION
Identification of infected patients from a large population using clinical tests, such as blood tests and polymerase chain reaction tests, requires significant operating costs. Group testing is one of the approaches to reduce such costs by performing tests on pools of specimens obtained from patients [1,2]. When the fraction of infected patients in a population is sufficiently small, the infected patients can be identified from tests on pools whose number is smaller than that of the patients. Originally, group testing was developed for blood testing by Dorfman and is now applied to various fields, such as quality control in product testing [3] and multiple access communication [4].
Group testing is roughly classified into non-adaptive and adaptive. In non-adaptive group testing, all pools are determined in advance and fixed during all tests. In adaptive group testing, pools are designed sequentially, depending on the previous test outcomes. Dorfman's original study considered the simplest adaptive procedure, the so-called two-stage testing; here, in the first round, tests are performed on pools designed in advance, and all patients belonging to the positive pool are individually tested in the subsequent stage. A generalization of the two-stage testing is known as a binary splitting method [5,6], where the positive pool in the previous stage is split into two subpools. Tests in the subsequent stage are performed on the subpools until the infected patients are identified. Further, the splitting of the positive pools into several subsets larger than two sometimes reduces the number of tests required for identifying the infected patients [7]. These splitting-based methods are effective when the number of infected patients is sufficiently small. However, the splitting-based methods exhibit a limitation in the correction of false negative results because patients in the negative pools are never tested again, even when the negative result is false.
Different from the splitting-based design, active design of data sampling has been studied in statistics and machine learning, known as experiments design [8,9], active learning [10,11], and Bayesian optimization [12,13]. In these * ayaka@ism.ac.jp approaches, the optimal method to select training data for efficient learning is developed considering several criteria that quantify informativeness of the unknown data. The active design of data sampling improves the performance of algorithms in several fields, such as text classification [10], semisupervised learning [14], and support vector machine [15]. Active data sampling is particularly effective when data possess uncertainty due to a noisy generative process and there exists a limitation in the number of data sampling. In the context of group testing, active sampling of data corresponds to the active design of pools for the subsequent stage. Further, noise is observed during tests and the number of tests should be reduced; active sampling makes a significant contribution by addressing these issues.
In this paper, we propose an active pooling design method employing Bayesian inference for efficient identification of infected patients using group testing. Bayesian modeling can consider the finite false probabilities in the test and provide a measure to quantify the uncertainty, posterior predictive distribution. We sequentially design pools based on the predictive distribution in adaptive group testing. The procedure is executed using a statistical-physics-based algorithm, belief propagation (BP) [16][17][18][19], which achieves a reasonable approximation of estimates with a feasible computational cost [20]. We demonstrate that, compared with the approach that uses randomly generated pools, the proposed pooling method effectively corrects errors with a smaller number of tests.

II. MATHEMATICAL FORMULATION
Let us denote the true state of N-patients by X (0) ∈ {0, 1} N , where X (0) i = 1 and X (0) i = 0 indicate that the i-th patient is infected and not infected, respectively. The pooling of the patients is determined by a matrix F ∈ {0, 1} M×N , where M(< N) is the number of pools and F µi = 1 and F µi = 0 indicate that the i-th patient is in the µ-th pool and is not, respectively. The true state of the µ-th group, denoted by T(X (0) ,F µ ), whereF µ is the µ-th row vector of F, is given by denotes the logical sum of N components. Namely, when the µ-th pool contains at least one infected patient, the state of the arXiv:2007.13323v2 [stat.ML] 19 Aug 2020 µ-th pool is 1 (positive); otherwise, it is 0 (negative).
The test error is modeled by a function C(·) that returns 0 or 1 according to the probability conditioned by the input as P(C(a) = 1|a = 1) = p TP , P(C(a) = 0|a = 1) = 1 − p TP (1) and p TP and p FP correspond to the true-positive (TP) and falsepositive (FP) probabilities in the test, respectively [18,20]. We assume that the test errors are independent of each other; further, from the property of C(·), the generative model of Y is given by P gen is a Bernoulli distribution conditioned by X and F. Currently, we aim to infer the true states of patients X (0) from the observation Y . To this end, Bayes formula is considered. Further, we introduce the prior distribution of the patient states P pri (X i ) ∼ Bernoulli(ρ), where ρ ∈ [0, 1] is the assumed infection probability. Following the Bayes rule, the posterior distribution is given by P post (X |Y ) ∝ P gen (Y |X) i P pri (X i | ρ). The i-th patient's state is identified on the basis of the marginal distribution P post (X i |Y ) = X\X i P post (X |Y ), where X\X i denotes the components of X other than X i . As the variable X i is binary, we can represent the marginal distribution using a Bernoulli probability θ i as and θ i (Y ) corresponds to the infection probability estimated under the test result Y , namely, the probability that X i = 1. We have to convert the returned probability to a binary value for the identification of the patients' states. The simplest estimate of X (0) i is the maximum a posteriori (MAP) estimator given by where I(a) is the indicator function whose value is 1 when a is true, and 0 otherwise.

III. ADAPTIVE DESIGN OF POOLS
Here, we divide M-tests into M ini -tests under pools fixed in advance as the initial stage and M ada -tests sequentially performed on actively designed pools as the adaptive stage. Hence, M = M ini + M ada . We denote the index set of patients who are in the ν-th pool as π(ν), where F νi = 1 for i ∈ π(ν); otherwise, 0. We consider the determination of π(ν + 1) (ν ≥ M ini ) among possible pools denoted by P based on the 1, · · · , ν-th test outcomes, denoted by which are performed on pools π(1), · · · , π(ν). The predictive distribution for the unknown result of the test Y ∈ {0, 1}, which will be performed on a certain pool π ∈ P, is defined as where X π = {X i |i ∈ π} and P post (X π |Y (ν) ) = X\X π P post (X |Y (ν) ). By setting P post (X π = 0|Y (ν) ) = q(π; Y (ν) ), which is the estimated probability under given Y (ν) that all patients in the pool π are not infected, the predictive distribution is expressed as The predictive distribution measures the adequacy of the posterior distribution to describe the unknown data, and is used as a modeling criterion in Bayesian inference [21]. We use the predictive distribution for active design of pools. For intuitive discussion, let us consider the case that P pre (Y = 1|Y (ν) , π) and P pre (Y = 0|Y (ν) , π) are significantly different. We consider the case that they are close to 1 and close to 0, respectively. This means that the posterior distribution is consistent with the new observation performed on the pool π in the sense that the current posterior matches the new test result Y = 1, and Y = 0 is supposed to be the test error. We do not consider this 'explainable pool' in the subsequent stage because the test performed on the explainable pool is not expected to modify the current posterior to be realistic. Instead, we consider the pool π that gives comparable P pre (Y = 1|Y (ν) , π) and P pre (Y = 0|Y (ν) , π), where the posterior at step ν cannot explain the test result performed on the pool π, and hence, the test result is expected to correct the posterior to explain it. This strategy can be expressed by the maximization of the predictive entropy at step ν + 1 defined as where P pre (Y = 1|Y (ν) , π) = P pre (Y = 0|Y (ν) , π) = 0.5 gives the entropy maximum. Active design of data sampling based on the entropy maximization is known as uncertainty sampling in active learning [22,23]. As shown in eq.(7), the predictive entropy is expressed by one parameter q(π, Y (ν) ). Regarding the predictive entropy as a function of q ∈ [0, 1], the maximum of the predictive entropy is achieved at q = q * given by where p FP < 0.5 ≤ p TP is assumed. We determine the ν + 1-th pool as The remaining task is the calculation of q(π; Y (ν) ) for possible π under the given test results Y (ν) . The mathematical form of q(π; Y (ν) ) depends on the size of π, denoted by |π|. When |π| = 1, we obtain For larger pools, the correlation between the patients in the pool should be considered for the exact evaluation of q(π; Y (ν) ). For example, when π = {i, j} (|π| = 2), we obtain where is the susceptibility and E post |Y (ν) [·] denotes the average according to the posterior distribution P post (X |Y (ν) ). Next, we discuss the relationship between q * , p TP , and p FP . From the definition of q * , eq.(9), if p TP > 1 − p FP , then q * < 0.5. This indicates that the pools with q(π, Y (ν) ) < 0.5 are likely to be chosen when p TP > 1 − p FP . In other words, when the probability that at least one patient in a pool is infected is larger than the probability that no one is infected, the pool tends to be chosen. This can be understood as follows. Introducing false negative probability p FN , p TP > 1 − p FP is equivalent to p FN < p FP . This means that false test results are mainly contained in positive results. Hence, pools with q(π, Y (ν) ) < 0.5 contain significant uncertainty, compared with q(π, Y (ν) ) > 0.5. Therefore, in the active pooling design based on uncertainty, pools with q(π, Y (ν) ) < 0.5 are preferably chosen, when p TP > 1 − p FP . Following the same logic, we can understand that the pool with q(π, Y (ν) ) > 0.5 is likely to be chosen when p TP < 1 − p FP .

IV. IMPLEMENTATION BY BELIEF PROPAGATION
The computation of the marginal distribution requires the exponential order of the sums, and thus is intractable. We approximately calculate the marginal distribution using the BP algorithm [17][18][19]. Compared with the approximation using the BP algorithm with the exact calculation at a small size, the BP algorithm has sufficient approximation performance when applied to group testing [20]. In this study, we use the BP algorithm as a reasonable method owing to its approximation accuracy and computational time. In Appendix A, the BP algorithm for calculating the infection probability given by the posterior distribution is summarized. We denote the obtained estimates of θ i and the corresponding MAP estimator asθ i and X (MAP) i = I(θ i > 0.5), respectively. We measure the accuracy of the MAP estimator by the TP and FP rates given by respectively. A TP value larger than p TP and an FP value smaller than p FP indicate that the BP-based identification has better performance than the parallel test of N-patients.
To apply the BP algorithm to an adaptive test, we need to obtain q(π, Y (ν) ) for each ν (> M ini ). For its exact computation, we need multibody correlations between patients except when |π| = 1, although the BP algorithm returns one-body information. In this study, we use the simplest approximation provided by the BP algorithm asq(π, Y (ν) ) ≡ |π | i=1 (1 −θ π i ), where π i (i = 1, · · · , |π|) is the i-th component in the pool π, to avoid the increase in the computational time required for the calculation of multibody correlation. Further, to reduce the time of the computation of q(π, Y (ν) ) for all possible π, we focus on the subspace of pools P 1 = {π||π| = 1, π ∈ P} and P 2 = {π||π| ≤ 2, π ∈ P}; hence, P 1 ⊂ P 2 ⊂ P. In principle, BP can approximately compute the correlation between patients by deriving conditional posterior expectations, which requires additional computations of the order of O(N!/ (N − |π|)!) according to the product-rule of conditional joint distributions. As an example, we calculate the susceptibility using the BP algorithm and implement active pooling design on the basis of q(π, Y (ν) ) for |π| = 2 case, as shown in Appendix B. The consideration of the susceptibility does not provide large improvements in terms of TP and FP rates in the problem setting studied herein. Hence, we use one-body approximationq(π, Y (ν) ) throughout the study.
Algorithm 1 Group testing with active pooling design using the belief propagation (BP) algorithm is approximately calculated using the BP algorithm. For the subsequent adaptive stage, we actively choose π(M ini + 1) among P 1 or P 2 based on the predictive entropy given by the posterior distribution of the initial stage. Next, we constructF M ini +1 , so that F M ini +1,i = 1 for i ∈ π(M ini + 1); otherwise, 0. The test result is generated as Y M ini +1 ∼ P gen (Y |X (0) ,F M ini +1 ), and we obtain the posterior distribution under the constraint that i X (0) i = N ρ. Here, we assume that the correct parameters ρ, p TP , and p FP are known in advance. For more general cases where the estimation of unknown parameters is required, we can construct their estimators by combining the BP algorithm with the expectation-maximization method, or introducing a hierarchical Bayes model [20]. Fig.1 shows the-ρ-dependence of (a) TP and (b) FP at N = 1000, M = 400 with M ini = 300 and M ada = 100. The error probabilities are set at p TP = 0.9 and p FP = 0.05, and the group size in the initial stage is N G = 10. P 1 and P 2 in the figure denote the results of the active pooling in the spaces P 1 and P 2 , respectively [24]. For comparison, the results of random pooling are shown, where tests in M ada steps are performed on random pools generated by the same rule as the initial M initimes tests. Each data point represents the averaged value with respect to 100 realizations of Y (M ini ) , F (M ini ) and X (0) . For any region of ρ, TP under a random test cannot exceed the p TP , which is indicated by the horizontal line in Fig.1 (a). The adaptive test improves TP and achieves TP > p TP when ρ < 0.02 for P 1 case and ρ < 0.04 for P 2 case. As shown in Fig.1 (b), FP is smaller than p FP even when the pooling is randomly determined, but the adaptive test can further decrease FP.
The performance of the adaptive test depends on the number of initial random tests M ini . Fig.2 shows the M ini -dependence of (a) TP and (b) FP at N = 1000, ρ = 0.02, p FP = 0.9, and p TP = 0.05. The pool size at the initial stage is N G = 10. The figure also presents the results for M ini = 300, 400, and 500. The horizontal dashed line in (a) represents the TP probability of the test, which is 0.9. As M ini increases, a high TP close to 1 is obtained via the adaptive test. Moreover, for a large M ini , such as M ini = 500, the possible pooling space does not significantly influence the performance in terms of TP and FP. Meanwhile, for a small M ini , the result of TP depends on the pooling space, and more accurate identification is achieved by P 2 . The test results at the initial stage have large uncertainties when M ini is small, and hence, it is considered that the large pooling space is required for the effective sampling of the uncertain pools.
As shown in Fig.2 (a), to achieve a high TP, the active pooling method requires a smaller number of tests than that required by the random pooling method. For instance, the active pooling in P 2 with M ini = 300 initial stage results in a high TP > p TP after the M ada = 40 adaptive stage, namely M = 340. Meanwhile, the random pooling achieves TP > p TP with almost M = 500 tests. With regard to the improvement of TP, the adaptive method helps effectively identify infected patients with a small number of tests.
The active pooling is robust to the errors in the test, compared with the random pooling. Fig.3 shows (a) the p TPdependence of TP for p FP = 0.05 and (b) the p FP -dependence of TP for p TP = 0.95 at N = 1000, M ini = 300, M ada = 100, and ρ = 0.02. The random tests in the initial stage are performed on pools of size N G = 10. For the random pooling case, TP > p TP is achieved only when p FP is sufficiently small such as p FP < 0.02. The adaptive test improves TP, and the parameter region where TP > p TP is extended in particular for the case P 2 .
These results indicate the efficiency of the active pooling design based on predictive distribution in group testing. However, a limitation of this approach is that the computational cost involved is higher than that of the non-adaptive approach. The estimation of the infection probability via the BP algorithm M ada steps is obtained again; hence, the computational cost of the adaptive approach is approximately M ada times larger than that of the non-adaptive approach. However, the adaptive approach achieves accurate estimation using a small number of tests. The trade-off between the reduction of the operating cost involved in tests and the increase in the computation time of inference should be considered for practically apply the adaptive approach.

V. SUMMARY AND DISCUSSION
In this paper, we propose an active pooling design in adaptive group testing, where the pool for the subsequent stage is determined based on the Bayesian posterior predictive distribution under the test outcomes in the previous stage. The proposed method was implemented using the BP algorithm, and the identification of infected patients using adaptive tests is demonstrated to be more accurate than that using randomly designed pools. In particular, the active pooling design reduced the number of required tests to achieve TP > p TP . Further, the proposed method is robust to test errors and TP > p TP holds in smaller p TP and larger p FP , compared with the approach that uses randomly designed pools.
In the current study, we restrict the possible pooling space within P 1 and P 2 . Mathematically, more uncertain pool can be considered removing this restriction, and further improvement in the TP and FP rates is expected. However, the straightforward calculation of the predictive entropy for all possible π is computationally intractable. Hence, some approximation will be required. An efficient sampling method in π ∈ P to find the uncertain pool should be developed such as the Markov chain Monte Carlo method.
We focused on the MAP estimator to convert the estimated infection probability, which is [0, 1] variable, into the state of patients, {0, 1} variable, because of its simplicity; however, changing the decision threshold from 0.5 results in improvements in the TP rate. For example, the estimate using the confidence interval constructed based on the bootstrap method was obtained. Further, the TP rate obtained using this method is higher than that obtained using the MAP estimator [20]; however, its computational cost is unrealistic to accompany the active pooling procedure. The receiver operating characteristic (ROC) analysis is a promising method to understand the appropriate decision threshold [25,26]. Along with the ROC analysis, the mathematical background of the active pooling proposed in this paper is expected to be established.

ACKNOWLEDGMENTS
This work was accomplished thanks to the authorâĂŹs pleasant discussions with Yukito Iba. Further, the author thanks Koji Hukushima, Yoshiyuki Kabashima, and Satoshi Takabe for their helpful comments and discussions. This research was partially supported by Grant-in-Aid for Scientific Research 19K20363 from the Japanese Society for the Promotion of Science (JSPS) and JST PRESTO Grant Number JPMJPR19M2, Japan. [ 27,28], where recursive update of tensors that give susceptibility is introduced on the basis of linear-response theory. In the current problem setting, the variables to be estimated obey the Bernoulli probability. Hence, we can compute the susceptibility in a simpler way.
Let us denote the expectation of X j under the posterior conditional distribution X\i, j P post (X j , X i = 1, X \i, j |Y ) as . This expectation value is evaluated using the BP algorithm by fixing θ i→µ = 1 and θ µ→i = 1 for µ ∈ G(i). The conditional expectation value obtained using the BP algorithm is denoted asθ |X i =1 j . Thus, the susceptibility is given byχ i j =θ iθ |X i =1 j −θ iθ j . We can show that the symmetryθ iθ iθ j holds.
To check the accuracy of the susceptibility derived using the BP algorithm, we compute the exact posterior distribution by sampling all configurations in {0, 1} N . Examples of the exact susceptibility and the approximated one are shown in Fig.4(a) at N = 20, M = 10, ρ = 0.1, p TP = 0.95, and p FP = 0.05, where the i-dependence of χ i,20 is shown for two different realizations of Y, F, and X (0) . Here, the pooling matrix is randomly generated to be N G = 10 and N O = 5. The difference between χ andχ is quantified by ≡ i< j ( χ i j − χ i j ) 2 /{N(N − 1)/2}, whose behavior is shown in Fig.4 (b) at N = 20 and different values of α ≡ M/N, p TP , and p FP . For any parameter region, is O(10 −3 ). Therefore, we consider that the BP algorithm provides a reasonable approximation of the susceptibility and expect that it is also applicable for larger N.
In Fig.5, (a) TP and (b) FP are shown for the cases when the susceptibility is considered (denoted by 'P 2 : with χ') and not considered (denoted by 'P 2 : without χ'); namely, eq.(12) is used to determine the pool in the subsequent stage by substitutingχ calculated by BP into χ at N = 200, ρ = 0.05, p TP = 0.9, and p FP = 0.1. Each data point is averaged over 50 samples of F, X (0) , and Y . The initial stage consists of M ini = 80 random tests with N G = 10 and N O = 4. The P 1 case is compared with the random case with the same test at the initial stage. Considering the susceptibility, a slight improvement in TP is observed.