Exhaustive Model Selection in $b \to s \ell \ell$ Decays: Pitting Cross-Validation against AIC$_c$

In the light of recent data, we study the new physics effects in the exclusive $b \to s \ell^+\ell^-$ decays from a model independent perspective. Different combinations of the dimension six effective operators along with their respective Wilson coefficients are chosen for the analysis. To find out the operator or sets of operators that can best explain the available data in this channel, we simultaneously apply popular model selection tools like cross-validation and the information theoretic approach like Akaike Information Criterion (AIC). There are one, two, and three-operator scenarios which survive the test and a left-handed quark current with vector muon coupling is common among them. This is also the only surviving one-operator scenario. Best-fit values and correlations of the new Wilson coefficients are supplied for all the selected scenarios. We find that the angular observables play the dominant role in the model selection procedure. We also note that while a left-handed quark current with axial-vector muon coupling is the only one-operator scenario able to explain the ratios $R_{K^{(*)}}$ ($R_{K^*}$ for $q^2\in [ 0.045, 1.1] {\rm GeV}^2$ in particular), there are also a couple of two operator scenarios that can simultaneously explain the measured $R_{K^{(*)}}$.


I. INTRODUCTION
Decays involving b → s transitions are suppressed in the standard model (SM). These decay modes are potentially sensitive to new physics effects. Whether the contributions appear at the tree or the loop level depends on the type of the new physics (NP). A lot of attention, both experimental and theoretical, have been given to B → K ( * ) µ + µ − decays in the last couple of years. There are several angular observables associated with these decays, which are potentially sensitive to the NP effects and are measured by LHCb [1,2]. A couple of them have shown discrepancies with their respective SM predictions [3][4][5][6]. However, these angular observables are not free from hadronic uncertainties and it is fairly possible that the observed discrepancies are due to poorly known hadronic effects, e.g., see [7] for details. Furthermore, these modes offer theoretically clean observables like where H is either K or K * meson and q 2 is the dilepton squared mass. These ratios are useful to test lepton flavor * bhattacharyasrimoy@gmail.com † iluvnpur@gmail.com ‡ soumitra.nandi@iitg.ernet.in § sunando.patra@gmail.com universality violation (LFUV) and within appropriately chosen ranges of q 2 , these observables can be predicted very precisely in the SM; see [8,9] for details. The SM predictions are, respectively, R(K) = 1.0004 (8), and R K * = 0.920 ± 0.007, q 2 ∈ [0.045, 1.1] GeV 2 , 0.996 ± 0.002, q 2 ∈ [1.1, 6] GeV 2 .
(5) and [13] R K = 0.98 +0. 27 −0.23 ± 0.06, q 2 ∈ [1, 6] GeV 2 , These new measurements have larger uncertainties compared to those from LHCb, but the results are consistent with each other. Belle has also measured separate ratios arXiv:1908.04835v3 [hep-ph] 11 Mar 2020 like R 0 K * and R + K * , but the associated uncertainties are quite large at the moment. On the whole, the deviation between data and SM predictions stand at the level of 2.5 to 3 σ. Future measurements of these ratios with enough statistical significance would have the potential to discover NP unambiguously.
The observed discrepancies can be explained in various NP models. Different types of new physics interactions (like vector, scalar etc.) with different Lorentz structures may contribute to these decays and explain the data. A lot of work has already been done and it is a difficult task to quote all of them. We are more interested, in the present work, in a model independent analysis. There are a few related analyses available in the literature, which mainly focus on considering one or two operators at a time [8,[14][15][16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31].
In this article, we have done a model independent analysis of the NP affecting the b → s + − decay modes. The operator basis is exactly the same as that given in Ref. [23]. We have considered all possible combinations of these operators and categorized them as independent 'model's (scenarios). There are several models capable of describing the observed data and one is thus confronted with the problem of model selection.
In short, the problem of model selection is as follows: any model, used to represent certain observation, will almost never be exact; chances are, that some information will be lost due to the choice of that particular model [32,33]. Choosing a simplistic model with too few parameters can involve making unrealistically simple assumptions and lead to high bias, poor prediction, and consequent missed opportunities for insight. While simplistic models are not flexible enough to describe the sample or the population well, a model with a larger number of parameters can fit the observed data very well. Does this make it a better model? With too many parameters, we face the possibility of just fitting the noise in the data and losing sight of the important trends. The most general problem in model selection is thus the optimization of the parameters required to explain certain observation [34,35]. The motivation: there must be a happy medium somewhere.
The most generally applicable, powerful, and reliable method for model comparison (also computationally expensive) is 'cross-validation' [36], which, in addition to testing the predictive power of the model, minimizes the bias and variance together by minimizing the meansquared-error (MSE). On the other hand, penalizedlikelihood information criteria, such as the Akaike Information Criterion (AIC) [37], and the Bayesian Information Criterion (BIC; more aptly named as Schwarz Information Criterion) are widely used for model selection.
AIC estimates the relative amount of information lost by a given model: the less the information lost by a model, the higher the quality of that model. For a detailed discussion on various model selection procedures and their relative performance with respect to cross-validation, ref.
[? ]. In our earlier publications [38,39], we have used these criteria in the context of NP model selections in b → cτ ν τ decays.
Very recently, in a Bayesian analysis of b → s + − decays [27], an information criterion has been used. They have shown the use of a criterion closely related to DIC (Deviance Information Criterion; the definition by Gelman et al [40]) and BPIC (Bayesian Predictive Information Criterion [41]) for model selection, which is not only ideal for samples from a Markov Chain Monte Carlo but is also asymptotically equivalent [42] to natural modelrobust version of AIC.
AIC is easy to calculate in a frequentist analysis, which is not the case for Bayesian analyses. The main difference between that analysis and ours is that they created the model hierarchy by defining ∆IC = IC SM − IC N P . As a result, the quality of a model is determined with respect to the SM, whereas in our case the best model is picked up first and the hierarchy is defined with respect to that. Still, similar to our findings, they have found that the case ∆C 9 (C N P 9, µ in their paper) provides the optimal outcome for B → K * transitions if we consider only the 'Moments' data for the angular observables, in addition to the new LFUV data.
In the present analysis, we use both AIC and crossvalidation to pin down the best possible model(s), and find out how one can use both procedures in tandem to glean the most out of the data at hand.
The article is organized as follows: in section II, we discuss the present experimental and theoretical status of the observables used in this analysis. Section IV discusses the detailed methodology of the statistical analysis, as well as model selection. We present our results in section V and in section VI we summarize.
• Binned data on the angular observables for B + → K + µ + µ − (A F B and F H ) have been taken from ref. [46] (CMS).
• The lepton flavor universality violating (LFUV) observables R K * , both for the low and central bin, have been obtained from ref. [10] (LHCb). We also include the recent measurements on these observables (for the same bins) from Belle [12]. The old R K data from LHCb has been taken from ref. [48]. The updated result on the same has also been included [11].
• The experimental result for the branching ratio (BR) corresponding to B s → µ + µ − has been taken from [49] which is the average of the measured values by CMS, ATLAS and LHCb. The value is given by The decay constant is taken from ref. [50,51] f Bs = 0.2284 ± 0.0037 GeV .
A few words regarding the data on the angular observables due to LHCb taken from ref. [1] is in order at this point. LHCb has provided the data corresponding to the angular observables in bins of q 2 (q = p µ + +p µ − , p µ being the four-momentum of muon) by performing two separate analyses. The more commonly used dataset in the community is that due to the "Method of Moments". The angular observables in this case are determined by using a principal moment analysis of the angular distribution without carrying out any angular fit to the data [52,53].
These moments are continuous functions of q 2 . The statistical uncertainties for these angular moments are estimated using a bootstrapping technique [54] and confidence intervals are defined such that they include the 16 th -84 th percentiles of the bootstrap distribution of the observables. The other method termed the "Maximum likelihood fit" involves an unbinned maximum likelihood fit to the invariant mass m(µ + µ − (K * →)K + π − ) and the three decay angles cos θ l , cos θ K and φ in each q 2 bin, where: • θ l is the angle between the µ + (µ − ) and the direction opposite to that of the B 0 (B 0 ) in the rest frame of the dimuon system, • θ K is the angle between the direction of the K + (K − ) and the B 0 (B 0 ) in the rest frame of the K * 0 (K * 0 ) where the K * 0 meson is reconstructed through the decay K * 0 → K + π − , and • φ is the angle between the plane defined by the dimuon pair and the plane defined by the K and the π in the B 0 (B 0 ) rest frame.
The bin sizes corresponding to the maximum likelihood analysis are larger than those for the method of moments. This is done since there is a dearth of statistics, and an increase in the bin-size renders the precision comparable with that for the method of moments. With increased number of events in the future, an unbinned likelihood analysis will become the norm, but at the present precision level, the 'Moments' data is at least equally dependable, if not more. To examine and point to any fundamental difference between these two data-sets in presence of NP models, we use both these sets as separate cases in our analysis. To the best of our knowledge, this is the first global b → s analysis that takes both of these datasets into account.
Apart from classifying the data according to whether it corresponds to the "Likelihood" or the "Moment" method for the angular observables, we have also prepared separate datasets which we call: • "Old" dataset, containing the (previous) estimates for the LFUV R K and R K * ratios from refs. [10,48] respectively, and • "New" dataset, where the old estimate for R(K) [48] by LHCb is replaced by the new one [11], while both the previous [10] and the current [12] estimates for the R K * ratio have been included. We also include the most recent measurement for R K due to Belle [13].  We should mention here that we have only taken the "low bins" (q 2 ≤ 6 GeV 2 ) corresponding to the experimental data referred to above. This is done so that we can avoid the region around the J/ψ resonance (the "broad charmonium" region) since a trustworthy theoretical estimate for this region is challenging. Hence, we do not include the R K * data from the low-recoil region provided by the recent Belle measurement from [12]. We take care of the systematic and statistical correlations separately in the data as and when they have been reported.

B. Theoretical
The effective Hamiltonian and the operator basis for exclusive b → sµ + µ − decays are taken from [22,23] and is written as: with the CKM combination λ i = V ib V * is and We consider NP effects in the following operators: The NP contributions to operators O 9,10 is given by ∆C 9,10 . In these decays, when the final state contains a vector meson, one can construct various helicity amplitudes. These helicity amplitudes are used to form angular coefficients which are relevant in defining the CPsymmetric and asymmetric observables measured by the different experimental collaborations. The details about various transversity amplitudes and the respective angular coefficients can be obtained from [23]. The two major components that go into the formation of the helicity amplitudes are the Wilson coefficients (WC) of different operators and the form factors which are defined as the hadronic matrix elements of various operators. We follow ref. [6] for the form factors in B → K * and B s → φ decays 2 .
For the B → K sector we closely follow the methodology communicated in ref. [61]. This includes expressing the differential decay distribution in terms of a polynomial in cos θ, where θ denotes the angle between the direction of motion of the parent B meson and the positively charged lepton in the dilepton center of mass frame. The coefficients of these terms can then be expressed as combinations of the corresponding WC and form factors. For the form factors, we use the results from ref. [22], where the authors perform a combined fit to the lattice computation in ref. [62] as well as LCSR predictions at q 2 = 0 [63,64], using the parametrization and conventions of [62]. The method is described in details in the appendix of ref. [65].
We also take care of the correlations among these form factor elements as reported in these references, in order to propagate them to form the theoretical correlations and errors for the corresponding observables.
Since our aim is to perform a global model selection based on the plethora of available b → s data discussed in sec. II A, there is a possibility that amongst the selected models the operator with C 7 as coefficient may appear as a plausible solution. Such an operator is also relevant for the radiative decays like inclusive and exclusive b → sγ. For such scenarios, we have checked whether parameter spaces which are allowed by b → sll data are also allowed by the inclusive B → X s γ measurement, alongwith the branching ratios for the three exclusive radiative modes B + → K * + γ, B 0 → K * 0 γ and the time integrated BR(B s → φγ) 3 . The definitions and formulae for these modes are taken from ref. [66]. We provide FIG. 1: Correlations between RK and RK * for different single-operator NP scenarios. The arrows indicate the increasing values of the WCs from −2 to +2. All the experimental data are considered within their 1σ ranges.
the experimental values and the SM estimates used in our analysis in Table. I. The corresponding theoretical (for B → X s γ) and experimental references are provided therein. Our SM values are consistent with the estimates of ref. [66], within 1 σ.

III. NP AND CURRENT DATA ON RK AND RK *
Before pursuing a detailed discussion on model selection, let us look for the NP effects in b → sµµ decays, only in the light of recently updated measurements on R K and R K * in this section, focusing on the measurements of R K * in the low q 2 bins. There is some discrepancy between this particular data and the corresponding predicted value in the SM. However, one needs to remember that the angular observables are not free from hadronic uncertainties. In this part of the study, we do not consider any of the angular observables, neither do we carry out any fit to data. We simply check the dependencies of , and R K on various WCs in one and two operator scenarios. We do not include C S , C P , C S , and C P , since the corresponding operators by themselves, or combinations including such operators, are unable to explain the observed data in R Low K * . These WCs also suffer from tight constraints due to B s → µµ decays [67]. Also, the new electromagnetic dipole operator O 7 alone would not be able to explain the observed data on R ( * ) K and the branching fractions in the above mentioned radiative decays simultaneously. Hence, we have not considered the effects of this operator in this part of the analysis.
The results of our analysis are presented in figures 1 and 2. In single operator scenarios, the correlations between all the above-mentioned observables are shown in figures 1a and 1b. It would be difficult to explain the observed data for R Low K * within their 1σ ranges will be difficult in the single-operator scenarios. Although the allowed region is tightly constrained, O 10 (with WC ∆C 10 ) is the only operator that can simultaneously explain all the data on R K ( * ) except R Low K * from LHCb. The required value of ∆C 10 lies in between 0.5 and 1.5, which is consistent with the measured value of Br(B s → µµ) within its 2-σ range for detail see figure 2k. However, there are several candidates in the two operator scenarios that could explain all the data simultaneously. Among various possible combinations, the highly probable scenarios are the operators with the WCs [∆C 9 , C 10 ], [∆C 10 , C 9 ], [∆C 10 , C 10 ], and [∆C 10 , ∆C 9 ]. The other possible scenario [∆C 9 , C 9 ] is less favored but allowed by the data. Also, the allowed values of ∆C 10 and/or C 10 can explain Br(B s → µµ) within its 1-σ range, wherever applicable; see figure 2k for details. In fig. 2k, we have shown the variations of Br(B s → µµ) w.r.t. the parameters [∆C 10 , C 10 ]. However, there are scenarios where only ∆C 10 or C 10 appears. In such cases, depending on the scenario, one needs to look at the plot with either ∆C 10 = 0 or C 10 = 0. To conclude this section, we would like to mention that the three or more operator scenarios could also be relevant to explain the present data on R K ( * ) simultaneously. The take home message is that simultaneous contributions from various operators are required for a simultaneous explanation of the R K ( * ) data alone. The results of this section will be useful for a better understanding of the results in the following section. shows the correlations between R K and R K * in different NP scenarios. The constraints on ∆C 10 and C 10 from the measured value of Br(B s → µµ) can be inferred from figure 2k.

FIG. 3:
For the fit with 'New Data', indices of competing scenarios with ∆AICc ≤ 4 in the MSE X-val vs. w ∆AICc i plane. We break the plane in four regions, with the one in right-bottom being the best one. Models in this region are chosen as the best ones from these two criteria and the labels are colored blue. Wilson coefficients contained in these models are shown in-box.
For comparison, indices for models picked up by the criterion ∆AICc ≤ 2 are framed. For details, check sec. V A. The methodology adopted in this paper for the model selection is as follows: a. Define Models: Considering the NP Wilson coefficients real, we take all possible combinations (511 in total) of the coefficients forming a predefined global set of different scenarios. Each scenario with a specific combination of coefficients thus constitutes a potential 'model' to explain the experimental results.
b. Numerical Optimization: Next, for each such 'model' k, as mentioned above, we perform a Frequentist statistical analysis optimizing a χ 2 statistic which is a function of the Wilson coefficients. Whenever applicable, statistical (systematic) covariance matrices V stat(syst) , are constructed by taking separate correlations. Theoretical uncertainties are propagated separately and are introduced in the χ 2 in terms of a 'theoretical' covariance matrix V th . The effect of the interplay of the SM uncertainties and the NP parameters come in the fit at a higher order and are neglected without any loss of generality. Following section II A, we perform 4 types of fit for each 'model': (a) New data with Likelihood data for angular observables, a total of 214 observables.
(b) New data with Moments data for angular observables, a total of 258 observables.
(c) Old data with Likelihood data for angular observables, a total of 211 observables.
(d) Old data with Moments data for angular observables, a total of 255 observables.
All fits are done in batch using Mathematica c in the form of a package [68]. The chosen optimization method is 'Differential Evolution', a stochastic parallel direct search evolution strategy [69] 4 .
c. Post-process: In the post-process for each fit, we obtain fit-quality using p-value and find outliers by constructing a 'Pull' (related to Studentized residuals; for our purpose, the difference between the fitted and experimental results, normalized by the uncertainty of the data, including theory uncertainties [70,71]) for each data-point. We also check the normality of the 'Pull'-distribution (i.e. consistency with a Gaussian of µ = 0 and σ = 1) to ensure the applicability of the χ 2 as the fit-statistic. We use the "Cramér-von Mises" criterion [72] for the normality check. Scenarios not satisfying the normality criterion are dropped from the analysis.  With the remaining scenarios, we perform a modelselection procedure for each data-set. In the following sub-section, we elaborate the methods used to do the multi-model selection procedure.

B. Model Selection
Following the 'concept of parsimony' [75], we need to optimize the dimension (measure of the degree of structure) of the model explaining our data. All model selection methods, to some extent, depend on the principle of parsimony [76]. In statistical terms, this is expressed as a bias versus variance trade-off. In general, bias decreases and variance increases as the model-dimension increases.

Cross-Validation:
As we have mentioned in the introduction, 'crossvalidation' is the most generally applicable, powerful, reliable, and computationally expensive method for model comparison. The most straightforward and the most expensive flavor of cross-validation is "leave-one-out crossvalidation" (LOOCV). In LOOCV, one of the data points is left out and the rest of the sample ("training set") is optimized for a particular model. Then that result is used to find the predicted residual for the left out data point. This process is repeated for all data points and a meansquared-error (MSE) is obtained using all those residuals. This process is repeated for all models. The models with the least MSE are the best ones.

Criteria from Information Theory:
In addition to the extreme computational cost demanded by cross-validation methods, especially LOOCV, its applicability is questionable to very small sample sizes [77,78]. Due to this reason, in our earlier works [38,39], we have shown the importance and use of the informationtheoretic criterion 'AIC' [37] and its second order variant 6 Range of the 1σ confidence level (CL) of the profile likelihoods of the said parameter. One and two dimensional profile likelihoods in this analysis will be depicted as 1-CL plots, closely following the PROB method followed in ref. [74] 'AIC c ' [79]. It has been shown that minimizing AIC is asymptotically equivalent to cross-validation [80]. For a detailed discussion on AIC c , we point the reader to those papers and references therein. Here, let us reiterate the main important aspects of AIC c in with respect to model selection in the present work: a. AIC c : If the full reality or truth is noted as f and an approximating model in terms of probability distribution is g, then we can define a model selection criterion in terms of the χ 2 min (equivalent to the maximum point of the empirical log-likelihood function) in the parameter space: where n is the number of data points and K is the number of estimable parameters 7 .
b. w ∆AICc i : The model which is the 'closest' to the unknown reality generating the data should have the smallest value of AIC c among the considered models. Simple differences of them (∆ AIC i = AIC i c − AIC min c ) estimate the relative expected information loss between f and g i allowing comparison and ranking of candidate models in increasing order of ∆ AIC i . Generally, the level of empirical support in favor of g i is considered substantial when ∆AIC c is between 0 and 2 (∆AIC c ≤ 4 is considered to be a conservative and loose bound). We can also quantify the weight of evidence in favor of model i by defining a set of positive "Akaike weights": adding up to 1 [34]. As these depend on the entire set, adding or dropping a model during an analysis requires re-computation for all models in the new set.
In the present analysis, we have a unique opportunity to not only test the relative capability of MSE from crossvalidation and w ∆AICc i , but also the validity of the empirical rule of selecting models with ∆AIC c ≤ 2. To that end, we first select a large number of competing models by using the conservative limit of ∆AIC c ≤ 4, and then distribute them in the plane of MSE X-val vs. w ∆AICc and check how they are clustered. Models with a low value of MSE X-val and a high value of w ∆AICc are the undoubtedly the best ones to explain the data.

FIG. 5:
Comparison of the CP -averaged angular observables obtained in experiment, SM and from our fit results considering all the avilable inputs.
V. RESULTS

A. Model Selection
As explained in section IV, we perform the fit for four different sets. After applying the normality check (as explained in sec. IV A) on all the 511 models (thus ensuring only valid fits remain in our data-set), we pick out the large set of models with ∆AIC c ≤ 4. The list of models, thus selected, are shown in figures 3 and 4, which are based on the analysis of all the available data sets given in sec II A. Each point in these figures represents a 'selected' model (i.e. a model for which ∆AIC c ≤ 4). Among these, the indices (labels for points) for models selected by ∆AIC c ≤ 2 are framed. As is evident from the figure (and explained earlier in section IV B 1), the lower the MSE X-val , the better the model. One can clearly see three separate clusters depending only on MSE X-val , and we can safely label the lowest one as the cluster of the best models (from MSE X-val ) and discard the rest. Similarly, there are three clusters in the w ∆AICc direction as well, where the cluster with the largest w ∆AICc value contains only one model. All models with ∆AIC c ≤ 2 lie in the two rightmost clusters. Following section IV B 2, we FIG. 6: Comparison of the optimized angular observables obtained in experiment, SM and from our fit results considering all the avilable inputs.
know that a model with a larger value of w ∆AICc is more probable to explain the data. So, we put a commensurate bound on w ∆AICc . We note that out of the various possible combinations, only a few are 'selected' by the combined criteria of MSE X-val and AIC c .
Figures 3a and 3b respectively compare the selected models in the analyses of moments and likelihood data on angular observables along with the new data on R K ( * ) . Similar comparisons are done with the old data on R K ( * ) in figures 4a and 4b, respectively. We note that apart from the two-operator scenario [O 7 , O 9 ], the likelihood and moments data pick up very different combinations of operators after model selection. The likelihood data prefer scenarios with more operators than those that are required for moments data. This could be since most of the observables which are determined from an unbinned maximum likelihood fit are consistent with their respective SM predictions. A large value of a single WC may lead to a discrepancy between the measured values and the corresponding SM predictions. It is thus preferable to have simultaneous contributions from different operators. The same observation can be made from the comparison of the quality of fits with moments and likelihood data which are shown in table II and III, respectively. The values of the reduced χ 2 and corresponding p-values, indicating the quality of the fit and the relative quality of the various models (scenarios) for a given set of data are provided in each corresponding table. We note that in general, the quality of the fits is better in the analysis with moments data compared to that with likelihood data. It can be seen that the ranking of the models depends not only on the quality of fit but also on the penalty function defined in eq. 12. This is one of the major advantage of AICc over data-fitting. Results of a similar analysis after dropping R K ( * ) are presented in figure 9 in Appendix B. We note that after dropping the LFUV observables from the fits for the respective cases, the multioperator scenarios can not survive the competition from those scenarios with relatively less number of operators.
As we know, the data from Belle has significant errors compared to those given by LHCb. To check whether this data has any influence over the selected scenarios, we have performed an analysis where we drop the R K ( * ) data from Belle only. We note that the selected scenarios are the same as that given in figure 3. Also, the quality of fits or the respective p-values do not change. The reason is straightforward: the Belle data being consistent with LHCb, and the extracted R(K) and R(K * ) in the selected There are a few other relevant observations here as well. For example, both in the case of the 'Old' and 'New' data sets, the majority of the selected scenarios are with two, three, or four operators and all of them contain the operator corresponding to the WC ∆C 9 . Also, the only scenario with a single operator selected by the 'New' data set (in particular the 'Moments' dataset) has the WC ∆C 9 (model 2). Model 2 is clearly the better option for all the datasets. However, for the 'New Moments' dataset, scenarios with two operators corresponding to the WCs [∆C 9 , We note that some models have low MSE scores but are rejected by AICc. Although leave-one-out-crossvalidation (LOO-CV) is asymptotically equivalent to AIC, there are differences between the two. Theoretical considerations aside, AIC is just likelihood penalized by the degrees of freedom. Evidently, AIC accounts for uncertainty in the data (-2Log(L)) and assumes that more parameters lead to a higher risk of overfitting (2k). Crossvalidation (CV) just looks at the test set performance of the model, with no further assumptions. There is no explicit measure of model complexity, unlike AIC. Clearly, AIC penalizes model complexity more than CV. The accepted practice in the literature is that if one cares mostly about making predictions and assumes that the test set(s) to be reasonably similar to the validation sets, one should go for CV (only with a large number of data) [? ].
Following the reasoning in the paragraph above, it seems clear a priori that models with low MSE but not selected by AICc may have more parameters. In our analysis, the On another note, as the title suggests, this study clearly finds that under the simultaneous application of both AIC and CV, the models cluster in such a way that lets us carry out a tighter selection of models than would have been possible by the use of any one of these criteria. So the fact that some of the selected models by CV are further discarded by AIC due to relative complexity is actually quite a non-trivial finding.
The best fit values of the new WCs with the corresponding errors for the selected models in Figure 3 are given in tables II and III. We note that for all the fit results, the allowed values of ∆C 9 are negative and greater than one. Other WCs appearing along with ∆C 9 as probable solutions have values << 1 or − 1. The selected models with C 7 as one of the WCs will impact radiative decays. Hence, along with the allowed values for the new WCs, we have shown that in all the relevant models the branching fractions for B → K * γ, B → φγ and B → X s γ are consistent with their respective measurements within 1σ confidence interval in the relevant tables. The fitted values of the selected WCs remain almost unchanged in the analysis obtained after dropping R K ( * ) . The allowed parameter spaces of the respective WC's are similar in the analyses with old data-set which are given in the Appendix in tables IV, V. However, note that the fit qualities are relatively better in the analysis with old data which is an indicator of relatively poor NP-description of the new LFUV observables along with the angular observables.
We compare our fit results for various angular observables with their respective measured values and the SM predictions in figures 5 and 6 for a set of selected models. We see that the data in a few bins are inconsistent with their corresponding SM predictions, particularly in the data set for the 'Moment' analysis. For the 'Likelihood' analysis, most of the measured values in different bins are consistent with their SM predictions, albeit with exceptions. Assuming the observed discrepancies are due to NP, most of them can be resolved by our selected models. For the likelihood data, our analysis shows that the angular observables obtained from the selected models are fully consistent with their SM predictions as well as measured values. On the contrary, the fitted angular observables in the selected models in the moment-data analysis are shifted from their respective SM predictions in a few bins. Moreover, in some of those, the fit results shift from their measured values as well. One can also notice the correlations among various angular observables in the presence of different new operators from these figures. For example, A F B and F L are positively correlated, F L and S 5 are negatively correlated, etc. For some of the bins of F L , the values predicted by the selected models are shifted from respective measured values. Similar plots obtained in our analysis after dropping R K ( * ) is provided in fig. 10 in the appendix.
In figure 7, the predicted values of different observables and their respective correlations are shown for a few of the models (only one or two operator scenarios) selected in our analysis. We note that R Central K * and R K are correlated differently in different two operator scenarios. However, all of them satisfy the current experimental bounds on these observables. Interestingly, the single operator scenario with ∆C 9 and the two operator scenario with [∆C 9 , C 7 ] are unable to satisfy the current experimental bounds on R Low K * . In these scenarios, satisfying the experimental bounds on R K , it is hard to get a value of R Low K * below 0.85. However, the scenarios with [∆C 9 , C 9 ] and [∆C 9 , C 10 ] as WCs can explain the observed data on R K and R K * . From the respective correlations, one can also see that R Low K * and R K are negatively correlated and values of R Low K * lower than 0.88 prefers R K > 0.8. In figures 7c, 7d, 7e, and 7f, we provide the predicted values of the branching fractions of different radiative decays and their correlations with R Low K * in scenarios corresponding to the WCs [∆C 9 , C 7 ]. We note that there are no noticeable correlations between these branching fractions and R Low K * , or for that matter with R K and R Central The q 2 distributions and the zero crossing of the angular observables A F B , S 4 and S 5 corresponding to the SM and the selected models are shown in fig. 8. We have noted discrepancies between the q 2 distributions for the SM and the selected models in the cases of A F B and S 5 while S 4 (q 2 ) in the selected models are fully consistent with the SM. The q 2 distributions of these angular observables in different selected models overlap with each other. Hence it is hard to discriminate models from these distributions.
In the Appendix in figure 12, we compare the values of q 2 at the zero crossing (q 2 0 ) between SM, our selected models and the measured values for the above mentioned observables.
The uncertainties of the observables, in terms of the parameters, are obtained by two different techniques for figures 7 and 8. For figure 7, which contains spaces for any 2 observables, we take, from our fit-results, all the information (best-fit values, uncertainties, and correlations) of only the NP parameters occurring in those observables, create a multivariate distribution out of those, and sample a large number of points (∼ 5000) from that distribution using Monte-Carlo. For each of those points, we get sets of values of the observables, which in turn lets us draw the 1 (39.35%) and 2 (63.21%) σ contours (for 2-observable plots) from these data-sets.
For figure 8, we are dealing with only one observable at a time and that too, for a specific value of q 2 . Taking the 1σ (68%) confidence levels for the marginal likelihoods around the central (for SM/nuisance parameters) or best-fit values (for NP) of the parameters as uncertainties and the corresponding correlations between them, we propagate the uncertainties to get the central values and uncertainties of the observables at some q 2 . Doing this for many points over the allowed q 2 range and interpolating between them gives us the plots of q 2 distribution.
We should also mention here that the scenario ∆C 9 = −∆C 10 that arises, for example, in some Leptoquark models (among other model-dependent origins) does not pass the criteria of ∆AIC c ≤ 4. We hence do not display or discuss this scenario.

VI. SUMMARY
In this article, we have analyzed the semileptonic b → s decays in a model independent framework with the relevant dimension six effective operators invariant under the strong and electromagnetic gauge groups. Our chosen set of operators does not include the four quark operators, chromomagnetic operators and tensor operators. Different possible combinations of all the effective operators have been considered, and following the statistical tools like cross-validation and the small-sample-corrected Akaike Information Criterion (AIC c ), we have found out the combinations which best explain the available data. We have provided separate analyses for the data on angular observables obtained from an unbinned maximum likelihood fit and that due to the principal moments of the angular distribution in B → K * µ + µ − decay.
Among all the possible combinations, a relatively small number of one, two and three-operator scenarios satisfy the criterion of a selected 'best' model. All the selected scenarios contain a left-handed quark current with vector muon coupling as an operator (O 9 ). This is also the only one-operator scenario that survives the exclusion test in our search for the 'best' model(s). We have noted differences between the selected models in the analysis with angular data from likelihood fit and those from the principal moments analysis. The R K ( * ) , along with the angular observables associated with B → K * µ + µ − decays, have played an important role in this selection. In the analysis with the new data on R K ( * ) , the scenarios with three, four and five operators are selected. This could be due to the fact that the tension between the updated measured values of R K and R K * and their respective SM predictions have reduced in comparison to that for their old experimental measurements. Therefore, in order to explain all the data simultaneously, simultaneous contributions from different operators are required. We have noticed changes in the selected scenarios when we drop R K ( * ) from the list of inputs. We have performed the analysis with and without the 2019 updates on R K ( * ) from Belle and have compared them. We have noticed changes in the allowed parameter spaces for the Wilson coefficients of the selected scenarios.
We have compared our fit results for the angular observables with the corresponding SM predictions and the measured values in different bins for a few selected models. While our fit results are fully consistent with both the measured values and the respective SM predictions in the analysis of likelihood data, in the analysis of moment data, there are discrepancies between our fitted results and the respective SM predictions and the measured values in some bins. For some of the selected scenarios, we have studied the correlations between different observables, which show that the operator O 9 and the combination of O 9 and O 7 (flavor changing electromagnetic dipole operator) can not explain all the available data on R K ( * ) simultaneously. In particular, they have difficulty in explaining the observed results of R K * in the low q 2 bin (q 2 ∈ [0.045, 1.1] GeV 2 ).
We have studied the NP effects in R K ( * ) only, and noticed that the operator with a left-handed quark current with an axial-vector muon coupling (O 10 ) is the only oneoperator scenario that can explain the data. Also, the parameter space for the corresponding Wilson-coefficient ∆C 10 , allowed by R K ( * ) , is tightly constrained by the measured values of Br(B s → µ + µ − ). However, there are a few two-operator scenarios which have the potential to explain the current observation. Those operators are obtained from possible combinations of O 9 , O 10 and operators like right-handed quark current with vector or axial-vector muon couplings (O 9 , O 10 ). In the two-operator scenarios, the allowed parameter spaces for ∆C 10 or/and C 10 can comfortably explain the observed data on Br(B s → µ + µ − ).
Note:The complete list of all models used in this analysis is too long to include in this draft. We have added an ancillary file named "models.json" along with this draft (to be found within the 'arXiv' source file). This file contains all combinations of WCs, relating them with their corresponding indices in our analysis.
A comparison of the selected models in figures 3 and 9 is helpful in understanding the impact of the new R K ( * ) measurements by LHCb and Belle on the process of model selection. With more precise measurements on R K and R K * , the consistency between the data and the corresponding SM measurements may increase further. Therefore, it is important to gain insights on the probable NP effects in the angular observables. In figure 9, we provide the results for the model selection corresponding to the "New-moments" and the "New-likelihood" datasets after dropping the LFUV observables R K and R K * . The selected scenarios can be compared with the one given in figure 3. We note that in both the likelihood and moments data under the given selection setup, the number of selected scenarios reduce after we drop R(K) and R(K * ) from the inputs. A comparison of between figures 9a and 3a indicates that the three-operator scenarios become less favourable. Similarly, from figures 9b and 3b we see that the four and five-operators scenarios are less favourable in the analysis without R(K) and R(K * ). As explained in the main text, the explanation of new data prefers NP scenarios with more than two operators like three, four or five-operator scenarios. In particular, we have noted that the fit qualities improve once we drop these LFUV observables from the fits in general, which is at par with our expectations. For comparison with the 'New' dataset, figure 10 lists the angular observables from SM, experiment, and our fit results. Figure 11 depicts the allowed parameter spaces of the most commonly occurring one and two parameter scenarios selected from different types of fits and data-sets. For one operator scenario (O 9 ), the allowed parameter space of the corresponding WC ∆C 9 , is shown in figure  11a. For the two operator scenarios, figures 11b, 11c and 11d shows the correlations between the WCs. We note that the allowed values of C 7 and ∆C 9 have reasonably small ranges and they are negatively correlated. The corresponding value is −0.316. Large (negative) values of ∆C 9 prefers large positive values of C 7 . The other two plots show the allowed parameter spaces and the correlations of ∆C 9 with C 9 with C 10 , respectively. The fitted values of C 9 and C 10 have large errors. Guessing the exact correlations between them from the figures alone is therefore difficult. However, one can see that in the analysis with new data, the value of the correlation between ∆C 9 and C 10 ( + 0.24) is greater than that between ∆C 9 and C 9 which is −0.11.
In figure 12 we compare the values of q 2 at the zero crossing (q 2 0 ) between SM, our selected models and the mea-  sured values for the above mentioned observables. We note that in our selected models, the q 2 0 for all these three observables are in good agreement with the corresponding measured values.
FIG. 11: Allowed parameter space for ∆C9 in one-operator scenario and allowed NP parameter spaces and their respective correlation for some selected two-operator scenarios.