Semileptonic decays of heavy mesons with artificial neural networks

Experimental checks of the second row unitarity of the Cabibbo-Kobayashi-Maskawa (CKM) matrix involve extractions of the matrix element $V_{cd}$, which may be obtained from semileptonic decay rates of $D$ to $\pi$. These decay rates are proportional to hadronic form factors which parameterize how the quark $c \to d$ transition is realized in $D \to \pi$ meson decays. The form factors can not yet be analytically computed over the whole range of available momentum transfer $q^2$, but can be parameterized with a varying degree of model dependency. We propose using artificial neural networks trained from experimental pseudo-data to predict the shape of these form factors with a prescribed uncertainty. We comment on the parameters of several commonly-used model parameterizations of semileptonic form factors. We extract shape parameters and use unitarity to bound the form factor at a given $q^2$, which then allows us to bound the CKM matrix element $|V_{cd}|$.


I. INTRODUCTION
Studies of exclusive semileptonic decays of heavy mesons play an important role in understanding the dynamics of the strong interaction. They may also provide additional constraints on physics beyond the standard model (SM) [1]. Such searches, recently performed in B decays yielded tantalizing results in measurements related by lepton universality requirements, i.e. by the requirement that couplings of leptons to gauge bosons be independent of the lepton flavor. It is interesting to see if similar anomalies exist in semileptonic decays of charmed particles if higher precision data are available [2][3][4][5].
Accurate theoretical description of such transitions is also needed for the extraction of relevant Cabbibo-Kobayashi-Maskawa (CKM) matrix elements. In particular, decays of charmed D 0 , D + , or D s mesons provide the simplest way to determine the magnitudes of quark mixing parameters V cs or V cd [5,6]. Extractions of these CKM matrix elements from experimentally measured semileptonic decay rates are done with the knowledge of matrix elements of quark currents that describe strong interaction effects. This implies that accurate description of semileptonic transitions is also needed for improvement of our understanding of quark hadronization mechanisms in Quantum Chromodynamics (QCD). A hadronic transition between two mesons in exclusive semileptonic decays makes it a suitable system to theoretically analyze matrix elements of flavor changing currents, which are usually parameterized in terms of momentum-dependent form factors. In semileptonic decays of charmed mesons, the form factors that describe the hadronic part of the decay amplitudes are conventionally introduced as K(π)(p K(π) )|qγ µ c|D( where P = p D + p K(π) and q = p D − p K(π) . Experimental studies of these form factors are performed through the analysis of the differential decay rate dΓ/dq 2 . In the simplest cases where the mass of the final state lepton can be neglected, the differential decay rate can be written as dΓ(D → K(π) ν ) where p K(π) is the magnitude of the K(π) 3-momentum vector in the D-meson rest frame.
As can be seen from Eq. (2), only a single form factor, F + (q 2 ), contributes.
Accurate calculations of the non-perturbative form factors F +/0 (q 2 ) in the whole momentum range are very challenging. Aside from lattice QCD [7] and/or QCD sum rule (QCDSR) [8] calculations of matrix elements of hadronic currents in exclusive decays, we are currently lacking a complete non-perturbative description of hadronic form factors. While both lattice QCD and QCDSR computations of form factors are improving, at the moment they only provide model-independent predictions for F + (q 2 ) at limited regions of q 2 .
Rather general arguments based on analyticity of F + (q 2 ) have been used to place general constraints on the shapes of the form factors. A popular approach that rigorously employs analyticity requirement involves the so-called z-expansion, where a series expansion of the form factor around some point t = q 2 is improved by making a conformal transformation to the parameter z [9], which maps the interval −∞ < q 2 < t + onto the line segment −1 < z < 1. Here t 0 is a free parameter that corresponds to the values of q 2 that maps onto z = 0, and t ± = (m D ± m π ) 2 .
The form factor can be expanded in z as where Φ(q 2 , t 0 ) is an arbitrary function that is analytic anywhere but the unitarity cut [9,10].
Note that Φ(q 2 , t 0 ) is often written as Φ(q 2 , t 0 ) = P (q 2 )φ(q 2 , t 0 ), with P (q 2 ) = z(q 2 , m 2 V ) if there are poles present in between q 2 = 0 and the beginning of the unitarity cut, as in B → π transitions where m V = m B * [11,12]. The expansion in Eq. (4) is converging rapidly, so only a few terms in the expansion are really needed 1 . Lattice QCD [7] or QCD sum rule [8] results can be used to constrain the coefficients a k to provide a model-independent parameterization of the form factor.
As it stands, phenomenological parameterizations of the form factors are also often used [14]. The most common parametrization is a "single pole" shape, where the pole refers to the lowest mass vector resonance formed in the t-channel with quantum numbers of the quark current. For example, in the decay D → πeν e the dominant pole is the D , a vector state 1 See however [13] for a discussion of possible shortcomings of this approach. with 1 − quantum numbers, where F + (0) is the value of the form factor at zero momentum recoil that has to be fixed either from the lattice QCD or from other arguments, andq 2 = q 2 /m 2 D * . While physical masses of the states D * (2010) (for D → π transition) or D * s (2112) (for D → K transition) could be used, the mass m D * is often taken as a fit parameter. More complicated shapes, with more effective poles, are also available [1], where α determines the strength of the dominant pole, ρ k gives the strength of the kth term in the expansion, and γ k = m 2 V k /m 2 V , with m V k representing masses of the higher mass states with vector quantum numbers. In principle, a form factor can be approximated to any desired accuracy by introducing a large number of effective poles. Keeping the number of terms in this expansion manageable, a popular parameterization due to Becirevic and Kaidalov (BK) [15] is often used, representing the N = 1 truncation of the expansion in Eq. (6), where a BK is a fit parameter. As with the case of a single pole shape in Eq. (5), a good fit to experimental distribution can be obtained if m V is regarded as a fit parameter as well. A further extension of the BK parameterization was proposed by Ball and Zwicky [16], where r BZ and a BZ are the shape parameters.
All form factor parameterization discussed above represent physically-motivated ways to describe hadronic input. Yet, a question might be asked then what uncertainty should be assigned to the choice of a particular shape of the fit function. In other words, we will be interested if choosing a specific functional form for the form factor induces a bias in the interpretation of results of an experimental analysis.
This question may be addressed in the framework of machine learning (ML) approach, in particular, it can be investigated with the help of artificial neural networks (ANN). It has been shown that ANN can be used as an unbiased estimator of data [17,18]. This fact has been used by the NNPDF collaboration to parameterize nucleon's parton distribution functions [19][20][21], and in form factor analysis of nucleon data [22,23]. In this paper we shall build a statistical interpolating model based on ANNs that contains information on experimental uncertainties and correlations, but does not introduce theoretical bias. Following [19,20], we employ an approach based on multilayer feed-forward neural networks trained using the back-propagation learning algorithm.

A. Basic facts about neural networks
With the recent explosion of interest in machine learning, artificial neural networks are now widely employed in analyses in experimental particle physics. Their use in jet-finding algorithms and other applications are well known [24]. Roughy speaking, a neural network is represented by a certain non-linear function that connects input and output data. This leads to another feature of ANNs which we explore in this paper: their ability to provide unbiased universal approximants to incomplete data [17,18].

Input
Hidden I Hidden II Output FIG. 1: A sample structure of an artificial neural network with two hidden layers.
An ANN is built to mimic the structure of human neurons and consists of a set of interconnected units (see Fig. 1) called neurons or nodes. The activation state of a neuron is determined as a function of the activation states of the i neurons connected to it. Each pair of neurons is connected by a synapsis, characterized by a weight, which we call ω i . We also introduce a set of θ i , representing thresholds for each neuron to "fire". Each ANN contains several groups of neurons called layers. The first layer is called an input layer. It provides input information that is to be approximated. In this paper the input information is the value of q 2 for each bin in q 2 distribution of the CKM matrix element times the semileptonic form factor. We find it convenient in this work to work with an input layer that contains two nodes, as we shall explain later. The final layer is the output layer. It gives the value of form factor for each q 2 along with its uncertainty. Layers between the input and output are conventionally called hidden. In this work we employ ANN with two hidden layers of 100 nodes each. The ANN is trained when optimal sets of weights and thresholds are determined such that ANN reproduces the training data within a given uncertainty. This is achieved by minimizing the error function, where n p is the number of pseudo-data used to train an ANN, o(q 2 A ) is the output, which is given by the ANN's fit for a given input data q 2 A . The target data point for our paper, y A , is obtained from the magnitude of the CKM matrix element times the semileptonic form factor, |V cd F + (q 2 )|. The differential distribution of Eq. (2) is proportional to its square.
The o(q 2 A ) is obtained using forward propagation. In order to achieve this we pass the input through a network of hidden nodes. The output from the first hidden layer with n 1 number of nodes is In this equation the response of each neuron is given by which is the sigmoid activation function, and the summation over the q 2 data points is implied.
The ξ [1] is then used as an input for the second hidden layer with n 2 number of hidden nodes, and so on. The process is continued until the output layer of ANN is reached. In general, we can construct the output from th hidden layer with n number of nodes as where In the training process the thresholds and weights need to be adjusted so the output represented the training data with a set precision, so the error function in Eq. (9) need to be minimized. It is common to use the method of steepest descent for this purpose. Instead, we decided to use the non-linear conjugate gradient (NLCG) method [25,26] to minimize Eq. (9). In each iteration the ω i and the θ i update as where η is the learning rate at a given iteration. The NLCG method employed here does not require a pre-defined learning rate. The learning rate is initially determined by using line search algorithms [25], and then iteratively updated based on the gradients that are in a conjugate direction to original gradient used in the line search algorithm. As it turns out, the NLCG method converges much faster than steepest descent method for the fits employed in this paper. For more details on the NLCG method, see Ref. [26]. The gradients of the error function are obtained by using the method of back propagation [27]. Back propagation can be thought of as a consecutive application of the chain rule. By applying the chain rule to the Lth layer we find where g (h [L] ) is the derivative of the activation function with respect to h [L] and The derivatives with respect to ω i and θ i for layer L are given by The output of Eq. (15) is used to obtain the derivatives of the (L − 1)th layer, ∆ The procedure is repeated for the hidden layers to find derivatives of error function with respect to ω i and θ i in each layer, Using these we can obtain the numerical gradient of the error function and find the corrections to the weights and thresholds.

B. Neural network training
Training of ANNs described in the previous section must be performed either on real or artificial data (pseudo-data). The pseudo-data is generated using as much experimental information as possible. It can be constructed with uncorrelated data, correlated data, normalized data, or some combination of all three. In this work we elected to follow [21] and generate pseudo-data from the BES III experimental data set of [28] employing Monte Carlo techniques. The experimental data set we used has both uncorrelated and correlated statistical and systematic uncertainties, with correlation matrices available. The artificial data is generated as stat,m σ stat,mi (20) where i = 1, ..., N data is the number of experimental data entries considered, which is equal to the number of q 2 bins. These entries are used to generate k = 1, ..., N rep of Monte Carlo "replicas." These replicas are generated following the recipe of [21]. The first term on the right-hand side of Eq. (20) is the central value from the experimental data point for a given q 2 bin. The data points in the replicas are created from it by using the remaining three terms on the right hand side of Eq. (20), which provide variation in pseudo-data samples.
They represent experimental uncertainties (total uncorrelated, correlated systematic, and correlated statistical, respectively) obtained from the experimental data. Each "uncertainty term" is multiplied by a Gaussian random number r stat,m which have a mean of zero and a standard deviation equal to the bin uncertainty reported in Ref. [28]. The total uncorrelated uncertainty, σ t,i , is defined as the quadratic sum of the uncorrelated systematic and statistical uncertainties,σ sys,ji andσ stat,mi , respectively, The correlation matrix elements, corr(j, i), found in Ref. [28] are related to σ sys,ji and σ stat,li as σ j,i = σ iσj corr(j, i), whereσ i is the uncorrelated uncertainty in the i-th bin of data. The q 2 values were randomly generated with a flat prior across the entire q 2 bin. Every value of dΓ (art) /dq 2 has a different q 2 input generated for it.
We take the pseudo-data we have generated and divide it up into 100 batches (one batch per network). Each batch has an average q 2 and a standard deviation relating to the q 2 values, which are used to scale the each value of q 2 which we have generated. Using the scaled q 2 data as a secondary input is recommended to improve the stability and the performance of ANNs [29]. In particular, data standardization is a popular data scaling choice, and it is defined asq 2 iρ = q 2 iρ −q 2 ρ /σ ρ , where ρ is the batch number and i is a single q 2 value in the batch. With this transformed data, each of our ANNs has the structure (2, 100, 100, 1), as the two hidden layers, each with 100 nodes, provide the most efficient structure without compromising the performance or accuracy. With a higher number of nodes, the ANN's fit would be more accurate, but the training speed would also be reduced.
This data transformation, along with the conjugate gradient method, provides the minimum of the error function at 100 iterations. In contrast, steepest descent method with a constant learning rate provides a comparable result only at 20000 iterations.

III. FORM FACTOR PARAMETERIZATION WITH NEURAL NETWORKS
We generated 2×10 6 data points for each of the 14 q 2 bins which results in each ANN being trained with 280,000 unique pseudo-data points. After training all networks individually, we found the average ANN curve, with uncertainty, at every calculated q 2 value. The differential decay rate, dΓ/dq 2 , and the |V cd F + (q 2 )| curves are shown in Fig. 2 and Fig. 3 respectively. Further results of the ANN training and relevant graphs are available at the URL s.wayne.edu/HEPMachineLearning. We would like to compare our results with some common form factor models: simple pole, the BK model (or modified pole), and the BZ model [15,16]. Since the ANN fits a product of V cd and F + (q 2 ), a direct comparison will be affected by the value of V cd that would have to be taken as an external parameter. With that in mind, we can compare |V cd F + (0)| obtained from the model fits and our ANN analysis of the semileptonic decay data.
A further insight into how well model-inspired parameterizations describe hadronic dynamics is possible if we expand the form factor around q 2 = 0, Form factor  and compare the coefficients F n of the higher order terms for the models to our averaged ANN output. We looked at the ratios of the nth derivative of the form factor divided by the form factor at q 2 = 0, We note that the first and second terms in Eq. (23), which are independent of the value of V cd , are quite sensitive to the quark hadronization dynamics. In particular, drawing parallels to the discussion of the charge radius of the proton [30], the slope of F + (q 2 ) at q 2 = 0, denoted F 1 , encodes the information about the effective size of the volume where the quark transition takes place. We shall call the coefficient F 1 a transitional charge radius.
In order to compare F 1,2 of a particular model-inspired parameterizations to our ANN fits, we need to determine shape parameters for each form factor model. Other than the simple pole model, where we take the mass of the D * (2010) resonance as m D * , the parameters that need to be fit include a BK for the BK model, and a BZ and r BZ for the BZ model. We obtain these shape parameters by fitting the model to the experimental data. Using this procedure we find a BK = 0.431 ± 0.117 for the BK model and r BZ = 0.020 ± 0.013 and a BZ = 1.41 ± 0.15 for the BZ model. We note that for each of these parameterizations the combination |V cd F + (0)| is also treated as a fit parameter.
The resulting values for |V cd F + (0)|, F 1 , and F 2 for the neural network parameterization and the model-inspired parameterizations can be found in Table I. As we can see from the first column of Table I, the values of |V cd F + (0)| are consistent throughout the popular form factor models and are roughly consistent with our ANN study. The agreement is much worse for the parameters F 1,2 : the ANN fits are consistently larger for the transitional charge radius F 1 and only marginally describing the F 2 parameter. It is likely that this happens due to rather rigid parameterizations of the model-inspired form factors, which artificially decrease possible uncertainties associated with them. This is particularly true for the simple pole parameterization F pole + (q 2 ): the uncertainty of F 1,2 is unreasonably small because once |V cd F + (0)| is fixed, the only uncertainties that can cause the spread in F 1,2 are the experimental uncertainties in the value of m D * , which are rather small. We conclude that it is possible that more effective poles need to be taken into account if model-inspired form factors are used for parameterizations of future experimental data. Graphically, the model fits compared to the artificial neural network fits and experimental data points are shown in Fig. 3.

IV. FORM FACTOR BOUNDS AND THEIR DERIVATIVES
We can use our ANN fits to obtain separate bounds on the CKM matrix element V cd if we combine our fits with model-independent bounds on the hadronic form factor imposed by analyticity and unitarity requirements [11]. In order to do so and place an upper bound on |F + (0)|, we would need to calculate moments of the heavy-light invariant amplitude Π + (q 2 ), which we denote by χ (n) + . They are defined by the relation, where n denotes a specific moment and t + = (m D + m π ) 2 . These moments can be computed in QCD. In addition, an inequality for the imaginary part of Π + (q 2 ), which holds for t > t + , can be found via the unitarity sum of the Dπ state spectral function in the isospin limit [11]. This result leads to an inequality with respect to the moment, where ρ (n) The form factor is an analytic function in the cut complex t-plane, so we can apply the standard techniques to derive the constraint on the form factor [11]. We can bring Eq. (27) to a canonical form by mapping it to the interior of a unit disk using the transformation in Eq. (3). In this mapping z(t + ) = 1 and z(∞) = −1.
In terms of z, the inequality is On one hand the analytic function g where ω This expansion is convergent for |z| < 1. It follows from Eq. (28) that the coefficients must satisfy the inequality The left side of the above inequality is always positive, which leads to a maximum number of g-coefficients being non-zero. Expanding Eq. (29) in a Taylor series around z = 0 and setting it equal to Eq. (31) results in each g-coefficients being a function of F + (0), F 1,2 , and meson masses. Substituting g (n) +,j (F + (0), F 1,2 , ...) into Eq. (32) and solving for F + (0) leads to a bound on the form factor (at q 2 = 0) in terms of F 1 , F 2 , and χ (n) + , Where the h (n) is the function that results in solving the inequality for F + (0). The moments where ūu and αG 2 are the quark and gluon condensates, respectively. These parameter values have been taken from Table II, and the updated values for the moments can be found   in Table III.
We have also found F 1 and F 2 for our averaged network by fitting the calculated data to a Taylor expansion around q 2 = 0. Only using the data range 0 ≤ q 2 ≤ 0.75 GeV, we found our two fits to be F 1 = (60.68 ± 0.01) × 10 −2 GeV −1 and F 2 = (13.43 ± 0.37) × 10 −2 GeV −2 , as shown in Table I. Using these two values with |V cd F + (0)| = (6.02 ± 1.14) × 10 −2 , and plugging it into the inequality we obtained with Eq. (33), we can find an upper bound for the form factor at q 2 = 0 for each moment that has been calculated. The results can be found in  [34].
Accurate theoretical description of semileptonic form factor F + (q 2 ) are needed for accurate extraction of the CKM matrix elements V cd and for studies of possible new physics contributions. While lattice QCD and QCD sum rules' calculations provide model-independent results for various portions of available q 2 range, extrapolations of F + (q 2 ) are often needed to extend the predictions to other values of q 2 , for which a particular shape of of the q 2dependence is often used. What systematic uncertainty does choosing a particular function to describe a q 2 dependence of the form factors brings to such extrapolation? We performed the fit of the available experimental data to an artificial neural net, which was used in a capacity of universal unbiased approximant. We trained a perceptron neural net with two hidden layers of one hundred nodes in each. The results of the ANN training and relevant graphs are available at s.wayne.edu/HEPMachineLearning. While the simple ANNs employed in this paper do not provide spectacular extrapolation to q 2 = 0, the obtained results, displayed in Table I, can be used to test existing models of q 2 -dependence of the F + (q 2 ) form factor. Based on our fits, we conclude that it is possible that more effective poles need to be taken into account if model-inspired form factors are used for parameterizations of future experimental data. Finally, we used the resulting ANN fit to improve unitarity constraints on the form factor, which allowed for model-independent bounds on V cd .
This work was supported in part by the U.S. Department of Energy under contract de-sc0007983. We thank Gil Paz for reading the manuscript and helpful comments. AAP thanks Roy Briere for useful and oftentimes sobering conversations. AAP thanks the Institute for Nuclear Theory at the University of Washington for its kind hospitality and stimulating research environment. This research was also supported in part by the INT's U.S. Department of Energy grant No. DE-FG02-00ER41132.