Nucleon axial form factor from a Bayesian neural-network analysis of neutrino-scattering data

The Bayesian approach for feed-forward neural networks has been applied to the extraction of the nucleon axial form factor from the neutrino-deuteron scattering data measured by the Argonne National Laboratory (ANL) bubble chamber experiment. This framework allows to perform a model-independent determination of the axial form factor from data.. When the low $0.05<Q^2<0.10$ GeV$^2$ data is included in the analysis, the resulting axial radius disagrees with available determinations. Furthermore, a large sensitivity to the corrections from the deuteron structure is obtained. In turn, when the low-$Q^2$ region is not taken into account, with or without deuteron corrections, no significant deviations from the dipole ansatz have been observed. A more accurate determination of the nucleon axial form factor requires new precise measurements of neutrino-induced quasielastic scattering on hydrogen and deuterium.

The Bayesian approach for feed-forward neural networks has been applied to the extraction of the nucleon axial form factor from the neutrino-deuteron scattering data measured by the Argonne National Laboratory (ANL) bubble chamber experiment. This framework allows to perform a modelindependent determination of the axial form factor from data.. When the low 0.05 < Q 2 < 0.10 GeV 2 data is included in the analysis, the resulting axial radius disagrees with available determinations. Furthermore, a large sensitivity to the corrections from the deuteron structure is obtained. In turn, when the low-Q 2 region is not taken into account, with or without deuteron corrections, no significant deviations from the dipole ansatz have been observed. A more accurate determination of the nucleon axial form factor requires new precise measurements of neutrino-induced quasielastic scattering on hydrogen and deuterium.

I. INTRODUCTION
A good understanding of neutrino interactions with matter are crucial to achieve the precision goals of oscillation experiments that aim at a precise determination of neutrino properties [1]. In current (T2K, NOvA) and future (DUNE, HyperK) oscillation experiments with few-GeV neutrinos, a realistic modeling of neutrino interactions with nuclei, and their uncertainties, in a broad kinematic range is required to distinguish signal from background and minimize systematic errors. A key ingredient of such models are the amplitudes and cross sections at the nucleon level.
In particular, a source of uncertainty arises from the nucleon axial form factor, F A . This fundamental nucleon property is a function of Q 2 , defined as minus the fourmomentum transfered to the nucleon, squared. The axial coupling g A = F A (Q 2 = 0) is known rather precisely from the neutron β-decay asymmetry [2]: although a more precise value can be obtained using recent measurements of the nucleon lifetime [3]. For the Q 2 dependence, the most common parametrization is the dipole ansatz in terms of a single parameter, the so-called axial mass M A . The dipole parametrization has been utilized to describe also the electric and magnetic form factors of the nucleon. In the Breit frame and for small momenta, this Q 2 dependence implies that the charge distribution is an exponentially decreasing function of the radial coordinate. Both the electric and magnetic form factors of the nucleon deviate from the dipole parametrization, for a review see [4]. It seems then natural to expect similar deviations for the axial one.
Empirical information about F A can be obtained from neutrino charged-current quasielastic (CCQE) scattering ν l n → l − p. Modern neutrino cross section measurements have been performed on heavy nuclear targets (mostly 12 C), where the determination of F A becomes unreliable due to the presence of not well constrained nuclear corrections and the difficulties in isolating the CCQE channel in a model-independent way. A detailed discussion of this problem can be found, for instance, in Sec. III of Ref. [5]. A more direct and, in principle, less model dependent determination of F A relies on bubble-chamber data on deuterium. Global analyses of the ANL [6][7][8], BNL [9,10], Fermilab [11] and CERN [12] data with updated vector from factors based on modern electron-scattering data have been performed by Bodek and collaborators. A reference value of M A = 1.016 ± 0.026 GeV with a small (2.5%) error has been obtained [13].
On the other hand, as pointed out in Ref. [14,15], anzatz (2) is not theoretically well founded. A new extraction of F A has been recently undertaken using a functional representation of the form factor based on conformal mapping (z-expansion) [16]. The function is only constrained by the analytic structure and asymptotic behavior dictated by QCD. The resulting form factor is consistent with the dipole one but with a much larger error, Fig. 7 of Ref. [16]. In particular, the axial radius obtained is r 2 A = 0.46 ± 0.22 fm 2 , which agrees with the dipole one r 2 A = 12/M 2 A but with a ∼ 20 times larger error. This result might have implications for oscillation studies and calls for a new measurement of neutrinonucleon cross sections, which is in any case desirable. The axial radius can also be extracted from muon capture by protons. A recent analysis [17] using the z-expansion obtains r 2 A = 0.43 ± 0.24 fm 2 , in agreement with the neutrino-scattering result.
A promising source of information about F A (Q 2 ) is lattice QCD. Although the experimental value of g A has been recurrently underpredicted in lattice QCD studies, the use of improved algorithms has recently lead to consistent results [18][19][20][21]. A global analysis of the low-Q 2 and light-quark mass dependence of the results of Refs. [19][20][21] using baryon chiral perturbation theory has found g A = 1.237 ± 0.074 and r 2 A = 0.263 ± 0.038 [22]. The central value of r 2 A is considerably lower than those from empirical determinations but within the (large) error bars of the z-expansion results.
The choice of a specific functional form of F A may bias the results of the analysis. Moreover, the choice of the number of parameters within a given parametrization is a delicate questions. Too few parameters may not give enough versatility. As the number of parameters increase, the χ 2 value of the fits can be reduced, but at some point the fit tends to reproduce statistical fluctuations of the experimental data [23]. A reduction of the model-dependence of the results can be obtained within the methods of neural networks. This approach has been used to obtain nucleon parton distribution functions from deep-inelastic scattering data by the NNPDF (neural network parton distribution function) collaboration [24].
In this paper we demonstrate that a modelindependent parametrization of F A can be extracted from a neural-network fit to neutrino-deuteron CCQE scattering data. In this context, Bayesian statistics has proved to be a very effective tool [25]. Its methods allow to make comparisons between different models and control the number of parameters in the fit. Bayesian methods are successfully used in different branches of physics, For instance, in hadron and nuclear physics they have been applied to the study of the resonance content of the p(γ, K + )Λ reaction [26] and to constrain the nuclear energy-density functional from nuclear-mass measurements [27]. We consider the Bayesian framework for neural networks formulated by MacKay [28]. It has been adapted to model electric and magnetic form factors [29]. It was also used in the investigation of the two-photon exchange phenomenon in elastic electron-proton scattering [30][31][32]. Furthermore, this approach has proved valuable to gain insight into the proton radius puzzle and, in particular, to study the model dependence in the extraction of the proton radius from the electron-scattering data [23,33].
In the present work we employ the Bayesian framework for neural networks to find the most favorable F A (Q 2 ) based on the neutrino-deuteron CCQE scattering data measured by the ANL experiment 1 . In section II the Bayesian approach for the feed-forward neural networks is introduced. Section III introduces the ANL data and theoretical framework which describes the neutrinodeuteron scattering. In Sec. IV the numerical results are 1 A global analysis including data from other experiments will be addressed in a subsequent study.
presented and discussed. In Sec. V the summary of the study is given. Appendix A contains the analytic form of some of the fits.

II. BAYESIAN FRAMEWORK FOR NEURAL NETWORKS
This section reviews the Bayesian approach formulated for the feed-forward neural networks and its adaptation to the problem of the extraction of the nucleon axial form factor from the CCQE neutrino-deuteron scattering data.

A. Neural networks
Our aim is to obtain a statistical model which has the ability to generate F A (Q 2 ) values together with uncertainties. In practice, to construct such a model, a number of probability densities must be estimated. This can be achieved within three general methods [34]: (i) nonparametric, (ii) parametric, (iii) semi-parametric. In the first approach no particular functional model is assumed and the probabilities are determined only by the data. However, if the size of the data is large the method requires introduction of many internal parameters. Additionally, this approach is computationally expensive. In the parametric method a specific functional form of the model is assumed. In this case it is relatively easy to find the optimal configuration of the model parameters. However, a particular choice of the parametrization limits the ability of the model for an accurate description of the data 2 . In this case the uncertainties for the model prediction are either overestimated or underestimated. The semi-parametric method takes the best features from both (i) and (ii) approaches. In this method, instead of a single specific functional model, a broad class of functions is considered. The optimal model is chosen among them. The neural-network approach is a realization of the semi-parametric method. In particular, the feed-forward neutral networks form a class of functions with unlimited adaptive abilities.

B. Multi-layer perceptron
In order to model the nucleon axial form factor a feedforward neural network in a multilayer perceptron (MLP) configuration is considered. The concept of MLP comes from neuroscience [35]. A given MLP is a nonlinear map from the input space of dimension n i to the output space of dimension n o N : The MLP map can be represented by a graph which consists of several layers of units: the input layer with n i units, one or more hidden layers with hidden units and the output layer which n o units. In the input and in every hidden layer there is an additional bias unit. The units from the consecutive layers are all connected but the bias unit is connected only to the following layer. As an example, the graphical representation of the MLP: N : R → R is given in Fig. 1. Every edge (connection line) in the graph represents one parameter of the function, called latter a weight. To every unit (blue circles in Fig. 1), a real singlevalued function called activation function, f , is associated; its argument is the weighted sum of the activation function values received from the connected units. In the feed-forward case, the i-th unit in the k-th layer is given in terms of the input from the k − 1-th layer by A graphical representation of the above function is given in Fig. 2. The weights w i,k u are real parameters. Their optimal values are established by the network training, for which we adopt the Bayesian framework explained below.
Notice that for the bias unit f i,k (x) = 1. Furthermore, it is assumed that in the output layer the activation functions are linear, f (x) = x. In order to simplify and speedup the performance of the numerical analyses MLPs with only one hidden layer of units are considered. In Fig. 1 there is an example of such an MLP with M = 3 hidden units.
Let us introduce the MLP N M : R → R, with a single hidden layer and M units, which has the following functional form: This function depends on W = 3M + 1 weights and Q 2 .
It has been proved and demonstrated [36][37][38][39][40][41][42] that if M is sufficiently large and activations functions in the hidden layer have a sigmoidal shape, then a map of the form (6) can approximate arbitrarily well any continuous function and its derivative. We take profit of this property to find the optimal fit of the axial form factor. In our numerical analysis the activation functions in the hidden layer are given by sigmoids Eventually, to model F A (Q 2 ) the MLP of the form (6) is adopted. In order to speed-up the numerical computations we re-scale the output of the MLP (6) by normalizing it to the dipole ansatz. Therefore, the axial form factor is represented as where F dipole A is given in Eq. (2) with M A = 1 GeV. In this way, the neural-network response gives the deviation of the axial form factor from the dipole parametrization. The value of g A is allowed to change within the PDG uncertainty, Eq. (1).
C. Bayesian approach and Occam's razor As described above, an MLP is a nonlinear map defined by some number of the adaptive parameters. The increase of the number of hidden units improves MLP's ability to reproduce the data. However, when the number of units (weights) is too large the model tends to over-fit the data and reproduce the statistical noise. As a result, its predictive power is lost. On the other hand if the network is too small then the data are under-fitted. This competition between two extreme cases is know in statistics as the bias-variance trade-off [43]. Certainly, the optimal model is a compromise between both extreme situations.
Bayesian statistics provides methods to face the biasvariance trade-off dilemma. Indeed, the Bayesian approach naturally embodies the Occam's razor [25,28,44,45] i.e. complex models, defined by a large number of parameters, are naturally penalized, while simple fits, with a small number of parameters, are preferred. Moreover, the Bayesian approach allows one to make comparisons between different statistical descriptions of the data and to indicate the model which is favored by the measurements. An example of such analysis can be found in Ref. [23], where a large number of different fits of the electric and magnetic form factors were obtained form electron-proton scattering data. For each model the value of the proton radius, r E p has been calculated. It turned out that, depending on the model, r E p ranges from 0.8 to 1.0 fm. A considerable fraction of the results agreed with the muonic-atom measurement, r E p = 0.84184(67) fm [46] but the Bayesian algorithm preferred a model with r E p = 0.899 ± 0.003 fm, which is in contradiction with the muonic-atom result.

D. Bayesian framework for MLP
We adopt the Bayesian framework for the feed-forward neural network formulated by MacKay [47,48]. The main steps of the approach are shortly reviewed below.
Let us consider the set of neural networks where M denotes the number of units in the hidden layer.
To each of the models N one associates a prior probability denoted P(N ). Our task is to obtain two posterior conditional probabilities: P(N | D) -the probability of the model N given a data set D and P({w j } | D, N )probability distribution of the model parameters, given D and model assumptions. The first probability density allows us to choose among many network types the one which is favorable by the data, while the second one is necessary to make model predictions.
If one assumes, at the beginning of the analysis, that all MLP configurations are equally suited for describing the data then the following relations between prior probabilities hold Then, in order to classify the models it is sufficient to compute the so-called evidence P(D | N ). Indeed, from the Bayes' theorem one gets where P(D) is the normalization constant.
On the other hand, the posterior probability for the weights of a given MLP reads where P(D | {w j }, N ) is the likelihood while the density P({w j } | N ) is the prior describing the initial assumptions about the weight parameters.
In the adopted approach it is assumed that the posterior distributions have Gaussian shape. Hence, to get the necessary information about (12) it is enough to obtain the configuration of weights {w j } M P at which the posterior distribution is at its maximum and the covariance matrix for the model. The latter is necessary to predict the uncertainties for the model predictions.
Notice that by integrating both sides of Eq. (12) one gets the evidence for the model, namely In order to calculate the posterior (12) we assume that the likelihood is given in terms of the χ 2 function where N L is the normalization constant. The χ 2 function for the present study is defined in Sec. III B (see Eq. 25). It is also assumed that the initial values of the weights are Gaussian distributed (see Ref. [34]) where α is a hyperparameter (regularizer) introduced to deal with the over-fitting problem and The regularizer α plays a crucial role in model optimization. Indeed, if α is large then the term (16) dominates in the posterior, Eq. (12), so it is very likely that the model under-fits the data. On the contrary, if α is too small, the likelihood dominates and the model tends to over-fit the data. It is clear that a proper estimate of the α parameter is crucial. In our approach we use the algorithm described in Ref. [34]: the optimal value α M P is established iteratively during the training of the network. Details about the algorithm implementation can be found in Ref. [29]. Notice that the optimal configuration of the model parameters ({w j } M P , α M P ) is close to the configuration at which the error function is at the minimum.
The evidence for the model is obtained from (12) and (13). We consider the approximation proposed in Ref. [48], for which the evidence has the following analytic form In the above expression, normalization factors common to all models are omitted; |A| denotes the determinant of the Hessian matrix: The parameter measures the effective number of weights, whose values are controlled by the data [34]. The λ i s are eigenvalues of the matrix ∇ i ∇ j χ 2 | w= w M P . The evidence contains two contributions: the Occam's factor (19), which is large for models with many parameters and the misfit, Eq. (18), which could be large if the model is too simple. Therefore the model which maximizes the evidence is the one which solves the biasvariance dilemma. As an illustration from the present study (details can be found in the following section), in Fig. 3 we plot the values of the error (17) and the evidence for MLP fits. The best model, with the highest evidence is not the one which has the smallest value of the error function E.

A. Theoretical framework
The neutrino-induced CCQE differential cross section, in terms of the standard Mandelstam variables s = (k + p) 2 , u = (p − k ) 2 and t = (k − k ) 2 = −Q 2 can be cast as [49] dσ to the corresponding electromagnetic proton and neutron form factors, which are extracted from electron scattering data. In the present study we have taken these electromagnetic form factors from Ref. [50,51]. With this simple choice we disregard deviations from the dipole shape because the accuracy of the neutrino-deuteron data is insufficient to be sensitive to them, particularly at the rather low Q 2 1 GeV 2 probed in the ANL experiment. Finally, PCAC and the pion-pole dominance of the pseudoscalar form factor, F P , allow to express it in terms of F A .
Deuterium-filled bubble chamber experiments actually measured ν µ + d → µ − + p + p. The cross section for this process differs from the one on free neutrons due to the momentum distribution of the neutron in the nucleus, Pauli principle, final-state interactions and mesonexchange currents. In the literature it has been commonly assumed that Eq. (23) can be corrected for these effects by a multiplicative function of Q 2 alone, R(Q 2 ) and such that R → 1 at large Q 2 . For the present study we adopt R(Q 2 ) from the calculation of Ref. [52] (solid line of Fig. 4).

B. Chi-square function for the ANL Experiment
In the ANL experiment, the interactions of muon neutrinos in a 12-foot bubble chamber filled with liquid deu-terium were studied [6][7][8]53]. The neutrino flux peaked at E ν ∼ 0.5 GeV and has fallen by an order of magnitude at E ν = 2 GeV [7,8]. For the statistical analysis we consider the Q 2 -distribution of CCQE events. Some of the originally published bins were combined together to have a number of events larger than five in every bin. Therefore, the number of bins is n ANL = 25, where bins from 1 to 23 have a width of 0.05 GeV 2 , while bins 24 and 25 have a width of 0.65 GeV 2 . The total number of measured two-and three-prong events adds to N ANL = 1792 [8]. One-prong events were not included in the ANL selection. To account for their loss, the region of Q 2 = 0.05 GeV 2 was excluded.
The predicted number of events in each bib is calculated similarly as in Ref. [16] where p(dN/dE ν )/σ(E ν , F A ) is the neutrino energy flux, given in terms of the experimental energy distribution of observed events dN/dE ν taken from Ref. [53]. As stated in the previous section, the likelihood [Eq. (14)] is built using the χ 2 function, which we cast as where χ 2 g A is introduced to constrain the value of the axial form factor at Q 2 = 0 g A and ∆g A are fixed by the present PDG central value and its error, respectively, Eq. (1). For χ 2 ANL we take where N i denotes the number of events in the bin. The last term takes into account the systematic uncertainty in the total number of events [54] inherited from the fluxnormalization uncertainty. Similarly as in the analysis of single pion production data [55] it is assumed that ∆p = 0.20 3 . At the beginning of the analysis p = 1 is set. Then during the training of the network p is iteratively updated. This algorithm is described in Ref. [30]. It is known that the low-Q 2 data are characterized by a lower efficiency (see for instance Fig. 1 of ref. [8]). Moreover, in this kinematic domain deuteron structure corrections must be carefully discussed. In order to study this problem we consider three variants of the ANL data: (i) χ 2 ANL → χ 2 BIN0 : all ANL bins included; (ii) χ 2 ANL → χ 2 BINk : where k = 1 or k = 2: ANL bins without the first k bins.
Additionally, for each data set we consider the cross section model both with and without (R(Q 2 ) = 1) deuteron corrections.

C. Numerical algorithm
We consider an MLP with M = 1 − 4 hidden units in a single hidden layer. For M > 4 the number of parameters in the fit starts to be comparable with number of bins. The numerical algorithm for getting the optimal fit is summarized by the following list of steps: (i) consider an MLP with a fixed number of hidden units M = 1; (ii) using the Bayesian learning algorithm ( [31]), perform the network training and find the optimal values for the weights and the regularizer α; (iii) calculate the evidence for each of the obtained MLP fits; (iv) repeat steps (i)-(iii) for various initial configurations of {w j }; (v) among all registered fits choose the best one according to the evidence; (vi) repeat steps (i)-(iv) for M = 2 − 4; (vii) among the best fits obtained for N 1−4 MLPs, choose the model with the highest evidence.
The optimal configuration of parameters is obtained using the Levenberg-Marquardt algorithm [56,57].

IV. NUMERICAL RESULTS
The analysis of the BIN0, BIN1 and BIN2 data sets has been independently performed. For each set, both cross section models, with and without deuteron corrections, have been studied. For the default analyses ∆g A has been taken from PDG as in Eq. 1 but the impact of a larger uncertainty, ∆g A /g A = 10% has been investigated and is discussed below. We have also performed analyses with normalization uncertainties smaller (∆p = 0.10) and larger (∆p = 0.30) than the default ∆p = 0.20 but it turned out that decreasing/increasing ∆p does not significantly affect the final results. All in all, about 17000 fits have been collected. Among them, for each type of analysis, the best model has been chosen according to the objective Bayesian criterion described in Sec. II.
In order to compare quantitatively different analyses one needs to take into account the relative data normalization P(D). This density is not evaluated within the adopted approach. Hence we can not quantitatively compare the results of e.g. BIN0 and BIN1 analyses. Nonetheless, for a given data set, quantitative comparisons between the results obtained within with the two versions of the cross section model can be made.
As described in Sec. III C for each type of analysis (data set plus cross section model), to find the optimal fit, MLPs with: M = 1, 2, 3 and 4, hidden units have been trained. The best model within each MLP type is the one with the maximal value of the evidence, Eqs. (18)(19). In order to illustrate the performance of the training algorithm, in Fig. 4 we present the dependence of the resulting axial-radius squared (r 2 A ) values on the evidence for the BIN1 data set. The best fit with the highest evidence, obtained with M = 4, gives r 2 A ≈ 0.464 fm 2 . Notice that all the best models within each MLP type reproduce well the ANL data. This is illustrated in Fig. 5 which presents the distribution of the ANL events and the best fits.  Our main results i.e. the best fits to BIN0, BIN1 and BIN2 data for the model with/without deuteron correction, with ∆g A from Eq. (1), are summarized in Table I. The corresponding F A (Q 2 ) functions with their error bands are shown in Fig. 6.
Both fits for the BIN0 data set, which contains all the data from the original ANL measurement, with and without deuteron correction, show a Q 2 behavior of F A with a rapid increase followed by a decrease after a local maximum. As a result r 2 A has a negative sign 4 , which is at odds with all available determinations. The height of the F A maximum is reduced once the deuteron correction is included in the analysis and it disappears when the first bin is removed from the ANL data (BIN1 data set). Hence, the presence of the local maximum of F A appears to be caused by low-Q 2 effects. There are several possible sources of this unexpected behavior of the fits to the BIN0 set, namely: (i) an improper description of the nuclear corrections; (ii) a low quality of the measurements at low-Q 2 due to low and not well understood efficiency; (iii) constrains coming from the uncertainty of g A ; (iv) because of the lack of very low-Q 2 data, the actual value of r 2 A might not be properly estimated: for instance, if F A has first a local minimum and then a local maximum 5 . In the later scenario, the ANL data (and the available bubble-chamber data in general) are not precise enough to reveal this behavior.
In the low-Q 2 kinematic domain, deuteron effects are sizable and may play a crucial role. On the other hand the inclusion of deuteron corrections in the analysis of the BIN1 and BIN2 data sets has a minor impact on the functional dependence of the final results i.e. there is small difference between F A (Q 2 ) obtained with and without deuteron correction as can be seen in Fig. 7. It is also interesting to highlight that the inclusion of the deuteron-structure corrections in the cross section model increases the value of the evidence for each type of the analysis, see Table. I.
In Fig. 8 we plot values of r 2 A against the evidence. It is clearly seen that the fits including deuteron corrections are favored by the ANL data. The impact of the sensitivity of the results on the deuteron structure revealed in the present study calls for a more accurate account of this ingredient of the cross section models, beyond the R(Q 2 ) function from Ref. [52] employed so far. Recent studies of CC νd scattering in the QE regime (without pions in the final state) include the nonrelativistic calculation of the inclusive cross section, incorporating two-body amplitudes, of Ref. [58]. For the kinematics of the ANL and other bubble-chamber experiments, it is important to employ a relativistic framework as in Ref. [59]. Furthermore, the consideration of the semi-inclusive rather than the inclusive cross section shall allow to take into account the detection threshold for outgoing protons, which in the ANL case is 100 MeV [7]. One should nonetheless bear in mind that even with the best model for the deuteron there is no guarantee that the low-Q 2 region is successfully described because of the difficulties in the measurement and with efficiency estimates at this kinematics.
The impact of ∆g A on the results can be easily investigated. Indeed if one increases the ∆g A uncertainty from ∆g A /g A ≈ 0.1%, as in Eq. (1), to 10%, then the local maximum of F A disappears. However, the fit uncertainty rapidly grows, from ∆F A /F A lower than 0.01% to ∆F A /F A ≈ 7% at Q 2 = 0. This analysis is shown in   In order to compare the Bayesian neural-network results with the traditional approach, we have performed a conventional analysis of the ANL data assuming the dipole parametrization for the axial form factor, Eq. 2. The best fit minimizes the χ 2 ANL function [Eq. 27] 6 . These results are summarized in Table II, while the comparison between dipole fits and neural-network analyses are displayed in Fig. 10. z-expansion results [16] even though in the latter case a different error function was utilized. Certainly, with a dipole fit to BIN0 data one can not obtain the local maximum of F A at low-Q 2 . On the other hand, the dipole fits to the BIN1 and the BIN2 data have very similar functional Q 2 -dependence as the best MLP fits. We find that BIN1 and BIN2 data sets do not show any evidence of departure of the axial form factor from the dipole shape. Furthermore, the uncertainties in the neural-network fits are systematically smaller than in the dipole χ 2 ones. This seems to be a consequence of the fact that the dipole model is biased by its simplicity while the MLP has a larger flexibility to reproduce the data. The Bayesian approach embodies the Occam's razor, hence, the predictions for the uncertainties within the optimal model should not be neither over-nor under-estimated [45].

V. SUMMARY
The first neural-network Bayesian analysis of the neutrino-deuteron scattering data has been performed. The reported study has been oriented to the extraction of the axial from factor from the ANL CCQE data, searching for deviations from the dipole form. With the full ANL data set, the analysis leads to an axial form factor which has positive slope at Q 2 = 0 and a local maximum at low Q 2 . The inclusion of deuteron correction reduces the peak in F A . Only after removing the lowest available Q 2 region (0.05 < Q 2 < 0.10 GeV 2 ) from the data, a value of the axial radius consistent with available determinations could be obtained. This suggests that correc-tions from the deuteron-structure play a crucial role at low Q 2 but it could also be the case that the experimental errors in this kinematic region were underestimated. Analyses without the low Q 2 data do not show any significant deviation from the dipole shape. Furthermore, our neural-network fits are characterized by smaller uncertainties than the dipole ones. New more precise measurements of neutrino cross section on hydrogen and deuterium are needed to unravel the axial structure of the nucleon. Techniques like the one applied in the present study shall prove valuable in such a scenario.
Appendix A: Best-fit results

BIN0 with deuteron corrections
The best-fit parametrization for for the BIN0 data set with deuteron correction included is The weights w 1−13 take the following values: Parameters w 1,3,5,7 are given in units of GeV −2 while the rest are dimensionless. As explained in Sec. II B, Eq. 8, to obtain F A (Q 2 ), function N (Q 2 , {w j }) given above should be multiply by the dipole, Eq. 2, with M A = 1 GeV.