Distinguishing $W'$ Signals at Hadron Colliders Using Neural Networks

We investigate a neural-network (NN)-based hypothesis test to distinguish different $W'$ and charged scalar resonances through the $\ell+\require{cancel}\cancel{E}_T$ channel at hadron colliders. This is traditionally challenging due to a four-fold ambiguity at proton-proton colliders, such as the Large Hadron Collider. Of the neural network approaches we studied, we find a multi-class classifier based on a convolutional neural network (CNN) to be the best approach, where the CNN is trained on 2D histograms made from the transverse momentum $p_T$ and pseudorapidity $\eta$ of $\ell$. The CNN performance is quite impressive and can begin to distinguish between hypotheses when the signal to background ratio is above 10\%, with near perfect performance for $S/B\gtrsim $ 60\%. In addition, the performance is quite robust against variations in the signal such as the overall signal strength and the decay width of the resonance. As a comparison to traditional approaches, we compare our method with Bayesian hypothesis testing and discuss the pros and cons of each approach. Finally, by considering the next-to-leading order (NLO) process with an additional jet, we demonstrate that one can generalize the CNN to multi-dimensional histograms by utilizing RGB colors to represent different variable pairs. The neural network scheme presented in this paper is a powerful tool that could help investigate the properties of charged resonances and more generally can be applied to many other hypothesis testing situations.


I. INTRODUCTION
Ever since the discovery of the W boson through the eν decay channel in 1983 at the SPS collider [1,2], the search for W and other charged boson resonances has continued. The latest analyses include the 13 TeV search in the di-jet channel conducted by ATLAS [3], and the 13 TeV search with + j [4] and + E T [5] final states conducted by CMS. So far, the mass limits have been pushed above the TeV level (see ref. [6]), and thus future W signals are expected to occur at higher energies in high-energy hadron colliders. One such example is the CERN Large Hadron Collider (LHC), which is the main focus of our study. In this case, the leptonic search turns out to be a favorable choice, as it avoids the large QCD background. Some of the most important properties to be identified of a W would be the mass, decay width, and couplings to the Standard Model (SM) fermions; if we further include the study of charged scalar bosons, spin would also be important. However, determining the boson's couplings and spins in its center-of-mass (COM) frame at the LHC suffers from two ambiguities: • Unknown initial state: To study the Lorentz structure of a charged current interaction, the incident partons must be identified so as to define the forward direction (e.g. in the quark direction, not the anti-quark direction.). Due to the parton distribution functions (PDFs), the best one can do is to make a reasonable guess for this from the PDF properties [7].
• Missing longitudinal momentum: Since the colliding frames of the incident partons are typically boosted, we need to identify the missing longitudinal momentum to correctly determine the COM angular distribution in cos θ COM . From kinematics, the longitudinal momentum can be solved from a quadratic equation assuming the mediating boson to be on-shell, but there is no event-by-event information that can be used to determine which of the two quadratic solutions is correct. This ambiguity has already been pointed out in several studies involving E T , such as the reconstruction of W → eν at the SPS pp Collider [1] and top pair production at the Tevatron [8].
Even though the mentioned ambiguities have imposed an obstacle to such studies, several studies based on traditional approaches have still been conducted to reconstruct the information of the W , such as refs. [9][10][11][12][13].
In this paper, we investigate deep-learning-based approaches to tackle the problem of determining the spin and interaction type of a heavy charged boson resonance through its leptonic decay channels. In particular, we will consider W and H, generic spin-1 and spin-0 charged resonances respectively. Over the past few years, neural networks have made enormous strides on a variety of challenging problems in different fields. Some recent high energy physics applications include refs. [14][15][16][17][18][19][20][21][22][23].
The above ambiguities make event-by-event reconstruction by a neural network challenging, but classification based on a collection of events can still have significant distinguishing power. Bosons with different leptonic couplings and spins will manifest distinctive kinematic features which become apparent as one accumulates events. Thus, instead of trying to reconstruct the spins and couplings directly, we can use a multi-class neural network classifier that takes measured lab quantities of a set of events as input. There are two straightforward ways to input this collection of events: either simply feed it in event-by-event as an array, or combine a number of events and form 2D histograms by choosing a certain pair of variables.
The latter would be similar to feeding in part of the probability density function on the chosen 2D kinematic plane. We have considered the following three NN models for this problem: • Deep Neural Network (DNN): We constructed a simple DNN trained upon the kinematic information of from each individual event.
• Convolutional Neural Network (CNN): We constructed a simple CNN trained upon 2D histograms made from pairs of kinematic observables of a certain number of events.
• Transfer-Learning Network (TLN): As a more sophisticated CNN model, a TLN has a part called the base model, which is modified from a publicly available pre-trained model, and another part called the top model, which links the output of the base model to the target output layer. We choose the VGG-19 model [24] as the base model, and import the pre-trained weights from the ImageNet database [25]. The TLN serves as a comparison with our own CNN.
The first two methods mentioned above have already been proposed and used in ref. [22] to distinguish the mono-jet and di-jet signatures of weakly interacting massive particles (WIMPs) from those of the SM and other dark matter models. During our study, we found that the DNN could barely distinguish among the three classes, while the CNN outperformed the TLN. The first comparison shows that changing from individual-event identification to "global" feature identification would increase the efficiencies. The second comparison tells that there is no need for a sophisticated NN model to solve this problem, as the features within the base model of TLN do not seem general enough to outperform our own fully trained CNN. Therefore, we will only demonstrate the results of the CNN in this paper.
In our study, we investigate the application of this method to the classification of samples into the following three coupling classes 1 : • Vector/Axial (VA): This class corresponds to a W with vector-like (V) fermionic couplings, ∼ W µ ψγ µ ψ, or axial-vector-like (A) fermionic couplings, ∼ W µ ψγ µ γ 5 ψ.
• Chiral (CH): This class corresponds to a W with left-handed (LH) fermionic couplings, • Scalar (SC): This class corresponds to an H with Yukawa-like fermionic couplings, For a pp collider, we will show that for signal alone the p T and η variables of the lepton cannot distinguish between the V and A hypotheses or between the LH and RH hypotheses. Interference between a W and the SM W background could in principle break this degeneracy, yet such effects are found to be negligible for the TeV-mass bosons considered in this study. Thus, under our approximations the VA, CH and SC hypotheses comprise three distinct signals.
For the sake of simplicity, we only focus on positively charged resonances, W + and H + , with masses of 1 TeV, and analyze the e + ν e final state, although this can be applied to higher masses, negatively-charged final states, and to the muon final states as well. Also, we assume that the coupling strengths and structure are universal to both the quark and the lepton sectors (even for H + ), and to all generations.
We also take into consideration the effects of different boson resonance widths, varying from truly narrow widths to sizeable ones for the 1 TeV resonance. Explicitly, we considered widths of ∼ 100, 10, 1, and 0.1 GeV. However, it turns out that the training outcomes upon 1 These are the interactions familiar to us in the SM. The proposed method can be generalized to include other interactions, such as other linear combinations of ∼ W µ ψγ µ (a + bγ 5 )ψ. The discriminating power, of course, will depend upon how close the different coupling classes are. different widths are quite similar. We will focus on the samples of 10-GeV width, a choice to mimic the SM W width-to-mass ratio Γ W /m W ≈ 1/40, in most of our presentation below.
Beside the leading-order (LO) process, we have also studied the next-to-leading-order (NLO) process in which an extra jet is produced in the final state. To account for the extra information provided by the jet, we will further extend the 2D histogram inputs of our CNN to include more variable pairs by using the three RGB colors, to demonstrate that the CNN approach of Ref. [22] can be generalized to more dimensions. We will formulate a few different input schemes for these NLO histograms, although there is no major performance difference among the different schemes. To understand these results, we will also study the importance and contributions of the different variable pairs in these schemes. It is worth noting here that for situations involving more kinematic variables like this, our results show that the CNN approach is more convenient than and superior to conventional methods, such as Bayesian hypothesis or χ 2 tests.
We prepare the samples assuming 14-TeV pp collisions, which is the expected COM energy of LHC Run-III. Going beyond the signal-only hypothesis testing of [22], we will also include the SM background from the W boson. We will investigate scenarios of different S/B and S/ √ B assuming an integrated luminosity of L = 60 fb −1 , a value comparable to the amount of the LHC Run-II data collected in 2018. This value will affect two things: the labelling of S/ √ B, which can be substituted by the L-independent S/B; and the number of events used to form individual histograms, whose effects have also been explored in our study and provided in the Appendix. As our results will demonstrate, the CNN can start distinguishing the signal hypotheses when S/B = 0.1 and have nearly perfect performance for S/B 0.6. Overall, our approach can in some sense be viewed as a model-independent approach, as it does not require specific model details except for the masses, widths, and couplings of the heavy bosons. This will be justified in Sec. II.
In the Appendix, we further provide details of technical studies of the CNN performance when varying the event numbers per histogram and corresponding total sample sizes, resolutions, and kinematic windows. In addition, we investigate the results of applying the wrong models on the testing samples. Finally, we compare the performances of binary classifiers to those of the original ternary classifiers by performing a projection on the testing scores of the latter, which demonstrates that our ternary classifier is as capable as the individual binary classifiers.

II. PARTON-LEVEL ANALYSIS OF GENERAL SINGLY-CHARGED BOSONS
Consider the following processes: The corresponding p T and η differential cross sections of e + are given by where V qq is the Cabibbo-Kobayashi-Maskawa (CKM) matrix element and q(x, Q 2 ),q (y, Q 2 ) are the parton distribution functions (PDFs).
The parton-level p T and η differential cross sections for H and W are given respectively by dσ H dp T = 1 2π dσ W dp T = 1 2π and where √ s = 14 TeV and F, G, H, I, J are given by From these parton-level differential cross sections, one can tell H and W apart from the p T distributions alone. However, the W bosons of different coupling structures would give On the other hand, the second term in the curly brackets of Eq. (4b) is proportional to c 2 V c 2 A and would lead to distinct η distributions for different W coupling scenarios. Thus, combining the parton-level p T and η distributions, one should be able to readily distinguish among the three classes but cannot distinguish between V and A nor between LH and RH from the shape of the distributions alone. After convoluting with the PDF's, the distribution differences among the classes become less obvious, but will still be detectable through our technique.

III. SAMPLE GENERATION AND ANALYSIS
We prepare our parton-level samples using MadGraph5 aMC@NLO v2.7.0 [26], followed by parton shower and hadronization performed with Pythia 8.2.44 [27]. The samples are then passed to Delphes 3.4.2 [28] for detector simulation using the default CMS card. The events are reconstructed with FastJet 3.3.2 [29]. In particular, the final-state jets in the NLO processes are reconstructed using the anti-k T clustering algorithm [30] with the cone radius R = 0.4.
The processes are simulated for 14-TeV LHC collisions with the NNPDF23 nlo as 0119 [31] PDF set. The W -and H-mediated processes are generated respectively with the Wprime model and General  The LO samples are generated for the processes: The cuts imposed at the generator level are summarized in TABLE I. The selection cut is imposed to avoid the tail of SM W background while retaining a sufficient amount of the newphysics (NP) signals below the Jacobian peak at p T = m W /2 = m H+ /2 = 500 GeV. This p T cut is a practical one so that the CNN training samples are not background dominated at the low end of this cut, which assists in training while allowing our p T binning to be sufficiently high in resolution. However, in the Appendix, we will explore how the CNN performance depends on the p T cut, where we show there can be a trade-off in information loss (too high of a cut) and p T resolution (too low of a cut).
Basic cuts p T > 10 GeV ; η < 2.5 Selection cut p T > 300 GeV  We choose to divide both p e T and η e into 40 bins so as to satisfy the minimum dimension requirement of the VGG-19 model used in the TLN while our CNN still remains trainable.
We only show the corresponding p e T , η e , and p e T vs. η e distributions for Γ NP ≈ 10 GeV in FIG. 1. As the boson width increases, the Jacobian peak of p e T distribution would become broader, while the η e distribution would remain identical.
As discussed in Sec. II, ideally the p T curves of the VA and CH classes should be identical in FIG. 1; however, there is a slight difference between the two due to numerical precision in the selected couplings. Since there is a much larger difference in the η distributions, we do not expect this difference to strongly affect the training or performance. The same issue will also occur in the NLO case.
The color scheme for FIG. 1(c), and also for the remaining 2D histograms, are as follows: the coldest color (blue) denotes a 0 entry, while the warmest color (red) denotes the maximum entry among all four classes. From FIG. 1(c), we see the Jacobian peaks at p T = 500 GeV for all the NP classes, with the CH class possessing the longest tail toward low p e T , followed by the VA class and finally the SC class. Such differences in the p e T tail and the spread in η e show the kinematic information that can be used to distinguish among the three classes, even after including the background. Within the selected phase space, the fiducial cross section for the SM class is Correspondingly, the number of SM events is Each histogram to be fed into the CNN is made up of a total of N LO events, where N LO is given by where S LO denotes the number of signal events and we will vary the signal-to-background ratio S LO /B LO in our considerations.
We first study low-significance scenarios (with S Note that it is quite challenging to distinguish them by eye at high accuracy but will be a simple job for the CNN. In order to generate enough training histograms, we shuffle the sample events several times before grouping them into sets of N LO , and then repeat this process until at least 15K histograms are produced for each significance scenario.

B. NLO samples
The NLO samples are generated for the processes: which include three types of diagrams: initial state radiation (ISR), gq t-channel, and gq s-channel.
The cuts imposed at the generator level are summarized in TABLE III. The p T selection cut is imposed for the same reason as in the LO case. However, the emission of the jet allows more of the background to pass this cut, so we place an addition E T selection cut on all NLO samples.

Basic cuts
As in the LO case, we also study both the low-significance and high-significance scenarios.
Since the NLO process has a three-body final state, we now have 5 physical degrees of freedom. The straightforward observables are: • p e T and p j T : transverse momenta of e + , j, respectively.
To form the required histograms, but at the same time to involve as much information as possible, we further consider three derived observables: • E T : missing transverse energy.
• ∆φ eE and ∆φ jE : azimuthal separations between e + and E T and between j and E T , respectively.
The distributions of these kinematic observables are shown in FIG. 3. To choose the three pairs of variables for making our RGB histograms, we propose three schemes: • Physical Relation (Scheme 1): Intuitively, the kinematic information measured from a single object should manifest high correlation. Therefore, we first pair up p e T and η e as well as p j T and η j . Then, guessing that observables of the same mass dimension should be more related, we randomly choose two out of the three azimuthal separation variables, ∆φ eE and ∆φ jE , to form the third pair.

IV. MODEL STRUCTURE AND TRAINING SPECIFICATIONS
In this section, we describe in detail the structure of our CNN model, which is constructed with the Keras [33] library along with TensorFlow [34] for backend implementation. We will also describe our training specifications, including the training parameters and strategies.

A. CNN structure
Our CNN is designed to read 40 × 40 2D histograms of kinematic variable pairs as input, and to classify each histogram into one of the three signal classes. For the LO samples, we only input one channel: p e T vs. η e ; while for NLO, we input three channels based on the three different schemes described above. Again, for scheme 1, we input p e T vs. η e , p j T vs. η j , and ∆φ eE vs. ∆φ jE ; for scheme 2, we input p e T vs. E T , η e vs. η j , and ∆φ eE vs. ∆φ jE ; and for scheme 3, we input p e T vs. η e , p e T vs. E T , and p e T vs. ∆φ ej . The CNN structure is specified in TABLE VI. In finalizing these parameters, we found that increasing the complexity of the NN model easily leads to over-fitting, and would even result in higher instability and worse training outcomes.

B. Training specifications
In all trainings, we split the dataset into three subsets: training, validation, and testing sets, in the proportion of 0.64 : 0.16 : 0.20. We set the batch size to 128 and the maximum training epoch to 1000. To avoid over-training, we call for an early stopping if the validation loss has not improved by more than 2 × 10 −4 for over 100 epochs.
To evaluate the performance of our CNN, we determine the receiver operating characteristic (ROC) curve in terms of the one-against-all strategy: we only consider the binary comparisons between class i and a combination of the other two classes, where i is the target class to be tested. Then, we calculate the areas under the ROC curves (AUCs) as the measure of the CNN performance. b This means that the max pooling kernel dimension is 2 × 2, and that each stride is 2 pixels.
c This means that there are 128 nodes in the dense layer.    Clearly, p e T vs. η e plays the most important role in the class discrimination, while p j T vs. η j and ∆φ eE vs. ∆φ jE barely contribute. This is physically understandable as we expect the angular and coupling information of the leptonic decay to be preserved mostly in e + , which is a direct decay product of the new charged bosons, rather than in j. Following p e T vs. η e are η e vs. η j , p e T vs. ∆φ ej , and p e T vs. E T , with the first two best at identifying the CH class and the latter two identifying the SC class. However, in all cases, VA is always the most difficult to be identified. Compared to FIG. 9, we see that combining different channels does lead to better overall performances, thus demonstrating that the multi-dimensional CNN can successfully utilize the additional information in these channels.

VI. COMPARISON WITH BAYESIAN HYPOTHESIS TEST
Finally, to give context for our CNN approach, we compare the LO results with a standard hypothesis test, the Bayesian hypothesis test (BH test). 2 In the Bayesian approach, for a specific observed dataset D, the probability for it to suggest a specific hypothesis H k is given by where P (H k ) denotes the prior that the hypothesis H k is correct, and P (D|H k ) gives the conditional probability to obtain the dataset D given the fact that H k is correct. In our study, we assume that it is equally likely for all the hypotheses (k = VA, CH, and SC) to be correct and hence P (H k ) = 1/3. We assume Poisson distributions for all individual bin counts, and the conditional probabilities are then given by where f (h D mn , H k mn ) denotes the Poisson probability for an observed number of counts h D mn at the pixel (m, n) in the 2D histogram, assuming an expectation value of H k mn . From the definition of Eq. (12), we could see that there would be a problem if any H k mn = 0 because an observed count in this pixel would have an extremely high weight in determining the hypothesis. To overcome this problem, we first symmetrize H k across the η e axis and then perform locally non-uniform binning to take into account small numbers of events at the pseudorapidity edges |η e | ∼ 2.5.
The low-and high-significance results for both CNN and BH tests are shown in FIG. 12. The plots indicate that the CNN approach is able to produce the same or better level of performance as the BH test for S/B ≥ 0.3. We note that the binning strategy used in the BH test could be more difficult to resolve in other cases. For example, for a higher mass resonance, the p T range to be studied would be wider, with more chances to get empty bins. Therefore, either more events need to be generated or the bins should be made coarser; otherwise, the BH test cannot be applied properly. Another complication could occur if more kinematic variables are needed, as the dimension of the phase space to be studied increases, proper binning will become more challenging. In fact, we encounter such an issue when we turn to NLO. To compare to the NLO Scheme 3 CNN, we perform a BH test using where a represents the three input channels. 3

VII. CONCLUSIONS
In this paper, we have investigated the ability of a neural network to distinguish different resonances in the pp → W /H → ν process at the LHC. We showed that the original eventby-event ambiguities in the coupling differentiation problem could be tackled with neural network classifiers with a convolutional neural network architecture using binned histograms as input images. The predicted p T distributions allow the discrimination between H and W , and because of the boosted parton collision frame, W with different couplings further manifest distinct η distributions.
Extending previous signal-only analyses [22], we demonstrated that a simple CNN could start distinguishing the signals even with low signal-to-background ratio (S/B), saturating to nearly perfect performance at S/B 0.6. As our NLO schemes show, the 2D approach of [22] can also be generalized to higher dimensions, where we took into account the extra information of the jet by using RGB channels to represent different kinematic variable pairs.
These NLO performances were roughly as good as those in the LO case even with fewer number of events at the same S/B. Performance differences resulting from different boson widths and the three pairing schemes were also investigated, and it was concluded that there was no major difference among the training results. Finally, we studied the importance of each individual variable pair in the NLO CNNs, and found out that they had different discrimination power for the three signal classes, with some variable pairs being more suited to picking out certain classes.
Out of all the variable pairs, the CNN still relied on the information of the charged lepton the most, although our results showed that the RGB color scheme successfully combined multiple channels to produce a better overall performance. As a final comparison, we also showed that this technique was as good or better than the conventional Bayesian hypothesis testing procedure, without having to worry about binning issues or how to generalize to higher dimensions.
Even though this study is based upon the specific choice of 1-TeV mass for the new charged resonance, it can be readily extended to other mass ranges that better meet the current experimental constraints. Moreover, more general studies can also be considered, such as including lepton universality violation, non-universal couplings to quark and lepton sectors, different channels other than the LO and NLO process presented here, etc. More generally, the technique we have explored can easily be applied to any hypothesis testing scenario. For possible future explorations, inputs of more than 3 channels and higherdimensional "super-images" would be interesting directions to extend the current approach. and r Sc ≡ S c /S c , the mixing ratio between the NP and SM events should then be modified In general, r S VA , r S CH , and r S SC are all different. This would lead to histograms with different number of events. Instead, we define r S ≡ c r Sc /3 and mix the new samples of all three classes according to: where the ratio r S /r B ≈ 0.040 for the example of p T,min = 100 GeV. The same procedure is carried out for p T,min = 150, · · · , 500 GeV, respectively, and the corresponding histograms It is clear that there exists a "sweet window" for the cut in the range of [150, 250] GeV. The performance deteriorates for p e T,min either lower or higher than the window boundaries. To pin down whether the effect of p T,min is due to resolution or training, we resize the histograms to dimension 60 × 60, and fix the p T bin width to 10 GeV. The bins outside the cut are then filled with zeros so as to retain a uniform structure for our NNs. The results are given in FIG. 15. One sees that the overall bell-shaped trend still manifests, which suggests that the reduced performance of a lower p T,min is due to incomplete training rather than the resolution. five or fewer bins, confirming that the p T resolution does play a role in the NN performance but only once the binning is extremely coarse. As one increase the bins, the AUCs have nearly saturated their maximum values way before N bin = 40. Consequently, we could infer that as long as the η e resolution remains sufficiently high, the p e T resolution does not need to be maximized to obtain the optimal NN performance.

Consistency between binary and ternary classifiers
Even though we are dealing with a three-class problem, one alternative other than training a ternary classifier to tag a specific sample set is to test it with multiple binary classifiers.
If the NNs are all properly trained, we should expect a consistency in their performances. Therefore, we compare the two methods in the following way.
After each individual testing sample is tested by a trained ternary NN classifier, it will be assigned with a three-component score array, (P 1 , P 2 , P 3 ), denoting its "probabilities" of belonging to one of the three classes. Suppose we are trying to compare a ternary NN's performance with that of a binary NN concerning the discrimination between class i and class j, we project the score components of the ternary by defining We then go on to compare the projected AUCs with the AUCs given by the true binary classifier dedicated to class i and j. FIG. 17 shows the AUCs of the projected ternary (left) and binary (right) scores dedicated to VA vs. CH, CH vs. SC, and SC vs. VA for different S/B ratios. As one can see, the projected ternary AUCs are consistent with the binary AUCs. This implies that the ternary classifier we trained gave the same level of performances in terms of binary classifications as the dedicated binary classifiers did. Also, it is interesting to see that as long as the CH class is included, the performances are all above 0.8 for S/B ≥ 0.1, and would already reach 1.0 at S/B = 0.3, meaning that it is relatively easy to identify the CH class from the other two; on the contrary, it is much harder for the NN to distinguish the VA class from the SC class.

Applying the wrong models
Another interesting question is what would happen if the wrong models are applied to the testing sets. There are two ways in our analysis of testing this: wrong significance and wrong decay widths. In the following, we show the two corresponding tests: The first is to use the models trained upon LO samples of Γ NP ≈ 100 GeV to test the samples of Γ NP ≈ 0.1 GeV at a fixed S/ √ B and vice versa, as well as between samples of Γ NP ≈ 10 GeV and Γ NP ≈ 1 GeV. We then calculate the ratios of the "wrong AUCs" to the "correct AUCs" with respect to different significances. To compare with FIG. 1(c), we show the p e T vs. η e distributions for Γ NP ≈ 100, 1, 0.1 GeV in FIG. 18. The training results are shown in FIG. 19. We can see that applying models of the wrong widths still yields some distinguishability, yet they are much worse than applying the correct models.

33
This indicates the importance of getting the right order of magnitude for Γ NP before setting up the trainings, and shows that even an incorrectly trained NN still has an AUC within ∼ 20 − 30% of the correctly trained model. The second is to use the models trained upon LO samples of S/ √ B = 3, 5, 8 to test the samples of S/ √ B = 1, 2, · · · , 10 for a fixed Γ NP ≈ 10 GeV. We also calculate the ratios of the "wrong AUCs" to the "correct AUCs" for different significances and show them in FIG. 20. We observe that the wrong models are still able to yield reasonable results in the vicinity of the trained significance level. This result shows that some deviation from the correct significance is all right if one is satisfied with performance within 5%.
These two comparisons indicate that when applying our analysis to the parameter space of the signal hypotheses, even a coarse set of CNNs covering the allowed parameter space will still have reasonable performance for a model with a decay width or significance different than the ones used for the set of CNNs, allowing a reduction of computing resources with a trade off of a small drop in performance.