Boosted $W/Z$ Tagging with Jet Charge and Deep Learning

We demonstrate that the classification of boosted, hadronically-decaying weak gauge bosons can be significantly improved over traditional cut-based and BDT-based methods using deep learning and the jet charge variable. We construct binary taggers for $W^+$ vs. $W^-$ and $Z$ vs. $W$ discrimination, as well as an overall ternary classifier for $W^+$/$W^-$/$Z$ discrimination. Besides a simple convolutional neural network (CNN), we also explore a composite of two CNNs, with different numbers of layers in the jet $p_{T}$ and jet charge channels. We find that this novel structure boosts the performance particularly when considering the $Z$ boson as signal. The methods presented here can enhance the physics potential in SM measurements and searches for new physics that are sensitive to the electric charge of weak gauge bosons.


Introduction
Boosted heavy resonances play a central role in the study of physics at the Large Hadron Collider (LHC).These include both Standard Model (SM) particles such as W 's, Z's, tops and Higgses, as well as hypothetical new physics (NP) particles such as Z 's.The decay products of the boosted heavy resonance are typically collimated into a single "fat jet" with nontrivial internal substructure.A vast amount of effort has been devoted to the important problem of "tagging" (i.e., identifying and classifying) boosted resonances through the understanding of jet substructure.(For recent reviews and original references, see e.g.[1][2][3].) Recently, there has been enormous interest in the application of modern deep learning techniques to boosted resonance tagging .By enabling the use of high-dimensional, low-level inputs (such as jet constituents), deep learning automates the process of feature engineering.Many works have demonstrated the enormous potential of deep learning to construct extremely powerful taggers, vastly improving on previous methods.
So far, most of the attention has focused on distinguishing various boosted resonances from QCD background in a binary classification task.Less attention has been paid to multi-class classification, i.e., a tagger that would categorize jets in a multitude of possibilities.(Notable exceptions include refs.[23,24].)In this work, we will examine an important multi-class classification task: distinguishing W + , W − and Z bosons.Having a W/Z classifier that can also recognize charge could have many interesting applications.For instance, one could use such a classifier to measure charge asymmetries and samesign diboson production at the LHC.Or there are many potential applications to NP scenarios, such as the reconstruction of doubly-charged Higgs bosons from its like-sign diboson decay in models with an extended scalar sector.
Since we are interested in distinguishing W + and W − bosons from each other, a key element in our work will be the jet charge observable Q κ .It was first introduced in ref. [25] and its theoretical potential was discussed further in ref. [26].Such an observable has also been measured at the LHC [27,28].When used in conjunction with other quantities, such as the invariant mass M, it can help distinguish between hadronically decaying W from Z bosons [29].Moreover, its performance as a chargetagger was assessed in ref. [30] for jets produced in semileptonic t t, W +jets and dijet processes.Most recently the authors of ref. [31] incorporated jet charge into various machine learning quark/gluon taggers, including BDTs, convolutional neural networks (CNNs), and recurrent neural networks (RNNs).They showed that including jet charge in the input channels improved quark/gluon discrimination and up vs. down quark discrimination.
In our study of W + /W − /Z tagging, we will compare a number of techniques, from simple cut-based methods, to BDTs, to deep learning methods based on CNNs and jet images.As in ref. [31], we will include jet charge as one of the input channels and examine the gain in performance from including this additional input.We will go beyond refs.[29,15,31] and construct a ternary classifier that can discriminate among W + , W − , and Z, depending on the physics process of interest.We will study the overall performance of our ternary tagger as well as its specialization to binary classification.For the latter we will compare its performance to specifically trained binary classifiers and show that the ternary classifier reproduces their performance, and in this sense it is optimal.Overall, we will demonstrate that deep learning with jet charge offers a significant boost in performance, around ∼30-40% improvement in background rejection rate at fixed signal efficiency.
In addition to a simple CNN, we will also develop a novel composite algorithm consisting of two CNNs, one for each of the Q κ and p T channels, combined in a merge layer, which we refer to as CNN 2 .This allows us to separately optimize the hyperparameters of the CNNs for the two input channels.We show that this new CNN 2 architecture further boosts the performance for most combinations.
The rest of the paper is organized as follows.In Sec. 2 we describe the jet samples and jet images used in this study, and review the definition of the jet charge variable.In Sec. 3, we describe the different taggers studied in this paper.These include cut-based and BDT taggers used as baselines for comparison, as well as two different taggers based on convolutional neural networks.We show results for a binary W − /W + classification problem in Sec. 4 and compare our performance with the recent work in ref. [31].In Sec. 5, we discuss the Z/W + discrimination problem, focusing on the benefit from including jet charge, and compare our performance with the ATLAS boson tagger in ref. [29].Finally in Sec.6, we extend our results to the full ternary Z/W + /W -classification problem, and comment on the reduction from a three-class tagger to a two-class one.In Sec. 7, we attempt to shed some light on what the deep neural networks learned We summarize our findings and conclude in Sec. 8.

Jet samples and inputs
For this study, we use MadGraph5v2.6.1 [32] at leading order to simulate events at the 13 TeV LHC for VBF production of doubly charged Higgses H ±±  The entries in the sum correspond to the (W + , W − , Z) samples, respectively.In the training stage, the training set is further divided into two subsets: 9/10 is the actual training set, and 1/10 serves as the validation set.All the ROC and SIC curves shown in this paper are evaluated using the testing sets described here.
The sum runs over all constituents i inside the jet J (i.e., all tracks and calorimeter towers in the jet) with 4-momentum (E i , p i ) and p T > 500 MeV. Figure 1 shows the reconstructed jet mass after all selections in table 1.The broader widths in the mass distribution originate from a combination of showering, hadronization, jet clustering and detector effects.
It has been known for some time [25] that a useful observable for distinguishing jets initiated by particles of different charges is the jet charge: where q i corresponds to the integer charge of the jet constituent in units of the proton charge, and κ is a free parameter.The Q κ observable is computed in this p T -weighted scheme to minimize mis-measurements from low-p T particles.Figure 2 shows the Q κ distributions for jets coming from the W + , W − and Z samples.Distributions are shown for different κ values.The choice of κ together with the p T range of the vector bosons will affect the tagging performance.

Jet images
As described in the introduction, the deep learning based taggers studied in this work are based on jet images and convolutional neural networks.In this work, our jet images are made from jets reconstructed in a ∆η = ∆φ = 1.6 box, with 75 × 75 pixels.This choice yields a resolution consistent with that of the CMS ECal (see table 3).The input variables or channels are now Q κ and p T per pixel.Therefore, the sum in the Q κ definition in Eq. (2) in this case goes over all jet constituents in each pixel.
To improve the performance of our taggers, we preprocess each image, following a similar procedure as in ref. [16]: centralization, rotation and flipping.In figures 3 and 4, we use φ and η to denote the new coordinate system for the images after preprocessing.Note that we do not perform normalization in the preprocessing, though it is another common preprocessing operation.Instead, we introduce a BatchNormalization layer [38] and observe a comparable (or even better) performance.
Figure 3 shows the average of jet images in the p T channel after preprocessing.For comparison, we show the W + average jet images in left plot and the difference between Z and W + average jet images in the right plot.It is observed that the average Z jet image is wider in the η − φ plane than the average W jet image, as expected from their difference in the invariant mass.
Figure 4 shows the average of jet images in the Q κ channel for a fixed κ value of 0.15 and after preprocessing.On average, the jet charge images are consistent with what we expect from the W + , W − and Z-boson charges.In particular, the average Z-boson jet charge image is very close to zero, as the charges of the constituents tend to cancel out on the average.

Methods
In the following, we will investigate possible classification tasks, including an overall ternary problem and binary ones (W − vs. W + and Z vs. W + ).Throughout this paper, unless otherwise specified, we will treat W + as the "background" in all binary classification tasks.
Here we describe the various taggers used in our work: baseline cut-based and BDT taggers that take high-level inputs (M, Q κ ), as well as taggers based on CNNs, which are trained on jet images formed by lower level inputs.

Cut-based and BDT taggers
We first construct a cut-based tagger.We will identify the optimal values of κ for the different binary discrimination tasks, W − vs. W + and Z vs. W + .For the latter, we will explore all possible simple 2D rectangular cuts in the jet charge and jet mass plane for the optimal cut-based discriminator.
We will also study a "single-κ BDT."It is built out of Q κ with κ held fixed together with the jet mass M .Both observables are fed into a gradient BDT implemented with the sklearn [39] package and assuming the default parameters.We can also combine models of different κ values to form another BDT tagger, dubbed "multi-κ BDT," similar  to the "multi-κ" jet tagger constructed in ref. [31].For this multi-κ BDT, M, Q κ and κ = 0.2, 0.3, 0.4 are specified as inputs.We find that the single-κ BDT, when taking the optimal κ value, generally has a comparable performance as the multi-κ BDT.In the following sections, the single-and multi-κ BDTs are shown as benchmark models to compare with our deep neural networks.
We use these BDT taggers in both binary and ternary classifications.The prediction of the ternary single-κ BDT classifier on the testing set can be visualized in figure 5.The three blobs (red, green and blue) correspond to the three classes of jets (Z, W − , W + ).The single-κ BDT nicely separates and distributes the jets as intuitively expected.That is, to mark out the border among the three classes, the best choice would be the Y-shaped cut in the two dimensional plane.

CNN-based taggers
We now describe the architectures of two deep CNN models based on jet images developed in this study.We feed the two-color jet images (p T , Q κ ), as described in Sec.2.2, into the CNNs.We use the Keras library with TensorFlow backend for the implementation of the networks.Throughout the paper, the "CNN" label describes a network composed of 3 convolutional layers followed by 2 fully connected layers.The padding option is activated to enable the network to go deeper.To prevent overfitting, regularizers are used as well as dropout layers.For training, we use Adam [40] as the optimizer algorithm. 3n figure 6, we show the architecture of the CNN, with detailed model configuration parameters summarized in table 3.
The second deep neural network we have considered is a composite design consisting of two CNNs with asymmetrical depths, which we call "CNN 2 ."The machine learning in the p T and Q κ channels is done in parallel, i.e., we feed the p T images and the Q κ images into two separate CNNs that differ in structure.These are combined at the end in a merge layer.
The CNN 2 architecture is motivated by our finding that the depth of the CNN network is limited by the classification task between W + and W − .If the network is made too deep, the W + /W − tagger tends to overfit.This makes sense, since the W + and W − jet images are identical except in the sign of the Q κ channel.Therefore, it requires fewer convolutional layers to capture the difference between the two.On the other hand, a deeper network structure does help a lot in successfully identifying the Z boson as the signal.The Z samples differ from the other two in the spatial distribution (substructure) of the constituents.If we enhance the resolution/ability of the CNN's pattern recognition, it is natural that the CNN could do better in the Z discrimination.Therefore, there seems to be a trade-off in the ternary classification problem.The CNN 2 architecture is an attempt to have our cake and eat it too.
After investigations on the model structure and seeing performance trends in the different classification problems, we have chosen the CNN 2 architecture detailed in figure 7 and table 3. The one dealing with the Q κ images is shallower.Based on the observed performance, we keep on stacking up to 8 convolutional layers for the other CNN that processes the p T images.

W − /W + binary classification
We begin with a study of the W − /W + classification.Since the signal and background differ only in their charges, the only useful quantity here is the jet charge Q κ .An important aspect of the jet charge variable defined in eq. ( 2) is that it depends on a  parameter κ which specifies how the contributing charges are p T -weighted.Nothing a priori tells us which value of κ to use, and different tasks may prefer different values of κ.In the following subsection we examine this issue for our various taggers.

Determining κ
In figure 8, we show the trend in performance when varying κ for the W − vs. W + classification.We show three typical metrics for evaluating the algorithm's performance: area under the curve (AUC), best accuracy (ACC) and background rejection at a 50% signal efficiency working point (1/ b | s=50% , denoted by R50).Deep learning taggers as well as the cut-based reference tagger are plotted together for comparison.Note in particular that the performance of single-κ BDT, not shown in these plots, would trivially reduce to that of the cut-based tagger in the W − /W + binary classification task, in which case the M information does not provide additional discriminative power.We find in figure 8 that the performance does depend on the choice of κ.It is also interesting to see that the deep learning taggers have a qualitatively different κ dependence from the traditional (cut-based) taggers, while little difference exists between the two CNN models.The former are always better than the latter and have a smaller optimal κ.
Based on these results, we determine the optimal κ in each of the tagger definitions in the following sections.A value of κ = 0.3 is fixed for the single-κ BDT reference tagger, and κ = 0.15 for the CNN taggers.

Comparison of taggers
Having fixed the value of κ in each tagger, we are now ready to compare the various tagging methods.Table 4 shows the performance metrics described above (AUC, ACC and R50).In figure 9, the left plot shows the Receiver Operating Characteristic (ROC) curves and the right plot shows the Significance Improvement Characteristic (SIC) curves.While all the ROC curves seem to be close to one another, one can readily see a clear benefit from employing deep learning in the SIC curves.The improvement in background rejection rate is around 30-40% across a wide range of signal efficiencies.It is also worth noting that for the W − /W + task, there is no particular gain in using the CNN 2 structure; in fact, the performance is slightly worse.Although differing in the details, it is still instructive to compare with the previous  work on jet charge and deep learning [31], which focuses on up/down quark jet discrimination.Our performance gain from BDT to deep learning turns out to be quite comparable.For the 1000-GeV benchmark scenario, their CNN's background rejection rate at 50% signal efficiency grows by about 40% relative to their "κ and λ BDT" reference tagger, as given in table 1 of ref. [31].Our result shows the same amount of enhancement, as seen in table 4.

Z/W + binary classification
Next we turn to the Z vs. W + binary classification task.(Z vs. W − would obviously have the same result because of charge symmetry.)Here we are primarily interested in how much the jet charge observable adds to the discriminative power, compared to just the information computable from the four vectors (e.g., the jet mass).

Determining κ
The performance of various taggers as a function of κ is shown in figure 10.Unlike the W − /W + classification, the dependence in κ is very mild here.We will keep using the same values of optimal κ as before for simplicity.

Comparison of taggers
Figure 11 shows the ROC curves (left plot) and the SIC curves (right plot) of the different taggers in the binary task of distinguishing Z from W + .Table 5 gives the three performance metrics.Compared to the W − /W + classification task, the benefit from deep learning in the current case is much greater, with an improved background rejection rate at 50% signal efficiency of as much as ∼ 2.85.This is perhaps unsurprising, since our cut-based and BDT methods do not include any jet substructure variables.As noted before in figure 3, in the W − /W + classification task, the samples have identical average distribution in the p T images and differ only in pixel intensity in the Q κ channel.However, the Z and W + events have distinct spatial distributions in the jet's constituents, in addition to the charge difference.Since the CNN naturally learns spatial differences, it will naturally show a big improvement over methods that do not include any substructure information.
We also note that for the Z/W + task, the CNN 2 tagger with the architecture tailored for better Z identification outperforms the CNN tagger by a sizable amount.
We now scrutinize the role of jet charge in Z vs. W + discrimination.Figure 12 shows the performance of taggers with either one or two input channels.We split our predictions from M to (M, Q κ ) in the BDT case, or from p T to (p T , Q κ ) for the CNNs.Comparing the solid curves to the dotted ones, we see that all the three taggers get significant improvements after adding the Q κ information.
The role of jet charge in boosted Z vs. W + discrimination was previously studied by ATLAS [29].However, the ATLAS study is different from ours in details: it focuses on a different signal process, W → W Z; its jet samples are defined differently; and what they construct is a likelihood tagger with the M and Q κ as the inputs.Nevertheless, it is instructive to compare our results to theirs.Throughout a wide range of working points, both the ATLAS tagger and our CNN tagger attain an additional 30% enhancement in the background rejection rate after further incorporating Q κ information.In the high signal efficiency region, our taggers seem to enjoy a larger gain by introducing the Q κ channel.6 Ternary W + /W − /Z classification Finally, we turn to the ultimate task of a full classification of boosted weak gauge bosons (W + /W − /Z).We will quantify how well the full ternary classification performs.Then we will see how to reduce to the binary classifications described in the previous sections.

SIC curve
For the choice of κ for the CNN taggers, we will continue to use the same values of κ as before because they are also optimal for the ternary classifier.We summarize and compare the performance of the ternary taggers according to two metrics: their overall accuracy, defined as (number of correct predictions)/(total number of instances);4 and a "one-against-all" metric which binarizes the task, i.e., singling out one class as the "signal" and treating all the others as the "background." Figures 13 and 14 plot the ROC and SIC curves resulting from treating W − and Z as the signals, respectively. 5The corresponding metrics are given in table 6.Our results show that the CNN 2 generally performs better than CNN.Thanks to the parallel structure, the network has a sufficient depth for a sizable improvement in the Z-signal performance, while having comparable or better performance than the CNN in the W − (or W + ) discrimination.

Comparison with binary taggers
We expect that a multi-class classification task should be able to fully recover the binary classification performance after an appropriate projection.If the multi-class NN output is supposed to approximate the class probability P i (x), where x is a data point and i = 1, . . ., N is the class label, then the projection to binary classification between class i and class j is simply: In figure 15, we plot the results of this projection in solid curves.The dotted curves are the reproduced binary BDT and CNN results from figures 9 and 11 for comparison.In the left plot, the solid and dotted curves are almost on top of each other.For the Z/W + projection, however, we observe that the ternary CNN outperforms the binary CNN after the projection in the low signal efficiency region.
7 What did the machine learn?
In this section we will attempt to shed some light on how our CNN and CNN 2 taggers learn to classify W + /W − /Z bosons, and the differences between them.Although a complete understanding is not possible -they are still very much "black boxes," we will find that with the help of some visualization techniques we can understand better what the machine has learned.

SIC curve
Figure 15: SIC curves of the ternary classification for a W − /W + discrimination (left) and for a Z/W + discrimination (right), when projected to binary according to eq. ( 3).The dashed curves are for binary classifications, and the solid curves for the projected ternary results.

Saliency maps
Here we will use "saliency maps" [41] to compare the CNN and CNN 2 networks in an attempt to understand why the latter outperforms the former.We will use the tool-kit of Keras-vis [42] to compute the saliency maps.Here the class saliency is extracted by computing the pixel-wise derivative of the class probability P i (x), as denoted in eq. ( 3): where the gradient is obtained by back-propagation.By making a map of the gradients (4) across an image, we can identify the regions of the image where the decision of the CNN (to be class i or not) depends most sensitively.The saliency maps for nine W − jet images from the test sample are shown in figures 16 (the p T channel) and 17 (the Q κ channel) for the CNN and the CNN 2 networks.The color in each pixel indicates the magnitude of the gradient value.
The difference in the saliency maps between CNN and CNN 2 is very striking.We can see that the attention of the CNN 2 is generally concentrated on much smaller regions than the CNN.Evidently, the resolving power of the CNN 2 is much better than that of the CNN.Given that the CNN 2 network goes much deeper than the CNN network, this is perhaps expected: a deeper network structure with more convolutional layers is supposed to be more capable of capturing more subtle features in the training data.Altogether, this could explain why the CNN 2 mostly outperforms the CNN.

Phase transition in CNN's learning
Another interesting difference between our networks is in their learning curves.We observe that the learning curve of the CNN always has a sudden jump in the performance.Such a phase transition in learning comes from the fact that the CNN tends to first learn characteristics of the Z sample, and then those of the W + (or W − ) sample.In order to investigate this phase transition behavior, we monitor the network performance during its intermediate stages.During the training, we set a check point every three epochs and record the network's weights, then analyze to see how the discrimination ability evolves.At each check point, we evaluate the network performance using one-against-all metrics on the testing jet samples.The results of the ACC metric are shown in figure 18.We can see that the CNN develops the ability of Z-discrimination at an earlier stage.However, the network has not learned how to discriminate between W − and W + until after the 24th epoch, around which a phase transition is seen in their learning curves.
Even though CNN 2 shows a more "steady" learning process, the possibility of phase transition phenomenon cannot be ruled out.It is possible that the CNN 2 learns so fast that the performance in all classes saturates within one epoch and, therefore, the phase transition is not manifest.

Conclusions
In this work, we apply modern deep learning techniques to build better taggers of boosted, hadronically-decaying weak gauge bosons.We demonstrate and provide the results for the boosted weak bosons with p T ∼ 400 GeV throughout this paper.(We have also studied the scenario with even more boosted W and Z bosons, with p T ∼ 1 TeV, and found results similar to what are presented here.)Going beyond previous works, we incorporate jet charge information in order to discriminate between positively and negatively charged W bosons, and between W and Z bosons.We study all possible binary classification tasks as well as the full ternary classification problem.Taking BDT and cut-based taggers as our baselines for comparison, we construct a simple CNN tagger that takes jet images as the input, and show that it leads to significant gains in classification accuracy and background rejection.
In addition to the simple CNN tagger, we also construct a novel CNN structure consisting of two parallel CNNs, which we call CNN 2 .The key feature of this structure is to assign different network depths to each of the p T and Q κ channels.This further improves the performance of nearly all the classification tasks.
We see various ways in which our work could be extended and improved.First, traditional CNNs may have some drawbacks due to the fact that the receptive field of every neuron, which is the field of view that one unit can perceive, is fixed by the assigned kernel sizes and depth of the network.But the complexity in detecting the patterns or features in realistic problems generally differ.As the W − /W + /Z discrimination problem in our analysis shows, the complexity/depth for dealing with W − /W + and W/Z is different.To optimize the performance as well as the computational costs, a "ResNet" network architecture with "skip connections" [19,43,44] may be a desirable solution.It would be interesting to study this further, along with other architectures and jet representations such as point clouds and sequences.
Another direction where it may be possible to improve on this work is to come up with an architecture that takes p T and charge information as totally separate channels, such that the network could learn the ideal combination of them for measuring the charge of the boosted heavy resonance.The fact that our tagger performance depends on the value of κ is a symptom that it is not truly learning the optimal combination of p T and charge information.
Comparing the performance gains due to deep learning seen in this work with recent related works in the literature [29,31], we have seen similar improvements over more conventional methods.We caution that these are merely rough comparisons, as the nature and details of the classification problems are different from ours.Nevertheless, this gives further evidence for the enormous potential of deep learning for the study of jet substructure and boosted resonance tagging.

Figure 1 :
Figure 1: Reconstructed jet mass of W and Z samples.

Figure 2 :
Figure 2: Q κ distributions for the three samples under study.Representative κ values are shown.

Figure 3 :
Figure 3: The left plot shows the average of W + jet images in the p T channel using the preprocessed testing set sample.The right plot shows the difference between Z and W + average jet images in the p T channel.

Figure 4 :
Figure 4: Average of jet images in the Q κ channel, with κ = 0.15, for the three jet samples in our testing sets, after preprocessing.

Figure 5 :
Figure 5: The left plot shows the true distributions for W + /W − /Z in the (Q κ , M) plane (for κ = 0.3).The T-shaped lines in the left plot mark the decision boundaries of the cut-based tagger.The right plot shows the output prediction of the ternary BDT classifier in the (Q κ , M) plane, whose Y-shaped color boundaries match our intuition for the optimal border.

2 Figure 8 :
Figure 8: Summary of the performance as a function of κ in the range [0.1, 0.6] for a binary classification task to discriminate W − from W + for all taggers.The three metrics are the AUC (left), accuracy (middle) and background rejection (right).

Figure 9 :
Figure 9: ROC (left) and SIC (right) curves for the binary classification to discriminate W − from W + for all taggers, except for the single-κ BDT.

2 Figure 10 :
Figure 10: Same as figure 8, but for a binary classification task to discriminate Z from W + for all taggers.

Figure 12 :
Figure12: ROC (left) and SIC (right) curves for Z/W + binary classification using taggers with different numbers of input channels.The dotted lines are for one channel only, either M (for the reference cut-based and BDT taggers) or p T (for the CNN taggers), and the solid lines are for those with two channels (M + Q κ or p T + Q κ ) and correspond to the results in the previous section.The single-κ BDT tagger is compared with the cut-based tagger because the BDT with only one input channel reduces to the cut-based tagger.

Figure 13 :
Figure 13: ROC (left) and SIC (right) curves for a ternary classification discriminating W − from (W + , Z) for all the taggers.

Figure 14 :
Figure 14: ROC (left) and SIC (right) curves for a ternary classification discriminating Z from W s for all the taggers.

Figure 16 :
Figure 16: Saliency maps for the p T channel of the CNN (upper plots) and CNN 2 (lower plots) networks on nine W − jet images from the test sample (for which both networks give correct output predictions).

Figure 17 :
Figure 17: Same as figure 16 but for the Q κ channel.

Figure 18 :
Figure 18: Accuracy of a one-against-all metric, ACC, at different callback points for CNN (left) and CNN 2 (right).A phase transition in ACC during the CNN training stage occurs around the 25th epoch.

Figure 19 :
Figure 19: SIC curves for the binary classifications to discriminate W − from W + (left) and Z from W + (right) using the CNN and CNN 2 taggers.

Figure 20 :
Figure 20: SIC curves for the ternary classifications to discriminate W − from the rest (left) and Z from the rest (right) using the CNN and CNN 2 taggers.

Table 2 :
Summary of the jet sample sizes used for training and testing, after the selections in table 1.

Table 3 :
Summary of the configurations of our CNN taggers.

Table 4 :
Performance metrics for all taggers, except for the single-κ BDT, in a W − /W + binary classification task.

Table 5 :
Performance metrics for all taggers in a binary Z/W + classification task.
Figure 11: ROC (left) and SIC (right) curves for a binary classification discriminating Z from W + for all taggers.

Table 6 :
Performance metrics for all taggers in the ternary classification task.

Table 7 :
Summary of the Herwig-jet sample sizes used for training and testing, after the selections in table1.The entries in the sums correspond to the (W + , W − , Z) samples, respectively.In the training stage, the training set is also divided into two subsets as described in table2.metricsare provided in tables 8 and 9.The results are slightly worse than those from the Pythia-jet dataset, based upon which all the taggers in this work are optimized.Nevertheless, the tagger performance indicated by the SIC curves roughly agrees with what is observed in Pythia-jet dataset.This shows that the tagging abilities of our CNN and CNN 2 taggers are independent of showering and hadronization models employed in the analysis.

Table 8 :
Performance metrics for CNN and CNN 2 taggers on the Herwig-jet sample in the W − /W + and Z/W + binary classification tasks.

Table 9 :
Performance metrics for CNN and CNN 2 taggers on the Herwig-jet sample in the ternary classification tasks.