Morphology for Jet Classification

We introduce a jet tagger based on a neural network analyzing the Minkowski Functionals (MFs) of pixellated jet images. The MFs are geometric measures of binary images, and they can be regarded as a generalization of the particle multiplicity, which is an important quantity in jet tagging. Their changes by dilation encode the jet constituents' geometric structures that appear at various angular scales. We explicitly show that this analysis using the MFs together with mathematical morphology can be considered a constrained convolutional neural network (CNN). Conversely, CNN could model the MFs in a certain limit, and we show their correlation in the example of tagging semi-visible jets emerging from the strong interaction of a hidden valley scenario. The MFs are independent of the IRC-safe observables commonly used in jet physics. We combine this morphological analysis with an IRC-safe relation network which models two-point energy correlations. While the resulting network uses constrained input parameters, it shows comparable dark jet and top jet tagging performances to the CNN. The architecture has significant computational advantages when the available data is limited. We show that its tagging performance is much better than that of the CNN with a small number of training samples. We also qualitatively discuss their parton-shower model dependency. The results suggest that the MFs can be an efficient parameterization of the IRC-unsafe feature space of jets.

The jet classification relies on substructures of jets from boosted massive particles. [1,[7][8][9][10][11][12]. The quantification of those features may be performed with jet shape variables, such as n-subjettiness [13], or energy correlation functions [14]. In particular, these variables are often described by a set of n-point energy correlators [15,16], which is a basis of jet substructure variables with infrared and collinear (IRC) safety conditions.
On the other hand, counting variables, such as the number of charged tracks [17], are yet another type of discriminative variable in jet tagging, but there is some subtlety in predicting them by QCD because they are not IRC safe. Those IRC unsafe features are often empirically modeled in event simulations. The predicted distribution often has a sizable deviation from the experimental data. We have to use them carefully, so that classification models are not biased to a particular simulators.
Meanwhile, these feature engineering may be replaced with deep learning. For example, convolution-based networks [3,6] using (pixelated) particle distributions, and recurrent neural networks [4,5] using predefined sequence of particles are known for good jet tagging performance [18].
Those networks can represent a wide variety of functions, and they cover the high-dimensional phase space of inputs. However, some phase space of the training sample may be underrepresented by a finite number of samples, and the jet taggers based on them require high-quality samples to get the best performance. Because of that, it is often necessary to use dimensionality reductions, such as introducing bottlenecks in the middle of their architecture. But those reduction techniques may not respect the physical constraints of the system, and explaining the outputs in domain-specific languages is less straightforward. Intensive post-analysis is often required in order to get an insight from the trained networks.
In this regard, starting from physics-inspired inputs and network architectures [19][20][21][22][23] has advantages over the general functional model trained on primitive inputs in controllability and interpretability. For example, the energy flow network (EFN) [19,24] and the relation network (RN) [20,21,25,26] is known for its good tagging performance under the IRC-safe constraints [18,21]. If those constrained models cover all the relevant features for solving the given problem, the model will have equal performance compared to the general-purpose models [20,21]. So far, the networks covering IRC safe variables are well studied, but constrained models for IRC unsafe variables are not available yet. We need architectures bridging between general models and IRC unsafe variables.
Although deep learning models that systematically cover those IRC unsafe variables are not available, there are several frameworks based on multiplicities in coarsegraining [27], dilation and Minkowski functionals [21], and Delaunay triangulation and its topology [28]. In this paper, we thoroughly reintroduce the approach in [21] in terms of the mathematical morphology and integral geometry, build a constrained model for the IRC unsafe variables, and show its analytic representation in the large network width limit. This paper is organized as follows. In Sec. II, we introduce the morphological analysis on jet images using Minkowski functionals (MF), which is a generalization of counting variables by using its abstract algebraic features. We point out that the MFs can be represented bya chain of convolutions of the jet images and 2 × 2 filters, and therefore, convolutional neural networks (CNN) can utilize it.
Section III reviews the two IRC-safe energy correlatorbased networks which may provide complementary information to the MFs. In the case of jet image analysis, we show that the RN simplifies to a multilayer perceptron (MLP) taking a two-point energy correlation S 2 (R), which is an energy-weighted count of pairs of jet constituents at a given angular scale. On the other hand, the EFN is an MLP taking the jet image itself, where the jet image is an energy flow with a finite angular resolution.
In Sec. IV, we introduce a modular architecture combining the morphological analysis and RN (or EFN). We simply combine outputs of each network using another MLP to get the final outputs. We are going to compare the RN (or EFN) augmentedwith the morphological analysis, against the baseline CNN.
Section V is devoted to the jet tagging performance between the combined setup using RN or EFN and the CNN. We consider two benchmark scenarios: tagging semi-visible jets [29], and top jets. By using the semivisible jet tagging example, we show that CNN can learn the distinctive feature of the MFs when the difference in the MF distributions between the signal and background is significant. Besides, our combined architectures and CNN augmented by the MFs show better performance than baseline CNN. This contradicts the observation that CNN can represent the MFs. These performance differences may be originated from the finite network size effects and regularization.
Section VI discusses computational advantages of our constrained architecture compared to those of CNN. We show that the constrained architecture has better generalization performance when the number of training samples are small. We also point out that our setup is faster and memory-efficient because of lower computational complexity. In short, the MFs can efficiently represent IRC-unsafe information about the jet constituents.
Existing event simulation tools such as Pythia [30] and Herwig [31,32] predict different soft particle distributions. Therefore, special care is needed to estimate the classification performance using simulated datasets. Section VII shows the generator dependence of jet constituent distributions in terms of MFs and describes the connection of qualitative features to the shower algorithms.

II. GENERALIZATION OF COUNTING VARIABLES IN JET PHYSICS
In order to generalize the counting variables, such as particle multiplicities, we need to introduce the mathematical concepts called valuation. The particle multiplicities, which is essentially the number of elements in a set, has the following characteristic property for union and intersection of two sets of particles, A and B, This abstract mathematical feature is called valuation in measure theory. For example, area of a region is a valuation. It would be worth exploring the space of valuations to generalize the counting variables, and Minkowski functionals and Hadwiger's theorem are the important tools for that.

A. Minkowski Functionals and Hadwiger's theorem
The MFs of the jet constituents are the key characteristics for analyzing the space of valuations of jet substructures. Since we are going to analyze jet images on the pseudorapidity-polar coordinate plane, we will focus on discussing the MFs for two-dimensional Euclidean space R 2 . We also denote the coordinate vector as R = (η, φ).
For a closed and bounded set S in R 2 , there are the three MFs: area A, boundary length L, and Euler characteristic χ. They can be expressed as the integral of local features of S as follows, where κ is the curvature of the boundary ∂S. The integral representation of the Euler characteristic is the Gauss-Bonnet theorem. The MFs are useful measures because of its completeness. Hadwiger's theorem [33,34] states that these three functionals are complete basis for the translation and rotation invariant valuations of convex bodies where the convex body is a closed and bounded convex set with non-empty interior. Let F be a function that satisfies the following properties, • Valuation: for any two convex bodies B i and B j , • Invariance: for any translation or rotation g, the measure F is invariant, i.e, for any convex body B, • Continuity: for any convergent sequence of convex bodies, B i → B, FIG. 1. Binary jet images of ((a)) a dark jet and ((b)) a QCD jet. Black dots are the active pixels in P (0) without any filtering. Dark gray, gray, blue, and light blue pixels are pixels in P (i) − P (i−1) for i = 1, 2, 3, 4, respectively. Both of the binary images have A (0) = 30. The dark jet model is described in Sec. V.
Then for any F , there exist three constants c 0 , c 1 , and c 2 such that where MF ν is (A, L, χ) for ν = (0, 1, 2), respectively. Hadwiger's theorem also holds in the geometry of the square lattice and pixelated image, but the context should be modified accordingly [35]. The geometry of the square lattice has a different distance function called the L 1 distance, which is a sum of the absolute value of the difference in each component as follows.
This distance is essentially identical to the length of the shortest path between two points on a square grid. The points within unit L 1 distance from the origin is different to those in Euclidean geometry. They form a square whose vertices are at (0, 1), (0, −1), (1, 0), and (−1, 0). The statements of Hadwiger's theorem still holds under this geometry, but there are two modifications. First, the invariance under translation and rotation is replaced by the isometry of the L 1 space. The convexivity is replaced with L 1 -convexivity. A given set B is called L 1 -convex if and only if there always exists a path connecting two points R 1 and R 2 in B, and the components of the path are monotonic along the path. One clear example illustrating the difference between those two convexivities is an L-shaped region: it is not convex but L 1 -convex. After these modifications, we may safely use the MFs for the pixelated image analysis.

B. Morphological Analysis on Jet Images
The morphological analysis on jet images is then performed on the filtered distribution of jet constituents pro-jected on the pseudorapidity-polar coordinate (η, φ). We consider superlevel sets of the jet image, P (0) , i.e., the set of pixels whose energy deposit p (i,j) T is higher than the threshold value p T [36], where (i, j) is the integer coordinate of the given pixel. 1 The resulting binary images on a two-dimensional integer grid are used for the morphological analysis. For the following discussion, we will omit the threshold argument [p T ] unless it is required explicitly. We then analyze the MFs of the images after dilation by a square called a structuring element to understand the geometric structure with the aid of mathematical morphology. The dilation is useful for probing geometric features that are visible at the angular resolution of the size of the square. For our pixellated image analysis, the structuring element B (k) is a square with side length 2k + 1. The dilated image P (k) is defined as follows.
Sample binary images are in Fig. 1. The binary image P (k) is analogous to a coarse-graining or smearing of the original binary image P (0) . We denote the three MFs of P (k) as A (k) , L (k) , and χ (k) . In [21], we have shown that the MFs A (0) and A (1) improve the top jet vs. QCD jet classification.
We also note that the dilation by a square is good enough for retrieving the topology of an underlying smooth body where the point clouds are sampled. The topology of the dilated image is sensitive to the structuring element in general, especially when we are using a finite number of samples. Still, the square is connected and sufficiently round so that the dilation by the square is a good topology estimation process without any glitches [37].
We can get some intuitive idea of how the sequences of the MFs encode the geometric information of a given image by considering its limiting behavior. For a scale k much larger than the size of the image, A (k) → (2k + 1) 2 because the details of the images are irrelevant to P (k) . In a different extreme case where P (k) is consisted by N sufficiently isolated clusters, the asymptotic behavior changes to A (k) → N (2k + 1) 2 . Therefore, the sequence A (k) is sensitive to the number of clusters of active pixels in the jet image.
The intermediate behavior of the MF sequences (A (k) , L (k) , χ (k) ) contains more details about the pixel distributions. When P (k) is a convex body, the MFs of 1 The physical unit length of the grid is the hadronic calorimeter resolution ∆R = 0.1 of our analysis. The physical coordinates (η, φ) are obtained by multiplying ∆R to those integer coordinates. P (k) and P (k+1) satisfies the following recurrence relation [35], 2 The deviation from this relation signals that some change of the shape or topology occurs at the given angular scale.
For example, if a hole or dent is completely filled during the dilation, the above recurrence relation is violated. Therefore, the full sequences of the MFs contain useful information about the geometry of the binary image in general. The analysis is also a persistent analysis of geometric features of jet substructures, similar to [28]. The recurrence relation also explains the asymptotic behavior of A (k) . Suppose that the recurrence relations of the MFs hold after the given scale k 0 . The solution for A (k+k0) in terms of the MFs of P (k0) are as follows.
For k k 0 , the area A (k0+k) is approximately 4k 2 χ (k0) , and the Euler characteristic χ (k0) can be interpreted as the number of clusters.
We now compare the area A (k) with the extrapolated area A (k) ext from the MFs of P (k−1) in order to check whether the dilation preserves the geometric features. The difference ∆A (k) is a useful measure for checking the geometric persistence, Figure 2 shows 2D histograms of (A (k) ext , ∆A (k) ) of the leading p T jets of QCD dijet events with p T,J ∈ [500, 600] GeV. Figure 2(a) for k = 2 shows typical jets has lots of vibrant activities at the short scale so that the condition ∆A (k) = 0 can be easily violated for a small k. A smeared image becomes more regular at the large scale, so that many of the samples has ∆A (k) = 0 as shown in Fig. 2 A similar behavior can be directly seen in the Euler characteristics. For a small k, the jets occasionally have subclusters, i.e., the intrinsic topology χ (k) variate a lot. Therefore, the extrapolation χ (k) ext = χ (k−1) is also quite different from χ (k) , as shown in the 2D histogram of (χ (k−1) , χ (k) − χ (k−1) ) in Fig. 2(c). For a large k, since we are analyzing a single jet, we expect that most of the events has χ (k−1) χ (k) 1 as in Fig. 2(d). Note that χ (k) − χ (k−1) is positive for some events, indicating that there are holes at the scale k − 1 and they are filled at the scale k. Note that MFs are aggregated features, and their statistical fluctuations are smaller than the primitive inputs. For example, the number of active pixels A (0) has fluctuation δA (0) /A (0) ∼ 1/ √ A (0) but its pixel-by-pixel fluctuation is order 1. As a result, the training of RN with MFs is potentially more stable against the fluctuation of the energy deposit of pixels, while CNN is more susceptible to that.
Neural networks trained on these MFs has useful geometric measures for solving the given task. The MFs do not use energy weighting in contrast to other energyweighted IRC safe jet substructure observables, so that all the jet constituents are treated equally once they pass the p T threshold.

C. Convolution Representation of Minkowski Functionals
The MFs are defined as an integral of local features in the continuum limit as in Eq. 2 so that they can be written as a sum of all the local contributions from finitesized patches. This leads an interesting property of the MFs; they can be embedded in the CNN with finite-sized filters.
For example, the area of a two-dimensional region S can be written as a following double integral of an indicator function K of a square with side length and centered at (0, 0).
By swap the order of the integration, we obtain the expression in the form of the sum of the local contribution of finite patches, To discretize and evaluate this integral for the binary image on a square grid, the following marching square algorithm [38] is a fast and useful. The marching square algorithm for square lattice process all the 2 × 2 subimages of given binary images and collect its local features for calculating the MFs. The local features are summarized in table I. Note that we do not include the boundary of the 2 × 2 subimages for this calculation.
• For the area A, a subimage contribution is 1/4 of the number of its active pixels because a pixel belongs to four subimages.
• For the boundary length L, the contribution is local boundary length divided by 2 since every boundary belongs to two subimages.
• For the Euler characteristics χ, we only need to count the number of outward corners, N out , and the number of inward corners, N in . Since inward and outward corners have exterior angle π/2 and −π/2 respectively, the total curvature is just proportion to the difference between N out and N in . The Euler characteristic is then as follows, Each corners are considered only once during the marching, the local contributions are 1/4 for the outward corners and -1/4 for the inward corners.
Note that the Euler characteristic depends on the definition of the connectivity between two diagonally neighboring pixels. We define that pixels sharing a same vertex are connected, and the corresponding subimages have two inward corners.
For example, (A, L, χ) of an isolated pixel is the sum of 1, 2, 4, and 8 of the table I, and the value is (1, 4, 1). This algorithm can be generalized for calculating MFs of images on other types of lattice, such as hexagonal pixels, 3 or for approximating MFs of raw images without pixelation [41].
Since there are only 16 unique configurations for the 2 × 2 subimages, we may use the look-up table v k (a), where a = 0, · · · , 15 and k ∈ {A, L, χ}, in table I for parameterizing the local contribution. The MFs are then the sum of look-up table values as follows, where f nm = ((1, 2), (4,8)), and P (k) ij is 1 or 0 if (i, j)-th pixel of P (k) is active or not, respectively.
Note that all the steps for calculating MFs in this section can be written in terms of convolutions. Let p (i,j) T be the energy deposit of (i, j)-th pixel. The calculation method of MFs discussed in this section can be summarized as follows.
where all the binary images in the above equations are considered as a function that gives 1 for active pixels and 0 for otherwise, * is the discrete convolution. The stacked convolution layers can simulate this algorithm, i.e., B (k) and f can be considered as the weights of convolution layers, and the functions θ and v can be modelled by 1 × 1 convolutions [42]. Therefore, A (k) , L (k) , and χ (k) are in principle covered by a CNN trained on jet images. One subtle point is that this closed expression contains a step function, which has a point of discontinuity. The CNN with a finite number of filters and smooth activation functions may have difficulty on accessing this variable set since the network itself is a smooth function. A similar situation may happen on the CNN with L 2 regularizers. We will show an example that the tagging performance of the CNN is improved by adding MFs to the inputs.

III. ENERGY CORRELATOR BASED NEURAL NETWORKS FOR JET SUBSTRUCTURE
The energy dependence of MFs in Eq. 21 is nonlinear, while many theory-motivated jet substructure variables typically have a multilinear energy dependence; these types of variables are called IRC safe energy correlators [15,16]. Since the counting variables complement those variables, we may use a neural network model representing the IRC-safe energy correlators and provide the MFs as additional inputs. In this section, we briefly review two examples: the IRC-safe relation network [20,21,43], and the energy flow network [19] A. Relation Network The relation network (RN) is mainly designed for capturing the common properties of relational reasoning. For example, if we use the momentum p i of the i-th constituents of the jet as a network input, we can build one simplest model of RN with two scalar functions f and g as follows, where a and b are labels for subsets of jet constituents. If we impose the IRC-safe constraints [15,16], the function g should be bilinear in the constituent p T and the coefficients Φ ab should depend only on the relative angular distance between the jet constituents, R ij . The following is then the basic form of the IRC-safe RN for the jet substructure, The summation in the above equation is a nested loop over the jet constituents. Nevertheless, this part can be simplified to a single summation as we describe below. We introduce the following two-point energy correlation S 2,ab that accumulates energy correlations at a given angular scale R.
By using S 2,ab , the nested summation in Eq. 23 can be replaced to a single integral as follows, This model covers various jet substructure variables. For example, the two-point energy correlation functions EFP n 2 [14,16] can be written in terms of a linear combination of the S 2 as follows, Therefore, this network covers all information encoded in EFP n 2 . For the practical use of this RN with IRC-safe constraints, we discretize the integral in Eq. 25 by binning the integrand with bin size ∆R. The discrete version of S 2,ab is then defined as follows.
where k is the bin index. The integral in Eq. 25 can be expressed as a inner product between S (k) 2,ab and a weight vector Φ For our numerical study, we take bin size ∆R = 0.1, which is the hadronic calorimeter resolution. The S 2 's are directly calculated from the HCAL and ECAL outputs. If we use an MLP to model the function f of the RN in Eq. 23, we can embed Φ (k) to the first fully-connected layer. The fully-connected layer that maps one input ab to the latent dimension is equivalent to a fully connected layer that maps S (k) 2,ab 's to the latent dimension, i.e., The RN is modelled by an MLP taking S (k) 2,ab , and the first layer can be regarded as a trainable two-point energy correlation.

B. Energy Flow Network
Energy flow network (EFN) [19] is also a graph neural network based on the energy correlators, but this network uses only pointwise features. This network is based on the deep set architecture [24], i.e., As discussed before, this pointwise feature g(p i ) should be a linear function of energy when the IRC-safe constraint is assumed, and we have the following model of the EFN.
For the pixelated image analysis, the p T -weighted sum over the jet constituents is replaced to the energyweighted sum over all pixels, where P ij T is the energy deposit of the (i, j)-th pixel, and Φ ij is the corresponding angular weights.
When we replace f with an MLP, the angular weights Φ ij can be absorbed into the MLP. The product between the weights W of the first dense layer and Φ ij can be considered as an effective weights W ij of an MLP taking P (ij) T as inputs, i.e., the dense layer can be rewritten as follows.
Therefore, an MLP for the pixelated image analysis models the EFN for the pixelated jet image.
Note that using the standardized inputs results does not change the conclusion since the standardization is a linear transformations. Let us consider the following transformation of the inputs and parameters of the dense layer transforms, where µ (ij) and σ (ij) are the mean and standard deviation of the inputs. 4 The first dense layer, i,j P (ij) T W ij + B is invariant under this transformation, we may safely use the MLP for the standardized image to model the EFN.

IV. COMBINED NETWORK SETUP
In this section, we describe the network that combines the morphological analysis and the RN or EFN.

A. Network Inputs
For the morphological analysis, we use the MFs up to k = 6 and denote them as x morph , We use the following p T thresholds: default threshold of the detector simulation 5 , 2, 4, and 8 GeV.
For the IRC-safe relation network, we used the twopoint energy correlation S 2,ab of the following subsets of jet constituents.
• the trimmed jet J trim [9], denoted by h, • the compliment set of J trim , denoted by s, • the leading p T subjet J 1 , denoted by 1, • the compliment set of J 1 , denoted by c.
Using these subsets is effective in the top tagging [21]. We use the following sets of binned two-point correlations as inputs of the RN, In addition to those MFs and two-point energy correlations, we provide p T and mass for each jet, trimmed jet, and leading p T subjets as additional inputs to give information regarding jet kinematics, and we denote them as x kin .

B. Network Architecture
We use the following setup to transform the given inputs to the desired outputs for the binary classification.
We first use MLPs to encode each of the primitive inputs x morph , x trim , and x J1 into latent spaces of dimension 5, All the MLPs used in this section take the kinematic inputs x kin as additional inputs. Those latent space features are mapped into the classifier outputsŷ the by another MLP, where logit(ŷ) is the inverse of the standard logistic function, log(ŷ) − log(1 −ŷ). For the analysis using only the subset of the inputs, we take only the relevant latent space features. We denote this setup as RN+MF, and the pure RN setup without morphological analysis as RN.
We will use this network for binary classifications, trained by minimizing the binary cross-entropy loss function.
where y = 1 indicates the signal samples, and y = 0 indicates the background samples. The priors for each class is 1/2. All the hidden layer's weights are L2 regularized with a weight decay coefficient of 0.001. The network is trained by ADAM optimizer [44] with default parameters, and we adopt the temporal exponential moving average on trainable parameters after ignoring the early 50 epochs. The ratio between training, validation, and test datasets is 9:1:10. We stop training when the validation loss does not improve for 50 epochs. We iterate this procedure for different numbers of minibatches of 20, 50, 100, and 200, and choose the results with the largest validation AUC. All of these setups are implemented using Keras [45] with TensorFlow backend [46]. Finally, all inputs are standardized, and we also reweight events to make the p T distribution flat in order to marginalize learning from p T,J distribution.
We also remark that in a limit of large width of the MLPs and small bin size for S 2 and MFs, this network setup corresponds to the following smooth model, where all the Φ and Ψ are some scalar functions. This expression can help discuss the relationship between the morphological analysis and other networks working on the momenta of jet constituents without pixelation, such as ParticleNet [6]. However, the discussion is beyond the scope of this paper.

C. Convolutional Neural Network and Energy Flow Network
We compare this RN+MF to the following CNN and EFN.
Our baseline CNN is trained on the preprocessed jet images, as described in [21].
2. Set the center of (η, φ) coordinate to be the leading p T subjet axis.
3. Rotate (η, φ) plane about the origin so that the subleading p T subjet is on the positive y axis.
4. If the third leading p T subjet exists and has negative x value, flip the x axis so that the third subjet is always on the right side of the image.
5. Pixelate the jet constituents to get the jet image.
The preprocessed jet image is a two-dimensional p T weighted histogram of jet constituents on a range [−1.5, 1.5] × [−1.5, 1.5] with bin size 0.1 × 0.1. We denote the set of energy deposits for each pixels as follows, The image input x image is provided to networks after standardization. In summary, the preprocessed images are aware of the most energetic subjet locations, and the relative position of the two subleading p T subjets. The CNN consists of six convolutional layers. The filter size is 3 × 3, and a pooling layer with pool size 2 × 2 is inserted for every three convolutional layers. After then, all the spatial dimensions are flattened, and a 1 × 1 convolution maps the intermediate outputs to latent space with dimension 10. 6 These latent space features are then concatenated to the kinematic inputs x kin , and we use an MLP to transform them into the desired classifier output. The training setups are the same as RN+MF, but we scan by minibatch numbers 100, 200, and 500.
Although CNN can represent MFs, we may explicitly provide the MFs to the CNN. As discussed earlier, CNN may experience technical difficulty expressing MFs through the training because the MFs are not smooth functions of the jet image. We additionally consider a CNN whose MLP at the end receives h morph as additional latent space inputs. We denote this setup as CNN+MF.
We model the pointwise correlation of the EFN by an MLP with three hidden layers and 10 outputs. The first hidden layer has 50 (200) outputs, while the others have 200 outputs. The input image is concatenated with x kin . The outputs are then provided to another MLP that converts those inputs to the classifier, similar to that of the CNN. Table II lists the combination of inputs we study in this paper, and training costs for the classification problems that is discussed in Sec. VI. Some notable differences between the inputs to the CNNs and the RN+MFs are as follows.
The baseline CNN takes a large number of inputs since they are taking the whole image. However, the detector hits are sparsely distributed over the images since the center of the images contains more information while the outer region of the jet image has sparse soft activities. The CNN has to distill the useful information from this sparse dataset. On the other hand, RN only takes the basis for the two-point energy correlators. The soft activities are collected to each bin of S 2 , and the resulting number of inputs is only O[100].
The number of MF inputs is 3×7 for each binary image given energy thresholds. This is also a relatively small number compared to the dimension of the image inputs. We also note that as k increases, the change in geometry of the dilated image P (k) becomes more regular, and the MFs are getting dependent on their previous values in the sequences. The cutoff for k may be fine-tuned further, but we use 7, which effectively smoothes out geometric features below the angular scale of 1.5. The latter terms in the sequence merely validate the regularity in dilation, and dropping some of them may not affect the performance significantly.

V. JET TAGGING PERFORMANCE COMPARISON
A. Semi-visible Jet Tagging As a working example of our network, a toy Hidden Valley model [49,50] whose signature a semi-visible jet  [29,51] is considered. The hidden sector may include a fermion q v charged under the secluded gauge group and a massive leptophobic gauge boson Z that mediates the interaction between the SM particles and the hidden sector. At the hadron collider, q v may be produced through the process qq → Z → q vqv . The secluded gauge interaction confines q v andq v and forms pions π v and rho mesons ρ v after the hidden sector parton shower and hadronization. We consider a scenario that only ρ v leaves visible signatures via the decay ρ v → qq while the other mesons are not visible at the detectors. The resulting semi-visible jet, which we call a dark jet, contains many color-singlet quark pairs fragmenting into hadrons and missing particles. Therefore, the dark jets have different geometric and hard substructures compared to the QCD jets.
For the simulation of the dark jet, we use Pythia 8 [30] and its Hidden Valley model implementation [50]. The  [30,70] GeV. The number of selected events is 6.0 × 10 5 for the dark jet samples and 1.9 × 10 6 for the QCD jet samples. Figure 3(a) shows the A (k) distributions of dark jets and QCD jets. The most left curve is the A (0) distributions, and they are close to each other. On the other hand, the average of A (i) (i > 0) of the QCD jets is much larger, and the A (i) distribution extends far beyond the endpoint of the dark jet A (i) distribution. The RN+MF model can explicitly use the feature in the classification.
Given the apparent difference of A (i) distributions, the CNN is also capable of learning this phase space where only QCD jets exist. The classifier reasoning appears in the dijet distribution in Fig. 3(b). The distributions are after applying the mild cut of 90% signal dark jet efficiencies using the CNN. The cut significantly suppresses the events beyond the endpoint of the dark jet distribution.
We show the receiver operator characteristic (ROC) curves of RN, 7 and CNN with and without the MFs on Fig. 4. The corresponding area under the ROC curve (AUC) in table III. Both RN and CNN models reject more than 90% QCD jets on the phase space of large MFs without losing any dark jet events as illustrated in Fig. 3. Even a simple classifier using only the MFs and kinematical variables rejects most of the QCD jet samples, as seen by the orange curve. This shows that the MFs describe the boundary of the phase space of the dark jet events quite efficiently. The model with MF consistently outperforms the one without MF, as can be seen in table III. The AUC of RN+MF is slightly better than 7 EFN results are explained in Sec. V C.
CNN, and the AUC of CNN+MF is the best among the CNN and RN models.
The ROC curves show some crossovers in the region of small dark jet tagging efficiency below dark = 0.6, and RN+MF rejection efficiency looks better than CNN+MF in such regions. However, the rejection rate is so high that a relatively small training sample of O(1000) events is available for the training. A slight difference in the rejection efficiency is therefore not statistically significant.
We can estimate the difference between the CNN and RN+MF models by calculating the correlation coefficient of the logit outputs logit(ŷ CNN ) and logit(ŷ RN+MF ) for the same testing event set. We list the values in table IV. Hereŷ is the outputs of each model, and its logit is logit(ŷ) = log(ŷ) − log(1 −ŷ). The correlation coefficient ρ between CNN and RN+MF is relatively small, and ρ = 0.793 for the dark jet dataset. But once we give the MF information to the CNN model, the correlation improves, and ρ = 0.893 between CNN+MF and RN+MF. The improvement of correlation and classification performance indicates that the CNN is not fully utilizing those MFs unless explicitly provided as inputs.
The correlation coefficient between the network outputs trained with different random number seeds is significantly larger than the correlation between the different models. This indicates that the difference between the network outputs is primarily due to the systematic difference in the network architectures.

B. Top Jet Tagging
For the top jet study, we use the samples described in [21]. We use the events with p T,J ∈ [500, 600] GeV and m J ∈ [150, 200] GeV. The number of selected events is 9.5 × 10 5 for top jets and 3.5 × 10 5 for QCD jets. The ratio between training, validation, and test samples and the training method is the same as the dark jet case.
We show the ROC curves in Fig. 5. The model MF, which uses only the MFs as inputs (without any IRC safe correlators), performs better than the RN model. This indicates that the geometric and topological information is the primary information for the top jet classification.  As can be seen in table V, the model using IRC safe variable with MFs is better than the one without MFs as the dark jet case. The MFs are enhancing the performance of the RN much more than the dark jet tagging case.
The CNN+MF shows a similar tagging performance to the RN+MF, but the baseline CNN does not. As discussed earlier, the convolutional representation of the MFs involves a discontinuous step function. However, the step function is hard to be modeled by convolutional layers with a finite number of filters and L2 regularizers. This CNN setup effectively penalizes functions with discontinuity because it requires large weights or a large number of filters with small weights.
The correlation coefficient ρ of the logit of outputs among the training of the same model with different random number seeds is 0.986 for RN+MF. On the other hand, the ρ of CNN is 0.933. The difference shows that the training of the CNN model suffers the local minimum problem relative to RN+MF. In gradient-based training methods, easily classifiable samples dominate the early phase of the training. The different training may show us different local minima that mainly describe the classification boundary for the dominant samples. In such cases, confusing events are underrepresented, and the training results will have some variance. This variance is larger for the more generic function model, and the CNNs have a larger correlation coefficient than the RN+MFs.
The local minimum problem of the CNN can be relaxed by explicitly providing some components, such as the MFs. Adding the MFs to CNN inputs improves the situation, and CNN+MF has the correlation coefficient 0.979. Furthermore, the correlation between CNN+MF and RN+MF is 0.941, much higher than the correlation between CNN and RN+MF. Namely, the two models are now quite correlated to each other.
To visualize the fine difference between the RN+MF and CNN, we compare the (A (0) , A (2) ) distribution of dijet samples, conditioned on the classifier outputs. We select the dijet samples with classifier outputsŷ CNN and y RN+MF of CNN and RN+MF models less than its value at the 70% of top jet selection efficiency, respectively.
By taking the ratio of the histograms of the MFs, we can visualize the difference in classification boundaries of RN+MF and CNN. In Fig. 6, we consider the ratio where N is the density at a given bin of the histogram of the samples selected by the CNN or RN+MF, and = 0.1 is the regularization to avoid dividing by zero. Figure 6(a) is distribution of I in (A (0) , A (2) ) plane, and Fig. 6(a) is the same plot but for the MFs obtained from the pixels above the 8 GeV threshold, Because the RN+MF model rejects more dijet events, the ratios tend to be bigger than 1 for most of the bins. In the figure, the red bins represent I > 1, while the blue bins correspond to I < 1. For Fig. 6(a), the bins with large A (0) and small A (2) is red, indicating the RN+MF improves the classification by selecting more samples on this region. For Fig. 6(b), the region with large A (0) and large A (2) tend to have larger values, but the red region is less prominent. This may indicate that the CNN is utilizing the geometric features of the pixels with energy above 8 GeV, but the CNN may also have difficulty in fully utilizing the geometric information of soft energy deposits.

C. Comment on EFN and EFN+MF
In addition to CNN, we study the classification using EFN and EFN+MF models. The EFN model uses the same jet images as inputs, but the model itself is constrained to be IRC safe. Because of the constraint, the EFN cannot fully use the geometric information of the soft activities encoded in the MFs. As a result, the classification performance of EFN is worse than that of the networks taking MFs as inputs and the CNN, which implicitly cover the MFs. Nevertheless, the EFN+MF works nearly as equal as the CNN+MF and RN+MF, and it covers sufficiently useful IRC safe information for both dark jet tagging and top jet tagging.
In the dark jet tagging, the IRC safe variables are the key information for the jet tagging, and EFN performs well in the classification as illustrated in Fig. 4. In addition, considering MFs as extra inputs improves the performance slightly. At the low signal efficiency, the EFN+MF model has the best among all models in Fig. 3. As discussed already, due to the large background rejection in the region, the number of the training sample is enough, and we suspect that the difference is within the statistical fluctuations.
In the top tagging, the geometric and topological information is important. The performance of sole EFN is comparable to that of the RN, but it is significantly improved when MFs are considered as additional inputs. Our RN model uses the two-point correlation to the leading p T subjet and two-point correlation after removing the leading subjet to capture the three-point correlation inside the top jet. The inputs for the EFN are also sensitive to this topological three-prong structure of the top jet because we preprocess the jet images, and those three subjets always appear at particular points on the jet image. The EFN+MF covers more geometric information than the EFN, and its performance is comparable to the CNN as a result. But the improved performance mostly comes from the MFs, and the EFN+MF works nearly as equal as the CNN+MF and RN+MF.

A. Overcoming a Small Dataset
As discussed in the previous section, the RN+MF model has some merit over the CNN model on better training performance. Models with broader coverage, such as CNN, are capable of modeling generic functions. The price of the high expressive power is often the high variance in the trained outputs and the high sensitivity to the statistical noise. These errors may degrade the generalization performance of the network. In this respect, using a simpler model helps to maintain the performance for some cases. Figure 7 shows the AUCs of RN+MF and CNN as the functions of the number of training samples. We can see from the figure that the AUC of RN+MF is significantly larger than that of CNN for a small dataset, although their gap decreases as the size of the training dataset increases.
For the top jet classification, RN+MF achieves the AUC higher than 0.9 already at 1000 training samples, and the AUC is only 4% smaller at most than the best AUC. Meanwhile, CNN needs O[10, 000] samples to achieve the same performance as RN+MF.
We find similar behavior of the AUC curves in the dark jet classification. The curves for RN+MF and CNN meet at 4000 events, which is much smaller than the meeting point of the curves in the top jet tagging case. This is because there are no dark jet samples at the tail of the MF distributions of QCD jets, as shown in Fig. 3. The training of the CNN could easily find this difference with a small number of samples, and the curves will meet much earlier.
Since the CNN model has comparable performance to RN+MF, we may consider optimizing learning steps to improve the performance when the dataset is small. For example, we may adjust learning dynamics by replacing the cross-entropy loss L CE with a focal loss L FL [54], The results are shown in dotted lines in Fig. 7. The focal loss penalizes the contribution from easily-classifiable examples by extra factors (1 −ŷ) 2 and (ŷ) 2 , and it helps training when the dataset is sparse. The jet image dataset is sparse, so that we can see the improvement in the low statistics. However, there are no improvements to RN+MF since MF and S 2 distributions are mostly dense and smooth. Note that the training using focal loss does not converge to the maximum likelihood estimatiion of the binary classifier, i.e.,ŷ p(y = 1|x) in the asymptotic limit. Therefore, the performance is generally less than the one using the cross-entropy loss when enough data is available.

B. Less Computational Complexity and Training Time
Another advantage of the RN+MF is its low computation complexity. Networks with less computational com-plexity can be evaluated much faster and takes less memory. Table III and table V show that the training time of RN+MF is about ten times shorter than that of CNN. We also note that RN+MF takes about 300 MB GPU memory during the training with 200 mini-batches, while CNN takes about 6000 MB GPU memory in our setup.
We can estimate the computational complexity difference between CNN and RN+MF from the complexities of network evaluations and the input calculations. Because input calculations can be cached, the network evaluation complexity is the dominant factor to the complexity during the training. The evaluation complexity is proportional to the number of multiplications since the networks mostly consist of tensor multiplications. One of the most expensive layers of our CNN is a convolution layer with 3 × 3 filters mapping images with 30 × 30 pixels and 16 channels to the images of the same size. This layer has the following number of multiplications, Our CNN has two convolutional layers with this configuration, so that those two layers used about 4, 000, 000 multiplications. Meanwhile, our RN+MF has only fully connected layers, and the most expensive one has 200 incoming features and 200 outgoing features. This layer has 200 × 200 = 40, 000 multiplications. We use three dense layers for each of the MLPs of RN+MF, which have four MLPs. Then the number of multiplications is at most 3 × 4 × 40, 000 = 480, 000.
The estimated computational complexity is factor 10 less than the convolutional layers, and it qualitatively explains the difference in training time. It also explains the difference in GPU memory usage since the backpropagation algorithm has to record the entire operations. More operation is involved, more GPU memory is needed during the training. On the other hand, the complexity of input calculations only matters when the network inputs are not cached. The computational complexity of evaluating the inputs of RN+MF is as following. The calculation of MFs has two convolutions with filter sizes (2k + 1) × (2k + 1) and 2 × 2 for the dilation and local feature identification, respectively. Those two convolutions have the number of multiplications, (51) which is 4,500 for k = 0 and 155,700 for k = 6. Note that the complexity of dilation, (2k + 1) × (2k + 1) × (30 × 30), can be further reduced by using optimized algorithms. We may consider this number as the upper bound of the complexity.
The calculation complexity of the two-point correlation S 2,ab is a function of the number of jet constituents, N . The jet reclustering has N log N complexity [55], and the two-point correlation calculation has N 2 complexity in general. In the case of N = 50, which is approximately the largest number of jet constituents in our sample according to Fig. 3, the total complexity is ≈ 2, 700. The second N 2 factor can be reduced to N 2 /2 if a and b of S 2,ab are the same.
Those two complexities of evaluating the inputs of RN+MF, 155,700 and 2,700, are still much smaller than the complexity of the two convolutions layers. We conclude that the RN+MF setup is computationally efficient than the CNN.

VII. PARTON SHOWER MODELING AND MINKOWSKI FUNCTIONALS
So far, we have been discussing jets generated by PYTHIA8, but the predicted jet substructure has a simulator dependency in general because of different parton shower schemes. PYTHIA8 adopts p T -ordered showering [56,57] while HERWIG7 adopts angular-ordered showering. The distributions of MFs with energy thresholds can capture the geometric differences between those two shower schemes, and the two simulated distributions may be different from each other. We quickly check the difference in A (k) [p T ] distributions and discuss the origin of difference in terms of the shower scheme.
In Fig. 8, we show the following asymmetry ratio D of the distribution of two selected A (k) [p T ].
where N P (i) and N H (i) are the number of PYTHIA8 and HERWIG7 events in the i-th bin, and f P (i) and f H (i) are its fraction with respect to the total number of events, respectively. Here, the samples are the QCD jets of the top jet classification, with p T,J ∈ [500, 600] GeV and m J ∈ [150, 200] GeV.
In Fig. 8(a) and Fig. 8(b), we show the asymmetry ratio of (A (0) , A (1) ) without p T filters. The darkest red bins has D = 1, where no HERWIG7 events are observed. The darkest blue region corresponds to D = −1, and no PYTHIA8 samples are in there. The dark red pixels tend to be in large A (0) region, because PYTHIA8 predicts higher A (0) . For the same A (0) value, PYTHIA8 predicts smaller values of A (1) than HERWIG7. This means the jet constituents are more clustered in PYTHIA8. The trend is common for all k > 1 (See Fig. 8 The situation is different for A (k) with p T filter. As illustrated in Fig. 8(c) and Fig. 8(d), the A (k) [8 GeV] of PYTHIA8 tend to be higher than that of HERWIG7 for given A (0) [8 GeV]. This means high p T pixels are more sparsely distributed in PYTHIA8 generated samples.
Recall that PYTHIA8 adopts a transverse-momentumordered evolution scheme. A high p ⊥ radiation in PYTHIA8 tends to be emitted at a larger angle. For the case of HERWIG7, the first emission in the evolution is typically a large angle soft radiation. The asymmetry D for A (k) [p T ] distributions is consistent with the expectation of the shower modeling. HERWIG7 QCD jet emits soft particles at a large angle while PYTHIA8 QCD jet emits higher p T objects at a large angle.
For the best classification performance with less simulator bias in the application stage, the distribution of inputs, especially the MFs, has to be tuned carefully to the real experimental data. The calibration of MF distributions will be helpful to reduce the simulator dependency in the prediction of more general models, such as the CNN, because the MFs are important features in the jet classifications, as shown in Sec. V.

VIII. SUMMARY
In this paper, we introduce a neural network covering the space of "valuations" of jet constituents. The valuations introduced in this paper can be considered as a generalization of particle multiplicities which is a useful variable in quark vs. gluon jet tagging, but it is not IRC safe in general. The space of IRC unsafe variables is less explored compared to that of IRC safe variables because of its theoretical difficulties. Nevertheless, Hadwiger's theorem in integral geometry tells us some structure of the valuation space, which is an interest to this paper. The dimension of the valuation space is finite, and its basis is called the Minkowski functionals (MFs). In the two dimensional Euclidean space, the MFs are Euler characteristic, perimeter, and area. We utilized these geometric features to build a neural network covering the space of valuations, and the resulting network is a multilayer perceptron taking the MFs as inputs.
In the case of jet image analysis, we showed that the MFs of dilated jet images could be represented by a chain of convolutional layers. Therefore, convolutional neural networks (CNN) can explicitly utilize this information. In the semi-visible jet tagging example, the CNN finds out the phase-space region of MFs where only QCD background can be found without difficulties. However, the MFs is not a smooth function of jet images, and the CNN had a problem accessing that information when L 2 regularization is involved. By explicitly adding the MFs as inputs to the CNN, we showed that its classification performance is improved.
We further build up a neural network architecture combining these valuations to the IRC safe information. In particular, we consider energy correlator based networks: the relation network and the energy flow network. We combine the outputs from those IRC safe neural networks to the network covering IRC unsafe MFs. This combined setup has a comparable performance to the CNN.
The combined model is constrained compared to the CNN, but its classification performance is similar; moreover, it has computational advantages. First, it has a smaller computational complexity than the CNN so that its evaluation is fast and less memory-demanding. Second, constrained model generally requires a less number of training samples in order to reach its best performance. This network is especially useful when data is expensive.
Since MFs can be embedded to the CNN, they could potentially be interpreting variables of the CNN. Deep neural networks are a highly expressive model of a function, but their prediction is not explainable [58,59] in general. If we are aware of potentially important features for modeling, we may distill the features [58,60] by using interpretable models built from the important features in order to get an insight. It will also allow us to control the network predictions systematically by using domain-specific knowledge. We built a network based on MFs, which have clear geometric interpretations, and this type of network combined with interpretable IRCsafe neural networks [19,20] can be an answer for that in jet tagging problems.
For example, the distributions of IRC unsafe variables, including the MFs, have to be appropriately tuned in order to reduce the simulation bias. Tuning the distribution of jet constituents themselves for that purpose is not trivial because parton shower simulations are approximation and they do not fully cover the phase space of radiated particles. The expression of the valuation space using MFs is significantly small in dimensions and includes important counting variables that also also have geometric meanings. Tuning the distribution of MFs by reweighting [21,61] can be a more feasible method for controlling systematical errors of modeling the space of IRC unsafe features.
Finally, although we limit our discussion to the pixelated image analysis, but it would also be interesting to develop a continuum version of this morphological analysis in order to compare it with graph convolutional neural networks [6]. We will leave these interesting possibilities in future studies.
• Conv2D: filter size: 3 × 3, 16 filters where Conv2D is a convolutional layers and MaxPooling2D is a max pooling layer for two-dimensional pixelated images. Zero padding is used for calculating convolutions at the pixels near the boundary. We also showed that this configuration has a similar classification performance to the ResNet and ResNeXt within our setup and training samples [21].

Energy Flow Network
The energy flow network presented in this paper is essentially the MLP of jet images. However, 900 inputs are much larger than that of the MFs and S 2,ab , we compress the inputs to 50 (or 200) latent dimensions first.
• Concatenate inputs and x kin . Again, the first dense layer is essentially the model for the linear energy correlators.

Multilayer Perceptron Classifier and Logistic Regression
The selected network outputs are then combined to the binary classifier, i.e., MLP followed by logistic regression.
• Concatenate all the inputs and x kin .  [21]. We also show the number of hidden outputs at the last dense layer of the CNN.

Appendix B: Comment on Smooth Activation Functions
In the previous paper [21], we compared the RN with A (0) and A (1) with CNN with ELU activation function and found that the performance is comparable, but this is accidental. As shown in Fig. 9, the performance of the CNN with ReLU is better than the CNN with ELU [62] because ReLU is not a smooth function and can model the step function better. Nevertheless, the performance of RN+MF also improves after fully considering the MFs, and the performance is comparable with CNN with ReLU activation, as shown in the main text.