Reconstruction of decays to merged photons using end-to-end deep learning with domain continuation in the CMS detector

A novel technique based on machine learning is introduced to reconstruct the decays of highly Lorentz-boosted particles. Using an end-to-end deep learning strategy, the technique bypasses existing rule-based particle reconstruction methods typically used in high energy physics analyses. It uses minimally processed detector data as input and directly outputs particle properties of interest. The new technique is demonstrated for the reconstruction of the invariant mass of particles decaying in the CMS detector. The decay of a hypothetical scalar particle $\mathcal{A}$ into two photons, $\mathcal{A}$ $\to$ $\gamma\gamma$, is chosen as a benchmark decay. Lorentz boosts $\gamma_\mathrm{L}$ = 60-600 are considered, ranging from regimes where both photons are resolved to those where the photons are closely merged as one object. A training method using domain continuation is introduced, enabling the invariant mass reconstruction of unresolved photon pairs in a novel way. The new technique is validated using $\pi^0$ $\to$ $\gamma \gamma$ decays in LHC collision data.


Introduction
Since the standard model (SM) of particle physics continues to show excellent agreement with measurements performed at particle colliders, such as the CERN LHC, the search for physics beyond the standard model (BSM) has led experiments, such as CMS, to look beyond simple decay topologies toward more exotic, potentially overlooked ones.An important class of such exotic signatures is composed of Lorentz-boosted particles that are sufficiently energetic to induce collimation of their decay products [1].An important consideration in developing a robust search program for such phenomena is understanding whether existing particle reconstruction algorithms are sensitive to the decays of boosted particles.
Particle-flow (PF) algorithms [2,3] are widely used frameworks for the reconstruction of particle decays in the LHC detectors.These algorithms reconstruct and identify each individual particle in an event using an optimized combination of information from the various detector elements.Although PF algorithms have been successfully used in previous CMS analyses, they lean heavily on the assumption that particle decays are well resolved and well isolated, a feature not generally true for boosted particle decays that often exhibit some degree of merging.
In its loosest sense, merging may refer to the collimation of particles that are otherwise individually resolved but are in overlapping ensembles.For instance, in the hadronic decay of a boosted high-mass resonance to jets, the jets overlap to form a single, large-radius jet.In such cases, there is an ambiguity in how to group the decays so as to consistently reconstruct the properties of the parent particle.Although such ensemble merging may be mitigated, the corresponding analyses require a fine-tuned strategy for clustering, often at the cost of losing reconstruction efficiency in some regions of phase space.Therefore, attempts to directly probe the mass of exotic resonances are rarely pursued in the boosted regime, unlike for the reconstruction of known SM resonances [4][5][6][7].
At higher boosts, a more experimentally challenging form of merging arises when the separation between the decay products approaches the detector resolution.Consider, for instance, decay products interacting with the CMS electromagnetic calorimeter (ECAL).At separations approaching the Molière radius of the calorimeter material, the particle showers of the decay products begin to overlap.This effect can cause the merged decay products to be misreconstructed as a single-particle candidate.Such shower merging, if within a few Molière radii, might still be discernible by current dedicated shower clustering tools, even in regimes inaccessible to PF.The 3×3 clustering algorithm, used for reconstructing low-energy π 0 → γγ decays [8], is a notable example.However, at separations approaching the Molière radius, such clustering tools are unable to discern distinct clusters and begin to lose their sensitivity.Thus, they are difficult to adapt to exotic searches that typically probe a wide range of particle masses and Lorentz boosts.
A special case of shower merging occurs when the decay products are collimated to the point of depositing their energy within the same calorimeter cell, or less than the Molière radius.In this limit of fully overlapping particle showers, the resulting shower pattern is nearly indistinguishable from that of a true, single-particle shower.The only distinction in such cases is the slightly greater spreading in the particle shower along the principal axis connecting the points where the decay products enter the calorimeter.Reconstructing such an instrumentally merged decay remains a challenge for existing experimental tools.Such a tool must discern subtle differences in the energy distribution of the fully merged particle shower.
As a concrete example of such a signature, we consider the exotic decay H/X → AA of the Higgs boson H with a mass of about 125 GeV, or some new resonance X, to a pair of hypothet-ical scalar particles A. Such decays are well motivated in extended Higgs sectors or BSM models involving two-Higgs doublets, an additional singlet, or axionlike particle production [9][10][11].Depending on the mass and decay mode of the A, one or more forms of shower or instrumental merging, as described above, may occur in its decay.In particular, for decays to diphotons A → γγ [12], since photons are massless, the particle A can be arbitrarily light and the resulting diphoton system correspondingly merged.For the mass of the A satisfying m A ≲ 0.4 GeV, the H → AA → 4γ signal is increasingly dominated by events where both A → γγ decays experience shower or instrumental merging.When this occurs, each A → γγ decay is misreconstructed as a single photonlike cluster, burying the H → AA → 4γ signal in existing SM H → γγ events [13,14].Moreover, in this m A regime, the decay modes of the A into more massive particles become inaccessible.This feature further emphasizes the importance of the diphoton decay mode at low m A .
In this paper, we introduce a novel particle reconstruction strategy based on a modern machine learning (ML) approach also known as deep learning.We show that such a strategy addresses the above merging challenges associated with reconstructing boosted particle decays over a wide range of boosts.We describe the concept of end-to-end particle reconstruction using ML algorithms that bypass PF [15] and train directly on minimally processed detector data to reconstruct a particle property of interest.To benchmark our technique, we reconstruct the invariant mass of simulated A → γγ decays for masses in the range m A = 0.1-1.0GeV and for boost regimes where the diphotons are barely resolved to instrumentally merged.For particle A boosts γ L = E A /m A , where E A is the energy of the A, these correspond to boosts of γ L = 60-600.A single end-to-end ML mass regressor is used, where regressor refers to the output mass of the ML algorithm.For instrumentally merged diphotons, we further develop a novel ML technique called domain continuation.This technique involves extending the training domain to nonphysical values, to include the learning of masses below the detector resolution.Using both end-to-end particle reconstruction and domain continuation, we are able to reconstruct the invariant mass of instrumentally merged diphotons in the CMS detector, a first for a particle reconstruction method.The technique is validated using π 0 → γγ decays in LHC collision data collected by the CMS experiment in 2017.We achieve a substantial sensitivity gain over existing benchmark algorithms, allowing previously inaccessible boost regimes to be probed.

The CMS detector
The central feature of the CMS apparatus is a superconducting solenoid of 6 m internal diameter, providing a magnetic field of 3.8 T. Within the solenoid volume are a silicon pixel and strip tracker, a lead tungstate crystal ECAL, and a brass and scintillator hadron calorimeter, each composed of a barrel and two endcap sections.Forward calorimeters extend the pseudorapidity (η) coverage provided by the barrel and endcap detectors.Muons are detected in gas-ionization chambers embedded in the steel flux-return yoke outside the solenoid.
The ECAL consists of 75 848 lead tungstate crystals, which provide a coverage of |η| < 1.48 in the barrel region (EB) and 1.48 < |η| < 3.00 in the two endcap regions.The full azimuthal range |ϕ| < π is covered for all regions.
In the barrel section of the ECAL, which is the focus of this paper, an energy resolution of about 1% is achieved for photons in the tens of GeV energy range, for those that do not convert to an e + e − pair before reaching the EB.The energy resolution for photons converting in the EB is about 1.3% up to |η| = 1, rising to about 2.5% at |η| = 1.48 [8].Each EB crystal has dimensions of 2.18 × 2.18 × 23 cm 3 .This corresponds to an angular resolution of ∆η × ∆φ = 0.0174 × 0.0174 in the direction of the collision.The crystal Molière radius is comparable to the crystal width, The powerful feature-learning capabilities of modern ML algorithms provide an alternative.Using inputs that are as unfiltered and informationally rich as possible, such algorithms have been shown to outperform cut-based or even ML-based methods utilizing input variables modeled by hand [28][29][30][31][32][33][34].This is expected if the input variables are difficult to model by hand or if their correlations are difficult to extract.Moreover, by encompassing the majority of the reconstruction workflow into the functionality of the ML algorithm, potential information loss due to stepwise optimizations is avoided.Such end-to-end ML algorithms are thus trained directly on minimally processed detector data with the objective of predicting the desired quantity of interest.This motivates an end-to-end ML-based particle reconstruction strategy for the challenges of shower merging.We accomplish this by transforming the detector data into high-fidelity images [15,35,36], and using these to train a convolutional neural network (CNN) that outputs the final parent particle property of interest.Similar applications in the literature have employed order-invariant graph-based ML models as well [37,38].
The end-to-end ML approach has specific advantages for boosted particle decays.The first is the gain in information granularity offered by detector data for features that cannot easily be reduced to a particle-level, or shower-shape-type representation.Second, the choice of a CNN, or any similar hierarchical ML architecture like a graph-based network, allows detector features to be learned across several length scales.Features from the crystal level to the cluster level and beyond are learned in a complementary way.This is particularly advantageous for boosted particle decays that may exhibit merging at multiple scales.Third, by training on minimally processed data rather than heavily filtered or clustered data, the ML algorithm may learn to adapt to more varied, higher-dimensional changes in the data, potentially developing a greater robustness to evolving data-taking conditions.Finally, there is the computational simplicity of using a single ML algorithm versus a hierarchy of algorithms each with its own intermediate optimizations.Admittedly, transitioning to a fully end-to-end ML-based particle reconstruction framework would involve an extensive overhaul of the workflows employed by the LHC experiments such as CMS.Moreover, to what extent the entire particle reconstruction workflow can or should be reduced to a single ML algorithm remains an open question.
Although a comprehensive end-to-end particle reconstruction framework is a long-term undertaking, its benefits can already be demonstrated through targeted applications.In this paper, we test the potential of this technique by investigating a previously inaccessible boost regime, namely the merged A → γγ decay.We train an ML regression algorithm to reconstruct the generated mass m A using only the electromagnetic shower pattern of the merged diphoton decay in the CMS ECAL.This enables us to exploit subtle differences in the energy distribution of the ECAL shower pattern [39] to analyze even extremely merged decays.
We parametrize the diphoton merging in terms of the Lorentz boost of the particle A. Figure 1 illustrates the typical merging regimes at three different boosts.Samples of A → γγ decays misreconstructed by PF as single photons passing candidate selection criteria (discussed in Sec. 5) are used.The A → γγ samples are derived from simulated H → AA events in which the true positions of the diphoton decays are known.The distribution of the opening angles between the higher-p T (leading, γ 1 ) and lower-p T (subleading, γ 2 ) photons from the simulated particle A decay is shown in Fig. 1 (left column).The angles are expressed in number of ECAL crystals in the η direction, ∆η(γ 1 , γ 2 ) gen , versus the ϕ direction, ∆φ(γ 1 , γ 2 ) gen .Typical ECAL energy deposition patterns from a single A → γγ decay are also displayed in Fig. 1 (right column).The diphoton particle showers are resolved in the ECAL for boosts of γ L ≲ 50 (Fig. 1, upper row), they are shower merged for 50 ≲ γ L ≲ 250 (Fig. 1, middle row), and instrumentally merged for γ L ≳ 250 (Fig. 1, lower row).Note that the same γ L value can lead to considerably different merging, depending on the opening angle between the diphotons.
The barely resolved decays (γ L ≈ 50) represent the challenge of ensemble merging.First, rule-based methods may fail to categorize the A → γγ shower pattern as either one or two photon objects, potentially causing the event to be discarded.Second, resolved A → γγ decays can resemble both a photon conversion to an e + e − pair and low-energy deposits from pileup.For these latter two possibilities, track information, if present, can potentially be exploited to distinguish between energy deposits from A → γγ decays where the photons do not convert before reaching the ECAL and those arising from single photons converting before reaching the ECAL and from pileup, as is done in PF.For the shower-merged regime, not even dedicated shower clustering tools have the sensitivity to reconstruct the invariant mass of the A → γγ system.As we will show in this paper, only the end-to-end ML approach can effectively operate in all three regimes.The A → γγ decay thus provides a simple but comprehensive test case for all the merging regimes occurring in the decays of boosted particles.

Detector images
Since the photons from A → γγ decays deposit energy primarily in the ECAL, for simplicity, we use an image construction strategy that consists of only ECAL information.We take a 32×32 matrix of ECAL crystals around the most energetic (seed) crystal of the reconstructed photon candidate and create an image array.This corresponds to an angular cone of √ (∆η) 2 + (∆ϕ) 2 ≈ 0.3, and ensures the subleading photon in the A → γγ decay is fully contained in the image.shows the normalized distribution of opening angles between the leading (γ 1 ) and subleading (γ 2 ) photons from the particle A decay, expressed by the number of crystals in the η direction, ∆η(γ 1 , γ 2 ) gen , versus the ϕ direction, ∆ϕ(γ 1 , γ 2 ) gen .Note that the distributions include contributions outside of the plotted ranges and thus may not sum to unity within the displayed ranges.The right column displays the ECAL energy shower pattern for a single A → γγ decay, plotted in relative ECAL crystal index coordinates and color-coded by energy.In all cases, only decays reconstructed as a single PF photon candidate passing selection criteria are used.
Although generous, our results are not strongly sensitive to the size of the image.Each pixel in the image exactly corresponds to the energy deposited in a single ECAL crystal.These energy depositions represent the actual interaction of the incident photon with the detector material (Fig. 1, right column).This approach is distinct from those that use PF candidates, in which case the A → γγ decay, reconstructed as a single particle, would be displayed as a single image pixel, by construction.No rotation is performed on the images since electromagnetic showers are not rotationally symmetric.In addition to the η-ϕ symmetry being broken by the CMS magnetic field, general rotations of square pixels are destructive operations that distort the particle shower pattern.
For simplicity, only photons depositing energy in the barrel section of the ECAL are used in this paper.For general particle decays, the ECAL images can be combined with additional subdetector images [35], or parsed into a multi-subdetector graph if one is using such an architecture.Including such tracking images, even as a complement to the A → γγ ECAL images, could enable a better accounting of contributions from e + e − conversions and pileup.However, they were not included in this study.
As noted in Sec. 2, because of the η-dependent material structure of the inner tracker, electromagnetic shower development varies significantly with η.Once the 32×32 crystal matrix of the particle shower is taken, the resulting image no longer contains explicit information about where in the ECAL the shower was located.We thus perform two modifications to recover this information during the training of our ML algorithm.The first is to split the ECAL images described above into a two-layer image that contains the transverse and longitudinal components of the crystal energy, which are defined as E sin θ and E|cos θ|, respectively, where E is the crystal energy and θ is the polar angle of the crystal energy deposit position.The second is to include the crystal seed coordinates.
The calibrated ECAL detector data exist at various levels of processing: from minimally processed to the filtered and clustered version more easily accessible at the analysis level.Since the clustered detector data are historically optimized for the needs of PF, it is worth revisiting this choice for the end-to-end ML technique.As discussed in Appendix A.1, training on minimally processed instead of clustered data significantly improves our results.We thus emphasize our earlier statement that using minimally processed detector data is critical in realizing the full potential of ML.We make exclusive use of minimally processed detector data in this paper.One caveat is that accessing such data is becoming logistically challenging because of the trend toward more compact CMS file formats, necessitated by the growing volume of LHC data.

Candidate selection and datasets
To address the most challenging diphoton particle decays, we require that the diphoton system be misreconstructed by the PF algorithm as a single photon candidate, labeled Γ.As a consequence of the PF ambiguities in categorizing low-boost A → γγ decays (γ L ≲ 50) as either one-or two-photon objects, this requirement does not necessarily imply that the diphotons are merged.For diphotons that are barely resolved, if one of the photons is too low in energy (p T < 10 GeV), it is not reconstructed by PF as a separate photon even if it is sufficiently separated from the other photon.Acceptance of this information results in a range of boost regimes.
Reconstructed photons are also required to pass shower shape and isolation criteria that would accept single photons with > 90% efficiency.These photon selection criteria are similar to those used by CMS in the SM H → γγ analysis [13], which emphasizes how a merged signal might be buried in such events.For each selected photon candidate Γ, a single detector image is constructed (described in Sec.4), which is then passed to the ML algorithm for either training or inference.
For training, we use a sample of simulated A → γγ decays in which the particle A decays promptly.The sample includes pileup but otherwise no additional particles.An ensemble of continuously distributed masses m A is used, such that the ML model described in Sec.6 does not learn the discretization of the mass points.These samples are generated with p T, A = 20-100 GeV, m A = 0-1.6GeV, and |η A | < 1.4.This corresponds to Lorentz boosts with γ L in the approximate range 10-1000 and adequately covers the regimes of interest (discussed in Sec. 3).
After applying photon selection criteria, samples with closely merged photons (γ L ≳ 150) are preferentially selected over samples with resolved photons (γ L ≲ 50), since the latter are more likely to fail single-photon shower-shape criteria.This tends to sculpt the phase space in (p T, A , m A ), which can result in the mass regressor preferring to output masses in regions with higher populations.To prevent this from happening, the training samples are generated to ensure that the (p T, A , m A ) phase space is uniformly distributed after the photon selection criteria have been applied.After applying these requirements, approximately 780,000 A → γγ decays are available for training, a sample of about 26,000 decays for validation, and another 26,000 for testing.In addition, as explained in Sec.6.2, the training samples are augmented with true, single-photon samples with similar kinematic properties.After the selection requirements, there are 150,000, 26,000, and 26,000 simulated single photons for training, validation, and testing, respectively.The validation set is used to optimize the ML model hyperparameters, and the test set is used to assess the performance of the chosen model on a statistically independent sample.
To benchmark the ML technique, we use both simulated and actual CMS collision data.The benchmark studies are described further in Sec.6.4.In addition to pileup, underlying events are included in the simulated A → γγ samples to make them more realistic.The simulated samples are obtained from H → AA → 4γ events generated at fixed masses of m A = 0.1, 0.4, and 1.0 GeV, all with prompt A decays.The decays correspond to median boosts of ⟨γ L ⟩ = 600, 150, and 60, respectively, covering a similar boost range as used for training.The collision data contain events enriched in π 0 → γγ decays selected from the 2017 and 2018 CMS data-taking periods [40], corresponding to an integrated luminosity of 41.5 and 56.9 fb −1 , respectively.To enrich the data sample with π 0 → γγ decays from hadronic jets, as explained in Sec.8.2, only reconstructed photon candidates with p T = 20-35 GeV are used.Shower-shape and isolation requirements, which are slightly tighter than for the A → γγ analysis described above, are applied.
To study the robustness of the ML technique as a function of various parameters, as described in Sec. 9, events enriched with electrons from Z → e + e − decays are used.The electrons are obtained from 2017 data and simulation using the "tag-and-probe" method [41].The selected electron is required to coincide with a photon candidate passing the photon selection criteria described earlier.
For all simulated samples, events are generated with corresponding 2017 data-taking pileup conditions and fully simulated CMS geometry and detector response [17].

Mass regression model and training
A RESNET CNN architecture [42] forms the basis of our mass regression model.Although other ML architectures exist, including those that use graph-based techniques, the emphasis in this paper is placed on the general reconstruction method rather than the optimization of the ML model.
The energy deposits in the ECAL detector images are first scaled by a sample-wide constant so that the sample-wide energy distribution is approximately within the [0, 1] interval.Each image associated with a reconstructed photon candidate Γ is then passed to the RESNET model, which outputs to a global maximum pooling layer.A more detailed description of the RESNET model used is presented in Ref. [15].The outputs are then concatenated with the crystal seed coordinates of the photon candidate.The seed coordinates are also scaled to lie within the [0, 1] interval.The concatenated outputs are then fully connected to a final output node that gives the predicted or regressed value of the scalar-particle mass m A for that photon candidate.The above ML mass regressor contains 89k trainable parameters.
To train the mass regressor, the regressed mass m Γ is compared with the true (generated) m A by calculating the absolute error loss function |m Γ − m A |, averaged over the training batch.Other loss functions are equally performant.This loss function is then minimized using the ADAM optimizer [43].To facilitate convergence, the mass values are also scaled to lie within [0, 1] before they are passed to the optimizer.This procedure represents our basic training strategy.

Performance limitations when m A → 0
Relying solely on this basic training strategy has significant performance limitations.As shown in Fig. 2 (left), naively training the mass regressor as described above results in a nonlinear response near either boundary of the mass regression range.At the high-m A boundary, this issue can be resolved by trivially extending the training mass range.Thus, if one wanted to obtain a usable regression range up to m A ≈ 1.2 GeV, one would train with a sample including higher masses, as we have, up to m A ≈ 1.6 GeV, depending on the mass resolution at the upper mass range.Predictions in the extended mass range are then discarded during the physics analysis.This approach cannot, however, be used in any obvious way for the low-m A boundary, since it is constrained by the physical requirement m A > 0. The mass region m Γ ≲ 200 MeV, of considerable theoretical interest for the diphoton decay mode in BSM models (discussed in Sec.1), would therefore be inaccessible.The use of the mass regressor as a tool for reconstructing π 0 → γγ decays would also be lost.Moreover, significant biases in the regressed masses of true photons would arise.As illustrated in Fig. 2 (right), photons would be regressed as a peak around m Γ ≲ 200 MeV, reducing even further the usable range of the mass regressor when single-photon backgrounds are included.

Domain continuation to negative masses
Fundamentally, the above boundary problem arises because, when training the mass regressor, the physically observable A → γγ invariant mass distribution becomes underrepresented for samples with m A < σ(m A ), where σ(m A ) is the mass resolution.This issue is illustrated in Fig. 3 (left).For samples where m A ≈ σ(m A ), the full, physically observable mass distribution (f obs ), visualized as a Gaussian distribution, is barely represented in the training set.As m A → 0, shown in Fig. 3 (middle), only half of the mass distribution is now observable.For these underrepresented samples, the mass regressor defaults to the closest mass value whose observable distribution is well represented, m A ≈ σ(m A ), since this will appear most similar.This results in a gap above m Γ ≈ 0 and an accumulation of masses at m Γ ≈ 200 MeV.More generally, this boundary problem manifests itself when regressing a quantity q, with resolution σ(q), over the range (a, b), for samples with q ≲ a + σ(q) or q ≳ b − σ(q).This effect only becomes negligible at either boundary in the limit σ(q) ≪ a, b.
This motivates a solution for the low-m A boundary problem by extending the regression range below m A = 0, into the nonphysical domain.These are then populated with "topologically similar" samples.We thus augment the training set with samples artificially and randomly la-   beled with negative masses.During inference, we remove the nonphysical predictions m Γ < 0. As a topologically similar sample, either a sample of decays with a fixed mass m A ≈ 0.001 MeV or a sample of single photons can be used.In this paper, we use the latter, although we find either method works well.If we require that the "negative mass" samples have the same mass density as the "positive mass" samples in the training set (cf. Fig. 3, right), then only a single hyperparameter is needed, the minimum artificial mass value, min(m A ).This can be set by choosing the negative value with the smallest magnitude that closes the low-m A gap (cf.Fig. 2, left) and linearizes the mass response in the physical domain, m Γ > 0. We find a value of min(m A ) = −0.3GeV to be sufficient.Other applications may seek to optimize both the minimum artificial mass value and the number density of the augmented samples.Note that having the augmented samples carry negative mass values is simply a consequence of the lowm A boundary coinciding with m Γ = 0. Had the boundary coincided with a positive mass, positive artificial mass values would be involved as well.
The above procedure effectively tricks the mass regressor into seeing a full invariant mass distribution for all physical A → γγ decays, even when they reside below the detector mass resolution.As a result, the full low-m A regime becomes accessible.In addition, this procedure provides a simple way for suppressing single-photon backgrounds by requiring m Γ > 0. Since single photons will tend to be regressed with negative masses, removing samples with m Γ < 0 reduces single-photon contributions in a mass decorrelated way.The only trade-off is that low-m A samples incur a selection efficiency to be regressed within m Γ > 0. However, this is expected for most A → γγ merged cases that cannot be distinguished from true photons.
By analogy to complex analysis, we denote the above procedure as domain continuation.Similar procedures, however, all occur in the statistical tests used in high energy physics [44].Our final training strategy implements domain continuation on top of the basic training strategy described at the beginning of this section.

Out-of-sample response
An important feature of ML-based regression algorithms is that their predictions are bound by the regression range on which they were trained.This is true even when out-of-sample decays, or decays not represented in the training set, are given to the mass regressor.If this fact is not addressed, unexpected peaks and features in the regressed mass spectrum can potentially appear.Although hadronic jets are indeed out of sample, it is desirable not to reject them at the stage of the mass regressor in order to enable the reconstruction of, e.g., embedded π 0 → γγ decays.If desired, these can instead be suppressed by modifying the photon selection criteria, for instance, as described in Sec.8.2.
Furthermore, A → γγ decays from more massive A particles than those used during training are a potential issue.These will be regressed as a false mass peak near the upper m A boundary.For this and other reasons stated earlier, when addressing the boundary problem at the upper mass range, we ignore predictions above m Γ > 1.2 GeV (Fig. 3, right).Lastly, to suppress single photons, as noted above, we ignore predictions with m Γ < 0. During inference, the use of the mass regressor is thus limited to samples regressed within the region of interest (ROI): m A -ROI ∈ [0, 1.2] GeV.The impact of this method on the sample selection efficiency is estimated in Sec. 7.

Evaluation
In Sec. 7, we validate the ML training using a number of figures of merit.We primarily use the mean absolute error, MAE = ⟨|m A − m Γ |⟩ over the test set, but also consider its normalized counterpart, the mean relative error, MRE = ⟨|m A − m Γ |/m A ⟩.The MAE (MRE) approximately corresponds to the mean absolute (relative) mass resolution.If treating the mass distribution at a fixed m A as a signal model, a lower MRE, similar to a lower relative mass resolution, implies a better signal significance assuming some fixed background contribution.The regression efficiency, or the number of samples regressed within m A -ROI ∈ [0, 1.2] GeV, divided by the total number of samples considered, is also used as a figure of merit.Similarly, if treating the mass distribution at a fixed m A as a signal model, a higher regression efficiency implies a higher signal significance for a given background contribution.
In Sec. 8, we benchmark the physics performance of the end-to-end ML mass regressor (endto-end) by comparing its response with both a traditional neural network-based mass regressor (photon NN) and a shower cluster-based algorithm used for mass reconstruction (3×3 algorithm).The photon NN is trained on a mix of 11 shower-shape and isolation variables, identical to those used by CMS for multivariate photon tagging [13].These variables are passed to a fully connected neural network of 128 nodes and three hidden layers that regresses the scalarparticle mass.For a fair comparison, it is also trained with domain continuation.The resulting performance was insensitive to small increases or decreases in network size.The 3×3 algorithm is similar to that used by CMS for the reconstruction of low-energy π 0 → γγ decays in the calibration of the ECAL [8].It first identifies a local crystal energy maximum (seed).If the seed energy is above some energy threshold, then a 3×3 crystal matrix (cluster) around the seed is formed.If a pair of nearby clusters is found, the reconstructed mass is calculated as the invariant mass of the two clusters.If a pair of clusters cannot be found, a default mass value outside of the m A -ROI range is passed.
Lastly, in Sec. 9, to assess the robustness and generalizability of the end-to-end mass regressor, we compare its mass response versus various kinematic and detector quantities of interest.We conclude with a comparison of the regressed mass spectrum in data versus simulation.

Training validation
To validate the training of the mass regressor and to characterize its performance, we use a test sample of A → γγ decays with a continuous uniform m A distribution.The regressed versus generated mass is shown in Fig. 4 (upper).We observe a linear and well-behaved mass response throughout the m A -ROI range.Since the regressed mass range spans more than an order of magnitude, some variation can be seen in the mass resolution, and thus in the shape of the regressed m Γ versus m A response.Notably, the mass regressor is able to probe the low-m A regime where it exhibits a gentle and gradual loss in resolution upon approaching the m Γ = 0 boundary.As discussed in Sec.6, this behavior is the result of using domain continuation.This performance confirms the ability of the end-to-end ML technique to access the highest boost regimes, where shower and instrumental merging are present, yet maintain performance into the high-m A regime, where the particle showers become resolved.
The MAE and MRE as functions of the generated mass are displayed in Fig. 4 (lower left).The MAE varies from 0.14 (0.20) GeV for m A in the range 0.1 (1.2) GeV, corresponding to mean boosts of ⟨γ L ⟩ = 600 (50), respectively.In general, the absolute mass resolution worsens with increasing m A , as reflected in the MAE trend.However, the relative mass resolution tends to improve with mass, as is evident in the MRE values, converging to about 20% for m A = 1.2 GeV.If the A → γγ mass distribution is used as a signal model, then, for a fixed regression efficiency, a lower relative resolution implies better signal significance, given some fixed background contribution.Notably, as shown in Fig. 4 (lower left), for m A ≲ 0.3 GeV, the MAE starts worsening with decreasing mass.This can be attributed to the gradual deteriora-tion of the mass regressor below the detector's mass resolution.
These figures of merit are achieved with a regression efficiency between 70 and 95%, as shown in Fig. 4 (lower right).The regression efficiency for a given sample as a function of m A is defined as the number of events in a particular m A bin that has m Γ within m A -ROI, divided by the total number of samples in that bin.If the A → γγ mass distribution is used as a signal model, then, for a fixed mass resolution, a higher regression efficiency implies better signal significance, for some fixed background contribution.The efficiency is primarily driven by how much of the mass peak can fit within m A -ROI.Thus, it is highest at the midway point of m A -ROI and falls off to either side.The relatively poorer mass reconstruction at low m A causes the efficiency to fall off more steeply than it does at high m A .About 50% of single photons are rejected by the m A -ROI requirement, as seen by the hatched region in Fig. 4 (lower right).Photons with m Γ > 0 are primarily due to e + e − conversions.

Physics validation
To validate the physics performance of the end-to-end mass regressor, we compare it with two traditional reconstruction strategies: a photon NN-based mass regressor trained on showershape and isolation variables, and a 3×3 shower clustering algorithm (described in Sec.6.4).Although the figures of merit from the previous section provide a high-level characterization of the reconstruction performance, the ultimate physical quantity of interest is the regressed mass spectrum itself, which is the focus of this section.

Simulation
The validation on simulated data compares the mass spectra for A → γγ decays for various fixed-mass values.As described in Sec. 5, we use A → γγ samples obtained from simulated H → AA → 4γ events with masses m A = 0.1, 0.4, and 1.0 GeV.In these events, the particle A energy is distributed around a median of E A ≈ m H /2 ≈ 60 GeV, corresponding to median boosts of ⟨γ L ⟩ ≈ 600, 150, and 60 for the respective m A masses.In addition to the photon selection criteria, event selection criteria similar to those used in the CMS H → γγ analysis are applied.The reconstructed mass spectra are shown in Fig. 5 for the different algorithms and m A mass values.For each mass value, representing a different median boost regime, the samples are further broken down by ranges of reconstructed p T, Γ , to highlight the stability of the mass spectrum with energy.These p T, Γ ranges are: (i) low-p T, Γ : 30 < p T, Γ < 55 GeV, (ii) mid-p T, Γ : 55 < p T, Γ < 70 GeV, (iii) high-p T, Γ : 70 < p T, Γ < 100 GeV, and (iv) ultra-p T, Γ : p T, Γ > 100 GeV.Some overlap in the boost values across different m A masses is to be expected.Note that the mass regressor has been trained on samples with p T, Γ < 100 GeV and that reconstructed candidates only in the range |η Γ | < 1.44 are used.
For boosts ⟨γ L ⟩ ≈ 60 and m A = 1.0 GeV, only the end-to-end algorithm (Fig. 5, upper left) consistently reconstructs a mass peak for all p T, Γ ranges.The position of the mass peak remains stable, with the resolution improving in the high-p T, Γ category.The end-to-end regression performs best when the A → γγ decay products are moderately merged, and neither fully resolved nor fully merged.The mass peak in the ultra-p T, Γ category is well reconstructed despite being outside the trained phase space.This demonstrates that the phase space extrapolation is   For m A = 1.0 GeV, the photon NN (Fig. 5, upper middle) has difficulty reconstructing the mass peak, except at the higher-p T, Γ categories.This can be understood in terms of the information content of the input variables on which the algorithm was trained.At higher p T, Γ , the two photons are more likely to be moderately merged so that their showers are contained within the 5×5 crystal block where the shower-shape variables are defined.At lower p T, Γ , the two photons are more often resolved so that the lower-energy photon shower falls outside the 5×5 crystal block.The photon NN must then rely on the isolation variables, which are defined much more coarsely over a cone of √ (∆η) 2 + (∆ϕ) 2 < 0.3 about the seed crystal.Since these variables have much less discriminating power, this results in a steep falloff in reconstruction performance.To improve the performance, the photon NN could be augmented with the momentum components of the lower-energy photon, in instances where the PF is able to reconstruct it.
Lastly, for m A = 1.0 GeV, the 3×3 algorithm (Fig. 5, upper right) is the only one competitive with the end-to-end method for lower p T, Γ .As the photon clusters become resolved, the 3×3 method thus becomes an effective tool for mass reconstruction.However, as soon as the clusters begin to merge at higher p T, Γ , a sudden dropoff in reconstruction efficiency occurs since the 3×3 algorithm is unable to compute a mass for a single cluster.A spurious peak develops at m Γ ≈ 0.5 GeV for decays with sufficient showering prior to the ECAL.The 3×3 method is thus only useful for a limited range of low boosts.The comparatively weaker performance of the end-to-end method to the 3×3 one at low-p T, Γ is attributable to the underrepresentation of relevant boosts in the training sample.The optimization of the end-to-end technique for these lower boosts is beyond the scope of this paper.
For boosts ⟨γ L ⟩ ≈ 150 and m A = 0.4 GeV, the end-to-end method (Fig. 5, second row, left) is able to reconstruct the mass peak with full sensitivity across most of the p T, Γ ranges.Only at the highest p T, Γ range does the mass peak significantly degrade, although it is still reasonably well behaved.Training with higher p T, Γ could potentially improve this behavior.The photon NN performs its best in this regime (Fig. 5, second row, middle) because a majority of the photon showers fall within the 5×5 crystal block.However, the mass resolution is still significantly worse compared with the end-to-end method.The 3×3 algorithm (Fig. 5, second row, right) is barely able to reconstruct a mass peak for these boosts.We recall that, if the 3×3 algorithm is unable to find a pair of clusters, an invariant mass cannot be calculated, and a default mass value outside m A -ROI is instead passed.
For boosts ⟨γ L ⟩ ≈ 600 and m A = 0.1 GeV, the end-to-end method (Fig. 5, third row, left) reaches the limits of its sensitivity, although it is still usable.We attribute the performance of the end-to-end method to its ability to discern subtle differences in the energy distribution of the fully merged particle shower, which we expect to have more smearing along the principal axis connecting the A → γγ diphotons.Notably, even at these boosts, the position of the mass peak measured with the end-to-end method remains stable with p T, Γ .This is not the case for the photon NN (Fig. 5, third row, middle) whose peak becomes erratic and displaced with increasing p T, Γ .The 3×3 method is not able to calculate a mass at this level of merging for the same reasons stated earlier.
For reference, the regressed mass spectrum for single, isolated photons from H → γγ decays is shown in Fig. 5 (lower row).Both the end-to-end (left) and photon NN (middle) methods are able to regress to the m Γ ≈ 0 GeV boundary, with a smoothly falling distribution since they were trained with domain continuation (cf.Fig. 2, right).The remaining photons within m A -ROI come from photon conversions that acquire an effective mass because of nuclear interactions.
Summarizing the validation performance on simulated data, the end-to-end ML technique is the only one that is able to robustly and consistently probe boost regimes ranging from resolved to instrumentally merged showers.

Data
To validate the findings described above from simulated events, we perform a cross-check using γ + jet events from CMS data, where the jet is reconstructed as a photon candidate.The CMS dataset was acquired in 2017 at the LHC with proton-proton collisions at √ s = 13 TeV.In Sec.9.2, we also compare the results from this dataset with a similar one acquired in the 2018 LHC running.These two datasets correspond to integrated luminosities of 41.5 and 56.9 fb −1 , respectively.If the jet contains an energetic, collimated neutral-meson decay (p T, Γ ≳ 20 GeV), typically a π 0 → γγ or η → γγ, it will be misreconstructed as a single photon Γ and the event will pass a diphoton trigger.Since the energy of the jet will, in general, be shared among several constituent particles, the π 0 is more likely to be reconstructed as the lower-energy photon in the event.Thus, a data sample enriched in merged photons is obtained by selecting events passing a diphoton trigger and selecting the lower-energy reconstructed photon, which we additionally require to pass our photon selection criteria.The selected sample is then given to the m Γ regressor, whose output we study below.Further details about the trigger and photon selection criteria can be found in the H → γγ analysis [13].We emphasize that the mass regressor is being used to regress the m Γ of individual reconstructed photon candidates, which we assume to be merged photons, not the invariant mass of the reconstructed diphoton event itself.As before, only reconstructed candidates in the range |η Γ | < 1.44 are used.
An important caveat in regressing the m Γ of energetic photons (p T, Γ ≳ 20 GeV) within jets is the presence of other hadrons within the jet.At these energies, we emphasize that it is no longer the case that neutral-meson decays are well isolated in jets, a main point of distinction compared with the isolated A → γγ decays used to train the mass regressor.In general, the neutralmeson decay will be collimated with other hadrons, including, potentially, several merged π 0 decays.The effect of these additional hadrons is to smear and distort the resulting m Γ spectrum and introduce an energy dependence in the m Γ value.We therefore restrict our study to 20 < p T, Γ < 35 GeV and require tighter shower-shape criteria, to increase the contribution from well-isolated π 0 → γγ decays that more closely resemble the A → γγ decay.The low-p T threshold is dictated by the chosen diphoton trigger.Although these tighter criteria mitigate the stated effects, the impact of these effects remains visible.
Within the above restricted p T, Γ range, the π 0 is boosted to the approximate range γ L = 150-250, putting its invariant mass reconstruction out of reach of all but the end-to-end mass regressor.Also present is the higher-mass η, which, though produced with a much lower cross section, is boosted to only about the range γ L = 30-60, just within reach of the 3×3 algorithm.
As clearly seen in Fig. 6, the end-to-end method is able to reconstruct a prominent π 0 peak.Indeed, it is the only algorithm able to do so.The π 0 peak appears more prominent than the corresponding A peak at m A = 0.1 GeV (cf.Fig 5, third row, left) because of the higher mass and lower boost of the π 0 in this case.The photon NN exhibits an erratic response, suggesting it does not have the data granularity needed to probe this regime.Likewise, the 3×3 method is unable to reconstruct the π 0 peak at all.It is, however, able to reconstruct the η peak, as expected.We attribute the weaker η peak in the end-to-end method to the aforementioned smearing effect of additional hadrons in the jet, as discussed further in Appendix A.2.The 3×3 method is less sensitive to this effect because of the restriction on the ECAL energy clusters being in the smaller 3×3 window.
Whether the sensitivity of the end-to-end method to the effects of jet hadronization represents an advantage or disadvantage depends on the application.For an analysis searching for isolated A → γγ decays [45], background processes from neutral mesons in jets will be smeared in mass, providing a distinct advantage for separating their mass spectra from that of true A → γγ decays peaking at similar masses.The optimization of the end-to-end technique for the mass regression of neutral mesons in jets is beyond the scope of this paper.
The unique capability of the end-to-end technique to reconstruct highly boosted particle decays thus opens the door to physics searches in boost regimes previously inaccessible to existing reconstruction algorithms.Additionally, because of the difficulty of obtaining low-energy π 0 decays (E ≈ 1 GeV) with increasing luminosity, the ability to reconstruct the more abundantly available high-energy (E ≈ 10 GeV) π 0 decays instead offers the possibility of improving the reach of existing CMS ECAL intercrystal calibration methods, which rely on such decays [8].

CMS
Figure 6: Reconstructed mass m Γ for end-to-end (red circles), photon NN (blue squares), and 3×3 (gray triangles) algorithms for hadronic jets from data enriched with π 0 → γγ decays.All distributions are normalized to the same number of events, including those outside m A -ROI.The statistical uncertainties in the distributions are negligible.

Robustness of the algorithm
To further assess the robustness and generalizability of the end-to-end ML-based mass regressor, we study how the regressed mass varies with respect to a number of key quantities of interest.Such studies are useful in revealing potential biases of the mass regressor technique to kinematic regions and detector conditions for which it was not trained.These mass dependence studies are performed on data using both π 0 → γγ events and electrons from events enriched with Z → e + e − decays.

Mass dependence on kinematic quantities
We first measure the dependence of the regressed mass on reconstructed kinematic quantities such as p T, Γ and η Γ .These studies have the caveat outlined in Sec.8.2 concerning the distortions in the regressed π 0 invariant mass distribution coming from jet hadronization.Figure 7 (left) shows a two-dimensional plot of the regressed mass versus p T, Γ for 20 < p T, Γ < 35 GeV.A clear band is observed that is independent of p T, Γ .We attribute this band to well-isolated π 0 → γγ decays, which are more prominent in this relatively low-p T, Γ range.This is consistent with our earlier results from simulated A → γγ decays, shown in Fig. 5, albeit for a narrower p T, Γ range.A broadening of the mass distribution for p T, Γ ≳ 30 GeV is also visible in Fig. 7 (left).However, this is likely an artifact of the discontinuous increase in the photon population of the selected sample, due to the turn-on of the leading-p T threshold for the diphoton trigger.
Next, we study the dependence of the regressed mass on the reconstructed η Γ .As discussed earlier, the number of radiation lengths traversed by an electromagnetic particle entering the ECAL barrel at η ≈ 1.4 is nearly fivefold that at η ≈ 0 because of differences in the underlying tracker material structure (described in Sec. 2).Thus, this study is a useful check on the sensitivity of the regressed mass to the level of electromagnetic shower development in the A → γγ energy deposits.As seen in Fig. 7 (right), the regressed mass has noticeably better resolution in the central region (|η Γ | ≲ 0.5) than in the forward regions (|η Γ | ≳ 1), as expected.The regressed mass distribution also shows good continuity with respect to η Γ .The dependence of the regressed mass on the same kinematic quantities for simulated H → AA → 4γ events is presented in Appendix A.3, with similar conclusions.

CMS
Figure 7: Regressed mass from π 0 → γγ data events vs. p T, Γ (left) and η Γ (right).For both plots, the regressed mass distributions are normalized in vertical slices of the accompanying kinematic quantity to highlight the intrinsic dependence on the quantity.The relative contribution over each vertical slice is given by the color scale to the right of each plot.

Mass dependence on detector conditions
A critical question surrounding the end-to-end ML technique is how well it accommodates mismodeled detector conditions, or even conditions unseen in the training set; in particular, whether exposure to the full granularity of the detector data results in an algorithm that is overly dependent on detector conditions or the amount of pileup.
To address these questions using the same π 0 → γγ data, we study the regressed mass versus the mean number of interactions per bunch crossing.Importantly, because of challenges with the LHC during the data-taking period used in this study, the proton bunch scheme was significantly altered in the latter half of the year.This resulted in the pileup distribution being significantly skewed toward a higher number of interactions per bunch crossing.This effect was not fully modeled in the simulated data used to train the mass regressor.In spite of this, as shown in Fig. 8 (upper), we observe that both the peak and width of the regressed masses are stable versus the amount of pileup.For various slices of data in the pileup range from 15 to 50, we fit the regressed mass distribution in each slice to a Gaussian-plus-quadratic function in the vicinity of the mass peak.The variation in the position and width of the fitted mass peaks is less than 3% and 9%, respectively.
We next investigate the stability of the regressed mass with respect to changing detector conditions by comparing the regressed π 0 → γγ mass spectrum at different times.In Fig. 8 (lower left), the mass spectrum is plotted for the start, middle, and end of the 2017 data-taking period.During these time segments, the electronic noise in the ECAL barrel increased by 25%.Using the same fitting procedure described earlier, the positions and widths of the peak in the latter two time segments are within 1 (middle) and 4% (end).These differences are similar to those measured in these datasets using established calibration techniques.As an additional check, in Fig. 8 (lower right), we show the invariant mass spectrum separately for the entire 2017 and 2018 data-taking periods.Between the ends of these two data-taking periods, the electronic noise in the ECAL barrel increased by about 10%.The positions and widths of the π 0 peak are consistent within uncertainties using a similar fitting procedure.
Robustness to changes in detector conditions is thus a principal strength of the end-to-end ML technique when using minimally processed detector data.However, such robustness is notably degraded when training on clustered data, as discussed in Appendix A.1.

Mass dependence on data versus simulation
Finally, we compare the dependence of the mass regressor for events from data versus simulation.If the relative mass resolution in data is worse than it is in simulation, the significance of an observed mass peak in data would be adversely affected.If there is a bias in the position of the mass peak in data versus simulation, an inaccurate mass measurement would be made unless the bias is corrected.
First, it is important to decouple data-versus-simulation mismodeling caused by detector-related effects from that attributable to the collimation of jets described in Sec.8.2.An unbiased assessment of the mass regressor requires a measurement of the detector-related mismodeling alone.Because of the computational and logistical challenge of obtaining minimally processed simulated QCD events with energetic π 0 → γγ, we analyze instead the agreement between data and simulation using electrons from Z → e + e − events.Electrons from Z → e + e − decays are produced in abundance, and are reconstructed with good isolation and high purity.The tagand-probe method is used to select the electrons for both data and simulation within the range |η e | < 1.44.
Although the electron is effectively massless in this energy regime, radiation is produced from its bending in the CMS magnetic field.This causes the energy in the seed crystal of the electron shower to be slightly smeared toward neighboring crystals in the shower [39], similar to the pattern seen in instrumentally merged A → γγ decays.As a result, the regressed electron mass spectrum displays a peak at m Γ ≈ 0.1 GeV that can be used to measure differences in the mass spectrum.
We parametrize the differences between data and simulation in terms of a Gaussian relative mass scale s scale and smearing difference s smear .A scan is performed over different (s scale , s smear ) hypotheses, in steps of (∆s scale = 4 × 10 −3 , ∆s smear = 0.4 MeV).At each hypothesis, the value of m Γ, i for each electron candidate in the simulated sample is then smeared using the probability distribution N (s scale × m Γ, i , s smear ), where N (µ, σ) is a Gaussian function parametrized by mean µ and standard deviation σ.The best fit mass scale and smearing between data and simulation is then defined as the (s scale , s smear ) hypothesis for which the chi-square (χ 2 ) test statistic between the mass distribution in data and the transformed simulated sample is at a minimum.The definition χ 2 = ∑ i (h data,i − h MC,i ) 2 /h MC,i is used, where h data,i and h MC,i denote the normalized data and transformed simulation counts, respectively, at bin i of m Γ .We determine the best fit value for the relative mass scale to be 1.040 and the best fit value for the mass difference is less than ∆s smear .The contours of the χ 2 scan over the scale and smearing hypotheses are displayed in Fig. 9 (left).The resulting regressed mass distributions for electrons in data versus simulation under the best fit hypothesis are shown in Fig. 9 (right).
After the difference in mass scale is taken into account, we find the core of the mass distribution to be well modeled in the simulation, as seen in Fig. 9 (right).Although there are some systematic differences in the high-side tail of the mass distribution between data and simulation, these deviations will be less relevant in an application where data are limited in the tails and statistical uncertainties are larger.Indeed, the significance of an observed mass peak is primarily driven by the core of the peak and not the tail contribution.Thus, the lack of any significant mass smearing implies the full mass resolution of the regressor is preserved in data.

Summary
A novel end-to-end particle reconstruction technique is introduced that is able to serve as a general strategy for reconstructing decays of boosted particles.The method involves the use of deep learning algorithms that do not rely on particle-flow objects, but are trained directly on minimally processed detector-level data to reconstruct particle properties of interest.Using simulated A → γγ decays in the CMS electromagnetic calorimeter, where A is a hypothetical scalar particle, the technique is used to reconstruct the diphoton invariant mass over a wide range of photon-merging scales, corresponding to Lorentz boosts γ L = 60-600.Furthermore, when domain continuation is incorporated in the training, the most challenging parts of the A → γγ phase space (γ L > 150) are made accessible.The resulting end-to-end mass regressor, in addition to being a highly sensitive tool, also has a robust response.Based on studies using simulated samples and collision data, a stable mass response is observed in various kinematic, beam, and detector conditions.
Studies are under way to employ the end-to-end mass regressor in π 0 → γγ reconstruction and in searches for new physics, such as H/X → AA → 4γ, where H is the Higgs boson and X is some new heavy resonance.Furthermore, although demonstrated for the specific case of mass reconstruction of a boosted particle decaying to photons, the end-to-end deep learning technique can be used for arbitrary decay modes by including additional subdetector information.The technique is not restricted to mass reconstruction; other particle properties, particularly those that are currently resolution constrained, such as the lifetime of particles in long-lived decays, stand to benefit significantly.It can potentially be used instead of particleflow techniques to reconstruct the four-momenta of resolved decays.Sensitivity gains in these cases, however, are likely to be more modest, and must be balanced against the challenges of accessing the wider event content of the CMS datasets.
The technique of training via domain continuation can be exploited independently of the endto-end method.Indeed, it is applied to the training of the photon neural network used as a benchmark.The application of this technique is not specific to high energy physics.It should be applicable in any machine learning regression task that seeks to regress a quantity near a boundary, physical or otherwise, that is close in scale to its resolution.
When end-to-end particle reconstruction is combined with domain continuation, diphoton showers that are completely unresolved can now be reconstructed.This is a regime inaccessible to existing reconstruction techniques, and it is the first time a technique has been developed to achieve this important goal.

A Supplementary studies A.1 Minimally processed versus clustered data
We attribute the robustness of the end-to-end ML mass regressor (discussed in Sec. 9) to the use of minimally processed (all) rather than clustered (clustered) detector data.As shown in Fig. A.1, the PF clustering algorithm filters out low-energy deposits and, under certain situations, may completely remove all the deposits associated with the lower-energy photon from the A → γγ decay.For example, in Fig. A.1 (left), showing the minimally processed data, the lower-energy photon is visible on the lower left of the core photon, at a distance of approximately 7 ECAL crystals.In the clustered data on the right plot, the deposits associated with the lower-energy photon have been dropped, along with other, isolated low-energy deposits.), we see no evidence of a shift in the mass peak in either boost regime.This is not the case for the mass regressor trained on clustered data (lower row), which exhibits a shift when applied outside of its domain.This suggests it is desirable not to filter the detector data beforehand so that the mass regressor learns to suppress low-energy deposits.Without this opportunity, the mass regressor becomes more susceptible to variations in the data.
In addition, the loss of the lower-energy photon at resolved boosts (right column) leads to A → γγ decays being incorrectly reconstructed as photons, causing both a drop in reconstruction efficiency at the correct mass value (m Γ = 1 GeV) and a buildup of samples at the wrong mass (m Γ ≈ 0 GeV).Even if minimally processed data are presented to the mass regressor trained on clustered data (lower right, blue circles), the lost efficiency at the correct mass value is still not recovered.The use of minimally processed data is thus vital to maximizing the capabilities of the end-to-end ML-based technique.

A.2 Hadronization effects
Although the end-to-end method is able to clearly reconstruct π 0 candidates in hadronic jets, the η resonance in jets appears much less pronounced compared with that reconstructed by the 3×3 method (cf.Fig. 6).At the particle energies that we study, the weaker η resonance from the end-to-end method is due to the presence of additional particles collimated in the jet.As noted in Sec.8.2, because the 3×3 method uses only a window of 3×3 crystals, it is less exposed to these effects.To illustrate this point, in The ECAL energy pattern associated with the minimally processed detector data (all), as used by the end-to-end technique, is plotted on the left.The corresponding energy pattern, after applying the 3×3 clustering (3×3), is shown on the right.We have verified that, by applying the end-to-end method on candidates within a mass window of the η peak built using only 3×3 clusters, we are able to reconstruct an η mass peak of similar size as for the 3×3 algorithm.The effects of hadronization are also relevant for π 0 reconstruction in jets.However, the higher production rate of π 0 mesons permits the use of tighter photon identification criteria (cf.Sec.8.2), in order to maximize the isolated π 0 -like component.

A.3 Mass dependence in simulated samples
To complement the mass dependence studies performed on data using π 0 → γγ decays and electrons, we present in Fig. A.4 similar studies for A → γγ decays from simulated H → AA → 4γ events passing our event and photon selection criteria.Two-dimensional plots are shown of the regressed mass versus the generated p T, A (left), η A (center), and amount of pileup (right) for barely resolved (upper), shower merged (middle), and instrumentally merged (lower) decay products.There is good stability in the regressed mass response throughout the explored phase space and the varying beam conditions.or instrumentally merged (lower).In all plots, the regressed mass distribution is normalized in vertical slices of the quantity of interest.The relative contribution over each vertical slice is given by the color scale to the right of each plot.

Figure 1 :
Figure 1: Simulation results for the decay chain H → AA, A → γγ at various boosts: (upper plots) barely resolved, m A = 1.0 GeV, γ L = 50; (middle plots) shower merged, m A = 0.4 GeV, γ L = 150; and (lower plots) instrumentally merged, m A = 0.1 GeV, γ L = 625.The left columnshows the normalized distribution of opening angles between the leading (γ 1 ) and subleading (γ 2 ) photons from the particle A decay, expressed by the number of crystals in the η direction, ∆η(γ 1 , γ 2 ) gen , versus the ϕ direction, ∆ϕ(γ 1 , γ 2 ) gen .Note that the distributions include contributions outside of the plotted ranges and thus may not sum to unity within the displayed ranges.The right column displays the ECAL energy shower pattern for a single A → γγ decay, plotted in relative ECAL crystal index coordinates and color-coded by energy.In all cases, only decays reconstructed as a single PF photon candidate passing selection criteria are used.

Figure 2 :
Figure 2: Left: the regressed mass m Γ vs. the generated m A value for simulated A → γγ decays generated uniformly in (p T , m A ) before domain continuation is implemented.The regressed m Γ is normalized in 0.025 GeV vertical slices of the generated m A .The color scale to the right of the plot gives the normalized number of events per vertical slice in 0.025 GeV bins of m Γ .Right: the regressed m Γ distribution for simulated single-photon samples only, before domain continuation, resulting in a distinct peak in the low-m Γ region.The distribution is normalized to unity with the vertical bars on the points indicating the statistical uncertainty.

Figure 3 :
Figure 3: Pictorial representation of the m A → 0 boundary problem occurring when attempting to regress below the mass resolution.Left: the distribution of physically observable A → γγ invariant masses (f obs ) vs. the generated m A .When m A ≈ σ(m A ), the left tail of the mass distribution becomes underrepresented in the training set.Middle: As m A → 0, only half of the mass distribution is represented.The regressor subsequently defaults to the last full mass distribution at m A ≈ σ(m A ). Right: with domain continuation, the generated mass distribution of the original training samples (A → γγ, red region) is augmented with topologically similar samples that are randomly assigned nonphysical masses (γ, blue region).This allows the regressor to see a full mass distribution over the entire region of interest (unhatched region).Predictions in the black hatched regions are discarded.

Figure 4 :
Figure 4: Mass regression performance for simulated A → γγ samples generated uniformly in (p T , m A ), corresponding to mean boosts in the range ⟨γ L ⟩ = 600-50 for m A = 0.1-1.2GeV.Upper: regressed m Γ vs. generated m A .The regressed m Γ is normalized in 0.025 GeV vertical slices of the generated m A .The color scale to the right of the plot gives the normalized number of events per vertical slice in 0.025 GeV bins of m Γ .Lower left: the MAE (blue circles, use left scale) and MRE (red squares, use right scale) vs. the generated m A .For clarity, the MRE for m A < 0.1 GeV is not shown since its value diverges as m A → 0. Lower right: the m A regression efficiency as a function of the generated m A .The hatched region shows the efficiency for single photons.The vertical bars on the points show the statistical uncertainty in the simulated sample.

Figure 5 :
Figure5: Reconstructed mass spectra for end-to-end (left column), photon NN (middle column), and 3×3 algorithms (right column) for A → γγ decays with m A = 1.0 GeV (upper row), m A = 0.4 GeV (second row), m A = 0.1 GeV (third row), and for isolated single photons (lower row).For each panel, the mass spectra are separated by reconstructed p T, Γ value into ranges of 30-55 GeV (red circles, low-p T, Γ ), 55-70 GeV (gray triangles, mid-p T, Γ ), 70-100 GeV (blue squares, high-p T, Γ ), and >100 GeV (green inverted triangles, ultra-p T, Γ ).The vertical bars on the points give the statistical uncertainties.All the mass spectra are normalized to unity, including samples outside m A -ROI.The vertical dotted line shows the input m A value.

Figure 8 :
Figure8: Upper: two-dimensional plot of the regressed mass for π 0 → γγ data events vs. the amount of pileup (PU).The mass distribution is normalized in vertical slices of the amount of pileup.The relative contribution over each vertical slice is given by the color scale to the right of the plot.Lower left: the regressed mass distributions for the start (gray circles), middle (blue squares), and end (red triangles) of data taking during the year 2017.Lower right: the regressed mass distributions for the 2017 (gray circles) and 2018 (blue squares) data-taking periods.The lower two plots are normalized to unity and the vertical bars on the points show the statistical uncertainties.The lower panel for the lower left plot gives the ratio of distributions for the middle to the start (blue squares) and the end to the start (red triangles) of the 2017 data-taking period.The lower panel for the lower right plot gives the ratio (blue squares) for the 2018 and 2017 data-taking periods.The vertical bars on the points in both lower panels show the statistical uncertainties in the numerator quantity, and the gray bands give the similar uncertainty in the denominator quantity.

Figure 9 :
Figure9: The agreement in the regressed m Γ spectrum between electrons in data versus simulation.Left: contours of 68% (solid line) and 95% (dotted line) confidence level (CL) in the χ 2 test statistic as a function of the (s scale , s smear ) hypothesis.The best fit point (s scale = 1.040, s smear = 0 MeV) is indicated by the red diamond.Right: the regressed mass distributions in data (points) and the best fit Monte Carlo (MC) simulation (blue region) for electrons from Z → e + e − events.The difference between the simulated distribution under the null scale and smearing hypothesis versus the best fit hypothesis (Syst) is plotted as a green band.Each of the distributions are normalized to unity, including samples outside m A -ROI.Statistical uncertainties in the data distribution are negligible.The lower panel shows the ratio of the data to the simulation under the best fit hypothesis (points).The statistical uncertainties in the latter are plotted as a blue band.The ratio of the simulated distribution for the null to the best fit hypothesis is displayed as a green band.

Figure A. 1 :
Figure A.1: A typical A → γγ decay using minimally processed (left) and clustered (right) data.The energy distributions are plotted in relative ECAL crystal index coordinates of the pseudorapidity η versus the azimuthal angle ϕ and color coded by energy.The impact of using minimally processed versus clustered data is seen in Fig.A.2, which compares the effect of training on all (upper row) versus clustered data (lower row).The regressed mass spectra for shower-merged and barely resolved boosts are displayed in the left and right columns, respectively.For each scenario, we regress the mass of a sample constructed from all (blue circles) versus clustered (red squares) data to compare how well each mass regressor extrapolates to the other's domain.For the mass regressor trained on minimally processed data (upper row), despite the differences in input image (cf.Fig.A.1), we see no evidence of a shift in the mass peak in either boost regime.This is not the case for the mass regressor trained on clustered data (lower row), which exhibits a shift when applied outside of its domain.This suggests it is desirable not to filter the detector data beforehand so that the mass regressor learns to suppress low-energy deposits.Without this opportunity, the mass regressor becomes more susceptible to variations in the data.

Figure A. 2 :
Figure A.2: Regressed mass spectra for the mass regressor trained on minimally processed data (upper) versus clustered data (lower) at shower-merged boosts (left) and barely resolved boosts (right).For each scenario, the mass regressor is run on the same set of A → γγ decays, composed either of minimally processed (blue circles, all) or clustered (red squares, clustered) data.The vertical bars on the points give the statistical uncertainties and the vertical dotted line shows the input m A value.
Fig. A.3, we select a representative merged photon candidate reconstructed by the 3×3 algorithm with a mass close to the η resonance.

Figure A. 3 :
Figure A.3: A typical hadronic jet sample reconstructed by the 3×3 algorithm with a mass near the η meson peak, using minimally processed left) and 3×3 clustered (3×3, right) data.The energy distributions are plotted in relative ECAL crystal index coordinates of the azimuthal angle ϕ versus the pseudorapidity η and color coded by energy.

Figure A. 4 :
Figure A.4: Regressed mass spectra vs. generated p T, A (left), generated η A (center), and amount of pileup (right), for A → γγ decays that are barely resolved (upper), shower merged (middle), or instrumentally merged (lower).In all plots, the regressed mass distribution is normalized in vertical slices of the quantity of interest.The relative contribution over each vertical slice is given by the color scale to the right of each plot.