Fast Point Cloud Generation with Diffusion Models in High Energy Physics

Many particle physics datasets like those generated at colliders are described by continuous coordinates (in contrast to grid points like in an image), respect a number of symmetries (like permutation invariance), and have a stochastic dimensionality. For this reason, standard deep generative models that produce images or at least a fixed set of features are limiting. We introduce a new neural network simulation based on a diffusion model that addresses these limitations named Fast Point Cloud Diffusion (FPCD). We show that our approach can reproduce the complex properties of hadronic jets from proton-proton collisions with competitive precision to other recently proposed models. Additionally, we use a procedure called progressive distillation to accelerate the generation time of our method, which is typically a significant challenge for diffusion models despite their state-of-the-art precision.


I. INTRODUCTION
Simulations are a critical component of nearly all inference tasks in particle physics.These simulations connect theory to experiment and must span a wide range of energy scales and encode the complex structure of high energy physics data.Physics-based simulations are excellent, but are only an approximation to nature.Additionally, some components of these simulations are computationally expensive and are a bottleneck for the high statistics datasets that are being collected now and in the near future.Classical fast approximations exist for some steps and in some cases, such as detector simulations for a particular experiment, but they are often not expressive enough to achieve high fidelity compared to a full simulation routine.
Deep neural network-based simulations (called deep generative models) are a promising alternative to classical fast simulations.Since the first deep generative model applied to high energy physics [1], there have been a large number of proposals to use these tools for fast simulation and many other applications [2][3][4].In this paper, we revisit the original problem of emulating parton shower Monte Carlo simulations.These simulations describe the formation of jets of hadrons that emerge from the high energy quarks and gluons.Jets are ubiquitous at particle colliders and are the most complex objects reconstructed from hadronic final states.Together, these qualities make jets a standard benchmark for developing machine learning-based generative models.
Many deep generative models have been deployed to the problem of emulating jet formation.The first approaches used images by spatially discretizing the radiation pattern within jets [5,6].Generative Adversarial Networks (GANs) [1,7] and autoencoders [8] were able to reproduce many aspects of the parton shower, but were fundamentally limited because of their pixelization.While other applications of deep generative models naturally process image data (e.g.calorimeter simulations ), jets are naturally represented as variable-sized point clouds and information is lost when they are projected onto fixed size grids with reduced spatial position information compared to the original detector granularity.
Point cloud generative models (PCGM) offer the solution to the inherent challenges with pixelation.The first PCGM applied to jet formation was Ref. [34], which used a recurrent model to describe the probability density of a given jet.Recently, there has been a surge of interest in more general PCGMs that do not need to make any assumptions or approximations about the underlying generative process.This latest wave of methods began with a graph neural network-based GAN [35] and now includes a deep sets-based GAN [36] and a normalizing flow [37,38].These models mark a significant step forward in the application of generative models to particle physics, but there is still significant room for improvement in both precision and robustness.For example, GANs solve a minimax problem and are thus difficult to train.Normalizing flows are more stable to train, but may have difficulties when generating low level inputs (such as particle kinematic information), or a variablelength representation with complex topology due to the invertible nature of their neural networks.
In the machine learning literature, the most precise generative neural networks are diffusion models (see e.g.Ref. [39]).These approaches circumvent the challenges with other models by performing a convex optimization problem, but without the need for invertible transformations.This can be achieved by learning the score of the probability density ∇ log p instead of the probability density directly.The first diffusion model applied to particle physics was in the context of image-based calorimeter simulation [33], significantly extending the dimensionality of previous results.Our goal is to adapt diffusion models to the variable-length point cloud setting for parton showers and other phenomena in high energy physics while also reducing the generation time to be competitive with other fast generation methods.To this end, we introduce our algorithm for Fast Point Cloud Generation (FPCD), used to simulate point cloud data with varying length much faster than a standard diffusion implementation.Examples of generated point clouds using our proposed algorithm are shown in Fig. 1, where we compare the average energy deposition for top quark initiated jets generated by the full simulation or by the generative model.We accelerate the sampling time of the surrogate model using a method called progressive distillation [40], resulting in a generative model with high physics fidelity and fast sampling times.While this paper was being finalized, the authors of Ref. [41] also proposed a diffusion-based PCGM for jet formation.The proposal in our paper differs from Ref. [41] in a few ways.First, our model does not condition on the jet mass, but rather utilizes a separate diffusion model to determine the jet kinematics.Next, it is much faster (via progressive distillation) and is conditioned on the particle type, thereby avoiding the training of multiple diffusion models for each type of jet.Finally, we also provide results for more particle types (including gluons and W and Z bosons in addition to light and top quarks) in two different datasets with varying number of particles to demonstrate that our model is capable of generating outputs of varying sizes.
This paper is organized as follows.Section II introduces score-based diffusion models and describes how they can be accelerated with progressive distillation.We then detail our implementation of the diffusion-based generative model for parton showers in Sec.III.Numerical results are presented in Sec.IV and the paper ends with conclusions and outlook in Sec.V.

II. SCORE-BASED GENERATIVE MODELS AND PROGRESSIVE DISTILLATION
The goal of a generative model is to be able to generate new observations from a noise distribution.Diffusion models became popular in recent years for their capacity to generate realistic data, often surpassing standard state-of-the-art generative models.In score-based methods [42], a diffusion process is designed to slowly perturb the data through the addition of noise, while a neural network learns a time-dependent score function ∇ x log p data for some high-dimensional distribution x ∈ R D described by the probability density p data .The score function is then used in a reverse-diffusion process: starting from a noisy distribution and proceeding to denoise the observation.The diffusion model is described by latent variables z = {z t |t ∈ [0, 1]} with a time-dependent noise schedule α t , σ t , such that the log signal-to-noise-ratio log[α 2 t /σ 2 t ], decreases monotonically with time.During training, the network learns to denoise z t ∼ q(z t |x) = N (z t ; α t x, σ 2 t I) towards the unperturbed data x ∼ p data , effectively learning an estimate xθ ≈ x by updating the trainable parameters θ during training.Following Ref. [40], we instead train a network to estimate a "velocity" parameter v ≡ α t − σ t x, with ∼ N (0, I) which is observed to yield accurate results while also simplifying the distillation method employed later.The loss function to be minimized during optimization is then defined as: where t is sampled uniformly over the considered interval.
In this formulation, we can identify the estimate of the score function as: In our implementation, we consider the variancepreserving setting of diffusion processes, where σ 2 t = 1−α 2 t .For the time-dependence, we use a cosine schedule such that α t = cos(0.5πt).
The generation of new samples is then carried out using the DDIM sampler proposed in Ref. [43] that uses an integration rule to solve the deterministic ordinary differential equation: with drift coefficient f (z t , t) = d log αt dt z t and diffusion coefficient g 2 (t) = dσ 2 t dt .In the DDIM solver, the update rule is then specified by: In practice, solving Eq. 3 can be slow since the error introduced by the numerical integration is sensitive to the number of time steps chosen, often requiring hundreds to thousands of time steps and hence function evaluations of the trained model.
To accelerate diffusion models, Ref. [40] introduced a technique called progressive distillation.Starting from a trained diffusion model, the goal of progressive distillation is to learn iteratively to halve the number of time steps required during generation of new samples.In this setting, the trained diffusion model ("teacher") is used to initialize a "student" model.During training, the goal is to have the student model learn how to denoise data z t towards a target x, where x does not represent the clean data (x) anymore, but instead is one that makes a single student DDIM step to match two teacher DDIM steps.This process is then repeated multiple times, with the student at the end of each iteration becoming the new teacher.In this work, we train a diffusion model with initial number of steps N fixed to 512.From there, we distill the model multiple times, reporting the results obtained with N = 512, N = 8, and N = 1.

III. POINT CLOUD DIFFUSION FOR COLLIDER DATA
We train a conditional diffusion model to generate particle jets conditioned on the initial particle type.We use the datasets introduced in Ref. [35] consisting of jets initiated by light-quarks, gluons, top quarks, W and Z bosons.The jets are generated with transverse momenta p T around 1 TeV and are clustered using the anti-k t algorithm [44] with a radius parameter of 0.8.Each jet has a maximum number of particles stored fixed to 30 [45] or 150 [46].For each jet, the four-momentum information (p Tjet , η jet , φ jet , m jet ) is provided, as well as the particle multiplicity.For each particle clustered inside a jet, the relative set of kinematic quantities are provided: Our goal is to develop a diffusion model that is conditioned on the particle's type and is able to generate both jet-and particle-level kinematic information.To accomplish this task, we train two diffusion models simultaneously.The first model learns the jet kinematic information, including particle multiplicity, while the second is conditioned on the jet kinematic distributions to generate particle information.Effectively, the loss function that is minimized during training is with different models trained to generate jet information and particle information.During the generation step, we first sample the jet kinematic information together with the particle multiplicity, conditioned on the type of the jet we aim to generate.This information is then used as an input to generate the particle information for each jet.
The particle multiplicity generated determines the total number of particles generated in each jet.Although it is feasible to achieve a perfect match between the output particle multiplicity and the sampled particle multiplicity, we prefer to employ a masking plus zero-padding approach.This involves always sampling a set number of particles (either 30 or 150 depending on the dataset), but in the generation process, we mask the input noise and only consider the desired particle multiplicity.
Prior to training, the inputs to the diffusion model undergo a normalization process where all input features are standardized by adjusting their mean and standard deviation to zero and one, respectively.
The generative model designed to produce jet kinematic information is based on a fully-connected architecture incorporating multiple skip connections.Specifically, the model employs five ResNet [47] blocks, where each residual layer is connected to the output of a twolayer network through a skip connection.The activation function used is LeakyRelu [48] with a slope of α = 0.01, and all layer sizes are set to 512.
The particle diffusion model employs a DeepSets [49] architecture with Transformer layers [50] to increase the model's expressivity.The input sets are first mapped into a larger latent space using a fully-connected layer with a size of 64, applied independently to each particle in the set.The model then employs eight Transformer encoding blocks followed by a fully-connected layer with a size of 64 before the output layer.The activation function used is again LeakyRelu, and the outputs of the transformer layers are summed to the last layer before the first Transformer block, which is observed to result in better performance according to our experiments.
Both diffusion models incorporate time information by feeding random Fourier features [51] through two fully connected layers with 32 and 64 nodes.The resulting embeddings are combined with additional conditional information including jet type for the jet diffusion model and both jet type and jet kinematic information for the particle diffusion model.After passing through a fully connected layer of size 64, these embeddings are concatenated with the inputs of each diffusion model.
A visual description of both models is shown in Fig. 2.
The implementation of the model is carried out using Keras backend [52] with a TensorFlow [53] backend.The model is trained for up to 250 epochs with a cosine learning rate schedule [54]

IV. RESULTS
The performance of the generative model is evaluated using physics-based metrics proposed in [35] as well as additional metrics designed specifically to assess the quality of the jet kinematic generation.These metrics include the 1-Wasserstein (W 1 ) distances that are calculated using only particle information such as averaged particle relative momentum W P 1 , relative jet mass W PM 1 , and average of first five energy flow polynomials W PEFP 1 [59].The 1-Wasserstein distances are also calculated for jet kinematic information, including jet transverse momentum W JP 1 , jet pseudorapidity W Jη 1 , jet mass W JM 1 , and jet particle multiplicity W JN 1 .The evaluation also includes Fréchet ParticleNet distance (FPND), coverage (Cov), and minimum matching distance (MMD), described in Ref. [35].To calculate each metric, 50,000 generated examples for each jet category are compared against 50,000 validation samples that were not used during training.Uncertainties are estimated using bootstrapping with replacement following [35].Additionally, results for distilled models with different number of total time steps are also provided.
Different jet kinematic distributions are shown in Fig. 3 as well as the comparison of the different metrics listed in Tab.II.We also present the per-particle distributions in Fig. 4, displaying simultaneously all particles inside a given jet.Results are also compared with the official implementations of EPiC-GAN and MP-GAN, where in the latter the MP-MP implementation is taken for the comparison.While other implementations using the same datasets exist [37,41], these models are not directly comparable to our method as they are conditioned by the jet kinematic information, whereas our method simultaneously models both the jet and particle kinematic distriburions.
Similarly, we also consider the same physics-inspired metrics to evaluate FPCD in the dataset consisting of up to 150 particles per jet.In this case, the majority of the jets used during training need to be zero-padded and is used to display the capability of FPCD to learn how to generate jets with varying number of particles.The comparison of the physics inspired metrics are listed in Tab.III.Since the jet kinematic information is not affected by the maximum number of particles stored in each jet, we only report the W JN 1 metric for each dataset in Tab.IV.Histograms for each of the distributions considered in this study are provided in Appendix A.
Finally, we also compare the generation time for FPCD in Tab.V The original FPCD model is highly accurate but computationally expensive.However, we found that a distilled model with as few as 8 time steps can achieve similar results while significantly reducing the overall sampling time.Surprisingly, a distilled model with only a single time step during generation still retains high fidelity while further reducing the sampling time.In Appendix B, we compare the histograms for each of the kinematic distributions generated by the diffusion model 600 800 1000 1200 1400 1600 1800 FIG. 3. Generated jet kinematic information using FPCD compared to simulated events for particle jets consisting of lightquarks (q), gluons (g), and top quarks (top). -0.
and the distilled models in the dataset of top quark initiated jets.
EPiC-GAN is shown to be around one order of magnitude faster than FPCD even with a single diffusion step due to a different network architecture proposed by the authors.On the other hand, FPCD with a single time step is faster than MP-GAN, which also generates new jets through a single evaluation of the trained network.Nevertheless, all generative models are several orders of magnitude faster than the original physics simulation.

V. CONCLUSION AND OUTLOOK
In this work we introduced a fast point cloud diffusion (FPCD) model, providing flexibility, accuracy, and computational efficiency in jet generation.By simultaneously learning and generating multiple jet species, the two-part diffusion model generates both the jet kinematic information and particle information, conditioned on the jet kinematics and particle type to be generated.
Our model has achieved state-of-the-art performance in several physics-inspired metrics.We have demonstrated its capability to generate five different jet types with high fidelity using datasets consisting of 30 to 150 particles, showcasing the model's ability to generate jets with different particle multiplicities through a masking strategy.
Furthermore, the generation time was reduced by a factor of 450 using progressive distillation compared to the initial FPCD baseline, enabling high-fidelity generation with a single time step.This exciting result motivates future research to further reduce the model complexity and accelerate even further the sampling time.
The investigation of different backbone network designs is another promising direction to reduce generation time while maintaining high fidelity.The EPiC-GAN network structure shows great potential, with lower computational costs in higher particle multiplicity regions.
Given the flexibility of our model, we envision possible future applications in fast event generation, hadronisation models conditioned on parton kinematics, and full event reconstruction conditioned on different particle types.600 800 1000 1200 1400 1600 1800

FIG. 1 .
FIG. 1.Average top quark initiated jet in the full simulation, after generation with the diffusion model, and after distillation resulting in 8 or a single time step used during sampling.

FIG. 7 .FIG. 8 .
FIG.7. of generated jet kinematic information using different distillation steps for top quark initiated jets in the dataset consisting of 30 particles.
with initial learning rate of 16 × 10 −4 .If the loss function does not decrease for Description of the network architectures used to train the jet and particle diffusion models.Numbers after layers represent the number of hidden nodes associated to the layer.See the text for more information.
[58]onsecutive epochs, evaluated in a separate testing set, representing 20% to the sample size, the training is stopped.During training, 16 NVIDIA A100 GPUs are used simultaneously interfaced with the Horovod package[55]on the Perlmutter supercomputer[56].The batch size in each GPU is set to 128.The hyperparameters used in the model architecture were optimized using the KerasTuner[57]package with Hyperband[58]algorithm.

TABLE I .
Comparison of the results obtained between different generative models in the task of particle property generation in the dataset consisting of 30 particles.Baseline FPCD uses 512 time steps during sampling.Distilled models are listed alongside number of time steps used.Lower is better for all metrics except Cov.FPND metrics are not available for W and Z bosons, hence omitted.

TABLE II .
Comparison of the results obtained between different generative models for the task of jet property generation in the dataset consisting of 30 particles.Baseline FPCD uses 512 time steps during sampling.Distilled models are listed alongside number of time steps used.Lower is better for all metrics listed except Cov.

TABLE III .
Comparison of the results obtained between different generative models in the task of particle property generation in the dataset consisting of 150 particles.Baseline FPCD uses 512 time steps during sampling.Distilled models are listed alongside number of time steps used.Lower is better for all metrics except Cov.

TABLE V .
[35]ng comparison for full jet generation with FPCD.A fixed batch size of 10'000 examples is used.The time reported is the sum of the time used to generate each jet and particle kinematic information in a single GPU.Full simulation time is taken from[35].The EPiC-GAN work also provides a time comparison with a retrained MP-GAN.However since we do not retrain any model used for comparison, we decided to mention only the official results released in the original publications.