PC-Droid: Faster diffusion and improved quality for particle cloud generation

,


I. INTRODUCTION
An incredible amount of computing resources is required to keep up with demands for simulated events at the intensity frontier of high energy physics (HEP).As such, focus has turned to fast surrogate models for event and detector simulation.Deep generative models and modern machine learning techniques have shown great promise in improving the fidelity and speed of fast simulation approaches .Recently, diffusion models have come to the fore in a wide range of disciplines, notably for images generation [26][27][28][29][30][31].However, they have also demonstrated promise in applications in HEP for fast detector simulation [32][33][34], the generation of jets [35][36][37], unfolding [38], and anomaly detection [39].
The recent rise of diffusion models has also lead to a rapid development of approaches [40], similar to the development observed in generative adversarial networks [41].As such, the level of performance is continuously improving with more stable training procedures and improved differential equation solvers for generation.
In this work we introduce PC-Droid (Particle Cloud * matthew.leigh@unige.ch† debajyoti.sengupta@unige.ch‡ john.raine@unige.chDiffusion of reconstructed objects with improved denoising), a significant improvement over the PC-JeDi approach introduced in Ref. [35], demonstrating both a decreased inference time and higher-fidelity generation.
We use a more recently developed formulation for diffusion models [31] and combine it with improved diffusion sampling algorithms [42,43].We compare and contrast different network architectures, balancing the overall performance against generation time.Furthermore, in an effort to greatly reduce the generation time we study the application of consistency models [44] for jet generation.
With PC-Droid we achieve state-of-the-art performance as measured on standard metrics, as well as a new set of benchmarks.The repository 1 used in this work is publicly available.

II. METHOD
PC-Droid is a family of models which are trained to generate the constituent particles of jets either conditionally, given the desired kinematics of the jet, or unconditionally, with jet kinematics sampled from distributions learned from the training dataset.where F θ is the prediction from the network for the input point cloud x and conditional parameters y.As the variance is no longer preserved during the diffusion process, the scaling functions are used to maintain the unit variance of the raw inputs and outputs of the network.
The skip connection allows the x to bypass the network at low σ, as the input is closer to the target.
Finally, this new framework results in a change to the objective function of the network.Here, the loss is calculated from the distance between the denoised output and the true target x 0 where previously, the loss was calculated from the difference in the predicted noise added in x.Another key development has been in the choice of integration solvers used during inference.Several stateof-the-art algorithms have been studied including those provided by k-diffusion library. 2 The most promising solvers are presented in this work.
In PC-JeDi dedicated models were trained for each jet type; in PC-Droid a single conditional model for all five particle types (PID) (q -light quark;

B. Model training
A denoising transformer is trained for each of the Jet-Net datasets introduced in Ref. [45] comprising of up to 30 [46] and 150 [47]

C. Cross-attention encoder
In PC-JeDi only a self-attention transformer encoder (TE) architecture was studied which, although very expressive, is computationally expensive.The number of operations scales with O(N 2 const ) for the number of constituents N const .As diffusion models require many network passes during generation, this makes self-attention 2 https://github.com/crowsonkb/k-diffusion/tree/v0.0.15 a suboptimal choice for fast generation.For the 30 constituent models this does not present a large problem, however when moving to 150 constituents the impact is non-negligible.Therefore, in addition to the transformer model we also introduce the cross-attention encoder (CAE) as a faster and more memory efficient permutation equivariant network.
A schematic overview of the CAE-Block is shown in Fig. 2. In a CAE-Block, the input point cloud is used to update a group of global tokens using multi-headed crossattention.The number of global tokens N global is a hyperparameter.The global tokens are then further updated using a residual multi-layer perceptron (MLP), before being redistributed back to the point cloud using another cross-attention layer and a residual MLP update.This process of global pooling followed by distribution can be thought of as a transformer analogue to EPiC layers [48].We train two models using CAE layers, with N global = 1 or 16, on the 150 constituent dataset using the same training setup, model dimension, and number of layers as the baseline transformer for a fair comparison.

D. Consistency models
In this work we investigate the process of consistency distillation (CD) [44], whereby a teacher network is used to train a student network in order to solve the reverse diffusion process in less time steps, even enabling oneshot generation.The teacher network in this case is a standard diffusion model which can already be used in conjunction with an integration method to solve the reverse diffusion ODE [49].
During training, the objective of the student network is to map all points sampled along a same ODE trajectory to the same output.To ensure that the student model does not collapse to a constant function, a boundary condition is enforced such that the it approaches the identity map as σ approaches zero.We follow the approach in Ref. [44] and use skip connections to enforce this.This During training, two adjacent points on the ODE are sampled using the teacher network and sampler, and are subsequently evaluated with the student.The training loss is the mean squared distance between these two outputs.To stabilise training, the student network is duplicated into a online and target network, a process commonly found in deep reinforcement learning [50,51].Gradients are only propagated through the online network which always processes the ODE sample with the larger σ.After each iteration, the target network is synced with the student network using an exponentially decaying average of the parameters.
Although both are distillation methods, consistency models differ from progressive distillation models (PD) [52], such as the approach applied to jet generation in Ref. [36] This approach is preferential over implicitly unconditional models as it segments what the model needs to learn along a logical line.The kinematics of a jet is strongly correlated to and influential on its substructure, and therefore the kinematics of the individual constituents.Providing this information to the diffusion model is observed to improve the overall performance.
Secondly, the kinematics can be far more easily learned and modelled with a normalizing flow than embedding it in the same network.

F. Comparison to prior work
Diffusion models are a fast growing family of generative models, and are becoming more actively studied for their use in high energy physics.Whilst this work was being undertaken, Ref. [36] introduced a fast diffusion model for jet particle cloud generation exploiting progressive distillation.
We are similar to FPCD in that we train on all jet types simultaneously, have a second model for generating the conditions, and can generate jets with up to 150 constituents.However, our approach differs as we consider not only transformers but also introduce a faster, novel architecture in comparison, we focus on generation speed as well as performance, and study the use of consistency distillation.We also use a different diffusion paradigm to train and perform inference.The additional conditioning variables are also modelled differently between the two approaches.In Ref. [36] a second diffusion model is jointly trained to generate the kinematics and N const of the jets given the PID.However, generating the conditional information is an orthogonal task to the constituent generation, and we find we do not need to train both jointly as there are no shared weights in the two networks.Using a normalizing flow is also significantly faster at inference than a second diffusion model, and well suited to the structured vector output.In addition, it is unlikely that a fast surrogate model will be used solely for the amplification of statistics from the training distribution and with conditional generation any desired distributions over the jet kinematics can be achieved.
Other adversarial approaches such as MPGAN [45] and EPiC-GAN [48] rely on implicitly learning the correlations to the jet kinematics without a second network.
In EPiC-GAN a kernel-density estimator (KDE) is used to model the number of constituents, which is a conditional parameter of the generation and correlated to the jet invariant mass and p T .However, the KDE does not take into account correlations between the number of constituents and the kinematics of the jet.
Conditioning is also studied in JetFlow [56], where nor- Concurrent with this work, Ref. [57] studied the use of similar cross-attention layers for particle cloud generation.A distinguishing feature of the CAE layer is that is extends to multiple global tokens and utilises attention to redistribute them back to the point cloud, while in Ref. [57] distribution is performed by concatenation to the individual points.

A. JetNet30
To assess the quality of the generated jets we compare distributions of several observables to the JetNet30 test set, which we label as Delphes.For a quantitative analysis, we use the metrics3 introduced in Ref. [35] and Ref. [45].For each metric we establish an ideal limit by comparing the training and test sets, which corresponds to the natural variation in the Delphes samples.
All PC-Droid samples used in this study are taken from the fully conditional generation regime.We observe very little difference in performance between sampling all conditions from the normalizing flow for each jet, and taking the jet kinematics from the training data.
We test a wide variety of integration solvers in the generation stage for PC-Droid.These include a fourth order linear multistep method (LMS) [58] and DPM-Solver-2 (DPM2) [42].We study the trade-off in quality of the generated jets and the amount of neural function evaluations (NFE) in order to choose the best solver and a corresponding optimal step size.For this comparison we look at the FPND, W M 1 , and the W τ32 The generation performance is summarised in Table I for gluon and top jets.It is observed that MPGAN and PC-JeDi already approach the ideal limit for several of the metrics introduced in Ref. [45], thus we focus on metrics where performance gain can be achieved.Particularly in FPND, PC-Droid manages to significantly bridge that gap and surpasses all methods in all metrics.

B. JetNet150
We (middle), and W M 1 (right) on the NFE for top jets with up to 30 constituents.We generate samples with PC-Droid using Heun and HeunSDE methods [31], DPM2 [42], DPM2 with ancestral sampling (DPM2A), DPM++2M [43], and LMS [58].Substructure distributions are shown in Fig. 6.Both models do very well in capturing jet mass and D 2 , but the CAEs show a slight drop in performance when looking at the subjettiness ratios.This can also be seen in their correlations, as shown in Fig. 7.
We provide a quantitative comparison between the performance of the PC-Droid models and EPiC-GAN in Table II.PC-Droid with the self attention network displays superior performance in all metrics where there is notable separation between models.EPiC-GAN seems to offer su-perior performance in modelling the relative mass, but all generative modes are within agreement to Delphes on this metric.
Comparing the performance of models in the context of jet tagging is also useful, as ideally a jet classifier has the same separation power on the generated and reference samples.The FPND score introduced in Ref [45] attempts to capture this, where the jet classifier is a message passing neural network.Here, we present simpler and interpretable method using a 2D cut based tagger, similar to those used by the ATLAS collaboration   from the preprocessed jet constituents.These different approaches result in slight discrepancies between the conditional and point cloud variables, even for the Delphes dataset, though it is minor.Furthermore, the original jets may have had even more than 150 constituents when the kinematics were calculated, which is why the shift in mass and p T is negative.
For the mass plot in Fig. 9 we see PC-Droid has a near identical spread compared to Delphes.For p T there is slightly more variation as the residual magnitudes are higher, but PC-Droid is within 0.4% of the target.

D. Consistency models
All diffusion models offer a trade off between the number of iterations and the quality of the final sample.Consistency models extends this trade off as they are able produce realistic samples with as few as one iteration In order to improve the generation speed and study its impact on the performance, we train a consistency model using PC-Droid as the teacher.We use the same training configuration as the original paper [44] with one modification.Instead of using the Heun integration solver to select adjacent points of the ODE, we find DPM2 leads to improved performance.For generation, we are able to all models.Figure 11 shows three representative metrics for all models as a function of the generation time for producing top jets with 150 constituents on the same hardware. 4All scores are calculated with respect to the ideal performance, defined by the Delphes dataset. 4All times are calculated from the average of ten runs each generating a batch of 512 jets using an NVIDIA ® GeForce RTX 3080.
Our fastest model (CD1) is around three times slower than EPiC-GAN and produces a slight shift in the mass of the jets.However, the FPND score is an order of magnitude closer to the ideal score, with τ 32 comparable between the two.CD5 is a further five times slower than the one-shot generation, but brings improvements to all metrics.CAE-1 further improves all metrics, particularly W M 1 .Increasing the number of global tokens to 16 results in a 50% longer generation time but a much improved FPND and W τ32 1 .The full model is around 250 times slower than EPiC-GAN, but has near ideal performance in all but W τ32 1 .In comparison to the PD models in Ref. [36], our fastest consistency model is just under 30% faster than FPCD (PD1), with notably improved performance, particularly in FPND.When comparing multistep distillation approaches, CD5 is 2.5 times faster than the eight-step FPCD (PD8) model.

IV. CONCLUSION
In this work we have introduced an updated version of the PC-JeDi model for generating jets as particle clouds, called PC-Droid.An improved diffusion noise scheduler and training procedure, as well as more modern integration solvers, combine to yield state-of-the-art results across a wide range of metrics.We also consider more jet types and provide a clean and simple method to perform unconditional generation.
We study an additional network architecture to optimise the trade-off between speed and generation quality, and demonstrate the potential of consistency models for one-shot generation.Even with our fastest models we outperform other competing methods across several of the studied metrics.
Due to its success at jet generation, we expect the findings in this work to be similarly competitive at generating other physical point clouds.A natural application is for the simulation of particle showers in calorimeters, which has already seen successful and encouraging performance from the application of diffusion models [32,33]. A.U.
constituents.To allow comparisons to PC-JeDi, the optimiser, learning rate scheduler, and most of the hyperparameters are left unchanged.The only notable changes are the reduction in the number of transformer encoder layers from four to three, and the inclusion of an extra MLP to combine the cosine embedding of σ with the context vector.For the 150 constituents model we double the token dimension and the width of the MLP layers.As in PC-JeDi, each jet constituent is represented by its coordinates relative to the jet centre ∆η and ∆ϕ, and its transverse momentum of the form log(p T + 1).
We construct the full CAE by stacking together multiple CAE-Blocks in sequence, and allowing the initial set of global tokens to be fully learnable.If N global = 1, the distribution attention operation utilises a sigmoid instead of a softmax function.The number of attention operations for the CAE scales with O(N const × N global ).

FIG. 2 :
FIG. 2: A single cross-attention encoder block updating both an input point cloud x and global tokens g, using layer-norm (LN), multi-headed attention (MHA).Converging arrows represent vector addition and contextual information c is injected into the network by concatenating to the inputs of the MLPs.The first attention operation effectively pools information from the point cloud into the global tokens.The second attention operation is inverted, and information from the updated globals token are distributed back to the point cloud.
boundary condition means that the global minimum of the training loss is reached only when the student network learns to map each point of the ODE trajectory to its endpoint.At the start of training the student model is initialised with the same parameter values as the teacher model.
malizing flows are used to generate jets.Due to the nature of normalizing flows, in comparison to the diffusion and GAN based architectures, JetFlow is not permutation invariant.Furthermore, as normalizing flows have a fixed dimensionality, jets with fewer than 30 particles are zero-padded up to the maximum size.Small levels of noise added to the empty constituents during training.Due to the fixed size of normalizing flows, jets with fewer than 30 particles are zero-padded up to the maximum size, with small levels of noise added to the empty constituents.The number of constituents, along with the jet mass, are provided to the network as conditions during training and at inference.To generate new jets with JetFlow, the conditioning variables are sampled sequentially from two cumulative distribution functions (CDF), with a separate jet mass CDF for each value of N const .

1 metrics .
Results for PC-JeDi are obtained using the Euler-Maruyama sampler at 200 NFE[35].From Fig.3, we see most solvers saturate at around 100 NFE.While there is no clearly superior method, we observe that the LMS solver performs the best across most metrics and henceforth we use the LMS solver at 100 NFE for all PC-Droid results.We also find that with most solvers PC-Droid outperforms MPGAN with as few as 20 steps.Accurately modelling individual constituents is a crucial requirement when it comes to jet generation.Figure 4 shows the p T distributions of the leading, fifth leading, and twentieth leading constituents of the generated top and gluon jets as modelled by PC-Droid and PC-JeDi.PC-Droid demonstrates improved agreement with Delphes compared to PC-JeDi across all the constituents, especially in the tails.Next, we look at the substructure variable distributions of the generated jets and correlations between them, which are crucial for jet tagging.We specifically look at top jets as they are the ones which have the most complex substructure and in our experience have been the hardest to model.In Fig.5, we see that PC-Droid has a much improved τ 32 modelling compared to PC-JeDi, and has an excellent agreement with Delphes simulation for all other substructure variables.

FIG. 4 :FIG. 5 :
FIG. 4: Comparison of p T distributions of the leading, fifth leading, and twentieth leading constituents of the generated top and gluon jets with up to 30 constituents.

FIG. 6 :FIG. 7 :
FIG.6: Comparison of the standard transformer based PC-Droid and the cross-attention encoder (CAE) variants using the generated mass and substructure marginals of top jets with up to 150 constituents.

FIG. 9 :FIG. 10 :
FIG. 9: The correlation between the conditional and point cloud kinematics on top jets with up to 150 constituents.The y-axis, shows the difference between the two variables.

FIG. 11 :
FIG. 11: Performance as a ratio to Delphes as a function of the required generation time for a top jet with up to 150 constituents.The time for Delphes is taken from Ref. [45].

FIG. 12 :
FIG.12: Marginals of the conditioning variables generated with Flow-⃗ p (red) compared to Delphes (black) for top and gluon jets with up to 150 constituents.

FIG. 13 :
FIG. 13: Comparison of p T distributions of the leading, fifth leading, and twentieth leading constituents of the generated top and gluon jets with up to 150 constituents.

TABLE I :
Comparison of generative models on top and gluons with up to 30 constituents.Lower is better.

TABLE II :
Comparison of generative models on top and gluon jets with up to 150 constituents.Lower is better.The FPND score is only sensitive to the leading 30 constituents.

TABLE III :
Top jet selection efficiency and gluon

TABLE IV :
Comparison of conditional and unconditional models on jets with up to 150 constituents.Lower is better for all metrics except Cov.The FPND score is only sensitive to the leading 30 constituents.

TABLE V :
Comparison of all models on jets with up to 30 constituents.Upper limit of performance calculated using the Delphes dataset.

TABLE VI :
Comparison of all models on jets with up to 150 constituents.Upper limit of performance calculated using the Delphes dataset.