A PARALLELIZED VLASOV-FOKKER-PLANCK-SOLVER FOR DESKTOP PCS

In order to simulate the dynamics of an electron bunch due to the self-interaction with its own coherent synchrotron radiation it is a well established method to numerically solve the Vlasov-Fokker-Planck equation. In this paper we present a modularly extensible program that uses OpenCL to massively parallelize the computation, allowing a standard desk-top PC to work with appropriate accuracy and yield reliable results within minutes. We provide numerical stability studies over a wide parameter range and compare our numerical ﬁndings to known results.


I. INTRODUCTION
At synchrotron light sources electron bunches of a length of a few millimeters are used to produce coherent synchrotron radiation (CSR) in the terahertz (THz) frequency range.Due to the coherent emission, the intensity scales with the number of emitting particles squared, instead of linearly as for incoherent emission.In storage rings the spatial compression is achieved by using magnet optics with a small momentum compaction factor α c .The compression leads to the micro-bunching instability.On the one hand, this instability limits the electron bunch charge that can be used in stable operation; on the other hand the emerging substructures emit coherent radiation also in wavelength smaller than the electron bunch length.
First observations of micro-bunching in a storage ring as well as of the increase of coherent emission in the spectral range of interest were made at the NSLS VUV ring [1,2].It then has been studied both experimentally, to map out the parameters governing the bursting behavior (e.g. at ANKA [3], BESSY II [4], DIAMOND [5], MLS [6], and SOLEIL [7]), and theoretically to predict thresholds [8][9][10][11] and to simulate the dynamics.To simulate the dynamics, it is possible to solve the Vlasov-Fokker-Planck equation for the longitudinal phase space density [12,13], or, using supercomputers, to do particle tracking with one million macro particles or more [14].
Recent advances in detector development and readout electronics facilitate fast mapping of the micro-bunching instability [15] over a wide range of physical parameters.To cover these settings in simulation as well, it calls for an ultra-fast simulation technique.Also, the simulation tool should be designed such that the influences of both the simulated physics and of numerical effects can be studied and separated.As we are interested in simulating an instability, in particular sources of numerical instabilities should be ruled out.In this paper we present Inovesa (Inovesa Numerical Optimized Vlasov-Equation Solver Application, available at [16]), a Vlasov-Fokker-Planck solver that runs on standard desktop PCs and yields robust results within minutes.
Section II summarizes the theoretical description of the problem.The actual implementation is described in section III, while section IV presents numerical studies that show the robustness of the implementation and compares results of simulation and measurements.

II. LONGITUDINAL PHASE-SPACE DYNAMICS
A. Vlasov-Fokker-Planck equation The phenomenon of micro-bunching happens in the longitudinal phase space, which is spanned by the position z relative to the synchronous particle and the energy E. Taking the particle density ψ(z, E, t) of electrons in a storage ring to be a smooth function, its evolution with time t can be described by the Vlasov-Fokker-Planck equation (VFPE).Following the notation of [10] it reads with the time given in multiples of synchrotron periods θ = f s t, the normalized coordinates q = z/σ z,0 , and p = (E − E 0 )/σ δ,0 , the Hamiltonian H, the reference particle's energy E 0 , and β = 1/(f s τ d ), where τ d is the longitudinal damping time.The quantities σ δ,0 and σ z,0 describe, respectively, energy spread and bunch length in the equilibrium state that exists for small bunch charges.
To solve this partial differential equation there exists a widely used formalism by Warnock and Ellison [17].It uses a grid to discretize ψ(q, p) and assumes that the collective force due to self-interaction with the bunch's own coherent synchrotron radiation is constant for small time steps.The perturbation due to the collective effects is described as a perturbation to the Hamiltonian H(q, p, t) = H e (q, p, t) where Q c is the charge involved in the perturbation, and V c is the potential due to the collective effect, which can be expressed in terms of an impedance Z c .It is then possible to use the homogeneous solution, which in the unperturbed case is represented by a rotation in phase space, and add the influence of diffusion and damping as a particular solution.To model the perturbation, the influence of V c is implemented as a 'kick' along the energy axis.

B. Micro-Bunching Instability
To calculate the effect of the perturbation term introduced in Eq. 2, one needs the electric field E(q, s) at the longitudinal position s.It is convenient to express it via a wake potential which directly gives the energy difference for the electrons (in eV).The wake potential can be obtained from the wake function W (q), which describes the field produced by one single particle.The wake potential V (q) then is obtained by convolving it with the charge density [17] V As in frequency space closed and smooth expressions exist for many commonly used impedances, we decided to work in frequency space.Then the wake potential can be deduced directly from the impedance Z(k) in every time step using where ˜ (k) is the Fourier transform of the bunch profile.This method allows to implement different impedance models for Inovesa in just a few lines of code.As we are mostly interested in CSR-driven dynamics, we currently implemented two cases.The first, the free space CSR impedance, describes the effect of coherent synchrotron radiation of particles traveling on a curved path in vacuum.A good approximation for the CSR impedance in a perfect circle is [18]  FIG. 1. Unshielded (free space) CSR impedance (FS) and CSR impedance shielded by parallel plates (PP), calculated for the case of an accelerator with frev = 8.582 GHz, and (for the shielded case) g = 32 mm.For high frequencies both impedances converge, as short wavelength are not effected by the shielding.For f → 0 Hz both impedances approach Z = 0 Ω.
where Z 0 is the vacuum impedance, and n = f /f rev is frequency expressed in multiples of the revolution frequency f rev .
The second implemented impedance approximates the shielding effect of the beam pipe by two parallel plates with distance g [18].It can be approximated with the Airy functions Ai and Bi [19]: Ai (u p )Ci (u p ) + u p Ai(u p )Ci(u p ), (7) where the prime marks the first derivative with respect to the argument, R is the beam path's radius, Ci := Ai − j Bi, and This more complex impedance has already proven to describe the micro-bunching instability threshold within the uncertainty of the measurements [15].For the limit g → ∞ it converges to the free space impedance.Figure 1 shows these two impedances, for the case of an accelerator with f rev = 8.582 GHz, and (for the shielded case) g = 32 mm.

A. Discretization
Following the approach of Warnock and Ellison [17], the VFPE is discretized on a grid to be solved numerically.For Inovesa, we define a grid starting from a minimum value that can be expressed for each dimension.
For q > q min and p > p min , the generalized coordinates q, p ∈ R become the grid coordinates x r , y r ∈ R ≥0 .With ∆p, the granularity of the grid in energy direction, the transformation between the coordinate systems can be expressed as y r (p) = (p − p min )/∆p, (9) and accordingly for x y (q).Function values are only stored at integer coordinates m, where m refers to either x or y.When a function value at an arbitrary noninteger coordinate m r is needed, Inovesa approximates it by interpolation using were {m} = m r − m denotes the fractional part of m r .The interpolation multiplies the vector of the function values at the N surrounding mesh points with a vector containing N interpolation coefficients where l ν,N ({m}) are the Lagrange basis polynomials [20] l ν,N ({m}) := We have implemented this for up to N = 4, which results in a cubic interpolation scheme.Interpolation using these polynomials sometimes overshoots the values of the neighboring grid cells.Section IV B will show that these numerical artifacts can even self-amplify and become a very dominant feature in the simulation.To avoid overshooting, one has to use clamped or saturated interpolation functions.A simple, clamped version of Eq. 10 reads Keeping in mind that not only q, p ∈ R is discretized to x, y ∈ N but also ψ(x, y) is discretized in the computer's memory, we analyzed the effect of this second discretization.To be able to change the accuracy in small steps, we used a fixed point representation [21] where the number of bits for the fractional part could be chosen freely.The result of this test will be shown in the convergence studies (section IV A).

B. Simulation steps
Each time step f : ψ t (q, p) → ψ t+∆t (q, p) is composed of a number of simulation steps that model rotation, kick, damping and diffusion f = f rot + f kick + f damp,dif f .Inovesa splits the direct implementation of each of these simulation steps into two sub-steps.The information on the actual coordinates (q, p) is used by the first half simulation step.We call its result a 'source map' (SM).Then, in the second half step only the grid coordinates x, y are used.Further we define z = N y × x + y with the number of grid cells in energy direction N y .Doing so, the function f : ), which depends only on z.
This method -by construction -produces the same results as the single-step implementation.Practically speaking, the source map expresses the information which data of the current simulation step contributes to a grid point for the next simulation step directly in terms of position in the computer's memory.For many functions -such as rotation -the SM will look the same for the whole runtime of the program.For that reason, it only has to be computed once during a simulation run and can be kept for multiple usages.
The source map formalism does not only allow to keep intermediate results, it also gives a handy interface to implement arbitrary functions on the phase space.Furthermore, the reduction of the problem's dimension also leads to a speedup of the calculations.Results of a benchmark of the computational performance for the particular case of rotation are shown in Fig. 2. In this case source mapping halves the runtime.Parallelization using OpenCL [22] allows further speedup.One advantage of OpenCL is that it can utilize not only multi-core CPUs but also graphic processors.In total, a non-optimized program takes days for a typical simulation run.Using the method described above, Inovesa can reduce this to 15 minutes when running on a customer grade graphics card.
Using the SM formalism, we implemented different ver- for 1000 cubically interpolated rotation steps using different implementations.In this test case, the grid has 512 points per axis, no optimizations (besides SM) were used.The first bar represents the computational time a direct implementation of the rotation takes.The second and third bar show the time the implementation using the source map formalism takerunning sequentially or parallel on the integrated graphics processor.
sions of the simulation steps necessary to solve Eq. 1.
For the rotation, we provide a direct implementation (as in [17], 'standard rotation').Additionally, we use a symplectic integrator [23] to implement the rotation.This method offers two advantages: First, since the method is symplectic, it is automatically area preserving -also for truncated power series.Using the map from an infinitesimal rotation given by a direct implementation, symplecticity is easily lost in the numeric treatment.Second, the symplectic method provides additional numerical stability also for the interpolation.The standard rotation algorithm requires a two-dimensional interpolation, which involves N × N data points, leading to interpolation coefficients l 2 ν,N ∝ {x} N −1 × {y} N −1 1.The symplectic map, in contrast, splits the rotation into a energy-dependent drift followed by a location dependent RF kick.Each requires a one-dimensional interpolation, involving N data points.The resulting coefficients are l 1 ν,N ∝ {m} N −1 l 2 ν,N , minimizing the vulnerability to numerical absorption.Since a point is rotated by moving on straight lines in perpendicular directions, in the following the symplectic approach will be referred to as 'Manhattan rotation'.
The damping and diffusion terms (right hand side of Eq. 1) need numerical differentiation.We found that the same type of artifacts that we found for the interpolation (see section IV B) can also occur because of the differentiation.This can be explained by the fact that the algorithm usually used for numerical differentiation [20] is equivalent to differentiating the quadratic interpolation polynomial P 3 (see Eq. 10) where the distance between the sampling points in our case is ∆x = 1.As a consequence, we target this by using the cubic interpolation polynomial P 4 , and obtain Aside from this improvement, we proceed analogously to [17].
To implement the perturbation via a kick, we just had to translate the wake potential (see Eq. 5) to the grid coordinate system using Eq. 9. Furthermore, our implementation uses the fact that both Z(k) and ˜ (k) are Hermitian.This means that optimized algorithms like the ones from Ref. [24,25] only need explicit function values for k ≥ 0 to perform the inverse Fourier transform.Using this symmetry also brings an improvement in both speed and memory usage by roughly a factor of two [24].IV.RESULTS

A. Convergence studies
In this section we compare the effect of different numerical settings, which ideally should not affect the physical result.We also compare different implementations of the rotation as described in section III B. For the sake of simplicity, we go to the unperturbed case (meaning H c = 0 in Eq. 2).So any starting distribution should exponentially converge to a Gaussian distribution with σ q = σ p = 1.
At first, we investigate the effect on the results of using different data types.To do so, we observe the evolution of a Gaussian distributed charge density with σ q = σ p = 2.8, when the damping time is set to five synchrotron periods τ = 5 T s .We start by reading the distribution from a 16-bit grayscale PNG.This brings initial Exponential damping of the RMS energy spread for different numbers of computation steps per synchrotron period.(Here the grid size is set to 256.)To emphasize the differences, the analytical, exponential damping (σ δ,t ) has been subtracted from the simulated values.Every color represents a different number of simulation steps per synchrotron period.Until T = 100 Ts, all simulations reproduce the exponential behavior.Manhattan rotation (dashed lines) reproduces the set values (solid gray line) better than standard rotation (other solid lines) independently of the number of steps.However, when the number of steps per synchrotron period becomes too large, the tiny step size leads to systematic drifts also for this more robust method.
quantization noise, but also provides well defined starting conditions for every data type: Initial rounding errors will be the same in the different runs.Figure 3 shows the different simulation results.We find that the results obtained using fixed point representations with a fractional part of at least 44 bits, and the tested floating point representations (binary32 and binary64 [26], often referred to as 'single precision' and 'double precision') show a common difference from the analytic result of about 10 −4 .The relative differences between the data types are less then 10 −6 and therefore can be neglected.As most libraries are developed focusing on floating point data types, we implemented the calculation of the wake potential only for those types.In the following, we default to binary32 because it is much faster than binary64.
To investigate the influence of the size of the time steps, we observe the evolution of a Gaussian distributed charge density with σ q = σ p = 1.45.For this we set the damping time to τ = 45 T s .The Number of simulation steps (∆T ) per T s is varied between 500 and 4000.All settings reproduce the exponential damping.However when σ q approaches 1 some of the runs start to diverge.As depicted in Fig. 4 for Manhattan rotation the error on the reconstructed values is significantly lower (usually 1%).Furthermore the Manhattan rotation is more robust against changing the step sizes.Note that the error increases when the number of steps per synchrotron period becomes too large.For standard rotation on the other hand there is no obvious optimum for the size of the time steps.
In another example to test the influence of the grid size we study the evolution of a Gaussian distributed charge density with σ q = σ p = 1.From the physics point of view it is expected that the distribution stays constant with time.However, as illustrated in Fig. 5, it is observed that σ p converges to different values or even diverges depending on the numerical parameters.For the standard rotation with the specific time step of ∆t = 1/1000 T s , there is no mesh size where the simulation converges.Manhattan rotation on the other hand shows initial jitter with a relative amplitude smaller than 1% and reliably converges afterwards.
It can be seen that different combinations of the relative time step θ, the damping factor β, and the grid size may converge to slightly different values -or even diverge.We found that Manhattan rotation is much more robust than the standard approach.Provided that θ and β do not become too small, the relative error can be reliably kept below 1% (see Tab. II).

Quadratic interpolation Cubic interpolation
FIG. 6. Evolution of the energy spread over time.For the case where cubic interpolation is used (solid black line), the energy spread stays at σ δ ≈ 1.01σ δ,0 .When using quadratic interpolation (dashed blue line), an increase in energy spread can be observed at T = 1 Ts.Looking at the bunch profiles at the points in time marked by the disks (see Fig. 7) reveals that this increase is a numerical artifact.

B. Numerical artifacts
Besides the numerical inaccuracies discussed above there are also numerical artifacts.For those we identify two main sources: interpolation and numerical differentiation.
As an example, we do two simulation runs with the same current distribution, one run using cubic interpolation, the second using quadratic interpolation.For the simulation run that uses cubic interpolation, the distribution stays in a relatively calm state with just little oscillation.Note that this is not equilibrium: As expected for the unshielded CSR case and ξ > 0.5, we have σ δ > σ δ,0 [9] and an oscillation with f ≈ 2f s .The complete set of simulation parameters is listed in Table III.If there was no influence of the different interpolation schemes, one would expect no difference between the two runs.However, as Fig. 6 depicts, this is not guaranteed.
The energy spread simulated using quadratic interpolation rapidly increases to a higher value after one synchrotron period (T = 1 T s ).On a longer time scale it TABLE III.Parameters for an example run to check for numerical artifacts.For the given set no artifacts were observed.Changing to quadratic interpolation however, triggered the occurrence of (non-physical) structures with a period length of two grid cells (see Fig. 7).T=0.5 T s T=1.0 T s T=1.5 T s T=2.0 T s FIG. 7. Bunch profiles of the simulation run using quadratic interpolation at selected points in time (cf.discs in Fig. 6).The bunch profiles computed during the initial increase of the energy spread show large ripples with a period length of two grid cells.The earlier and later profiles do not show such structures.This implies that the increase of energy spread is driven by a numerical instability.

Parameter
will damp down again, and after that a new numerical instability might rise.If the aim is to find out whether the simulated conditions are above the micro-bunching threshold these artifacts might not matter -below the threshold the initial, numerical modulation should not be amplified.However, we want to track the evolution of the charge distribution, and this numerical artifact might be interpreted as an unphysical slow bursting frequency.So we have to avoid numerical artifacts as they occur in the run using quadratic interpolation (depicted in Fig. 7): The period length of the ripples is exactly two grid cells, and they continue to exist even to a position where they create negative charge densities.In this particular case, the instability is clearly triggered by numerical artifacts (overshoots) of the interpolation.
Note that also higher order polynomials show overshoots and that also numerical differentiation can be expressed in terms of interpolation polynomials.In our tests, however, we did not find any case where a more complex differentiation method than the one described in Eq. 15 was needed to avoid artifacts.In contrast to the differentiation, there were rare cases where we observed interpolation artifacts even when using cubic interpolation polynomials.Those we had to suppress by clamping (Eq.14) -which restricts interpolation to the range of the neighboring values.

C. Comparison with measurements
For our comparison with measurements results, we scale the quantities that are a function of the revolution frequency by a factor of 2πR/C where R is the bending radius and C is the circumference of ANKA.This way measurements are comparable to the simulationswhich assume an isomagnetic ring with the same bending radius.The parameters used here are listed in Table IV.Note that this is just one set of possible parameters be-cause the magnet optics (and thus f s ) as well as the RF voltage V RF can be gradually changed at ANKA.
There are some effects we neglect for the simulation such as the frequency response of the used detector, and the coherent tune shift observed in the measurement.Also, contributions from other impedances than shielded CSR (e.g.geometric impedance) are not studied.
We do separate simulation runs for about 150 different currents between I = 1.3 mA and I = 0.5 mA.For the first one, we start with the highest current and a Gaussian charge distribution that is significantly broader than the expected distribution.To compensate for this, we allow some extra time for convergence.For the following simulation runs, we take the final charge distribution of the run before that has the (slightly) higher current as starting parameters.(Different approaches to create starting distributions are discussed in the Appendix A.) For each of these runs, the simulation time on a AMD Radeon R9 290 graphics card is a bit more then ten minutes, which makes a total simulation time of about 19 hours.
The spectrum of the emitted CSR is calculated using where (Z k ) is the real part of the impedance and |ρ(k, t)| 2 the form factor of the bunch profile.It is then integrated to obtain the power a detector would measure In analogy to what is done for the measured data, the resulting signal over time is Fourier transformed to obtain a spectrogram of the 'bursting' frequencies.Figure 8 shows this spectrogram of P (t).The general structure of the simulation and the measurement results agree very well: The instability threshold is marked by the occurrence of an isolated finger pointing down down to I ≈ 0.21 mA).For slightly higher currents, the finger broadens and fluctuations in the lower frequency range (f < 10 kHz) start.A third regime is observed at the highest currents.It shows parallel frequency lines that stay approximately constant with changing current.There are slight mismatches, e.g. in the threshold currents and in the frequencies, but most features are well reproduced by the simulation.This means that for ANKA not only the thresholds (see also [15]) but also the dynamics of the micro-bunching instability are governed by an impedance that can be approximated by the parallel plates CSR impedance.To explain the details, a more complex model will be needed.Two possibilities are to take into account higher orders of the momentum compaction factor α c or additional impedance contributions.

V. SUMMARY
We introduced Inovesa, a Vlasov-Fokker-Planck-solver that uses a runtime-optimized implementation of the computation steps.Utilizing OpenCL for parallelized computation, it can simulate the dynamics of the longitudinal phase space more than a 150 times faster than a non-optimized implementation -using a dedicated (consumer-grade) graphics card.Furthermore, we eliminated sources of numerical artifacts and have done numerical stability studies to show that relative errors can usually be kept clearly below 1%.
Using Inovesa we were able to simulate the dynamics of the longitudinal phase space of ANKA in the regime of the micro-bunching instability.To do so, we used an impedance model that assumes CSR of electrons moving on a circular path shielded by parallel plates.Considering the simplicity of the model, the numerical results show an excellent agreement to the measurements.FIG. 8. Example for a simulated (left) and a measured (right) bursting spectrogram.For the simulated spectrogram the axis are scaled with a factor of 2πR/C to correct the mismatch due to the isomagnetic approximation.There are small differences, e.g. in the threshold current and in the frequencies.However, keeping in mind the very simple model, the general structure of the spectrograms matches quite well: There is an isolated finger pointing down (at f ≈ 35 kHz); the fingertip (I ≈ 0.21 mA) marks the instability threshold.For slightly higher currents, fluctuations in the lower frequency range (f < 10 kHz) start and the finger broadens.For the highest currents displayed here, there is a regime showing parallel frequency lines that stay approximately constant with changing current.
One possibility is to start the simulation with a Gaussian distribution that is broader than the expected charge distribution.It will damp down until the physical state of the instability is reached.Although for currents well above the instability threshold any possible starting distribution will reach the same state, we chose these initial conditions because it shows good convergence.When alternatively using narrower distributions unphysical structures form and might persist for a long time.
In Fig. 9 the evolution of the RMS energy spread over 500 synchrotron periods is shown for three different start- ing distributions at I = 1.5 mA.When using the final distribution of a previous run with slightly higher current (here I = 1.54 mA), the simulation converges quasi instantaneously.As shown, the Haïssinski distribution is immediately blown up and becomes larger than the Gaussian distribution that has been set to be larger than the expected distribution.Also in the beginning (t < 300 T s ) the oscillation is systematically enlarged.

FIG. 2 .
FIG. 2. Computational time needed by an Intel Core i5 4258Ufor 1000 cubically interpolated rotation steps using different implementations.In this test case, the grid has 512 points per axis, no optimizations (besides SM) were used.The first bar represents the computational time a direct implementation of the rotation takes.The second and third bar show the time the implementation using the source map formalism takerunning sequentially or parallel on the integrated graphics processor.

FIG. 3 .
FIG.3.Exponential damping of the bunch length σz as a function of time for different data types.There is an initial jitter due to quantization noise in the test pattern that has been read in from a 16 bit PNG file and due to the fact that the initial distribution is not fully covered by the grid.Even with this problematic starting conditions, after T = 60 Ts, all data types have converged to a constant value.Note that all simulation curves are almost perfectly overlapping, there is no noticeable difference coming from the used data type.The values the simulations converge to are listed in Tab.I.
FIG. 4.Exponential damping of the RMS energy spread for different numbers of computation steps per synchrotron period.(Here the grid size is set to 256.)To emphasize the differences, the analytical, exponential damping (σ δ,t ) has been subtracted from the simulated values.Every color represents a different number of simulation steps per synchrotron period.Until T = 100 Ts, all simulations reproduce the exponential behavior.Manhattan rotation (dashed lines) reproduces the set values (solid gray line) better than standard rotation (other solid lines) independently of the number of steps.However, when the number of steps per synchrotron period becomes too large, the tiny step size leads to systematic drifts also for this more robust method.

FIG. 9 .
FIG. 9. Evolution of the RMS energy spread over a time of 500 synchrotron periods for I = 1.5 mA.The width of the line is caused by high frequency oscillations.For the starting distributions the Haïssinski distribution (equilibrium at I = 675 µA), a Gaussian distribution with σp = σq = 2.25, and the final distribution of the previous run ("adiabatic", I = 1.54 mA) were used.Note that the Haïssinski distribution is immediately blown up and becomes larger than the Gaussian distribution, and that the high frequency oscillation is systematically higher until T ≈ 300 Ts.

TABLE IV .
Parameters of an isomagnetic accelerator comparable to ANKA (t d , h, fs, and frev are scaled by 2πR/C = 0.316).