Optimizing a Superconducting Radiofrequency Gun Using Deep Reinforcement Learning

Superconducting photoelectron injectors are a promising technique for generating high brilliant pulsed electron beams with high repetition rates and low emittances. Experiments such as ultra-fast electron diffraction, experiments at the Terahertz scale, and energy recovery linac applications require such properties. However, optimization of the beam properties is challenging due to the high amount of possible machine parameter combinations. In this article, we show the successful automated optimization of beam properties utilizing an already existing simulation model. To reduce the amount of required computation time, we replace the costly simulation by a faster approximation with a neural network. For optimization, we propose a reinforcement learning approach leveraging the simple computation of the derivative of the approximation. We prove that our approach outperforms common optimization methods for the required function evaluations given a defined minimum accuracy.

few of them are measurable, and not all of them are adjustable. The correct alignment of these parameters has a significant influence on the performance of the complete accelerator and the operability of the downstream experiments.
In Fig. 1 we provide a schematic view of the components of the electron gun and the locations of the input and output parameters. With the first viewscreen approximately one meter downstream of the SRF gun module, we can measure four parameters that describe the behavior of the extracted electron beam such as the transverse position and the transverse beam size. Because these properties determine the quality of the resulting beam, we will minimize the horizontal and vertical beam size as well as centering the horizontal and vertical beam position. We define this as our optimization task, which is addressed in this article.
To tackle this optimization task, we will use a technique from the area of machine learning called reinforcement learning (RL) [2,3]. The basic idea is that we have an algorithm, which we will call agent, that influences an environment through performing actions on it. The agent decides to make an action a. We call the rule base used to make this decision the policy. In our case, an action is a change of solenoid angles and positions which has an impact on the electron beam. This action leads to changes in the environment. The environment is in our case a simulation approximation of the electron gun, because both measurements in the real machine and the original simulation are too slow for training our agent. We define a reward function R l that indicates how well action a is suited to achieve our optimization can apply an action a to the environment. The environment depicts in our case, the simulation approximation which can calculate its new state s and the reward R l . It gives this information to the RL agent to decide the next action. Adapted from [2].
goal, i.e., to optimize the quality of the beam. To choose the following action, the agent gets a state variable s which provides information about the state of the environment. Based on the state and reward, the agent decides which action to perform next [2]. We give a schematic view of the RL cycle in Fig. 2.
The main advances provided by the method proposed in this article are: Fast inference requiring fewer reward evaluations. We compare different strategies for optimizing the beam properties to a particular level of accuracy. We expect our approach with RL to outperform other local optimization algorithms in terms of the required reward evaluations, which we can equalize with computational time. Once we have trained our RL agent, we can change the parameters and quickly execute the optimization. This procedure is different from the previously used local optimization algorithms, which require several hundred thousand simulation evaluations.
Compound solution for the optimization task. In this article, we propose a compound pipeline for solving the optimization task of an electron gun. This method includes the solution of the offset determination task, which we covered in [4]. Our proposed pipeline is a considerable step towards automated self-optimization of the radiofrequency photoinjector.
Explainability of decisions. Furthermore, we will analyze the learned policy, which gives a basic understanding of how the RL agent makes choices. This explainability makes the decisions more trustworthy. Moreover, the agent chooses actions from a limited interval of parameter values, which ensures the safe execution of these actions at all times.
In the next section, we will give an overview of the state-of-art of RL. Furthermore, we will present already existing approaches in applications of optimization tasks in electron gun and accelerator settings. In Section III we will describe our proposed RL agent, which means we will define the optimized reward function and how our policy is updated. We will compare our approach with several local optimization algorithms in Section IV and give a conclusion and outlook in Section V.

II. RELATED WORK
We first briefly introduce RL based on [3] and then show several applications of machine learning and RL in a synchrotron context. The interested reader can find further details on RL in [2].
RL problems are modeled as Markov decision processes. That means we assume that the probability of a transition from one state s to another state s depends only on s and not the predecessors of s. One of the most basic variants of RL is Q-Learning: It uses a table as policy, that contains the states and their Q-values, which are the expected rewards for an action taken in a given state [2].
We calculate the expected discounted return to measure the performance of a policy: The value γ ∈ [0, 1) is a discount rate which weights older rewards less strong. The reward obtained during the transition from s t to s t+1 is denoted with r t+1 .
The policy table is updated according to the Bellman equation: It uses a stochastic policy function π : S × A → [0, 1] with π(s, a) = P (a|s). Stochastic means here that it is represented as a distribution over actions.
In settings with a discrete action space, we can estimate the Q-value for each state-action pair. However, in continuous action space settings, this is not possible. Policy gradient methods learn the policy directly and thus can also map an input to continuous action spaces. The target is to maximize the objective function J(θ) = E π θ (R t ). That means we search for the parameters θ that maximize the expected discounted reward. A neural network can represent J, and the parameters θ are the weights of this neural network.
Typically, we initialize the weights θ of neural networks randomly. The stochastic policy gradient theorem provides an estimate of the gradients the weights of the neural networks need to get updated to improve the occurring reward with the chosen actions. However, the stochastic policy gradient theorem depends on the unknown Q-value q π (s, a). We can approximate it by using the actual reward r t after that action. This approach is called the REINFORCE learning rule [5]. Another way to solve this problem is to train a second neural network to approximate the Q-value directly. This approach is called actor-critic, the actor learns a policy π θ only based on the state, while the critic learns to evaluate the Q-value and gives this information to the actor again. DDPG has been successfully applied to many continuous control problems.
When an RL agent incorporates a neural network, the method is called Deep RL [8]. Similar applications of Deep RL approaches is used in this study have already been successfully applied to different application scenarios in BESSY II [9], e.g., booster current, injection efficiency, and orbit correction. The method used in these scenarios uses DDPG.
Various methods have been proposed for local optimization and will serve as a reference in this study. We will compare the Nelder-Mead simplex algorithm [10], Powell's [11] and gradient descent [12] with our proposed RL approach.
Other approaches exist for optimizing a radio frequency photoinjector with similar objective functions to the one used in this article. As a first step, [13] shows the use of a convolutional autoencoder that can compress the data in images. These images show the longitudinal phase-space, which is the first step towards using images in a succeeding optimization algorithm. Another analysis of optimizing an RF photoinjector is using multiobjective Bayesian optimization [14]. The authors can tune an electron gun's parameters efficiently and show that they can find solutions sufficiently near the Pareto front of the beam optimization problem.

Simulations
Surrogate Model Optimization Offset Finding

III. METHOD
We depict the general approach in Fig. 3. We first simulate the electron gun with randomly chosen parameters and create a database of the outcomes. The next step is to learn a surrogate model as a faster replacement for the simulation. We use this model for the following offset finding and optimization steps. The focus of this study is the optimization part.
The basic idea behind our approach is an extension of the technique we used in [4]. The target was there to find the offsets of the input parameters and output screen variables. The offset means the difference between simulation and the real device. We used the ASTRA (A Space Charge Tracking Algorithm) simulation for physical modeling [15]. It is physically precise but computationally intense. The assessment of one combination of parameters requires, on average, about five minutes of calculation time per core on a current CPU.
For solving the offset optimization problem, we used a local optimization method. However, local optimization algorithms rely on thousands of simulation evaluations to solve the optimization problem. That makes optimization computationally infeasible.
To overcome this issue, we trained a surrogate model, a neural network that replaces the simulation. The evaluation time of this surrogate model is on the scale of several hundred milliseconds. We generated the training data for this neural network with uniformly distributed input parameters in ranges as defined in Table I. For this surrogate model, we used 546689 samples created with ASTRA. We used feature scaling for parameters and simulation output. The neural network consists of five layers (input layer excluded). The number of neurons increases to 2002 in the first layer, after that decreasing logarithmically to 447, 100, 20 neurons, and finally returning five outputs. The overall mean squared error obtained by the trained surrogate model is about 1.13×10 −5 [4]. It is important to note that one should only use the surrogate model within the ranges specified in Table I. The error can be substantial for parameters outside this range since neural networks cannot extrapolate

Label Parameter
Interval Unit To find the offsets, we used the basinhopping algorithm [16]. Basinhopping works by walking a step randomly. After that step, it runs a local minimization algorithm. If the accuracy increases, it gets accepted and executes a new step. If the accuracy decreases, the algorithm discards the current step and randomly chooses another step [16].
For testing purposes, we assumed random offsets, which our algorithm needs to approx- We assume that this level of accuracy is sufficient for the subsequent optimization steps.
This assumtion is based on the fact that the offsets could be checked independently from the input parameters via the output screen variables as shown in [4].
We will now take a closer look at the beam optimization task. We give a brief overview of the interaction of all parameters and optimization targets in Fig. 4.
The parameters of the electron gun (Table I) are divided into three groups: • State parameters s: We can only observe but not change these parameters. That includes the laser pulse length, the cathode's spot size, and the horizontal and vertical laser position. There are additional state parameters that describe the gun peak and bias field. The gun peak field is the maximal amplitude of the accelerating electrical field. Since the photocathode is electrically isolated from the rest of the gun cavity, an additional voltage can be applied, described by the gun bias field. The field flatness characterizes the planarity of the cavity field, and another state parameter specifies the longitudinal cathode position.
• Action parameters a: Our agent can change these parameters. These are the solenoid horizontal and vertical position and the angles with respect to the x-and y-axis.
• Integral parameters t: These parameters are like state parameters not modifiable by the agent but scanned over multiple equally distanced constant positions. An automated software procedure can easily change them in the real device. However, to limit our action space, we assume our agent cannot modify those, but the machine scans over them in a defined set of parameters. The two parameters in this group are the solenoid's focal strength and the electron gun's emission phase. The emission phase is the arrival time of the laser pulse relative to the sine wave of the high-frequency field of the electron gun.
The action parameters modified here are all enclosed in cryogenic encapsulation, which means they all have to be controlled by motors in the real device [17]. However, the electron gun of Sealab is still not in commissioning yet. Therefore, we can only use simulated data for testing in this study.
Our optimization function is composed of five optimization criteria that will be named The function l(x) := min (− |x| , − ) limits the optimization of the components R 1 , R 2 , R 3 to a defined maximum accuracy level > 0. That limitation leads to smoother convergence and avoids over-focusing on one component of the reward function. In the term of R l we choose the minimum function instead of the sum since it leads to faster convergence due to higher gradients.
Typically in RL problems, we would define a feedback loop considering step-based operations. However, in this case, this is not necessary because our environment does not require step-based actions and does not have delayed rewards. Furthermore, the states do not get modified during a learning cycle.
Since we consider the one-step case, we define the optimal policy we are looking for as with A policy µ is a function that maps a state to an action. The states s are chosen from some state distribution p 0 .
According to the deterministic policy gradient theorem [18] we can assume: We denote the policy with µ θ since we will use a neural network as a policy that has parameters θ (weights of the connections between neurons) that we can learn through a training process. In our case we can calculate ∇ a R l (s, a) because our surrogate model is a neural network that we can differentiate (because a neural network is a composition of linearly combined nonlinear activation functions, which are differentiable, at least almost everywhere).
Our configuration for the maximum accuracy level is chosen as = 5×10 −5 . We chose this value because this level of accuracy is adequate for the application. The state distribution p 0 is chosen from a normal distribution: We chose the mean value 0.5 to get centered samples since our data is normalized in the range [0, 1]. We do this cropping so that the agent cannot chose actions out of the allowed safe operation range. The variance is chosen as 0.2 2 so that the resulting numbers are large enough to produce relevant states, but not too large and thus rarely out of range. The values get truncated so that they stay within a range of [0, 1]. We do not require a Qfunction because we only consider the one-timestep RL cycle and the used surrogate model is differentiable. That allows us to determine the policy µ θ with a policy gradient approach.
We use a multilayer perceptron neural network with three hidden layers: 1000, 400, and 200 nodes. It uses four output nodes for the four particular actions. It is activated with ReLU function except for the output layer, which uses tanh for activation. We use the optimizer Adam (adaptive moment estimation) [19] with learning rate η = 10 −4 . We chose parameters similar to a comparable setting where the booster current parameters were optimized, as described in [9]. We performed a brief hyperparameter search with different amounts of layers and nodes. However, the hyperparameters similar to [9] achieved the largest rewards.
We train our agent for 700000 epochs, which means we perform the RL cycle and thus reward evaluations 700000 times. This amount of repetitions is possible without escalating in time because we have a fast surrogate model. It is worth noting that the learning process saturates after about 20000 epochs, which means that it improves only slightly after that.
After this initial training phase, we freeze the policy. We can determine the chosen actions to arbitrary state and integration variables and calculate its reward with only a single surrogate model evaluation.

IV. RESULTS & DISCUSSION
In this section, we present the results of our method and discuss them in regard to the research questions proposed in Section I.

A. Fast inference requiring fewer reward evaluations
We compare the optimization performance of the trained policy with four different optimization algorithms as baselines. The optimizers try to solve the following optimization problem: arg min a R l (s, a) The state s is sampled from p 0 as defined in Section III. We will compare the optimal value R l (s, a) after 1000 function evaluations with our RL approach R l (s, µ(s)). We assume that the policy of our RL approach has been fully trained. Because the local optimization algorithms are stochastic, meaning they rely on random variables, we repeat the experiments for 800 times. This approach allows us to calculate a mean value and avoids getting particularly good or bad results only due to coincidence. For Powell's and Nelder-Mead, we choose the default settings, which means that we set the absolute error in inputs and outputs between iterations that is acceptable for convergence to 0.0001. As gradient descent algorithm, we use stochastic gradient descent with a learning rate ν = 0.1. The histograms in the comparison plots in Fig. 5 have a higher value in the color scale below the red line.
That shows that all other compared optimization methods reach a smaller reward given the same number of reward function evaluations. The results shown in Table II confirm these findings. All compared optimizers fail to achieve equal or higher reward in comparison to the RL approach. Only Nelder-Mead and gradient descent can achieve similar or better reward than the RL policy in about half of the repetitions after 1000 reward evaluations.
Powell's can even match only in 57 repetitions with the reward achieved by the RL agent.
The best stochastic optimization method is Nelder-Mead, which can compete with the RL approach after 230.53 reward and thus surrogate model evaluations.
We will now compare the achieved rewards at different evaluation counts of the local optimization algorithms and our RL agent policy. With evaluation counts, we mean the number of evaluations of the reward function and thus of the simulation approximation. We can equalize this amount as computational costs since evaluating the surrogate model takes the most time in the learning cycle of both the RL approach and the local optimization algorithms. We expect our RL policy to outperform the local optimization algorithms regarding the required evaluation counts for an adequate reward. In Fig. 6 we can see that even after 1000 evaluations, the RL policy has, on average, a larger reward than all compared local optimization algorithms. The reward of Powell's increases a lot slower and converges at a lower level than the RL policy. Both Nelder-Mead and gradient descent perform almost equally, however even after 200 evaluations, the RL policy achieves a larger reward.
To summarize, we can conclude that our RL agent requires fewer surrogate model evaluations compared to the local optimizers. This result means that our RL agent, once trained, provides considerably faster inference.

B. Compound solution for the optimization task
Together with the offset determination proposed in [4], we are now able to solve the beam optimization task fully. That means we need first to determine the offsets of the parameters and then apply the RL agent.
Our simulation method makes some assumptions, especially in the area of the electron source itself. We assume an exactly round beam at the beginning of the simulations and that the laser spot is homogeneous and stable in time. In reality, it can happen that the laser spot on the cathode is not homogeneously round.
Additionally, there are field errors in the electron gun, which we cannot estimate in all details yet. We cover these mainly in the variable of the field flatness. However, we do not consider higher asymmetries of the resonator (e.g., inner cell misalignment, coupler kicks, higher radiofrequency modes). But these are higher-order effects and therefore should not significantly impact the applicability of our method in the real world.
The differences between the simulation and surrogate model are negligible within the trained parameter ranges. This premise allows us to safely use the surrogate model to replace the simulation in offset determination and beam optimization. In summary, based on our reasoning, our method solves the optimization task and is transferable to the realworld application.

C. Explainability of decisions
In the following, we will elaborate on how our RL agent chooses the actions according to a given state. We utilize the fact that we have trained a deterministic policy with a neural network. Since our neural network is differentiable (almost everywhere), we can extract the Jacobian matrix with the help of automatic differentiation methods. Because the Jacobian matrix shows the actions' derivatives concerning the different state values, we can see which action has a high impact on the respective state value. We depict the mean and standard deviation of the policy Jacobian matrices evaluated at 100000 states sampled from a normal distribution in Fig. 7 and Fig. 8. Figure 7 shows that the average change of the action solenoid horizontal position is high when the horizontal laser position changes.
This relation can be seen explicitly in Fig. 9, where we applied varied inputs to the policy of our RL approach. There is also a high correlation between the vertical laser position and the action parameter vertical solenoid position. output layer in combination with normalized outputs.

V. CONCLUSION
As shown in this article, we have successfully applied RL to optimize an SRF cavity module in the simulation environment. The used surrogate model as a fast approximation for the simulation is accurate and can safely replace the simulation within the defined parameter ranges. We have shown that the optimization accuracy of a pre-trained RL agent is comparable with several local optimization algorithms but achieves this performance by direct evaluation instead of several hundred iteration steps. After training the RL agent, the inference times are much lower since the optimization problem only needs to be evaluated once instead of many times as with the local optimizers. This result is a considerable step towards fast and automated commissioning and optimization of an SRF gun. Potentially, this procedure will replace the time-consuming manual alignment in the future. This approach allows a quicker and easier setup of energy recovery linacs and other high-current and high-repetition electron beam applications.
The next step will be to verify our results on the actual device when it is ready for commissioning. Our approach should be directly applicable since the simulation already models inaccuracies of the device as discussed in Section IV. Probably we can transfer this approach to other problems. For example, we could use it for optimizing beamlines of synchrotron radiation facilties. Another idea is to use real feedback loops during operation for applying RL. This should be done after initial training and optimization with the simulation approximation. That means instead of relying only on our simulation approximation, we could also measure the state and integration variables in the actual device and then perform further policy optimization. To realize this idea, we needed to extend our approach by step-based actions.
ACKNOWLEDGMENTS Support by the JointLab AIM-ED between Helmholtz-Zentrum für Materialien und Energie, Berlin and the University of Kassel is gratefully acknowledged.