Tuning Particle Accelerators with Safety Constraints using Bayesian Optimization

Tuning machine parameters of particle accelerators is a repetitive and time-consuming task that is challenging to automate. While many off-the-shelf optimization algorithms are available, in practice their use is limited because most methods do not account for safety-critical constraints in each iteration, such as loss signals or step-size limitations. One notable exception is safe Bayesian optimization, which is a data-driven tuning approach for global optimization with noisy feedback. We propose and evaluate a step-size limited variant of safe Bayesian optimization on two research facilities of the Paul Scherrer Institut (PSI): a) the Swiss Free Electron Laser (SwissFEL) and b) the High-Intensity Proton Accelerator (HIPA). We report promising experimental results on both machines, tuning up to 16 parameters subject to 224 constraints.


I. INTRODUCTION
Particle accelerators are complex machines consisting of many elements and are typically built according to an idealized design. However, in reality, there are always systematic and time-varying errors that will reduce the performance with respect to the design performance. These errors need to be corrected by parameter tuning to achieve the optimal performance. Therefore, empirical parameter tuning is a required and often reoccurring task for any particle accelerator. Common tuning objectives include the beam shape, beam trajectory, beam loss minimization, or a combination of multiple objectives.
In this paper, we present a method that allows for safe optimization of particle accelerators. Our approach follows recent work by Kirschner et al. [1] on safe Bayesian optimization. The main idea behind Bayesian optimization is to compute regression estimates of the target and loss functions using the data collected during optimization [2,3]. The estimates and associated statistical uncertainty are then used to systematically explore the parameter space, while guaranteeing that the query points are safe. Our main contribution is an experimental evaluation of safe Bayesian optimization on the High Intensity Proton Accelerator (HIPA) and the Swiss Free Electron Laser (SwissFEL). Further, we address practically relevant challenges in the context of tuning particle accelerators, including step-size constraints and user feedback for hyper-parameter tuning. Tuning of accelerators is a challenging task for several reasons. Typically, simulation models are not accurate enough to allow offline optimization, and the parameters have to be tuned empirically by evaluating different settings on the machine. Among many other considerations, the following points are of particular importance; including on the machines we used for experimentation. * jkirschn@ualberta.ca † jochem.snuverink@psi.ch Safety: Beam optimization is a delicate task because improper settings can cause beam losses that potentially damage the machine. While modern machines have safety measures implemented that are designed to prevent hardware damage by stopping the beam, it is desirable to avoid triggering such safety measures. On one hand, re-starting the machine is time-consuming and breaching the safety level is often considered off-limit. Further, an effective method should concentrate on sampling feasible points for fast convergence.
Step-Size Control: Classical optimization methods are often based on the principle of making small adjustments to an incumbent solution, such as a small step in the gradient direction. Global search methods are an attractive alternative since they do not require gradient estimates and make less restrictive assumptions such as convexity of the target functions. Popular choices include Nelder-Mead [4], CMA-ES [5] and Particle-Swarm optimization [6]. However, these methods do not impose a step-size constraint. Large changes to input parameters of particle accelerators can be problematic because oftentimes feedback systems are in place to stabilize the beam, which might not be able to follow abrupt changes. This limitation can be circumvented by slowly changing the machine parameters to the requested target values. Such a process, however, increases the measurement time on the machine and slows down the optimization.
User Feedback: Optimization methods are usually not designed to provide user feedback other than the progress on the target value. In particular, it is difficult to monitor the optimization trajectory and the user essentially has to trust the method to not induce an unwanted state on the machine.
As our main contribution and addressing the challenges outlined above, we demonstrate feasibility and ef-ficacy of safe Bayesian optimization for parameter tuning on particle accelerators. Our approach is based on the LineBO method of Kirschner et al. [1], and we develop a version of the same method with step-size control. The LineBO approach combines Bayesian optimization with a line-search technique, thereby allowing to scale the method to high-dimensional settings. We compare performance with the non-safe global search methods, CMA-ES [5] and Nelder-Mead [4]. Our claims are corroborated with experimental data on two machines: The High Intensity Proton Accelerator (HIPA, introduced in section I B) and the Swiss Free Electron Laser (Swiss-FEL, introduced in section I C).

A. A Brief Overview on Optimization Methods
For many accelerators, parameter tuning has been automated to achieve significant speedups compared to manual operator tuning [7][8][9]. Since only point evaluations of the objective function are available, one has to rely on so-called zero-order or black-box optimization methods. There are a large variety of optimization methods in use. Due to their simplicity, the Nelder-Mead (Simplex) [4] algorithm and random walk optimizers have become popular choices to assist operator tuning [8,10,11]. Further methods that have been successfully applied on accelerators include Extremum Seeking [12][13][14] and robust conjugate direction search (RCDS) [15,16]. The advantage of local descent methods is their simplicity, although convergence guarantees only apply under strong assumptions like convexity. Other gradient-based optimization algorithms are more difficult to deploy as additional samples are required to estimate the gradient, e.g. via finite differences.
Another class of zero-order optimization methods are evolutionary algorithms such as Particle-Swarm optimization [16] and genetic algorithms [17][18][19]. A popular choice due to its simplicity is CMA-ES [5], which is shown to achieve competitive performance on standardized benchmarks [e.g. 20], although we are not aware of an application on particle accelerators.
As an alternative, Bayesian optimization [3,21] has recently gained interest in the accelerator community [22][23][24][25]. Bayesian optimization is a framework for zero-order optimization that deals with observational noise in a principled way, allows the use of prior data or knowledge about the objective, and comes with theoretical convergence guarantees for some variants [21].
When tuning particle accelerators, there are often safety-relevant constraints, e.g., physical or safety limits that cannot be violated, of which the typical optimization algorithm is unaware. For example, at the Swiss Free Electron Laser the pulse energy should be maximized while keeping low undulator losses. At the High Intensity Proton Accelerator the situation is more rigid: It is important to keep the overall losses as low as possible while not exceeding any of the individual loss limits; violating the limit of a single loss monitor triggers the safety system, causes downtime, or possible damages to the machine.
It is natural to formulate this setup as a constrained optimization problem, where the goal is to optimize a target function subject to a feasibility condition. Established algorithms for the constrained setting include Sequential Quadratic Programming [SQP 26] and interior point methods [27]. Variants for constrained Bayesian optimization have been proposed in [e.g. [28][29][30]. Gradient descent methods can be applied in the constrained setting via a Lagrangian (primal-dual) formulation of the objective that penalize infeasible points [31]. We emphasize that the methods for constrained optimization ought to return a feasible final solution but evaluating infeasible parameters is tolerated during optimization; hence these methods do not address the safety requirement for tuning particle accelerators. This is opposed to safe optimization, where the complete optimization trajectory has to satisfy the feasibility condition. To achieve safe optimization that guarantees feasibility for all evaluation points with high probability, Sui et al. [32] proposed Safe Bayesian Optimization. We describe this method and extensions in detail in section II. A variant of Safe Bayesian Optimization was evaluated on SwissFEL in previous work [1].

B. HIPA
The high intensity proton accelerator (HIPA) provides the primary beams to PSI's versatile experimental facilities. HIPA generates a proton beam with 590 MeV kinetic energy and presently 1.3 MW average beam power [33]. This corresponds to a proton current of about 2.2 mA. The HIPA accelerator consists of a Cockcroft-Walton pre-accelerator and a chain of two isochronous cyclotrons, the Injector II and the Ring cyclotron. The beam is produced in continuous wave (CW) mode at a frequency of 50.6 MHz. The high intensity proton beam is used to produce pions and muons by interaction with two graphite targets. After collimation the remaining beam with roughly 1 MW is then used to produce neutrons in a spallation target. A pulsed source for ultracold neutrons (UCN) is also in operation [34].
In practice, the performance is limited by the beam losses at the extraction of the Ring cyclotron and in the high energy beamline downstream. In the Ring cyclotron longitudinal space charge leads to an increase in energy spread that transforms into transverse beam tails, which reduces the turn separation at extraction [33]. Furthermore, significant emittance growth occurs in the graphite targets and results in unavoidable collimation loss in specially shielded collimators. These losses are controlled by several RF and orbit feedback systems to keep the machine stable and by empirical operator tuning.

C. SwissFEL
The Swiss free electron laser (SwissFEL) is PSI's state of the art source for ultra-short X-ray pulses [35,36]. A photocathode is used to generate short electron pulses which are then brought up to 6 GeV by a series of linear accelerator modules and compressed down to a pulse length of only a few femtoseconds in two bunch compressor chicanes. Inside an undulator line, the electron pulses self-modulate and emit X-ray pulses with laser-like properties by a process called self-amplified spontaneous emission (SASE). These ultra-short X-ray pulses are then used for various user experiments (e.g. pump probe experiments, X-ray spectroscopy). At SwissFEL, there are two undulator lines -Aramis and Athos -that can be used to generate photons for different end stations where the user experiments take place.
The settings of the accelerator (e.g. the photon energy or the pulse compression) are often adjusted to fit requests of the users. Typically subsequent parameter optimization is necessary in order to maximize the photon pulse energy at the experiment. The pulse energy is measured with a gas detector [37] that provides readings on a shot-to-shot basis. When performing the optimization, various tuning parameters (∼ 100) can have a strong impact on the resulting beam quality, but also on the losses that can cause radiation damage to delicate equipment. Especially, in the undulator region, losses should be kept low while tuning on the SwissFEL output, to avoid longterm damage due to demagnetization of the permanent magnets inside the undulators.

II. BAYESIAN OPTIMIZATION WITH CONSTRAINTS
Bayesian optimization (BO) [2,3,21] is a flexible, datadriven approach for global optimization with noisy feedback. The main idea is to fit a statistical regression model on the target function, using the data collected during optimization, and possibly, any available prior data. The next evaluation point is then chosen to reduce the model uncertainty about the unknown optimal solution. The approach was shown highly effective in many real-world applications, including optimization of particle accelerators [1,23,38]. Variants for safety-critical tuning were developed by [32,39,40], where the algorithm learns a model of additional constraints online. The constraint models are used to estimate a feasible 'safe' set of parameters. The method then only evaluates such safe inputs and thereby ensures that no query point violates the safety constraints (with high probability).

A. Line Bayesian Optimization
In the following, we introduce a variant of safe Bayesian optimization developed in [1]. The main idea is to alternate between two phases (see Figure 1). First, data is collected within a small ball around the current incumbent solution. Then a line search is performed in the direction of the (predicted) largest increase. In both phases, the data points are chosen to maximize an upper-confidence bound (UCB) of the regression estimate, which is known to be an effective search surrogate [41]. This procedure is repeated until a satisfactory solution is found, or the search budget is exhausted. The intuition for these choices is twofold: First, by concentrating samples on a 1-dimensional domain or a small enough ball we obtain reliable function estimates with few samples in a local region of interest. Second, determining the acquisition points is computationally much simpler compared to standard Bayesian optimization, which requires to solve a non-convex optimization problem over the full input domain in each round.
To formally introduce the algorithm, we write the tuning objective (maximized) and l constraints as follows: corresponding to the set of d tuning parameters on an allowed range. The functions g 1 (x), . . . , g l (x) are constraints that need to be satisfied during operation, such as a loss limit. Note that the g 1 (x), . . . , g l (x) are assumed to be unknown initially (like the target f (x)), and the only way of accessing the constraint functions is by evaluating a parameter on the machine. Provided an input parameter x t ∈ X at step t, the machine interface (denoted MACHINE-EVAL(x t )) returns a set of noisy The plots show 1-dimensional slices of a synthetic objective (blue, maximized) and constraints (red). The left, middle and right column correspond to different GP models with lengthscales 0.1, 0.2 and 0.5, respectively, on the same objective. We produce such plots live during optimization to evaluate model fit and confidence bands relative to the measurements (crosses). The vertical dashed line indicates the best parameter prior to optimizing the 1-dimensional subspace, and the vertical solid line indicates the updated best parameter. Green and red intervals on the x-axis correspond to predicted safe and unsafe parameters. Note that the safe region is expanded as more data is collected (cf. top to bottom plot). When choosing hyperparameters such as the kernel lengthscale, system experts can visually inspect the predicted safe set, and make adjustments accordingly. A smaller lengthscale leads to a more conservative method that expands the safe set in smaller steps. Note that in the right column, the lengthscale is chosen too large. At the first evaluation the safeset is predicted incorrectly (see upper right plot), which consequently leads to an unsafe query point (orange circle, bottom plot). measurements, where t and ξ t,1 , . . . , ξ t,l is measurement noise. In practice, we average multiple evaluations to reduce the noise variance. Also note that the constraint limit is zero without loss of generality, because we can always redefine a We say that an evaluation procedure is safe if it guarantees that with very high probability, all evaluation points satisfy g i (x t ) ≤ 0 for all i = 1, . . . , l and time steps t ≥ 1.
The next step is to compute a regression model given the available data D t = {(x s , y s , z s,1 , . . . , z s,l )} t s=1 . In general, this is a user choice which we abstractly encapsulate in the REGRESSION(·) oracle function. Given the data D t , the regression oracle produces a model estimatem t . We assume that the model provides the following estimates for the target function and the constraints (h ∈ {f, g 1 , . . . , g l }): • MEAN h (m t , x): The mean estimate of the target function evaluated at x; typically a consistent estimator of h(x).
• UCB h (m t , x, δ): An upper confidence bound (UCB) on the true function, which takes a confidence parameter δ ∈ [0, 1] and satisfies: where the randomness is due to noisy evaluations.
• LCB h (m t , x, δ), a lower confidence bound (LCB) on the true function, defined analogously to UCB h .
, the width of the confidence band at x ∈ X .
The most common regression model used in Bayesian optimization is Gaussian process (GP) regression (or kernel least-squares) [42], which we also use in our experiments. A brief, illustrative introduction to GP-regression and the exact specification of our regression function is given in Appendix A.
Finally, Bayesian optimization acquires points that are targeted to reduce the model uncertainty of the true maximizer x * = arg max x∈X f (x). To account for the safety constraints, at any step t, the acquisition is restricted to the safe set with margin τ ≥ 0, The safe set is defined as the subset of the domain X where with probability at least 1 − δ, the model does not predict a constraint violation with margin τ . A larger safety margin makes the algorithm more conservative. Also note that the model predicts the average constraint level for a given input parameter. Therefore a positive safety margin is needed when the constraint measurements are noisy and the goal is to avoid constraint violations with high probability (i.e., controlling the tail of the constraint distribution). The candidate solutionx * i in iteration i is defined as the maximizer of the mean estimate on S τ t , Evaluating just the incumbent solution however does not induce enough exploration, and the algorithm would get stuck in local optima. Instead, the next sequential measurement is chosen to maximize the upper confidence bound score function, which is a well-behaved surrogate to reduce the uncertainty of the true maximizer: This choice leads to evaluation of points where the model predicts either a high reward or large uncertainty (or both). Upper confidence bound algorithms are widely studied in the literature [43] and convergences guarantees for these algorithms are known for the unconstrained case [41,44]. The constrained case requires additional care to ensure that model obtains sufficient data to expand the safe set [32]. The exact acquisition procedure that we use is specified in Algorithm 2. Since finding a maximizer of the acquisition score (1) is challenging in high-dimensional domains, we rely on the LineBO approach by Kirschner et al. [1]. Instead of searching the full domain, LineBO restricts the search space to an adaptively chosen sequence of one-dimensional subspaces. Let A(x, w) = {x + aw : a ∈ R} be an affine subspace for offset x ∈ R d and direction w ∈ R d . To determine direction and offset we first collect data points in a small ball B(x * i−1 , η) around the previous incumbent solutionx * i−1 to compute a new candidatex * i . The offset is then set to the new candidatex * i and the search direction is chosen as w =x * i −x * i−1 . For step-size control during the line-search phase, the algorithm constrains the next evaluation point x at any time to B(x * i , η). The complete approach is summarized in Algorithm 1. An illustration is given in Figure 1 and the line search phase is shown in Figure 2.

B. Variants
In our experiments, we evaluate the following variants of the proposed approach: A-LineBO-loc: This is the main (ascent & localized) variant shown in Algorithm 1. We use η = 0.1 relative to a domain that is normalized to [0, 1] d .

C. Choice of Hyperparameters
Algorithm 1 relies on several hyperparameters that are configured by the user. We first explain our choices in the following. Then we briefly discuss more general methods for adjusting the hyperparameters parameters.
For the number of evaluations per round, we set j ball = 2d and j line = 10. From our experience, this allows to estimate a reasonable search direction and make sufficient progress on each line (and the method can revisit the same search direction in the next iteration). When the signal is very noisy, a larger number of evaluations per iteration can be appropriate. On systems with a large number of hyperparameters (d > 50) but low effective dimensionality, it can be effective to choose search direction uniformly at random [1]. The step-size η and margin τ are provided by the user. The exact choice is usually based on system considerations. We use a maximum step size of 0.1 (i.e., 10% of the allowed input range) and τ = 0.1 (i.e., 10% below the critical value).
Parameters of the GP model encode regularity assumptions on the objective and constraints. Choosing suitable model parameters is essential for the effectiveness of Bayesian optimization. In a first step, we normalize input ranges to obtain a normalized domain X = [0, 1] d . Further, we normalize the observed objective values to [0, 1], and the range of feasible constraint values to [−1, 0]. The latter is important to ensure that the prior upperconfidence bound predictions of the constraints are positive. This corresponds to assuming that an input parameter is unsafe in the absence of data.
We then use a Matern52 kernel with lengthscale 0.2 and unit prior variance, which we found effective and robust choices across all our experiments. For the observation noise likelihood, we use a Gaussian distribution with an empirical variance estimate. We choose to keep all hyperparameters fixed during optimization, but adaptive choices are possible in principle.
The above choice of kernel and lengthscale is appropriate when a small change in the input parameters (e.g., < 10% relative to the input range) leads to a moderate change in the output, e.g., does not suddenly increase the constraint level to a critical value. This requires to choose suitable allowed ranges of tuning parameters, which we determined together with system experts.
To adjust these parameters on a new system, the practitioner can inspect the model predictions after the initial (safe) point has been evaluated. Most importantly, the predicted safe set for each parameter can be plotted and evaluated by a system expert. Based on the visual feedback, the lengthscale or input ranges of each parameter can be adjusted to meet the operators expectation on the safety of the method. The method acts more conservatively for larger safety-margin τ , smaller step-size η, and smaller lengthscale parameter used for the kernel estimation. The visual inspection procedure is illustrated in Figure 2.
We also produce the line plots online during the linesearch phase. We found this to be an extremely valuable tool to further calibrate hyperparameters and obtain feedback on the optimization progress. Specifically, too conservative parameter choices (e.g., a very small lengthscale) lead to slow expansion of the safe set and no/little extrapolation of the target values beyond the data points. On the other hand, if the regression estimate appears too smooth relative to the observed data points, or the algorithm evaluates points close or above the constraint limit, more conservative hyperparameters are advisable.
In applications where prior data or a physics simulation is available, an alternative approach is to estimate hyperparameters in advance, e.g., using a maximum likelihood or marginalized likelihood estimation [42]. For a successful approach to extract kernel hyperparameters from a physics model, see the work by Hanuka et al. [24]. Some previous work has also explored adaptive model selection, e.g. [45], although this is still largely an active research area today.
Lastly, a word of caution is unavoidable. Safety and effectiveness of the methods is subject to the modeling assumptions, and therefore, choice of hyperparameters. In the absence of any assumptions on the system, a principled choice is limited. The challenge is amplified for safe optimization, because no method can guarantee safety if the system behaves in a way not anticipated by the algorithm.

D. Remarks on Computation
GP regression requires to compute the inverse of the kernel matrix, which can be done in O(t 3 ) basic operations per round. With incremental updates, the complexity is reduced to O(t 2 ). In practice, this effectively limits the number of data points that can be handled without significantly increasing computation time to ∼ 1000 -2000. See Appendix A for further comments.
The SAFE-ACQUISITION subroutine requires solving non-convex optimization problems over continuous domains. When the domain is one-dimensional as in the line search, a fine uniform (equidistant) discretization of the domain easily produces a solution below the system accuracy / noise level, such that the approximation error translates to a negligible difference on the objective value (we used 300 points per line). Maximizing the acquisition score within in the ball B(x * , η) is computationally expensive in general. Nevertheless, since the domain is relatively small compared to the full domain, we found random search to be effective in our experiments. We compute the UCB score at 500 points sampled uniformly from B(x * , η) and choose the maximizer among these. Step-by-step evolution of the target function for the different optimization algorithms. We performed optimization at low beam intensity, using 5 quadrupoles strengths on a manually detuned initial point. The left plot shows the target values. The faint lines represent the measurement of the target function at each step. The solid lines depict the measurement of the target function at the best set of parameters predicted by the optimization models (updated only every 5 steps). The right plot shows the maximum constraint value at each step, where a value above 0 corresponds to a constraint violation. Note that NelderMead and A-LineBO (unsafe) caused interlocks at step 12 and 19 respectively. Note that the exact values of the corresponding constraint violations are unknown, because the machine protection system stops beam operation before the measurement is registered. CMA-ES and NelderMead caused constraint violations that did not trigger interlocks at steps 12 and 2-4 respectively.
An alternative is to maximize the UCB score using any standard optimization algorithm such as gradient descent with random restarts. Lastly, we note that in time-critical applications with a large number of constraints, a significant speedup can be obtained by using a joined covariance matrix for all constraints. Further, the most expensive step in computing the estimates is the inversion of the covariance matrix, which can be pre-computed while waiting for the machine to return the next evaluation.

E. Limitations
Enforcing a step-size sometimes limits the algorithm's ability to escape local optima. However, theoretical analysis suggests that such an approach at least finds a local optimum [46,47]. In our experiments we did not see any performance degradation from enforcing a step-size (see Section III). This indicates that, perhaps surprisingly, high-quality local solutions exist in the optimization surfaces that we encountered. On systems where local optima prevent the algorithm from progressing, one can use multiple starting points and drop the step-size constraint during the line-search phase. Another option is to replace the line search by a search over higher-dimensional subspaces.
Another challenge is model misspecification. If a parameter is incorrectly predicted as safe, the parameter can become infeasible under the model's predictions after the evaluation (c.f. Figure 2, right column). With a small step-size, this can lead to a situation where the predicted safe set is empty. A simple resolution is to back-track and restart the method from an earlier safe point. In such cases, it is further advisable to choose a smaller lengthscale, to prevent evaluation of infeasible points in the first place.

III. EXPERIMENTAL EVALUATION
We first present empirical results on HIPA, and then an additional experiment on SwissFEL below.

A. HIPA
The different versions of the BO algorithm with constraints have been tested in several dedicated experiments at HIPA. The target of these experiments has been to reduce the overall beam losses around the machine by minimizing a target signal defined as a weighted sum of about 60 beam loss monitors (or loss related monitors) spread across the machine. This weighted sum reflects that the beam loss monitors are not distributed equidistantly along the machine and that losses at lower energies cause less activation.
For each beam-related signal that could trigger a beam interruption (interlock ), a safety constraint function is defined as the difference between the signal and its beam interruption limit. If available, the warning limit of each signal is used and not the actual limit that would produce a beam interruption when crossed, which allows a safety margin in case the safety constraint are accidentally violated when operating near the limit. For optimization of HIPA, we used 224 of such signals, mostly current and temperature measurements in collimators and beam loss monitors. We normalized each of the safety signals steps respectively. Note that the exact constraint value of the constraint violation is not known as the machine protection system prevents the measurement. A-LineBO-loc successfully tunes the machine at high intensity (repeated twice). This experiment used an earlier configuration of the LineBO methods, where we used a single constraint defined as the maximum over the 224 (normalized) loss signals and no safety margin (τ = 0). We think that this configuration led to misspecified confidence intervals, and therefore an interlock of C-LineBO-loc. In such an event we recommend to use more conservative hyperparameters.
g loss i (x) to define constraint functions where g i,limit is the limit of constraint i. Therefore the constraint reads g i (x) ≤ 0, and the safe outcome range is [−1, 0] (assuming that g loss i (x) ≥ 0). For faster computation, it is possible to take the maximum over multiple constraints. However, this can lead to less smooth functions that are harder to learn by the GP model.
At HIPA, a beam current regulation and several beam orbit feedbacks are in continuous operation to ensure stable operation without drifts. Every time a new evaluation point is set, the algorithms wait for these feedbacks to stabilize before acquiring data. Larger steps typically result in more time required until the machine is stable and the measurement is taken.
In our experimental validations, the variables used to optimize the target function are the field strengths of quadrupoles in the 870 keV transport beam line after the Cockcroft-Walton pre-accelerator towards the first of the two cyclotrons, the Injector II, with an allowed empirical range of ±2 A. This range is equivalent to an integrated field gradient of about ±0.01 T, and is a typical range operators use for tuning. In the center of Injector II the beam is collimated from 10 mA to about 2 mA with several collimators [48]. By adjusting the quadrupole strengths upstream, the beam can be shaped and matching in the cyclotron can be improved [49], which will avoid beam tails and reduce beam losses throughout the accelerator downstream.
In order to protect the machine, the first experiments were performed at low intensity (100 µA), which also allows a larger margin in the constraint functions limits as the losses are in general lower at low intensity. To give some headroom to the automatic algorithms and to simulate a sub-optimal machine, the selected quadrupole settings are manually moved away from the default settings. These detuned settings are recorded to have a consistent initial point. In later sessions the algorithms were tested without detuning at full production current (about 1930 µA).
The left plot in Figure 3 shows the target evolution of the different algorithms at low beam intensity. The faint lines show the evaluation of the target function at every new parameter configuration (step). The solid line shows the measurement of the target function for the best set of parameters predicted by the algorithms after every 5 steps (evaluating the candidate solution more frequently slows down the overall optimization). The right plot shows the constraint closest to the limit at each evaluation point (that is, the maximum over the normalized constraint levels). Note that NelderMead and A-LineBO (unsafe) caused beam interruptions due to constraint violations. These constraint violations are not visible on the constraint plot because the machine projection system stops beam operation before the measurement is registered. Further, CMA-ES and Nelder-Mead show constraint violations that did not lead to beam interruptions. This can happen in our setup when the warning level of a loss monitor is reached, but not the beam interruption level. All safe methods and CMA-ES successfully optimize the loss to a final level of about 0.1, where A-LineBO-loc converges the fastest.
Our second experiment at full production intensity is shown in Figure 4, where we optimize 16 quadrupoles (all the quadrupoles in the 870 keV beam line). The target reduction is less steep because, in order to avoid putting the machine in a dangerous state, in this case the parameters were not manually detuned and the initial point was already fairly close to the optimum. Notice also the difference on the magnitude of the target function due to the higher intensity and therefore higher losses around the machine. The C-LineBO-loc and CMA-ES algorithms stopped early as they triggered beam interruptions in their last parameter settings. . This can be improved by using a joined covariance matrix for all constraints, see the remarks in II D. A-LineBO and C-LineBO are not step-size constrained, which leads to larger machine evaluation times. C-LineBO-loc has an explicit step-size limit which leads to shorter machine evaluation times. Note that A-LineBO (unsafe) and NelderMead caused interlocks after few steps (see Figure 3).
This experiment was performed with an earlier configuration of our method without safety margin (τ = 0). Moreover, we used a single constraint defined as the maximum over the 224 loss monitors. Consequently, an interlock was caused by C-LineBO-loc, either because of model-misspecification (e.g., the maximum over the constraints was non-smooth) and consequently invalid confidence intervals, or noise of a constraint monitor that was very close to its limit. In such cases we recommend more conservative hyper-parameters (bigger margin τ , smaller lengthscales of the GP models), as was used in the experiment shown in Figure 3. A-LineBO-loc completed the optimization successfully twice, without triggering any beam interruption. Figure 5 shows an analysis of the step sizes and their impact on the speed of the optimization of the different algorithms. The plots show data from the low intensity experiment. The C-LineBO algorithm has no limitations on the step sizes and therefore is allowed to navigate the whole safe region in single steps. Compared with the A-LineBO-loc algorithm with limits on its step size it can be seen on the left plot how the unrestricted algorithms takes much larger steps. The impact of these large steps can be seen on the right column plot, as due to the large steps taken by the unrestricted algorithm the optimization took about 1.5× as long to complete because it takes the machine feedbacks and regulation systems much longer to converge for these large steps. We remark that the high computation time for the safe BO methods is due to the use of 224 constraints. We expect that significant speedups are possible when using a joint kernel matrix for all constraints. Further, the most expensive step of inverting the covariance matrix can be pre-computed while waiting for the machine to return the next measurement.
A closer analysis (see Appendix B) shows that the step sizes taken by the CMA-ES diverge as the algorithm approaches the optimization solution, as the target function becomes close to invariant to the change of the parameters in this region and tries to spread the sampling points. This is an undesirable behavior with respect to potential constraint violations or unstable machine states caused by large steps.
In conclusion, the experiments at HIPA demonstrate that the BO optimization algorithms successfully reduce the losses around the machine in a safe and efficient way, both at low intensity and full intensity operational setting. Particularly, the A-LineBO-loc algorithm outperforms or matches CMA-ES and the loss level achieved by human operators, while staying in general below the limits that would trigger beam interruptions. The constrained step sizes of this method also allow for faster and more stable tuning.

B. SwissFEL
Previous work [22] established that LineBO outperforms NelderMead [4] for tuning the beam intensity on SwissFEL. Here, we provide an additional experiment on SwissFEL, including CMA-ES, A-LineBO, and the step-size constrained variant A-LineBO-loc. As mentioned in section I C the tuning signal is a shot-by-shot signal of the gas detector that gives a signal proportional to the amount of photons in a single pulse. As optimization variables, we use 10 (horizontal and vertical) beam position monitor target values for the trajectory feedback in the undulator section. We used as constraints 16 individual loss monitors that were combined into a single constraint. During optimization with the specified pa- Step Size (a.u.)

FIG. 6. (SwissFEL)
Step-by-step evolution of the Swiss-FEL single pulse intensity (top) and average step-size (T-bars: maximum) during optimization of 24 beam position monitor target values. The results demonstrate that the same methods can be used on different accelerators. We emphasize that, similar to the experiments on HIPA, the step-size constraints did not affect optimization performance while resulting in significantly smaller variation in the input space, and therefore a more stable machine. This can be seen in the top plot, where the faint lines depict the objective value observed at the evaluation points during optimization.
rameters, we did not observe any losses.
The top plot of Figure 6 shows the target evolution of the different algorithms. The faint lines show the evaluation of the target function at every new parameter configuration (step). The solid line shows again the measurement of target function for the best set of parameters predicted by the algorithms every 10 steps. While all three tested algorithms achieve a similar final target value, it can be seen that the target value often drops for A-LineBO. This is due to large steps in the parameter space, as shown in the bottom plot of Figure 6, where the average (normalized) step-size is plotted. It can also be seen that the A-LineBO-loc performs slightly better than CMA-ES in this respect. Having a low variability and a high average of the target value is especially important during parasitic tuning, i.e. tuning during user's experiments. For this reason the step-size constrained A-LineBO-loc is the preferred algorithm.

IV. CONCLUSIONS
While tuning particle accelerators remains a challenging task, we demonstrated the feasibility of safe Bayesian optimization for automated parameter tuning. Safe Bayesian optimization does not rely on any specific machine model, and accounts for an arbitrary number of safety constraints. By both optimizing and exploring the target and constraints functions, the algorithm continuously improves its estimate and uncertainty of these functions, and in this way uses all available data to evaluate promising but safe parameters. Multiple variants of safe Bayesian optimization have been described in detail, including a newly developed variant with step-size control. This is an important feature for tuning particle accelerators, since large parameter steps can lead to an unstable machine state as fast feedbacks might not be able to follow. These variants have been applied and compared with a non-safe search methods CMA-ES, NelderMead and Bayesian optimization without constraints. We presented experimental data for two accelerators, HIPA and SwissFEL, showing efficacy and reproducibility in several optimization runs. Overall, we find that the different variants of Bayesian optimization and CMA-ES are both effective tools for beam optimization. However, safe Bayesian optimization is shown to cause constraint violations much less likely, and consequently, avoid beam interruptions. In addition, the step-size controlled variants allow for a faster and more stable tuning, which is less disruptive for the users.
where 1 t ∈ R t×t is the identity matrix, K t is the t × t matrix containing (K t ) ij = k(x i , x j ) formed by pointwise evaluation of the kernel function, y is a vector of responses and lastly k t (x) i = k(x, x i ) interpolates the measurement x to other data points. A direct implementation requires the inversion of the kernel matrix K t with computational complexity O(t 3 ). Iterative updates using the Sherman-Morrison identity reduce the complexity to O(t 2 ) per round. If the computational burden is too large, methods relying on approximation tools for kernel spaces have been developed with near-linear dependence for low dimensional problems [50,51].
Frequentist confidence bandsf t (x) ± β 1/2 t,δ σ t (x) have been derived in the literature, which contain the true function with high probability in the realizable case [41,52]. The quantity σ t corresponds to the estimation uncertainty and can likewise be derived in closed form, where K t and k t (x) as above. The parameter β 1/2 t,δ is a confidence parameter that determines the coverage and can be chosen according to frequentist theoretical results or tuned as a hyper-parameter (we used β t,δ = 1 in our experiments).
Using the notation from section II, this leads to the following model estimates: The Bayesian view-point offers an alternative interpretation where the kernel estimator corresponds to the posterior mean of a Gaussian process regressor [53] which was exploited in deriving tighter predictive bands in [41]. b. Kernel Choice The choice of the kernel is crucial for obtaining a data efficient model. Depending on the kernel choice, we can represent a different class of functions. Failure to identify the right kernel for the application leads to misspecified confidence sets and hence potential failure of the method. Different choices for the kernel are demonstrated in Figure 7 for squared exponential k(x, y) = exp − (x−y) 2 γ 2 , Matérn kernel [42] and linear kernel k(x, y) = x y. In the case of squared exponential kernel (as well as Matérn), γ is often referred to as lengthscale or bandwidth, and determines the smoothness of the estimated functions.
c. Bayesian optimization Classical Bayesian optimization [2] usually revolves around a simple procedure depicted in Figure 2, where an utility function -in this Step Size (normalized) Step A-LINEBO 0 20 40 60 Step C-LINEBO Step Size (normalized) A-LINEBO (UNSAFE) 0 20 40 60 Step CMA-ES 0 20 40 60 Step NELDERMEAD FIG. 8. Each plot shows the step-size (measured in Euclidean norm) for the different methods. Note that without step-size constraints, LineBO takes large steps throughout (top middle/right). We also observed that CMA-ES was slowly increasing step sizes, likely due to invariant subspaces of the objective. Unsafe LineBO and NelderMead stopped early because of interlocks, therefore interpretation of the data is limited.
case the UCB -is maximized to determine the next evaluation point. Other acquisition functions are possible, see e.g. Shahriari et al. [21].