Multi-Objective Bayesian Optimization for Online Accelerator Tuning

,


I. INTRODUCTION
Accelerator optimization during operation (i.e."online tuning") is a tedious but often necessary part of any experimental facility's operation.Due to their large number of components and the impact of variable external factors, such as vibrations or temperature changes, accelerators must be continuously re-tuned and optimized to meet various beam quality objectives.This requires hours of dedicated beamline time where teams of operational experts diagnose issues with beam quality and make corrections, for even small to medium size facilities.This severely limits beam time that is available to experimenters (i.e."users") thus reducing the facilities' overall scientific output.With current technological advances in the fields of computer science and machine learning, it would be beneficial to have an automated or semi-automated algorithm take care of normal beamline tuning, reducing downtime while also allowing human experts to tackle more challenging operational and design problems.
As a response to this problem, a number of algorithms have been used to optimize current accelerator facilities.Gradient-based algorithms, such as robust conjugate direction search (RCDS) [1], have been used successfully in the past to optimize beam parameters.Heuristic methods such as the Nelder-Mead Simplex [2] algorithm can also be used to optimize black box problems, when * rroussel@uchicago.edufunctional derivative information is not easily accessible.More recently, BOBYQA has also been used to optimize accelerators [3][4][5] by fitting data to a second order model in a local trust region, which accounts for noisy observations.However, these methods can struggle to handle problems with many local minima and must be restarted many times to ensure a global solution is found.
Bayesian optimization [6,7] provides a framework for global optimization, while significantly reducing the number of physical observations needed to find solutions while also taking into account functional noise.In this method, physical observations are combined with a kernel, which describes the overall functional behaviour, to create what is commonly referred to as a Gaussian Process (GP) model, which predicts the value and uncertainty of a target function [8].Using this prediction, an optimizer can then choose input points that are likely to be ideal, before a physical measurement is made.Recently, this method was successfully used to efficiently optimize single objective problems at LCLS and SPEAR3, with a lower number of observations needed than Nelder-Mead Simplex and RCDS algorithms [9][10][11][12].
These algorithms have been used to optimize a single beam characteristic, while in reality, accelerator tuning generally seeks to simultaneously optimize multiple facets of the beam ("objectives") at a time.This presents an issue, as individual beam characteristics can often be optimized only at the expense of others.For example, it is difficult to simultaneously minimize both the bunch length and the transverse emittance of a low energy electron beam in a photo-injector due to space charge forces [13].Solving multi-objective problems during simulated beamline design has recently become a relatively simple task, given the development of evolutionary algorithms and the availability of large computing clusters, which can run a large number of particle physics simulations in parallel [13,14].
By contrast, solving multi-objective optimization problems during accelerator operations presents an extremely difficult challenge due to several factors.Most notably, accelerator operators can only evaluate or observe the objectives for a single set of input parameters at any time (referred to as "serialized observations").This makes the use of evolutionary algorithms practically infeasible, due to the number of observations needed to converge to a solution if used in a serialized manner.Furthermore, an online optimization algorithm must be able to keep track of constraining functions, as well as account for relative objective preferences specified by the operators.Finally, the optimization algorithm should take into account the costs of changing accelerator parameters during optimization.
In this paper, we use the recent development of Multi-Objective Bayesian Optimization (MOBO) [15] to extend online Bayesian optimization of accelerators to solving multiple objective problems using serialized observations.We also demonstrate how to extend this algorithm to solve specific practical challenges associated with online accelerator optimization.

II. ONLINE MULTI-OBJECTIVE OPTIMIZATION OF ACCELERATORS
We start with a brief explanation of techniques currently used to solve multi-objective problems.This serves to motivate use of the MOBO algorithm for online accelerator optimization.
A simplistic way of solving a multi-objective optimization problem is to explicitly weight each objective relative to one another a priori, and add up the weighted objective values [16,17].This optimization method results in a solution found only for a single set of weights (trade-offs), and must be repeated from scratch to explore different trade-offs between objectives.Mapping out the full set of optimal trade-offs in an accelerator is highly desirable, particularly at facilities which must accommodate a variety working points, or when operators wish to benchmark beam simulation results to experimental realities.In this case, repeating the optimization for a discrete set of weights is relatively inefficient, even when Bayesian optimization methods are used [18].
To achieve this goal, multi-objective optimization algorithms attempt to find a set of points, known as the Pareto front P, that optimally balances the trade-offs between multiple competing objectives simultaneously (see Fig. 1).The Pareto front is defined as the set of "nondominated" points in objective space with respect to a reference point r (which itself must be dominated by every other observed point).Points are non-dominated if they are as good as any other observed point for every objective and are better than any other point for at least one objective.The hypervolume metric H, [19] shown in Fig. 1, is often used to characterize the quality of the Pareto front, where a larger volume corresponds to a better solution set.Adding observations to the current data set which dominate over points in the current Pareto front leads to an expansion of the hypervolume, characterized by the hypervolume improvement H I (see Fig. 1.Algorithms generally stop once this metric converges to a maximum as new observations are added, signifying that a correct approximation to the true Pareto front has been reached. A popular set of techniques used to solve multiobjective problems are known as evolutionary algorithms.These algorithms, such as Non-dominated Sorting Genetic Algorithm II (NSGA-II) [20] or Multi-Objective Particle Swarm Optimization [21][22][23], are based on the generation of a large collection of candidate solutions, which are then observed via simulation or experiment, usually in a parallelized manner.The results from each observation are then sorted into non-dominated and dominated subsets.The non-dominated subset of candidate solutions is used to produce the next "generation" of candidate solutions using a stochastic heuristic, which are then re-evaluated.The process is repeated over a number of generations until the non-dominated set of observations converges to a stationary Pareto front or the hypervolume has converged to a maximum value.It has been shown that these methods are well suited for solving accelerator design optimization problems [13,14].
Recently, surrogate assisted evolutionary algorithms have been developed which combine evolutionary algorithms with surrogate models of the objective functions [24,25].A fast executing surrogate model (possibly a GP or a neural network) is used to predict if a candidate solution generated by the evolutionary algorithm will be Pareto optimal before an observation is made.This significantly speeds up convergence over basic evolutionary algorithms by eliminating candidate observations that are not predicted to improve the Pareto frontier.The surrogate model is retrained after observations are made, thus improving the model's accuracy as the optimization progresses.This further improves the convergence speed of the algorithm, as the surrogate model gains knowledge about the objective functions and can more adequately identify which candidates will be non-dominated.
However, these algorithms still are not ideal for online accelerator optimization.Evolutionary type algorithms rely on parallelized evaluation of the objective functions, which is inefficient when restricted to serialized evaluations, as is the case for online accelerator optimization.Furthermore, evolutionary algorithms use a binary classification metric of Pareto dominance to identify which candidate points to observe.As a result, this metric does not guarantee optimal expansion of the Pareto front, as it does not consider the relative hypervolume improvement of individuals in the non-dominated subset of candidates.This further reduces the observation efficiency of these algorithms, in a case where efficiency is paramount.
The Multi-Objective Bayesian Optimization (MOBO) algorithm [15] achieves maximum efficiency by using an explicit calculation of the hypervolume improvement as an acquisition function to expand the Pareto frontier.The hypervolume improvement H I is defined as the increase in Pareto front hypervolume by adding a new observation y (see Fig. 1).Each objective is modeled using a GP surrogate model which can be used to predict the hypervolume improvement as a function of input parameters.By maximizing the hypervolume improvement acquisition function, MOBO can determine a single point that maximally increases the Pareto front hypervolume at every step in a serialized manner, making it ideal for online accelerator optimization.

III. MULTI-OBJECTIVE BAYESIAN OPTIMIZATION
We begin the explanation of MOBO by first briefly going over single-objective Bayesian optimization.This will aid our understanding of different components inside , many of which will be applied directly to the multiobjective case.To maintain consistency with reference texts, we assume that single objective optimization aims to maximize the objective, while in the multi-objective case we wish to minimize each objective.Simply multiplying any objective and its corresponding observed val-ues by -1 allows us to switch the optimization goal from maximization to minimization or vice versa.

A. Single Objective Bayesian Optimization
The goal of our optimization strategy is to maximize the function f (x) using as few observations of f as possible inside the input domain x ∈ X .Bayesian optimization uses two components to achieve this.
The first component is the GP surrogate model.A "surrogate model" in this case refers to a computationally cheap-to-evaluate predictive model, that acts as a predictive stand-in for any computationally expensive or difficult to measure system, and can be either a local or a global model.The GP surrogate produces both the predicted mean µ(x) and the corresponding uncertainty σ(x) of a random function value at the point x: f (x) ∼ GP(µ(x), k(x, x )) where k(x, x ) is the covariance (kernel) function.The kernel function represents how we expect our function to change between two points in input space, for example, we can specify how rapidly f (x) changes as a function of the distance between two points x and x .Given a set of observations D N = {(x 1 , y 1 ), (x 2 , y 2 ), . . ., (x N , y N )} where y n is the observed value which is assumed to include normally distributed noise , y n = f (x n ) + , and we can then calculate the predicted mean and variance anywhere in input space.Details on creating and using a GP surrogate model can be found in Appendix A and in Ref. [8].
The second component of Bayesian optimization is an acquisition function α(x) which codifies how promising potential observation points are, based on mean and uncertainty predictions from the GP model.To find the global optimum efficiently we wish to search regions of input space that either take advantage of previously observed extremum points (exploitation) or have a large amount of uncertainty (exploration).We choose our acquisition function such that it is maximized at the point of most interest, one that properly balances the trade off between exploration and exploitation.We can then use a standard single-objective optimization algorithm to optimize the cheap-to-evaluate acquisition function to propose the next observation location, instead of directly optimizing the expensive-to-evaluate physical experiment.Two popular acquisition functions for Bayesian optimiza-tion are expected improvement (EI) and upper confidence bound (UCB) [26,27].
Expected improvement calculates the average improvement of a point over the best observed function value where Φ(•) and φ(•) are the probability distribution function and cumulative distribution function of a Gaussian distribution respectively.Upper confidence bound calculates the most optimistic improvement at a given point, weighted by a parameter β, which explicitly specifies the trade off between exploration and exploitation For β 1 UCB prioritizes exploitation; conversely for β 1 UCB prioritizes exploration.This parameter can be increased as the optimization progresses, to prevent the optimizer from getting stuck in a local optimum [27].Combining both the GP surrogate model and the acquisition function we can now perform Bayesian optimization using the method presented in Algorithm 1.

B. Incorporating Multiple Objectives
We now extend this methodology, following [15], to incorporate P objectives f = {f 1 , f 2 , . . ., f P }.We assume that the objectives share the same input domain x ∈ X and are all observed for each input point such that D now contains the set of N observations of each objective, {(x 1 , f 1 ), (x 2 , f 2 ), . . ., (x N , f N )}.Each objective is then modeled as an independent GP such that f p (x) ∼ GP p (µ p (x), k p (x, x )), as seen in Fig. 2(a).Each GP has its own independent kernel which is trained on corresponding observations by maximizing the marginal log likelihood.
In order to proceed with optimization we must construct a scalar acquisition function α : X → R that finds points which are likely to maximally increase the Pareto frontier hypervolume.We consider two acquisition functions that have been developed for this purpose.
The first multi-objective acquisition function, expected hypervolume improvement (EHVI) seen in Fig. 2(b), is analogous to single-objective expected improvement.This acquisition function calculates the average increase in hypervolume using the probability distribution of each objective function from the surrogate model.The EHVI acquisition function is formally defined as where P is the current set of Pareto optimal points, r is the reference point, H I (P, y, r) is the hypervolume improvement from an observed point y in objective space, and ξ µ,σ is the multivariate Gaussian probability distribution function with the GP predicted mean µ and standard deviation σ for each objective.The hypervolume improvement is defined by the exclusive hypervolume contribution to the current Pareto front by adding y to the Pareto set, as seen in the green region in Fig. 2(b).
The reference point r is chosen such that all expected observations f dominate the reference point.Any predicted points in objective space that do not satisfy this condition will not contribute to the hypervolume improvement, and are thus never chosen as observation candidates.It is also important to note for optimization purposes that the prior mean for each objective GP is set to the corresponding component of the reference point.This ensures that in regions of input space where observations have not been made, the mean of the GP model returns to the reference point.As a result, the acquisition function will never predict that unobserved regions of input space contribute to the hypervolume.
The integral in Eqn. 3 can become computationally expensive to calculate, as the objective space must be decomposed into cells for which the integral has an analytical form.This can become computationally expensive for high dimensional objective spaces with even a small number of observations, as a naive decomposition algorithm scales as O(N P ).Work in this field has produced efficient methods for objective space decomposition, which improves scaling to O(N log N ) for 2-3 dimensions and O(2 P −1 N P/2 ) scaling when P ≥ 4 [28].Regardless, this computational complexity results in a significant increase in computation time when used to maximize the acquisition function, which could be called many times depending on the optimization strategy.
The second multi-objective acquisition function, Upper Confidence Bound Hypervolume Improvement (UCB-HVI), is similar to the UCB acquisition function for single objective optimization.This acquisition function describes an optimistic view of the hypervolume improvement given the surrogate model prediction, The simplicity of this acquisition function reduces computation time relative to EHVI, especially in highdimensional objective spaces.This is due to the development of advanced hypervolume computation strategies, such as the Walking Fish Group (WFG) algorithm [29] or approximate hypervolume computation algorithms [30].These algorithms can be used to calculate the hypervolume improvement by projecting points from the Pareto front onto the sub-domain that is dominated by the test point.The use of these advanced algorithms allows UCB-HVI to be a much faster calculation than EHVI when the number of objectives is large (P > 3), while still achieving similar optimization performance.As a result we exclusively use the UCB-HVI acquisition function as a starting point to perform multi-objective optimization for the rest of the paper.
We now show how the MOBO tackles a simple 2objective optimization problem in 2D input space, x = (x 1 , x 2 ).The problem is stated as ( The analytical Pareto front for this problem in objective space lies on the line segment from (f 1 , f 2 ) = (0, 2 √ 2) to (2 √ 2, 0).In input space, Pareto optimal points lie on the line segment from (x 1 , x 2 ) = (−1, −1) to (1, 1).We start with a set of 5 random input points, drawn from a uniform distribution, which are used to initialize the GP surrogate model.We choose an isotropic radial basis function (RBF) for our kernel (see Eqn. A4) with a length-scale of λ = 1.0 and a variance of σ 2 f = 0.5 which roughly matches the functional form of our objectives.The UCB-HVI acquisition function with β = 0.01 is then used to determine the next point to sample.This value of β is chosen to heavily weight exploitation since our functions are unimodal.
Fig. 3 shows optimization results after 15 optimization iterations.We see that the GP prediction for each of the objectives (Fig. 3(c),(d)) near the observed points closely resembles the true value of each objective (Fig. 3(a),(b)).After only 15 observations the Pareto front found by the UCB-HVI algorithm closely matches the analytical one seen in Fig. 3(f).We also observe that extrema of the acquisition function Fig. 3(e) (i.e. the most likely points to expand the Pareto front hypervolume) are located near the analytical Pareto optimal region in input space.Each successive extremum is located in between each previ-ously observed point due to the increase in uncertainty in-between observations.This means that after the algorithm observes points along the entire Pareto front at a low resolution, the algorithm will continually attempt to increase the hypervolume by sampling intermediate points in-between previous observations.

C. Adding optimization preferences and constraints
One major advantage of the MOBO approach is the ability to specify how the optimizer searches the input space when considering preferential treatment of objectives and adding constraints.In the case of preferential treatment, we wish to specify that the optimizer searches in a given objective subspace (as seen in Fig. 2(d)), thus optimizing one objective or a set of objectives over another.To achieve this, we simply set the acquisition function to zero outside of the selected sub-domain [31].On the other hand, if we wish to specify an objective constraint, we require that an observed objective quantity g(x) satisfies g(x) ≤ h where h is a constant.In this case, we create a surrogate model for g(x) and use it to predict the probability that the constraint will be satisfied [32].We then multiply the acquisition function by this probability to bias the optimizer against choosing points in a region that will likely violate the constraint.
While at first glance it seems that these two methods would result in the same behaviour, i.e. limiting the region where the acquisition function is non-zero, they in fact produce different results.The addition of a preferential objective sub-domain results in a Pareto front that is only found within that sub-domain, which implies that all objectives are still minimized within the sub-domain (which comes at the expense of other objectives).On the other hand, a constraint loosens this re- quirement, allowing any objective value that satisfies the constraint, which in-turn can lead to better solutions for the other objectives.The subtle difference between these two methods allows more flexibility during optimization, suiting the different operation requirements for each accelerator.We now look at how to implement each of these methods in the MOBO framework.
The preferential algorithm specifies both a maximum and minimum reference point in objective space, as supposed to a single reference point in normal MOBO.It then calculates what has been coined as the truncated hypervolume improvement [31].If we specify the truncated domain T ∈ [A, B] defined by the minimum objective point A and the maximum objective point B the truncated expected hypervolume improvement (TEHVI) is thus given by α T EHV I (µ, σ, P, A, B) := R P ∈T H I (P, y, r) • ξ µ,σ (y)dy (6) where the Pareto set P is projected onto the truncated domain.In a similar fashion, the truncated version of the UCB-HVI is given by α T U CB−HV I (µ, σ, P, β, A, B) := H I (P, y, B) y ∈ T 0 otherwise (7) where y = µ − √ βσ.If we wish to specify an inequality constraint that needs to be satisfied, we create a GP surrogate model that predicts the probability of that constraint being satisfied and modify our acquisition function accordingly.We assume that we have another observed quantity g(x) that must satisfy g(x) ≤ h whose value is stored in a dataset D g and used in a GP to predict the probability distribution of g(x).The probability of a point x satisfying the constraint condition is then which is simply a univariate Gaussian cumulative distribution function.This probability can be adapted to suit a number of different type of constraints by modifying the limits of this integral.Now we can define a new constrained version of the acquisition function α(x) as α(x) = α(x)P g (x).
Our acquisition function will be reduced anywhere we predict the constraint has a high probability of being violated and remain unchanged where there is a high probability that the constraint is satisfied.Extra constraints can be easily added by multiplying the acquisition function by their respective probabilities of satisfying each additional constraint.An example of adding a constraint to the problem specified in Eq. 5 is shown in Fig. 4. In this case we add the constraint inequality g(x) ≤ 0.5 where g(x) = x 1 , to stand in opposition of minimizing the first objective.We see the predicted probability that a point in parameter space will satisfy the constraint in Fig 4(a).Even though only a few observed points violate the constraint, we can clearly see a predicted threshold boundary, consistent with the stated constraint inequality.Furthermore, the boundary is best defined in the region of most interest, namely in the region denoted by the red arrow in Fig. 4(a) where Pareto optimal points would lie.From Fig. 4(b) we see that as the optimization progresses the constraint function surrogate model accuracy is improved, and a smaller fraction of points which do not satisfy the constraint are sampled.This leads to a reduction in the number of iterations needed to converge to a solution, as the constraint reduces the effective input parameter domain of the optimization problem.

D. Proximal input space exploration
One aspect of accelerator optimization that is often overlooked when constructing optimization algorithms is the cost associated with exploring the input parameter space.Changes to input parameters (magnetic field strengths, cavity phases, etc.) often take time that scales proportionally to the magnitude of the change.As a result, it is undesirable or infeasible to make large changes in machine input parameters frequently.Thus it is desirable to modify the acquisition function so that each optimization step travels a small distance in parameter space during optimization (i.e."proximal exploration").
Achieving this is done by modifying the acquisition function to prioritize points in input space that are near the current or most recently observed parameter setting.We modify our normal acquisition function by multiplying a multivariate Gaussian distribution, centered at the most recently observed point in input space x 0 and has a precision matrix Λ, to produce a proximal UCB-HVI (P-UCB-HVI) acquisition function given by α(x, The precision matrix in this case specifies the cost associated with changing each input parameter, where larger elements of the matrix correspond to a higher cost associated with changing a given parameter.The matrix can be specified prior to optimization or trained from multiple optimization runs (Bayesian optimization has been used to optimize similar hyper-parameters for neural network regression [33]).The addition of this extra term decreases the acquisition function far away from the most recently observed point.With this modification we expect the MOBO algorithm to sample points along the Pareto front in input space, provided that the objectives are smoothly varying.This is almost always the case in accelerator optimization problems with continuously variable input parameters.Since the weighting function is non-zero throughout the input domain, large jumps to explore regions of parameter space with high uncertainty are still allowed, only if the acquisition function α(x) is large enough to overcome the travel distance penalty.This is in contrast to simply reducing the UCB-HVI parameter β to limit exploration, which prevents any such considerations.By including the extra multi-variate Gaussian term, we maintain the optimization algorithm's ability to escape local extrema to explore regions of unobserved input space, while significantly reducing the frequency of large jumps.We demonstrate how this new method effects optimization by tracking how the UCB-HVI and P-UCB-HVI acquisition functions explore the input space, while optimizing our test problem Eq. 5. We start with a random sample of 5 points and then use UCB-HVI as our unmodified acquisition function to perform MOBO with 25 iterations, the result of which is shown in Fig. 5(a).We then repeat the optimization with our modified acquisition function P-UCB-HVI.We specify Σ = 4I where I is the identity matrix, the result of which is shown in Fig. 5(b).The modified acquisition function reduces the average distance traveled in input space during each optimization step (L), when compared to UCB-HVI (see Fig. 5 insets).By tracking the order of observation, we see that while normal UCB-HVI seems to quasi-randomly explore the input space, P-UCB-HVI explores the Pareto optimal region in a disciplined manner.It first travels along the Pareto optimal space until it reaches the end, then it explores in the vicinity of the endpoint to verify it is indeed the end of the Pareto optimal region.Finally, it reverses direction and continues exploring the Pareto optimal space, jumping over regions that have already been explored.

IV. APPLICATION TO ACCELERATOR OPTIMIZATION
We now demonstrate the MOBO framework on a multi-objective accelerator optimization problem, namely, optimizing the parameters of the Argonne Wakefield Accelerator (AWA) photoinjector [34].The AWA photoinjector uses a Cesium-Telluride cathode in a copper, radio-frequency cavity, to produce electron beams with a wide variety of bunch charges for the use in wakefield accelerator physics experiments.We consider a case where we can vary a number of parameters inside the injector and the first linac section, seen in Table I and in Fig. 6.Our goal is to simultaneously minimize a collection of beam parameters at the exit of the linac section, also seen in Fig. 6.
This problem was chosen based on previous work done towards creating a surrogate model of the AWA photoinjector [25].In this previous work, the authors used the full 3D space charge, particle in cell (PIC) code OPAL [35] to simulate a large set of randomly generated input parameters and measure the corresponding beam parameters at the injector exit.They then created a neural network based surrogate model, trained on the simulation data set.The model can be rapidly queried to retrieve output beam parameters for a given input parameter set.The authors then showed that this surrogate model accurately reproduces results from the 3D PIC simulation.We use this surrogate model for testing our optimization algorithm as it reduces simulation time by several orders of magnitude.In our first experiment, we use MOBO to minimize all seven exit beam parameters as a function of all six input parameters.We wish to compare the convergence rate of MOBO with the convergence rates of standard and recently developed algorithms for solving multi-objective optimization problems.To begin with, all of the input and output values are normalized to the range [−1, 1] in order to account for the widely varying scaling of each parameter.We assume that the functional form of each objective is smooth,and thus we choose the standard radial basis function kernel (Eq.A4) with an anisotropic precision matrix Σ = diag(l) −2 where l is a vector that stores an independent length-scale for each input parameter.Initially, a randomly generated Latin-Hypercube distribution of 20 input parameter sets with corresponding objective observations is used to train each objective GP.Hyperparameter training is done by maximizing the log marginal likelihood of the GP model with the gradient based Adam optimization algorithm [36], with 5000 iterations and a learning rate of 0.001.
Once trained, we use the UCB-HVI acquisition function to perform multi-objective Bayesian optimization with 500 sequential observations.In this case, empirical testing found that β = 0.01 gave the fastest convergence, likely due to the unimodal nature of each objective function, which allows us to aggressively exploit the GP model for the global extremum without worrying about getting caught in local extrema.
Maximizing the acquisition function is done via particle swarm optimization, implemented using the PyGMO package [37] with 64 individuals over 10 generations.In order to account for new information gained from obser-vations during optimization, we retrain the GP kernel hyperparameters with the accumulated data set every 10 observations, again using the Adam algorithm with a learning rate of 0.001 but with 1000 steps.
We re-run this optimization procedure 10 times, each with a different set of 20 randomly generated initial points every time.After each observation, we calculate the exact hypervolume of the Pareto set in normalized objective space, referenced to the maximum possible value for each objective (in this case 1).The average and variance of the hypervolume as a function of observation number after ten optimization runs is shown in Fig. 7.
For comparison we run the same optimization test, but with previously used methods for multi-objective optimization, evaluated in serial, as would be the case during online optimization.The first, Non-dominated Sorting Genetic Algorithm II (NSGA-II) [20], is a popular genetic optimization algorithm, which has been previously used to solve multi-objective accelerator design problems [13].We conducted 10 optimization runs using the NSGA-II algorithm, with a population of 20 individuals, that matched the input parameter sets used in the MOBO optimization runs.We then evolved the population for 200 generations, with a crossover probability of 0.8 and mutation probability of 0.05.The hypervolume after the first 25 generations (500 function observations) is shown in Fig. 7(a).
The second algorithm, iterated neural network (I-NN) optimization [25], is a recently developed algorithm using surrogate neural network (NN) models to choose future observation locations.In this method, observations are used to train a NN surrogate model, which in turn is optimized by the NSGA-II algorithm to propose a new set of observations that are likely to be non-dominated.We plot the predicted hypervolume from the NN surrogate model after each batch of measurements in Fig. 7(a).
From this comparison, we clearly see that MOBO reaches convergence much faster than both the NSGA-II and I-NN algorithms.While not shown in Fig. 7(a) it took about 17,500 NSGA-II observations to reach the same hypervolume that MOBO reached after 500 observations, roughly a factor of 35 times slower.
In Fig. 7(b) we show the 2D projected Pareto front on the energy spread dE and horizontal beam emittance ε x objective space from each of these optimization algorithms after 200 observations each.We observe that the Pareto front generated by NSGA-II is far from optimal and contains few points.This is a direct result from NSGA-II's inefficient sampling behavior.The points that NSGA-II chooses to observe are frequently dominated by previous observations due to the randomized heuristic used to generate observation proposals.The I-NN algorithm improves over NSGA-II by including a neural network surrogate model that directs NSGA-II towards observing non-dominated points.Neither of these algorithms include a direct calculation of the hypervolume increase for each proposed observation point, and thus do not optimally increase the hypervolume at each ob-servation step.As a result the Pareto front generated by MOBO is larger and has a higher resolution than the Pareto fronts generated by either NSGA-II or I-NN.Generally, MOBO shows similar improvements in optimization speed when used in solving a variety of different optimisation problems with varying input and objective spaces [38,39].
In Fig. 7(c) we show the 2D projected Pareto fronts generated by MOBO after 100, 200 and 500 observations.Since the objective space is high dimensional, it takes a large number of observations (100-200) for the algorithm to build up a well-defined Pareto front.Once the front is loosely meshed, the acquisition function can increase the hypervolume in one of two ways, by either finding points in objective space in between prior observations, in order to improve the Pareto front resolution (Fig. 7(c)(i)), or finding points in objective space that dominate initial observations (Fig. 7(c)(ii)).We see that most of the points present in the Pareto front after 200 observations remain on the Pareto front after 500 observations, as the majority of new observations lie in between prior observations in objective space.We conclude that in this optimization run, the algorithm correctly predicts where the true Pareto front lies most of the time, leading to a gain in optimization efficiency.This is in contrast to the heuristic methods we compare MOBO with, where only a small fraction of the observed points are actually on the true Pareto front.

B. Constrained optimization
We now investigate the effect of preferential or constrained treatment of an objective on the accelerator optimization.First, we consider a case where we want to optimize the same objectives as the previous problem but wish to only find solutions where the energy spread satisfies dE < 0.52 MeV (dE < −0.25 in normalized coordinates).To judge how this modification effects optimization, we compare the observed points in the 2D energy spread dE and horizontal emittance ε x objective space, after 200 iterations in Fig. 8.We see in Fig. 8(a) projected observations when no constraints or preferences are added.It is important to note that a large majority of the observations plotted here are on the 7D Pareto front, even though only a small fraction make up the projected front in 2D space.When preferential treatment is added to the acquisition function (Fig. 8(b)), the algorithm observes almost no points that violate this preference.Furthermore, since the effective volume of the objective space is significantly reduced, the optimizer finds a higher quality 2D Pareto front in the same number of steps as the unconstrained case.
Second, we consider a case where we relax this preference, removing the energy spread minimization objective entirely and replace it with the inequality constraint dE < 0.52 MeV.The resulting distribution of observations appears significantly different in this case (Fig.

8(c))
. In this case, more observations are made that violate the constraint than in the preferential treatment, which is necessary to accurately model the constraining function near the boundary.Furthermore, the optimizer allows the energy spread to increase up to the constraint value, instead of attempting to minimize it, in order to better optimize the six remaining objectives.We see this effect in Fig. 8(d) where the Pareto front for a different set of objectives, σ x and ε x , is better when the energy spread preference is relaxed to a constraint.The 2D front then improves again when the constraint is completely removed, owing to the fact that dE is allowed to increase further when all constraints are removed.

C. Proximal optimization
Finally, we demonstrate the use of P-UCB-HVI on optimizing the AWA problem.We start with the same set of 10 initial sets of observations as in Section IV A with the same hyperparameter training schedule.However this time we run MOBO optimization using the P-UCB-HVI acquisition function, with an isotropic precision matrix (see Eq. 10) Σ = 0.25I is defined in normalized input space.
Results from these optimization runs are presented in Fig. 9.We observe that during optimization, when the UCB-HVI acquisition function is used, the solenoid strength parameter K 1 is wildly varied to increase the hypervolume as much as possible each step.However, when the localization term is added to the acquisition function, the frequency and amplitude of large jumps in parameter space are both decreased.While not shown here, change in behavior is mirrored in each of the other 5 input parameters.Furthermore, the use of P-UCB-HVI acquisition function over the generic UCB-HVI function only minimally reduces the overall speed at which the method maximizes the Pareto front hypervolume (Fig. 9c).

V. CONCLUSION
In this paper we have demonstrated that the MOBO framework can be used to solve online multi-objective optimization accelerator physics problems.This method efficiently finds the Pareto front in a serialized manner, which makes it viable for use in online accelerator optimization.The framework also allows the user to easily specify objective preferences and constrain the objective space through the use of GPs.Finally, we demonstrated that adding a localization term to the acquisition function effectively restricts the MOBO algorithm to prioritize moving through input space in a proximal, disciplined manner, which is especially important for practical use in accelerator facilities.
Our results also demonstrate the reasons why MOBO is ideal for online multi-objective accelerator optimization.As stated previously, optimization takes place after every observation step, as opposed to methods designed for parallel use which are not sample efficient when used in a serialized context.Second, observation points proposed by MOBO directly incorporate learned information about the objective function instead of using a heuristic method that is model independent to generate potential solutions.As a result, MOBO strategically proposes solutions which optimally increase Pareto front hypervolume and improve Pareto front resolution.While this comes at an increase in computational complexity, the extra computation time needed (estimated to be < 5 s for most problems) is small relative to the reduction in optimization time associated with faster convergence to the Pareto front.
The use case for practical online multi-objective optimization can be viewed as experimental beamline characterization.Accelerator working points are often predetermined through conducting multi-objective optimization on simulated versions of the beamline.However, simulations rarely capture the full behavior of the beam in the real accelerator and as a result any selected working point from simulation might not be ideal in reality.MOBO can be used to find the realistic ideal trade off between given objectives and the input parameters that are Pareto optimal during beamline commissioning or machine study.Once a trade-off has been selected by specifying an explicit weighting of the objectives and the correct input parameters have been determined, single objective optimization strategies can take over during regular operation to do local optimization in response to noise and/or temporal drift.Furthermore, the experimental multiobjective optimization process can be repeated periodically to monitor changes in machine performance over time and benchmark simulated Pareto fronts to experimental results.
While the goal of this work is to apply the MOBO framework towards online optimization of accelerators, this technique can also be applied to solve computational accelerator physics optimization problems.High fidelity simulations of particle accelerator physics (beamlines, cavities, magnets etc.) are also a resource intensive process, often requiring time on computational clusters which have limited availability.The recently developed q-Expected Hypervolume Improvement (qEHVI) [40] algorithm is a parallelized extension of EHVI.In contrast to EHVI, which proposes a single point at each optimization step, qEHVI proposes multiple q points that are likely to increase the Pareto front hypervolume for each optimization step.These points can be evaluated in a batched parallel process on a computing cluster, which significantly reduces overall optimization time while maintaining the sampling efficiency advantages of EHVI.Furthermore, multi-fidelity approaches to MOBO have also been proposed to reduce optimization time [41], by incorporating low-cost, approximate simulations as a temporary stand-in for expensive high fidelity simulations.The application of these methods to solving computational accelerator problems has the potential to dramatically reduce resource requirements for those in the field.

Figure 1 .
Figure 1.Cartoon of multi-objective optimization where each objective is to be minimized.Multi-objective optimization attempts to find a set of points known as the Pareto front P that dominate over a reference point r and any other observed points in objective space.The Pareto front hypervolume H (shown in blue) is the axis-aligned volume enclosed by the Pareto front and a reference point r.Making a new observation y, that dominates over points in the current Pareto front, leads to an increase in hypervolume (shown in green), referred to as the hypervolume improvement HI .

Figure 2 .
Figure 2. Cartoons showing hypervolume improvement metrics used for MOBO.Blue regions denote current Pareto front hypervolume defined by a reference point (upper right) and three observed points.Green regions denote hypervolume improvement by adding an observation.(a) Each objective is modeled by an independent GP that predicts the function value and an uncertainty, the next optimization point is chosen by maximizing the acquisition function.(b) Expected hypervolume improvement (EHVI) where the distribution of predicted objective values is given by a probability distribution (orange shading) centered at the black cross.(c) Upper confidence bound hypervolume improvement (UCB-HVI) where the hypervolume improvement is determined by an optimistic view of the predicted objective values.(d) Truncated UCB-HVI where we only consider hypervolume contributions within the un-shaded sub-space T .

Figure 3 .
Figure 3. Optimization results after 15 function observations using the MOBO algorithm on the problem defined in Eq. 5. Functions plotted in (a-e) are normalized to the range 0 (blue) to 1 (yellow).(a-b) Ground truth of f1, f2.Green line denotes the analytical location of Pareto optimal points.(cd) GP mean prediction for f1, f2 respectively.(e) UCB-HVI acquisition function with β = 0.01.(f) Plot of observation values in objective space and the analytical Pareto front (green line).(c-f) Colored dots denote observation locations in input and output space, colored by observation number (blue to pink).Orange crosses denote the locations of 5 random, initial observation points.

Figure 4 .
Figure 4. (Color) (a) Probability map of satisfying the constraint x1 ≤ 0.5 after 50 observations, selected by constrained UCB-HVI, while optimizing Eq. 5.The red arrow points to where the constraint probability is most accurate, which is where the greatest number of observations are located.(b) Average fraction of points satisfying the constraint over 25 randomly initialized constrained MOBO optimization runs.Shading denotes one sigma spread.

Figure 5 .
Figure 5. (Color) Optimization trajectories in input space after 25 observations of objective functions in Eq. 5 using (a) the unmodified UCB-HVI acquisition function and (b) the P-UCB-HVI acquisition function.Insets: Distribution of distances (L) traveled per step in input space.

Figure 6 .
Figure 6.Cartoon of the AWA photoinjector and first linac cavity.Input and output parameters used in optimization are labeled.Reproduced from [25].

Figure 7 .
Figure 7. (Color) (a) Average Pareto front hypervolume H of 10 multi-objective optimization runs of the AWA example using MOBO, NSGA-II and iterated neural network (I-NN) algorithms.Shading and error bars denote 1 sigma variance.(b) Projected hypervolume onto energy spread (dE) vs. transverse emittance (εx) sub-space after 200 observations for each optimization algorithm shown in (a).(c) Projected hypervolume onto dE vs. εx subspace after 100 (light blue), 200 (blue) and 500 (dark blue with orange outline) observations using the MOBO algorithm.Inset zoom (i) shows an increase in hypervolume due to increasing the Pareto front resolution, while inset zoom (ii) shows an increase in hypervolume due to finding new points that completely dominate old observations in projected space.

Figure 8 .
Figure 8. (Color) Plots showing energy spread (dE) and horizontal beam emittance (εx) observations after 300 observations taken by MOBO algorithms.(a) MOBO with no constraints.(b) MOBO with an optimization preference of dE < 0.52 MeV.(c) MOBO with an inequality constraint of dE < 0.52 MeV.(d) Projected Pareto fronts for σx vs. εx for each case above.The dotted lines in (b) and (c) denotes the preference/constraint limit.

Figure 9 .
Figure 9. (Color) Comparison between normal UCB-HVI and the proximal UCB-HVI acquisition functions when used to perform optimization of the AWA photoinjector.(a) Solenoid 1 strength parameter over 300 observations using UCB-HVI (left) and corresponding distribution of ∆K1 for each step (right).(b) Solenoid 1 strength parameter and travel distance distribution when P-UCB-HVI is used.(c) Average Pareto front hypervolume of 10 optimization runs for UCB-HVI and P-UCB-HVI with identical random initialization sets.Shading denotes one sigma variance.

Table I .
AWA Input Parameters