Statistically Characterising Robustness and Fidelity of Quantum Controls and Quantum Control Algorithms

Robustness of quantum operations or controls is important to build reliable quantum devices. The robustness-infidelity measure (RIM$_p$) is introduced to statistically quantify the robustness and fidelity of a controller as the p-order Wasserstein distance between the fidelity distribution of the controller under any uncertainty and an ideal fidelity distribution. The RIM$_p$ is the p-th root of the p-th raw moment of the infidelity distribution. Using a metrization argument, we justify why RIM$_1$ (the average infidelity) suffices as a practical robustness measure. Based on the RIM$_p$, an algorithmic robustness-infidelity measure (ARIM) is developed to quantify the expected robustness and fidelity of controllers found by a control algorithm. The utility of the RIM and ARIM is demonstrated by considering the problem of robust control of spin-$\tfrac{1}{2}$ networks using energy landscape shaping subject to Hamiltonian uncertainty. The robustness and fidelity of individual control solutions as well as the expected robustness and fidelity of controllers found by different popular quantum control algorithms are characterized. For algorithm comparisons, stochastic and non-stochastic optimization objectives are considered, with the goal of effective RIM optimization in the latter. Although high fidelity and robustness are often conflicting objectives, some high fidelity, robust controllers can usually be found, irrespective of the choice of the quantum control algorithm. However, for noisy optimization objectives, adaptive sequential decision making approaches such as reinforcement learning have a cost advantage compared to standard control algorithms and, in contrast, the infidelities obtained are more consistent with higher RIM values for low noise levels.

Standard quantum control methods for steering quantum devices mostly focus on finding controls that have high fidelity using mathematical models [20][21][22].However, if the operation of quantum devices is subject to noise, high fidelity itself is insufficient to gauge performance of a control scheme, and extra effort is required to systematically search for solutions that are both, robust against noise and have high fidelity [23,24].This requires a notion of robustness and ideally a single measure that can capture robustness and fidelity, enabling the identification and construction of more efficient methods to find controls that satisfy both properties.
In this paper, we introduce a general statistical diagnostic based on the Wasserstein distance of order p [25] to evaluate the robustness and fidelity of quantum control solutions and the algorithms used to find them.This is applicable to any quantum control problem where the fidelity is a random variable with a probability distribution over [0,1].The Wasserstein distance between probability distributions is a measure of the minimal costs of probability mass transport between two distributions.In Sec.II, the p-th order Robustness-Infidelity Measure (RIM p ) is defined to quantify the robustness and fidelity of a quantum controller.It is the p-th order Wasserstein distance between the probability distribution for the fidelity induced by noise and the ideal distribution for a perfectly robust controller, described by a Dirac delta function at fidelity 1.We show that the RIM p is the p-th root of the p-th raw moment of the infidelity distri-bution -a non-parametric measure independent of any particular assumption for the distribution.
Using measure-theoretic norm scaling relations between RIMs of different order, we argue that RIM 1 , the average infidelity, is sufficient as a practical measure of robustness and fidelity.As such, it meshes with the average infidelity, already used in robust [26,27] and stochastic/adaptive quantum control settings [28,29].More generally, the RIM p is related to the risk-tunable fidelity measure using a utility function, as introduced in Ref. [30].
The RIM has practical utility by allowing us to choose among similar, high-fidelity controllers, acting as a postselector for robust controllers, agnostic of the algorithm used to find them.This may be computationally more efficient than optimizing the RIM directly, as we see in Sec.IV C.Moreover, it can also be adapted to compare the performance of control algorithms in finding not only high-fidelity but also robust controllers.To that end, we introduce an Algorithmic RIM (ARIM), averaging RIMs over multiple controllers, in Sec.II.
In Sec.IV we illustrate the RIM and draw useful insights for robust quantum control by generating controllers for static energy landscape control of the XX Heisenberg model, exploiting the degree of freedom afforded by the existence of multiple optima in quantum control [31].We analyze their robustness properties and the performance of algorithms in finding effective controllers using four optimization algorithms representing different, commonly employed approaches: (1) L-BFGS: a second-order gradient-based optimization using an ordinary differential equation model of the quantum system to compute the fidelity under perfect conditions [32]; (2) Proximal Policy Optimization (PPO): a model-free reinforcement learning algorithm, having no prior knowledge of the system [33]; (3) Nelder-Mead: a derivative-free simplex-based heuristic search method [34]; and (4) Stable Noisy Optimization by Branch and Fit (SNOBFit): another derivative-free method that performs model-free learning by using regression to estimate gradients via a branch and fit method [35].Here, (1) serves as a baseline for optimization over a noise-free fidelity objective functional under ideal conditions.(2) represents a machine learning approach with minimal knowledge.( 3) and ( 4) are derivative-free methods to handle stochastic objective functionals.A detailed motivation for the choice of these algorithms is presented in Sec.III B. These choices are not exhaustive but serve as a diverse set of algorithms to which we apply the RIM and ARIM, illustrating their utility and giving some indication of the performance of common control algorithms for the specific robust control problem.The algorithms were implemented in Python.Specifically, we used the SciPy library for (1) and (3) [36], and (4) is obtained from Ref. [37].Our code and data are available at Ref. [38].
Our experimental motivation is four-fold: (A) by comparing the robustness of controllers without regard to the optimization algorithm, we wish to answer whether high fidelity implies high robustness using the RIM of the individual controllers (Sec.IV A). (B) by conducting a distributional comparison of controllers we wish to understand how likely it is that a given algorithm produces controllers in an ideal (no-noise) setting that are robust in noisy conditions (Sec.IV A 2). (C) to study the effect of training noise of the same nature as the robustness noise model applied during optimization on an algorithm's ability to find robust controllers using the ARIM (Sec.IV B).For a fair comparison, we conduct (B) and (C) with a fixed number of objective function calls allotted to each algorithm.(D) In Sec.IV C, we try to understand an algorithm's asymptotic ability to find robust controllers using the ARIM through optimizing the RIM by allowing unlimited objective function calls.We consider two settings in this scenario: stochastic and non-stochastic fidelity optimization.In the latter case we optimize over a fixed set of Hamiltonians sampled once according to a noise model, while in the former case the Hamiltonians are stochastically chosen at each objective function evaluation using the same noise model.
Our main numerical findings are: • High fidelity controllers are not always robust, but the non-robust controllers can be filtered out using the RIM.
• Using a consistency statistic, we show that PPO controller infidelities (RIM at no noise) are more correlated with RIM values at low noise levels compared to the other algorithms.More generally, a strong signal in the consistency statistic predicts RIM robustness while avoiding its explicit evaluation.
• For constrained objective function calls, there appear to be problem-dependent optimal levels of noise that produce more robust controllers for PPO in contrast to L-BFGS, SNOBFit and Nelder-Mead.
• Robust controllers with respect to certain noise models in the optimization objective are obtained by all algorithms for the non-stochastic optimization objective when there are no constraints on resources.
• However, if the optimization objective is stochastic, the ARIM improves asymptotically only for PPO.In either case, PPO requires fewer function calls compared to the other algorithms, which highlights the potential of adaptive sequential decisionmaking strategies like reinforcement learning for NISQ optimization problems, where not all uncertainty can be captured by non-stochastic objective functionals (e.g., shot noise).
Lastly, from the perspective of classical control [39], it is well known that accuracy conflicts with robustness through the S + T = I formula, where S is related to tracking error and T to sensitivity of the tracking error to uncertainties.This restriction does not map directly to quantifying fidelity versus robustness in the quantum domain [24,40], although it is recovered in some cases [41].One reason for the discrepancy is that the frequencydomain limitations of classical feedback control have limited applicability in quantifying the performance in the time-domain.The RIM combines the two figures of merit into one single measure: small RIM means high fidelity and high robustness, while large RIM means poor fidelity and poor robustness.

II. MEASURING ROBUSTNESS AND FIDELITY OF QUANTUM CONTROLS
A. The General Quantum Control Problem The physical system we wish to control is represented by a Hamiltonian where the time-independent drift Hamiltonian H 0 describes the natural dynamics of the system and the control Hamiltonian H u (t) describes the time-dependent control with the tunable, usually piecewise constant, control parameters u.The closed-system dynamics are governed by the Schrödinger equation, which can be written in terms of the unitary evolution operator from time t 0 to time t 1 where T denotes time ordering and is the reduced Planck constant.
In general, the control problem is formulated as optimizing a fidelity F over a set of admissible controls.A notion of fidelity that reflects most definitions used in practice is given by F := | G|K | 2 , which measures the similarity between normalized objects G and K.If we wish to prepare a state G = |ψ f from an initial state |ψ 0 at time t 0 , then K = U (t 0 , t 1 , u) |ψ 0 and the optimization problem is given by where X is the domain of allowed controls, here including the final time t 1 .A variant, up to normalization, of the state fidelity is the Hilbert-Schmidt inner product F = Tr W † V between a desired unitary transformation W and a gate achieved by control, V = U (t 0 , t 1 , u).In general, we assume the fidelity is bounded, and without loss of generality we assume it lies in [0, 1], where F = 1 if and only if we have G = e ıφ K, up to a global phase φ (and equivalently for the Hilbert-Schmidt inner product).

B. Robustness-Infidelity Measure
Uncertain dynamics turn the fidelity F into a random variable with a probability distribution P(F).Intuitively, we call a controller robust if this distribution has a low spread.While a low spread alone may indicate robustness, low fidelity means the controller does not realize the target operation well.So we also expect a fidelity close to 1.That means the perfect distribution under any uncertainties is δ 1 -the Dirac delta distribution at maximum fidelity 1.In particular, we consider the delta function δ x to be defined by an indicator cumulative distribution function (CDF), This permits the familiar delta function property for integration w.r.t. a basic (rapidly diminishing) function, Our goal is to define a distance between probability distributions that measures closeness between the ideal and the achieved probability distribution in order to combine high fidelity and its robustness into a single measure.For this we take the Wasserstein or Earth mover's distance W [25,42] due to the facts that: (1) it allows us to compare two probability distributions that do not share a common support, and, in particular, compare discrete and continuous distributions; (2) its easy geometric interpretation helps with its optimization; and (3) a simplification allows it to be calculated easily, as shown next.
The dual formulation of the p-th order Wasserstein distance [43] between two distributions µ, ν is given by where h(x) − g(y) ≤ x − y p .Even though this form seems abstract, for one-dimensional distributions, we can analytically compute the optimal maps h, g with Theorem 1 (Prop. 1 in [43]) The p-th Wasserstein distance W p (µ, ν) for one-dimensional probability distributions µ and ν with finite p-moments can be rewritten as where Q µ (z) = inf{x ∈ R : C µ (x) ≥ z} denotes the quantile function and C µ is the cumulative probability function of µ and likewise for Q ν .
Remarkably, the optimal transport distance between onedimensional distributions µ, ν over all possible transportation plans can be computed in terms of their quantile functions Q µ , Q ν .From here, following Thm. 1, it is straightforward to define the p-th Robustness-Infidelity Measure, It can be written in terms of the raw moments (see App.A 1): where f is a fidelity sample drawn from the distribution P(F) and 1−f is the corresponding infidelity sample.We use the expectation operator defined as E f ∼P(F ) [(•)] := (•)P(F = f ) df .For p = 1, we recover the average infidelity, To compute the RIM p , we estimate P(F) using n fidelity samples f 1 , f 2 , . . ., f n .Such samples may be obtained in practice via Monte Carlo simulation or physical experiments [44].Hence, barring the computational or experimental expense of obtaining these samples, the RIM p is easy to compute.In case the dynamics of the system are certain, i.e.P(F) = δ f for some constant fidelity value f , the RIM 1 is equal to the infidelity 1 − f .Moreover, the RIM 1 is small if and only if the controller is robust (in the sense of the fidelity distribution having a low spread) and is also close to the maximum fidelity.

C. The Average Fidelity is Sufficient for Robustness Comparisons
We motivate why the RIM 1 is sufficient for comparing robustness and fidelity of controllers by making use of the fact that the RIMs of different orders computed on the estimated fidelity distribution are in agreement.We obtain the following bounds between the lower and higher order RIMs (see App.A 3): for p < p , where n is the number of samples used to estimate the RIM.Eq. (10b) is stronger and states that RIM p is less sensitive to outliers than RIM p while Eq.(10a) states that for fixed n and p, RIM p growth is sublinear (∝ exp(−1/p )).This can be made tighter by adding additional assumptions on the nature of P(F), but these depend on the specific control problem.The upper bound becomes loose with increasing n, but highlights the constraining nature of deviation of higher-order RIMs from RIM 1 .This means that the higher-order RIMs do not capture more useful robustness information for comparisons, with the base case in Eq. (10b) being decided by the RIM 1 .RIM 1 has low sensitivity to outliers (see App.A 3), which makes it easier to estimate than higher-order RIMs, which, like the worst-case fidelity, are harder to accurately practically obtain (as more samples are required, which also explains the presence of n in the inequality).
The bound in Eq. (10b) gets tighter for large n but also for decreasing infidelities, so in this regime, the RIMs are in agreement.Another way to see this is to note that the Wasserstein distance provides a structurepreserving geodesic between any fidelity distribution to the ideal δ 1 : the distributions converge together with their RIMs of any order.So especially when approaching the ideal distribution δ 1 , i.e. in case RIM 1 is small for high-fidelity, robust controller, there is strong agreement between RIMs of all orders.For example, the variance of distributions decreases as However, the fact that outliers are more influential for higher RIM orders proves useful for optimization [30] where such behavior is sought after, while our goal here is robustness/fidelity comparison.For this goal, in general, outliers are obstructive as they would hide the general distributional trend.From now on we will refer to the RIM 1 without the subscript.

D. Perturbations
Next, we define the noise in the system as perturbations of its uncertain dynamics that give rise to P(F).A perturbation to the full Hamiltonian in Eq. (1) can be expressed as H(t, u) = H(t, u)+γS ∈ C n×n where γ ∈ R describes the strength of a perturbation and S ∈ C n×n its structure, usually normalized using some matrix norm.
To induce an uncertainty into the dynamics we treat γ and S as random variables drawn from some probability distributions.This give us a general way to represent any physically relevant uncertainties in Hamiltonian parameters.
The structure S may be fixed, e.g., describing the uncertainty in some coupling parameter for the Hamiltonian, while γ is drawn from a normal distribution.This would be consistent with a (linear) structured perturbation in classical robust control theory [45].Instead, S may also be drawn from a probability distribution, describing uncertainties across multiple Hamiltonian parameters.While this generalizes structured perturbations, note that they do remain linear w.r.t. the strength.If S is sampled uniformly on the unit-sphere, according to its normalization, we have an unstructured perturbation, with (uncertain) strength determined by γ.Conceptually, if γ is drawn from a normal distribution with zero mean and standard deviation σ, γS describes a "fuzzy" ball B σ around H(t, u).In this paper, we consider unstructured perturbations that are less idealized, in some sense, than the structured perturbations (usually considered in classical control [46]), allowing the robustness 0.25 0.50 0. 75  results to be interpreted generically without the need to consider specific sources of uncertainties arising from specific quantum device designs.For simplicity, we write P σ (F) for a fidelity distribution obtained by unstructured perturbations drawn from B σ .
Our quantification of robustness is dependent on the choice of γS and the uncertainties in these quantities.Note that neither the choice of the noise model nor the magnitude of the noise level is restricted, as our approach is not perturbative around the optimum (t opt , u opt ), which is how noise is usually modelled in the literature [13,18,[47][48][49][50].This approach becomes relevant when confidence in an analytical physical model is low or there are missing terms that cannot be analytically or perturbatively accounted for, e.g., complicated noise sources.This is also in accordance with modern robustness theory and the µ function in classical [51] and quantum [45,52,53] settings.
To further motivate the RIM, we study how it compares with other statistical measures of robustness.The RIM generally correlates with worst-case or minimum sample fidelity, variance or higher moments and the yield function Y (F Th ), which is the fraction of fidelities greater than a threshold fidelity F Th .Fig. 1 shows a scatter plot of RIM values versus Y (0.95), Y (0.98) and the worst-case fidelity for an example problem using Eq. ( 13) discussed in Sec.III A. The RIM has an advantage over Y in that it does not depend on an arbitrary choice of F Th .

E. Measuring the Performance of Control Algorithms
We can also apply the previous arguments to derive a measure to compare the ability of control algorithms to find high-fidelity, robust controllers.Let P(RIM) be a distribution of RIM values of controllers obtained by a particular algorithm and a particular control problem with specific uncertainties.This can be estimated by sampling L controllers produced by the algorithm.The ideal of this distribution is δ 0 , so that we can define the Algorithmic Robustness Infidelity Measure, following the same argument as before.The ARIM is small if and only if the underlying RIM distribution P(RIM) has higher density at or near RIM = 0, i.e. is close to the ideal δ 0 .

III. ROBUSTNESS FOR STATIC CONTROL PROBLEMS
We study the robustness of static control problems, where the controls are time independent, instead of the usual time dependent controls.Previous work has shown that particularly robust controls can be found for these systems [23,24,40,45].While these systems are often not fully controllable, solutions for specific operations can be found via optimization [55].The static approach is simpler in the sense of having fewer control parameters to optimize over, which reduces computational and experimental complexity.This makes the problem suitable to demonstrate the practical usage of the RIM in a concrete example and explore the robustness properties of the control algorithms as well as the controllers they find.

A. Information Transfer in the Single Excitation Subspace of XX Spin Chains
We consider a network of M spins represented by the quantum Heisenberg model given by the Hamiltonian where σ a j = I ⊗j−1 ⊗ σ a ⊗ I ⊗M −j and σ a are the usual Pauli matrices.We set J z = 0 and J x = J y = J for the XX model with uniform couplings.This model has been studied extensively, starting with Ref. [56] in 1961, and a more recent review of the system, as it relates to quantum communication, is provided in Ref. [57].Conditions for perfect state transfer along XX chains were derived in Ref. [58] and applied to NMR systems [59].Similar experiments have been carried out in photonic P (1)  0.1 ( 1); RIM=0.548P ( )  0.1 ( 1 ); RIM=0 P (2)  0.1 ( 2); RIM=0.074 To illustrate the RIM robustness measure, two static controllers for an XX spin chain of length five for transferring an excitation from spin |1 to |3 are compared.The empirical approximations to the CDFs for the two controllers, l = 1, 2, were simulated using 100 bootstrapped perturbations with σ = 0.1, giving fidelity distributions P0.1(F l ) for the fidelity random variables F l .The fidelity distribution P0.1(F δ 1 ) for a perfectly robust controller with F δ 1 is also shown.The ECDFs are estimated using 500 bootstrap repetitions.The 0.95 confidence bounds on their error are obtained using the Dvoretsky-Kiefer-Wolfowitz inequality [64].
Closeness to the perfectly robust controller can be interpreted as having a smaller area under the curve and is indicated by the RIM values.
systems [60,61], and proposals for engineering similar systems with trapped ions [62] and cold atoms [63] exist.The state space naturally decomposes into non-interacting excitation subspaces as the Hamiltonian commutes with the total excitation operator.Here we consider the first excitation subspace, the smallest space that enables transfer of one bit of information between the nodes in the network.Higher excitation subspaces may be needed for other applications, but it is desirable for information transfer to limit the space to the smallest space that is sufficient to achieve the task.This is a much smaller space and only grows as O(M 2 ) as opposed to O(exp(2M )).The Hamiltonian of the first excitation subspace is where 1 l,m is the Kronecker delta.The static controls are local energy biases ∆ l on spin |l in a diagonal matrix H XX allows for transfer of single bit excitations from an initial spin state |a to a final state |b .We define the fidelity as F = | b| U (t 0 , t 1 , H XX ) |a | 2 and the infidelity as I = 1 − F. The solution to Eq. ( 3) is a final time t opt and a single vector of M biases ∆ opt .The perturbations are given by where γ J k and γ C c are the strength of the perturbation on the couplings and controls respectively.We draw these strengths from the same normal distribution N 0, σ 2 with mean 0 and variance σ 2 .A numerical example illustrating the RIM via the empirical CDF (ECDF) for two controllers is shown in Fig. 2.
Depending on the hardware platform, it is possible to consider specific practically motivated correlated noise models with correlated structured perturbations or a power law decaying electric-field noise (1/s), e.g., in trapped atomic platforms [11,65].We have chosen to implement the simplest option of equal strength random perturbations on all non-zero entries of the Hamiltonian that is also relevant in practical settings [59][60][61][62][63].

B. Algorithms for Static Control Problems
The ARIM compares algorithm performance in finding robust controllers.Here, we describe the algorithms used to find the controllers for the static control problems.For selecting algorithms, we tried to (a) investigate the performance of algorithms commonly used in the quantum control and other communities, (b) consider algorithms that do and do not require gradient information, and (c) consider reinforcement learning, more recently also used in quantum control.
L-BFGS is a common optimization algorithm used in quantum control as part of GRAPE [66] and performed well on finding high-fidelity energy landscape controllers [55].It has not been designed for noisy optimization, but there exist smoothing modifications that attempt to address this [67][68][69].For individual controller comparisons, we use standard L-BFGS with an ordinary differential equation model to compute the fidelity without perturbations during optimization.This serves as a baseline to understand the performance of optimizing noiseless objective functionals compared to the noisy optimization performed by all other selected algorithms and its impact on the robustness of the controllers found.We have explored stochastic gradient descent methods (e.g.ADAM [70]) and also tested a noisy version of L-BFGS that has been recently proposed that modifies the line search and lengthening procedure during the gradient update step [69] and found that our training noise scales were too large and washed away gradient information, rendering these algorithms unsuitable for our study.
Reinforcement learning has been successfully used for tackling quantum control in challenging noisy environ-ments, resulting in similar or better performances compared to standard control methods.Promising results include the stabilization of a particle via feedback in an unstable potential [71], optimizing circuit-QED, twoqubit unitary operators under physical realization constraints [72], and optimizing multi-qubit control landscapes suffering from control leakage and stochastic model errors [73], among many others.
Proximal Policy Optimization (PPO) is a policy gradient method in the class of reinforcement learning algorithms [74].It uses a discounted reward signal (e.g., the fidelity) accumulated over multiple interactions with the optimization landscape using non-parametric models: the policy function, that does the interacting by performing control operations, and the action value function, that predicts the quality of each action undertaken by the policy in terms of future payoffs in the reward signal.Both are estimated using neural networks in a control problem agnostic fashion.Doing this allows the incorporation of perturbations during training which specifically has advantages in finding robust controls for energy landscape problems [75].Here, we use a control problem formulation of PPO for Eq. ( 13) as described in Ref. [75].
Nelder-Mead is a popular simplex-based control algorithm using direct search.Essentially, it keeps updating a polytope whose vertices are function evaluations towards an optimum direction.It has successfully been used in noisy experimental settings [76] due to its non-reliance on gradient information [77][78][79], especially when obtaining such information is resource-intensive.
Stable Noisy Optimization by Branch and Fit (SNOB-Fit) has been chosen as it has been designed to filter out quite large scale noise in objective functionals [35].It fits local models using objective function evaluations and implements a branching and splitting algorithm to partition the parameter space into smaller boxes with one function evaluation per box.The latter is a non-local search scheme that orders promising sub-boxes by the number of bisections required to get from the base box to that box.Sub-boxes with smaller bisections are worth exploring more.Like PPO, it does not rely on explicit gradient information and builds models of the optimization landscape.Thus, both algorithms should be able to cope with large amounts of noise in the form of controller and model uncertainties, environmental effects and singularity during optimization.SNOBFit, however, differs importantly from PPO in the assumption that those models are linear.Moreover, its non-local optimization landscape exploration is not random and thus has comparatively a lot less variance in performance (that may or may not be poor).

IV. NUMERICAL EXPERIMENTS
To explore the robustness of controllers and corresponding control algorithms (see experimental motivation in Sec.I), we perform a Monte Carlo robustness analysis using the RIM on numerical solutions to the spin chain information transfer problem in Sec.III A for chains of length M = 4, 5, 6, 7, 8, 9 with J = 1.
We look at transitions from the start of the chain |1 to the end |M and from |1 to the middle M

2
. The former transition is physically easy to control while the latter is more challenging [55] as transitions to the middle exhibit anti-core behavior [80].
We collect the best 100 solutions, ranked by their fidelity, obtained by all the control algorithms.Each algorithm has a budget of 10 6 fidelity function evaluations.The budget correlates with the run time for each algorithm.It is imposed to allow for a fair comparison of the algorithm robustness performance under similar resources, while being agnostic to specific implementations and speed differences.
We initialize ∆, t with quasi Monte Carlo samples from the Latin Hypercube [81][82][83] to increase convergence rate and decrease clustering of controllers.This permits coverage of the parameter domain with O(1/ √ N ) samples as opposed to O(1/N ) for random sampling, where N is the number of initial values.Our constraints are 0 ≤ t f ≤ 70 and −10 ≤ ∆ ≤ 10.We use 100 bootstrap samples to estimate fidelity distributions throughout.The perturbation strengths γ J j and γ C c are scaled by J and ∆ respectively as per Eq. ( 14).Note, for σ = 0, P 0 (F) = δ F is deterministic.
The perturbation strengths are drawn from a normal distribution with standard deviation σ train determining the strength of the noise added for the optimization.σ sim is the noise level used in the simulations to assess the robustness of the controllers found.Implementation details are in App.B 1. The optimization objective is noiseless F for Sec.IV A, stochastic F with unstructured perturbation S σ for Sec.IV B, and the RIM for the non-stochastic problem and a stochastic F with S σ in Sec.IV C.

A. Characterization of All Controllers Found with
Constrained Resources

Ranking Individual Controllers
In this section, we address our motivating question (A), whether high fidelity implies high robustness for an individual controller.We also numerically demonstrate the non-linear and non-uniform deterioration of robustness with increasing noise which implies a trade-off between higher fidelity at no noise and robustness at higher noise levels.
To this end, we employ control algorithms to optimize an objective functional without noise, i.e., setting σ train = 0 (see Sec. II D), under the general optimization conditions outlined at the start of Sec.IV.We rank these controllers by their infidelity values and then compute the RIM values for various levels of simulation noise, σ sim = 0.01, 0.02, . . ., 0.1.The main result, that applies also to all transitions (not explicitly shown here), is that the high fidelity controllers do not, in general, preserve their ranks as σ sim increases.E.g., for SNOBFit (see Fig. 3(b)), the RIM for controllers 6, 8, 9, 11 − −13 grows much more rapidly than for controllers 24 − −33, indicated by rapid color changes from dark (low RIM) to light (high RIM) in the vertical direction.Interestingly, almost all controllers found by PPO have very low RIM across σ sim values compared to the other control algorithms (color remains dark for longer).This is, however, not reflective of PPO's general behavior on the extended sample of problems we examined (see App. B 5).It could be limited fundamentally by the existence of robust controllers and/or the resource budget for a particular problem (see Fig. 10 in App.B 4 showing results for other transitions).
We further evaluate the best performing individual controller.To this end, we seek the controller that preserves its overall RIM rank average the most across the noise levels.It is computed using the reshuffled RIM ranks of each controller for all values of σ sim .Likewise, we locate the controller that has the median RIM rank average across the noise levels as the averagely performing controller.Most of the RIM rank sum distributions studied were symmetric, and their median was close to their average value.So we can try to understand average controller RIM rank order consistency in terms of how the median controller performs.We compare the RIM values of the median with the best controller in Fig. 3(e) for all algorithms, showing the RIM values for the best and median controller as a function of σ sim .
For all algorithms, the best and the average controllers have similar infidelities (initial RIM value) in Fig. 3(e).Their behavior as a function of σ sim is different and is generally non-linear.Thus, the best controllers, despite being distinguishable from the others at σ sim = 0, become indistinguishable for higher σ sim and point at a trade-off between infidelity (at no noise) and robustness that could be leveraged when selecting a controller to be deployed for a noisy system.Moreover, the RIM curve of the best controller among all algorithms (here L-BFGS) suggests a fundamental limitation on RIM for this problem.It is likely not possible to obtain curves that are lower, but this remains theoretically unresolved.
To that end, we reduce the RIM rank consistency property of the top-k controllers across two perturbation strengths σ (i) sim and σ (j) sim to a prediction problem by asking the following: (Q) How well does the RIM rank of a controller, when ordered at strength σ (i) sim , predict the RIM rank of the controller at strength σ
To answer this question, let us denote the controller RIM σ (i) sim -rank order by the vector r σ (i) sim , and compute an ordinal (binned/categorical) version of the Kendall-tau-B statistic τ [84,85], a measure of statistical dependence between r σ (i) sim and r σ (j) sim .The ordinals are constructed only for r σ (i) sim by binning using a discrepancy parameter α that indicates the fraction of the maximum RIM value difference within a single bin.The binned rank order rσ (i)  sim (α) minimizes the effect of small movement in either rank due to noise.Then τ is computed by τ (σ where are the l, m-th sign products of the rank order differences at σ sim with +/− denoting the positive/negative pair contributions; K = k(k − 1)/2 is the number of total pairs being compared; t total .For complete positive/negative rank order correlation τ = ±1 and τ = 0 for zero rank order correlation.For our hypothesis test, we assumed a worst case pvalue of 10 −4 as an acceptance criterion on the numerical results that will follow and also that the controllers generating these rank orders are independent of each other.In this case, this constraint is satisfied by the i.i.d.noise model for a given set of unique controllers corresponding to different points in a static optimization landscape.The independence over the choice of controllers is not necessary as all the consistency comparisons are for this fixed choice of controllers.
For our earlier spin chain example (M = 5 spins, transfer from |1 to |3 ), we focus on τ for σ sim pairs that is sufficient to answer (Q).More specifically, we aim to understand how well the no-noise RIM (i.e., the average infidelity) ranks correlate with the general RIM ranks.This is shown in Fig. 4 for each optimization algorithm for α = 0.05.For the σ (i) sim ≥ 0.03, the RIM rank order is the most consistent with τ 0.6 for PPO excluding other algorithms.But there is larger shuffling of the ranks of PPO controllers as σ sim increases with deteriorating τ and SNOBFit takes over.This may be due to small numerical differences in RIM (see Fig. 3(c)) observed, and thus a stronger consistency for σ sim ≤ 0.03 0.00 0.02 0.04 0.06 0.08 0.10 In other words, this is the correlation of infidelity rank order with the general RIM ranks.The τ0,j values decline the slowest for PPO until σ (j) sim = 0.04 and then SNOBFit takes over compared to the rest.This shows, for this case, that the PPO infidelity rank order correlates the most with RIM rank order for σsim ≤ 0.03. is captured.We highlight in App.B 2 that the reason why PPO infidelities correlate more with RIM values at higher σ sim is because it optimizes a discounted RIM ( i γ i RIM (i) for 0 ≤ γ ≤ 1) as its reward function.We extend this analysis for σ train > 0 in App.B 3 to further corroborate that the infidelity rank order for PPO correlates most with higher order RIMs.
The other algorithms typically have a sharper drop at σ j) sim = 0, 0.01 step where the infidelity rank order for L-BFGS and, to a lesser extent, Nelder-Mead is completely non-informative (due to very high fidelity values without noise) and is not consistent with the orders at larger σ sim .This is most likely because the controllers found are the result of second order, gradient-based or similarly successful search methods for finding optima precisely.Since PPO and SNOBFit are gradient-free, for σ sim ≥ 0.03, their controllers are more consistent in comparison.In this case, the infidelity rank order is more informative of the RIM rank order than, e.g., L-BFGS, as fidelities are not being fully maximized due to the absence of a strong gradient direction.Note that a viable link between the consistency statistic and a generic gradient-based algorithm is hard to establish, so this does not preclude the existence of algorithms that are τ -wise better.Finally, note that τ should be thought of as a proxy of reliability of an algorithm's capability to generate numerical control solutions whose infidelity values are more consistent and predictive of their RIM values at higher σ sim .If strong correlation is obtained, this circumvents (or at least increases confidence for circumventing the latter's) computation.
However, high RIM rank order consistency does not imply that the RIM values remain low at higher noise.Rather, it indicates how much the RIM of a controller is predictive of the controller's relative robustness performance at a higher noise level.The non-parametric nature of τ removes information about the fidelity value range and should be viewed in conjunction with Fig. 3

(a) − (e).
If the correlation signal is strong, it could be used to sidestep the evaluation of the RIM at non-zero noise in favor of using the infidelity instead, eliminating the need for expensive sampling.

B. Comparison of Control Algorithms with Constrained Resources
We address our motivating question (C): what is the effect of training noise on a control algorithm's ability to find robust controllers?The overall picture is complex in terms of algorithm rankings.We numerically confirm that there is a problem-dependent optimal noise level that best smooths the optimization landscape for algorithms to more consistently find robust controllers.
We collect 100 controllers at training perturbations S σtrain with training noise level σ train ∈ {0, 0.01, . . ., 0.05} for PPO, SNOBFit and Nelder-Mead.We do not consider any training noise for L-BFGS, since only the former algorithms are designed to perform optimization with noisy perturbations.This involves using a stochastic fidelity (objective) function call evaluated under the sin-gle structured perturbation S σtrain (exactly analogous to S σsim ).
We select σ sim ∈ {0, 0.01, . . ., 0.1} to evaluate the RIM of the controllers found at different noise levels with a budget of 10 6 objective function calls per run.Each run corresponds to 100 controllers found under this budget constraint.The ARIM is then used to quantify an algorithm's performance, w.r.t.robustness and fidelity, based on the 100 controllers that it found during the run.
We only show the representative end-to-middle and end-to-end transition for the state-preparation problem for M = 5 at σ train = 0, 0.02, 0.05 in Fig. 5. Results for other spin-transitions and training noises are presented in App.B 5.
Recall from Eq. ( 11) that the ARIM is a measure of how far the distribution P(RIM) is from its ideal δ 0 .The ARIM curves at different training noises in Fig. 5(a-f) increase at different rates σ sim , starting from similar base ARIM values at σ sim = 0 for each algorithm.Note that the base ARIM value coincides with the average infidelity over controllers, in the absence of training noise.
A spread in ARIM curves indicates that the probabilistic distance of RIM values w.r.t. the ideal for all controllers increases at different rates.So, the algorithm represented by the slowest growing curve is the best to find robust controllers.
Overall, SNOBFit's and Nelder-Mead's ARIM curves at various training noises perform similarly to L-BFGS across all problems.However, there are distinctions in the region of σ sim ≤ 0.05 where L-BFGS curves start at lower ARIM values and grow more quickly compared to SNOBFit curves at various noise levels.In the region of σ sim ≥ 0.05 the SNOBFit curves comparatively grow more slowly, possibly because the fidelity has degraded so much that further deterioration is less likely across all 100 controllers.The Nelder-Mead curves exhibit similar behavior to the SNOBFit curves in that there is less variance w.r.t. the σ train levels, both when overall performance is good and when it is poor.
Compared to other algorithms, there is more variance in the PPO ARIM curves across training noises for a particular spin transfer problem, with some curves overlapping each other.The best performing ARIM curve is PPO at σ train = 0.05 for the end-to-end transition shown in Fig. 5(f) (and for 6 of 8 cases in App.B 5).This indicates that PPO is often capable of finding robust solutions, but the optimal value of training noise varies across the transition problems.
We also present an extended RIM analysis (like in Sec.IV A) for the controllers found for the same transition problem at training noises for the derivative-free approaches.The RIMs at σ sim ∈ 0, . . ., 0.1 are plotted in Fig. 5(g)-(i) for PPO, SNOBFit and Nelder-Mead at σ train = 0.03.On an individual level, SNOBFit and Nelder-Mead controllers share more algorithmic robustness and fidelity characteristics with each other across σ train than with PPO controllers: i.e., they have high RIM variance within distribution per σ train .This performance is also comparable to the L-BFGS controllers shown in Fig. 3(d).On the other hand, individually, the controllers found by PPO differ significantly across σ train where notably the RIM and ARIM values stay uniformly very low for the case σ train = 0, 0.03 and the controllers are generally distinctly robust compared to SNOBFit and Nelder-Mead controllers.
Finally, we suggest possible explanations for these differences in behavior between algorithms.Since SNOB-Fit constructs local quadratic models to estimate gradients, it effectively filters out the perturbations S σtrain .The manifestation of this effect is that the controllers at one training noise react similarly w.r.t. the RIM, compared to controllers at other training noises (including the case of no training noise) as well as controllers found by L-BFGS.For Nelder-Mead, there are fewer noiseadaptation mechanisms compared to PPO and SNOBFit for large noise perturbations that might affect the quality of the estimated gradient direction and hence the rate of growth of the ARIM w.r.t.simulation noise at higher training noise levels is unavoidable.
In contrast, PPO does not filter out the perturbations under S σtrain and forms its policy gradient estimates from stochastic fidelity function evaluations, which likely differentiates it from SNOBFit.PPO also effectively estimates the fidelity landscape non-linearly using a fixed two-layer linear (100 × 100 dimensional) neural-network, which may lead to generally better ARIMperformance.

C. Comparison of Control Algorithms with Unconstrained Resources
We consider the behavior of the aforementioned control algorithms with an unconstrained number of objective function calls to address our motivating question (D), that is, we seek to understand an algorithm's ability to find robust controllers via the ARIM -without the function call constraint.Furthermore, we wish to ascertain what the effect of the training noise level σ train is on ARIM optimization.
We consider two objective function settings: (i) stochastic objective: for each evaluation, a new Hamiltonian is drawn according to the noise model, which corresponds to 1 S σtrain perturbation in F during a single evaluation; (ii) non-stochastic objective: where the evaluation is over k perturbed, but fixed, Hamiltonians, predrawn from the noise model such that optimization objective is a deterministic RIM computed from k fixed training perturbations In this case, the function calls are counted as k as they amount to k different fidelity function evaluations per 1 optimization objective call.Furthermore, since we cannot compute the analytical gradient of both objective functions, in order to use L-BFGS [32] in both settings, we use a version of L-BFGS that approximates the Hessian using forward differences.
We fix the control problem to be the end-to-middle M = 5 transition.We consider the change in average ARIM over σ sim ∈ {0, 0.01, . . ., 0.1} for the top 100 controllers w.r.t.function calls.Three training noises σ train = 0, 0.05, 0.1 are considered.The controller rankings are maintained w.r.t. the objective function and are updated in steps of 10 6 function calls up to 4 × 10 7 .For the stochastic setting (i), we maintain the controller ranking via the stochastic fidelity function evaluation.For the non-stochastic setting (ii) we maintain the ranking through the deterministic RIM obtained using k = 100 pre-drawn training perturbations {S (i) σtrain } 100 1 .The choice of the hyperparameter k was obtained using cross-validation.Specifically, for a particular training noise and control algorithm, we picked a k from {10, 100, 10000} to compute a RIM in the objective function and then compared it to a RIM computed using k = 10000 different training perturbations {S (i) σtrain } 10 4 1 that comprise a large validation set during the optimization run for all 100 controllers.We found no significant empirical difference in error between the objective function RIMs for k = 100 and k = 10000.Note that the variance in the RIM decreases as O(1/k) by the law of large numbers.
For the non-stochastic setting (ii), it can be seen from Fig. 6(a)-(c) (solid lines) that the average ARIM of all algorithms reduces asymptotically with the number of function calls.PPO attains the lowest final average ARIM values at 4 × 10 7 function calls for each σ train but its final ARIM is not markedly better than the other algorithms considered here.However, this setting is quite expensive in terms of the total number of function calls.
For the stochastic setting (i) (dashed lines in Fig. 6(a)-(c)), increasing the training noise reduces the ability of the control algorithm to find robust controllers for all the σ train , and the average ARIM is not minimized to the same extent as in setting (ii).This makes sense, since the stochastic objective (i) is a noisy fidelity with a reduced focus on robustness.The average ARIM is lbfgs nonstoch ppo nonstoch snob nonstoch nm nonstoch stoch ppo and others lbfgs no-noise bench FIG. 6. Asymptotic ARIM performance when the nunber of objective function calls is unconstrained for the M = 5 end-tomiddle spin transfer problem.The ARIMs are averaged over the σsim set {0, 0.01, . . ., 0.1}.The stochastic objective setting (i) is shown with dashed lines and deterministic RIM objective setting (ii) with solid lines; different algorithms correspond to different colors.An L-BFGS no-noise benchmark is shown with a dotted line.The target RIM is computed using 100 non-stochastic fidelity evaluations.The ARIM is computed and averaged over a σsim = 0.0, 0.01, . . ., 0.1 set.Plots (a)-(c) correspond to training noise σtrain = 0, 0.05, 0.1, where the curves are for 100 controllers ranked by the corresponding objective function evaluation (i/ii) and are updated every 10 6 function calls.For setting (ii), all control algorithms asymptotically reduce the average ARIM, but this is not cost competitive with the stochastic setting (i) where PPO performance reaches the local minimum for all noise levels with fewer function calls.We see that the training noise level can help the landscape exploration process; this positively affects PPO in (a), (c) and Nelder-Mead in (b).For setting (i), L-BFGS, then SNOBFit, then Nelder-Mead, then PPO is the most prone to performance deterioration w.r.t.σtrain due to the differences in their reliance on (estimated) gradient information.In these plots, the shading indicates 95% confidence intervals, determined by using bootstrap resampling.
no longer reliably improved by any control algorithm in setting (i).
Within setting (i), we note that PPO converges to the lowest average ARIM value compared to the rest of the algorithms.This is theoretically justifiable as PPO optimizes a discounted RIM by design (see App. B 2).Another thing to note is that lowest average ARIM values obtained by PPO in all 3 settings are similar, even though they are not attained at the same number of function calls.This suggests that PPO's ARIM performance can be made independent of the training noise levels, given an unconstrained number of objective function calls.However, this might be difficult to achieve completely since, even for PPO, there is a selection bias for low infidelity, but not low RIM.This manifests itself in the fact that the average ARIM starts increasing, albeit slowly, w.r.t.number of function calls after the lowest average ARIM value is reached.Furthermore, we note that sharp transitions like the stochastic PPO curves depicted in Fig. 6(b) are also typically reported in classical reinforcement learning contexts and are linked to sharp improvements in the reward by the algorithm [87].
To compare these results with the more standard noiseless fidelity maximization as a benchmark, we also plot the average ARIM for L-BFGS with a noiseless fidelity objective function and analytical gradient information.This version of L-BFGS accumulates sharp peaks in the fidelity landscape with more function calls since it is gradient-based and is effectively climbing to the sharpest peak in the fidelity landscape.Hence its average ARIM flatlines quickly w.r.t.function calls to a higher value compared to the other control algorithms in setting (i), with the exception of the forward differencing L-BFGS.
Contrasting settings (i) and (ii) for a single control algorithm, the point at which there is an advantage for non-stochastic optimization via setting (ii) is around 10 7 function calls for the algorithms, excluding L-BFGS with noise.For the regime below 10 7 function calls, setting (i) has a clear advantage over (ii) for PPO and SNOBFit.

V. CONCLUSION
We have presented the robustness-infidelity measure (RIM p ), a statistical generalization of the infidelity in the robustness sense, defined w.r.t.perturbations of arbitrary noise level S σ in the fidelity function.The RIM p is the p-th order Wasserstein distance of the infidelity distribution induced by S σ from some ideal distribution that is impervious to S σ .We show that it is the p-th root of the p-th raw moment of the infidelity distribution and can be evaluated using perturbed fidelity function evaluations in physical experiments or Monte Carlo simulations.For p = 1, the infidelity measure RIM 1 , reduces to the av-erage infidelity.Using a metrization argument, we justify why RIM 1 is a practical robustness measure for quantum control problems due to the convergence of RIM p values, for all p, given highly robust and high fidelity controllers.The RIM 1 also has a nice interpretation as the area under the curve of the cumulative distribution of the infidelity.We leave further analyses of the utility of the RIM p for, e.g., the optimization of robustness, for future work.
The RIM is further generalized to define an algorithmic RIM (ARIM) to compare the performance of control algorithms in terms of their ability to find robust highfidelity controllers.Even though the RIM and ARIM are illustrated for static controls, they can be computed in any situation that generates a fidelity distribution over [0, 1], including time-dependent controls and open quantum systems, enabling their use and further study for a wide range of practical quantum control problems.
We have used the RIM under model and controller noise to quantify the performance, in terms of the robustness and fidelity, of individual controllers for excitation transfer in spin chains by energy landscape shaping.The controllers were obtained by four control algorithms (PPO, SNOBFit, Nelder-Mead, L-BFGS) at simulation noise scales of up to 10%.Using the RIM we found that high-fidelity controllers can vary widely in robustness to noise across all algorithms that we studied, although there are notable differences in algorithmic efficacy w.r.t.robustness, as indicated by the ARIM.We also demonstrate a consistency statistic that can be used to differentiate control algorithms by how correlated their controller infidelities are with the RIM 1 .This provides a method to predict robustness via the RIM 1 without its explicit evaluation.
To compare the control algorithms, we studied their ARIM performance for multiple spin transfer problems.Under constrained function calls of a stochastic objective function (noisy infidelity), PPO performed better than SNOBFit and Nelder-Mead at certain, problem-specific training noise levels.SNOBFit performance at different training noise levels was similar, regardless of whether it was good or bad, suggesting that it is filtering out the noise.Nelder-Mead exhibits similarly consistent behavior across training noise levels with less than optimal performance for all but one problem.With unconstrained stochastic function calls, PPO showed excellent performance compared to the other algorithms, independent of the training noise level, since its reward accumulation strategy implicitly optimizes a discounted RIM.
In contrast, when optimizing the RIM 1 (average infidelity) over a fixed ensemble of perturbations, all algorithms were capable of asymptotically finding an optimum.However, this approach is expensive in terms of the number of function calls compared to the aforementioned stochastic optimization setting with a noisy fidelity function as the objective.Our results also show that for stochastic settings, e.g., shot noise, PPO (or more generally reinforcement learning) is a promising approach to obtain robust controllers.
A limitation of this work is that we require the computation of multiple controllers per control problem.In simulation, this further involves numerous time-consuming matrix exponential evaluations to generate a large number of samples per controller to approximate the RIM measure.More work is necessary to elucidate the fundamental limitations of the optimization landscape.Nevertheless, our statistical robustness approach is a useful tool that can be applied in a wide range of quantum control scenarios where analytic approximations with small and/or uncorrelated noise are unsuitable.

Error Bound on the RIMp and ARIMp Estimators
Here we propose a probably approximately correct (PAC) alternative error bound for an estimation RIM p of RIM p in Eq. (A5) based on an empirical estimate P(F) of its generating probability distribution P(F).With probability at least 1 − δ/2, where the second line and the fourth line come from using the reverse triangle inequality and in the fifth line we rewrite the true distribution P(F) as E P∼D P(F) which is true for any unbiased empirical estimator.We use McDiarmid's inequality to obtain the bounding constant C using the fact that the probability distribution D generates a family of random variable empirical distributional estimators P j = 1 n n i=1 δ fi where we have the differences occurring only on the k-th coordinate, where n is the number of samples.A similar bound can also be derived for the ARIM estimator.This error bound is similar to the DKW (Dvoretzky-Kiefer-Wolfowitz) bound for the ECDF and would suffice in generating the 95% confidence intervals for Fig. 5 without the need to do bootstrap resampling.

Relative Order of RIMp
Using Lyapunov's inequality, stating that For any q ≥ p ≥ s > 0, it follows that RIM q ≥ RIM p ≥ RIM s .The converse is true without the pth roots.The linearity of expectations implies that We can also derive a lower bound on RIM p .For some p ≥ p, we have where the relation in the second last line is obtained by applying Jensen's inequality and the final line is obtained from the observation that min Note that this result still depends on the data.Higher orders p and p of the RIM are related to each other in a concave sense and when p, p → ∞, the RIMs become more equivalent.Conversely, near perfect fidelity, all the RIMs are converging to 0, but the presence of an outlier fidelity sample strongly governs how much discrepancy there still is between a higher-order RIM and a lower order RIM.This discrepancy is still concavely dependent on p and p .We arrive at equivalence relations for RIMs of different order by noting that for the smallest positive finite measure m > 0 on the domain set on which we define the probability distribution P(f ).This follows from the continuity of f and the continuity of P(f ).If f already has an ideal distribution, then this is trivially true.Eq. (A12) yields In practical settings, e.g., when using the ECDF, m ≥ 1 n .Intuitively, this follows from the observation that for any This implies that RIMs of different orders are equivalent (in the convergence sense) when the bound holds.Instead, motivated by the convergence in the Wasserstein distance, for fixed n, RIM 1 can effectively constrain any RIM p with p > 1, since growth in p is sublinear.This implies that the RIMs converge when they tend to 0 as seen in Fig. 7.We also note that the higher order RIMs increase the measure's sensitivity to outliers greatly, even though growth in the RIM is sublinear in p.For most practical purposes, the first order RIM measure should be sufficient for performance measurements, especially in the quantum technologies setting.Details about the optimization objectives for the numerical results in Sec.IV are given in Table I.In every section, for every σ sim , the RIM is evaluated using N = 100 Monte Carlo S σsim perturbations to the fidelity function.The RIM itself is only optimized in Sec.IV C for the non-stochastic case (ii) where 100 S σsim are sampled at the start and are reused for every function call.Note, however, that we count these as 100 function calls as these amount to 100 fidelity function evaluations.Also, for better performance, in Sec.IV C for the stochastic case (i), instead of using the analytical form for the gradient of fidelity F, we use finite differences to approximate the gradients ∇ ∆ F (where ∆ are the controls).

PPO Optimizes a Discounted RIM1
We follow the standard finite-horizon Markov Decision Process (MDP) formulation for the reinforcement learning setting for states, actions and one-step state transition rewards (s t , a t , r t ) that are sampled in trajectories τ = {(s t , a t , r t ) : t = 1, . . ., T } stored in the buffer D. The proximal policy optimization (PPO) algorithm uses a clip objective to update the policy π θ parameters θ with first-order constraints that minimize policy distributional divergence.The policy objective is where π(•) is the policy probability distribution.The advantage estimates are with value function V φ (s t ) = E π T −1 i=0 γ i r t+i+1 |s = s t where φ are the value function parameters.The value function is regressed onto discounted rewards sampled according to π(•).The clip function truncates the advantages to be between (1 ± )A π θ k .The value function's optimization objective is (B3) The algorithm tries to maximize this expression.In the case of flat rewards and advantages λ = γ = 1, the advantage estimates are The value function can be written in terms of an expectation under the policy, as an average reward: The optimal value function is defined by V * (s t ) = max π V φ (s t ), which is maximized if the policy is optimal, i.e., π θ = π θ * at θ = θ * .Near optimality, the advantages are approximately 0 as there should be no advantages conferred to the optimal policy π θ * which also has an optimal value function.Thus, V φ * (s t ) → V φ * (s t ) as A π θ * → 0. The sample rewards minus the predicted rewards by the value function go to 0 in Eq. (B1).The same argument applies with discounts γ, λ < 1 and, hence, it can be shown that the algorithm optimizes a discounted RIM 1 estimator as its value function.Most reinforcement learning algorithms effectively optimize the average or cumulative reward Ĵ ∝ i r i due to the one-step heuristic application of the Bellman principle of optimality [88].

More Consistency Statistic Plots
This section expands the discussion on the consistency statistic in the main text in Sec.IV A 2. We plot the consistency statistic τ0,j for all algorithms for α = 0.05 for the case M = 5 and the transition |1 to |3 in Fig. 8(a)-(f) ((a) is Fig. 4) and |1 to |4 in Fig. 9(a)-(f) for multiple training noise levels.Note that for each subplot the L-BFGS curve is always the same at σ train = 0.The controllers found by PPO at σ train = 0.05 are less consistent for some noise levels than others, e.g., σ sim ≥ 0.04 compared with the controllers found at σ train = 0.04.This is also true for SNOBFit and Nelder-Mead.Moreover, the decline in the correlation values is smoothest for PPO compared to the rest for nearly all twelve instances shown in both figures.With more training noise, Nelder-Mead is sometimes closer in consistency to the controllers found to L-BFGS, e.g., Fig. 8(a,b).But it produces more consistent controllers with increasing training noise likely due to diminishing returns of the gradient direction, makes its behavior more like SNOBFit and PPO.For most PPO runs, the consistency statistic is highest for σ sim ≤ 0.04 and thus the infidelity rank order is a good predictor of RIM rank order for higher σ sim , which was not observed for any of the other algorithms.Also note that this analysis does not reveal anything about how high the RIM values are for the controllers (a drawback of the non-parametric test) and should be processed as companion plots to the figures where these explicit values are shown.

More Individual Controller Plots
The results presented in Fig. 3 (M = 5 and transition |1 to |3 ) are not reflective of PPO's general behavior on the extended sample of problems examined in Sec.IV B. Fig. 10 shows the case (M = 5 and transition |1 to |4 ) where all the controllers found are not very robust.This is likely either due to unlucky sampling of the space of possible controllers or their non-existence.Note that SNOBFit and PPO are similar in their RIM degradation as observed from Fig. 10(e).We also provide some more cases (M = 5 and transition |1 to |5 ) and (M = 6 and transitions: |1 to |4 , |1 to |6 ) for algorithm comparison of controllers under noisy training in Sec.IV B to highlight some of the variation of controller quality for different regimes of noise and spin chain transitions observed in the main ARIM comparison presented in Fig. 16.Each individual subplot is the result of an independent run of each algorithm with a stochastic fidelity function evaluated under the unstructured perturbations using the same approach as earlier with σ sim .These are also plotted for a more distributional comparison as pairwise box-plots in Fig. 15.For both, Figs.11 and 15, we also show L-BFGS results for comparison.

Full ARIM comparisons
For the cases M = 6, 7, both types of transitions appear to be challenging for PPO, SNOBFit and Nelder-Mead at most, if not all, training perturbation strengths; especially the end-to-end M = 7 transition (Fig. 16(d)), where PPO at σ train = 0.05 is only marginally better than the rest of the algorithm runs, excluding Nelder-Mead.A pertinent question is whether this is genuinely reflective of the landscape or if, for PPO, our budget constraint of 10 6 target functional calls is insufficient for larger system sizes, as the control problem is exponentially dependent on the number of control degrees of freedom.The former hypothesis might hint at a fundamental limitation on robustness of this particular control landscape.The fact that most noisy Nelder-Mead curves for these problems are clustering together suggests that noise could also help in reaching robust areas in the control landscape faster by regularizing or smoothing the landscape by an appropriate degree.We investigate asymptotic algorithm behavior w.r.t. the training noise in Sec.IV C to illustrate this and show that there is convergence in PPO performance for all the noise levels at sufficiently many function calls.For all problems, PPO has higher variance with respect to σtrain than SNOBFit and Nelder-Mead.The latter pair's performance curves are more in line with the L-BFGS curve for σsim ≥ 0.05 and mostly worse for σsim ≤ 0.05.For most of the problems the best performing (lowest) curve across all problems is PPO at σtrain = 0.05 (brown) except in (a) where it is PPO at σtrain = 0.02 and in (g) where it is Nelder-Mead at σtrain ≥ 0.04.95% confidence intervals (shading) are computed using non-parametric bootstrap resampling [86] with 100 resamples.

FIG. 3 .
FIG. 3. (a)-(d) 100 controllers found for the XX spin chain model, Eq. (13), using Nelder-Mead, SNOBFit, PPO, and L-BFGS for M = 5 and the spin transition from |1 to |3 with σtrain = 0.The controllers are ranked in increasing order of infidelity at σsim = 0 from left to right.Each column represents a single controller's RIM at σsim = 0, 0.01, 0.02, . . ., 0.1 from the bottom to the top on a log scale.Even if the infidelity or RIM at σsim = 0 is close to 0, some controllers' RIM values degrade faster than others and are hence less robust despite starting at very low infidelities.(e) RIM as a function of σsim for the average and best controller (i.e., most dark over all σsim levels) out of the 100 shown in (a)-(d) in terms of how much they preserve their corresponding RIM rank average across all σsim.Each algorithm is indicated by a marker shape, and the solid and dash-dotted lines denote the best and average controller lines respectively.All the best controllers have very high initial fidelities and are very similar across the different control algorithms, with Nelder-Mead being only moderately worse.
the total numbers of ties where I l,m = 0 for σ

FIG. 4 .
FIG. 4. RIM rank order consistency statistic τ for the 100 controllers found for the problem M = 5, |1 to |3 between the two levels: no simulation noise, σ (i) sim = 0 and σ (j) sim from {0.0, 0.01, . . ., 0.1} for (a) Nelder-Mead, (b) SNOBFit, (c) PPO, and (d) L-BFGS without training noise.In other words, this is the correlation of infidelity rank order with the general RIM ranks.The τ0,j values decline the slowest for PPO until σ

FIG. 5 .
FIG. 5. ARIM as a function of σsim for M = 5 where (a)-(c) are the end-to-middle (|1 to |3 ) and (d)-(f) are the end-to-end (|1 to |5 ) transitions (end denoted by O).The ARIM is computed from a distribution of RIM values for 100 controllers for each σsim for SNOBFit, Nelder-Mead, PPO and L-BFGS.We identify each algorithm with a unique marker and/or color.Both PPO and SNOBFit are run multiple times at σtrain = 0, 0.02, 0.05.PPO has higher variance w.r.t.σtrain than SNOBFit and Nelder-Mead, whose performance curves are more in line with the L-BFGS curve for σsim ≥ 0.05 and mostly worse for σsim ≤ 0.05.95% confidence intervals (shading) are computed using non-parametric bootstrap resampling [86] with 100 resamples.(g)-(i) show individual-controller comparisons for σtrain = 0.03 ranked by fidelity (leftmost is highest).

Appendix B: Optimization Algorithms 1 .
Implementation Details of the Optimization Objectives

FIG. 15 .FIG. 16 .
FIG.15.Box plots of the RIM for the 100 controllers for M = 5, |1 to |3 shown in Fig.11in the main text found by Nelder-Mead, SNOBFit, and PPO for various σtrain (a)-(f).For the case σtrain = 0 in (a), we also show L-BFGS box plots as a reference.On the distributional level, PPO controllers are generally the more robust of the three w.r.t. the RIM, but there is high variance across σtrain compared to the SNOBFit and Nelder-Mead controllers.The median SNOBFit RIM value per σsim is higher than L-BFGS, so it has a longer left tail.The Nelder-Mead controllers have the most weight on their right tails and are comparatively the worst.

TABLE I .
Implementation details for various optimization settings in the paper.For Sec.IV C, the asymptotic setting, (i) refers to the stochastic scenario and (ii) refers to the non-stochastic scenario where the RIM is optimized using the same 100 fixed set of perturbations {Sσ train } per function call.Sec.Obj.Function (OF) and args.Train (OF) noise Algorithm Total OF Calls Single Call Cost