Optimal quantum control with poor statistics

Learning how to control a quantum system based on experimental data can help us to exceed the limitations imposed by theoretical modelling. Due to the intrinsic probabilistic nature of quantum mechanics, it is fundamentally necessary to repeat measurements on individual quantum systems many times in order to estimate the expectation value of an observable with good accuracy. Control algorithms requiring accurate data can thus imply an experimental effort that negates the benefits of avoiding theoretical modelling. We present a control algorithm based on Bayesian optimisation that finds optimal control solutions in the presence of large measurement shot noise and even in the limit of single-shot measurements. With the explicit example of the preparation of a GHZ state, we demonstrate in numerical simulations that this method is capable of finding excellent control solutions in terms of minimal experimental effort.


I. INTRODUCTION
Quantum control allows us to perform highly accurate experiments and to turn synthetic quantum systems into devices ranging from the most precise available clocks to qubits registers on which several elementary quantum gates can be executed. optimization of the typically time-dependent control fields is traditionally performed based on a theoretical model. Given the increasingly complex quantum systems that we are gaining control over, it often becomes infeasible to find a sufficiently accurate model, let alone to simulate its dynamics. The design of control fields based on experimental observations, but without resort to theoretical modeling [1] has thus become more and more important.
While it is generally important that control algorithms converge quickly to a good solution, control based on experimental data also requires that such solutions can also be found based on a limited amount of data and data with limited accuracy. This is particularly problematic with modern experiments on quantum systems since the vast majority are performed on individual systems, rather than ensembles as was the case in the early days of quantum mechanics. Since obtaining expectation values of an elementary observable with good accuracy requires a large number of repetitions of the same experiment, there is a trade-off in the acquisition of data. More repetitions yield data with higher accuracy, which is clearly helpful in the search for a good control pulse. The repetitions, however, also imply additional effort, which may even make a search prohibitively expensive.
Our present goal is to devise a control algorithm that does not require accurate estimation of expectation values, and that can find good solutions of control problems even in the extreme case of poor statistics due to intrinsic shot noise.
Among the most popular methods to perform optimizations directly based on experimental outcomes, gradient-based routines [2][3][4][5][6] and Nelder-Mead [7][8][9][10][11] have the advantage of simplicity and fast convergence when the underlying optimization landscape is well-behaved. Their performance can however be limited in the presence of many local minima [12,13] or plateaus [14]. Evolutionary algorithms [12,15,16], and recently introduced reinforcement learning techniques [13,[17][18][19][20] may overcome this problem at the expense of a large number of iterations that require a large set of data. Bayesian optimization [21][22][23][24] on the other hand has been shown to offer a good compromise between the ability to find high-quality solutions quickly, based on limited data and reliability in distinguishing local from global maxima [25,26]. Since it is also well suited to deal with noisy data, we deem it an excellent choice for our present purposes.

II. CONTROL ALGORITHM
Optimal control can most intuitively be understood in terms of a control landscape, i.e. the value that a figure of merit F adopts as a function of the parameters θ that specify a control pulse. Typical examples for such figures of merit -also referred to as target functionals -include state-fidelities [27], gate fidelities [9,28,29], the expectation values of an entanglement witness, or also more non-linear quantities like entanglement measures [17,27,30,31] or the Fisher information [10,32]. Tunable control parameters often include piecewise constant amplitudes of external electric or magnetic fields [2][3][4][5][6]33], but temporal shapes of laser-or microwave pulse with parametrization in terms of Fourier series are also objects of optimization [7,8,27,[34][35][36].
It is hardly ever possible to find the dependence of the figure of merit F on the control parameters θ analytically. In addition to the necessity of an underlying theoretical model, this would require constructing the solution of the Schrödinger equation as an analytic function of the control parameters, which is only feasible in exceptionally elementary situations. The goal of finding the maximum of F therefore typically requires evaluation of F orin the present context -experimental assessment of F for many specific values θ j of θ. Essentially all existing control algorithms rely on an iterative procedure, with the central step being deciding for what value of θ to assess the figure of merit F in the next step. As detailed in the rest of this section, Bayesian optimization relies on building a probabilistic model of the landscape and using it actively when deciding which control pulse θ to probe next.

A. Surrogate model
Bayesian optimization is based on proxy or surrogate models f that approximate the actual landscape F . Given the limited information on F , one considers a typically continuous set of surrogate models that are consistent with the data obtained from the experiment. The central aim of Bayesian optimization is the construction of the predictive probability density i.e. the conditional probability density for f (θ), given the set of observations D made on the experiment. This construction follows Bayes' basic rule where P (f ) denotes the prior density over f , describing initial assumptions before the acquisition of any data and P (D|f ) is the probability to obtain the acquired data for a given control landscape f . The probability P (D) to acquire the data D in a given run of the experiment can be understood as a normalization constant and can be obtained from the condition df P (f |D) = 1. For Bayesian optimization to work well, it is important to use a reasonable prior distribution over f that is sufficiently broad to contain the actual control landscape, but that does not assign high probability to landscapes with unnatural features such as excessively wild oscillations or close-todiscontinuous dependencies. In practice [21][22][23][24][25], the surrogate model f is typically taken to be a Gaussian process [37]. In addition to the suitability to express minimal assumption on the underlying landscape, this choice is also very practical in an explicit implementation as discussed in more detail in Sec. II E.

B. Measurement noise
In practice, one can not assume that the exact value of the figure of merit can be experimentally determined but the term P (D|f ) permits Bayesian optimization to incorporate noise and uncertainty in the data acquisition; in a noiseless and deterministic scenario, any surrogate model f would result in a deterministic outcome, such that P (D|f ) is an infinitely peaked distribution. Distributions with finite width, on the other hand, take into account the fact that noise or finite resolution of a detector results in experimental observations that are not repeatable with infinite accuracy, or, as it is the case in quantum systems, that there is an intrinsic probabilistic element in the actual experiment. Whereas an unavoidable residual level of experimental noise is often taken into account satisfactorily in terms of a Gaussian distribution for P (D|f ) the extreme case of large measurement noise deviates strongly from this simplification and the inclusion of noise requires a more thorough modeling.
In the presence of large shot noise, it is more appropriate to use surrogate models for probabilities of measurement outcomes rather than for the figure of merit. Surrogate models, approximating the dependence of such a probability on the control parameters θ, can not be directly described in terms of Gaussian processes, since any Gaussian process will contain landscapes f with points exceeding the admissible range [0, 1] of a probability. It is therefore essential to introduce a squashing function π, such that f π (θ) = π(f (θ)) is bounded by this interval. In the following we will use the cumulative distribution function of a standard normal distribution π(x) = x −∞ dy exp(−y 2 /2)/ √ 2π, but essentially any monotonic function mapping the real axis to the interval [0, 1] could be used.
With the probability f π (θ) to obtain a certain measurement outcome in a single shot, the probability to obtain this outcome in n out of N repetitions is given by the binomial distribution Finally, Bayesian optimization involves several steps of acquiring data points n j for different values θ j of θ. The probability to observe a sequence n 1 to n M in such a series with an underlying model f is given by This long product of binomial distributions is the explicit form of the general object P (D|f ) entering the Bayesian inference above in Eq. (2). With this modeling in hand one can explicitly construct the predictive probability density P (f (θ)|D), based on any level of measurement noise, including the extreme case of no repetitions, i.e. N = 1.

C. Target functionals
In practice hardly any control target F is simply given in terms of the probability of a single measurement outcome, but any control target can be expressed as a function of the probabilities p k for the outcomes of a sufficiently large set of different measurements.
In this case, one can employ an independent surrogate model f k for each of the K required probabilities, such that each π(f k ) can be used to make predictions for the probability p k . These K surrogate models are updated independently following Eq. (2) during the course of the iterative procedure. The surrogate model f of the control target F , is then expressed in terms of the individual models f k as with exactly the same functional dependence as in Eq. (5). With this, one can construct the overall predictive distribution p(f (θ)|D) based on each individual predictive probability distribution p(f k (θ)|D).

D. Decision rules
With the probabilistic modeling for the control landscape defined, the remaining question is how to use it in order to decide which control pulse to use in the next step of the iterative optimization.
At any level of given information obtained from the experiment can the expected control landscape be expressed as Particularly in cases of limited information, the expected landscape f (θ) does not necessarily approximate the exact landscape F (θ) well for any value of θ, but there can be substantial uncertainty.
Since a strategy ignoring this will likely miss the global maximum, it is essential to take it into account, e.g. in terms of the variance or higher order statistical moments. A strategy based on both expectation and uncertainty allows the seeking of values of θ for which large values of F can be expected, given the limited data obtained so far. In practice this can be realised in terms of the acquisition function with a real, non-negative scalar α that balances the weight of the expected landscape and the confidence in this estimate. Using the value of θ that maximizes g(θ) for the next query to the experiment allows the algorithm to broadly explore the control landscape, thus minimizing the risk of getting trapped in a local extremum. The choice of the numerical value of α can be made depending on the practical requirements. A small value will result in rapid convergence to some reasonably good solution which might not be the best solution available, whereas a large value reduces the risk of not finding the best solution, but will typically result in slower convergence. In practice, one may also change the value as the optimization progresses, starting with a large value favoring exploration of the entire landscape, followed by a decrease to focus on the particularly promising domains.

E. Implementation
Sections II A, II B, II C and II D describe conceptually the basic structure of Bayesian optimization and its specificities in the presence of substantial measurement noise. In this final subsection , we will discuss the most relevant aspects for the explicit implementation of these concepts, but we invite any reader who is more interested in the performance of the present techniques than the underlying details to jump ahead to section III .

Gaussian Processes
A random process extends the concept of a vector of random variables to an infinite collection of random variables. Any function f can thus be understood as the continuum limit of a set of discrete random numbers f (θ i ). Since optimization landscapes are continuous functions of control parameters, random processes are the appropriate mathematical structure for the present purposes. In particular, the description of potential landscapes in terms of random processes avoids the necessity of an explicit parametrization, but permits to ensure that properties like continuity and differentiability of the control landscapes are guaranteed.
A random process such that any finite collection of variables [f (θ 1 ), . . . , f (θ N )] follows a Gaussian distribution, is referred to as a Gaussian process. While a Gaussian distribution is parametrized by a mean vector and a positive semi-definite covariance matrix, a Gaussian process is specified by a mean function µ(θ) = f (θ) and a positive semi-definite kernel function which defines the covariance between f (θ) and f (θ ) for arbitrary θ and θ , where the symbol • denotes the average, un-der this distribution. This covariance is typically taken to be exponentially decaying in the distance between the two arguments, which implies that knowledge of the landscape at some point θ allows one to estimate the landscape around this point with high confidence, whereas the ability to predict the landscape decreases with the distance from this point.
The present analysis is based on the Matérn function with x = |θ−θ |/l, defined in terms of a correlation length l, and a variance V . This choice ensures that all second derivatives of k are well-defined and finite, which effectively enforces that any f that occurs with non-vanishing probability is at least twice differentiable [37].
Without any prior knowledge about the optimization landscape one would typically choose a vanishing mean, µ(θ) ≡ 0. Choosing suitable values for the variance V and correlations length l would require some knowledge of the scales over which the value of F changes. Since even a very limited amount of data can be used to estimate this, one can adapt values for V and l during the course of the iteration. We will do so following standard practice [23,37] by minimizing the log marginal likelihood term log P (D) appearing in Eq. (2).

Bayesian inference in practice
With Bayes' rule (Eq. (2)), the incorporation of measurement noise (Eq. (4)), the decision rules discussed in Sec.II D and the explicit definition of the Gaussian process in Sec.II E 1, the Bayesian optimization is well-defined. In practice, however, the Bayesian analysis is a bit more subtle than the multiplication of probabilities that Eq. (2) suggests.
In any explicit implementation, one will always explore the optimization landscape on a discrete set of points θ j (with j = 1, . . . , M ), obtain a dataset of observations D, and aim at predicting the landscape at an arbitrary next point θ M +1 in order to decide where to probe next. Crucially, one would like to obtain the desired information P (f (θ M +1 )|D) at a stage of the iterative optimization algorithm at which the data D does not contain any experimental results with the control pulse θ M +1 yet.
It is possible to estimate the control landscape at the point θ M +1 based on the earlier experimental observations. Given the underlying random process, one can construct the conditional probability density for such a point of the control landscape given a finite set of values {f (θ j )} that is denoted with the short-hand notation f . Due to the noise in the data acquisition, the set of points f is not known with certainty, but rather is given in terms of the posterior distribution, expressed according to Bayes' rule as With the probabilities P (f (θ M +1 )| f ) and P ( f |D) one can recast the desired predictive probability as (where d M f is a short hand notation for df 1 . . . df M ) for the landscape at θ M +1 before any experimental data for this control pulse is available.
a. Integration with binomial noise model The explicit construction of P (f (θ M +1 )|D) requires the solution of an M -dimensional integral. Already after a few steps in the iterative control algorithm a numerical evaluation of such an M -dimensional integral is prohibitively expensive, and an analytic solution is essential.
Many problems of Bayesian optimization are based on entirely Gaussian models whose integrals have analytic solutions. In the present case, however, only the first factor P (f (θ M +1 )| f ) in Eq. (14) is Gaussian, because of the underlying Gaussian process, but the second factor P ( f |D), is non-Gaussian because P (D| f ) is given in terms of the binomial distribution in Eq. (4). In order to perform the required integration efficiently, we will employ the Laplace approximation [37] for P (D| f ), due to its conceptual simplicity. It entails approximating P (D| f ) by a Gaussian distribution, such that P (f (θ M +1 )|D) in Eq. (14) becomes an M -dimensional, analytically solvable integral over the product of two Gaussian distributions. It is thus possible to efficiently perform all integrations despite the detailed and non-Gaussian underlying noise given in Eq. (4), and we will discuss the implications of the approximate integration in detail later-on in Sec. III B.
b. Gaussian noise The need for the approximate integration results from the binomiali.e. non-Gaussian -modeling of measurement noise. Since we will compare results obtained with this detailed modeling, to more established noise models, a short comparison to typical modeling admitting exact integration is in order.
Standard approaches would estimate the probabilities p k in terms of the frequencies p  The fact that this estimate does not necessarily coincide with the exact value of F is then taken into account with phenomenological Gaussian probability distribution which, in this case, is defined in terms of a single surrogate model f . In contrast to the binomial modeling above, there is the phenomenological parameter σ; its value is not determined by basic principles, but needs to be chosen in accordance with observations similarly to the choice of values for the Gaussian process discussed above in Sec. II E 1.

III. POOR STATISTICS AT WORK
In order to demonstrate the strength and limitations of the control algorithm developed in Sec.II, we will apply it to two specific problems in numerical simulations.
The problem of state preparation of a single qubit discussed in Sec.III A is intended to be of pedagogical nature, giving insight into the workings of the methodology. This is followed by the analysis of statepreparation of a three-qubit GHZ state in Sec.III B, which aims at demonstrating the practical value of the present method for state-of-the-art control problems.

A. A single qubit to warm up
A simple toy model of pedagogical value is given by a single qubit |ψ(θ) parametrized with a scalar control parameter θ. The goal lies in maximizing the fidelity with respect to a given state |φ , and we assume for simplicity that a projective measurement in a basis including the state |φ can be taken. In the present case, the target functional is thus directly the probability to project onto |φ , i.e. F (θ) = | φ |ψ(θ) | 2 .
Rather than defining a control landscape in terms of the dynamics induced by a specific Hamiltonian, we will consider the function F (θ) = sin 1 2 sin 3θ + 9 10 + 3θ 2 + 9 20 This landscape, depicted by a dashed red line in Fig. 1, reaches the maximum value of 1 and has two additional local maxima in the interval [0, 4] that will help to illustrate the ability of the algorithm to distinguish local from global extrema. Fig. 1 depicts the progress in data acquisition and the estimates of the control landscape during the course of the optimization. The optimization starts after taking 30 projective measurements, for randomly chosen θ j (j = 1, . . . , 30) as shown in Fig. 1 a). The measurement results depicted by red dots and a green square can only adopt the values of 0 and 1, and this value is chosen at random following the actual control landscape. Based on those 30 observations, one obtains a rough estimate of the actual control landscape; the expected landscape f (θ) is depicted in blue, a 95% confidence interval is depicted by the grey contour, and the actual probability density is represented by shades of blue. As one can see the expected control landscape captures the most salient fea-tures like the approximate locations of the three maxima, but, given the very limited data it fails to reproduce the actual landscape quantitatively correctly. In particular, it fails to distinguish the local from the global maximum. Based on this model the decision rule described in Sec. II D determines for which value of θ to take the next measurement. The acquisition function (Eq. (9)), with a value of α = 4, is depicted in red at the bottom and takes its maximal value for θ 31 ≈ 0.1 (red vertical line). This decision reflects the high uncertainty in the model in this region where no experimental observations are available yet. Fig. 1 b) contains the additional data point (square green) resulting from the projective measurement with the value θ 31 of the control parameter. The updated surrogate model in b) shows that the surrogate model in a) has over-estimated the value of F (θ 31 ). In addition to the better estimate of this value, the uncertainty around θ 31 has also slightly decreased from a) to b). As a result of these two effects the next probe is taken for the value θ 32 = 2.3, i.e. in the vicinity of the maximum where now both the expected value and the uncertainty are high. Fig. 1 c) depicts the surrogate model after 100 queries to the experiment. As one can see, the framework suggested several probes in the vicinity of the three maxima, and managed to avoid queries that would not have resulted in substantial added value. Consequently, the surrogate model approximates the actual control landscape more accurately close to these maxima.
After these 100 iterations the algorithm has identified the value θ f = 1.86 (as indicated in green in Fig. 1 c)) as optimal, which is very close to the true optimal value θ o = 1.876. With the value F (θ f ) = 0.996 of the actual control landscape, the algorithm has thus found a solution with an infidelity of 0.004, after a number of measurements that would have only been enough to determine a single point of the control landscape with a resolution of 0.01.

B. Three qubits
With the flavor of the workings of quantum control based on poor statistics laid out in Sec. III A, we can now proceed to a more challenging control problem. We consider the gate sequence depicted in Fig. 2 for the preparation of a threequbit GHZ state |Ψ = (|000 + |111 )/ √ 2. The gate sequence consists of five single qubit gates R x (θ j ) = exp(iθ j σ x ) (j = 1, . . . , 5), one single qubit gate R y (θ 6 ) = exp(iθ 6 σ y ), and two C-NOT gates as shown in Fig. 2. The goal is to find suitable angles θ j ∈ [0, 2π] such that the circuit maps the initial state |000 into the GHZ state |Ψ . Later-on, we will also consider additional FIG. 2. Gate sequence for the preparation of a threequbit GHZ state in terms of two controlled-not gates and 6 single qubit gates with adjustable parameter θj. The addition of noisy unitaries Nε and readout error helps to demonstrate control in the presence of experimental noise in addition to the measurement shot noise. noisy unitaries N ε , but for the moment, they are treated as identities.

Control targets
The fidelity F for the state resulting from the gate sequence is defined as F = Ψ| |Ψ . Unless one is able to perform a projective measurement in a basis including the state |Ψ , however, one is bound to perform measurements on each individual qubit, as indicated by the three detectors in Fig. 2. In practice, one would therefore construct the fidelity in terms of local measurements.
Fidelities are often expressed in terms of expectation values of observables. In the case of a GHZ state, it can be decomposed as where S 1 = σ x ⊗ σ x ⊗ σ x S k with k = 2, 3 and 4 are the permutations of 1 ⊗ σ z ⊗ σ z , and S k with k = 5, 6 and 7 are the permutations of σ x ⊗σ y ⊗σ y . Since the framework developed so far allows us to estimate probabilities rather than expected values, it is more appropriate to express the fidelity in Eq. (III B 1) in terms of probabilities for measurement outcomes. Denoting P k = (S k + 1)/2 as the projector onto the subspace spanned by the eigenstates with eigenvalue +1 of the observable S k , the state fidelity for the GHZ state can be written as In contrast to the simpler case discussed above in Sec. III A, the fidelity is thus not given by the probability of one single measurement outcome, but it is given in terms of the seven probabilities p k = tr( P k ). It is thus necessary to employ seven surrogate models f k as detailed in Sec. II D, and to estimate the state fidelity as consistently with Eq. (19).

Optimization
In the following we will analyse the convergence towards high fidelities with Bayesian optimization with different numbers N of repetitions of the same measurement, ranging from single-shot measurements (N = 1) up to N = 10 000. We will compare optimizations based on the proper binomial modeling of measurement noise (as developed in Sec. II and in particular in Sec. II E 2 a) to optimizations with conventional modeling of the noise presented in Sec.II E 2 b. We refer to those as binomial and Gaussian modeling for brevity, the latter one being the one commonly used [25,26].
Every step in the search for the optimal control parameters consists in a projective measurement of each of the observables S j in Eq. (19) with N repetitions. In practice only 5 distinct observables are required as S 2 , S 3 and S 4 can be obtained from a single σ z ⊗σ z ⊗σ z measurement. Any optimization starts with N i = 50 initial steps, taken at random, and is followed by M optimization steps. The resulting total number N q = 5 × N × (N i + M ) of queries to the quantum hardware effectively quantifies the experimental burden.
Any individual run of an optimization has a substantial probabilistic component. The performance of the control algorithm will therefore be characterized in the following in terms of statistics of the infidelity I = 1 − F , such as the median and quantiles, over 30 repetitions of the same control task. In all the results presented below, the value of α in the acquisition function (Eq. (9)), is chosen as α = 4 except for the last 10 steps where it is fixed to α = 0.
The main frame of Fig. 3 depicts the decrease in the infidelity I as a function of the number of queries N q for the binomial and Gaussian modeling indicated by solid and empty symbols. In all cases the optimization includes up to M = 400 steps. Fig. 3 clearly verifies that binomial modeling outperforms Gaussian modeling in cases of few repetitions of the same measurement. For N = 1, 2 and 5, binomial modeling results in an improvement of more than an order of magnitude. For N = 10 (not shown here), this improvement is still close to an order of magnitude, but for N = 25 and N = 100 the improvement becomes smaller. For more frequent repetitions, one might have expected Gaussian modeling to become comparable to binomial modeling, but in the explicit implementation the Gaussian version actually outperforms the binomial. In particular we saw stagnation of the optimization traces around I = 10 −3 . This can be attributed to the Laplace approximation (Sec. II E 2). Close to the extreme values of 0 and 1 of the underlying probabilities this approximation becomes less accurate [37] resulting in an over(under)-estimation of the actual value of the probabilities. Signatures of this effect can also be seen in Fig. 1(c), where the expected control landscape is systematically below the true landscape in the vicinity of the global maximum.
Even though this effect can be reduced at the expense of higher computational effort [38], this does not seem necessary, since these stagnations occur in a regime in which the Gaussian modeling becomes appropriate and analytic solutions for integrals are available.
One can see in Fig. 3 -in particular, highlighted in the inset -that in all cases where there is data for different values of N for a given number of total measurements, the optimization with the fewest repetitions performs best. It thus seems to be always preferable to explore many points of the control landscape with few repetitions rather than trying to resolve the control landscape accurately on a few instances. Reducing the number of repetitions of the same measurement, however, also implies additional computational cost: the algorithm scales as O(M 3 ) [37] in terms of the number M of iterations, and at constant number N q ∝ M N (the number of initial steps N i is typically much smaller than M ) of measurements a reduction of N implies an increase in M . That is, whereas fundamentally optimizations with the fewest repetitions seem to perform best, it can become necessary in practice to increase the number N of repetitions.

Adaptive strategy
Given that working with few repetitions of the same measurement (i.e. small values of N ) is beneficial for fast convergence, but that this expensive benefit tails off at high fidelities, it seems favorable to increase N as the optimization converges. We found it most practical to start the optimization with few-shot measurements (e.g. N ≤ 10) in order to rapidly identify good domains in parameter space. As the search circles in on a good solution, one can restrict the exploration to a smaller domain; dropping the data outside this domain reduces the numerical effort, and the number of repetitions N can be increased as the search approaches a high-fidelity solution.
The black curve in Fig. 4 depicts the convergence of this strategy. The initial 150 steps of the optimization are performed with N = 5 repetitions and binomial modeling of the noise, corresponding to the first three data points of the curve. After those initial steps the parameter space is reduced around the 75 best parameters probed so far, and the number of repetitions is increased. The points at N q = 3 × 10 5 depict the median infidelity after an additional 500 steps in the optimization with Gaussian modeling and N = 100 repetitions. Markers at N q = 7.5 × 10 5 and at N q = 1.5 × 10 6 depict the median infidelity for optimizations with an increase to N = 250 and N = 500 repetitions instead of the increase to N = 100 after the initial 150 steps.
The two remaining points of the curve, at N q = 1.5 × 10 8 and at 1.5 × 10 9 , correspond to an additional round of optimization with Gaussian modeling starting with the solutions obtained with N = 500. For this round, the parameter space is reduced once more, and N is further increased to the values of N = 50 000 and N = 500 000 respec- tively. As one can see, extremely low infidelities can be obtained with these accurate measurement statistics.
These choices of explicit numbers are merely intended to give an example, and different choices can be taken in order to balance the distribution of effort between experimental data acquisition and classical computational effort.
The remaining five data sets (shown in color) correspond to optimizations with fixed repetitions N of the same measurement and Gaussian modeling. As one can see, the optimizations with increasing repetitions converge faster; at any given number N q , the optimizations with binomial modeling and adaptive repetitions have reached infidelities around one order of magnitude smaller than optimizations with fixed repetitions. Alternatively, for a given infidelity in the interval I ∈ [10 −5 , 10 −2 ], the framework requires between 3 to 10 times fewer queries to the experiment.
The explicit example shown in Fig. 4 is based on a minimal number of N = 5 repetitions in the beginning of the optimization. Decreasing this number improves the speed of convergence on average, but also increases the fluctuations between different realizations, due to the increased susceptibility to statistical fluctuations. This is similar to what can be seen in Fig. 3 for the case N = 1, where the range between the 25% and 75% quantiles is much wider than in cases of more frequent measurement repetitions.

Additional noise
Last but not least, we can demonstrate that the present control algorithm is not just able to cope with measurement noise, but can also be used to find good controls in the presence of additional sources of noise. To this end, we can consider the noisy unitaries N ε = √ 1 − ε 2 1 + ε n σ, where σ is the vector of the three Pauli matrices.
In each individual run of the gate sequence, ε is drawn from a Gaussian distribution with vanishing mean and width σ N , and the vector n of unit length is drawn from a uniform distribution following the Haar measure. In addition, we consider a finite probability p ro for each of the three detectors to yield the wrong result, i.e. '0' instead of '1' and vice versa.
In the presence of this noise with no correlations between the different noisy elements or between consecutive runs, it is fundamentally impossible to reach perfect fidelities. The present goal therefore can not be to aim at unit fidelity in a noisy circuit, but to demonstrate the ability of Bayesian optimization to find good solutions with a noisy circuit. The infidelities in Fig. 5 therefore correspond to what is achieved with a noiseless circuit but a control solution that has been constructed in the presence of noise.
Similarly to the other figures, Fig. 5 depicts a decrease in the median (as well as in the 25% and 75% quantiles) infidelity as a function of the number of measurements taken, for different values of σ N and p ro , using the adaptive strategy described in the previous section with an initial number of M = 5 repetitions.
Even though the control algorithm is entirely agnostic to the nature of the additional noise, there is rapid progress towards high fidelities. Most cases feature a rapid drop in infidelity after a few thousand measurements, with the σ N = 5% and p ro = 10% cases being the exception. Even in the cases of very substantial additional noise, however, the optimization manages to converge well to low infidelities.

IV. CONCLUSIONS
The ability to find close-to-optimal solution in terms of limited experimental data can advance technological development and precision experiments on a wide range of physical systems. It offers a very resource-efficient pathway towards the optimal use of currently existing quantum hardware with 10 to 100 qubits; the size of these systems makes theoretical modeling prohibitively expensive, but the limited accuracy of the individual qubits call for well-designed control sequences that prevent rapid accumulation of errors. Given the availability of cheap resources for classical computation, it is essential to use these resources as much as possible to support the limited capabilities of near-term quantum hardware. The present framework contributes directly to this goal in that it allows us to find optimal use of quantum systems in terms of limited experimental data, but at the expense of increased computational overhead for Bayesian optimization as compared to other control algorithms. The exploitation of cheap classical resources can be increased further in favor of increased efficiency with quantum mechanical resources using more accurate estimates of optimization landscapes than done here.
The applicability of the proposed methodology is by no means limited to state preparation, but includes for example the optimization of single-qubit gates in the presence of uncharacterized noise, or the direct realization of few qubit gates avoiding decomposition into more elementary gates in order to realize more complex quantum algorithm within limited coherence times. Whereas optimization of gate fidelities will likely remain limited to systems comprised of few qubits, there are also direct applications with large qubit registers. One example is the variational quantum eigensolver [39,40] aimed at finding the quantum state that minimizes an energy functional; the present method allows one to perform the optimization over a wide range of states without the need to accurately estimate the energy expectation during the search for the optimal state. With the identification of ground states as a promising route towards the solutions of many classical problems such as the traveling salesman problem, the present methodology holds the potential to substantially advance the practical value of quantum systems for real life applications.
The improvement in performance shown here for the case of shot-noise in projective measurements is also not necessarily limited to this specific type of noise, but similar improvement should be expected for different types of noise such as environmental noise or noise in the actual control fields if their spectral properties or temporal correlations are taken into account appropriately. With probabilistic modeling and Bayesian optimization being active fields in mathematics and computer science, one should expect that applicability and performance of methodologies similar to those developed here will rapidly increase beyond what has been demonstrated here.
Acknowledgment We are indebted to stimulating discussions with Annie Pichery, Hongzheng Zhao, Jake Lishman and Ulrich Schneider. The project Theory-Blind Quantum Control TheBlinQC has received funding from the QuantERA ERA-NET Cofund in Quantum Technologies implemented within the European Union's Horizon 2020 Programme and from EPSRC under the grant EP/R044082/1. This work was supported through a studentship in the Quantum Systems Engineering Skills and Training Hub at Imperial College London funded by the EPSRC(EP/P510257/1).