Neural-Network Heuristics for Adaptive Bayesian Quantum Estimation

Quantum metrology promises unprecedented measurement precision but suffers in practice from the limited availability of resources such as the number of probes, their coherence time, or non-classical quantum states. The adaptive Bayesian approach to parameter estimation allows for an efficient use of resources thanks to adaptive experiment design. For its practical success fast numerical solutions for the Bayesian update and the adaptive experiment design are crucial. Here we show that neural networks can be trained to become fast and strong experiment-design heuristics using a combination of an evolutionary strategy and reinforcement learning. Neural-network heuristics are shown to outperform established heuristics for the technologically important example of frequency estimation of a qubit that suffers from dephasing. Our method of creating neural-network heuristics is very general and complements the well-studied sequential Monte-Carlo method for Bayesian updates to form a complete framework for adaptive Bayesian quantum estimation.

Quantum metrology promises unprecedented measurement precision but suffers in practice from the limited availability of resources such as the number of probes, their coherence time, or nonclassical quantum states. The adaptive Bayesian approach to parameter estimation allows for an efficient use of resources thanks to adaptive experiment design. For its practical success fast numerical solutions for the Bayesian update and the adaptive experiment design are crucial. Here we show that neural networks can be trained to become fast and strong experiment-design heuristics using a combination of an evolutionary strategy and reinforcement learning. Neural-network heuristics are shown to outperform established heuristics for the technologically important example of frequency estimation of a qubit that suffers from dephasing. Our method of creating neural-network heuristics is very general and complements the well-studied sequential Monte-Carlo method for Bayesian updates to form a complete framework for adaptive Bayesian quantum estimation.
In quantum metrology we aim to design quantum experiments such that one or multiple parameters can be estimated from the measurement outcomes. Experiment design can involve the preparation of initial states, controlling the dynamics, or choosing measurements for readout. The estimation of parameters is a problem of statistical inference, and the most common approaches to tackle it are the frequentist and the Bayesian one.
In the frequentist approach, experiments are typically repeated several times which allows one to estimate the parameters from the statistic of measurement outcomes using, for example, maximum likelihood estimation. The problem of experiment design is often addressed with the Cramér-Rao bound formalism [1,2] by maximizing the quantum Fisher information with respect to experiment designs [3].
The Bayesian approach, on the other hand, relies on updating the current knowledge about the parameters after each experiment using Bayes' law. Examples for Bayesian quantum estimation involve state and process tomography [4][5][6][7][8][9], and phase and frequency estimation [10][11][12] with various experimental realizations [13][14][15][16][17][18][19][20]. The Bayesian approach is particular suitable for adaptive experiment design: experiments can be optimized depending on the current knowledge about the parameters and the available resources. While adaptivity can enhance the precision and save time and other resources compared to non-adaptive (frequentist) approaches [21,22], it involves a computational challenge: The Bayesian update and the consecutive optimization of the experiment design are both analytically intractable (with rare exceptions under idealized conditions [10,23]). In view of the short time scale of quantum experiments, slow numerical computation of the Bayesian update and the experiment design can drastically increase the total time consumed. In order to approximate the Bayesian update efficiently, a framework based on a sequential Monte-Carlo (SMC) algorithm has been developed [6,7,[24][25][26]. This framework, however, does not solve the problem of adaptive experiment design, which represents a second computational step.
In practice, one has to rely on so-called experimentdesign heuristics, i.e., fast to evaluate functions which take available information as input and return an experiment design as output, see Fig. 1(a). If the output depends on the available information, we speak of an adaptive heuristic. So far, adaptive experiment-design heuristics for Bayesian estimation have been found mostly manually, typically motivated by analytic arguments derived for idealized conditions concerning the experimental model and the available resources [9-11, 19, 27]. In one case a manually found heuristic [27] has been finetuned offline using a particle swarm algorithm [28]. Another approach is to optimize experiment designs between the experiments with respect to a restricted set of experiment designs in order to keep the problem numerically tractable [29]. Apart from Bayesian inference, particle swarm and differential evolution algorithms have been used to search a certain class of experiment-design heuristics (represented by binary decision trees) in order to optimize the scaling of the uncertainty in phase estimation with the number of entangled photons in an interferometer [19,[30][31][32]. A general numerical framework for finding adaptive experiment-design heuristics for Bayesian quantum estimation is missing so far.
We consider an approach to experiment-design heuristics which uses reinforcement learning (RL). Recently, RL has been used with great success to create programs that play chess and Go better than any other program or human [33]. RL has also been used in quantum physics [34,35] and, in particular, in quantum metrology such as for calibrating quantum sensors [36], for the identification light sources [37], and for improving the dynamics of quantum sensors [38,39]. Here, we propose to use neural networks (NNs) as experiment-design heuristics. We provide a general method based on a combination of an evolutionary strategy with RL for creating such NN experiment-design heuristics. This method builds upon and compliments the SMC framework for approximate Bayesian updates [6,7,[24][25][26]. It is general in the sense that (i) it can be easily adjusted to all kinds of estimation problems, (ii) it uses a very general ansatz for the experimentdesign heuristic in form of a NN, and (iii) it creates a NN experiment-design heuristics that can take into account not only knowledge about the parameters to be estimated but also additional information, for instance, about available resources. Further, our method numerically approaches global optimization in the sense that it tries to find heuristics which are optimal for all future experiments as opposed to local (greedy) strategies which always optimize only the next experiment. The trained neural networks represent ready-touse experiment-design heuristics and we envisage their application in Bayesian quantum sensors in combination with the SMC framework for approximate Bayesian updates [6,7,[24][25][26].

BAYES RISK
Let θ = (θ 1 , . . . , θ d ) ∈ Θ be a vector of parameters we want to estimate, and we assume that Θ restricts each θ j to a finite interval. The prior p(θ) is a probability distribution on Θ which represents our knowledge prior to the first measurement, and we imagine that prior to each Bayesian estimation θ is sampled from p(θ). After each experiment, our knowledge about θ is updated with Bayes' law (see Methods). Let p (θ D k , E k ) represent this updated knowledge about θ after the kth measurement, where D k = (d 1 , . . . , d k ) denotes the measurement outcomes from a sequence of k experiments E k = (e 1 , . . . , e k ). In the following, we omit the dependence on experiment designs for the sake of clarity, e.g., we write p (θ D k ) instead of p (θ D k , E k ).
An experiment-design heuristic h is a function which maps available information, for instance about θ or available resources, to an experiment design for the next experiment, see Fig. 1 (a). The idea is to consult the experiment-design heuristic prior to each experiment and design the experiment accordingly. Imagine that the available resources for one Bayesian estimation are such that we can make k experiments. Then, we aim to choose an experiment-design heuristic h which minimizes the expected traced covariance over p(θ D k ): where the covariance, and the notation E a b (c) denotes the expected value of c with respect to a ∈ A distributed as p(a b), E a b (c) = ∫ A da p(a b)c. If a takes discrete values the integral becomes a sum, and if a is distributed as In Methods we show that r [h p(θ)] corresponds to the Bayes risk for a loss function L θ k (D k ) θ = θ k (D k ) − θ 2 2 with the Bayes estimatorθ k after k measurements given byθ k (D k ) = E θ D k (θ). We also discuss in Methods how to generalize Eq. (1) if other resources are available, and how the expected values used for the computation of Eq. (1) are approximated numerically.
For a given estimation problem and prior p(θ), the Bayes risk r [h p(θ)] represents our figure of merit (smaller values are better) for experiment-design heuristics.

EXPERIMENT DESIGN AS A RL PROBLEM
Our idea is to train a NN to become an experimentdesign heuristic. To this end, we simulate the experiment and the Bayesian update offline many times to generate training data [see Fig. 1 (c)] and train the NN with it. Instead of simulating the experiment, measurement data from an actual experiment could be used to train the NN; note, however, that the calculation of Bayes update still depends on the model for the experiment. Either way, once the NN is trained, it provides a powerful adaptive heuristic that can be used as a part of a real Bayesian quantum sensor.
Let us phrase the problem of experiment-design heuristics in the language of RL. The neural network represents the RL agent. RL is an iterative method. In each iteration training data are generated and the agent learns from the data. The learning phase is prescribed by the RL algorithm and generally consists of a combination of learning from experience (using the training data) and random exploration.
The RL framework for generating training data is depicted in Fig. 1(c) and can be understood in the context of Bayesian quantum estimation, cf. Fig. 1(a): One episode of training data corresponds to one Bayesian estimation and consists of a sequence of k simulated experiments including Bayesian updates and experiment designs chosen by the RL agent. The number of experiments k depends on the available resources. In the simplest case, the number of experiments is the limiting resource (referred to as experiment-limited in the fol-lowing), i.e., each episode consists of the same number of experiments, and the Bayes risk is given by Eq. (1). More generally, the available resources set more complicated constraints such that k can vary between different episodes (see Methods for a generalization of the Bayes risk for such cases). If a resource is exhausted, the episode ends, the RL environment [see Fig. 1(c)] is reset to default values (the current posterior is reset to the prior p(θ) and a new true parameter θ is sampled from the prior), and another episode starts. Training data for one iteration typically consists of many episodes, e.g., ∼ 10 3 in our model study below.
Crucial for the success of RL are the observations and rewards for the agent, see Fig. 1(c). The observations may contain information about the current knowledge p (θ D k ), about the past, such as prior actions, and about resources, such as the remaining time. The reward should reflect the goodness of the behavior (actions) of the agent (larger reward is better) and is used by the RL algorithm to enforce behavior which leads to larger rewards; the RL agent learns from its experience. The negative Bayes risk seems to be an obvious choice for a reward function. However, the computation of the Bayes risk is too timeconsuming. Instead, we define the reward after the kth experiment as the difference in the traced covariance over the posterior after the kth and the (k − 1)th experiment, The idea behind this reward function is that (i) Eq. (2) is straightforward to compute in the Bayesian SMC framework, (ii) it reflects the difference in our uncertainty about θ before and after the current step, and (iii) the expected value of the rewards accumulated from the beginning of an episode with respect to possible measurement outcomes D k yields the negative Bayes risk (up to a constant), see also Methods. For example, in the experiment-limited case with N experiments per episode we have where the constant is given by the uncertainty in the prior, const = tr [Cov θ (θ)]. RL has the goal to maximize the expected discounted reward which equals the lefthand side of Eq. (3) because we set the discount factor to one. From Eq. (3) we thus see that RL indeed attempts to minimize the Bayes risk r [h p(θ)].
The training of the NNs consists of two steps, see Fig. 1 (b): We initialize the NN using imitation learning (pretraining) as implemented in [40]. The idea of imitation learning is to take advantage of an already existing (e.g., manually found) heuristic. The NN is trained to imitate the behavior from episodes created with the existing heuristic, cf. Fig. 1(c). This pretraining step is not strictly necessary but speeds up the training and makes RL more stable.
However, there might not always be a good heuristic available to imitate. For such cases and for the sake of comparison, we consider an evolutionary strategy [41] called the cross-entropy method (CEM) for continuous action spaces [42,43] [see Fig. 1(d) and Methods]. CEM is used to train a neural network from scratch starting from a randomly initialized neural network.
Once the NN is pretrained, we use RL as a second step. The RL algorithm we use is trust region policy optimization (TRPO) [44] as implemented in the Python package Stable Baselines [40] (see Methods for details on the training and the advantages of combining it with CEM). TRPO is an approximation of an iterative procedure for optimizing policies with guaranteed monotonic improvement [44].

MODEL STUDY
We demonstrate our method of creating NN heuristics with an example of high practical relevance for magnetic field estimation with nitrogen-vacancy centers with applications in single-spin magnetic resonance [17,18,20,45]. Let us consider a qubit which evolves under the Hamiltonian H(ω) = ω 2 σ z , and we want to estimate the frequency ω. The qubit is prepared in +⟩ = ( 0⟩ + 1⟩) √ 2, evolves under H(ω) for a controllable time t, and is measured in the σ x basis (assuming a strong projective measurement with outcomes labeled 0 and 1 corresponding to a measurement of +⟩ and −⟩). Let us further assume that the qubit suffers from an exponential decay of phase coherence, with characteristic time T 2 . According to the Born rule, the likelihood of finding an outcome d ∈ {0, 1} with the σ x measurement can be expressed as [7,10], for measuring d = 0, and p(1 ω, t, T 2 ) = 1 − p(0 ω, t, T 2 ) for measuring d = 1. Eq. (4) defines all relevant properties of the experiment. A single experiment design consists of specifying the evolution time t. We consider the following estimation problems: (i) the estimation of ω without decoherence (T 2 = ∞, see the top row of panels in Fig. 2), (ii) the estimation of ω with known T 2 relaxation (we consider this problem twice with different values for T 2 , see the second and third row of panels in Fig. 2), and (iii) the simultaneous estimation of ω and T −1 2 , i.e., θ = (ω, T −1 2 ) (see the bottom row of panels in Fig. 2). In all cases we consider ω ∈ (0, 1) (making the problem dimensionless).
Each estimation problem defines a RL environment [see Fig. 1(c)] which is either time-limited or experimentlimited. In the former case, the available time T per episode is limited, while the latter case the number of experiments N per episode is fixed. The first case is relevant if time is the limiting resource while the second case is relevant if measurements are expensive, for instance, if experiments involve probing sensitive substances such as biological tissue. In practice, there may be constraints on both, T and N , which could be easily taken into account by creating a RL environment accordingly.
As an observation for the NN after the kth experiment we choose the expected value E θ D k (θ) and the covariance Cov θ D k (θ) over the posterior (that generalizes the variance in case of single-parameter estimation), the previous actions from the current episode (maximal 30 actions), and the spent time or the number of experiments in the current episode (for the time-limited or experiment-limited case, respectively).
Several heuristics have been developed for estimation problems (i) and (ii). As an example for a non-adaptive strategy, we consider a heuristic which chooses exponentially sparse times [10], t k = (9 8) k , denoted as exp-sparse heuristic in the following. Further, we consider two adaptive heuristics: We define the first one as t k = tr Cov θ D k−1 (θ) −1 2 and we will call it σ −1 heuristic. This represents a generalization to multiparameter estimation of a heuristic which was derived for estimation problem (i) (ω estimation, T 2 → ∞) by Ferrie et al. [10]. For single-parameter estimation, the σ −1 heuristic chooses the times t k as the inverse standard deviation of θ over the posterior and is optimal in the greedy sense and only in the asymptotic limit N → ∞.
The second adaptive heuristic that we consider is the particle guess heuristic (PGH) [11]. It is based on the SMC framework which uses a particle filter to represent probability distributions such as p(θ D k ) [6] (see Methods). PGH chooses times as the inverse distance of two particles θ 1 , θ 2 ∈ Θ sampled from p(θ D k−1 ), In case of single-parameter estimation, PGH is a proxy for the σ −1 heuristic but it is faster to compute (given the particle filter) and introduces additional randomness (compared to the σ −1 heuristic).
Let us turn to the results depicted in Fig. 2. In all examples, we consider uniform priors for ω ∈ (0, 1) and, in case of multiparameter estimation, also for T −1 2 ∈ (0.09, 0.11). For the multiparameter estimation examples, we consider, instead of one experiment, 100 independent and identical experiments in each step in order to facilitate the estimation of T −1 2 , and we also give the averaged outcome of these experiments as an additional observation to the RL agent.
We considered three different heuristics for pretraining TRPO: the σ −1 heuristic, PGH, and a CEM-trained NN, but plot only the results for the best of these three. For the data in Fig. 2  . We study frequency estimation without (top row) and with T 2 relaxation (2nd and 3rd row, with different values of T 2 as stated in the plot titles) as well as the simultaneous estimation of the frequency ω and relaxation rate T −1 2 (bottom row). The Bayes risk is calculated numerically from 10 4 episodes, see Methods for details. TRPO has been pretrained with the heuristic which is specified in brackets in the legends. The lines are linear interpolants to guide the eye.
of T 2 relaxation, times which exceed T 2 tend to yield no information, which explains why the Bayes risk saturates for the exp-sparse heuristic. The largest advantage of a NN heuristic is found for the example of timelimited ω estimation with T 2 = 10. Compared to PGH, we find an improvement in the Bayes risk by more than one order of magnitude. Generally, the performance of NN heuristics is remarkable given that (for the single-parameter estimation problems considered in Fig. 2) the conventional heuristics such as PGH are used in experiments [17][18][19][20] and are considered to be the best practical choice with near-optimal performance [11,12]. In the Supplementary Material, we compare the distribution of experiment-designs, i.e., the evolution times, for the different adaptive heuristics used in Fig. 2.
While the Bayes risk is used to compare the expected (average) performance of heuristics, it is an important advantage of the Bayesian approach over the frequentist one that it provides credible regions as a practical tool for comparing single runs of parameter estimation (episodes) [25,26]. Let us revisit the multiparameter problem discussed in the two bottom panels of Fig. 2. This time, we run only one Bayesian estimation of (ω, T −1 2 ) with each heuristic and visualize their performance by plotting 95% credible regions [25,26], see Fig. 3. Note that since the Bayesian estimation is subject to fluctuations (such as stochastic measurement outcomes), the shape of credible regions fluctuates between different estimations. For the specific example shown, we see that TRPO provides the smallest uncertainties compared to the other heuristics as judged by the area of the credible regions.

DISCUSSION AND CONCLUSION
The practical success of adaptive Bayesian estimation often depends on the run time of data processing (Bayesian update, choice of an adaptive experiment design). Using NNs as experiment-design heuristics introduces a computation overhead compared to the fastest existing experiment-design heuristics such as PGH. On the other hand, improvements in measurement precision achieved by the NN heuristics may easily outweigh the drawback of an increased run time. This trade-off has to be assessed dependent on the estimation problem and its concrete realization. In our model study, the run time for one call to a NN heuristic is always shorter than the corresponding numerical Bayesian update (single-core com-putation in both cases). For more complicated modeling of the sensor or for an estimation up to a larger precision, the run time of the Bayesian update increases even more and we expect that the run time will be dominated by the numerical Bayesian update. Moreover, the effective run time of the NN per experiment could be reduced by using smaller NNs or by sacrificing a part of the adaptivity, e.g., by calling the NN heuristics only every mth experiment in which case the NN could return the designs for the next m experiments.
In conclusion, we proposed and demonstrated a machine learning method to create fast and strong experiment-design heuristics for Bayesian quantum estimation. The method uses imitation and reinforcement learning for training NNs to become experiment-design heuristics. In order to make the method independent of the availability of an expert heuristic for imitation learning, we show that expert heuristics can be found with an evolutionary strategy (the cross-entropy method). The big advantage of our method is its versatility and adaptivity. The properties of the estimation problem and the quantum experiments, and the availability of resources are taken into account during the training such that the trained NNs are tailored experiment-design heuristics. Similarly, we expect that issues such as uncertainty in the model or multimodal probability distributions are automatically taken into account by NN heuristics as a result of training. We provide the complete source code [46] used for this work in order to facilitate the application of the presented method in experiments and to related problems such as the detection of time-dependent signals [25,47] and adaptive Bayesian state tomography [8,9].

Bayes' law and the sequential Monte-Carlo algorithm
Bayes' law for updating our knowledge about θ ∈ Θ according to the measurement outcome d k of the kth experiment is given by where p(θ D k ) is our updated knowledge (posterior), p(θ D k−1 ) is our knowledge prior to the kth experiment, The exact solution for the Bayesian update is generally intractable. Instead, we use an inference algorithm based on the sequential Monte-Carlo algorithm [6,7,48]. The idea is to represent the probability distributions p(θ) and p(θ D k ) by a discrete approximation ∑ n j=1 w j δ(θ − θ j ) with n socalled particles with positive weights w j and positions θ j ∈ Θ. Then, for a Bayesian update, we only need to update the weights of each particle by calculating p(d k θ j ). This means that for the jth particle we have to simulate the experiment for θ = θ j and use the Born rule to find p(d k θ j ). The expected value in the definition of p(d k ) reduces to a simple sum over the particles of the prior.
The particle locations need to be resampled if too many weights are close to zero, i.e., the particle filter of p(θ D k ) is impoverished. We use Qinfer's [26] implementation of the Liu-West resampling algorithm [24] (with default parameter a = 0.98 [26]) with n = 2 × 10 3 particles for RL environments without decoherence and for RL environments with multiparameter estimation. For the environments with ω estimation and finite T 2 we use n = 2 × 10 4 particles.

Bayes risk
Let us consider a sequence of k experiments designed with the experiment-design heuristic h. Letθ k ∶ D k ↦ θ k (D k ) be an estimator of θ after k experiments, and let L θ k (D k ) θ be a loss function which quantifies the deviation ofθ k (D k ) from θ. For an experiment-design heuristic h, the risk s ofθ k is defined as (usually denoted by R which denotes the reward in this work) Note that the experiment-design heuristic h determines the experiment designs E k which influence the estimatê θ k (D k , E k ). However, in order to simplify notation we writeθ k (D k ) instead ofθ k (D k , E k ). For a definition of the expected value E D k θ , see in the main text after Eq. (1). Eq. (6) is the risk typically associated with a choice of an estimatorθ k and corresponds to the expected (with respect to measurement outcomes D k ) loss. The Bayes risk represents a way to additionally take into account prior knowledge about θ in form of a probability distribution p(θ) (the prior) on Θ. Given a prior p(θ), the Bayes risk is defined as A common choice for a loss function is the quadratic loss . Then, the estimator which minimizes the Bayes risk, the Bayes estimator, is given by the expectation over the posterior,θ k (D k ) = E θ D k (θ). For the quadratic loss, the risk is also known as mean squared error, and the Bayes estimator is also known as minimum mean square error estimator.

Derivation of Eq. (1) in the main text
Let us show that Eq. (7) is equivalent to the Bayes risk defined in Eq. (1) in the main text. We find from Eq. (7), where we used the definition of the expected values together with Bayes' law p(D k θ) = p(θ D k )p(D k ) p(θ) . Next, we insert the Bayes estimator in Eq. (9) in order to write the Bayes risk solely as a function of the heuristic h which corresponds to Eq. (1) in the main text: The Bayes risk in the presence of limited resources As discussed in the main text, the available resources for experiment designs may lead to episodes with different numbers of experiments. A Bayes risk which corresponds to the situation that the available resources are exhausted can easily be generalized from Eq.(1) in the main text as where D end denotes all measurement outcomes for an episode. This means, the expectation E D end must be taken with respect to all data sets D k which are compatible with the available resources.

Numerical Approximation of the Bayes risk
Let us first consider how to approximate Eq. (10) which is relevant for the experiment-limited case considered in our model study. The Bayes risk in Eq. (10) involves expected values over p (θ D k ) and p (D k ). The posterior p (θ D k ) is represented in the SMC framework by a particle filter with weights w j and particle locations θ j ∈ Θ which allows us to approximate the expected value of a function f (θ) as E θ D k [f (θ)] = ∑ j w j f (θ j ). The same approximation is used to compute the reward. Note that we set tr (Cov θ [θ]) = const (14) to a constant numerical value in order to avoid fluctuations of the reward originating from numerical uncertainties in the particle filter representing the prior p(θ).
The expected value over p (D k ) is approximated as a sample mean: We sample 10 4 episodes with the RL environment. To each episode corresponds a series of measurement outcomes D k and we can calculate tr Cov θ D k (θ) . Note that the properties of the RL environment ensure that an episode with data D k is sampled with probability p(D k ). Then, we can approximate the expected value over p(D k ) as the mean 1 10 4 ∑ D k tr Cov θ D k (θ) , where the sum runs over all 10 4 sampled series of measurement outcomes D k .
The time-limited case we consider in our model study is an example where episodes can have different numbers of experiments. Moreover, we defined the time-limited case such that an episode ends when the agent has consumed more time than the given time limit. This means with the last experiment the RL agent (as well as the other heuristics) can go beyond the time limit. For the comparison of heuristics, this is not relevant mainly because all other experiment designs must lie within the time limit.
For the time-limited case, it would be insightful to have a time-resolved Bayes risk in analogy to the experimentlimited case, where for every number of experiments, we can compute a Bayes risk. Fig. 4 illustrates how we approximate such a time-resolved Bayes risk (actually, we average over 10 4 episodes and use 200 equally spaced times for interpolation). This approximation of the time-resolved Bayes risk is used for the time-limited cases shown in Fig. 2 in the main text.

The combination of CEM and TRPO
While CEM takes into account only the accumulated reward of full episodes, RL uses all the training data consisting of actions with corresponding observations and rewards. This allows TRPO to take into account the performance of single actions in order to improve the overall performance (at the end of episodes). If we pretrain TRPO with a CEM-trained NN (as a known heuristic), we obtain a purely machine learning based method for finding strong NN heuristics. The advantage of this two-step procedure over using only one of the algorithms is that we can use CEM for exploring the policy space (for our RL environments, CEM has proven to be better than TRPO in this respect) and TRPO for optimizing the heuristic further. A further speed-up and possibly an improvement of this method could be achieved by passing the neural network directly from CEM to TRPO (avoiding imitation learning). However, in the current implementation, this is not possible because the neural networks used for CEM and TRPO are of different shape and not compatible.

Cross-entropy method
The input layer of the NN is defined by the observation. The output layer is determined by the number of actions (one action: time) and we choose 16 neurons in the hidden layer. The layers are fully connected. The hidden layer has the rectified linear unit (ReLU) as its activation function and the output layer has the softmax function as its activation function [53].
A schematic representation of the algorithm is given in Fig. 1(d) in the main text. The weights of the neural network form a vector x. A generation is sampled from a Gaussian distribution x i ∼ N (µ, Σ) with mean µ and covariance Σ = 1 2. Before the first iteration, we sample µ from N (0, Σ). A generation consists of K = 100 individuals (=NNs). By running one episode (interacting with the RL environment) with each NN, we determine the K = 10 fittest individuals (x 1 , . . . , x K ) as those with the largest reward accumulated over an episode. This concludes one iteration, the next generation is sampled from the distribution N (µ new , Σ) with µ new = 1 K ∑ K j=1 x i . We run CEM for N = 1000 iterations. The final solution is given by µ new calculated in the last iteration. CEM is used to find 5 NNs for each RL environment and we plot results in Fig. 2 only for the NN heuristic which achieves the smallest Bayes risk after the maximum time or the maximal number of measurements.

Trust region policy optimization
We use TRPO [44] with the default multilayer perceptron policy as implemented in Stable Baselines 2.9.0 [40]. Hyperparameters which differ from their default values are γ = 1, λ = 0.92 and vf stepsize = 0.0044. These hyperparameters were found to yield good results during initial testing of the algorithm, without doing a rigorous hyperparameter tuning. In particular, we do not use discounted rewards by setting the discount factor γ = 1, because even without reward damping the rewards are defined such that they tend to decay exponentially with the number of experiments. Further, smaller values for λ such as λ = 0.92 often work well together with large γ.
Pretraining is also implemented in Stable Baselines 2.9.0 [40] using the Adam optimizer [54]. Deviations from the default parameters [40] are the following: we use 10 4 (10 3 ) episodes, sampled from a known heuristic, as an expert dataset. Pretraining runs with 10 4 (10 3 ) epochs (the number of training iterations on the expert dataset) and a batch size for the expert dataset of 100 (10) (bracketed values are used for the time-limited (ω, T −1 2 ) estimation to speed up the computation).
Training runs for 500 iterations. However, we use an exit condition which can stop the training earlier. The exit conditions stops training if the policy entropy drops below 0.005 [40].
TRPO is pretrained once per heuristics (PGH, σ −1 , the best of five CEM-trained NNs), and then trained 5 times for each of the pretrainings, and we plot the results in Fig. 2 only for the best NN heuristic, i.e., with the largest Bayes risk at the end of the episodes.

Supplementary Note 1: Comparison of experiment designs from different heuristics
Here we provide plots of the distribution of experiment designs, i.e., experiment times, chosen by the four adaptive heuristics discussed in the main text: the particle-guess heuristic (PGH), the σ −1 heuristic, a neural network trained with the cross-entropy method (CEM), and a neural network trained with trust region policy optimization (TRPO). The plotted experiment designs correspond precisely to the data used to calculate the Bayes risk in the main text (see Fig. 2 in the main text). Data consist for each estimation problem and each heuristic of 10 4 episodes. Note that the distribution of experiment designs shown in Supplementary Figs 5 and 6 do not reveal the adaptivity of experiment designs. Instead we can see the frequency of experiments designs for different experiments, i.e., which times are chosen for the first experiment of each episode (Bayesian estimation), which for the second experiment, and so on. The diversity we find for different experiment-design heuristics is remarkable. Also the diversity between different environments underpins that experiment-design heuristics should be tuned to the particular estimation problem. Table I shows the limits on the number of experiments and the available time for all estimation problems. Note that for numerical convenience, time-limited (experiment-limited) problems also have a limit on the number of measurements (the available time). In particular, for time-limited problems a limit on the number of experiments helps to avoid that a RL agent chooses very many small experiment times which would lead to very long episodes, i.e., to very long run times. Also from a physical perspective, small experiment times are not sensible due to overheads for readout and preparation. Such overheads can be taken into account as a dead time of the sensor between experiments which can be easily implemented as a part of the RL environments. In case of experiment-limited environments, the time limit 10 27 avoids issues with large numerical values and it is chosen large enough for the exp-sparse heuristic with 500 experiments, which chooses exponentially increasing times.