Active Learning of Quantum System Hamiltonians yields Query Advantage

Hamiltonian learning is an important procedure in quantum system identification, calibration, and successful operation of quantum computers. Through queries to the quantum system, this procedure seeks to obtain the parameters of a given Hamiltonian model and description of noise sources. Standard techniques for Hamiltonian learning require careful design of queries and $O(\epsilon^{-2})$ queries in achieving learning error $\epsilon$ due to the standard quantum limit. With the goal of efficiently and accurately estimating the Hamiltonian parameters within learning error $\epsilon$ through minimal queries, we introduce an active learner that is given an initial set of training examples and the ability to interactively query the quantum system to generate new training data. We formally specify and experimentally assess the performance of this Hamiltonian active learning (HAL) algorithm for learning the six parameters of a two-qubit cross-resonance Hamiltonian on four different superconducting IBM Quantum devices. Compared with standard techniques for the same problem and a specified learning error, HAL achieves up to a $99.8\%$ reduction in queries required, and a $99.1\%$ reduction over the comparable non-adaptive learning algorithm. Moreover, with access to prior information on a subset of Hamiltonian parameters and given the ability to select queries with linearly (or exponentially) longer system interaction times during learning, HAL can exceed the standard quantum limit and achieve Heisenberg (or super-Heisenberg) limited convergence rates during learning.


I. INTRODUCTION
Hamiltonian learning constitutes the problem of learning the Hamiltonian governing the dynamics of a quantum system given finite classical and quantum resources. This is a fundamental problem encountered in identification of quantum systems [1][2][3], operation of quantum information devices [4,5], validation of theoretical physical models, and has implications for computational bounds on quantum algorithms [6][7][8]. In the calibration of quantum computers alone, it is a significant step in each of the following tasks: system characterization, learning device parameters, different sources of noise, gate design [9] and control strategies for implementing robust quantum gates with high fidelity. Moreover, a quantum computer typically requires frequent recalibrations to account for drift in parameters over time requiring multiple iterations of some Hamiltonian learning routine.
The resource requirements for learning a generic many-body Hamiltonian rise exponentially with the system size [10]. Even for a fixed system size, however, the achievable learning error is fundamentally limited by the number of queries N made to the quantum system. In particular, through repeated system queries, N in general scales as −2 as a consequence of the central limit theorem. This is commonly referred to as the standard quantum limit (SQL) or shot noise limited scaling. Using quantum resources, however, a number of approaches have shown a much better Heisenberg limited scaling of N ∼ −1 . The Heisenberg limit is known to be fundamental [11][12][13][14][15], under a wide range of assumptions [16][17][18][19], but it is typically only saturated with the help of quantum resources.
This has been achieved using entanglement [20] such as NOON states [21,22], but can also be accomplished without entanglement under certain circumstances. For example in the problem of phase estimation [23], one has to estimate the phase φ in an unitary operator of form U = exp(−iφH) where H is a Hermitian operator and φ can also be interpreted as the strength of coupling in a Hamiltonian. It has been shown that Heisenberg limited scaling can be achieved using multi-round protocols using both adaptive measurements [24,25] and predetermined non-adaptive measurement sequences [26,27]. In contrast, there has not been such a detailed study in the case of learning a general many-body or a multi-parametric Hamiltonian. This motivates understanding how might just classical resources be employed to solve the Hamiltonian learning problem with a scaling which surpasses the SQL, and ideally achieves the Heisenberg limit.
Even if scalings higher than SQL cannot be achieved, it is still desirable to minimize resource requirements. This may be accomplished by changing the estimation procedure used for Hamiltonian learning in combination with engineered experiments. Fast Fourier Transform (FFT) and linear regression are some of the traditional estimation methods which still form the powerhouse of modern Hamiltonian learning strategies [28]. It has been shown that adopting alternate estimation methods such as Bayesian estimation [29], stochastic estimation [30], and neural-network based Hamiltonian reconstruction [4] can reduce resource requirements and improve scalability. We will call the reduction in resource requirements achieved by replacing one Hamiltonian learning strategy with another as query advantage. This will obviously depend on the two strategies being compared. We will call the strategy being replaced as the baseline. The baseline and our proposed replacements will be discussed in detail later. This brings us to the primary question we tackle in this work: what is a common framework for Hamiltonian learning that can achieve query advantage even if scalings higher than SQL are not achieved, just using classical resources?
One effective framework for surpassing central limit theorem bounds when possible and/or achieving query advantage is active learning, e.g. using optimal experiment design. In [31], this idea was explored for quantum state tomography, process tomography and Hamiltonian learning given a model, but in an offline manner. This has also been used to reduce experiment budget and propose different control schemes [32]. Active learning of a Hamiltonian is a more challenging problem than for quantum state tomography [33] where one optimizes over different measurements and process tomography [34] where one additionally optimizes over initial states, due to the additional control parameter of system time evolution. Active learning thus provides a general framework for making adaptive queries to the quantum system, comprised of initial state, measurements and system time evolution during Hamiltonian learning. In fact, [35] has shown that with adaptive feedback control, Heisenberg limited scaling can be reached in principle, but a recipe for this is only given for estimation of a single Hamiltonian parameter and the procedure requires prior information of the parameter. This was later extended to multi-parameter Hamiltonians by [36]. A common ingredient of these works and the earlier mentioned multi-round protocols for phase estimation, is trading the cost of using physical quantum resources for cost in time resources [20].
In this work, we introduce a Hamiltonian active learning algorithm (HAL) based on the criteria of Fisher information, which is a way of measuring the information content of different queries and naturally appears in the bound on the errors achieved by a Hamiltonian learner. This resulting variant of HAL is called HAL-FI. We also introduce another variant of HAL for the task of predicting queries to the Hamiltonian, which uses the criteria of Fisher information ratio (FIR). We call the resulting active learning algorithm HAL-FIR. We demonstrate the performance of HAL-FI experimentally on IBM Quantum devices which are based on the superconducting cross-resonance (CR) gate. Compared with passive learning which scales as SQL, we show that HAL-FI with a fixed space of queries also has an asymptotic scaling of SQL but is able to achieve a constant reduction of 99.1% in number of queries required for a desired learning error in learning the two-qubit CR Hamiltonian on a 20-qubit IBM Quantum device. HAL-FI can achieve up to a 99.8% reduction in number of queries required when compared to current standard methods used for Hamiltonian learning which use estimation procedures based on FFT and regression. We finally show that queries involving exponentially growing system evolution time to the quantum devices suffices during learning to achieve Heisenberg limited scaling with HAL-FI when prior information is available. This is another example of trading physical quantum resources with time resources as highlighted before.
The paper is organized as follows. In Sec II, we formally describe the problem of Hamiltonian learning, and the concept of an active learner. In Sec. III, we present the HAL algorithms of HAL-FI and HAL-FIR. To illustrate the performance of HAL-FI, we consider the example of calibrating CR gates on IBM Quantum devices. In Sec. IV, we describe our experimental setup, provide a theoretical description of the Hamiltonian model of the CR gate and physical models of the different noise sources affecting the quantum devices. Further, we provide details of our experiments on evaluating the performance of HAL-FI. Finally in Sec. V, we describe the amount of query advantage that can be obtained using HAL-FI and specify the conditions under which Heisenberg limited rate of convergence or even super-Heisenberg limited rate of convergence can be achieved. Specifically for CR gates, we show that HAL-FI can be used to learn an accurate Hamiltonian using only a fraction of the queries required by currently used methodologies, resulting in reduction of queries of around two or three orders of magnitude over currently used methods for particular learning tasks.

II. HAMILTONIAN LEARNING
In this section, we present a formal description of the problem of Hamiltonian learning for a general quantum system (Section II A), and in the presence of different noise sources (Section II B). Further, we introduce the concept of an active learner in this context (Section II C). Notation used for this work will be defined as introduced and is also summarized in Appendix A.

A. Problem Statement
This subsection of the paper describes the unknown Hamiltonian of interest, specifies our query setting and formally describes the different Hamiltonian learning tasks.

Unknown Hamiltonian
Let H be a partially or fully unknown system Hamiltonian over n qudits (e.g., d = 2 for a qubit, d = 3 for a qutrit, etc.) As H is a linear operator acting within a linear vector space with a well-defined inner product (Hilbert space), we can define H in terms of the tensor product of the system qudits with Hilbert space SU (d). H can thus be represented by a d n × d n matrix and is constrained to be Hermitian.
We assume that we have a model for the unknown Hamiltonian and belongs to the the family of models H = {H(θ)|θ ∈ Θ} ⊂ R m parametrized by the real vector θ. We use H(·) to denote the model formulation and Θ as the space over parameters. The model may be derived from first principles based on the understanding of the physics of the quantum system or may be a complete blackbox such as a neural network. We further assume that the system Hamiltonian is time-independent and any unitary operator implemented on the system can be approximated by the evolution of H for sufficiently long evolution times.
Let the unknown Hamiltonian of the quantum system be H and the true Hamiltonian parameters be θ . We refer to the quantum system of whose Hamiltonian we wish to learn as an oracle that we can query and which returns measurement outcomes upon querying. We denote the random variable of query to the oracle by x and the resulting output by the oracle as the measurement outcome random variable y corresponding to a single shot of the qubits being readout in the standard computational basis. The pair of a query and its corresponding output is called an example and is denoted by (x, y). The alphabet of query x is referred to as the query space and we denote it by Q. The distribution from which queries are sampled from Q is referred to as the query distribution and denoted by q. Commonly, y ∈ Y = {0, 1} nr where n r is the number of qubits being readout. Our goal is to then learn the parameters θ of the Hamiltonian with error from examples of the form {(x (i) , y (i) )} N i=1 while minimizing the number of queries made to the oracle or the query complexity N .

Specifying a Query
We first describe what we mean by a query before specifying the different learning objectives. The query comprises three parts: measurement observables M , initial state preparation operators U , and control parameters t. A schematic of how the query is used in a quantum circuit to evolve a quantum system is shown in Figure 1. In Figure 2, we show the oracle of a quantum device receiving the input of queries of form x = (M, U, t) and returning the corresponding outputs of a single shot of the qubits on the device being readout denoted by a binary string of length n r : y ∈ {0, 1} nr . The input to the quantum circuit is the zero state |0 ⊗n . Application of the preparation operator U on |0 ⊗n is used to create an initial state before interacting with the system Hamiltonian H for time t. Finally, a measurement operator M is applied to the evolved state before measuring in the usual computational basis denoted by the meter. The output is a single shot of qubits being readout.
a. Measurement Observables The measurement observables M ∈ M specify the basis set {|ψ m } used to generate the final measurement observation results. The permissible set of observables is typically constrained to be simple operators acting on just one qubit at a time, because that is experimentally realistic. It would be unrealistic to allow any arbitrary operator, because specific operators could be constructed to measure individual parameters of H, making the problem too simple, and the measurement operators too hard to realize in practice. One standard option for specifying M is as a complete set of orthogonal projectors {|ψ m ψ m |} where m |ψ m ψ m | = I. This is known as a projective measurement. For qubit systems, one example of such a measurement basis set is the computational basis.
b. Initial State Preparation Operators Together with the measurement observable, an initial state |ψ 0 must also be specified, in order to determine a measurement result from H. This initial state can in principle be any unit vector in the Hilbert space, but much like for measurement observables, it only makes sense to allow a subset of possible states to be specified. We consider |ψ 0 = |00 · · · 0 and allow for realistic single qubit unitary transforms. Thus, a common initial state specification is to provide a unitary operator U = U n ⊗ U n−1 ⊗ · · · ⊗ U 1 ⊗ U 0 as a tensor product of single qubit unitary transforms, acting on each of the n qubits in the system. This then determines the U element of x. We denote the set of all considered preparation operators as U c. Control Parameters The control parameters are the last element of x, and are typically a set of classical numbers. For example, one canonical control parameter is the time t for which H should be applied. Other control parameters may modulate interactions between qubits. In this work, we consider only the Hamiltonian evolution time t as a control parameter and denote the set of all possible evolution times as T .
After defining the set of measurement operators M, the set of preparation operators U, and the set of Hamiltonian evolution times T , the resulting query space is given by Q = M × U × T . We require Q to be complete, i.e., there exist queries in Q that are informative about each of the Hamiltonian parameters θ i so that Hamiltonian learning can succeed. As we will see later in our discussion on active learning in Section II C 2, completeness of Q is equivalent to the condition of there existing a set of queries or query distribution for which the resulting Fisher information matrix is full rank and invertible.
Schematic of Hamiltonian learning with an active learner: The oracle constitutes the unknown Hamiltonian and noise sources, with the true parameter vector of θ which is unknown and needs to be learned. Here, we show the device coupling map of the 20-qubit IBM Quantum device ibmq boeblingen. Learning is carried out on training examples of the form (x, y) where x are queries inputted to the oracle specifying the measurement observable M , preparation operator U and system time evolution t, and y are the corresponding measurement outcomes outputted by the oracle. An illustrative quantum circuit picture of the oracle in the noiseless case is shown in Figure 1. The set of queries inputted to the oracle is denoted as X and the corresponding set of measurement outcomes outputted by the oracle as Y . An estimation procedure is run on the training examples of (X, Y ) to learn a model parameter estimateθ. The complete top row corresponds to how a passive learner operates and is also called open-loop Hamiltonian learning. We add a feedback loop to introduce an active learner which uses the current estimate of model parametersθ to prescribe the distributionq from which queries to the oracle should be sampled from next. This process is then repeated during learning until an accurate estimateθ is obtained. The Hamiltonian Active Learning (HAL) algorithm introduced in this work comprises the estimation procedure and active learner shown in this schematic. The active learner is described in Section II C and the HAL algorithm in Section III.

Learning Framework
Having fully described a query, we are now in a position to formalize the problem of Hamiltonian learning. While doing so, we draw parallels to and introduce language from machine learning and statistical learning theory.
We are given a query space Q = M × U × T constructed as discussed above and a label (or measurement outcome) space Y = {0, 1} nr . The labels y ∈ Y are not deterministic but are probabilistic in nature. We consider the class of Hamiltonian models H which map Q to Y. In particular we assume the availability of a model description H = {H(θ)|θ ∈ Θ} ⊂ R m parametrized by the vector θ and where Θ is considered to be the space over the parameters. When no prior information is available to constrain the parameter space, we can consider Θ = R m . We denote the true Hamiltonian as H and assume that H ∈ H i.e., it is realizable. We denote the parameter vector of the true Hamiltonian as θ . Our goal is to learn an estimate of θ which we denote byθ. With regards to notation, we use θ whenever we make statements that are true for any parameter in Θ.
For the parameter vector θ, the label (or measurement outcome) y given query x = (M, U, t) is produced with the conditional probability as per Born's rule and assuming the absence of noise. The summation is over the hidden measurement outcomes of the (n − n r ) qubits that are not read out which we denote by z and |·| is used to denote the absolute value. Given the dataset of N training examples D = {(x (i) , y (i) )} i∈ [N ] , the goal is to learn the parameter vector θ which characterizes the conditional probability (Eq. 1) that maps probabilistically Q to Y. This is a supervised learning task for which a suitable loss function is where L is a suitable error function for assessing the chosen model on the training dataset, and R is a regularization function penalizing model complexity. The former is usually a p-norm. The latter is chosen to incorporate prior information, enforce conditions such as sparsity through 1 -norm and generalize to unseen data. The parameter estimate is determined by minimizing the loss functionθ = argmin θ∈Θ L(θ; D). After obtaining an estimateθ, it is common to evaluate the performance against a testing dataset which includes examples not seen during training. This testing dataset which we denote as D test = {((x (i) , y (i) )} i∈[Ntest contains N test queries and their corresponding outputs. The N test queries are sampled from the query space Q according to a testing distribution which we denote by p test . This allows us to further distinguish between the problems of model inference and prediction against p test . In model inference, the goal is to learn a parameter estimateθ without making special considerations for the testing distribution p test which may change or may be unknown. We use root mean squared error (RMSE) as L in Eq. 2 where θ is the parameter vector corresponding to the unknown Hamiltonian H , and ξ is the vector to nondimensionalize and normalize θ. This is done to account for different relative magnitudes of each parameter component θ i and to ensure that the RMSE does not explode due to contributions of few parameter components. These normalization factors ξ can be obtained during estimation or be available through prior information. Moreover, we restrict estimators to be unbiased; this is commonly desired and also, a universal optimal estimator that minimizes the RMSE cannot generally be defined. Under this condition, L may be reduced to being the square root of the variance, A suitable consistent estimator for reducing the variance is then the maximum likelihood estimator (MLE) which determines an estimate of the parametersθ from the data D througĥ where − log p y|x y (i) |x (i) ; θ is the negative log-likelihood of the measurement outcome y (i) given the query x (i) and when considering the model parameters θ. Throughout this work, we often denote the negative log-likelihood in shorthand by L(y|x; θ) for a general measurement outcome y and query x given model parameters θ.
In prediction against a testing distribution p test , the goal is to learn an estimateθ that will allow us to perform well on predicting the likelihood function of measurement outcomes given queries sampled from p test . In such a scenario, L is still the RMSE but the performance of the estimate is assessed through the testing error which we have chosen as the expectation of the difference in negative log-likelihood when using the estimate θ and truth θ with respect to the testing distribution p test . This difference in negative log-likelihood is also often called the log-likelihood ratio [37]. It is then desired to learnθ such that the testing error is lower than a given error . The error here may need to be specified relative to how the negative log-likelihood scales which will depend on the Hamiltonian of interest and the queries themselves.
We note that Eq. 4 is an optimization problem that may be solved in several ways. The different programs or algorithms to solve the optimization problem are commonly called optimization procedures or estimation procedures. Different procedures have different properties such as rate of convergence and optimality such as guarantee of a local or global minimum. In order to improve convergence or search direction, it is common for procedures to require first or second gradient information. The estimation procedure will need to be specified along with the formulated optimization problem at hand for a complete specification of the learning algorithm.
Given N training examples, we then say that we have succeeded at model inference (prediction) if we are able to produce an estimateθ such that the RMSE (testing error) is bounded by some error parameter . We want to accomplish these learning tasks by minimizing the resource requirements or number of queries N . This is formalized in the following respective problem statements.
Problem II.1 (Model Inference). Suppose we are given access to a quantum system with an unknown time-independent 2 n × 2 n dimensional Hamiltonian H over n qubits, parameterized by θ and the vector of normalization factors ξ. Suppose we fix the error parameter > 0. What is then the number of queries x ∈ Q (where each query is of the form x ∈ {(M, U, t)|M ∈ M, U ∈ U, t ∈ T }) required to learn an estimateθ of the true parameters θ such that the root mean squared error RMSE(θ; ξ) ≤ ?
Problem II.2 (Prediction Against Testing Distribution). Suppose we are given access to a quantum system with an unknown time-independent 2 n × 2 n dimensional Hamiltonian H over n qubits, parameterized by θ. Suppose we fix the error parameter > 0. What is then the number of queries x ∈ Q (where each query is of the form x ∈ {(M, U, t)|M ∈ M, U ∈ U, t ∈ T }) required to learn an estimateθ of the true parameters θ such that We distinguish between the two problems of model inference and prediction against p test as the ideal training dataset for these two learning tasks may come from two different distributions. This will influence how the training data is chosen when we introduce an active learner into Hamiltonian learning, as we will discuss in Section III. The performance of the Hamiltonian learning algorithms in tackling Problem II.2 will also be useful for cross-validation and testing the robustness of these methods.

B. Learning in the Presence of Noise
Previously, in Section II A, we gave formal statements of the Hamiltonian learning problem in the absence of noise assuming an ideal oracle. However, the oracle is usually noisy due to the presence of different noise sources affecting the quantum system. In this section, we describe the problem of learning in the presence of these noise sources.
Common noise sources include readout noise, decoherence, and imperfect control of the quantum system. State preparation and measurement (SPAM) errors are already accounted for by considering these noise sources. Classical SPAM errors encountered are included in the readout noise and the errors in implementing state preparation or measurement operators fall under imperfect control.
a. Readout Noise The readout line of the qubit measurement outcomes y is also a classical communication channel and hence suffers from bit-flip errors. The readout noise can then be modeled as a classical bit flip channel where the true measurement outcome y i ∈ {0, 1} of the ith qubit may be flipped. We denote the observed measurement outcomes' random variable byỹ which can be interpreted as a noisy observation of y. We denote the conditional probability of observingỹ given measurement outcome y which is hidden from us as pỹ |y (ỹ|y). Note that in general we expect the readout channel to be asymmetric i.e., pỹ |y (1|0) = pỹ |y (0|1), and this is accounted for in the readout noise model. The probability of observing a noisy measurement outcomeỹ i of the ith qubit given query x is then given by where p y|x (y|x; θ) is given by Eq. 1. b. Imperfect Control Strategies In general, a Hamiltonian can be decomposed as where H d is the drift/free part of the Hamiltonian that is internal to the system and cannot be controlled externally.
The terms H k corresponds to the parts of the Hamiltonian that can be controlled using the control function u k (t).
Depending on the quantum architecture, the control function can be realized as microwave pulses for superconducting qubits, optical pulses for ion traps, etc.
Target operators U f are then obtained starting from the identity matrix I by implementing controls u k (t) over time duration [0, T ] as where T is the time-ordering operator. These target operators U f can be preparation operators, measurement operators or different gates e.g., the Clifford gates. Imperfections in these pulses will lead to errors in U f . For example, the strength of the qubit-qubit interactions is sensitive to variations in the pulse amplitudes. Strong driving can lead to leakage of states outside of the computational subspace. Moreover, bandwidth effects or dispersion can cause leading or trailing edge distortions in the pulse shapes that can lead to errors in the unitary operator implemented. c. Decoherence Let us consider the unitary operator of U (t) = e −iHt . The application of this unitary operator is accompanied by decoherence on a real quantum system due to interactions with its environment. We model this as a depolarizing channel E where ρ(t) = U (t)ρ(0)U (t) † is the state obtained on application of the unitary U (t) to the initial state ρ(0), p d (t) is the probability of the state being depolarized and I/2 n is the maximally mixed state. We have assumed the case of complete depolarization here. To obtain a functional form of p d (t), we note that up to first order, depolarization events can be assumed to arrive under a Poisson process with rate µ. By a depolarization event, we refer to the occurrence of an error that completely randomizes the quantum state. As the time between Poisson events follows an exponential distribution, we can write where t 0 denotes the starting time of the experiment. Denoting the probability of measurement outcomes y given a query x under depolarization errors as p E (y|x; θ), we have where p y|x (y|x; θ) is the probability of measurement outcome y assuming no depolarization given by Eq. 1. The rate µ can be related to the amplitude relaxation time T 1 and dephasing time T 2 of the qubits on the quantum system. We will describe this in Section IV C for the IBM Quantum devices considered for application of Hamiltonian learning. Let the collective set of parameters associated with the different noise models so far described be denoted by ζ. The measurement outcomes y from the oracle will then be a function of the queries x and the true parameters (θ , ζ ). In order to obtain estimates forζ in addition toθ, we solve the following modified optimization problem over the training examples θ ,ζ = argmin However, this increases the computational cost of the estimation procedure due to the increase in the number of parameters and the corresponding search space. In practice, prior calibration data can be used to obtain estimates of ζ which are then used in Eq. 2 to obtainθ. Hence, the Hamiltonian learning Problems II.1 and II.2 can be restated assuming additional information of the noise model parametersζ when we have access to a noisy oracle. When describing our application of Hamiltonian learning, we will describe both the Hamiltonian of interest in Section IV A and the specific noise sources in Section IV C.

C. Active Learning of Hamiltonians
This subsection introduces the concept of active learning and how an active learner can be used in the context of Hamiltonian learning. We begin by giving a quick overview of different types of learners, and describe how active learning differs from passive learning and online learning. This is followed by an overview of the different active learning (AL) strategies that have been proposed in the literature. We describe which AL strategy seems best suited for Hamiltonian learning. AL strategies based on Fisher information (FI) for Problem II.1 and Fisher information ratio (FIR) for Problem II.2 seem to be notably appropriate. In doing so, we establish criteria for evaluating the performance of an AL algorithm, and observe that AL algorithms are known to solve some learning problems faster than passive learners. Moreover, under some circumstances, AL algorithms can exceed the central limit theorem bounds.

Types of Learners
An overview of different learners often considered for learning tasks is given below. We will use the passive learner as a baseline and the active learner is the focus of our work. A description of an offline learner and online learner is given for completeness and to distinguish an active learner from them.
a. Offline Learner In offline learning, all of the training data is given to the learner at once and a model is learned. The training dataset may be made available to the learner or collected by sampling queries from an arbitrary distribution and then obtaining outputs from an oracle using these as inputs. This is the learning paradigm under which most machine learning tasks operate in.
b. Online Learner In online learning, the training data is made available in a sequence, typically one at a time by a referee. A query x t is made available to the learner at the t'th round in a sequence after which the learner constructs an estimate of the outputŷ t to this query. The learner providesŷ t to the referee, and suffers a loss that depends on y t and the actual output. The learner is then provided with feedback by the referee which the learner can then use to update the model. In this case, the queries that are provided to the learner by the referee may be adversarial or adaptive to the learner's behavior. The learner has no control over the query distribution from which these queries arrive.
c. Passive Learner In the learning problems described in Sec. II A, we did not specify the distribution from which the queries x are sampled from. In open-loop Hamiltonian learning, the query distribution remains fixed during learning and all the queries to the oracle are sampled from this distribution. When no prior information is available, it is common to set the query distribution to the uniform distribution over the query space Q. We will refer to this setting as passive learning through out this paper. Combined with a specification of an estimation procedure, the passive learner will serve as a baseline to the active learner which we introduce in the next section.
d. Active Learner In active learning (AL), the learner has access to the query space Q and the ability to select queries or decide the query distribution during training using the current estimates of the model parametersθ. This is accomplished by introducing a feedback into the open loop Hamiltonian learning approach shown earlier in Figure 2. Based on the current estimateθ and the queries made so far combined with their respective outcomes, the active learner proposes a query distribution from which queries should be selected from to send to the oracle. These queries may be sent to the oracle in a sequential manner one at a time or in batches. In Section II C 2, we discuss different criteria used for query selection or proposing query distributions.
Here, we distinguish a passive learner from an offline learner only in the number of rounds of training data collection. An offline learner is given access to a complete training dataset or is allowed to collect it through queries to an oracle in one round. On the other hand, a passive learner has continued access to the oracle and is allowed to collect training data until the learning task has been accomplished. The primary difference between the active learner and the online learner is that the former has control over the query distribution from which it will select queries to input to the oracle.

Query Criteria for Active Learner
There have been multiple criteria proposed for query selection but are usually subdivided into the two categories of informativeness and representativeness. Criteria based on informativeness aim to select queries that will reduce the uncertainty of the statistical model and include uncertainty sampling [38,39], query-by-committee [40,41], and margin [42]. On the other hand, the goal of representativeness [43,44] is to ensure selection of queries that exploit the structure of the underlying distribution and are diverse. There has also been exploration into combining the criteria of informativeness and representativeness [45,46].
Multiple query criteria used in active learning in practice are based on heuristics and empirical evidence [47]. Here, we choose the informativeness criteria of Fisher information (FI) and Fisher information ratio (FIR) as they have direct relationships with the different learning problems we introduced in Sec. II A. This allows us to provide guarantees on the performance of our AL strategy. Later, we discuss how we can ensure that representative queries are also selected.
We introduce some notation before discussing the query criteria for our AL strategy. Let us denote the Fisher Information matrix of a particular query x ∈ Q as I x (θ) where the (i, j)th element of the matrix is given by and where the expectation is taken with respect to p(y|x; θ). The Fisher Information matrix is equivalently written as I x (θ) = E[SS T ] where S = ∂ log p y|x (y|x; θ)/∂θ is commonly called the score vector. Instead of selecting one query at a time, we often require the active learner to select a distribution over the query space Q which we will call the query distribution. The Fisher Information matrix corresponding to q will then be given by where the summation can be replaced by an integral in the case of a continuous query space. If the parameters describing the Hamiltonian model θ ∈ R m , then I q ∈ R m×m . Let us now describe the different query criteria in the context of the different learning tasks.
a. RMSE of Parameters If the learning objective is to learn the parameters with small RMSE (Problem II.1), a natural query optimization strategy is obtained by noting the Cramer-Rao bound for unbiased estimators: [48]: Combining the Cramer-Rao bound with the fact that the parameter estimates converges in probabilityθ → θ for an unbiased estimator such as MLE and the distribution converges in law as [49,50], we get that MLE is an asymptotically efficient estimator with the efficiency equal to the Fisher information over the training examples [37]. An optimal query distribution can then be obtained through the following query optimization: where P is the family of all valid probability distributions over the query space Q. We also note that for q , I q must necessarily be invertible and hence full rank. This ensures that the queries inform us about all the Hamiltonian parameters of interest. Also for an unbiased and consistent estimator, the Cramer-Rao bound is likely to be saturated in the limit of large number of queries.
b. Testing accuracy against testing distribution When the learning objective is to minimize the expected loglikelihood error against a testing distribution p test (Problem II.2), we use an active learning strategy based on Fisher Information ratio (FIR) Tr(I −1 q (θ)I ptest (θ)). The name FIR comes from the scalar case where it can be viewed a ratio of the Fisher information corresponding to the query distribution and that corresponding to the testing distribution. The use of FIR for this learning task can be motivated by noting the following inequality [37] where the left hand side is the expected variance of the asymptotic distribution of the log-likelihood ratio which can be viewed as a testing error and the right hand side involves the FIR. Minimizing the upper bound would then allow us to control the testing error and hence this suggests using the following query distribution: We note the Fisher information ratio is related to the Fisher information matrices of the query distribution I q and testing distribution I ptest through the following inequality: Thus, we recover the query optimization of Eq. 17 when the testing distribution is unknown.
In quantum tomography, such query criteria have been applied in optimal experiment design (OED) or adaptive quantum tomography. Fisher information has been used as a query criteria for offline OED in quantum state tomography [33]. Even earlier, an active learner based on Shannon entropy aka maximum uncertainty sampling was considered in [51]. An AL strategy based on Shannon information combined with Bayesian estimation was proposed in [52] for the selection of measurement operators during quantum state tomography. Fisher information was again used in [34] where OEDs were analyzed for a family of qubit channels over different design problems. In Hamiltonian learning, Fisher information has been used to comment on heuristic strategies for OED [53] and has been combined with Bayesian estimation to produce an active learner [54].

Active Learning Strategy
Implicit in the above descriptions of the query criteria is the idea of proposing a query distribution rather than selecting one query at a time during active learning. This demarcates sequential active learning where one query is chosen at a time from batch mode active learning where a batch of queries sampled from a query distribution are selected to be inputted to the oracle. Moreover, combining FI/FIR query criteria with batch mode active learning [55] ensures that representative queries are chosen as well.
In what follows, we describe the batch-mode active learning scheme that forms the basis of the AL algorithm for Hamiltonian learning that we discuss in Section III. Given a budget of N queries, the training is divided into multiple rounds. We index each round of the training process as i and denote the batch size as N b . The number of queries made till the ith round (inclusive) is denoted as N (i) tot . In each round, a batch of queries is sampled from the optimal query distribution based on the current parameters' estimateθ and then this estimate is updated using the measurement outcomes of the queries. This is then repeated until all of the budget has been expended. We denote the estimate in the ith round byθ (i) and the optimal query distribution in the ith round based onθ (i) by q (i) . The query distribution at the very beginning of the training process q (0) is determined from any available prior information of the parameters or else set to be the uniform distribution over the query space.
What should be the size of the initial set of queries N (0) tot ? Some suggestions are given in [56] based on a finite-sample analysis for logistic regression but these do not suffice for the application considered in this paper. Qualitatively, one hopes that N (0) tot is high enough such that the parameter estimateθ (0) lies close to the true parameter value θ and in a convex basin of the asymptotic negative log-likelihood loss function. However, setting N (0) tot to a very high value may not allow us take advantage of the presence of an active learner and the savings it can provide.
Additionally, it may be advantageous to adaptively change the query space for exploration from one batch to the next. It is not necessary for the query space to remain static or unchanged [47] during active learning. In fact, there is an element of adaptively changing the search space in many prominent algorithms. In sparse fast Fourier transform [57,58], the bins of frequencies are randomly chosen with each iteration in the algorithm. In the related machine learning tool of reinforcement learning, action spaces are changed to eliminate actions [59], and to generalize over time by parameterizing them [60] or embedding them in a continuous action space [61].
Adaptively changing the query space is particularly compelling for the application of active learning to the Hamiltonian learning problem, because evidence suggests that it may result in so-called Heisenberg-limited scaling of the number of queries as we noted in Section I. We will also make a case for why we expect the active learner which we will introduce for Hamiltonian learning in Section III to achieve Heisenberg-limited scaling when possible.
The computational cost of the batch-mode AL scheme is determined by the number of rounds of batches issued and the computational cost of solving the query optimization problems of Equations 17 and 19. Solving these directly can be challenging but fortunately, the query optimizations can be reformulated as semidefinite programs (SDP) [37] under the assumptions of differentiability of the log-likelihood function, and invertibility of the Fisher information matrix for query distributions in a compact space around the optimal query distribution and around the uniform distribution. When using the query criteria of Fisher Information ratio (FIR) (Eq. 19), the optimization problem [37,56] is where we have introduced m auxiliary variables α 1 , ..., α m , and e j are the eigenvectors of I p (θ). Recall that the parameter vector θ has m components. To obtain the SDP program for the query optimization when using the query criteria of Fisher information (FI) (Eq. 17) in our AL strategy, we replace e j by the eigenvectors of the identity matrix. In this case, e j are m-dimensional canonical vectors with 1 in the jth component and zero elsewhere. The computational cost of solving the above SDP programs with a barrier interior-point method is O(n 2 Q m 3 + n Q m 4 + m 5 ) where we have denoted n Q = |Q| as the size of the query space of interest.

Query Advantage
To compare resource requirements of different Hamiltonian learning (HL) methods for accomplishing a learning task, we introduce the concept of query advantage. The query advantage (QA) of an HL method in achieving a learning error of is: Number of queries required by method Number of queries required by baseline (22) As discussed before, QA measures the amount of query reduction obtained by selecting a HL method over a baseline strategy. In this work, we consider the passive learner equipped with an appropriately chosen estimation procedure as the baseline. We will specify the estimation procedure when discussing a QA result or it will be clear from context. The benefits of quantifying QA for an HL method are twofold. Firstly, it allows us to comment on the performance boost obtained by using one HL method over another for accomplishing a learning task. Moreover, we can comment on learning tasks that can be achieved using one HL method but is unattainable by another. Secondly, it gives us a direct way to select a particular HL method based on minimal resources required, from a set of methods for a particular learning task by choosing the method with the highest QA.
As we will see in the next Section III, an active learning strategy is a framework for achieving query advantage and higher learning rates of convergence when possible.

III. HAMILTONIAN ACTIVE LEARNING ALGORITHMS
We are now in a position to describe how to adapt probabilistic pool-based batch-mode active learning with query criteria of FI and FIR for Hamiltonian learning. The resulting algorithms are collectively called Hamiltonian Active Learning algorithms. We call HAL combined with the query criterion of Fisher Information as HAL-FI and that with Fisher Information Ratio as HAL-FIR. We discuss how the resulting algorithms are expected to achieve query advantage over the baseline and Heisenberg limited scaling in query complexity with access to prior information if possible.

A. Algorithm
The HAL algorithm is summarized in Algorithm 1. We assume that the unknown Hamiltonian is time-independent and a model parameterized by θ for the oracle (i.e., Hamiltonian with noise sources) is available to us. Inputs to the HAL algorithm include the initial query distribution q (0) to be used for sampling the initial set of N (0) tot queries and the query optimization algorithm (QOA). When the query criterion is Fisher information (FI), the corresponding QOA is given by Algorithm 2 which solves Eq. 17. Similarly when the query criterion is Fisher information ratio (FIR), the corresponding QOA is given by Algorithm 3 which solves Eq. 19. Recall that the choice of query criteria and hence QOA depends on the learning task at hand: we consider FI for learning Hamiltonian parameters with low RMSE (Eq. 17) and FIR for minimizing the testing error (Eq. 19). Additional inputs include the maximum number of batches i max that will be issued during learning which we will denote by N b and the total experimental budget available.
Using the notation for batch mode active learning as discussed in Sec. II C, we denote the batch of queries that are sampled at the ith round with respect to the query distribution q (i) as X (i) q and the corresponding measurement outcomes from inputting these queries to the oracle as Y (i) q . The set of all queries made so far at any round is denoted by X (i) and their corresponding measurement outcomes as Y (i) . We note that |X when prior information about the parameters is not available. The initial set of training examples is used to determine an initial value of the parameters which is used for determining an informative albeit suboptimal query distribution q (1) through the given QOA. Note that in the QOA of Algorithms 2,3, we modify the query distribution q (i) obtained through solving Eq. 17 or Eq. 19, by mixing it with the uniform distribution over the query space p U i.e., µq (i) + (1 − µ)p U where 0 ≤ µ ≤ 1 is the mixing coefficient. This is done to encourage exploration and is analogous to epsilon-greedy policies in reinforcement learning [62]. The value of µ typically depends on the number of queries made so far, and we set it to µ = 1 − 1/|X (i) | 1/6 as often used for such active learning algorithms [37,56].
Deviating from a vanilla batch model AL scheme, we require an additional input of the query space {Q (i) } i∈[imax] during training. We allow the query space to adaptively change from one batch to the next. How do we decide how the query space changes from one batch to the next? We remind ourselves that the query distribution in I q is a joint probability distribution over the measurement operators, preparation operators and evolution times. Conditioned on a particular evolution time, we would not expect changing M or U to help in reducing the query complexity. The query space is then adaptively grown by growing T linearly or exponentially with each batch. We note that The output of the algorithm is an estimateθ of the true Hamiltonian model parameters θ learned through active learning using N queries. We use the unbiased maximum likelihood estimator in the HAL algorithm. HAL-FI produces an estimateθ that has the minimum RMSE possible using a budget of N queries, and batches of size N b . Similarly, HAL-FIR produces an estimateθ that performs best in prediction of queries to the Hamiltonian against the testing distribution p test .

Algorithm 1 Hamiltonian Active Learning (HAL)
Input: Initial number of queries N (0) tot , Batch size N b , Initial query distribution q (0) , Maximum number of batches imax, Adaptively growing query space {Q (i) } i∈ [imax] , oracle, query optimization algorithm (QOA) Output:θ Update number of queries: Obtain measurement outcomes Y (i) by issuing queries X (i) to oracle 10:

Algorithm 2 Query Optimization based on Fisher Information (FI)
Input: Number of queries made so far N 3: Obtain uniform distribution over query space: pU = 1/|Q (i) | 4: Set mixing coefficient:

Algorithm 3 Query Optimization based on Fisher Information Ratio (FIR)
Input: Number of queries made so far N 4: Obtain uniform distribution over query space: pU = 1/|Q (i) | 5: Set mixing coefficient:

B. Comment on Heisenberg Limited Scaling
We claim that HAL-FI with an appropriately chosen adaptively growing query space can achieve Heisenberg limited scaling in Hamiltonian parameters where possible. If the query space is not rich enough, it will not be possible to determine a sequence of queries to achieve Heisenberg limited scaling. Without an adaptively growing query space (i.e., fixed query space), the scaling of the number of queries N with error parameter is O(1/ 2 ) as is dictated by the Cramer-Rao bound. The query complexity by using an active learner only improves by a constant factor in such a case. If one chose to adaptively grow the query space during training without an active learning strategy and chose for example an uniform distribution over the new query space (i.e., carry out passive learning), Heisenberg limited scaling would not be expected as it would become exponentially more unlikely to sample an informative query.
Moreover, Heisenberg limited scaling is expected as long as the evolution time is below the decoherence time. Beyond the decoherence time, the oracle will start losing its quantum behavior.
If the so chosen adaptively growing query space cannot be used to achieve Heisenberg limited scaling, it should still be possible to achieve Heisenberg limited scaling for a subset of the Hamiltonian parameters provided that the goal is to now learn these parameters and we are provided information about the other parameters. This learning task often occurs in practice during recalibrations of quantum devices when it is required to learn a subset of parameters that are known to fluctuate significantly with time but information about the other parameters can be used from previous calibrations. We provide empirical evidence for this claim in Section V where we consider the application of the HAL-FI to cross-resonance Hamiltonians. The results of HAL-FI for this particular application are summarized in Table I.
What distinguishes HAL-FI from other methods such as Floquet calibration [28] which have been shown to achieve Heisenberg limited scaling is that it does not require prior specification of experiments and their order of implementation. This is decided by the HAL-FI during learning. Moreover, HAL-FI utilizes single-shot outimes from queries, instead of requiring expectation values. In practice, this can result in multiple orders of magnitude reduction in queries required. Finally, HAL-FI can achieve query advantage over passive learners for complete query spaces even when Heisenberg limited scaling cannot be achieved.

C. Computational Cost and Extensions
A consequence of adaptively growing the query space over rounds during learning is that the SDP programs (Eq. 21) corresponding to the query optimizations of Eqs. 17 and 19 (with computational cost O(n 2 Q m 3 + n Q m 4 + m 5 )) become increasingly more computationally expensive to solve over rounds. If n Q grows exponentially, each iteration of the query optimization problem becomes more exponentially expensive to solve. This can be circumvented by reducing the number of queries to optimize over using uncertainty filtering [63], thereby effectively reducing the size of the query space n Q over which the query optimization is carried out. Uncertainty filtering for our application of Hamiltonian learning to the cross-resonance Hamiltonian is discussed in Appendix C 2.
The HAL-FI and HAL-FIR algorithms can be generalized to different experimental setups or requirements. The HAL algorithm presented in Algorithm 1 uses the stopping criterion of maximum number of batches of queries issued during learning but other stopping criteria such as the 2 norm of the differences in consecutive parameter values ||θ (i) −θ (i−1) || 2 could also be used. We have also restricted our discussion to the maximum-likelihood estimator but the HAL algorithm can be combined with any unbiased consistent estimator.

Adaptivity in Query Space Scaling of N Scaling of N (Recalibration)
None

IV. HAMILTONIAN LEARNING FOR A TWO-QUBIT SUPERCONDUCTING CR GATE: MODEL AND SETUP
To assess the performance of the HAL algorithms described in Section III over a passive learner and empirically verify our claims, we consider the application of learning cross-resonance Hamiltonians on superconducting IBM Quantum devices. In this section, we describe the model of the two-qubit cross-resonance Hamiltonian, and the set of queries that we can make to it. This is followed by a description of the different noise sources that affect the quantum system and how this modifies the MLE of Hamiltonian parameters. A description of the IBM Quantum devices employed for assessing the performance of the HAL algorithms can be found in Appendix B.

A. Cross-Resonance Hamiltonian
The cross resonance (CR) gate is a two-qubit entangling gate for superconducting qubits requiring only microwave control which allows for the use of fixed-frequency transmon qubits [64,65]. Using appropriate pulse sequences such as multi-pulse echos and cancellation tones [66], the CR gate can be transformed to a locally equivalent CNOT gate [67]. Combined with arbitrary single qubit gates, this then forms a complete set of gates for universal quantum computation.
The Hamiltonian of the cross-resonance (CR) gate has the following structure where {σ I , σ X , σ Y , σ Z } are the single-qubit Paulis and c ab ∈ R are real coefficients of the Pauli product terms σ a ⊗ σ b . The above time-independent Hamiltonian description of the cross-resonance gate can be obtained through theoretical models based on effective block-diagonal Hamiltonian techniques [68]. In our experiments which we describe in Section IV B, the CR gate is implemented without using an echo [66] to refocus the σ I σ X , σ Z σ Z and σ Z σ I terms.
However, we measure only the target qubit through our queries and thus effectively neglect the σ Z σ I term. The target qubit is typically chosen to be qubit 1 in a (0, 1) qubit pair and is specified for the different quantum devices we consider in Appendix B 1. The effective removal of the σ Z σ I term from Eq. 25 results in the following simplified CR Hamiltonian which we consider throughout the rest of our work: where we have used the parameter vector J = [J IX , J IY , J IZ , J ZX , J ZY , J ZZ ] T to denote the non-zero coefficients of the corresponding Pauli product terms and have omitted the subscript CR. Learning the unknown Hamiltonian of the two-qubit CR gate is then reduced to estimating the unknown parameter vector J.
Noting the block-diagonal structure of H, we can express it in the usual computational basis of (|00 , |01 , |10 , |11 ) as where the subscript j ∈ {0, 1} is used to refer to the two different blocks. The two blocks have similar structure with their elements differing in the sign of J ZX , J ZY , and J ZZ . Using a j , and β j for j ∈ {0, 1} from Eq. 27, we define the following parameters where the subscript j = 0 is used to denote the preparation operator U = σ I σ I and j = 1 to denote U = σ X σ I . We then arrive at an alternate parameterization (related to the spectral decomposition of H) which turns out to be useful for simplifying expressions for probability, likelihood and Fisher information: The unitary operator U (t) = e −iHt in the usual computational basis is then given by where we have used the parameter vector of Λ defined in Eq. 30. Moreover, the different components of Λ can be bounded based on their definitions and these bounds will help us later while solving the MLE problems. By construction, we have − π 2 ≤ δ 0,1 ≤ π 2 and φ 0,1 can be bounded within any interval of size 2π e.g., −π ≤ φ 0,1 ≤ π. Assuming δt is the average time increment between distinct ordered values of evolution times t ∈ T , ω 0,1 can be bounded based on the Nyquist sampling theorem as 0 ≤ ω 0,1 ≤ π δt . To obtain the physical meaning of the parameter vector Λ, we consider Rabi oscillations. For different measurement operators M ∈ M and preparation operators U ∈ U, the corresponding Rabi oscillation is the difference in probability densities of the ground state and excited state of the target qubit with time t ∈ T . A typical example of the Rabi oscillations from querying the CR Hamiltonian on a noisy quantum system is shown in Figure 3. We remark that ω 0,1 defines the frequency of the Rabi oscillations for the two different preparation operators we consider. The parameters δ 0,1 and φ 0,1 define the offsets, amplitudes and phase shifts of the Rabi oscillations. In Figure 3, we can see the effects of different noise sources such as readout and depolarization which will be discussed in Section IV C.
FIG. 3. Rabi oscillations obtained experimentally through queries to a CR Hamiltonian on the IBM Quantum device (ibmq boeblingen) for different measurement operators (rows), preparation operators (markers) and evolution times t (x-axis). The set of measurement and preparation operators will be described in Section IV B 2. The experimental data is indicated by markers and the model fit to the data by lines. The model fit was generated by learning the CR Hamiltonian (Eq. 27) from experimental data using MLE and predicting values of Rabi oscillations for the query space using the learned Hamiltonian.

B. Experimental Setup
In this section, we give a quick overview of the different IBM Quantum devices that we employ for our application of Hamiltonian learning. This is followed by a description of the query space considered for our application and how the queries to the CR gate on the IBM Quantum devices are made in experiment.

Quantum Devices
We will present data and results for four different IBM Quantum devices which we call A, B, C, and D. All of these devices are based on superconducting architectures requiring only microwave control [64,65] and consist of fixed-frequency transmon qubits with shared quantum buses [69]. Device A is a two-qubit device. Device B is a four-qubit device on which we will query only one of the CR gates that can be implemented between two qubits. Device C is a five-qubit device with a bow-tie layout. Device D is the 20-qubit ibmq boeblingen which was accessed via the IBM Quantum cloud computing service. See Figure 13 in Appendix B 1 for the connectivity maps of all the devices. We give a summary of the properties of the pairs of qubits involved in the cross-resonance Hamiltonians we considered on these devices in Table VII in Appendix B 1.

Experimental Implementation and Query Space
Queries to the CR Hamiltonians between different qubit pairs on the IBM Quantum devices are made through appropriate pulse sequences. These pulse sequences are constructed and executed on the hardware using Qiskit-Pulse [70], which is a pulse programming module within Qiskit [71] and serves as a front-end implementation of the OpenPulse interface [72]. Each Qiskit-Pulse [70] program consists of pulses, channels and instructions. Here, we describe a Qiskit-Pulse program and describe how a query to a CR gate on a IBM Quantum device is specified.
A pulse is a time-series of complex-valued amplitudes with maximum unit norm and which we denote as a k where k ∈ [n − 1] corresponds to the time stamp. The difference between these time-stamps is considered to be dt which is typically the sample rate of the waveform generator. The output signal thus has an amplitude of at time kdt where f and γ are a modulation frequency and phase. A pulse is specified in Qiskit-Pulse by specifying the individual amplitudes a k and the phase φ. Alternatively, one can use parametric pulse shapes that are offered by the library such as Gaussian, GaussianSquare, etc. These pulses are then implemented on the hardware via channels which label signal lines used for transmitting and receiving signals between the control electronics, and the hardware. In particular, these are implemented on the PulseChannel and are used to control the system Hamiltonian to implement different gates.
To implement a query to a quantum device, we need an equivalent description of the quantum circuit shown in Fig. 1 in the form of a pulse schedule. We discuss this here in parallel with a description of the query space Q considered for learning CR Hamiltonians. To prepare the initial state, we consider the set of preparation operators U = {σ I ⊗ σ I , σ X ⊗ σ I } applied to the pure state |00 . The single-qubit gates are implemented as a sequence of Gaussian pulses of the appropriate amplitude and duration [70]. Assuming the first (left) qubit is the control and the second (right) qubit is the target, the effect of the preparation operators is to place the control in |0 and |1 respectively. We evolve the initial state |ψ(0) for time t ∈ T which we will specify when discussing the results of our application of Hamiltonian learning in Section V. This is done by switching the CR interaction on for time duration t which is done in practice by implementing a GaussianSquare pulse of duration t. A GaussianSquare pulse is a square pulse with truncated Gaussian-shaped rising and falling edges. We discuss later in Section IV C how using this pulse may introduce non-idealities in the system evolution. Finally after obtaining the final state |ψ(t) , we apply the measurement operators in M = {σ I ⊗ exp i π 4 σ Y , σ I ⊗ exp −i π 4 σ X , σ I ⊗ σ I } and measure only the second qubit which we have chosen as the target qubit. The query space is then Q = M × U × T . An example of a pulse schedule is shown in Fig. 4 highlighting the different parts of the query. Moreover, in our experimental setup, we where M = σI ⊗ exp −i π 4 σX , U = σX ⊗ σI , and time duration t = 6 × 10 −7 s. The x-axis corresponds to time normalized by dt = 2.22 × 10 −10 (Eq. 32). The different channels corresponding to each qubit (y-axis) are written as the type of channel (see plot legend) followed by qubit number. Qubit 0 is set to be the control qubit and 1 to be the target qubit. The envelope of the different pulses are shown in each channel. The rotations on the drive or control channels indicate virtual Z gates. An equivalent representation of the quantum circuit is shown in Fig. 1. obtain measurements of the single-shot signal (integrated cavity amplitude) c which is a function of the measurement outcomes y which we have described earlier. In the following Section IV C, we will discuss how to model the different noise sources that affect our system and how they are determined through experiments.

C. Estimates of Noise and Nonidealities for Experimental System
Using the noise models presented in Section II B, we give the specific relevant models for readout noise, imperfect pulse shaping, and decoherence for the different IBM Quantum devices which we study. Let us introduce some notation that we will use in the following discussions. We use the index k for the queries. The kth query is given by x (k) = (M (k) , U (k) , t (k) ), and the corresponding measurement outcome of the target qubit as y (k) . The measurement outcome y (k) is not directly observed but inferred from the corresponding signal c (k) . We denote the inferred value asŷ (k) .

Readout Noise
As discussed in Section II B, we assume the measurement noise model to be a bit-flip channel with the input of unobserved measurement outcomes y and the output of readoutỹ observed through signal c. For this inference task, we use calibration data of single qubits initially prepared in states |0 or |1 , and subsequently measured in the usual computational basis. In Figure 5(a), we plot different realizations {y (k) , c (k) } 100 k=1 from IBM Quantum device D ibmq boeblingen that were used for training a binary classifier. The classifier provides us with the ability to predict y given c. Moreover, the misclassification errors pŷ |y (1|0) and pŷ |y (0|1) can be used to approximate the properties of the bit-flip channel, in particular the conditional probabilities of a bit-flip pỹ |y (1|0) and pỹ |y (0|1) respectively. As these are independent of values of c here, we denote the conditional probabilities of a bit-flip as r 0 = pỹ |y (1|0) and r 1 = pỹ |y (0|1). The MLE of the parameters incorporating this noise model is given bŷ where p y|x (y (k) |x (k) ; θ) is given by Eq. 1. Alternately, instead of assigning a deterministic resultŷ (k) for each c (k) , we can incorporate p c|y directly into the log-likelihood function which could yield a more accurate model. This could be done through our choice of binary classifier or by fitting a distribution to the training data. Noting that single qubit measurement outcomes correspond to their energy levels, we fit Gaussian distributions to the training data and hence obtain a parameteric form of p c|y in Figure 5(b). The MLE is noŵ A useful tool for calibration and diagnostics is Rabi oscillations which we denote by p rabi (x). Rabi oscillations are obtained from evaluating the difference in the population densities of the ground state and excited state of the target qubit or p rabi (x) = p y|x (0|x) − p y|x (1|x). We can compute Rabi oscillations in two different ways, either through binary classification or through fitting Gaussians. Using the misclassification errors from the binary classifier, we can then writep where we have denoted the computed Rabi oscillations asp rabi (x). While p y|x (0|x) and p y|x (1|x) are guaranteed to be valid probability distributions, the above estimation does not ensure thatp rabi (x) is bounded by −1 and 1. To obtain more accurate estimates of the Rabi oscillations, we can solve the following MLE problem where we use the estimated conditional distributions p c|y from the Gaussian fits. The indicator function 1{x (k) = x} is used to ensure that the summation is only over measurement outcomes of given query x. The analytical expressions for p rabi (x) are given by M X : p rabi (x) = sin δ j cos δ j cos φ j + cos δ j 1 − cos 2 δ j cos 2 φ j cos(2ω j t + α j ) M Y : p rabi (x) = sin δ j cos δ j sin φ j + cos δ j 1 − cos 2 δ j sin 2 φ j cos(2ω j t + γ j ) M Z : p rabi (x) = 1 − 2 cos 2 δ j sin 2 (ω j t) = sin 2 δ j + cos 2 δ j cos(2ω j t) where we denote α j = arg((− sin δ j cos φ j )+i(− sin φ j )) and γ j = arg((− sin δ j sin φ j )+i cos φ j ) in the above expressions. The subscript j = 0 is used to denote the preparation operator U = σ I σ I and j = 1 to denote U = σ X σ I . In Figure 6, we plot the Rabi oscillations for the different M ∈ M and U ∈ U over the time range of t ∈ T using the above two methods. We observe that the Rabi oscillations are not bounded between −1 and +1 for the case of M Y in Figure 6(a) when using misclassification error to compensate for the readout noise. The estimated Rabi oscillations in Figure 6(b) do not suffer from the same issue.
As we will see later in Section IV D 2, Rabi oscillations will also be used in our estimation procedure for obtaining an initial guess for the parameter estimateθ that is used as an input to the optimizer for solving the MLE problem. It can also be used as a quantitative tool for ascertaining how well the model fits the data. This will become apparent in the next few sections where we discuss other noise models.

Imperfect Pulse Shaping
Another nonideality is introduced through the pulses used to control the Hamiltonian and implement different unitary operators. It is convenient to think of cross-resonance control pulses as rectangular pulses that modulate the sinusoidal control pulse. The modulated signal results in unitary operators of the form U (t 1 ) = exp (−iHt 1 ) which we would want to implement in a quantum circuit. However, in practice, rectangular pulses cause significant amounts of signal energy to be distributed above and below the frequency of the control sinusoid. This distribution of energy can potentially excite higher energy states of the superconducting transmon being used as a qubit (i.e., |2 and above) as well as excite neighboring spectator transmon qubits. To minimize such effects, pulse shaping is employed to reduce this energy spread by smoothing the rising and falling edges of the pulse, which has the effect of reducing the magnitudes of the frequency artifacts above and below the frequency of the control sinusoid.
Pulse shaping is accomplished by taking a Gaussian-shaped pulse, splitting it in half, and then inserting a rectangular pulse between the halves. This results in the GaussianSquare pulse, previously described in Section IV B 2. Thus, we actually implement operators of the formŨ (t 1 ) = exp (−iT t1 0H (t)dt) where T is the time ordering operator,H is the Hamiltonian at any particular time given byH(t) = H(t, v(t)) with H as the cross-resonance Hamiltonian and v(t) as a function of the cross-resonance pulse amplitude. Up to a first-order approximation in t, we can however model this as (see Appendix B 2 for details)H We denote ∆t r and ∆t f as the time durations of the rising and falling edges of the shaped pulse. The central portion where t is the duration of the pulse. We then haveŨ where in the last step, we set t expt = t − ∆t f − ∆t r which is the evolution time that is reported in our experiments. It should be noted that this can change from one experimental setup to another. The value of ∆t eff can be interpreted as an effective total edge duration that takes the shapes of the rising and falling edges into account. This first-order pulse-shaping model introduces another model parameter, namely ∆t eff , that we need to estimate. We can do this directly in our MLE: While this can be certainly done while learning the Hamiltonian, one can also determine the dependence of ∆t eff on the different Hamiltonian parameters using prior calibration data. We determined that ∆t eff depends only on the parameters of ω 0,1 as ∆t eff (θ) = a/(ω + bω 2 ). In Figure 7, we plot the dependence of ∆t (short for ∆t eff ) on ω for the IBM Quantum device D ibmq boeblingen. The results for the other IBM Quantum devices A, B and C, are shown in Figure 18 (Appendix E). For the IBM Quantum device D (ibmq boeblingen), the values are a = 6.2774 ± 0.01502 and b = 1.5086 × 10 −9 ± 0.6104 × 10 −9 s. How we arrived at this dependence is discussed later in Section V. Using this data-driven model allows us to use reduce the number of parameters in the estimation and hence reduce the associated computational cost of the optimization. In the following sections, we will denote this model by ∆t eff (θ).

Decoherence
Recalling our discussion in Sec.II B, we model decoherence as a depolarization channel acting on the quantum state ρ(t) = exp(−iHt)ρ(0) produced as a result of Hamiltonian evolution for a time duration t. The resulting state (from Eq. 9) was given by One approach to obtain a description of p d (t) is to assume the functional form 1 − exp(−(t − t 0 )/µ), resulting from a Poisson process with rate µ as described in Sec. II B and then estimate µ from the training examples after incorporating this into the MLE. Here, we describe a model for p d (t) using the measured device properties of T 1 and T 2 times of each qubit. We consider an independent noise model on each of the two qubits used for implementing a cross-resonance gate. Let us denote the amplitude damping and phase damping of kth qubit as E a,k and E p,k respectively. They have the following Kraus operators E a,k : where with Γ a,k : where T 1,k is the T 1 time of the kth qubit and T φ,k is the pure dephasing rate of the kth qubit related to the T 2 time of the kth qubit as The overall noise operator acting on the two qubits is then given by [73] where • indicates taking a composition of the two noise operators. The probability p d (t) can be based on the unitarity [74] of the noise operator E which is a completely positive linear map quantifying the coherence of the noise operator. The probability p d (t) is then given by This can be further generalized to n-qubit system (see Appendix G of [73] for details). Let the measurement of Hamiltonian evolution for time t followed by this noisy depolarization channel beỹ and the corresponding Rabi oscillationp rabi . Note that the Rabi oscillations in this case are related to the noiseless case (from Eq. 41) as In Table II, we give the RMSE betweeen Rabi oscillations obtained using different decoherence models assuming we know the true Hamiltonian parameters and Rabi oscillations inferred from data. For the single parameter model, we consider a prefactor of (1 − p d (t)) = exp(−(t − t 0 )/µ with the single parameter µ on the Rabi oscillations obtained using no decoherence model. This parameter µ is then estimated from the data and is found to be (7.75 ± 0.91) × 10 −5 s. In the two parameter model, this is allowed to vary with the state preparation operator U being applied. For U = σ I ⊗ σ I , we have µ = (5.52 ± 0.82) × 10 −5 s and for U = σ X ⊗ σ I , µ = (2.51 ± 0.59) × 10 −5 s. It is advantageous to use such models due to the low number of parameters present and when the T 1 or T 2 times of the different qubits are not available. However, if these times are available, one can use the two-qubit decoherence model or the two parameter model as we note from the values of RMSE in Table II ) is computed as DKL(p data ||p model ) where p data is the probability inferred from data and p model is that predicted from the model.

D. Procedures for Experiments evaluating HAL Algorithm
Having described the cross-resonance Hamiltonian and the different noise sources or non-idealities on the IBM Quantum devices that we consider for our application, we are now in a position to describe the HAL algorithm for this particular application. We firstly discuss the details of the HAL-FI algorithm implemented for learning the CR Hamiltonian on the IBM Quantum devices. As the implementation of HAL-FIR for this application is very similar, it is omitted. We then describe our estimation procedure for solving the MLE problem for estimating the Hamiltonian parameters from the training data generated during learning.

Implementation of HAL-FI Algorithm for Learning CR Hamiltonians
Let us review the details of the implementation of the HAL-FI algorithm in the context of our experiments on learning the CR Hamiltonian. We consider the following inputs. The initial query space which may be changed during the course of the HAl-FI algorithm during training is Q (0) = M × U × T (0) with M and U as described in Section IV B 2. Here, we explicitly denote the superscript on T which is initially set to T (0) but may change during training if an adaptive query space strategy is employed. We set T (0) to be the 81 equispaced times in the interval [10 −7 , 6 × 10 −7 ]s. We consider the initial number of queries as N in Line 9. The learning is continued until our query budget is expended or the desired learning error is achieved.
To determine the initial parameter estimateθ (0) from the initial set of training examples (Line 4) and subsequent θ (i) from (X (i) , Y (i) ) (Line 11), we solve the MLE (Eq. 4) for the CR gate incorporating the Hamiltonian model description and presence of different noise sources. The parameter estimateθ is then given bŷ depending on how the readout noise is modeled. The imperfect pulse-shaping model ∆t eff (θ) was discussed in Section IV C 2 and p d (t) is the probability of depolarization associated with the two-qubit decoherence model (Section IV C 3). The choice of the MLE (Eq. 58 or Eq. 59) for each IBM Quantum device will be specified in Section V. The parameter estimatesθ obtained after solving the MLE problem are used by HAL-FI to construct the Fisher information matrices based on the model and obtain the query distribution q (i) by solving the SDP program of Eq. 21 in Line 6. The expressions for the Fisher information matrices for different queries (Section IV B 2) considering the CR Hamiltonian and noise models are given in Appendix B. The query space Q (i) used in Lines 6 and 7 depend on the querying strategy and hence the learning scenario we consider.
As described in Section III A, we can define four different learning scenarios based on the presence of an active learner and how the query space is adaptively changed during learning. In passive learning, an active learner is not present and we set the query distribution to the uniformly random distribution over Q. In the case of active learning with fixed query space, the query space remains fixed during training i.e., Q (i) = Q∀i, and the query distribution is determined by solving Eq. 21 using the current estimate of the parametersθ over this fixed query space. When considering active learning with an adaptively growing query space, we consider two different situations on the basis of how the query space is changed between batches during learning. We consider two cases: (i) Q (i) grows linearly by linearly increasing the T (i) between batches and (ii) Q (i) grows exponentially by doubling the allowed set of system interaction time T (i) . The query distribution is then determined by solving the corresponding SDP problem of Eq. 21 over the query space Q (i) corresponding to the ith batch. These different learning scenarios for HAL-FI are summarized in Table III. The MLE problem (Eq. 58 and Eq. 59) for the two different parameterizations of J and Λ is nonlinear and non-convex. An example of the energy landscape of the log-likelihood loss function for the IBM Quantum device ibmq boeblingen is shown in Figure 17 of Appendix D. The presence of multiple local minima make the MLE problem in general challenging to solve. To ensure that we converge to the global minimum and do not get stuck in a local minimum during estimation, we divide the estimation into multiple stages. In the first stage, we obtain an initial crude estimate which is refined over subsequent stages.
The estimation procedure summarized in Algorithm 4 is divided into two main stages where in the first stage we use Rabi oscillations (Eq. 41) inferred from the experimental data to come up with an initial estimate for the MLE solve in the second stage. It proceeds as follows. Let us consider the parameterization of Λ. This is more useful to work with as the frequencies of oscillation in e −iHt are determined by ω 0 and ω 1 unlike J where it is determined by all the parameters. This provides a way to create initial estimates of ω 0,1 independent of the other parameters.
We compute the Rabi oscillationsp rabi for each query x made from the experimental data by solving the optimization problem of Eq. 38. This serves as an input to the initial estimation procedure. Initial estimates of the parameters ω 0,1 are then obtained by applying a Discrete (Fast) Fourier Transform to the Rabi oscillations. These initial estimates are then refined by fitting nonlinear regression equations of the form A cos(ωt) + B sin(ωt) + C to the Rabi oscillations, where the fit minimizes the total L 2 error, the coefficients A, B, and C for each Rabi oscillation are estimated using linear least-squares regression, and a bracketed gradient-based search is performed to refine the estimates of ω 0,1 . The corresponding optimization problem can be framed as minimizing the following residual error min E(A, ω 0,1 ) = t∈T (p rabi (t) − AΩ(ωt)) 2 (60) where the coefficients A = (A, B, C) are known functions of δ 0,1 , φ 0,1 through the analytical forms of the Rabi oscillations for the query space considered (see Eq. 41) and Ω(ωt) is a vector of cosines and sines (fully described in Appendix D). Thus, we can then obtain initial estimates for δ 0,1 , φ 0,1 from the A, B, and C coefficients of the nonlinear regression equations for each of the Rabi oscillations. Finally, we obtain an initial estimateΛ 0 for Λ by fixing the values of ω 0,1 and carrying out a gradient descent procedure using the same cost function. This is used as an initial condition to the MLE solve (Eq. 58). We have denoted the negative log-likelihood loss function which appears in the optimization problem of Eq. 58 as L. We solve the MLE problem using the optimizer of stochastic gradient method of ADAM [75] which encourages getting out of any local minima. The parameter estimate produced by ADAM is then refined using the second-order quasi-Newton method of L-BFGS-B [76]. The computational complexity of our estimation procedure is dominated by ADAM. This motivates us to directly use L-BFGS-B after initial estimation for latter batches during learning. The full description, computational details and extensions of the estimation procedure is given in Appendix D. We now present the results of the experiments described in Section IV and show that they support the claims made in Section II A. In this section, we compare the active learner HAL introduced in Section III against the passive learner (Section II C 1 d) in learning the CR Hamiltonian through experiments on different IBM Quantum devices and finally analyze the results of these experiments to affirmatively answer the learning problems proposed in Section II A.

Algorithm 4 Estimation Procedure for Hamiltonian Learning
In Section V A, we first describe the datasets used for Hamiltonian learning and the experimental protocol for evaluating the performance of the learners. In Section V B, we show results of the learners on these different datasets.
In Section V C, we describe how HAL-FI can be used to achieve Heisenberg (or super-Heisenberg) limited scaling and evaluate the query advantage of HAL-FI over the baseline considering different learning scenarios.

A. Data and Experiment Protocol
In this section, we describe the different kinds of datasets that were used in assessing the performance of the HAL-FI algorithm and how they were collected. We then summarize the parameters of the cross-resonance Hamiltonian and the noise sources discussed in Section IV C for the different IBM Quantum devices (Section IV B 1).

Datasets from IBM Quantum Devices
The different datasets that we use for Hamiltonian learning are a combination of experimental data collected from the IBM Quantum devices described in Section IV B 1 and that collected from a simulator which we will describe in Section V A 3.
Experimental data is collected from the different IBM Quantum devices according to the query space described in Section IV B 2. The set of evolution times T is set to 81 equispaced times in the interval T = [10 −7 , 6 × 10 −7 ]s. For IBM Quantum devices A, B, and C, there are 200 measurement outcomes (or shots) for each query x ∈ Q. For IBM Quantum device D ibmq boeblingen, there are 512 measurement outcomes for each query. Recall from our discussion of the Hamiltonian learning framework in II A and HAL algorithm in III B, the outputs of our queries are not expectation values but rather single shot readouts of the target qubit.
The experimental data is then collected and made available as an offline dataset that the active learner can query. Unlike deploying an active learner in real-time where a particular query can be made to the system unlimited number of times, using experimental datasets imposes the additional constraint of the number of times a query can be made by the active learner due to the limitation on the number of measurement outcomes available for each query. In Figure Appendix C, we discuss how we handle this constraint during query optimization.

Parameters of the CR Hamiltonian and Noise Sources for IBM Quantum Devices
Considering the entire collected experimental datasets for each IBM Quantum device (Section IV B 1) as training data, we compute the Hamiltonian parameters J/Λ, and that of the different noise sources using the estimation procedure specified in Section IV D 2 for the MLE. We solve the MLE of Eq. 58 for IBM Quantum device D which has very low readout noise and use MLE of Eq. 59 for the other IBM Quantum devices. A summary of the estimated parameters of the CR Hamiltonian and noise sources on IBM Quantum device D ibmq boeblingen is shown in Figure IV under different drive configurations. We summarize the estimated parameters for the other devices in Appendix E.

Datasets from Simulation
The experimental datasets in Section V A 1 contain noise sources other than those modeled even if negligible and may not span a long enough time range over T for testing different learning scenarios. In order to understand the behavior of the HAL-FI and HAL-FIR algorithms considering all the noise sources are known, we set up a simulator. The advantage of using a simulator over experimental data is that it allows us to assess the limits of the performance of the HAL-FI algorithm in the presence or absence of different noise sources such as decoherence. Studies carried out on the simulator also allow us to test the robustness of the active learner in the presence or absence of different noise sources.
The simulator is an oracle which imitates the different quantum devices but where all the different noise sources are known and perfectly modeled, and which we can query. The Hamiltonian of the simulator is set to that learned  We give the Hamiltonian parameters in the parameterization J and the physically relevant frequency components in Λ. The readout noise is defined by the parameters of r0 and r1 which are the conditional probabilities of bit flip given the measurement outcomes are y = 0 and y = 1 respectively (see Section IV C 1).
from the full set of training examples contained in an experimental dataset collected from a particular quantum device. Thus, we can have simulators for each of the IBM Quantum devices A, B, C, and D, under different drive configurations. All the modeled noise sources of readout noise, imperfect pulse-shaping, and decoherence as described in Section IV C are included. We apply HAL-FI in real-time on the simulator as there is no limitation on the number of times we select a particular query x ∈ Q.
In Figure 8, we compare Rabi oscillations from the two different oracles based on IBM Quantum device D ibmq boeblingen under drive configuration 2. A set of 46800 training examples are generated from both oracles assuming a uniform query distribution over the query space. This is used to learn Hamiltonian parameters in each case. The small difference in the predicted model Rabi oscillations from the Hamiltonian parameters learned on the two oracles, further indicates that the simulator faithfully represents the collected experimental datasets. In the subplots of (a)-(c), we plot Rabi oscillations (or difference) for different measurement operators M (rows), preparation operators U (colors), and evolution times t (x-axis), corresponding to the query space described in Section IV B 2.

Protocol for Comparing Performance of Hamiltonian Learning Methods
Query complexity is used for comparing the performance of different learners on the simulator or oracle with access to an experimental dataset. In particular, our main goal is to extract the scaling of the query complexity with respect to the root mean square error (RMSE) of Hamiltonian parameters J. We compute the empirical RMSE at each round of learning as follows: where we approximate the expectation by an average over parameter estimates from N runs runs and the true parameter values θ by the mean of these runs. The normalization factors ξ i are selected to be 10 6 s −1 for all i as the components of J with highest magnitude are expected to the order of 10 6±1 [68] for these IBM Quantum devices under these drive configurations. We implement the following experimental protocol. For each of the quantum devices described in Section IV B 1, we compute the empirical RMSE for the learners from 200 independent runs of the simulator and 500 independent runs on the experimental dataset. These number of runs on each oracle were required to obtain accurate scalings of trends in RMSE with number of queries and ensure the uncertainty (or two standard deviations) of each scaling was at most 10%. In each run, we carry out the Hamiltonian learning algorithm for the different learning scenarios as detailed in Section IV D 1 and summarized in Table III. Additionally, we track the testing error of the learner with number of queries N . The so obtained trends are used to comment on the robustness of the estimation procedure used for MLE (Section IV D 2) and the benefits of using the active learner HAL-FIR for making predictions of queries to the Hamiltonian (Problem II.2) over a baseline. The testing error is computed empirically as well on a testing dataset collected from the simulator or experimental dataset using p test .

B. Performance of Hamiltonian Learning Methods
In this section, we assess the performance of the different Hamiltonian learning methods introduced so far using the protocol described in Section V A 4 in tackling the learning problems posed in Section II A. We report results of the different Hamiltonian learning scenarios (Table III): (i) baseline of passive learner with estimation based on FFT and regression, (ii) passive learner with an estimation procedure to solve the MLE problem, (iii) active learner in fixed query space, and (iv) active learner in an adaptively growing (linearly/exponentially) query space. We use the HAL-FI algorithm which was discussed in Section II C in the latter active learning approaches.
We present the scaling behavior of each algorithm during learning under different regimes. For each scenario, we show trends of learning error (RMSE) with number of queries. These trends indicate the performance of each learning algorithm, culminating in evidence of query advantage. For brevity of presentation, we focus on the results obtained from IBM Quantum Device D (ibmq boeblingen) under the drive configuration 2.

Scaling Behavior and Convergence Studies
In this section, we assess the trends of learning error with number of queries for our baseline, passive learner, and active learner (HAL-FI). We consider HAL-FI in a fixed query space and a linearly growing query space. Results of the exponentially growing query space are postponed to Section V C 1 where we comment on the achievability of Heisenberg limited scaling.
We focus on the Hamiltonian learning task of Problem II.1 and touch upon Problem II.2 to show that HAL is a general framework for tackling both problems when equipped with the appropriate query optimization. We show results on the simulator for two different cases of time range T of the query space Q. For successful frequency detection using FFT [77,78], the time range must be sufficiently long to see a single cycle of the sinsusoid corresponding to the lowest frequency. We call this as the minimum-frequency criteria (MFC). We thus show results on the simulator where T is such that the MFC for frequency estimation (using FFT) is satisfied and when MFC is not necessarily satisfied. The Minimum-Frequency criteria is not to be confused with Nyquist criteria that is a comment on the sampling rate required. This is a buildup to our comparison on the experimental dataset where T is such that MFC is not necessarily satisfied. In each case, we plot the empirical learning error (RMSE) versus number of queries made on a log-log scale so that the slope s of the plotted lines can be directly interpreted as the scaling of learning error with complexity ∼ N s .
a. Simulator and Minimum-Frequency criteria is satisfied We consider the query space as defined in Section IV B 2 with T set to be the 243 equispaced times in the interval of [10 −7 , 18 × 10 −7 ]s. A comparison of different learners for Hamiltonian learning considering this query space is shown in Figure 9(a). For both the baseline and the passive learner, we observe a scaling of ∼ 1/ √ N or N ∼ −2 in RMSE with number of queries. This is in agreement with the SQL. The approximately constant gap between the baseline and the passive learner corresponds to a constant query reduction when using an estimation procedure based on solving the MLE over a procedure based on FFT and regression. Thus, a query advantage can be obtained by changing estimation procedures for Hamiltonian learning.
We observe two different scalings for the HAL-FI algorithm, an initial scaling which is higher than SQL and similar to Heisenberg limited scaling, and a scaling of SQL in the asymptotic query regime. This shows that depending on the desired learning error, we can expect to see a higher rate of convergence. However, asymptotically the number of queries N required by the active learner HAL-FI is a constant fraction of that required by the passive learner.
Additionally, we compare the baseline, passive learner, and active learner HAL-FIR for tackling Problem II.2 in Figure 9(b). The testing error is computed using Eq. 19 on a testing dataset of 10 5 i.i.d. samples. The testing distribution p test is considered to be known and set to be the uniform distribution over the query space Q. It is assumed that HAL-FIR is given access to this testing distribution. We observe a scaling of ∼ 1/N for the baseline and the passive learner which is expected as the log-likelihood loss function associated withθ is divergence-free in the asymptotic case [37]. As in the case of HAL-FI for Hamiltonian learning, we observe a higher initial scaling for HAL-FIR and a a scaling of ∼ 1/N in the asymptotic regime, consistent with that observed for the passive learner. As discussed in Section II A, it is not typical for the testing distribution in Problem II.2 to be known and the result here can viewed as a validation of the Hamiltonian model learned and hence the learners on a set of queries sampled using the testing distribution.
We note sudden peaks in uncertainty associated with the passive learner and HAL-FIR for higher values of samples. This might be indicative of traveling between multiple local minima in a larger convex hull when solving the MLE (Eq. 58). The trends in testing error are not severely impacted by this in expectation indicating that these are rare events and our learners (with their estimation procedure) are robust.
Having made a case for the robustness of the learners and estimation procedures being used in this work on the simulator considering a query case Q which satisfies MFC, we are now in a position to compare the performance of the different learners when the Q is not guaranteed or known to satisfy such a criteria. In the results that follow, the observations made here are used as a basis for the behaviour to expect among the learners.
b. Simulator and Minimum-Frequency criteria is not necessarily satisfied In practice, the range of system evolution times T corresponding to the query space Q cannot be known apriori to satisfy MFC for frequency estimation using FFT on Rabi oscillation data. It is then crucial for learners equipped with estimation procedures to either: (i) succeed at Hamiltonian learning given this query space or (ii) alert the user that a longer T is required upon failure to learn a Hamiltonian model. Here, we ensure the former by modifying the standard FFT routine as discussed earlier in Section D. Further details are also provided in Appendix IV D 2. Note that this solve is also carried out as the first step in our estimation procedure for obtaining initial conditions to the MLE solve.
A comparison of the different learners considering a T that does not satisfy Minimum-Frequency criteria on the simulator is shown in Figure 10(a). We set T to be the set of 81 equispaced times in the interval of 10 −7 , 6 × 10 −7 ]s. As obtained earlier on the simulator with a longer T , we observe a SQL scaling of ∼ 1/ √ N or (N ∼ −2 ) in RMSE with number of queries for both the baseline and passive learner. However, there is now a noticeably wider gap in between the trends corresponding to around 80.6% reduction in queries when using the passive learner over the baseline.
For the HAL-FI algorithm, we consider cases of when the query space is fixed and when it is adaptively grown by linearly growing the T . For both cases, we see an initial scaling which is higher than SQL and similar to Heisenberg limited scaling, and a scaling of SQL in the asymptotic query regime. This behaviour is expected from our observations on the simulator earlier. We note that using HAL-FI combined with a linearly growing query space does not show significant improvement over HAL-FI in the fixed query space. This is due to the low rate at which we adaptively grow the query query space during learning and the fact that growing the query space is only advantageous for learning a subset of the Hamiltonian parameters. We discuss this further in Section V C 1.
c. Experimental Dataset We show a comparison of the performance of the different learners on the oracle with access to experimental data in Figure 10(b). As expected from our observations on the simulator, we observe SQL like scalings in RMSE with N for both the baseline and passive learner. We observe around 82.2% query reduction when using the passive learner over the baseline, similar to that previously observed on the simulator. This query reduction was computed by fixing the RMSE value at 0.2, and comparing the number of queries required by the passive learner to achieve this RMSE value versus the baseline. The trends themselves are remarkably similar to those obtained on the simulator, supporting the fact that the main noise sources affecting the quantum device were identified and the simulator is a good representation of the real quantum hardware. For HAL-FI, we only show results for fixed query space as the results for the linearly growing query space are very similar, as was also observed on the simulator. We see an initial scaling of ∼ 1/N 3/2 (or N ∼ −2/3 ) in RMSE with number of queries which is higher than SQL and Heisenberg limited scaling, and a scaling of SQL in the asymptotic query regime. The performance of HAL-FI on the experimental data is surprisingly better than that on the simulator. This behaviour is consistent across different drive configurations on ibmq boeblingen and is not particular for this oracle. The initially accelerated learning where we observe super-Heisenberg limited scaling in HAL-FI and lower values of RMSE achieved for a smaller number of N than on the simulator might be due to: (i) noise sources not included in our model that do not contribute significantly to the noise affecting the quantum device but encourage exploration by HAL-FI, and (ii) correlations between the samples collected in the experimental data (e.g., due to thermal fluctuations).
We have already seen benefits of using an active learner over a baseline strategy or a passive learner through the lower values of RMSE that can be achieved. This is analyzed in terms of query advantage in Section V C 2.

C. Analysis
In this section, we analyze the results of the performance of the different learners from Section V B. We firstly comment on the achievability of the Heisenberg limited scaling by HAL-FI with an adaptively growing query space as claimed in Section II B. In the process, we consider a different learning scenario motivated by recalibrations of quantum devices. We then describe the query advantage of the active learner under different conditions over the baseline strategy.

Heisenberg limited scaling
In Section V B, we did not observe Heisenberg limited scaling for HAL-FI (even with an adaptively growing query space). This is due to the fact that the query space is not rich enough to achieve Heisenberg limited scaling i..e., there is no sequence of queries even in the adaptively growing query space to achieve Heisenberg limited scaling. We discuss when Heisenberg limited scaling is achievable for Hamiltonians based on the CR Hamiltonian in Appendix F. We show that the behavior of the learners observed so far is expected through another set of experiments in Appendix E.
It should however be possible to achieve Heisenberg limited scaling for a subset of Hamiltonian parameters given the query space (see Section IV B 2) when the task is to learn this subset of Hamiltonian parameters and we are given access to information about the other Hamiltonian parameters. This is exactly the setting of a recalibration where prior information about the Hamiltonian parameters is available from previous calibrations and the goal is to learn the subset of parameters which drift significantly with time. Motivated by this, we consider the following learning scenario on ibm boeblingen under drive configuration 3 (see Table IV).
We have access to an estimate of the Hamiltonian parametersθ from a previous calibration during which Hamiltonian learning was run on a uniformly sampled set of queries from Q (of size N = 2430). The goal is to then learn the parameters ω 0,1 using the different learners at our disposal. We plot comparisons of different learners (using N b = 972) on this recalibration task in Figure 11 considering the oracles of the simulator and experimental data.
We observe the SQL scaling in the baseline, passive learner and HAL-FI in the fixed query space. There is nearly a constant gap between the baseline strategy and HAL-FI in the fixed query space, indicating a constant query reduction in achieving a desired learning error. We observe a super-Heisenberg limited scaling of RMSE ∼ N −3/2 in number of queries in HAL-FI with a linearly growing query space. This is also observed for HAL-FI with an exponentially growing query space in the low sample regime. The deterioration in the scaling of HAL-FI with an exponentially growing query space to the SQL is due to the maximum evolution time corresponding to the growing query space eventually exceeding T 1 and T 2 . This is shown in Figure 11(b). In fact, the bend in the trend of RMSE versus N for HAL-FI with an exponentially growing query space occurs immediately after the maximum evolution time in T exceeds T 1 . As T for HAL-FI with a linearly growing query space is grown much more slowly, effects of decoherence are not yet felt and super-Heisenberg limited scaling convergence rate in learning error is achieved.
Qualitatively, it is clear that much lower values of learning error can be achieved with a given budget of queries using HAL-FI with an adaptively growing query space over the baseline for recalibration. This is quantified in terms of query advantage in Section V C 2.

Query Advantage
We have so far compared the trends and obtained the scalings of RMSE with number of queries N for different learners for Hamiltonian learning (Section V B) and with prior information (Section V C 1). To quantify the benefits of using HAL-FI over a baseline strategy or a passive learner, we now show the corresponding query advantage (defined in Section II C 4), a performance metric that summarizes the reduction in resources required to achieve a desired learning error.
We plot the query advantage of HAL-FI over the baseline strategy for Hamiltonian learning on the simulator and experimental data in Figure 12(a). When computing the query advantage of a learner over the baseline, we consider that they see the same oracle i.e., the simulator or experimental data. We observe a query advantage of around 80% in HAL-FI on the simulator and experimental data over the baseline, for high values of RMSE. The initial accelerated learning observed in Figure 10 for HAL-FI on the simulator and experimental data translates to an accelerated query advantage for high values of RMSE. The trend in query advantage for HAL-FI on the simulator flattens to an asymptotic value of at least 95.1% when the query space is fixed during learning and at least 96.3% when the query space is grown linearly. The corresponding value for HAL-FI on the experimental data in the fixed query space is 99.8% for low values of RMSE. While not shown, the query advantage of HAL-FI on the experimental data over a passive learner is at least 99.1% for low values of RMSE.
Similarly for the learning scenario of recalibration, we plot the query advantage of HAL-FI over the baseline strategy for Hamiltonian learning when prior information is available in Figure 12(b). The trends shown here correspond to that of the simulator as similar behavior is observed for experimental data. The query advantage of HAL-FI in a fixed query space over the baseline is constant with RMSE and around 99.6% i.e., HAL-FI requires only 0.4% of the queries required by the baseline. For HAL-FI with a linearly growing query space, the query advantage for low values of reported RMSE is around (100 − 3.2 × 10 −3 )%. Further, we note that the scaling of query advantage QA of HAL-FI in a linearly growing query space with RMSE scales as QA ∼ (1 − O( −1/6 )) until decoherence starts effecting the scaling. For HAL-FI with an exponentially growing query space, the query advantage flattens to around (100 − 5.4 × 10 −4 )% for low values of RMSE. Under this learning setting at low values of RMSE around 10 −3 , HAL-FI has a query space with evolution times far exceeding T 1 or T 2 and thus this reported query advantage is expected for even lower values of RMSE.
From the analysis of query advantage above, we observe that two orders of magnitude reduction in queries can be obtained over a baseline strategy by adopting HAL-FI. Moreover, during recalibrations, we observe three orders of magnitude reduction when HAL-FI is used with a linearly growing query space and five orders of magnitude reduction in queries when HAL-FI is used with an exponentially growing query space.
Adopting HAL-FI not only allows us to achieve query reduction but it also allows us to save wall clock time taken to calibrate a quantum device. We again consider the query space of IV B 2 where a query on average takes 2400ns to run (accounting for time duration of implementing measurement pulses). The repetition rate of current IBM Quantum devices for executing circuits is 10kHz. We can then reduce the duration of Hamiltonian learning of all CR Hamiltonians of directly connected qubit pairs on a 20-qubit IBM Quantum device of ibmq boeblingen to reach a RMSE of 5 × 10 −2 from around 10 minutes to 5 seconds by using HAL-FI instead of the baseline strategy. These timings only take into account the time on the quantum hardware and not additional latencies in classical electronics interfacing with the hardware.

VI. CONCLUSION
In this paper, we proposed the active learning algorithms of HAL-FI for Hamiltonian learning and HAL-FIR for predictions of queries to a Hamiltonian, sampled from a testing distribution. The performance of HAL-FI/HAL-FIR was compared against different learners for learning a CR Hamiltonian on the 20-qubit IBM Quantum device ibmq boeblingen on a simulator and an oracle with access to experimental data. We showed that HAL-FI can achieve a query advantage of around 99.8% compared to a baseline strategy and 99.1% over a passive learner for low values of learning error on experimental data. During recalibration when learning the Hamiltonian with access to information from previous calibrations, we observed that HAL-FI can achieve query advantages of 99.5% over a baseline strategy. Further, we showed that we achieve Heisenberg limited rate of convergence where possible when an active learner is used in conjunction with an adaptive query space during learning before the evolution time of queries exceed qubit T 1 or T 2 and decoherence starts deteriorating information content available in queries.
Overall, using the active learner HAL-FI can yield in reduction of resources of up to two orders of magnitude during calibration and five orders of magnitude during recalibration for low values of learning error. This improvement in query complexity has multiple practical consequences besides accelerating Hamiltonian learning during calibration and recalibrations of quantum computers.
Another important calibration step is determining controls to implement desired single and multi-qubit quantum gates. This often relies on building a Hamiltonian model of the true environment i.e., the quantum computer. This can be accomplished by HAL to ensure minimal queries are used. Gates, once implemented, are characterized through quantum process tomography which is not query efficient but could be accelerated with an active learner. For the specific application of learning the CR Hamiltonian, more noise sources can be included as they become relevant e.g., leakage errors which become pronounced under strong driving. It would also be interesting to see if one can achieve asymptotic Heisenberg limited scaling even in the presence of noise such as decoherence by using appropriate quantum error correction protocols. It should be noted that Hamiltonian learning and for that matter quantum process tomography both suffer from an exponential scaling in the size of the quantum device n and using an active learner only ensures a better scaling in or query advantage. One could get a better scaling in n if additional information such as the structure of the Hamiltonian was known to be a k-local Hamiltonian which can be learned with a query complexity that scales as O(poly(n)) [79].
There are other calibration steps where Hamiltonian learning may not be required but the concept of active learner can be introduced. This would particularly be advantageous where a variety of experiments can be carried out but it is not clear which of them are more informative for the learning task. For example, a query efficient method is desired for learning cross-talk on a superconducting quantum device [80,81].
Additionally, there is much room for improvement and extension in the algorithm itself. Currently, expert knowledge is required to specify a complete query space to HAL to ensure all the Hamiltonian parameters can be learned and possibly with Heisenberg limited scaling. It is desirable to remove this expert and replace them with a method to synthesize queries. The current active learning strategy is also based on the query criterion of Fisher information but this could be modified to incorporate cost of different queries and increase exploration. Moreover, a more general query criteria could possible be learned through reinforcement learning as illustrated in recent work on classical applications [82].  In this section of the Appendix, we give details of the different IBM Quantum devices employed for assessing the performance of the HAL algorithms (HAL-FI and HAL-FIR) proposed in Section III for learning cross-resonance (CR) Hamiltonians (Eq. 27). In Section IV C 2, we stated a model for the noise source of imperfect pulse-shaping. Here, we Notation Definition N (i) tot Number of queries at the ith round q (i) Query distribution at the ith round Q (i) Query space at the ith round N b Batch size i max Maximum number of iterationŝ θ (i) Estimate of the Hamiltonian model parameters at the ith round X (i) Set of all queries made up till the ith round (inclusive) Y (i) Set of all measurement outcomes up till the ith round (inclusive) X q Set of queries obtained from sampling Q according to query distribution q Y q Set of measurement outcomes obtained from quantum system after making queries in X q TABLE VI. Notation specific to HAL Algorithms (HAL-FI/HAL-FIR) describe how this model was obtained. This is then followed by relevant analytical expressions of likelihood and Fisher information considering the query space in Section IV B 2.

Description of IBM Quantum Devices
We consider CR Hamiltonains on the four different IBM Quantum devices described in Section IV B 1. The connectivity maps of these devices are shown in Figure 13. We consider CR gates on particular qubit pairs on each device which are summarized in Table VII. In Table VII, we describe the properties of each qubit involved in the CR gate including their T 1 or T 2 times and the average infidelity of single-qubit gates.

Modeling Pulse Shapes
We now describe how the imperfect pulse-shaping model stated in Section IV C 2 was obtained. We consider cross-resonance control pulses (also called GaussianSquare) whose time-varying amplitudes are rectangular-shaped envelopes with tapered rising and falling edges, where the tapering is designed to minimize the signal energy that falls above and below the frequency of the sinsoid that is being modulated by the pulse envelope. The resulting unitary operators thus have the formŨ (t) = exp (−iT t 0H (t )dt ), where T is the time ordering operator,H is the Hamiltonian at any particular time given byH(t ) = H(t , v(t )), H is the cross-resonance Hamiltonian, and v(t ) is the time-varying pulse envelope.
Let ∆t r and ∆t f be, respectively, the durations of the rising and falling edges of the shaped pulse envelope. The central portion of v(t ) is then a rectangular function such that, where t is the total duration of the pulse, and V max is the amplitude of this central rectangular portion. We thus havẽ where t expt = t − ∆t f − ∆t r and H max = H(t , V max ) which is constant assuming that any signal distortions that are introduced by the control electronics and/or along the signal path to the quantum device are negligible.
FIG. 14. Hamiltonian parameters J as a function of the amplitude of the control pulse or drive as reported in Sheldon et al. [66] The above equation decomposesŨ (t) into the time evolution of the Hamiltonian that corresponds to the central rectangular-pulse portion of the control pulse with pre-and post-rotations that are determined by the tapered rising and falling edges of the pulse. In general, there is a nonlinear relationship between the shapes of these edges and the resulting pre-and post-rotations. However, based on the results reported in [66] and shown in Figure 14, the cross-resonanace Hamiltonian parameters tend to vary fairly linearly with respect to the overall pulse amplitude. The pre-and post-rotations can thus be approximated by assuming a first-order model for the time-varing Hamiltonian parameters given by where J max is the vector of parameters for the Hamiltonian H max of the central recetangular portion of the pulse envelope. The time-varying Hamiltonian is then approximated by The overall unitary operatorŨ (t) is then approximated bỹ

Likelihood Function and Fisher Information Matrix for the CR Hamiltonian
In this section, we give the expressions for the likelihood function of p y|x (y|x; θ) and the Fisher information (FI) matrix. We consider the experimental setup as described in Section IV and query space Q as described in Section IV B 2.
Recall from Section II C 2, the FI matrix of a query x is given by where log p(y|x; θ) is the log-likelihood of the measurement outcome y given the query x. In most cases in practice, the Fisher information matrix must be computed empirically in a Monte Carlo fashion. However, here we have a model of the CR Hamiltonian and models of the different noise sources affecting the quantum system available to us. We can thus evaluate the FI matrix analytically for different queries and in the presence or absence of noise.
a. In Absence of Noise a. Likelihood In the noiseless case, the likelihood function of different measurement outcomes y ∈ {0, 1} given query x = (M, U, t) ∈ Q is Evaluating this for the different queries in the query space as described in Section IV B 2, we obtain p y|x (0|x; θ) =    1 2 ((cos(ωjt) + sin(φj) cos(δj) sin(ωjt)) 2 + (sin(δj) sin(ωjt) + cos(φj) cos(δj) sin(ωjt)) 2 ) , x = (M X , Uj, t) 1 2 ((cos(ωjt) − cos(φj) cos(δj) sin(ωjt)) 2 + (sin(δj) sin(ωjt) + sin(φj) cos(δj) sin(ωjt) where we have used the index j ∈ {0, 1} to refer to the different preparation operators U 0 = σ I σ I and U 1 = σ X σ I . The measurement operators are: Fisher Information Matrix Noting that the measurement outcome y ∈ {0, 1}, we have where in the second step, we have used the fact that p y|x (1|x; θ) = 1 − p y|x (0|x; θ) and The FI matrix elements can also be expressed using the Rabi oscillation of a query p rabi (x; θ) as follows The FI matrices I x (θ) for the different queries x = (M, U, t) (Section IV B 2) depend on the parameterization of choice. We denote the FI matrix considering the parameterization of Λ as I x (Λ). Note that the Fisher information matrices I x (J) for the parameterization of J is related to the former through the jacobian of Λ with respect to J.
Note that I x (Λ) is of rank-1 for each query and takes a block form of I 0 0 0 0 for U = σ I ⊗ σ I and 0 0 0 I 1 for U = σ X ⊗ σ I . This indicates that queries involving U = σ I ⊗ σ I are informative about the Hamiltonian parameters (ω 0 , δ 0 , φ 0 ) and those involving U = σ X ⊗ σ I are informative about (ω 1 , δ 1 , φ 1 ).

b. In Presence of Noise Sources and Nonidealities
In Section IV C 1, we modeled the effect of different noise sources and nonidealities on the quantum system. In particular, we discussed the effect of imperfections in control in Section IV C 2, effect of decoherence in Section IV C 3 and how the observed measured outcome is subject to readout noise in Section IV C 1. We consider the two-qubit decoherence model from Section IV C 3, and the readout noise model of a bit-flip channel based on binary classification from Section IV C 1. We denote the noisy observed measurement outcome asỹ and the hidden measurement outcome before the effect of the bit-flip channel as y.
a. Likelihood The likelihood function in the presence of the noise sources of readout noise, imperfect-pulse shaping, and decoherence, is then given by pỹ |x (ỹ|x; θ) = pỹ |y (ỹ|y) z∈{0,1} (1 − p d (t)) p yz|x (yz| (M, U, t + ∆t eff (θ)) ; θ) with the probability of the two-qubit string yz given by and the probability p y|x (y|x) given by Eq. 1 (the noiseless case). In the expressions above, we have used the tuple representation of the query x, p d (t) which is the depolarization probability associated with the two-qubit decoherence model as discussed in Section IV C 3 and (r 0 , r 1 ) which are the readout noise parameters as introduced in Section IV C 1. b. Fisher Information Matrix The FI matrix for a given query x is given by and Note that special attention has to be given when taking the derivative with respect to ω 0,1 as ∆t eff (θ) (Section IV C 2) actually only has a dependence on these two components in Λ and appears in the effective evolution time t with a prefactor of ω 0,1 in the likelihood (see Eq. B12).

Appendix C: Computational Details of Query Optimization
In Section V A 1, we pointed out that the number of shots available for each query in the experimental datasets collected from the IBM Quantum devices are limited. In this section of the Appendix, we describe how the query optimizations for HAL-FI (Eq. 17) and HAL-FIR (Eq. 19) are solved under shot constraints for each query. We then describe how the computational cost of query optimization can be reduced through uncertainty filtering of the query space.

Different Query Optimizations and Strategies for Handling Query Constraints
We describe how constraints can be handled for the query optimization (Eq. 17) in HAL-FI (Algorithm 2) but the same approach can also be used for HAL-FIR. Let us consider the ith round in active learning. The Hamiltonian parameter estimate in this round isθ (i) . Let us denote the number of shots available for each query x ∈ Q in the ith round as N shots (x). Let the total number of shots that have already been made against query x by the ith round (inclusive) be denoted as N (i) tot (x). We will denote the total number of shots made over all queries inputted to the oracle by the ith round (inclusive) as N (i) tot . We can frame the query optimization problem under shot constraints in two different ways, motivated by the asymptotic optimal query distribution associated with HAL-FI: q = arg min q∈P(Q) where P(Q) is the family of all probability distributions over the specified query space Q. As discussed in Section II C, we cannot solve for this distribution in practice as this requires us to have access to the true parameters θ , and hence we solve for a sub-optimal query distribution q (i) (θ (i) ) given the current parameter estimatesθ. During active learning, we can solve for q (i) (θ (i) ) by viewing it as the query distribution over all the queries that have inputted to the oracle (i.e., including from previous rounds) or as the query distribution associated with the current ith batch of queries that will be issued. These two viewpoints lead us to two different approaches of how the shot constraints are handled and how the queries to be inputted to the oracle are sampled from the query distribution. Firstly, consider the following query optimization problem with corresponding sampling of queries in that batch as: . The query optimization in this case considers all the queries that have been sampled from earlier batches during active learning. The query distribution q (i) is then the distribution over all the queries made so far and q (i) b is the query distribution for the batch to be issued. Further, the Fisher information matrix associated with q (i) is guaranteed to be invertible but that associated with q (i) b may be non-invertible. This suggests that the outcomes of queries made in this round of the active learning procedure may not be informative about all the Hamiltonian parameters.
In order to ensure that the query distributions for each batch are informative about all the Hamiltonian parameters, one may alternately solve the following query optimization problem is the size of the batch of queries being issued to the oracle in the ith round. Queries of the issued batch are then sampled as where q (i) is now the query distribution for each batch. This query optimization can be viewed as a greedy approach of query selection. The Fisher information matrix associated with q (i) is guaranteed to be invertible and hence informative about all the Hamiltonian parameters. We thus solve the query optimization considering shot constraints on each query according to Eq. C6 for our application.
As we sample queries randomly according to q (i) and not proportional to q (i) , the resulting X q might not satisfy query constraints exactly and an additional pruning step is required. Moreover, to reduce the computational cost of the query optimization solve, we consider a subset of queries Q (i) filtered ⊂ Q (i) for which shots are still available.
In Algorithm 5, we summarize the steps taken to handle query constraints during query optimization for HAL-FI. It takes the inputs of the query space Q (i) of the current ith round of active learning, number of shots available for each query and the size of the batch of queries to be issued. The inputted query space is first filtered by retaining only those queries for which shots are available. The resulting filtered query space is denoted by Q (i) filtered , and we then compute the query distribution q (i) according to Eq. C6. We then sample queries for the batch X q according to this distribution after mixing and with incorporation of a pruning step. Note that lines 4 and 5 were also used in the original query optimization algorithms of HAL-FI (Algorithm 2) and HAL-FIR (Algorithm 3) to encourage exploration. An illustration of the operation of the query optimization and handling of query constraints is shown in Figure 15 for a particular run of the HAL-FI learner.

Algorithm 5 Handling query constraints during query optimization in HAL-FI
Input: Current query space Q (i) , number of shots available for each query N   shots (x) = 512 ∀x. Half the total number of shots available in this dataset is exhausted by the end of learning. Note that the different parameter values considered for HAL-FI are for stress testing the query constraints' handling procedure and are not tuned for the HAL-FI algorithm.

Uncertainty Filtering of Query Space
Uncertainty filtering becomes a crucial step during query optimization when we consider HAL-FI with an adaptively growing query space. We noted in Section II C 1 d that the computational cost of query optimization (Eq. 17) scales as O(n 2 Q m 3 + n Q m 4 + m 5 ) where n Q = |Q| is the number of queries in the query space, and m is the length of the parameter vector θ. This computational cost is alleviated through filtering of the query space based on entropy S(x). The entropy of the different queries can be computed from the model probability expressions available to us (see Section B) given the current Hamiltonian parameter estimateθ. The filtered query space based on entropy is determined as follows: where we set the threshold τ = 0.95 i.e., we only consider queries with entropy that is at least 0.95 times the highest entropy. This value of τ was chosen to ensure that at most only half of the queries are retained after uncertainty filtering of the query space. An illustration of uncertainty filtering of a query space is given in Figure 16. In this section of the Appendix, we discuss in detail the estimation procedure of Section IV D 2 that is used for solving the MLE problem of Eqs. 58,59 for learning cross-resonance Hamiltonians (Eq. 27). This estimation procedure is used in conjunction with both the passive learner and HAL-FI/HAL-FIR. We consider the experimental setup discussed in Section IV. We also discuss how this estimation procedure can be improved or extended to other Hamiltonians. Let us recall the notation introduced in Section II C 1 d and Section 4. The different rounds of active learning are indexed by i ∈ [i max ]. In each round of active learning, we use an estimation procedure divided into multiple steps. We index each of these fractional steps by k. The Hamiltonian parameter estimate at the kth fractional step in the ith round of AL will then be denoted byθ (i,k) with the parameter estimate in the ith round at the end of the estimation procedure denoted simply byθ (i) . The training examples available at the ith round is given by (X (i) , Y (i) ).
We divide the estimation procedure into two main parts: (i) initial estimation and (ii) maximum-likelihood estimation (MLE).

Initial Estimation
As mentioned in Section IV D 2, the first step of our estimation procedure involves frequency estimation followed by a nonlinear regression solve combined with a gradient descent procedure on the Rabi oscillations inferred from data (Eq. 38). We will denote this byp rabi (t).
Frequency estimation is typically carried out through the Fourier transform (implemented as Fast Fourier Transform (FFT) [83]). We note this in general assumes the input of a periodic signal which translates into a limited frequency resolution of F s /T p where F s is the sampling frequency and T p is the period. In order to successfully calculate the frequency, we must set the T to be at least have the duration of T p by the minimum-frequency criterion (defined in Section V B). Standard FFTs then allow us to calculate frequency amplitudes at fixed intervals in the spectrum through where we have denoted the Fourier coefficients by F (k) and i = √ −1 is the imaginary unit. However, the frequencies of the Rabi oscillations are not known apriori and hence the T may not be adequately set. We would like in practice for the estimation procedure to succeed regardless of the choice of T . We now detail how the standard FFT may be modified in such a case. a. Modification for T not satisfying minimum-frequency criteria: We could sample frequency amplitudes at the intermediate half steps of ν/2N t by multiplyingp rabi (n) by rotating exponential frequency and then computing a second FFT: where G(0) is the Fourier coefficient at ν/2N t , G(1) is the Fourier coefficient at 3ν/2N t , etc. However, this only gives us an improvement in resolution from ν/N t to ν/2N t . One can more generally solve the following normal equations based on regression of the model Rabi oscillations: where F(ω) = t p rabi (t)e −jωt is the discrete Fourier transform. b. Extension to other multi-parameter Hamiltonian models In general multi-parameter Hamiltonians, the intial estimation procedure must be modified to ensure that all the frequency components in a Rabi oscillation corresponding to (M, U ) can be faithfully extracted. The model to be used is then assume that the Fourier series has up to K modes. Further, normal equations for the same can be setup. This has been employed in Bayesian spectral analysis [84] and non-stationary time-series estimation [85,86].

Maximum Likelihood Estimation
The initial estimation procedure described above produces the estimate ofθ (i,0) which is used as an initial guess to the MLE. The expectation is that the initial estimation produces an estimate that lies in the same convex basin as the global minimum of the MLE. This allows for a more localized search to be carried out and a stochastic gradient descent procedure should allow us to jump out of any smaller local minima here if present.
We solve the MLE 4 through a combination of SGD applied on different parameterizations and the quasi-Newton method for further refinement. Addition of the latter step helps us in saving computationally expensive hyperparameter tuning that is required.
1. MLE solve considering the Λ parameterization and learning rate of η i Λ using the input ofθ (i,0) . This returns the output ofθ (i,1) .
2. MLE solve considering the J parameterization and learning rate of η i J using the input ofθ (i,1) . This returns the output ofθ (i,2) .
We set the learning rate η i for ADAM [75] in the ith round of active learning according to the number of queries already made. We consider the learning rate to be η i ∝ i.e., the learning rate is reduced inversely to the square root of the number of training examples. This ensures a more localized search as we progress in the learning. We consider η 0 = 10 −3 . We found that carrying out step 2 after step 1 gave us more accurate estimates ofθ than just carrying out step 1. Moreover, after the first few rounds of HAL, we can skip steps 1 and 2. We can carry out step 3 directly using an initial condition ofθ (i,0) from initial estimation orθ (i−1) from the previous round.

Energy Landscapes of Negative Log-Likelihood Loss for Cross-Resonance Hamiltonian
We ascertain the efficacy of the estimation procedure by visualizing the energy landscapes of the negative loglikelihood loss function (Eq. 58) and inspecting the location of theθ in the landscape. In Figure 17, we plot the energy landscape obtained from an experimental dataset, for the two different parameterizations J and Λ. The energy landscapes indicate the nonlinear and non-convex nature of the MLE of Eq. 58. These energy landscapes also indicate why solving the MLE in the parameterization of Λ using ADAM is carried out before solving the MLE in the parameterization of J. The slices along specific components of Λ display more convex like nature than those along components of J. It should also be noted that there is a global minimum present in the energy landscapes which we are able to identify using our estimation procedure. FIG. 17. Slices of energy landscapes of the log-likelihood loss function along the different parameter components considering experimental data collected from IBM Quantum device D ibmq boeblingen under drive configuration 4 (a) using parameterization J, and (b) using parameterization Λ. In each slice of θi (x-axis), we fix the values of the other components to that obtained through estimation and evaluate the negative log-likelihood loss (y-axis) by changing the value of θi. We indicate the Hamiltonian parameter estimateθi as obtained through our estimation procedure by a dashed red line.

Incorporating uncertainty from shot noise
The inferred Rabi oscillations p rabi (t) used for estimation are sensitive to the number of shots made for a given query x = (M, U, t). This variability in the Rabi oscillations leads to a variability in the estimates ofθ (i,0) produced. This variability is particularly high during the initial rounds of HAL when there are only a few shots of each query present in the set of training examples. In order to accurately account for this variability and hence include the uncertainty in our estimates of the Hamiltonian parameters, we consider the following procedure. Let us consider the initial estimation of Algorithm 4 as the procedure applied on a particular realization of the Rabi oscillations. We construct n rep realizations of the Rabi oscillations considering the observed Rabi oscillation data and sampling according to the binomial distribution associated with the number of shots for each query. For each of these realizations, we obtain frequency estimates of ω 0,1 through the initial estimation as described above. We then fit the parametric distributions of log-normal distributions to frequency estimates from each each realization. We then continue with the initial estimation procedure and MLE considering each of these realizations.
Other ways of incorporating uncertainty in the Hamiltonian parameter estimates during the estimation procedure would be to adopt a Bayesian learning framework or stochastic process regression (e.g., Gaussian process regression). This is left for future work. The imperfect pulse shaping model extracted from these experimental data points is shown by a fit and this is later used in the MLE.

Expected trends of learning error
In Section V B, we assessed the performance of the HAL-FI and HAL-FIR algorithms in different learning scenarios on IBM Quantum device D ibmq boeblingen under drive configuration 2, where the query distribution was learned in real-time. Here, we lend support that the trends observed in Figure 10 and Figure 11 are expected.
To determine the behavior of the learners in an idealized setting, we consider the case where we have access to the optimal query distribution during learning. In Figure 19, we show the trend of RMSE for HAL-FI with a fixed query space assuming access to the query distribution q(θ ) during training and that the Cramer-Rao bound is saturated. We follow the same protocol from Section V A 4 as we did for our earlier experiments. For the baseline strategy and a passive learner, the query distribution corresponds to an uniform distribution over the query space. As expected, using a passive learner does not change the scaling in the finite query nor the asymptotic query regimes. A scaling of ∼ 1/ √ N or N ∼ −2 is observed which is in line with the standard quantum limit (SQL). For HAL-FI, we observe an initial scaling in RMSE with number of queries which is higher than SQL but this reduces to SQL in the asymptotic query regime. Thus, our results from Section V B is in agreement with what we observe here. Asymptotically, we expect a constant savings in the number of queries or resources required when employing an active learner with a fixed query space compared to a passive learner. FIG. 19. Scaling of RMSE with number of queries for Hamiltonian learning considering access to the asymptotic optimal query distribution q(θ ) of HAL-FI and analysis of the Cramer-Rao Bounds. We consider the Hamiltonian parameters of θ as determined from the experimental dataset of IBM Quantum device D ibmq boeblingen under drive configuration 2. We plot the trend of RMSE with number of queries for HAL-FI against the passive learner which uses the uniform distribution over Q.
Likewise for Hamiltonian learning with prior information, we can compute the RMSE with number of queries for HAL-FI with an adaptively growing query space and access to q (i) (θ ) for each ith batch during learning. In Figure 20, we show the trend of RMSE with number of queries for HAL-FI with a linearly growing query space and an exponentially growing query space. We observe that the passive learner has a scaling of the SQL in the asymptotic query regime. HAL-FI in the linearly growing query space achieves super-Heisenberg scaling. We note that this supports the trend of RMSE achieved during the experiments in Section V C 1.
As noted in Section V C 1, HAL-FI with an exponentially growing query space also achieves Heisenberg limited scaling until the evolution times being included in the query space reach the magnitude of the decoherence time T 1 and T 2 . In this case, HAL-FI avoids selecting higher evolution times as the information gained from these measurement outcomes will tend to zero. Thus, we expect that the Heisenberg limit to be achieved for the finite query setting.
FIG. 20. Hamiltonian learning with acess to prior information of subset of parameters during recalibration: Scaling of RMSE with Number of Queries. We assume access to the asymptotic optimal query distribution q(θ ) of HAL-FI and analysis of the Cramer-Rao Bounds. We consider the Hamiltonian parameters of θ as determined from the experimental dataset of IBM Quantum device D ibmq boeblingen under drive configuration 3.

Sparse Query Distributions
Another consequence of using HAL-FI is the sparsity of the asymptotic query distribution during learning. This is confirmed by visualizing the optimal query distribution of HAL-FI with the fixed query space (Section IV B 2) considering IBM Quantum device ibmq boeblingen under drive configuration 2 in Figure 21. It is interesting to note that this was achieved even though sparsity was not incorporated into the learning problem. This can be explained by realizing that the most informative queries are in fact sparse over the query space. It should be noted however that this is typically not the query distribution that HAL-FI has access to during learning as the true parameters θ are not available and the query distribution obtained through optimization (Eq. 17) is modified by mixing with the uniform distribution (see Section III A). FIG. 21. Asymptotic optimal query distribution q(θ ) for HAL-FI with a fixed query space (Section IV B 2) on ibmq boeblingen under drive configuration 2 (Table IV). We consider different noise sources of readout noise, imperfect pulse-shaping, and decoherence. The y-axis indicates the different combinations of measurement operators and preparation operators available for each query in Q. The different preparation operators are denoted as U0 = σI σI and U1 = σX σI . The x-axis corresponds to T which is set to 81 equispaced evolution times in [10 −7 , 6 × 10 −7 ]s. The query distribution is color-coded according to the colormap on the right.
This may be obtained by considering the parameter set of J = (0, 0, 0, J ZX , 0, 0) T and R = [0, 0, 0, 1, 0, 0]. The values of the parameters in the alternate parameterization of Λ = (ω 0 , δ 0 , φ 0 , ω 1 , δ 1 , φ 1 ) T = (J ZX , 0, 0, J ZX , 0, π) where we have assumed J ZX > 0. The Rabi oscillations for different queries in this case are as follows: (M X , U j , t) : p rabi (x) = 0 (F3) (M Y , U j , t) : p rabi (x) = sin(2ω j t) (F4) (M Z , U j , t) : p rabi (x) = cos(2ω j t) (F5) where the index j ∈ {0, 1} corresponds to different preparation operators U 0 = σ I σ I and U 1 = σ X σ I . We note that the measurement operator of M X is not informative about the frequency ω 0 = J ZX for this system and can also be noted from considering the corresponding Fisher information. In order to learn the parameter of interest ω 0 , it is enough to consider one of the queries in {M Y , M Z } × {σ I σ I , σ X σ I } and a suitable time range T . Let us select the query of (M Z , σ I σ I , t) and suppose our query distribution over the time range is based on the zeros the Rabi oscillations. Given a time range T , we consider values of t k = π 4ω0 + kπ 2ω0 where k ∈ N. Queries with these system evolution times have the maximum entropy for the considered (M, U ).
Note that the Fisher information of a query in this case is given by I x (ω 0 ) = 4t 2 . Through the Cramer-Rao bound, we then have , Fixed space of zero crossings where we have set the learning error to be achieved as . We can thus expect to achieve a scaling of N ∼ −3/2 when using linearly spaced zero crossings. For exponentially spaced zero crossings of the Rabi oscillations, the variance approaches zero at an increasing rate.

b. Two Interaction Systems
Consider the following Hamiltonian The reduced set of parameters is then Λ R = (ω 0 , ω 1 ) = (|J IX + J ZX |, |J IX − J ZX |). As in the earlier single interaction example, we can choose the queries that contain the zero crossings of the Rabi oscillations. The Rabi oscillations are given by (M X , U j , t) : p rabi (x) = 0 (F8) (M Y , U j , t) : p rabi (x) = sin(2ω j t) (F9) (M Z , U j , t) : p rabi (x) = cos(2ω j t) where j ∈ {0, 1} is used as an index to denote the preparation operators U 0 = σ I σ I and U 1 = σ X σ I . A complete set of queries to estimate (ω 0 , ω 1 ) with Heisenberg limited scaling would then be The set of evolution times chosen here also correspond to those with maximum entropy. The Fisher information of a query made through either set of measurement or preparation operators in the query space is given by I x (ω 0 , ω 1 ) = 4t 2 . Thus, through the argument we made in the previous section, we can also learn the parameters of unitary here with Heisenberg limited scaling.

Examples of non-HLS scaling during Hamiltonian learning
So far, we have given examples of Hamiltonians obtained through simplification of the CR Hamiltonian, that can be learned with HLS scaling. We now give examples of Hamiltonians, which cannot be learned with HLS scaling using the query space described in Section IV B 2.
If we were to consider the zero crossings of the Rabi oscillations as in the previous example, the queries and their corresponding Fisher information matrices are of the following form M X : t k (M X ) = π 2ω j + kπ 2ω j , I j = 4t 2 k sin 2 (φ j ) 0 0 0 (F24) where k ∈ N. It should be noted that the evolution times t k (M ) being selected are a function of the measurement operator involved in the query which is made explicit through the argument M . As Λ R is an over-parameterization, let us look at the query Fisher information matrix I q (J R ) for the above set of queries. where we have only given the upper-triangular part of the symmetric matrix. It can be shown that for these queries, I q (J R ) is rank deficient and thus non-invertible. This was foreshadowed by the fact that I q (Λ) was informative in ω 0 and ω 1 but not one of the phases φ 0,1 . Hence, it is more appropriate to consider R = 1 0 0 0 0 0 0 0 0 1 0 0 for these set of queries. It can be verified through an analysis similar to the single interaction system that we can achieve a scaling of N ∼ O( −3/2 ) and hence make improvements over than SQL.
If we wish to learn J IY as well, it is necessary to introduce other queries such that the Fisher information matrix is non-zero for the corresponding parameter of interest. Let us start by changing our learning task to the simpler challenge of learning the parameters (ω 0 , δ 0 ). In this case, it is enough to consider only queries of the form (M X , σ I ⊗ σ I , t) where the time range t ∈ T needs to be determined. We immediately observe that I q (Λ R ) −1 ∝ k 1 1 − sin 2 (φ 0 ) sin 2 (2ω 0 t k ) cos 2 (φ j ) sin 2 (2ω j t) − 1 2 t sin(2φ j ) sin(4ω j t) − 1 2 t sin(2φ j ) sin(4ω j t) 4t 2 k sin 2 (φ j ) cos 2 (2ω j t) (F28) and the variance of parameter φ 0 Var(φ 0 ) ≥ 1 N N k=1 1 − sin 2 (φ 0 ) sin 2 (2ω 0 t k ) cos 2 (φ 0 ) sin 2 (2ω 0 t k ) (F29) where the term inside the sum on the right hand side is fixed for any periodic or equi-spaced set of evolution times t k and thus HLS cannot be achieved using such a set of queries. One key to ensure achieving Heisenberg limited scaling is to introduce an explicit dependence on the variable of system evolution time t into the corresponding Fisher information. We note that this is not followed by the different set of measurement operators considered here.