Machine learning of noise-resilient quantum circuits

Noise mitigation and reduction will be crucial for obtaining useful answers from near-term quantum computers. In this work, we present a general framework based on machine learning for reducing the impact of quantum hardware noise on quantum circuits. Our method, called noise-aware circuit learning (NACL), applies to circuits designed to compute a unitary transformation, prepare a set of quantum states, or estimate an observable of a many-qubit state. Given a task and a device model that captures information about the noise and connectivity of qubits in a device, NACL outputs an optimized circuit to accomplish this task in the presence of noise. It does so by minimizing a task-specific cost function over circuit depths and circuit structures. To demonstrate NACL, we construct circuits resilient to a fine-grained noise model derived from gate set tomography on a superconducting-circuit quantum device, for applications including quantum state overlap, quantum Fourier transform, and W-state preparation.


I. INTRODUCTION
Recent years have seen a surge in quantum computer hardware development, and we now have several quantum computing platforms with tens of qubits that can be controlled and coupled with fidelities that enable execution of quantum circuits of limited depth. This has led to intense interest in formulating quantum algorithms that can be reliably executed on such devices. The challenge however is that naive compilations of nearly all nontrivial quantum algorithms require circuit depths that are currently out of reach for near-term hardware. Motivated by this challenge, in this work we study how machine learning (ML) can be applied to formulate noiseaware quantum circuits that can be executed on nearterm quantum hardware to produce reliable results.
Our method is called noise-aware circuit learning (NACL), and given suitable description of a computational task and a device model that captures the noise and constraints of a device, it outputs a native circuit that performs the task with greatest robustness to noise. NACL has several broad applications, as illustrated in Fig. 1. The task can be the compilation of a specified unitary transformation ( Fig. 1(a)), the preparation of a target state from a specified input state ( Fig. 1(b)), or the extraction of an observable from a many-qubit state ( Fig. 1(c)). In each case, NACL returns a circuit that is the significantly more noise-resilient to the given noise model, however, as we detail below, the formulation of the machine learning problem is different in each application. Perhaps the most familiar version of NACL is that depicted in Fig. 1(a), where a specified unitary matrix is to be implemented by a circuit composed of native gates, which is usually called compilation. In this context, NACL results in noise-aware circuit compilations.
FIG. 1. Applications of NACL. (a) In compiling, the goal is to approximate an input unitary matrix U by a noise-resilient circuit that is compatible with the device constraints. (b) In state preparation, one inputs a set of N input and output states {|xi , |yi }, where N could be as small as one, and the output is a noise-resilient circuit that approximately prepares the |yi states from the |xi states. (c) In observable extraction, one inputs a set of input states and classical outputs that typically correspond to local observable expectation values, {|xi , yi}, and the output is a noise-resilient circuit that approximately computes the outputs from any input state |ψ that might or might not be in the input set.
Previous work on circuit optimization for noise mitigation has largely considered the task of compilation, under restricted models of errors or imperfections. In fact, most work focuses on reducing overall circuit error by reducing the number of two-qubit gates (which tend to be more noisy than single-qubit gates), avoiding faulty qubits, reducing the number of SWAP gates required in architectures with restricted connectivity, or reducing the amount of qubit idle time and/or overall cir-cuit depth [1][2][3][4][5][6][7][8]. These strategies incorporate very little information about errors present in a particular hardware platform. More recent work on error-aware compilation by Murali et al. [9] goes beyond this and includes basic calibration information (e.g., qubit T 2 times, CNOT gate error rates) to compile circuits using more reliable qubits and gates.
In this work we extend this direction even further and demonstrate that one can use fine-grained error model information to increase the reliability of the outputs of quantum circuits. Incorporating detailed noise models into one's circuit optimization, as we do here, is particularly compelling at present with the advent of advanced characterization techniques like gate-set tomography [10,11]. These techniques produce fine-grained details -e.g., estimates of process matrices representing the action of imperfect quantum gates -describing the actual evolution of qubits in near-term hardware. We will demonstrate that such experimentally derived noise models can be used to go beyond naive circuit compilations for several example quantum algorithms.
NACL has several additional strengths relative to existing approaches in the literature. Crucially, NACL takes a task-oriented approach to quantum circuit discovery, which implies that one does not need a starting point or example quantum circuit that already accomplishes the task. Note that traditional compilers do require such a quantum circuit to start from. Furthermore, because NACL does not start from a template circuit, the optimization is less susceptible to bias. In contrast, standard literature methods that tweak a given quantum circuit inherently bias their optimization towards solutions that look like that starting point. This means that NACL has the potential to discover more novel solutions that otherwise would not be obvious to the human mind. In addition, we will see that NACL naturally balances the trade-off between circuit depth, which leads to more expressivity, and circuit noise, which makes outputs less accurate.
In what follows, we first present our theoretical framework (Sec. II). We then discuss a device model with experimentally determined noise parameters (Sec. III). Next, we present our implementations of NACL with this noisy device, for examples from the three different application classes shown in Fig. 1 (Secs. IV -VI). Finally, we conclude with a discussion in Sec. VII.
FIG. 2. Schematic diagram of NACL. Our approach takes a task and a device model as an input. The task is defined via examples in a training set and a cost function, C. That information is sufficient to find a noise-aware circuit that approximates a specified task. It is done via optimization over a set of parameters (L, k, θ) that describe a quantum circuit. The algorithm returns parameters (Lopt, kopt, θopt), which represent an optimized quantum circuit that minimizes the cost function C. See text for details.

A. Overview
A schematic diagram of the steps of NACL is shown in Figure 2. There are two inputs to NACL: (1) a task, and (2) a device model. The output of NACL is an optimized quantum circuit that accomplishes the inputted task in the presence of the inputted device model. NACL may not output a globally optimal solution (this depends on details of the cost function landscape and optimization method used), but even local optima are improvements over circuit compilations that are not noise-aware.
Note that circuit depth is not an input to NACL. This is because NACL optimizes over circuit depths, and aims to find the depth that achieves the most noise resilience. In addition, an ansatz for the circuit is not an input, because NACL attempts to optimize over many ansätze. Hence, the structure of the circuit, as well as its depth, are optimized by NACL. This feature of NACL is in the spirit of task-oriented programming, where the user only needs to specify the task, and not the details of the circuit. NACL adapts the circuit structure to optimize a cost function that depends on the type of task specified. As shown in Fig. 1, there are three categories of tasks.
In what follows we provide more details on how NACL works. Sections II B and II C discuss the device model and noise specification, and Section II D defines the NACL cost function for each application. Finally, Section II E summarizes the optimization methods used by NACL.

B. Parameterized circuit
For a given quantum hardware, we denote the native gate set or gate alphabet as A = {A j (θ)}. Each gate A j is either a one-or two-qubit gate and may also have an internal continuous parameter θ. As an example, the IBM Q 5-qubit computer "Ourense" has the native gate alphabet where CNOT jk is a CNOT between qubits j and k, Z j (θ) is a rotation of angle θ about the z-axis of qubit j, and X j (π/2) is a rotation of angle π/2 about the x-axis of qubit j (also called a pulse gate). Such a gate set is supplemented by state preparation and measurement quantum operations. These are typically fixed in most quantum computing architectures (e.g., prepare all qubits in the ground state and measure in the computational basis), and therefore there is no opportunity for optimizing over these. Therefore, we do not consider these as part of the learnable set.
We consider a generic gate sequence that defines a circuit where L is the number of gates, k = (k 1 , ..., k L ) is the vector of indices describing which gates are utilized in the gate sequence, θ = (θ 1 , ..., θ L ) is the vector of continuous parameters associated with these gates, and α = (L, k, θ) is the set of all these parameters. All parameters in α = (L, k, θ) are optimized over in NACL.

C. Device model
An input to NACL is a device model, which captures the constraints of a device (e.g., limited connectivity) and also represents the noise in the device. We assume the device constraints and connectivity are captured by the specification of a native gate alphabet for the device, e.g., Eq. (1). Only gates that are available are listed in this specification.
The salient characteristics of noise are captured by (i) process matrices for each element of the device's native gate alphabet, and (ii) for state preparation and measurement (SPAM) noise, by quantum-classical channels that represent noisy state preparation or measurement POVM elements. The assumption of a fixed process matrix for each gate in the alphabet restricts this treatment to Markovian noise. This can be relaxed by generalizing to time-dependent process matrices for each elementary gate, but we do not do this here for simplicity, and also because characterization tools capable of producing such non-Markovian representations of quantum computer operations are still in early stages of development [27]. Similarly, in this treatment we mostly ignore the effects of crosstalk, and assume that the process matrix describing a gate operates only on the qubits the ideal gate is defined on. Properly incorporating crosstalk into the noise models that NACL considers requires advances in characterization methods [28] that we discuss later.
Given this paradigm for representing noisy quantum operations, each gate in the alphabet A has an associated process matrix that accounts for the local noise occurring during that gate. Note that even the identity gate may have a non-trivial process matrix, for example due to relaxation during idling.
Mathematically speaking, the noise model provides a map from a parameterized circuit G α to a parameterized quantum channel E α : Here, E α is a completely positive trace preserving (CPTP) map that represents the action of G α in a noisy environment. Specifically, when the noise model is given in the form of process matrices for gates, one can do the following. Let A = {A j (θ)} denote the gate alphabet associated with the noiseless gates. In the presence of noise, this gate alphabet becomes a set of quantum channels, A = {Ā j (θ)}, where we note thatĀ j (θ) now denotes a quantum channel. Now suppose that G α is given by G α = A k L (θ L )···A k2 (θ 2 )A k1 (θ 1 ). Then the simplest way to incorporate the noise model would be to replace each A ki withĀ ki ; i.e., to transform G α into a sequence of quantum channels: However, it is important to note that this formula for E α only accounts for the non-trivial gates that were in the original circuit G α . However, in practice, identity gates will occur with noise due to, e.g., thermal relaxation. Therefore, care must be taken with respect to identity gates, and we discuss this next.

Parallelization
The object we are optimizing over, the circuit in Eq. (2), needs to be modified in the presence of imperfect idle operations. In this case, the sensible thing to do is to perform as many gates in parallel as possible, but the description of a circuit as a sequence of gates, as in Eq. (2), is incomplete because it does not capture which gates can be performed in parallel. In other words, in the presence of imperfect idle operations we cannot simply think of G α as a linear sequence of gates; we have to map G α to a two-dimensional circuit diagram, in space and time.
Abstractly, we can re-write Here, each U α,j represents a layer of gates that can be parallelized. Specifically, we take the circuit proposed in G α and compress it using simple circuit rules to minimize idling of qubits. For example, an X(π/2) rotation that occurs on the target qubit after a CNOT can be moved to before the CNOT because their actions on the target qubit commute. In this manner, each gate in G α is moved to as early a time as possible without changing the unitary being implemented by G α . This naturally defines the circuit layers and subsequently G Par α . Even though the reordering does not change the overall unitary, whenever we write G α in the form in Eq. (5) we denote it as G Par α . An important aspect of the optimization in NACL is to numerically find the parallelized representation, G Par α , that yields the minimum error in the cost functions detailed below.
Once G α is rewritten in the form of G Par α , we can then account for noise by replacing each gate in G Par α by the quantum channel that represents its noisy implementation. For example, if a circuit layer in G Par α on a 5-qubit processor happens to be where the superscript indicates which qubit the gates are operating on (and Z(θ) is a rotation around the Z axis, CNOT is a CNOT gate, I is the identity, and X is a π/2 rotation around the X axis). This layer would be replaced bȳ where the quantities with bars above them are the quantum channels representing those gates. Then the overall noisy circuit corresponding to G Par α is written as It is important to note that NACL uses E Par α rather than E α as the overall noisy channel associated with G α .
Note that this procedure of parallelizing and incorporating the noise model that we have outlined is valid because our noise models do not account for crosstalk effects. If crosstalk is significant, then this strategy of maximizing parallelization might not be optimal since performing many gates in parallel may lead to more noise. Moreover, in the presence of significant crosstalk, capturing processor noise using quantum channels for each of its gates is probably insufficient. Instead, one would need to characterize each possible layer (there are an exponential number of these) since the operation on a qubit due to application of a gate could depend on what is done to any other qubit in the computer at the same time. We discuss how to extend NACL in the presence of crosstalk in the Sec. VII.

D. Cost functions
In this subsection, we construct the cost functions that are minimized by NACL in each of the application classes outlined in Fig. 1.

Preliminaries
We first define some relevant quantities. Let F (ρ, σ) = (Tr √ ρσ √ ρ) 2 be the fidelity between two states ρ and σ. For a given pure input state |ψ , we can denote the fidelity of the output states under quantum channels E and F as We will be interested in the case where F corresponds to a unitary process U, in which case we have Furthermore, we can define the average process fidelity as with the integral taken over the Haar measure.

Observable extraction
The first class of applications involves estimating an observable given one, or a set of, input states. An example of this is computing the overlap of two quantum states (discussed in Section IV). In this application, the output of the circuit is a classical number (the observable expectation, which in practice is estimated by many executions of the circuit) denoted f (x), and the input, denoted |x , is a quantum state (or classical data encoded in a quantum state). Hence, we want to construct a circuit that computes the function |x → f (x), We classically generate a training data set of the form In general, the amount of training data required could scale exponentially in the problem size (i.e., number of qubits), since the data must be general enough to cover the space of possible inputs.
Recall that the parameters α define a circuit G α , which in turn defines a noisy quantum channel E Par α . For this quantum channel, let y (i) α denote the output of the circuit (i.e., the expectation of the observable of interest) when the input is |x (i) . Then we define the cost function as The cost quantifies the discrepancy between the desired output f (x (i) ) and the true output y (i) α , averaged over all training data points.

State Preparation
A second class of applications outlined in Fig. 1 is state preparation. Here, the input is a quantum state or more generally a set of quantum states {|x (i) } N i=1 . The task is then to construct a circuit U that prepares the output states from these input states. In other words, one wishes to learn a unitary U that accomplishes the desired state preparation task on the training data, Note that this is an under-constrained problem since in the state preparation application N 2 n , where U is an n-qubit unitary. In this case, we use the following cost function: where U(·) ≡ U (·)U † . This is the infidelity between state prepared by E Par α and the target state |y (i) , averaged over the training data points. A typical scenario is when there is a single input and output state (N = 1), as we will consider below in Section V.

Compilation
Finally we consider the application of compiling a target unitary, U , into a set of native gates. The action of U on all possible input quantum states must be reproduced. This is a more challenging task than constructing a state-preparation circuit, since one must consider the action on all states rather than just on one state or a small set of states.
Let U(·) ≡ U (·)U † denote the quantum channel associated with U . Then we define the cost function for compiling as Note that this is analogous to Eq. (15) with the discrete average replaced by a continuous average (i.e., integral with Haar measure). The average,F , can be computed in various ways. Most elegantly, the average process fidelity is related to the entanglement fidelity F e , via [29][30][31] F where F e (E) = φ|I ⊗ E(|φ φ|)|φ = F (|φ φ|, E(|φ φ|)), with |φ = j |j |j / √ d being a maximally entangled state, and d = 2 n being the Hilbert-space dimension. Therefore, we can compute the compilation cost function by computing F (|φ φ|, I ⊗ U † • E Par α (|φ φ|)). From the machine learning perspective, the training data set in this case just consists of a pair of states {|φ , (1 1⊗U )|φ }. However, this approach requires a computation in a doubled space of dimension 2 2n .
Alternative approaches to computingF that trade this greater memory complexity for greater time complexity (but can be easily parallelized) are (i) to approximate the Haar average with a sample average over a set of states that form a 2-design, or (ii) to use Nielsen's formula in terms of Pauli operators From the machine learning perspective, for (i), the training data set corresponds to the sampled 2-design and the the action of the ideal channel on these, {|φ i , U |φ i }, and for (ii) the training data set corresponds to the Pauli operators and the action of the ideal channel on these,

E. Optimization Methods
In this Section we describe the techniques used to find optimal values of parameters α opt = (L opt , k opt , θ opt ) for a given task and device model, see Fig. 2. The methods are general and could be applied to any cost function. In particular, they are applicable to the cost functions associated with the applications discussed in Section II D.
The space in which the optimization takes place is large and has a complicated form. In our method we are optimizing over circuits composed of gates taken from a particular alphabet. The circuit is described by two kinds of parameters, discrete and continuous. The discrete parameters k define the circuit's layout. That is, they specify what type of gate is acting on a given qubit, at a given time during the evaluation of the circuit. The continuous parameters θ span all gates that contain a variational parameter. In the example of an alphabet derived from the IBM Q Ourense device in Eq. (1), only Z rotations contain a continuous parameter.
The optimization is an iterative procedure in which every iteration is organized in two nested loops. In the inner one, the optimizer deals with continuous parameters with a fixed circuit layout k. Changes to the structure of the circuit are introduced in the outer loop. The optimization over continuous parameters θ is straightforward. Once the structure parameters k are fixed, the cost function depends on at most L continuous parameters θ i . We use off-the-shelf, unconstrained (the cost function is invariant under θ i → θ i +2π) methods to find a minimum of the cost function C k = C k (θ).
When the minimum c of C k = C k (θ) is found, the optimizer switches to the outer loop and makes a change in the structural parameters k. In this part of the procedure the optimizer is testing small, random updates to the structure of the circuit. Those updates include gate shuffling, gate removal as well as inserting new gates in the form of resolutions of identity (1-qubit and 2-qubit ones). This way, the number of gates L in the circuit is variable and reaches an optimal (noise dependent) value during the optimization, see below for more detailed discussion. After new structural parameters k are identified, the optimizer enters the inner loop and varies continuous parameters θ to find a new minimum c of a cost function C k = C k (θ). Finally, the optimizer makes a decision whether or not the old circuit structure k should be replaced by the new one k . Here we follow the simulated annealing approach and accept the change if c < c. The change is rejected if c > c with probability exponentially increasing in c − c.
The above describes one iteration of the optimization algorithm. The iterations are repeated until convergence of the cost function is observed. The optimization is also restarted multiple times to detect possible local minima.
Finally, let us mention an important feature of the optimization approach. As stated above, random structure updates done in the outer loop involve identity insertion and gate removal. Because the cost function is evaluated in the presence of noise, this procedure can sometimes lead to a larger value of the cost function (this is not possible with noiseless simulator). Thanks to that, the optimization algorithm automatically finds the optimal length L of the circuit for a specified error model. Other machine learning approaches that are not noiseaware must be artificially biased towards short circuits. In contrast, our approach automatically finds a balance between deep, expressive but noisy circuits and shallow, less noisy ones.

III. NOISE MODEL
We demonstrate NACL in the following sections using a fine-grained noise model derived from one-and two-qubit gate-set tomography (GST) [10,11,32] experiments run on the five-qubit IBM Q Ourense superconducting qubit device. We emphasize that we are not claiming to capture the full behavior of this device; this cannot be done with just one-and two-qubit GST, and we need to make some assumptions about device behavior. The most important physical effects we are ignoring in this noise model are: (i) non-uniformity across the device, since we use one-qubit GST results on qubit 0 and two-qubit GST on the qubit pair 0-1 to infer process matrices for all qubits on the device, and (ii) since we do not characterize spectator qubits, we do not capture any crosstalk effects.
One-qubit GST on qubit 0 of the Ourense device yields estimated one-qubit process matrices representing chan- nels associated to the principal native gates on the device, X(π/2) (or the "pulse" gate), and I, the single-qubit idle operation. The other single qubit gate used in this device is Z(θ), but this is performed virtually in software (through a phase shift of future single qubit gates) and so we assume it takes no time and is implemented perfectly. We also use the process matrices estimated by singlequbit GST for |0 state preparation and single-qubit measurement POVM elements for representing these operations. Then two-qubit GST on qubits 0 and 1 is used to extract a process matrix for the CNOT gate. All the estimated process matrices and their figures of merit are presented in Appendix B.
We assume the layout and connectivity of the qubits is the same as for the IBM Q Ourense device, and these are outlined in Fig. 3. This connectivity and the process matrices described above together define our device model.
Note that we only performed GST on qubits 0 and 1 for simplicity, and assume that the resulting process matrices describe the same gates on other qubits also. This assumption could easily be relaxed at the expense of more GST experiments on all the qubits in the device.

IV. IMPLEMENTATION FOR OBSERVABLE EXTRACTION
The observable extraction application we focus on is state overlap estimation, where the task is to estimate the overlap between two input states ρ and σ, i.e., estimate Tr(ρσ). The standard way to achieve this is to apply a controlled swap operation conditioned on an ancilla qubit, and then measure an expectation of an observable on the ancilla. We consider the case where ρ and σ are  Fig. 4, when decomposed into the native gates in our device model. P denotes the pulse gate, or X(π/2) rotation, and I is an idle timestep. The vertical lines denote Z(θ) rotations that are done virtually and therefore take no time. This notation helps visualize which gates can be performed in parallel. Values of θn are shown in Appendix A.
single qubit states, and decompose the textbook SWAPbased circuit for overlap estimation into a standard gate set in Fig. 4.
For evaluation under the noise model, we first compile the textbook circuit in Fig. 4 into the native gate set composed of CNOT, X(π/2) and Z(θ) rotations. Given the connectivity of the device, Fig. 3, we map the input qubits to qubits 2 and 3, and the ancilla qubit to qubit 1. This is the most favorable mapping since in this case the minimal number (2) of CNOTs in Fig. 4 needs to be decomposed to account for the lack of device connectivity. There are other mappings that result in similar requirements for CNOT decomposition. We iterated over all of them and selected the decomposition that gives the smallest error (as measured by the value of the cost function evaluated in the presence of noise).
The decomposed circuit is shown in Fig. 5. In this figure we show identity gates, or periods where a qubit is idle, in red. This circuit has been compressed and made as parallel as possible (using simplifications afforded by simple commutations relations and circuit identities), however, the remaining idle periods cannot be compressed away. We assume that X(π/2) rotations (denoted P in the figure) take the same amount of time as a CNOT for simplicity.
Next, we consider ML-based circuit implementations that do not consider noise. Using techniques developed in [6], which attempt to finding exact implementations that consist of as few gates as possible, we perform training without the noise model (but with the connectivity restrictions of the device). The training dataset size consists of 15 pairs of randomly generated single qubit states and their computed overlap. The resulting circuit for overlap estimation and its compiled version are shown in Fig. 6. In the absence of the noise model there is no penalty for the circuit to contain identity gates, and so the resulting circuit has a lot of them.
Finally, we apply NACL to this problem and formulate the cost function using circuit simulation with the noise model described in Sec. III. The training dataset size consists again of 15 pairs of randomly generated single qubit states and their computed overlap. The algorithm works directly with the native gate set, and so no subse- quent decomposition is necessary. The circuit found by NACL is shown in Fig. 7. Two features of the NACL circuit immediately stand out. First, since we have taken into account the noise associated with idling qubits, the circuit contains very few idles. Second, NACL makes interesting use of Z(θ) gates -these are error free, take no time, and also increase the expressiveness of a circuit -and consequently, NACL seems to maximize their use (especially compared the noise unaware ML circuit in Fig. 6, which does not distinguish Z(θ) gates from other gates, and therefore does not use them more frequently). This liberal use of Z(θ) gates most likely also leads to the shorter depth circuit. It should be stressed that these features are not built into the algorithm but result from the optimization and represent the best found balance between the number of gates and the noise induced by their action.
In the following, we compare the performance of the three circuits described above. We generated a validation dataset -1000 pairs of new random one-qubit, mixed states {ρ j , σ j } -and apply the three circuits to estimate the overlap between each pair (the circuits are simulated under the noise model). For simplicity, we label the textbook circuit (Fig. 5) A 1 , the noise unaware, standard ML circuit (Fig. 6) A 2 , and the result of NACL (Fig. 7) A 3 . Fig. 8(a) compares the errors of all three circuits, defined as the absolute value of the difference between the exact overlap Tr(ρ j σ j ) and its estimate computed with the given circuit: where σ z Ai is the expectation value of the σ z operator on the measured qubit at the end of circuit A i . The data is sorted such that the error of A 1 is increasing with sample index, j. Fig. 8(a) shows that noise-aware ML generated circuit gives the best overlap estimate for most of the state-pairs. The inset in Fig. 8(a) shows the difference between the error of the textbook circuit and both ML circuits (these sets of data points are both independently ordered according to decreasing error difference). The ML circuit is better than the textbook one if the value shown in the inset is positive. We can see that this is indeed the case for over 90% of cases, with the NACL circuit also outperforming the regular ML circuit in these cases. For further analysis, we look at the same data in Fig. 8(b), but this time with the errors plotted against exact overlap of the 1000 samples in the validation dataset. This figure shows that the error of A 1 generally decreases with the exact overlap. In addition, the error of A 3 (NACL) shows non-monotonic behavior with exact overlap, achieving its minimum around exact overlap of 0.5 and increasing for larger and smaller overlaps. This behavior of NACL error can be explained by the specifics of training method and the type of the cost function that was used. NACL is trying to minimize average error (see Eq. (14)), and examining a histogram of overlaps in the training sample (inset in Fig. 8 (b)) we see that these overlaps are concentrated between 0.4 and 0.5. Therefore, NACL optimizes the average-case cost function by performing best on input state pairs that have overlaps around this value. An interesting observation is that there can be a correlation between the structure of a circuit and the overlaps it can best estimate. Finally, we can explain why the textbook circuit outperforms NACL in regions of low exact overlap as a combination of two factors: (i) as mentioned above, NACL minimizes average error, and the contribution to this from training samples with small overlap is small; hence it sacrifices performance on small overlap states to get better performance on states with larger overlap; (ii) the other factor that results in the textbook circuit performing well for small exact overlap samples is accidental; namely, that the overlap is estimated by measuring σ z on the ancilla, and this quantity tends to zero with circuit length (since the stochastic noise in the gates dampens this polarization). The output of A 1 is small due to noise, and thus is accidentally close to the correct answer for small overlap states.
We note that the uneven behavior of NACL with exact overlap of input states can be easily modified by (i) modifying the training dataset to have uniformly distributed overlaps, and (ii) modifying the cost function to be a worst-case measure of performance instead of averagecase and/or a function of relative error as opposed to absolute error with the exact overlap.

V. IMPLEMENTATION FOR STATE PREPARATION
For the state preparation application, we will focus on preparing W-states of n qubits: where |i is the state where qubit i is |1 and all other qubits are in state |0 . W-states are multipartite entangled states that are robust against loss and can be used for multipartite cryptographic protocols and for teleportation [33]. As far as we are aware, the circuits generated in Cruz et al. [34] are the most efficient circuits for Wstate generation, and we will use these circuits as our base-case "textbook" circuits to compare against. In the following we will study the prepartion of Wstates for n = 4, 5.

A. 4 qubit W-state preparation
The textbook circuit for preparing |W 4 is shown in Fig. 9(a). It was obtained by following the general procedure given in [34]. This circuit will be applied to the first four qubits in the device shown in Fig. 3. The performance of the textbook circuit and the NACL circuit will depend on the subset of qubits on which we are preparing the state. However, in realistic situations, one will not be given that freedom since the state preparation is usually only one step in a larger quantum circuit, which imposes constraints on the choice of qubits. We select qubits 1-4 Compilation of the textbook circuit shown in (a) into first four qubits of the device model in Fig. 3. The notation is the same as in Fig. 5. The values of angles θj are given in Appendix A.
to show how NACL can optimize circuits on devices with restricted connectivity. The one-qubit gate, depicted as G(p) in Fig. 9(a), is defined as follows: Note that this is a slightly different definition than the one given in Ref. [34]. The above definition of G(p) leads to the same state that is prepared with the circuit shown in Fig. 9(a) but allows for more efficient decomposition of control-G(p) into CNOTs and one-qubit gates. The circuit shown in Fig. 9(a) must be compiled into the native gate set in the device model. The W state is invariant under permutation of qubits, and so one can relabel the qubits in the circuit shown in Fig. 9(a) if this is advantageous for compilation. To find the optimal compilation of the textbook circuit we checked all possible permutations of qubits. All permutations lead to a compilation in which at least two CNOTs are not compatible with device connectivity and need to be decomposed further. We evaluated each permutation by simulation (with the noise model) under the corresponding compiled circuit and computing the fidelity of the output with the exact |W 4 state. The permutation that gives the highest fidelity is simply [1,2,3,4] (there are however other permutations that lead to the same fidelity), and the corresponding compiled circuit is shown in Fig. 9(b). We found that this textbook circuit produces |W 4 with fidelity 0.671 under the noise model.
The circuit produced by NACL for preparing |W 4 is shown in Fig. 10. Since the task here is to prepare one state from one other state, the training dataset and validation dataset are the same, and just consist of one pair {|0 ⊗4 , |W 4 }; the first element is the input state and the second is the ideal output state. This NACL circuit FIG. 10. Circuit that prepares |W4 found by NACL. The notation is the same as in Fig. 5. Angles θj are specified in Appendix A.
outputs a state under the noise model with a fidelity of 0.8894 to the exact state. This is a reduction in error (as measured by 1 − F , where F is fidelity) by a factor of 3 as compared with the best known textbook circuit.
Careful inspection of the circuit in Fig. 10 reveals an interesting feature. In certain circumstances, it is more beneficial (from the point of minimizing the cost function; infidelity in this case) to have a long sequence of gates that are not compiled into an equivalent transformation with a shorter sequence. An example is the final 13 gates (including Z(θ) gates in this count) applied to qubit 1. It is possible to implement the resulting transformation with a shorter sequence of gates, but doing so would mean that the qubit sits idle for the remaining time while the operations on the other qubits complete. Apparently this incurs a greater cost than the longer sequence (the pulse gates are fairly high quality gates for this device and in fact, have a smaller infidelity than the idle operations, see Appendix B). We thus observe a feature that resembles dynamical decoupling or a dynamically corrected gate for this final transformation of qubit 1. We have reasonable confidence that this feature is not a numerical artifact or local optimum because we also independently optimized just that subcircuit (i.e., keep the rest of the circuit fixed and optimized just the last six clock cycles of qubit 1 under the same cost function that evaluates the error on the 4-qubit output state), and could not find a better sequence. Note that this feature is "emergent". Dynamical gate correction techniques were not coded in the search algorithm and yet NACL effectively used them in the optimized solution. It a way, those techniques were "discovered" via cost optimization. We also point out that this feature of preferring longer sequences to idles is not general -one cannot replace every sequence of idles with a sequence of pulses and Z(θ) rotations and lower the error. For example, qubit 3 sits idle over five clock cycles and this achieves the minimum cost function even when we attempt to re-optimize just that sub-sequence of gates. This feature demonstrates the ability of NACL to find circuit implementations that optimize performance in highly non-trivial ways that incorporate an interplay between the computational task (encoded in the cost function) and the device model.  11. (a) Circuit for preparing |W5 obtained by following the construction given in [34]. The first controlled G(p) gate can be simplified, as the first qubit is initialized in |1 . This allows for a shorter compilation. (b) Its best compiled version achieved by a proper permutation of qubits. The notation is the same as in Fig. 5. Angles θj are given in Appendix A.

B. 5 qubit W-state preparation
We also study the preparation of |W 5 since this task requires the use of all qubits on the device in Fig. 3. Again, we follow the prescription in Cruz et al. [34] to arrive at the best textbook circuit for preparing |W 5 in Fig. 11. The compilation of this circuit onto the device under study is not trivial since we can arbitrarily permute the qubits. Every permutation will result in a potentially different decomposition of CNOTs, given the constrained connectivity of the device. We checked all 120 qubit permutations and found that the circuit compilation shown in Fig. 11(b) gives the smallest value of the cost function when evaluated under the noise model. This optimal permutation was found to be [4,3,5,2,1]. Under this permutation, only one CNOT (the second gate from the left in Fig. 11(a)) needs to be decomposed due to the lack of connectivity. The circuit in Fig. 11(b) achieves the fidelity of 0.675.
NACL found the circuit presented in Fig. 12 for |W 5 state preparation. Again, NACL finds a circuit that is much more compact than the textbook one. It uses fewer CNOTs, requires less idling of qubits, and uses the errorfree Z(θ) gates liberally. The circuit produces an output state with fidelity of F = 0.837 with the ideal |W 5 state. That is, the error (as measured by 1 − F ) is reduced by a factor of 2 as compared to the textbook circuit.
FIG. 12. The circuit that approximates preparation of |W5 found by NACL. The notation is the same as in Fig. 5. Angles θj are given in Appendix A.
FIG . 13. (a) A textbook circuit for performing QFT on three qubits. (b) A compilation of the circuit in (a) into the native gate set in the device model we are simulating. The compilation has to take into account that qubit 1 and 3 are not directly connected. Angles θj are specified in Appendix A.

VI. IMPLEMENTATION FOR CIRCUIT COMPILATION
For the circuit compilation application we consider the problem of compiling the quantum Fourier transform (QFT), which is a paradigmatic building block that is used in many quantum algorithms [35]. In the following we will consider implementing a three-qubit QFT.
A textbook circuit for implementing a QFT on three qubits is shown in Fig. 13(a). We will consider implementing this on qubits 1, 2 and 3 in the device shown in Fig. 3. We first need to decompose the controlled Z(θ) rotations. Every controlled Z(θ) is decomposed using two CNOTs [36]. This decomposition leads to two CNOTs between qubits 1 and 3. Since these qubits are not directly connected, these CNOTs need to be decomposed further. The result of this compilation procedure is shown in Fig. 13(b). This compilation leads to a very sparse circuit with many (incompressible) idle gates, which has negative impact on the quality of the final result.
The circuit constructed via NACL is shown in Fig. 14. We used NACL with the cost function defined in Eq. (16) with the average process fidelity computed via Eq. (17). The circuit has shorter depth than the compiled textbook circuit, and does not contain a single idle gate (as compared with 18 for the textbook circuit). It also FIG. 14. Circuit performing QFT found by NACL. The notation is the same as in Fig. 5. Angles θj are given in Appendix A.
contains more error-free Z(θ) rotations enhancing the expressiveness of the circuit.
To compare the performance of the two compiled circuits for QFT, we select 1000 random pure states |Ψ j and evaluate each circuit on those states. The error metric we use is the infidelity between the ideal QFT output and the circuit output; 1 − Tr(ρ j |Ψ ex is the result of the exact evaluation of QFT on |Ψ j . Our results are summarized in Fig. 15. For easier comparison, the states |Ψ j were ordered such that the error of the textbook circuit (represented by the blue line) increases with the state index j. The NACL-generated circuit performed better than the textbook one on all considered states. Since the validation dataset is composed of random pure input states, the average infidelity (over these input states) is related to the entanglement infidelity of the channel defined by the noisy circuit (see Eq. (17)), which is an input-state independent measure of the quality of a channel (or circuit implementation). We use this relation to validate our error metric defined over randomly sampled input states. In Fig. 15 the dotted lines show 1 − (dF e (U † • E) + 1)/(d + 1), where d = 8, U is the channel corresponding to the ideal circuit implementation, E is the channel corresponding to the noisy circuit implementation, and F e is the entangled fidelity defined in Sec. II D. These lines correspond well to the sample averages of our infidelity error metric. We find that NACL reduced the average infidelity from 0.289 to 0.124, that is, by 57%. Another observation is that the performance of the textbook circuit varies more significantly with input state than for the NACL-generated circuit.

VII. DISCUSSION AND CONCLUSIONS
We have introduced the framework of noise-aware circuit learning (NACL), whereby the circuit implementation of a quantum algorithm is formulated by machine learning and optimization based on a cost-function that captures the goal of the algorithm and a device model that captures the connectivity and noise in the device that executes the circuit. We have shown that this framework can be applied to all of the common tasks in quantum computing -observable (or mean-value) extraction, state preparation, and circuit compilation -and demonstrated through examples the types of performance im- provements that can be obtained through NACL. For the examples considered here, NACL produces reductions in error rates (suitably defined for the different tasks) by factors of 2 to 3, when compared to textbook circuits for the same tasks.
In general, NACL produces shorter depth circuits that minimize the impact of stochastic noise sources. However, as demonstrated through the examples considered here, NACL can automatically derive known noisesuppression concepts such as dynamical decoupling and apply these in contexts where they are useful (as defined by the cost function). It also naturally outputs circuits that incorporate commonsense strategies such as minimizing the number of noisy idle gates and maximizing the use of ideal gates, such as error-free Z(θ) rotations. NACL can incorporate much more fine-grained information about the device than other circuit compilation techniques -e.g., in the demonstrations presented here we have used process matrices derived from gate set tomography of real hardware to approximately model noise on this device. Such process matrices can capture effects ignored by effective noise models, such as coherent noise and non-unital processes such as relaxation.
We note that we have also executed NACL with an error model derived from trapped-ion physics (see Appendix C for details), to validate that the technique can be used with a variety of noise model specifications. The results are very similar to those presented above, although there are some simplifications due to an assumption of full connectivity in the device (which is realistic for small trapped-ion platforms).
The noise models currently compatible with NACL do not include crosstalk effects. Although these can be incorporated for small devices using the approach outlined in this paper, incorporating crosstalk in a scalable manner is complicated. The heart of the issue is how to model crosstalk in a scalable manner [28]. In the presence of crosstalk, the natural description of operations on a quantum computer is not in terms of gates, but in terms of layers, which capture what is done to each qubit in the device in a given clock cycle. This is because the precise operation performed on a qubit could, in principle, depend on what is performed on any other qubit in the device. Therefore, the first extension of NACL required to capture crosstalk is to optimize circuits in terms of layers as opposed to gate sequences. Moreover, one has to also consider whether it is realistic to develop quantum channels representing noisy implementation of any circuit layer. Firstly, there are an exponential (in n, the number of qubits) number of possible layers to characterize, and secondly, one needs to perform n-qubit process tomography in order to get quantum channels for each layer. This last task is obviously impossible for large n, and therefore one has to develop more approximate techniques to describe noisy implementations of layers. One approach around these issues is to patch together quantum channels derived from one-, two-, and three-qubit tomography to get an approximate description of a circuit layer, similar to what is demonstrated in Govia et al. [37]. This would model a physically important subclass types of crosstalk errors [28]. Future work will look at incorporating these more complex noise effects into the NACL circuit learning framework.
An important issue to consider is how to scale NACL to develop noise-resilient circuits for larger devices. The complexity of circuit simulation under a noise model and the complexity of optimization over the circuit parameters increase exponentially with number of qubits. This means that NACL can be used as-is to optimize circuits for small modular elements (operating on 10-20 qubits) of a larger application; e.g., magic state distillation circuits. However, we can also outline a strategy for extending NACL beyond this use-case. The strategy applies when one is already given a circuit compilation for a computational task. Perhaps this is a compilation derived using theoretical decompositions or some other efficient method. Then one can sample a subcircuit from this circuit. This subcircuit defines an ideal unitary and one can use NACL to find best approximations to this unitary under the given device model. This sampling can be repeated for multiple subcircuits. However, note that this strategy does not guarantee any optimality properties for the circuit derived from combining these individually optimized subcircuits. Studying the potential of this strategy for scaling up the NACL framework is left as future work.
Related to scalability is the connection between NACL and variational quantum algorithms (VQAs). An alternative to evaluating the NACL cost functions in Sec. II D by simulating a parameterized quantum circuit on a classical computer is to evaluate them by executing the parameterized circuits on quantum hardware directly in the spirit of VQAs. In addition to the obvious advantage of scalability, this hardware-enabled approach has the ad-vantage of capturing the noise model exactly (and does not require any noise modeling). However, for certain applications (e.g., compiling and state preparation [19,20]), the NACL cost function require comparing against the ideal target circuit outputs. In a VQA setting, any preparation of the targets would also be noisy, and therefore one cannot exactly evaluate the required cost functions. Whether it is possible to sufficiently approximate the cost functions with noisy hardware is an open problem [38], and if this were possible, it would make hardware-enabled NACL realistic.
Modern optimization and machine learning methods will be critical for deriving computational use from nearterm quantum devices. Motivated by this, we have developed the NACL framework as a way to utilize detailed noise characterization information to build noise-resilient circuits for near-term quantum computing applications, and we outlined promising directions for extending this framework. Our NACL method can be combined with (and hence is complementary to) other approaches to error mitigation that have been recently proposed [39][40][41][42][43]. Hence, NACL is a novel primitive that will play an important role in the quest for quantum advantage.

ACKNOWLEDGMENTS
The authors would like to thank Tim Proctor and Andrew Baczewski for useful comments on a draft of this work.
Research presented in this article was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20180628ECR for the noise-free machine learning approach and project number 20190065DR for the machine learning approach in the presence of noise. PJC also acknowledges support from the LANL ASC Beyond Moore's Law project. This work was also supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, under the Quantum Computing Application Teams (QCAT) program.
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525. This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. In Table I we list the angles θ n that define the Z(θ) gates in all the circuits presented in the main text. n θn in θn in θn in θn in θn in θn in θn in θn in θn in Fig. 5 Fig.6(b) Fig. 7 Fig. 9(c) Fig. 10 Fig. 11(b) Fig. 12 Fig. 13(b) Fig. 14  1  Appendix B: Noise model process matrices In this Appendix we list the process matrices and SPAM elements derived from GST experiments that define our error model for the 5-qubit device we demonstrate NACL on. These process matrices are completely-positive tracepreserving estimates of the corresponding operations. (We note that, in order to estimate these process matrices, GST required that we also estimate the process matrix corresponding to the Y (π/2) operation. We omit that estimate here as our device model does not include the Y (π/2) gate in the native gate set). All process matrices are given in the Pauli basis (i.e., they are "Pauli transfer matrices") while the SPAM operations are given in the "standard" representation. Because of throughput constraints only "short" GST circuits (i.e., circuits for linear-inversion GST [10]) were used; each circuit was repeated 1024 times. Here, P 0 and P 1 are the imperfect POVM effects for projections onto the |0 and |1 states, respectively. ρ 0 is the density matrix for the single-qubit imperfect state preparation. Finally, We also list below various error metrics for these noisy operators (as compared to ideal operators).