Pulse-efficient circuit transpilation for quantum applications on cross-resonance-based hardware

We show a pulse-efficient circuit transpilation framework for noisy quantum hardware. This is achieved by scaling cross-resonance pulses and exposing each pulse as a gate to remove redundant single-qubit operations with the transpiler.Crucially, no additional calibration is needed to yield better results than a CNOT-based transpilation. This pulse-efficient circuit transpilation therefore enables a better usage of the finite coherence time without requiring knowledge of pulse-level details from the user. As demonstration, we realize a continuous family of cross-resonance-based gates for SU(4) by leveraging Cartan's decomposition. We measure the benefits of a pulse-efficient circuit transpilation with process tomography and observe up to a 50% error reduction in the fidelity of RZZ({\theta}) and arbitrary SU(4) gates on IBM Quantum devices.We apply this framework for quantum applications by running circuits of the Quantum Approximate Optimization Algorithm applied to MAXCUT. For an 11 qubit non-hardware native graph, our methodology reduces the overall schedule duration by up to 52% and errors by up to 38%


I. INTRODUCTION
Quantum computers have the potential to impact a broad range of disciplines such as quantum chemistry [1], finance [2,3], optimization [4,5], and machine learning [6,7].The performance of noisy quantum computers has been improving as measured by metrics such as the Quantum Volume [8,9] or the coherence of superconducting transmon-based devices [10][11][12] which has exceeded 100 µs [13,14].To overcome limitations set by the noise, several error mitigation techniques such as readout error mitigation [15,16] and Richardson extrapolation [17,18] have been developed.Gate families with continuous parameters further improve results [19][20][21] as they require less coherence time than circuits in which the CNOT is the only two-qubit gate.Aggregating instructions and optimizing the corresponding pulses, using e.g.gradient ascent algorithms such as GRAPE [22], reduces the duration of the pulse schedules [23].However, such pulses require calibration to overcome model errors [24,25] which typically needs closed-loop optimization [26,27] and sophisticated readout methods [28,29].This may therefore be difficult to scale as calibration is time consuming and increasingly harder as the control pulses become more complex.Some of these limitations may be overcome with novel control methods [30].
Since calibrating a two-qubit gate is time-consuming, IBM Quantum [14] backends only expose a calibrated CNOT gate built from echoed cross-resonance pulses [31,32] with rotary tones [33].Quantum circuit users must therefore transpile their circuits to CNOT gates which often makes a poor usage of the limited coherence time.With the help of Qiskit pulse [34,35] users may extend the set of two-qubit gates [36][37][38].Such gates can in turn generate other multi-qubit gates more effectively than when the CNOT gate is the only two-qubit gate available [37].However, creating these gates comes at the expense of additional calibration which is often impractical on a queue-based quantum computer.Furthermore, only a limited number of users can access these benefits due to the need for an intimate familiarity with quantum control.In Ref. [39] the authors show a pulsescaling methodology to create the control pulses for the continuous gate set R ZX (θ) which they leverage to create R Y X (θ) gates and manually assemble into pulse schedules.Crucially, the scaled pulses improved gate fidelity without the need for any extra calibration.
Here, we extend the methodology of Ref. [39] to arbitrary SU (4) gates and show how to make pulse-efficient circuit transpilation available to general users without having to manipulate pulse schedules.In Sec.II we review the pulse-scaling methodology of Ref. [39] and carefully benchmark the performance of R ZZ gates.Next, in Sec.III, we leverage this pulse-efficient gate generation to create arbitrary SU (4) gates which we benchmark with quantum process tomography [40,41].In Sec.IV we show how pulse-efficient gates can be included in automated circuit transpiler passes.Finally, in Sec.V we demonstrate the advantage of our pulse-efficient transpilation by applying it to the Quantum Approximate Optimization Algorithm (QAOA) [4].

II. SCALING HARDWARE-NATIVE CROSS-RESONANCE GATES
We consider an all-microwave fixed-frequency transmon architecture that implements the echoed crossresonance gate [32].A two-qubit system in which a control qubit is driven at the frequency of a target qubit evolves under the time-dependent cross-resonance Hamiltonian H cr (t).The time-independent approximation of H cr (t) is where Here, X, Y , and Z are Pauli matrices, I is the identity, and ω ij are drive strengths.An echo sequence [32] and rotary tones [33] isolate the ZX interaction which ideally results in the unitary R ZX (θ) = exp{−iθZX/2}.The rotation angle θ is t cr ω ZX ( Ā) where t cr is the duration of the crossresonance drive.The drive strength ω ZX has a non-linear dependency on the average drive-amplitude Ā as shown by a third-order approximation of the cross-resonance Hamiltonian [35,42].
IBM Quantum systems expose to their users a calibrated CNOT gate built from R ZX (π/2) rotations implemented by the echoed cross-resonance gate.The pulse sequence of R ZX (π/2) on the control qubit is CR(π/4)XCR(−π/4)X.Here, CR(±π/4) are flat-top pulses of amplitude A * , width w * , and Gaussian flanks with standard deviations σ, truncated after n σ times σ.
where the star superscript refers to the parameter values of the calibrated pulses in the CNOT gate.During each CR pulse rotary tones are applied to the target qubit to help reduce the magnitude of the undesired ω IY interaction.We can create R ZX (θ)-rotations by scaling the area of the CR and rotary pulses following α(θ) = 2θα * /π as done in Ref. [39].To create a target area α(θ) we first scale w to minimize the effect of the non-linearity between the drive strength ω ZX ( Ā) and the pulse amplitude.When α(θ) < |A * |σ √ 2πerf(n σ ) we set w = 0 and scale the pulse amplitude such that |A(θ We investigate the effect of the pulse scaling methodology with quantum process tomography by carefully benchmarking scaled R ZZ (θ) gates, see Fig. 1(a), with respect to the double-CNOT decomposition, see Fig. 1(b).We measure the process fidelity F[U meas , R ZZ (θ)] between the target gate R ZZ (θ) and the measured gate U meas .To determine U meas we prepare each qubit in |0 , |1 , (|0 + |1 )/ √ 2, and (|0 + i |1 )/ √ 2 and measure in the X, Y , and Z bases.Two qubit process tomography therefore requires a total of 148 circuits for each angle of interest which includes four circuits needed to mitigate readout errors [15,16].The scaled pulses consistently have a better fidelity than the double CNOT benchmark as demonstrated by the data gathered on ibmq_mumbai with qubits one and two, see Fig. 1(c).Appendix B shows key device parameters and additional data taken on other IBM Quantum devices which illustrates the reliability of the methodology.The relative error reduction of the measured gate fidelity correlates well to the relative error reduction of the coherence limited average gate fidelity [33,43,44], see Fig. 1(d) and details in Ap- pendix C. We therefore attribute the error reduction to the shorter schedules as they use less coherence time.
In addition to the gate fidelity, we compare the deviation ∆θ from the target angle of both implementations of the R ZZ (θ) rotation.The deviation ∆θ is the difference between the target rotation angle θ and the angle θ max which satisfies [45] the implementation with two CNOT gates does not depend on the desired target angle, see Fig. 1(e).However, the scaled gate has two competing non-linearities: an expected non-linearity from the amplitude scaling and an unexpected one from scaling the width.As the width is scaled down, the angle deviation increases from ∼10 mrad to ∼35 mrad.Once the amplitude scaling begins, a nonlinearity arises which reduces the deviation angle of the scaled gates.At α(θ) ≈ |A * |σ √ 2πerf(n σ )/2 the angle deviation of the scaled gates once again matches the deviation of the benchmark within the measured standard deviation.

III. CREATING ARBITRARY SU(4) GATES
We now generalize the results from Sec. II.Cartan's decomposition of an arbitrary two-qubit gate U ∈ SU (4) is U = k 1 Ak 2 which we refer to as Cartan's KAK decomposition [46].Here k 1 and k 2 are local operations, i.e. k 1,2 ∈ SU (2) ⊗ SU (2), and [47][48][49], see Fig. 2(a).The nonlocal term is defined by the three angles Geometrically, the KAK decomposition is represented in a tetrahedron known as the Weyl chamber in the threedimensional space, see Fig. 3. Every point (α, β, γ) in the Weyl chamber (except in the base) defines a continuous set of two-qubit gates equivalent up to single-qubit rotations [47].For instance, the point ( π 2 , 0, 0), labeled as C in Fig. 3, corresponds to the local equivalence class of the CNOT gate, and the point ( π 2 , π 2 , π 2 ), labeled as A 3 , represents the SWAP gate.
Since the rotations generated by XX, Y Y , and ZZ are locally equivalent to rotations generated by ZX we decompose the non-local e ik T •Σ/2 term into a circuit with three R ZX rotations, see Fig. 2(c).We shorten the total Rel. error reduction (%)  duration of the circuit by exposing the echo in the crossresonance gate, see Fig. 2(d), to the transpiler.This ensures that at most one single-qubit pulse is needed on each qubit between each non-echoed cross-resonance R ZX gate.By scaling the cross-resonance pulses we create the R ZX gates for arbitrary angles and therefore generalize the methods of Sec.II to arbitrary gates in SU (4).
We generate R ZX -based circuits as shown in Fig. 2(e) for (α, β, γ) angles chosen at random from the Weyl chamber and measure their fidelity using process tomography with readout error mitigation.Each R ZXbased circuit is benchmarked against its equivalent three R Z (θ) CNOT decomposition presented in Ref. [50] and shown in Fig. 2(b).The experiments are run on ibmq_dublin and ibmq_mumbai with 2048 shots for each circuit which we measure three times to gain statistics.The pulse-efficient R ZX -based decomposition of the circuits results in a significant fidelity increase for almost all angles, see Fig. 4.
A subset of the data is also shown in the Weyl chamber in Fig. 3.The correlation between the relative error reduction and the relative schedule duration indicates that the gains in fidelity come from a better usage of the finite coherence time as the scaled cross-resonance pulses achieve the same unitary in less time.Remarkably, these results were achieved without recalibrating any pulses.

IV. PULSE-EFFICIENT TRANSPILER PASSES
The quantum circuits of an algorithm are typically expressed using generic gates such as the CNOT or controlled-phase gate and then transpiled to the hardware on which they are run [51].Quantum algorithms can benefit from the continuous family of gates presented in Sec.II and III if the underlying quantum circuit is either directly built from, or transpiled to, the hardware native R ZX (θ) gate.We now show how to transpile quantum circuits to a R ZX (θ)-based-circuit with template substitution [52].
A template is a quantum circuit made of |T | gates acting on n T qubits that compose to the identity U 1 ...U |T | = 1, see e.g.Fig. 5(b) and (c).In a template substitution transpilation pass we identify a sub-set of the gates in the template U a ...U b that match those in a given quantum circuit.Next, if a cost of the matched gates is higher than the cost of the unmatched gates in the template we replace As cost we use a heuristic that sums the cost of each gate defined as an integer weight which is higher for two-qubit gates, details are provided in Appendix A. The complexity of the template matching algorithm on a circuit with |C| gates and n C qubits is i.e. exponential in the template length [52].We therefore create short templates where the inverse of the intended match, i.e.U † match , is specified as a single gate with rules to further decompose it into R ZX and single-qubit gates in a subsequent transpilation pass.In these decompositions we expose the echoed cross-resonance implementation of R ZX to the transpiler by writing R ZX (θ) = XR ZX (−θ/2)XR ZX (θ/2).This allows the transpiler to further simplify the single-qubit gates that would otherwise be hidden in the schedules of the two-qubit gates, as exemplified in the circuit in Fig. 5(e).Finally, once the R ZX (θ) gates are introduced into the quantum circuit we run a third transpilation pass to attach pulse schedules to each R ZX (θ) gate built from the backend's calibrated CNOT gates following the procedure in Sec.II.The attached schedules consist of the scaled cross-resonance pulse and rotary tone without any echo.Details on the Qiskit implementation are given in Appendix A.

V. IMPROVING QAOA WITH CARTAN'S DECOMPOSITION
We use the QAOA [4,53,54], applied to MAXCUT, to demonstrate gains of a pulse-efficient circuit transpilation on noisy hardware.QAOA maps a quadratic binary optimization problem with n decision variables to a cost function Hamiltonian ĤC = i,j α i,j Z i Z j where α ij ∈ R are problem dependent and Z i are Pauli Z operators.The ground state of ĤC encodes the solution to the problem.Next, a classical solver minimizes the energy ψ(β, γ)| ĤC |ψ(β, γ) of a trial state |ψ(β, γ) created by applying p-layers of the operator exp(−iβ k j−n i=0 X j ) exp(−iγ k ĤC ) where k = 1, ..., p to the equal superposition of all states.
Implementing the operator exp(−iγ k ĤC ) requires applying the R ZZ (θ) = exp(−iθZZ/2) gate on pairs of qubits.However, to overcome the limited connectivity of superconducting qubit chips [55], several R ZZ (θ) gates are followed or preceded by a SWAP resulting in the unitary operator up to a global phase.When mapped to the KAK decomposition SWAP(θ) corresponds to k T = (ηπ/2, ηπ/2, θ + ηπ/2)) where η = −1 if θ > 0 and 1 otherwise.This allows us to reduce the total cross-resonance duration using the methodology presented in Sec.III.We perform a depth-one QAOA circuit for an eleven node graph, shown in Fig. 6(b), built from CNOT gates.We map the decision variables zero to ten to qubits 7,10,12,15,18,13,8,11,14,16,19 on ibmq_mumbai, respectively.Since the graph is non-hardware-native eight SWAP gates are needed to implement the circuits.In QAOA the optimal values of (β, γ) are found with a classical optimizer [56].Here, we scan β and γ from ±2 rad and ±1 rad, respectively, as we submit jobs through the queue of the cloud-based IBM Quantum computers.For each (β, γ) pair we run the circuits with the noiseless QASM simulator in Qiskit, see Fig. 6(a) and twice on the hardware.The first hardware run is done using a CNOT decomposition with the Qiskit transpiler on optimization level three, see Fig. 6(c) for results.The second run is done with the pulse-efficient circuit transpilation, see Fig. 6(e) for results.Here, we first perform the template substitution with the R ZZ (θ) and SWAP(θ) templates, shown in Fig. 5(b), (c) and Appendix A for further details.A second transpilation pass then exposes the R ZX (θ) gates to which we attach pulse schedules in a third transpilation pass following Sections II -IV.
In each case we measure 4096 shots.The pulse-efficient circuits produce less noisy average cut values, compare Fig. 6(c) with (e), and have a lower absolute deviation from the noiseless simulation than the circuits transpiled to CNOT gates, compare Fig. 6(d) with (f).The maximum error in the cut value averaged over the sampled bit-strings is reduced by 38% from 3.65 to 2.26.We attribute the increased quality of the results to the decrease in total cross-resonance time and the fact that the pulseefficient transpilation keeps the number of single-qubit pulses to a minimum.In total, we observe a reduction in total schedule duration ranging from 42% to 52% depending on γ when using the pulse efficient transpilation methodology, see Fig. 7. Since the schedule duration of R ZZ (γα i,j ) and SWAP(γα i,j ) decreases and increases as γ decreases, respectively, we observe a non-monotonous reduction in the schedule duration of the QAOA circuit as a function of γ.

VI. DISCUSSION AND CONCLUSION
The results in Sec.II and III showed that by scaling cross-resonance gates we can automatically create a continuous family of gates which implements SU (4).These scaled gates typically have shorter pulse schedules and higher fidelities than the digital CNOT implementation.This fidelity is limited by coherence, imperfections in the initial calibration, and non-linear effects.Crucially, the resulting gate-tailored pulse schedules do not require additional calibration and can therefore be automatically generated by the transpiler.Transpilation passes, as discussed in Sec.IV, can be leveraged to identify and attach the scaled pulse schedules to the gates in a quantum circuit.Furthermore, exposing the echo in the cross-resonance gate to the transpiler allows further simplifications of the single-qubit gates.We used this pulse-efficient transpilation methodology to reduce errors in an eleven-qubit depth-one QAOA.
Scaled gates are particularly appealing for Trotter based applications, as shown in Ref. [39], and could therefore benefit quantum simulations [57].Future work may also include scaling direct cross-resonance gates [9] and benchmarking their impact on Quantum Volume [8].Methods to interpolate pulse parameters based on a set of reference R ZX (θ) gates, calibrated at a few reference angles θ, might also improve the gate fidelity and help deal with non-linearities between the rotation angle θ and pulse parameters.For variational algorithms, such as the variational quantum eigensolver, the scaled SU (4) gates may allow for better results due to the shorted schedules while still being robust to some unitary errors such as angle errors [58,59].
We believe that the methods presented in our work will help users of noisy quantum hardware to reap the benefits of pulse-level control without having to know its intricacies.This can improve the quality of a broad class of quantum applications running on noisy quantum hardware.
We extended the Qiskit implementation of Ref. [52] to parametric templates.To avoid a symbolic description of the unitary matrix of each gate we first match gates by qubits and name.This is however not sufficient to create a valid since, for example, the parametric template in Fig. 8(a) produces two tentative matches on the circuit in Fig. 8(b).We therefore form a system of equations based on the tentative match.If this system of equations accepts a solution the match is valid.For example, the tentative match in Fig. 8(b), indicated by the dashed blue box, results in the system of equations which accepts the solution θ = 2 and is therefore valid.However, the second tentative match, highlighted by the dotted purple box, results in the system of equations which has no solution and is therefore not valid.We achieve a pulse-efficient circuit transpilation with Qiskit by using three transpilation steps shown in Fig. 9. First, the TemplateOptimization transpilation pass is applied with the SWAP and rzz templates as shown in Fig. 5 RZXCalibrationBuilderNoEcho class scales the pulses of the cross-resonance gates and attaches them to the R ZX (θ) gates in the circuit.Figure 10 exemplifies the result of the first transpilation pass applied to the QAOA circuit in Sec.V.The template optimization pass requires a cost dictionary to determine if it is favourable to replace the matched gates U match = U a ...U b from a template with the Hermitian conjugate of the remaining part of the template The cost dictionary has gates as keys and their cost as value.The cost of U a ...U b is the sum of the costs of each individual gate U a to U b .We used the cost dictionary {'sx': 1, 'x': 1, 'rz': 0, 'cx': 2, 'rzz': 0, 'swap': 6, 'phase_swap': 0} which assigns a zero cost to the rzz and phase_swap gates which correspond to the pulse-efficient implementation of R ZZ (θ) and SWAP(θ).Single-qubit gates have unit cost except for rz which is implemented with virtual Z-rotations.The CNOT gate, i.e. cx, and the standard SWAP gate, i.e. swap have costs two and six, respectively.This cost dictionary ensures that the template substitution will include R ZZ (θ) and SWAP(θ) in the case of a match.Future work could improve this heuristic cost dictionary either by using the fidelity of the gates (if this metric is available) or the duration of the underlying pulse schedules as cost.
Since the qubit coherence times as well as the CNOT gate duration and error mainly limit the fidelity of the scaled cross-resonance gates we list their values for the qubits and devices we experimented with in Tab.I. To illustrate that scaling imperfect cross-resonance gates improves the gate fidelity we measured the process fidelity on several IBM Quantum devices and qubit pairs.In almost all measurements the scaled gates have a higher fidelity than the double CNOT benchmark and the relative error reduction increases as the schedule duration decreases, see Fig. 11.Fig. 12 shows additional quantum process tomography results for Cartan-decomposed circuits chosen at random in the Weyl chamber.The experiments were performed on ibmq_dublin, see Fig. 12(a), and ibmq_paris using different qubit pairs, see Fig. 12(b) -(d).For almost all angles the relative error reduction is positive which demonstrates the advantage of a hardware-native, scaled cross-resonance gate based circuit implementation.
The derivation of this limit is discussed in more detail in Appendix G of Ref. [33].Here, t is the gate duration while T 1,a and T 2,a are the T 1 and T 2 times for qubit a.Since the process fidelity and the average gate fidelity are linearly related [43,44] we compare the relative error reduction in the measured process fidelity with the theoretical relative error reduction in E.

Figure 4 .
Figure 4. Gate error reduction of the pulse-efficient SU (4) gates relative to the three CNOT benchmark for random angles in the Weyl chamber measured on ibmq_dublin, qubits one and two (light blue circles), and ibmq_mumbai, qubits 19 and 16 (dark blue triangles).The x-axis is the duration of the pulse-efficient SU (4) gates relative to the three CNOT benchmark.The angles of three gates are indicated in parenthesis as example.

Figure 5 .
Figure 5. Pulse-efficient transpilation example.(a) Circuit of the cost operator for a QAOA circuit implemented on three qubits connected in a line.(b) and (c) Templates of the RZZ and phase-swap gates, respectively.Here, RZZ (θ) and SWAP(θ) hold the rules with which to decompose them into the hardware-native RZX gates.(d) Circuit resulting from the template matching of (b) and (c) performed on circuit (a).(e) Circuit resulting from a transpilation of (d) which uses the decompositions rules of RZZ (θ) and SWAP(θ) into RZX .To shorten the circuit figure we replaced RZ (nπ/2) √ XRZ (mπ/2) with Un,m and c = 0.215.

Figure 6 .
Figure 6.Depth-one QAOA energy landscape.(a) Noiseless simulation of the cut value, averaged over all 4096 bit-strings sampled from |ψ(β, γ) , obtained using the QASM simulator for the weighted graph shown in (b).The maximum cut, with value 28, is indicated by the color of the nodes in (b).Figures (c) and (e) show hardware results obtained by transpiling to CNOT gates and by using the RZX pulse-efficient methodology, respectively.Figures (d) and (f) share the same color scale and show the absolute deviation from the ideal averaged cut values in figures (c) and (e), respectively.

Figure 7 .
Figure 7. QAOA schedule durations.(a) Duration of the scheduled quantum circuits transpiled to CNOTs with optimization level three (blue circles) and with the pulse-efficient methodology (orange triangles).In both cases we removed the final measurements from the quantum circuits.(b) Length of the pulse efficient schedules relative to the CNOT-based schedules.

Figure 10 .
Figure 10.11 qubit QAOA circuit with γ = 1 and β = −2 after the template substitution.The final measurement instructions have been omitted.

Table I .
Summary of the properties of the CNOT gates and coherence times for the qubits used to benchmark the performance of the scaled cross-resonance gates.