Co-Design quantum simulation of nanoscale NMR

Quantum computers have the potential to efficiently simulate the dynamics of nanoscale NMR systems. In this work we demonstrate that a noisy intermediate-scale quantum computer can be used to simulate and predict nanoscale NMR resonances. In order to minimize the required gate fidelities, we propose a superconducting application-specific Co-Design quantum processor that reduces the number of SWAP gates by over 90 % for chips with more than 20 qubits. The processor consists of transmon qubits capacitively coupled via tunable couplers to a central co-planar waveguide resonator with a quantum circuit refrigerator (QCR) for fast resonator reset. The QCR implements the non-unitary quantum operations required to simulate nuclear hyperpolarization scenarios.

Quantum computers have the potential to efficiently simulate the dynamics of nanoscale NMR systems. In this work we demonstrate that a noisy intermediate-scale quantum computer can be used to simulate and predict nanoscale NMR resonances. In order to minimize the required gate fidelities, we propose a superconducting application-specific Co-Design quantum processor that reduces the number of SWAP gates by over 90% for chips with more than 20 qubits. The processor consists of transmon qubits capacitively coupled via tunable couplers to a central co-planar waveguide resonator with a quantum circuit refrigerator (QCR) for fast resonator reset. The QCR implements the nonunitary quantum operations required to simulate nuclear hyperpolarization scenarios.

I. INTRODUCTION
Computer simulations are the backbone of scientific research and technological development. Quantum computers promise in the long term to enable simulations of systems that are intractable to even the largest supercomputers [1,2]. Currently, scientists have access to so-called noisy intermediate-scale quantum (NISQ) computers [3], that present limited qubit counts without error correction. While applications of error-corrected quantum computers are well established, use cases where NISQ devices might achieve quantum advantage are still elusive [4]. In the search for these early applications, the problem must fit the hardware, and the hardware must enable implementation with minimal overheads.
Application-Specific Integrated Chips (ASICs) are highly specialized processors optimized for specific problems when execution speed, power efficiency, or miniaturization is of utmost importance [5]. A prominent example where computational speed and energy efficiency are optimised through the use of ASICs is training of artifical neural networks using tensor processing units [6,7]. Building a general-purpose quantum computer capable of rivaling the most powerful classical computers has proven to be a difficult task, so it is likely that the first devices reaching useful quantum advantage will use quantum ASICs, also called Co-Design quantum computers.
A good example of a problem with suitable structure for simulation by quantum computers is nanoscale nuclear magnetic resonance (NMR) [8]. The problem can be described by a number of mutually interacting spins, which natively map to the qubits of a quantum com-puter, thereby circumventing the overheads in mapping the problem to qubits, such as in the case of fermions [9].
In general, fast and reliable quantum simulations of interacting spin systems would improve the interpretability of solid-state NMR and electron spin resonance (ESR) spectra, where advanced numerical techniques present very limited performance [10]. This shows the potential of quantum computers with a moderate number of qubits to shed light on the dynamics of these important systems. A Co-Design quantum computer that minimizes algorithm implementation overheads could be the first method to access these simulations. Note that, other NMR problems, such as zero-field NMR [11] and Hamiltonian learning [12], have already attracted research on how quantum computers can be used to tackle them and methods based on Bayesian computation [13] and generative models [14] have been developed for computing NMR spectra as well.
NMR techniques have a profound impact in research areas such as material science, chemistry, biology, and medicine [15]. Recently they have approached the nanoscale through solid-state quantum sensors such as the nitrogen vacancy (NV) center in diamond [16]. This is a particularly powerful quantum device, as it enables detection and control of nearby nuclear spins with nanoscale resolution [17]. Applications of the device are, e.g., the precise determination of the structure and dynamics of nuclear ensembles such as proteins [18], finding inter-label distances (via, e.g., Bayesian analysis of the NV response) in electronically labelled biomolecules [19], and the exploration of bespoke microwave (MW) sequences that efficiently transfer NV center polarization to the nuclear environment. Hyperpolarization (i.e. polarization beyond that of a thermal state in a magnetic field) of nuclear spins in diamond presents the potential to develop new and safer contrast agents for magnetic resonance imaging. This problem, which we aim to address through simulation by a quantum computer, could lead to improved detection of different malformations in tissues -such as heart or brain-without the need to deliver ionizing radiation, in contrast to other techniques [20].
This manuscript describes a Co-Design process for a quantum chip able to efficiently simulate nanoscale NMR scenarios. It is structured in three main parts, each of which is a crucial step in the Co-Design process: 1. Identifying the problem (Sec. II), which here is simulating a nanoscale NMR system for hyperpolarizing nuclear spins. 2. Choosing an algorithm for the nanoscale NMR problem and showing that a star-topology chip implements it with minimal overhead (Sec. III), and 3. Co-Designing the corresponding quantum chip using a central resonator bus (Sec. IV). The sections are followed by results and discussions (Sec. V) and an outlook (Sec. VI).

II. NANOSCALE NMR: HYPERPOLARIZATION
Let us consider a system consisting of M nitrogenvacancy (NV) centers and N carbon-13 isotopes in the presence of a driving field and an external magnetic field B Z . NV centers and nuclei are all effectively described as spin-1/2 systems. The representation of such a system for M = 1, N = 2 is shown in Fig. 1. For simplicity, we consider the NV centers aligned with the external magnetic field, leading to the following Hamiltonian: In Eq. (1) we find the spin operators in the joint Hilbert space C 2 (M +N ) of NV centers and nuclei: where (σ µ ) 2×2 , µ ∈ {x, y, z} is the corresponding 2 × 2 Pauli matrix on the j th NV center and the k th nucleus respectively, and 1 is the 2×2 identity matrix. Accordingly, σ ± j = σ x j ±iσ y j 2 I ± k = I x k ± iI y k are the j th NV center (k th nucleus) ladder operators. The term δ j is the detuning of the j th NV center with respect to the microwave drive H dr . The hyperfine coupling vector A jk represents the coupling between the j th NV center and the k th nucleus, A jk is the modified Larmor frequency of the k th nucleus with the 13 C gyromagnetic ratio γ c ≈ (2π) × 10.7 MHz/T, g k k is the coupling between the k th and k th nuclei, and h j j is the coupling between the j th and j th NV centers.
Note that, Eq. (1) is expressed in a rotating frame with respect to the free NV Hamiltonian, while H dr represents an external driving tuned near resonance with a certain NV energy transition. The derivation of Eq. (1) can be found in Appendix A.
In order to hyperpolarize a diamond sample at room temperature, the NV centers are first optically polarized employing laser light, and then their state is transferred to the surrounding nuclei with the aid of a tailored microwave radiation scheme. The initial state of the nuclei in a room-temperature sample is well described by a fully-mixed state due to the small energy splitting of the nuclear spins. By re-initializing the NV centers and repeating this procedure, the polarization transferred into the sample can be amplified. In this paper we will consider the quantum simulation of the polarization transfer mechanism and study two different driving schemes acting on the NV centers in a room-temperature diamond.
The first driving scheme is a continuous driving whose Hamiltonian in the rotating frame mentioned earlier is H dr = Ω 2 σ φ , where σ φ = e −iφ |1 0| + e iφ |0 1| = e −iφ σ − + e iφ σ + , φ a phase, Ω the Rabi frequency and the kets |1 and |0 are the eigenvectors of the operator σ z with eigenvalues ±1 respectively. The set {|0 , |1 } is called the computational basis of the state space of a two level system, and will be our standard choice for a basis, |0 ≡ (1, 0) t and |1 ≡ (0, 1) t . NV-nucleus polarization transfer is achieved when the Rabi frequency matches the modified nuclear Larmor frequency (i.e. when Ω = | ω c |), leading to the Hartmann-Hahn double-resonance condition [21]. For a single NV center and nucleus, the Hamil-tonian in Eq. (1) reduces, in an interaction picture, to where |± = |0 ± |1 , which shows a polarization transfer mechanism with the effective transfer rate A ⊥ 4 (a detailed derivation can be found in Appendix B).
The second type of driving we consider is a pulseddriving scheme, H dr = Ω(t) 2 σ φ , where Ω(t) is a train of π-pulses, such as the Carr-Purcell-Meiboom-Gill sequence [22,23] or the XY8 sequence [24,25]. We consider pulses with a negligible width compared to the time spacing τ between the π-pulses. If τ is selected such that τ = nπ | ω c | (n being an arbitrary integer number) and the pulses are evenly spaced one finds that, in an interaction picture, for a single nucleus and NV center, the Hamiltonian reduces to H I = αA ⊥ σ z I x , where α is a factor that depends on the integer n (see Appendix B). A phase imprinted on the pulse sequence through a time delay turns the interaction into H I = αA ⊥ σ z I y . By combining both sequences with the appropriate rotations over the NV center, the polarization transfer interaction is achieved (see Appendix B and Ref. [26] for more details).
Regarding common error sources, NV centers located at different positions in the diamond lattice experience stress conditions that lead to local energy deviations from the zero-field splitting. The corresponding term in Eq. (1) is the detuning δ j . Another common type of imperfection appears due to unavoidable fluctuations of the Rabi frequency of the driving. This fluctuation can be modelled as an Ornstein-Uhlenbeck (OU) process [27], which has been shown to be an accurate description for NV centers [28]. It is a Gaussian process of the following form [29]: where ∆t is the time step, τ the correlation time, c the diffusion constant of the process and N (t) a temporally uncorrelated normally distributed random variable. It is a dimensionless term, which yields an effective Rabi frequency of (1 + X) Ω. Neither of the system error types lead to considerable overheads in a simulation on a quantum computer. Finally, 13 C nuclear spin decay is not a relevant error source on the time scale of the protocol, since it is of the order of seconds [30], while the hyperpolarization process operates in the order of microseconds.

III. CO-DESIGN ALGORITHM
In this section we provide an in-depth description of our Co-Design algorithm, starting with the choice of a simulation technique, followed by a short listing of hardware assumptions related to the allowed qubit operations (gates and resets), as well as the noise and errors present in the physical NMR system and in the quantum com-puter. Subsequently, the algorithm components are introduced. We end the section with a discussion on layout and gate-level optimization. The high-level structure of the simulation protocol is shown in Fig. 2a.

A. Simulation technique
The best established digital quantum simulation technique is based on decomposing the time-evolution operator into single-qubit and two-qubit gates through the Lie-Trotter-Suzuki formula [31], known as Trotterization. To simulate our problem on a quantum computer, we base our strategy on regular Trotterization [2] but we also explore the randomized Trotterization method qDRIFT [32] in Appendix C. Other, more NISQ-specific, simulation techniques such as the variational quantum simulator [33], the quantum assisted simulator [34], numerical quantum circuit synthesis [35], and a plethora of other quantum algorithms [4] can also be used as simulation methods.
One advantage of Trotterization over some of these NISQ methods is that it closely follows the real time evolution for each time step. This is particularly important for pulsed-driving schemes, where the free evolution in between different pulses always starts with a different initial state. Variational and quantum assisted methods would then require that each interpulse evolution is solved independently, making them impractical for the problem.
A second advantage of Trotterization is that its complexity and precision are straightforward to analyze. The Trotterization procedure can also be expanded to higher orders, and symmetrized expansions converge more rapidly and reduce the error with respect to the continuum time limit [36].

Native gates
The hardware for the quantum simulation plays a major role in choosing the optimal quantum algorithm and its specific implementation. In our case, we consider a quantum computer based on superconducting qubits with the following native single-qubit gate set: where X, Y , and Z are Pauli operators on the superconducting transmon qubits. The R xy (φ, θ) can physically be implemented through a microwave drive [37]. The gate R z (θ) on the other hand does not need to be implemented directly, but can be performed virtually by tuning the phase of the subsequent gates applied on the qubit [38]. This reduces the number of single-qubit gates (SQGs) that need to be implemented. The native two-qubit gate (TQG) that arises from the superconducting system Hamiltonian shown in Sec. IV and Appendix G, is a continuously-parameterized controlled-Z (CZ) interaction [39], which can be transformed through local virtual R z -rotations into the form of a ZZ-interaction: Even though the ZZ-interaction and the controlled-Z interactions appear different, their physical implementation is identical since they are related through virtual R z -rotations which come at no additional cost. Sec. IV goes into more depth on the two-qubit-gate implementation on our Co-Design quantum chip.

Qubit reset
In the hyperpolarization process the state of the NV needs to be re-initialized after each cycle. It is therefore necessary to be able to reset the state of the qubit representing the NV center in the quantum computer. A qubit reset operation can be defined by two Kraus operators: On superconducting hardware this can be realized through connecting a quantum circuit refrigerator (QCR) to each circuit element that needs to be reset [40][41][42][43]. Different reset schemes are discussed in Sec. IV B.

Noise and errors
In this paper we show that the simulation can tolerate the noise of the quantum processing unit (QPU), and that the simulation does not require large overheads to implement imperfections present in the nanoscale-NMR system, as discussed in Sec. II. We will refer by system imperfections to effects in the nanoscale NMR system only, while the QPU is affected by noise, referring to the effect of the environment on the qubits, and errors, referring to inaccuracies of gates.
In our simulation of the algorithm, we use the most common noise models for superconducting transmon qubits [37], namely an amplitude damping channel modelled by the Kraus operators: with p(t) = 1 − exp (−t/T 1 ) and T 1 = 60 µs, and a pure dephasing channel represented by the Kraus operators: with p(t) = 1 − exp(−Γ(t)) and Γ(t) given by the expres- where β is the inverse temperature of the environment. We chose the spectral function I(ω) to be of the type 1/f [37], and T 2 = 60 µs. Additionally, each gate operation is assumed to be calibrated up to a two-qubit-gate (TQG) error ε TQG ∈ [10 −4 , 10 −2 ], with the induced effective noise modelled by a depolarizing channel defined for singlequbit gates by the Kraus operators: and for two-qubit gates by an analogous expression with the tensor products of two Pauli matrices and the coefficients √ 1 − p for the identity and p/15 for the other operators.
Single-qubit-gate (SQG) errors ε SQG are assumed to be one order of magnitude lower than TQG errors.

C. Algorithm components
Our simulation of the nanoscale NMR problem follows the general structure shown in Fig. 2a. It starts by initializing the states of all qubits, according to whether they represent a nucleus or a NV center, then evolving them using Trotter steps, followed by reset and re-initialization of the qubits representing NV centers. The cycle of time evolution and re-initialization is then repeated as many times as the protocol calls for. Finally the qubits are measured, and the polarization of the NV centers and nuclei are extracted as the expectation values of the qubit representing each element. Fig. 2a shows the circuits for the case of continuous driving, while the details of pulsed driving schemes are shown in Fig. 10a in Appendix B. In the following, we go through these steps in more detail for the case of a single NV center.

Initial state preparation
To enable the polarization transfer, it is necessary to prepare the NV center in a specific initial state that depends on the driving scheme. For the continuous-driving scheme it is the |+ or |− state, and for the pulseddriving scheme it is one of the two computational basis states, |0 or |1 . Initial state preparation Simulation protocol Measurement parameters are the various coupling strengths of the simulated system, and the X rnd gates refer to X-gates applied with a 50% probability to prepare an effective fully-mixed state. The initial state preparation can also be performed using the alternative random-phase approximation-inspired method. NV init is an initial-state preparation using single-qubit gates to the state required by the driving scheme. Details of the circuit components can be found in Appendix D along with a figure representing the pulsed driving case.
For a diamond at room temperature, the initial state of the nuclear spins is well described by a fully-mixed state ρ mixed = 1 ⊗N 2 N , where 1 ⊗N is the 2 N ×2 N identity matrix. The state can be approximated by running the algorithm several times, each time with a different initial state obtained by applying X gates randomly on the qubits representing nuclei. A faster alternative to this sampling is the random-phase-approximation-inspired method, described in [44], and introduced into quantum computing in [45]. In this method, the qubits are all prepared in an equal superposition by applying Hadamard gates, and then the phases are randomized through the application of random phase gates. The method effectively reduces the prefactor in the scaling of the sampling error [45].

Time evolution
We choose to implement the time evolution generated by the Hamiltonian in Eq. (1) through Trotterization. For that, the Hamiltonian is rewritten in terms of qubit Pauli operators and arranged into non-commuting terms for an optimal Trotter splitting. The resulting circuit, which performs one Trotter step of the evolution in the continuous driving case, is depicted in Fig. 2b. It consists of a set of initial single-qubit gates, including the ones corresponding to the driving and the detuning of the NV center, followed by three two-qubit gates per nucleus. There are three types of interaction terms, of the form XZ, Y Z and ZZ, when no internuclear interactions are considered. With interactions there are a total of five interaction terms. Our native gate set only includes one type of two-qubit interaction as explained in section III B 1.
Therefore, some SQGs need to be applied in order to convert the interaction terms into the right form, as discussed in Appendix D.
Under specific circumstances, some TQGs can be removed by rotating the Hamiltonian into a more suitable basis as shown in Appendix E.
The dynamics of the system is known to produce an exchange of polarization between the NV center and the nuclei. This exchange is oscillatory, and therefore choosing a proper stopping time is important in order to achieve an effective polarization transfer from the NV center to the nuclei. In practice, a sub-optimal transfer time can suffice, and the protocol is then repeated several times by resetting the NV center to its initial state and let-ting the system evolve under the drive again. Due to the re-initializations the full evolution of the system is nonunitary and a net gain of polarization of the system is enabled.
This structure is represented in the quantum circuit in Fig. 2a by the repeated Trotter evolution, followed by reset operations on the qubit representing the NV center, and a single-qubit gate to prepare the initial state of the driving protocol.

D. Layout optimization
When implementing a quantum algorithm on a superconducting QPU, the planar qubit connectivity forces us to solve the qubit-routing problem by introducing additional SWAP gates to connect distant qubits. In this subsection, we study the advantages of an optimized chip topology, a star topology, over a square-grid array of qubits in terms of reducing the number of SWAP gates that must be inserted to run the algorithm in Fig. 2 on the device. Different topologies will imply different counts of SWAPs added on top of the gates arising from the algorithm itself, as shown in Fig. 3. On a NISQ device, this implies different computational precision for the same gate error magnitudes. We choose the SWAP count as our metric to compare different topologies, as commonly gates have fidelities limited by calibration. The errors could be due to crosstalk, leakage, or filtering causing disturbances to the control signals. Under this scenario we want to minimize the gate count. On the other hand, for a highly tuned up device whose gates are limited by qubit coherence times, it would be optimal to minimize the circuit depth instead of the TQG count.
Assuming the gate errors are independent, the total error will be bounded by: where N TQG is the number of two-qubit gates, N SQG the number of single-qubit gates, and ε SQG is the SQG error. Consequently, reducing the gate count, especially N TQG , has an exponential effect on the precision of the computations, underlining the effect of minimizing the SWAP gate overhead. As SWAP gates are not native to the hardware, but must be compiled out of three CZ gates, their effective error rate is also much higher than those of native gates.

Square grid
A common choice in superconducting quantum chips is the square grid of qubits. It has high connectivity and is suitable for performing the surface code error correction when scaled to large enough qubit counts with fast measurement and feedback [46]. The qubit routing problem on a square grid can be tackled using various numerical approaches [47][48][49][50]. However, these methods are inefficient. In our case, a tailored SWAP routing method, shown in Fig. 3a, has been chosen and developed in Appendix F that can be shown to be well suited from two perspectives. First, a comparison against the cited numerical approaches (shown in Appendix F) reveals that our routing method is better in terms of number of gates. Second, it is completely deterministic and does not rely on expensive numerical optimization methods. It can also be shown not to be far from optimal: on a square grid each qubit has at most 4 nearest neighbors, implying that any SWAP operation provides at most 3 new neighbors. For an all-to-all (ATA) interacting Hamiltonian there are n 2 2 interactions, to leading order, for a simulation performed on n qubits (corresponding to N nuclei and one NV center). This implies a lower bound of at least n 2 6 SWAPs for any SWAP pattern on the square grid topology. Our SWAP pattern with n 2 2 SWAPs, discussed in Appendix F, is thus not far from optimal.

Star architecture
A star topology allows to implement the simulation of the simplified case without internuclear interactions directly, without any SWAP gates. With internuclear interactions considered, we still find a reduction in SWAP gates as compared to the square grid topology, as shown in Fig. 3b. This reduction comes from the SWAP routing we implement, that consists of making the qubit 0 in Fig. 3b interact with all the external qubits and then swap its state with that of qubit 1 and repeat this process until all interactions have been performed. This allows us to use only n − 1 SWAP gates. The percentage of SWAP gates that can be saved can be observed in Fig. 4.
However, this improvement in the number of gates comes with a price to pay in the depth of the algorithm. We can only do one TQG at a time in the star chip and we have 3 2 n(n − 1) TQGs from simulating the physical interactions and 3(n − 2) TQGs from the SWAPs. This yields a depth for the TQGs of 3 2 n 2 + 3 2 n − 6 in a star chip, while for a square grid it is 6n. Such depth increase comes from the reduction in parallelization, since all gates now act via the central qubit. On the other hand, less parallelization reduces the types of possible crosstalk errors. Adding connections between external qubits reduces the depth of the circuit, since the main cause of circuit depth is the fact that the interaction of two external qubits needs to be done exclusively by the central qubit. Further studies are required to see if the addition of more external layers to this topology (such as in a spiderweb) can lead to better compromises between depth and gate count, especially for simulating systems with clusters of strongly interacting nuclei.

E. Gate-level optimization
The two-qubit interactions that appear in the algorithm are the XZ, Y Z and ZZ interactions, as shown in section III C 2 and Fig. 2b. When compiling the algorithm into the native gates of the device, all these interactions must be implemented in terms of some available gate set. We study in Table I the overhead introduced by decomposing these interactions into different examples of native TQGs of superconducting devices; namely, the parametrizable and fixed-phase U ZZ gate, the fixedphase controlled-Z gate CZ, and the CNOT gate. The CNOT gate is usually performed by making use of the cross-resonance gate [37,51], which introduces an U XZ interaction, making it equivalent to the U ZZ for the purpose of this algorithm. We assume that the SQGs that  can be implemented are the R xy and the R z gates. These numbers can be further reduced if the first and last SQGs introduced by this compilation are combined with the adjacent SQGs in the algorithm. The conclusion is that fixed-angle gates will double the number of TQGs that need to be physically performed. In Ref. [52], the improvements coming from the reduction of the gate count are compared to the new errors introduced by the interpolation of the calibrated phases. For two instances of a Quantum Approximate Optimization Algorithm (QAOA) [53], it is shown that the performance is better when using parametrized TQGs.
The gate sequences for some of the gate decompositions are shown in Fig. 5

IV. CO-DESIGN HARDWARE
A star-architecture chip has fundamental scaling issues using a transmon as the central qubit as the number of neighbors grows. Every neighbor added to the center qubit would decrease its charging energy E c . To keep the qubit frequency constant and anharmonicity in the transmon regime, the ratio of the qubit's Josephson energy to its charging energy, E j /E c , must remain unaffected. Therefore we cannot afford to change its charging energy. This leads to a trade-off between the number of coupled qubits and their coupling strength to the central element.
The spirit of Co-Design calls for replacing the central transmon with another object that enables this scaling in size. A resonator has no Josephson energy E j , so the E j /E c ratio is not altered by adding more capacitive couplings to the resonator. Only small corrections to its frequency are introduced by adding coupled qubits. As a distributed element, a co-planar waveguide resonator also has physically more space for couplings than a central transmon qubit. By elongating the resonator and choosing the mode with the target frequency, the number of qubits coupled to it can further be increased. These properties make a resonator a favourable component in the center of the chip.
In the device in Fig. 6a the qubits are capacitively coupled to the resonator via tunable couplers [39,54,55] in the proximity of a voltage maximum of a standing wave in the resonator. As the resonator is elongated, we must use higher harmonic excitations of the resonator to keep the frequency around the operational frequency of the qubits. Tunable couplers avoid the frequency crowding issues related to direct coupling [56,57], and the linear resonator has higher connectivity in the center than ring resonator structures with quasi-all-to-all connectivities [58].
A linear resonator cannot in general be used as a qubit, since a microwave drive on it will not only populate the {|0 , |1 } subspace, but also higher excited states. However, the effective interactions mediated via the tuneable coupler in Fig. 6a are of the type a † aZ and (a + a † )X + (a − a † )Y where a and a † [37] are the resonator creation and annihilation operators. These types of interactions conserve excitation number, so when at most one excitation is in the qubit-resonator system, the resonator cannot be populated beyond its first excited state through interaction with a qubit mediated a tuneable coupler.
CZ and iSWAP gates between the resonator and a qubit can be performed using the two interactions, and the theory is developed more fully in Sec. IV A. Then, a resonator together with an external qubit can be used as an effective central qubit in the following way: The theoretically most straightforward protocol would be to perform a SWAP gate from the qubit to the resonator. The iSWAP, on the other hand, is a native gate that can directly be implemented on the hardware in Fig. 6b. The iSWAP gate between the resonator and the qubit is represented by the unitary operator: Since the CZ gates performing the computation following the iSWAP are diagonal in the computational basis, the phase introduced by the iSWAP is uninvolved in the gate. This enables substituting the SWAP gate by an iSWAP gate in the protocol to further minimize the gate count.

A. Gate theory and simulations
Here we demonstrate that in our star architecture CZ and iSWAP-type gates between any of the qubits and the {|0 , |1 } subspace of a chosen resonator mode can be implemented. The operational principles of these gates are very similar to those between two qubits coupled with a tunable coupler [39,54,55,59]. The main limitation of our architecture (where one transmon is replaced by a resonator) is that iSWAP operations can only be performed in the zero-and single-excitation subspace of the two-qubit computational basis.

Conditional-Z gate
The CZ operation between the resonator and the qubit is described by the unitary operator: This gate is equivalent to the U ZZ (φ) gate in Eq. 5 up to two R z -rotations. To operate a CZ gate, we initialize the resonator-coupler-qubit set up shown in Fig. 6b at the idling configuration with zero effecting coupling between the qubit and resonator. Note that the coupler is also a transmon that shows a higher sensitivity to the magnetic flux than regular qubits. We next apply a flux pulse that lowers the coupler frequency, turning on the effective coupling between the resonator and the qubit. Depending on the flux pulse shape, the state collects conditional phase φ and possibly experiences population oscillations between computational and non-computational states, as a function of the time spent at the gate-operation frequency. We optimize the pulse amplitude and duration such that after the flux pulse the CZ gate fidelity is maximized. Details of the gate theory can be found in Appendix G and the considered device parameters in Table II. In Fig. 7a, we operate our CZ gate by tuning the coupler frequency using a flattop Gaussian shaped flux pulse. The width of our Gaussian filter was fixed at 3 ns. Applying such a flux pulse to coupler results in a coupler frequency shift by ω shift c from the idling configuration. Then by appropriately tuning ω shift c and the gate time τ , one locates the optimal pulse configuration that minimizes the CZ(π) gate error ε CZ = 1 − tr where σ is the target density matrix obtained after propagating some initial state |Ψ Ψ| with the ideal unitary of Eq. 12 and ρ the final density matrix obtained after propagating |Ψ Ψ| with the Lindbladian corresponding to our system defined in Eq. (G1). For our device parameters, the maximal decoherence limited CZ gate error averaged over a number of random initial states is 1.6 × 10 −3 . Note that the system parameters in Table II were chosen such that they allow for the possibility to find a good idling configuration, where the residual CZ interaction vanishes before the gate operation. In our simulations, we have included environmental noise, such as amplitude damping and pure dephasing and treated them using a Lindblad master equation solver in QuTiP [60,61].

iSWAP gate
Just as the CZ gate, the iSWAP gate can be natively realized in superconducting quantum computing architecture [37]. With our device, we can perform highfidelity iSWAP gates between zero-and single-excitation computational states. The two-photon state |1 r ⊗ |1 , where |1 r denotes the first excited state of the resonator, must be excluded because it resonantly interacts with the state |2 r ⊗ |0 inducing a population exchange between  Table II. the states. Hence the resulting operation in this subspace does not match the action of the targeted iSWAP operation. The capacitive coupling between the elements of the electrical circuit shown in Fig. 6b gives rise to an effective XY -interaction between the qubit and resonator under the rotating wave approximation. Such an interaction conserves excitation number. With only the qubit or resonator (or neither) initially populated, we stay within the single excitation subspace of the joint system, thereby minimizing leakage of quantum population into the higher excited states of the resonator. The XYinteraction can be turned on by first tuning the qubit in resonance with the resonator, and then applying a fluxpulse to the coupler to turn on the coupling, similar to the CZ gate operation. Fig. 7b shows iSWAP gate error landscape for the same device parameters (given in Table II). The optimal average iSWAP gate error ε iSWAP obtained for our device is 1.7 × 10 −3 . This result is obtained by averaging over a number of random initial states within the zero-and one-excitation manifolds.
The results of our two-qubit-gate simulations demonstrate that our star architecture supports operating gates with similar fidelities as regular transmon qubits coupled together. The increased local connectivity of the device reduces the need for SWAP gates to simulate the nanoscale NMR problem (and others with a similar structure) and consequently in the end improves simulation fidelities.

B. Reset
The hyperpolarization protocol described in Sec. II needs regular re-initializations of the state of the NV center. The Co-Design hardware for simulating the protocol must therefore support this operation within qubit lifetimes. This is a hardware challenge, but one with solutions in sight. In particular, the quantum circuit refrigerator (QCR) has been used to perform the reset in tens of nanoseconds [40][41][42][43], which is a similar timescale to gate operations. The advantage of using a QCR for the reset is the possibility to reset the central resonator directly, without the need transfer the resonator population back to the central qubit using an iSWAP gate. Alternatively, a fast reset is possible through applying a flux drive to a qubit to SWAP its state with its measurement line [62]. This scheme has the advantage of not requiring any additional hardware not already present on the chip, but comes with a small cost in the circuit depth, as the state of the resonator must be transported using an iSWAP gate into the designated central qubit and be re-initialized there. The reset timescale is also somewhat longer than when using a QCR.

V. RESULTS AND DISCUSSION
In this section we discuss the two main results of the paper: namely the predicted performance of our proposed quantum algorithm on a regular noisy QPU, as well as the performance increase obtained with our proposed Co-Design QPU. To this aim, we will focus on the polarizations of the NV center and nuclear spins, that are relevant quantities of the problem and straightforward to measure in a quantum computer.
In Fig. 8 we compare the frequency response of the polarization transfer process on two different simulated devices: a QPU with realistic noise parameters, and an ideal noiseless QPU. We consider one NV center, two interacting nuclei and different driving frequencies for both continuous and pulsed driving schemes. In the simulation we ignore errors in the preparation of the fullymixed state of the qubits representing the nuclei. The blue curves show the remaining polarization in the NV center after one cycle of initialization and time evolution, while the red and the green curves correspond to the nuclear polarizations at the end of the cycle. For each nucleus there appears a resonance frequency in the system, for which the polarization transfer is optimal for said nucleus, depicted in the figure by the peaks of the curves.
Both simulations include the effects of the most common imperfections in nanoscale NMR systems, i.e. energy detunings and Rabi frequency fluctuations discussed in Sec. II. The simulation of the quantum algorithm additionally includes noise and gate errors present in the QPU. It is notable that the noise affects the height and shape of the peaks more than their location.
The system imperfections include a detuning of 120 kHz of the NV center from the zero-field splitting that shifts the peaks in Fig. 8 (left) to frequencies lower than their predicted Larmor frequencies (dotted vertical black lines). Fig. 8 (right) shows how the pulsed-driving scheme XY8 [24,25] acts as a robust dynamical decoupling sequence, eliminating such frequency shifts both in the ideal and noisy simulations.
Regarding the QPU noise and errors, the amplitude damping channel causes an overall shift down of all polarizations at all driving frequencies. Dephasing noise and gate errors (as modelled by depolarizing noise) cause the curves in Fig. 8 to flatten and lose contrast. While we have discussed how the product of gate errors is minimized by reducing the SWAP overhead through Co-Design hardware, the loss of contrast can also be addressed through error mitigation techniques such as zero-noise extrapolation [63][64][65]. Dephasing can also be reduced through dynamical decoupling techniques [37], thus extending the system coherence and increasing the effective T 2 time. The simulations presented in Fig. 8 include the decoherence times and gate fidelities that can be achieved with the hardware in Sec. IV. This implies an overestimation of the actual errors in the simulation, since the gate fidelities already include some decoherence.
To quantify the advantage of our Co-Design processor, Fig. 9 shows how the reduction in TQGs improves our ability to extract relevant information from the simulation. The figure compares the star-architecture chip to qubits connected on a square grid simulating a six qubit system with one NV center and five non-interacting nuclei. On the two chips we use SWAP patterns according to the schemes discussed in Sec. III D.
First, Fig. 9a shows the average height-to-width ratioξ of the nuclear polarization peaks obtained with star and square grid topologies with respect to an ideal error-free simulation. It serves as an indicator of how much the QPU noise degrades the simulation for each case. The ratioξ is computed by fitting a Gaussian function on each peak, and computing: where h is the height and σ the variance of the fitted Gaussian function, averaged over the five nuclei. FIG. 9. Performance gain from Co-Design: a comparison between a Co-Design star-architecture against a square grid, taking as reference an ideal simulation without QPU noise. The comparison highlights the negative effect that the SWAP gates on the square grid have on extracting relevant information from the simulation. We consider two quantities: in subplot (a) the ratioξ between the height and the width of the polarization peaks, and in subplot (b) the estimation error of the polarization peak center∆ peak , where the bars denote an average over five nuclei for each noise level. The simulations were performed with the same parameters as in Fig. 8, except for the number of qubits, which has been increased from 2 to 5.
The curves for both topologies must coincide atξ = 0 for a maximal-error device, and atξ =ξ ideal for an errorfree quantum computer, since for a maximal-error device the output is pure noise and for an error-free quantum computer the number of SWAPs is irrelevant to the precision. For NISQ devices in between these limits, a performance difference between the architectures is observed. For systems with more nuclei and NV centers, the differences between topologies start to appear at lower errors, since the number of total operations grows. This shows how the QPU topology is of great importance for the computational precision of NISQ devices, while for faulttolerant quantum computers the precision is unaffected by the topology.
Second, Fig. 9b shows the average relative error in the central frequency of the NMR peaks: where ω noisy and ω ideal are the peak-center frequencies extracted from the Gaussian fittings for the noisy and ideal cases, respectively. The peak centers correspond to driving frequencies that efficiently transfer polarization to different parts of the diamond lattice.
With the quantum simulation we can individually identify the nuclear resonance peaks by directly measuring the polarization of each qubit. This could enable exploration of how the polarization diffuses in the lattice with single-nucleus precision. In contrast, in a standard nanoscale NMR experiment, one typically only has only access to the excitation loss of the NV ( and thus only to the average transmitted polarization). This demonstrates the advantage of simulating the system on a quantum computer, as a it provides access to the relevant microscopic details of the dynamics that are otherwise inaccessible.
The figures demonstrate that the Co-Design chip is able to detect the resonance frequencies and predict the peak heights better at all considered noise levels. The power of Co-Design is particularly evident in Fig. 9b, where the square grid is shown to require two orders of magnitude lower noise levels to reach the same accuracy as the Co-Design chip.

VI. CONCLUSIONS AND OUTLOOK
We have presented a quantum algorithm to simulate a nanoscale NMR problem, namely a hyperpolarization protocol. We have simulated the proposed quantum algorithm with typical noise processes of a NISQ superconducting quantum computer with state-of-the-art parameters. We find that, despite considering a noisy QPU, our protocol still allows to identify the positions of the nuclear resonances (corresponding to the maximal polarizations) in the frequency domain, as well as the behavior in the vicinity of such resonant frequencies, thus enabling the exploration of optimized protocols and driving parameters to hyperpolarize the nuclear ensemble.
Moreover, we have shown that a specific Co-Design architecture adapted to the problem provides an advantage over general-purpose designs in the NISQ era, thanks to the reduction in two-qubit-gate count. Consequently, the adapted design reduces the necessary gate fidelities to solve practical problems in nanoscale NMR. This application-specific QPU consists of a central resonator, representing an NV center, coupled to a number of qubits representing the nuclei. The design can be scaled to more NV centers and a potentially large number of qubits around them. This is an example of a shortcut to quantum advantage. Adapting more NISQ-friendly algorithm alternatives, such as those listed in [4], adapted to the problem and to the Co-Design hardware can provide further shortcuts.
Our work opens interesting directions for further investigation, since a quantum processor able to efficiently simulate nanoscale-NMR scenarios with a large number of nuclear spins would have a great impact on NMRbased applications. Fast and reliable quantum simulations of interacting spin systems would improve the interpretability of zero-and low-field NMR where spin-spin interactions become dominant [11], and nanoscale-NMR systems where a quantum sensor is strongly coupled via dipole-dipole interactions to nuclear or electron spin clusters. A possible application of the latter is the estimation of inter-label distances (via, e.g., Bayesian analysis of the NV center response) in electronically labelled biomolecules [19]. In this case, the numerical analysis of systems beyond two-electron spin labels in realistic conditions, including protein motion and decoherence channels, is already numerically challenging.
The Hamiltonian in Eq. (1) can be derived from first principles. Let us first assume a model including only two 13 C nuclei and one NV center ( Fig. 1) with dipole-dipole interactions. For simplicity we also consider the NVs to be aligned with the external magnetic field. In that case, the Hamiltonian of the system reads: where S j is the j-th spin component of the NV center, I j k the j-th spin component of nucleus k, D is the zero-field splitting of the NV center, γ e and γ c are the gyromagnetic factors of the NV center and the nuclei respectively, B z is the external magnetic field, which is aligned with the symmetry axis of the NV center r k is the relative position vector between the NV center and nucleus k and r 1,2 is the relative position vector between both nuclei.
We go into an interaction picture with respect to H 0 = DS 2 z − γ e B z S z . The NV-nuclei interaction term reads: where we split the expression in commuting and non-commuting operators. The non-commuting operators pick a fastrotating phase and can be neglected through the rotating-wave approximation. By performing an interaction-picture transformation with respect to H 0 = −γ c B z (I z 1 + I z 2 ) = ω (I z 1 + I z 2 ), the nucleus-nucleus interaction term reads: with I ± k = I x k ± iI y k . Applying again the rotating-wave approximation and undoing the interaction picture we finally arrive at:

Pulsed sequence
Now we consider the pulsed case, represented by the driving term H dr = Ω(t) 2 σ φ where Ω(t) is a train of π-pulses. The Hamiltonian is already expressed in the interaction picture from Eq. (B4). From there, we further move into a rotating frame with respect to the driving term. The corresponding unitary transformation is U 0 = (−iσ φ ) k for the time interval between pulses k and k + 1. This leads to: where F (t) is the so-called filter function, with value +1 when k is even, and −1 when k is odd, representing the sign of the operator σ z , flipped by the action of each pulse. It is necessary to apply two different patterns of pulses. The "symmetric case", meaning an evenly-distributed sequence of pulses for which the filter function is even and can be expanded in Fourier series of cosines as: with f n = 0 when n is even and f n = − 4 πn when n is odd, if the pulses are distributed such that the interpulse spacing is constant. We choose the resonance condition T = 2πn | ω c | , where n is the harmonic number. This is the same resonance condition that we introduced in section II, but here it is formulated with the period T that appears in the Fourier expansion, instead of with the interpulse spacing τ = T 2 from before. Going to an interaction picture with respect to − ω c · I and repeating the procedure we used above in the Hartmann-Hahn case, we get: where α = fn 4 . With the second pattern of pulses, called the "asymmetric case", we apply an oddly-distributed sequence of pulses for which the filter function is odd and can be expanded in a Fourier series of sines. Note that this sequence of pulses is identical to the even sequence but shifted by a π/2 phase. An analogous derivation gives: with β = gm 4 and g m coming from the Fourier expansion of sines, analogously to f n . Combining these two patterns one can generate an effective Hamiltonian of the form: which can be transformed with simple rotations on the qubit representing the NV into: and this is equivalent to an interaction-exchange flip-flop Hamiltonian, similar to the one for the continuous-driving case (B6). A more detailed description of this whole process, including the expressions of the Fourier coefficients f n and g m can be found in reference [18]. In order to visualize the structure of the pulsed-driving case, we have included in Fig. 10 the circuit implementing all these terms on a quantum chip for the case of one NV center and two nuclei.

Pulse
T.e. . The term T.e. stands for Trotterized evolution and represents half of the free evolution of the system in between pulses, and can be devided into one or more single Trotter steps. The asymmetric and symmetric sequences of pulses are the ones discussed in Appendix B 2. The schematic drawings below the circuit in the form of square waves depict the modulation of the filter function (eq. (B7)) under the two different pulse patterns. We choose the pulses to be either X or Y gates acting on the qubit representing the NV, following the pattern XYXYYXYX, which can be repeated N blocks times for a stronger signal amplification. However, in our simulations a single block was enough to see clear patterns of polarization transfer, such as the ones in the right plot of Fig. 8. (b) Corresponding gate sequence of one Trotter step, as in Fig. 2, but without the H dr inside, because in this case the driving is applied through the sequence of pulses.

Appendix C: Randomized Trotter techniques
As explained in section III, we chose Trotter expansion. Besides this, we can consider other simulation approaches such as the variational quantum simulator [33], the quantum assisted simulator [34], numerical quantum circuit synthesis [35], or a plethora of other quantum simulation algorithms aimed at NISQ devices [4].
In addition, other approaches like randomized Trotter have been recently shown to provide some advantage compared to standard Trotter expansions [66]. We also propose to use one randomized approach, qDRIFT [32], that consists of the following: instead of splitting the whole evolution operator e −it f j hj Hj into simpler terms as done in full Trotterization, the method applies a random selection of such terms to the quantum circuit. This random selection is based on the probability distribution given by the weight of each term h j H j . For a certain evolution time, this set of gates can approximate the whole evolution operator by statistically drifting the state of the circuit towards the deterministic final state.
The error bound for this method is given as [32]: where λ = j h j and N terms is the number of individual two-qubit evolution operators that are implemented. These evolution operators have the form e −iτ Hj , being τ a constant related to the relative weight hj λ that the term H j has in the Hamiltonian.
The advantage of qDRIFT compared to Trotterization is particularly apparent when dealing with Hamiltonians with a large number of terms with small coefficients, simulated for short times. While in the standard Trotter case, every term has to be simulated for each step no matter how small its effect is, in qDRIFT this is not required. A more thorough analysis of errors in qDRIFT and gate counts can be found in [67]. This method is particularly suitable to our problem, since the range of coefficients in the Hamiltonian of a real diamond is large due to the length scales involved.
In this case, with qDRIFT the terms with smaller coefficients do not add a significant amount of gates as they would in conventional Trotterization approaches.
We note that other adapted protocols such as SparSto [68] can further enhance the simulation of this type of systems. SparSto represents a compromise between Trotterization and qDRIFT, generally guaranteeing an equal or better performance than both of them. We will not go into detail on this method since Trotterization and qDRIFT are enough to illustrate the main ideas behind this work.

Appendix D: Hamiltonian decomposition for Trotterized time evolution
In order to simulate the dynamics generated by the Hamiltonian in Eq. (1) on a quantum computer using Trotterization, we first need to express it in a suitable way. To begin with, we split the Hamiltonian into two parts: which can be expressed in terms of qubit Pauli operators: Since in the rotating frame with the drive the Hamiltonian is time independent, the time-evolution operator is simply given by: where t f is the time for which the simulation runs. The time-evolution operator is split into s discrete steps through Trotter decomposition: The evolution operator associated with single-qubit gates in each Trotter step of equation (D5) needs to be rewritten in terms of our native gate set. It is always possible to decompose any single-qubit unitary exactly, up to a global phase, into a sequence of three single-qubit rotations such as, for example, a rotation about the y-axis in between two rotations about the z-axis: where the angles β, γ, and δ need to be determined from the specific entries of the unitary in question to simulate the evolution of the p th qubit: From now on, we will concentrate on the case of a single NV center, which will be encoded in qubit 0. Then, the evolution operator associated to single-qubit gates for the NV center will be: Matching the entries of the matrices corresponding to the unitaries on equations (D7) and (D8) we get a system of equations for the angles β, γ, and δ for each Trotter step s.
There are 3 (5) types of interaction terms of the form XZ, Y Z, ZZ, · · · in H TQG without (with) internuclear interactions. Due to the native TQG being of only ZZ interaction type (see Eq. (5)), local rotations need to be introduced for simulating the rest of the TQG terms. These are R σi→σj k , which have the effect of converting the Pauli operator σ i into the Pauli operator σ j for qubit k.
After the Trotterization introduced in equation (D5), the term H TQG corresponding to TQG contains some elements which do not commute with each other, and some of them which do commute with each other. We choose to split all terms in order to express the time-evolution operator in terms of the native gates that we assumed in section III B 1. Only the elements that do not commute with each other contribute to the total Trotter error, which remains of the same order: (D9) Finally, we observe that the operators Z k Z 0 (and the rest of the TQG terms) commute with each other, so the exponentials can be further split without Trotterizing: The time-evolution operator implementing the continuous sinusoidal driving σ φ is: The quantum algorithm for simulating the system under a pulsed-driving scheme is somewhat more involved than the continuous-driving case, due to the two different time-dependent processes involved in the Trotter decomposition: the free dynamics of the spins and the sequence of pulses. The most crucial point to be aware of is the interplay between Trotter steps and interpulse spacing. The number of interpulse evolutions, i.e. number of pulses minus one, bounds from below the minimum number of Trotter steps for the simulation. Clearly, at least one Trotter step is needed for each interpulse evolution.
Taking this interplay into account, the most straightforward setup is to choose a frequency which will determine the spacing of the pulse sequence, and to identify each interpulse evolution with a single Trotter step. If the achieved precision is not high enough, more Trotter steps can be added for each interpulse evolution. Each π-pulse itself is simply implemented as an X-or Y -gate on the qubit representing the NV center. The OU-distributed Rabi frequency fluctuations present in nanoscale NMR systems are then simulated by over-and under-rotations of the X-and Y -gates.
can consider the effect of this rotation on only one qubit representing an arbitrary nucleus. We will exemplify this procedure using nucleus 1. If we want to obtain the mean value of σ z acting on the nucleus: where U (0, t f ) represents the evolution operator from t = 0 to t = t f . The density matrix ρ(0) contains the state of the NV center (which is in the |+ state at t = 0) and nucleus 1, i.e. ρ(0) = |+ +| NV ⊗ 11 2 . Our intention is to obtain an expression of this mean value in terms of the rotated evolution operators and later, we will find the appropriate rotation to be perfomed. Then, taking into account that the trace is invariant under a rotation R = 1 NV ⊗ R 1 we get: This can be expressed as: The density matrix of the nucleus is the identity. Thus, any rotation on nuclei qubits leaves the density matrix unaffected, leading to: Then we need to rotate the system previous to the measurement. By using the invariance of the trace under cyclic permutations we get: which is equivalent to introducing a counter-rotation in the circuit before measurement. Now let us focus on the specific rotation we have to implement. Since the constants multiplying the Pauli matrices in the Hamiltonian are A1 2 and ω c 1 = A1 2 − γ c B z e z (for nucleus 1), we can rotate the basis to obtain a representation in which the vectors have only z-component for A 1 and thus, XZ and Y Z terms are removed. The vectors before and after the needed rotation can be seen in Fig. 11.
To compute the new vectors (and thus the new coefficients for the gates of our algorithm), we can use Rodrigues' rotation formula to rotate a vector v an angle θ around a unitary axisk: v rot = v cos θ + (k × v) sin θ +k(k · v)(1 − cos θ), being in our case, θ = arccos (A z 1 /| A 1 |) andk = (cos(φ), sin(φ), 0), with φ = − π 2 + φ xy = − π 2 + arctan (A y 1 /A x 1 ). For implementing the counter-rotation of this in the quantum circuit, we use: The pattern is seen in Fig. 3a denoted by the intense blue arrows. With internuclear interactions we need to perform a swap pattern that enables all-to-all interactions. The so-called odd-even mapping in Fig 3a is an efficient one [69] represented by green arrows in Fig. 3a. This consists of swapping first all the even qubits with their right neighbors and then swapping all the odd qubits with their right neighbors. This way, we will obtain all-to-all interactions with 1 2 (n − 1)(n − 2) SWAP gates and a total TQG depth of 6n. A summary of the TQG counts is shown in Table III.
To motivate the creation of a chip with a star topology and the use of an alternative linearized SWAP routing for a square grid instead of standard numerical approaches, a comparison between all the cases is provided in Fig. 12. A reduction in the number of SWAPs can be noticed for both the linear chain approach and the star-topology chip against standard numerical approaches for a square grid.