Generative machine learning with tensor networks: benchmarks on near-term quantum computers

Noisy, intermediate-scale quantum (NISQ) computing devices have become an industrial reality in the last few years, and cloud-based interfaces to these devices are enabling exploration of near-term quantum computing on a range of problems. As NISQ devices are too noisy for many of the algorithms with a known quantum advantage, discovering impactful applications for near-term devices is the subject of intense research interest. We explore quantum-assisted machine learning (QAML) on NISQ devices through the perspective of tensor networks (TNs), which offer a robust platform for designing resource-efficient and expressive machine learning models to be dispatched on quantum devices. In particular, we lay out a framework for designing and optimizing TN-based QAML models using classical techniques, and then compiling these models to be run on quantum hardware, with demonstrations for generative matrix product state (MPS) models. We put forth a generalized canonical form for MPS models that aids in compilation to quantum devices, and demonstrate greedy heuristics for compiling with a given topology and gate set that outperforms known generic methods in terms of the number of entangling gates, e.g., CNOTs, in some cases by an order of magnitude. We present an exactly solvable benchmark problem for assessing the performance of MPS QAML models, and also present an application for the canonical MNIST handwritten digit dataset. The impacts of hardware topology and day-to-day experimental noise fluctuations on model performance are explored by analyzing both raw experimental counts and statistical divergences of inferred distributions. We also present parametric studies of depolarization and readout noise impacts on model performance using hardware simulators.


I. INTRODUCTION
In recent years, gate-based quantum computing has emerged as a relatively mature technology, with many platforms offering cloud-based interfaces to machines with a few to dozens of qubits [1][2][3][4][5], as well as classical emulators of quantum devices of this class [6].Today's quantum computing resources remain a long way from the millions of qubits [7] required to perform canonical quantum computing tasks such as integer factorization with error correction [8,9], and present devices are either engineered with a specific demonstration goal or designed for general-purpose researchscale exploration [10].With the advent of noisy, intermediatescale quantum (NISQ) devices [11], whose hardware noise and limited qubit connectivity and gate sets pose challenges for demonstrating scalable universal quantum computation, we are faced with a different form of quantum application discovery in which algorithms need to be robust to noise, limited qubit connectivity and gate sets, and highly resource-efficient.
Machine learning (ML) has been put forward as a possible application area for NISQ devices, with a range of recent proposals [12][13][14].ML may prove promising for NISQ applications because well-performing ML algorithms feature robustness against noise, quantum circuits can be designed for ML applications that are highly qubit-efficient [15], and quantum models can be designed whose expressibility increases exponentially with qubit depth [15,16].The most impactful nearterm ML application likely lies in quantum-assisted machine learning (QAML), in which a quantum circuit's parameters are classically optimized based on measurement outcomes that may not be efficiently classically simulable [17]; this also includes kernel-based learning schemes with a quantum kernel [18].Tensor networks (TNs) provide a robust means of designing such parameterized quantum circuits that are quantum-resource efficient and can be implemented and optimized on classical or quantum hardware.TN-based QAML algorithms hence leverage the significant research effort into optimization strategies for TNs [19][20][21], and also enable detailed benchmarking and design of QAML models classically, with a smooth transition to classically intractable models.
In this work, we explore the applicability of QAML with TN architectures on NISQ hardware and hardware simulators, exploring the effects of present-day and near-term hardware noise, qubit connectivity, and restrictions on gate sets.We focus on fully generative unsupervised learning tasks, which have been identified as a promising avenue for QAML [13], and focus on the most resource-efficient matrix product state (MPS) TN topology.We present a framework for QAMLoutlined in Fig. 1-that includes translation of classical data into quantum states, optimization of an MPS model using classical techniques, the conversion of this classically-trained model into a sequence of isometric operations to be performed on quantum resources, and the optimization and compilation of these isometric operations into native operations for a given hardware topology and allowed gate set.In particular, we develop several novel techniques for the compilation stage aimed at TN models for QAML on NISQ devices, such as the permutation of auxiliary quantum degrees of freedom in the TN to optimize mapping to hardware resources and heuristics for the translation of isometries into native operations using as few entangling operations (e.g., CNOTs) as possible.
The tools developed herein enable the robust design and performance assessment of QAML models on NISQ devices in the regime where classical simulations are still possible, and will inform architectures and noise levels for scaling to the classically intractable regime.Even in the classically intractable regime in which the model must be optimized using a quantum device in a hybrid quantum/classical loop [22,23], our techniques provide a means of obtaining an approximate, classically trained "preconditioner" for the quantum models that can help avoid local minima and reduce optimization time.We present exemplar results for our workflow for syn-thetic data that can be described by an exactly solvable twoqubit MPS QAML model, as well as on features extracted from the canonical MNIST handwritten digit dataset [24].
The remainder of this work is organized as follows: Sec.II discusses QAML with tensor networks (TNs) broadly, including embedding of classical data into quantum states, classical training of a TN model, and the conversion of TN models into resource-efficient sequential preparation schemes; Sec.III discusses our approach for compiling TN-based QAML models for running on quantum hardware, including the utilization of ambiguity in the TN representation and greedy compilation heuristics for minimizing model gate depth; Sec.IV presents an exactly solvable two-qubit QAML model and assesses the performance of our QAML workflow on quantum hardware; in Sec.V we give an example application to generative modeling of features extracted from the MNIST dataset and analyze the performance of our models as a function of hardware noise using a quantum hardware simulator; finally, in Sec.VI we conclude and give an outlook.Details of the MNIST model studied in Sec.V are given in Appendix A.

II. QUANTUM-ASSISTED MACHINE LEARNING WITH TENSOR NETWORKS
Fig. 1 broadly outlines the QAML workflow explored in the present work.We begin with a collection of classical data vectors in a training set T = {x j } N T j=1 , where each element x j is an N -length vector.The first step in our QAML workflow is to define a mapping of classical data vectors to vectors in a quantum Hilbert space.Here, the only restriction we will place on the encoding of classical data in quantum states is that each classical data vector is encoded in an unentangled product state.This is useful for several reasons.For one, unentangled states are the simplest to prepare experimentally with high fidelity, and also enable us to use qubit-efficient sequential preparation schemes.From a learning perspective, encoding individual data vectors in product states ensures that any entanglement that results in a quantum model comes from correlations in an ensemble of data and not from a priori assumptions about pre-existing correlations for individual data vectors [25].For encoding of an N -dimensional classical data vector x into an ensemble of N qubits, a convenient parame-terization is (1) that is, in terms of local maps φ j (x) mapping a single data element into a superposition of qubit states.In order that the full map Φ (x) maps each data instance into a normalized vector in Hilbert space, we require that When encoding data for use in generative applications it is also useful for the maps to have the orthonormality property which ensures that the wavefunction encoding data is normalized whenever That is, maps satisfying Eq. ( 3) map the data into an orthonormal Hilbert space.
The simplest case occurs when the data is discrete, and so can be formulated as vectors x where x j ∈ {0, 1}.We map each element to a qubit as [26,27] φ j (x) = δ j,x .
This map clearly satisfies the properties Eqs. ( 2) and (3) above, and so is suitable for either generative or discriminative applications.In the case in which the data is continuous, x ∈ R N , we now have freedom to choose how to encode it in Hilbert space.The phase-like encoding has been used in Refs.[25] to encode data for quantuminspired ML applications.Eq. (7) satisfies Eqs.(2) but not Eq.(3).A related map that satisfies both conditions is [25] φ 0 (x) = e 3πix/2 cos π 2 x − x min x max − x min , φ 1 (x) = e −3πix/2 sin π 2 x − x min x max − x min .
In the present work, we will focus on the case of binary data, and so utilize the map Eq. (6).

A. Tensor networks and sequential preparation
The next step in our QAML workflow outlined in Fig. 1 is to learn a quantum model for the collection to quantum states {|Φ (x j ) } N T j=1 resulting from applying the encoding map from the previous section to the training data.Here, we define a quantum model as a collection of operations applied to quantum resources to produce a state that encodes the properties of the ensemble {|Φ (x j ) }.In what follows, we specialize to the case of tensor network (TN) models, which provide a convenient parameterization of the structure of quantum operations and resources.Generally speaking, TNs represent the high-rank tensor describing a quantum wavefunction in a specified basis as a contraction over low-rank tensors, and hence define families of low-rank approximations whose computational power can be expressed in terms of the maximum dimension of any contracted index χ, known as the bond dimension.
A wide variety of TN topologies have been considered which are able to efficiently capture certain classes of quantum states [19][20][21]; in the present work we focus on matrix product states (MPSs).MPSs use a one-dimensional TN topology, as shown using the Penrose graphical notation for tensors [19] in Fig. 1(b), and form the basis for the enormously successful density matrix renormalization group (DMRG) algorithm in quantum condensed matter physics [28].MPSs have several properties that make them attractive for QAML.For one, they are undoubtedly the most well-understood and mature of all tensor networks, which has led to robust optimization strategies that are widely used in the quantum manybody community.In addition, MPSs are highly quantum resource efficient, in the sense that their associated wavefunctions can be sequentially prepared, and so qubits can be reused in deployment on quantum hardware.In fact, it can be shown that every state that can be sequentially prepared can be written as an MPS [29][30][31].
In recent years, TNs have found applications outside of the condensed matter and quantum information domains.The mathematical analysis community has proposed TN methods for data analysis, e.g., large-scale principle component analysis [32,33].In this community, MPSs are referred to as tensor trains [34].Using TN methods to design quantuminspired ML models was first proposed by Stoudenmire and Schwab [25], who put forth a scheme using a MPS network as a linear classifier in a Hilbert space whose dimension is exponentially large in the length of the raw data vector.Since then, many other proposals for quantum-assisted or quantuminspired TN ML models have appeared in the literature [35][36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51], including generative modeling of binary data using MPSs in Ref. [52].In the majority of approaches, DMRG-inspired algorithms for optimization have been employed.However the authors of Ref. [53] recently demonstrated an alternate strategy where a TN was implemented as a neural network using standard deep learning software, and the tensors of the TN were optimized using backpropagation strategies ubiquitous in classical ML.While this strategy has shown good performance, it has also been shown to be suboptimal with respect to the DMRG-like approach [54].Nonetheless, the use of deep learning "preconditioners" and the intersection of QAML and neural networks remains intriguing [55][56][57].
The fact that MPSs define a sequential preparation scheme means that MPSs define highly resource efficient schemes for learning [15] and quantum simulation [58].In particular, the qubit resource requirements for an MPS model are logarithmic in the bond dimension χ, which encapsulates the expressivity of the model, and are independent of the length of the input data vector N .In order to illustrate how this property comes about, consider that we have a register of N qubits with states |j i , i = 0, . . ., N − 1, j i = 0, 1 in which we want to encode data and a χ-level ancilla |α , α = 0, . . ., χ − 1 that can be used to entangle the qubits.Starting at the "right" end of the system, we can initialize the (N − 1) st qubit using an operator LN−1 defined as in which the coefficients L [N −1] satisfy the isometry condition Clearly, if we start our qubit and ancilla system in the state |00 , this operation transforms it into the (entangled) state , and the isometry condition ensures that this state is normalized.Moving to the next qubit, we now entangle it with the ancilla using the operator which is subject to the isometry condition with I χ the χ × χ identity matrix.This operation now puts the system in the state We follow this same logic for all subsequent qubits, defining isometric operators that entangle them to the rest of the system using the ancilla, until we reach qubit 1, which is attached using the isometric operator This operator puts the full system into the state Hence, in the last step, the qubit states decouple from the ancilla.The qubit state takes the form of an MPS with the additional constraint that each of the MPS matrices L satisfies the left-orthogonal condition Eq. (12).The above procedure can readily be read in reverse; given a general MPS QAML model with bond dimension χ, j0...j N −1 we can convert it into a sequential qubit preparation scheme with a χ-dimensional ancilla by putting the MPS in leftcanonical form.This transformation to left-canonical form can be done without loss of generality using a well-known procedure involving an orthogonal decomposition, e.g. the singular value or QR decomposition [19].Thus, the tensors appearing in an MPS, which could result from a classical training optimization, can be formally (i.e., modulo compilation into native quantum operations for a given hardware architecture) translated into operations for deployment on quantum resources.
The above prescription assumed the presence of a register of N qubits, but due to the sequential nature of the preparation this is unnecessary, and a single "physical" qubit together with the χ-level ancilla suffices, provided we are not measuring any multi-qubit properties of the state.As an example, we will consider drawing a sample from an MPS wavefunction generative model with the binary map Eq. ( 6).In this application, we first couple the qubit and ancilla as in Eq. ( 9) starting from both in the fiducial state |0 .We then measure the qubit in the computational basis, record its outcome as x N −1 , and then return it to the fiducial |0 state while leaving the ancilla unmeasured.We note that the ability to re-initialize a single qubit independent of the others is not universally available in present-day hardware, but has been demonstrated in, e.g., trapped ion platforms [59].We then re-entangle the ancilla and qubit using the operator LN−2 defined in Eq. (11), measure the qubit and record the outcome as x N −2 , and again return the qubit to the |0 state.This procedure is repeated with the other operations Lj until a complete set of N measurements x is made, which constitutes a data sample.This procedure is denoted graphically in Fig. 1 (d).Clearly, this only requires a single "physical" or "data" qubit (i.e., the one that is sampled) independent of the input data size N , and the construction of the χ-level ancilla requires only log 2 χ qubits.We stress that the scheme above is formal in the sense that it produces isometries acting on quantum resources without reference to their actual physical representation or other hardware constraints such as limited coherence time, connectivity, gate sets, etc..The translation of these formal isometries into operations to be dispatched on a given target hardware are detailed in Sec.III.

B. Generative MPS models and classical training procedure
We now further specialize to generative models, in which a collection of quantum data vectors are encoded into a wavefunction such that the probability distribution evaluated at data vector x is Here, Z = ψ|ψ = dx ψ|Φ (x) Φ (x) |ψ is a normalization factor, and we assume the property Eq. ( 3) holds for the Hilbert space encoding map.As this corresponds to Born's rule for measurement outcomes, the resulting structure is referred to as a Born machine [52,60].In order to discuss data representation using Born machines, we define the average log-likelihood of the data in the training set T as The minimization of the negative log-likelihood with respect to the parameters in our Born machine is equivalent to maximizing the probability that the data is generated by the Born machine.We will parameterize the wavefunction to be trained as an MPS and assume that the data is encoded in terms of an orthonormal map as in Eq. ( 3), resulting in where the normalization factor (partition function) is We will optimize the Born machine by a DMRG-style procedure using gradient descent, where the gradient is taken with respect to the tensors of the MPS.Namely, we will consider the gradient with respect to a group of s neighboring tensors Θ = A i l . . .A i l+s , with s typically being one or two, noting that the gradient of an object with respect to a tensor is a tensor whose elements are the partial derivatives with respect to the individual tensor elements.We take the gradient with respect to the conjugates of the tensors A ij , formally considering these conjugates independent of the tensors themselves.This gradient may be written as With this gradient in hand, we update the local block of tensors as in which η is a learning rate (note that this is equivalent to minimizing the negative log likelihood).For the single-site algorithm (s = 1), this update does not change the bond dimension or canonical form of the MPS.For the two-site algorithm (s = 2), we can now split the updated tensor Θ into its component MPS tensors as using, e.g., the SVD.Hence, the addition of the gradient can increase the bond dimension, and thus the representation power, adaptively based on the data.The bond dimension can also be set implicitly by requiring that the L 2 -norm of the tensor Θ is represented with a bounded relative error ε.The above update has affected only a small group of the tensors with all others held fixed.We now shift the orthogonality center to a neighboring tensor, and perform the same local optimization procedure.For the two-site case, the shift of orthogonality center can be accomplished simultaneously with the splitting of the tensor Θ in Eq. ( 24).In the onesite case, the orthogonality center is moved to the next tensor in the optimization cycle using either the SVD or the QR decomposition.A complete optimization cycle, or "sweep," occurs when we have updated all tensors twice, moving in a back-and-forth motion over the MPS.The sweeping process is converged once the negative log-likelihood no longer decreases substantially.Example convergence behavior will be given later in Sec.V.

III. COMPILATION OF MPS MODELS FOR QUANTUM HARDWARE
In this section, we address how to take an MPS model resulting from the classical optimization procedure outlined in Sec.II B and convert it into a sequence of operations to be performed on a quantum device.We will refer to this operation as quantum compilation.Many modern NISQ software ecosystems, for example Qiskit [4] and Forest [1], have routines for compiling quantum instructions, usually required to be supplied in the form of an abstract quantum circuit model.These compilers typically perform multiple passes through the abstract circuit to map virtual qubits from the abstract model onto the hardware qubits of the device, route operations between the virtual qubits to hardware qubits, e.g., by placing SWAP gates, and optimization to minimize some property of the circuit, such as entangling gate count.We note that quantum compilation remains an active area of research, and currently available generic methods for quantum compilation tend to produce "deep" circuits with significant numbers of entangling gates.
There are several unique properties of our particular quantum computing use case -compiling isometries encoding TN models for QAML-that make them unique compared to traditional quantum computing use cases.For one, our isometries are defined on the Hilbert space of a physical qubit and a formal χ-level ancilla, and so may not uniquely describe an isometric operation on a set of virtual qubits, e.g., when χ is not a power of 2. Further, since the ancilla degrees of freedom are never directly measured, there is no preferred basis or state ordering for these states.Both of these properties give freedom that can be utilized to simplify compilation.In addition, the isometries are the result of an optimization procedure that has a finite tolerance (see Sec. II B), and so do not need to be compiled exactly to meet some fine-tuned property.That is to say, model predictions are not more accurate when using a compiled unitary that matches the isometry better than the optimization tolerance.For NISQ devices in particular, finetuning of isometry properties through the introduction of additional entangling gates may in fact produce worse results due to the increased noise in the circuit compared to a shallower representation.These properties have motivated us to pursue optimizations of the tensor network structure as well as a set of greedy compilation heuristics, inspired by Ref. [61], that we outline in what follows.
The key objects that we want to optimize in this section are the isometries L[i] defined by the elements of the MPS in leftcanonical form, see Sec.II A. Given that the binary encoding map used in this work, Eq. ( 6), is real-valued, all MPS tensors are real-valued, and this extends to the isometries.We will display the isometries using plots of their matrix representations in a fixed basis, as in Fig. 2. In this and similar plots, the basis ordering is defined with the physical qubit (i.e., the qubit that begins in the |0 state and is read out after each isometric operation) as the least significant qubit such that an isometry acting on a χ-dimensional ancilla α ∈ {0, . . ., χ − 1} and a physical qubit q ∈ {0, 1} has state indices For isometries that have their ancilla states decomposed into qubits, we order those qubits a i ∈ {0, 1} such that significance increases with label index i, i.e.
index (|a nanc . . .a 1 q ) = nanc i=1 The isometry in Fig. 2 acts on a physical qubit and a χ = 7 dimensional ancilla, transforming the state |00 into a superposition of |11 and |60 , the state |10 into |00 , and so on.We note that the isometry in Fig. 2 is undefined when acting on states with |q = 1 in accordance with the sequential preparation scheme, but takes arbitrary ancilla states as inputs.
Because of the isometry property, we only need to account for the nonzero elements of the operation when matching to a unitary, and so do not need to distinguish between zero elements and undefined elements.As a first step in compilation, we will want to "clean" the isometries from the classical model in order to remove noise at the level of the classical optimization tolerance, otherwise we will expend effort attempting to compile this noise into quantum operations that will not improve the fidelity of the calculation.This amounts to implementing a filter on the MPS to remove elements below some tolerance level ε, which can be accomplished by using MPS compression to find the MPS with specified resources (e.g., restricted bond dimension χ) |φ that is closest in the L 2 -norm to a target MPS |ψ that Example isometry for optimization.An example isometry acting on a single physical qubit in the state |0 and a χ = 7-level ancilla, taken from the MNIST example in Sec.V.The isometry has been cleaned to remove small numerical values resulting from classical optimization, but no further optimization has been applied.has higher resource requirements (χ ).While this is optimally done variationally [19], a simple and practical method for performing this operation is to use local SVD compression, in which the MPS tensor of the orthogonality center A [i] is decomposed by the SVD as where the upper expression is for a right-moving update and the lower for a left-moving update.We can truncate the bond dimension by keeping only the χ largest singular values, or determine the new bond dimension implicitly through a singular value cutoff ε as When the MPS tensor is the orthogonality center, this condition is equivalent to a L 2 -norm optimization of the full wavefunction.Replacing A [i] by the truncated U for a right-moving update or by V for a left-moving update and contracting the truncated SV or U S into the neighboring tensor completes the local optimization.Sweeping the optimization across all tensors completes the filtering step.Since the optimization only deals with the parameters of a single MPS tensor at a time, it is not guaranteed to be globally optimal, but this simple procedure works well in practice.As a side benefit, ending the optimization by applying the update Eq. ( 27) and replacing the MPS tensor A [i] with U for each tensor places the MPS in left-canonical form, from which the isometries for sequential preparation can be constructed from the tensor elements.
A. Ancilla permutation and the diagonal gauge The conversion of an MPS into left canonical form uses the gauge freedom inherent in MPSs, namely that any invertible matrix X and its inverse can be placed between any two tensors of the MPS, i.e.
such that each of the tensors in the left-canonical MPS satisfies the isometry constraint αj without changing the overall quantum state.However, we note that the constraint Eq. ( 32) still allows for the insertion of any unitary matrix and its inverse on either the left or right bond basis of an MPS tensor L [i]j without changing the state or the isometry conditions.This freedom stems from the fact that the bond degrees of freedom are only used to mediate correlations between the physical degrees of freedom and are not directly measured, and so have no preferred basis for representation.We can attempt to exploit this freedom to produce MPS models that are more amenable to compilation on a given target hardware.We note that, just as with the ordinary gauge freedom of MPSs, a change of gauge affects two neighboring MPS tensors at a time, and so an operation that may benefit one tensor also affects its neighbors and so on down the network.Thus, the optimal choice of gauge requires a global optimization across all tensors.
To utilize the ambiguity in the basis representation of the ancilla states, we have devised a simple procedure that we have found to aid in compiling isometries for QAML models.The heuristic guiding our scheme is to ensure that operations are as "diagonal" as possible, in the sense that qubits preferentially remain in their same state rather than being swapped or mixed with other ancilla qubits.Operationally, in order to work only within the ancilla basis where we have freedom of representation, we define a matrix of overlaps which "integrates out" the physical qubit from the isometry used for sequential preparation, and so acts only in the ancilla space.A diagonal M is desired, as this would perfectly preserve the individual ancilla basis states and so reduce the number of quantum operations required.Recalling that we are only changing either the left or right basis of M at a time, one possible option to increase its diagonal dominance through transformation of either the left or right basis is to use the polar decomposition M → UP or M → PU with U unitary and P Hermitian and positive semidefinite.Using U 1/2 to transform the basis of L would transform M into P; however, this transformation does not preserve sparsity in L, and we have found that it often leads to more complex operators in practice.Instead, we use the values of U from the polar decomposition to define a permutation of the ancilla basis states as, e.g., This operation does preserve sparsity, and results in more diagonal operations in the ancilla degrees of freedom.An example of the isometries for a QAML model with and without this permutation procedure are shown in the right and left panels of Fig. 3, respectively.We see that the permutation of the basis states does result in a more diagonal isometry operator, as desired.
The permutation operation Eq. ( 34) is ambiguous whenever multiple elements of a column of U have the same absolute value.Recalling that our sequential MPS preparation scheme requires that the ancilla start and end in the vacuum state, we see that this occurs for tensors near the extremal values of the representation when an ancilla qubit is first utilized or an ancilla qubit is decoupled from the remaining qubits.In such cases, we use the following alternate procedure to decide between permutations.First, we enumerate all basis permutations resulting from these ambiguities for a given tensor L [i]ji and construct their associated isometries L(ζ) , in which ζ indexes permutations.To decide between these permutations, we again would like to make this operator as "diagonal" as possible, in the sense of minimizing the number of qubit operations being applied.We construct a simple cost function as follows: for each state indexed by the ancilla state α and the physical qubit q as above, we convert the state index into its binary representation b, which effectively maps the ancilla state onto a collection of log 2 χ qubits.As an example, the states of a four-dimensional ancilla and a single physical qubit give the representations We now calculate a distance between two basis states (α, j) and (α , j ) with respective binary representations b and b as The term in parentheses counts the number of individual qubit "flips" required to convert one of the states into the other, and the square strongly penalizes multi-qubit coordinated flips.We then use the cost function in which D is the matrix with D [•, •] as elements and L (ζ) is the matrix of absolute values of L (ζ) , to choose from between the L (ζ) .
As with the usual transformation of MPS gauge to mixed canonical form [19], there is a "right-moving" update that permutes the right bond basis of a tensor A [i] and the left bond basis of A [i+1] and a "left-moving" update that permutes the left bond basis of A [i] and the right bond basis of A [i−1] .When applied to all tensors, we say that the MPS is in the diagonal gauge, as it is the gauge which enforces the isometries for state preparation to be as diagonal as possible (according to our particular cost functions).We stress that the MPS is still in left-canonical form, and so the sequential preparation scheme still holds; the diagonal gauge merely uses the unitary freedom remaining in the left-canonical form to further optimize the state preparation procedure while maintaining sparsity.There is a single tensor that is not optimized at a certain location k in the transformation to the diagonal gauge that we call the diagonality center, analogous to the orthogonality center of mixed canonical form.While the location of the diagonality center can again be used as an optimization parameter, we have found it convenient to set the diagonality center to an isometry that is initially an identity matrix.Such an isometry can always be introduced by padding the classical data vectors with a zero at location k.The reason for our choice is that the permutation to diagonal gauge will transform this identity isometry into a permutation matrix, which is likely to be easier to compile with high fidelity than a general, non-sparse isometry.Specific techniques for compilation will be presented in a later section.
In addition to the permutation ambiguity, there is also a sign ambiguity on each of the bond states of the isometry.We again use diagonal dominance in fixing this sign ambiguity by reversing the sign of a column (row) if the element with magnitude above a certain threshold closest to the diagonal is negative during a right-moving (left-moving) update of the diagonal gauge, with the sign also being absorbed into the tensor to the right (left) of the one being optimized.Following transformation to diagonal gauge, we fix the signs of all elements of the diagonality center (chosen, as above, to be a permutation operator) to be positive by absorbing any negative signs into the nearest tensor that has elements of mixed sign in the chosen bond direction.

B. Greedy compilation heuristics
Following the fixing of gauge outlined in the last subsection, we are in a position where we now want to transform the isometries Li into operations to be performed on quantum hardware.The target hardware will have a collection of qubits laid out with a given topology and an allowed gate set of single-qubit rotations and entangling gates between pairs of qubits.Generally speaking, two-qubit gates are subject to higher degrees of noise than the single-qubit gates, and so higher-fidelity operations will be obtained by using as few two-qubit gates as possible.As an example, the error map and qubit/gate topology for the IBMQ-X2 machine is shown in Fig. 4. For this device, the single-qubit gates are defined by [4] and the two-qubit gates are controlled-NOT (CNOT) gates, which are allowed only between qubits designated with a solid line in Fig. 4. As shown in the figure, the average error of the CNOT gates at the time of this measurement was ∼ 2.6%, while the error of the single-qubit gates was ∼ 0.15%.Hence, a goal in compiling our isometries is to use as few gates as possible, and especially to minimize the number of two-qubit gates.
In our compilation heuristic, we enumerate possible unitaries by constructing a tree of potential circuit structures with continuous parameters to be optimized.The root node of our tree is comprised of a single-qubit gate (such as the Û3 gate in Eq. ( 42)) for each qubit.Each node in the tree has a child node corresponding to the placement of an entangling gate in one of its allowed positions, and then adding single-qubit gates to the qubits acted on by the entangling gate.Any circuit that can be constructed using the allowed entangling gates and singlequbit rotations corresponds to a node in this tree, as proved in Ref. [61].In order to select between nodes in this tree, we define a cost function in which S denotes the set of indices such that the elements of the matrix representation of the isometry are greater than some tolerance |L i,j | > δ.Because of the isometry property of L and the unitarity of the candidate gates Û , we can optimize only over the elements in S, which reduces the computational complexity of the cost function.Our optimization will select a particular unitary Û as being acceptable when the cost function drops below a specified tolerance ε.
The optimization procedure begins by optimizing the root node (single-qubit gates) over its parameters and checking the cost function if an acceptable gate is found.If no acceptable gate is found, a queue of gates corresponding to adding an entangling gate and a pair of single-qubit gates to the root node in all allowed locations as outlined above is formed, and these gates are optimized and their cost functions recorded.If no gate from this queue is acceptable, a priority queue is formed by sorting the gates from this set according to their cost functions and then appending entangling gates and single-qubit rotations as above.In order to avoid an exponential growth of the number of search considerations, we limit the number of gates forming the starting point of the priority queue (i.e., before appending new entangling gate and single-qubit rotations) to a fixed number.This number is used as a convergence parameter, and can vary between optimization cycles; we find that it is useful to allow more gates in early optimization cycles where the operations involve fewer parameters and so optimization is fast, and then to decrease the number of kept gates as the circuits become deeper.Also, we note that it may be useful to add a gate-dependent heuristic function h to the cost function when sorting gates to add to the priority queue, as advocated in Ref. [61].This can be used to account for, e.g., hardware-dependent noise [62]; we will return to our choice of this function shortly.
Here, we briefly note details of our implementation of the above procedure, along with some problem-specific optimizations.Our subroutine for the cost function takes as input a vector of parameters θ, constructs a matrix representation of the parameterized gate sequence in which θ i is the vector of parameters used by gate i, and then evaluates the cost function Eq. ( 43).This enables us to obtain analytic gradients of the cost function also as elements of products of matrices.We optimize the cost function using the BFGS method, and allow for multiple batches of input parameters with random variations added to avoid local minima.Additionally, as noted above, all of the isometries that result from the use of a real-valued quantum embedding map will be real, and so we can restrict our attention to real-valued gates.Hence, in our implementation, we parameterize single-qubit gates as y-rotations which relate to the gates in Eq. ( 42) as Ry (θ) = Û3 (θ, 0, 0), and CNOTs for the entangling gates.While we have made the above gate choices for use in this paper, we stress that our methods apply to any other choice of single-qubit and entangling gates.While there is no guarantee that there are not operations with fewer entangling gates that could be found using complex-valued gates, we find that the reduction in the number of parameters when using real gates significantly improves the optimization time.
The final optimization we have included is to introduce longer gate sequence "motifs" into the optimization alongside the native entangling gates.In particular, the two motifs we have utilized in our work are a two-qubit rotation gate , which is allowed between any two qubits that have CNOT connectivity, and a version of the Ŝ gate we call F that is controlled on a third qubit.We find that the former gate can be compiled using two CNOTs using the ansatz sequence shown in Eq. ( 47) and the latter gate with control on qubit c and the operation Ŝ applied to qubits q 1 and q 2 can be constructed using Hence, Ŝ gates require 2 CNOTs for compilation and F gates require 8 CNOTs for compilation.Both gates were identified from experiments with the greedy optimization procedure outlined above using only CNOTs, and their direct inclusion into the optimization enables more rapid convergence.As these gates require multiple entangling gates, it is useful to introduce a heuristic penalty function h into the cost function for ordering the next priority queue to ensure that they are not chosen over shorter gates with a similar cost function.The choice of this penalty function will be problem-specific, and finding ways for optimizing it in a data-driven fashion for problems of interest is an intriguing area for further research.
We also note that the use of multi-qubit controlled gates is penalized through the choice of the cost function Eq. ( 41) for choosing the permutation to diagonal gauge; the choice of a cost function of 4 or 8 for a gate requiring two and three bit flips is in rough accordance with the number of CNOTs required for Ŝ and F, respectively.        . . .
h A e l q y p H e 5 q W T P z J 4 p J U + D C K h i y s 4 U X 9 P p C i U c h z 6 u j O / W s 5 6 u f i f 1 0 1 U c O W l l M e J I h x P F w U J g y q C e S i w T w An example application of this procedure to the isometry shown in the right panel of Fig. 3 is given in Fig. 5. Here, we give cost function penalties of 0.6 and 0.2 for F and Ŝ gates, respectively, use a cost function tolerance of 5 × 10 −4 , and keep the 4 lowest cost gates to generate the priority queue from the first optimization and the 2 lowest-cost gates on subsequent optimizations.The successive rows show the optimized gates resulting from adding a single entangling gate to the ansatz resulting from the last round of optimization, starting with a single-qubit rotation on each qubit (top center).The green lines show the gates which are kept to form the new priority queue.Here and throughout, the quantum circuits are ordered with the physical (i.e.readout) qubit on the top line and the ancilla qubits in increasing order on lower lines.Following an optimization in which Ŝ or F gates may be used, the "raw" circuit containing these parameterized gates is then compiled into CNOTs using Eqs.( 47) and ( 48), products of single-qubit rotations are collected together, and then optimization passes are run to determine if single-qubit gates with rotations smaller than a certain threshold can be removed without affecting the cost function.We note that no cost function penalty is applied when an Ŝ or F gate brings the cost function below its desired tolerance, as in the last step of the optimization shown in Fig. 5, but is only used for ordering the priority queue when no gates meet the cost function tolerance.
Several "generic" methods for the compilation of isometries exist, as reviewed in, e.g., Ref. [63].These algorithms also underlie the implementation in Qiskit [4].In the generic approach, the matrix representation of the isometry is decomposed, e.g., a single column at a time or by the cosine-sine decomposition, and the resulting decompositions expressed in terms of multi-qubit controlled operations, which are themselves decomposed into a target gate set using known representations.These approaches are constructive, and so will find decompositions of any isometry in principle, but they are not designed to find the most efficient representation by some metric, e.g., the number of entangling gates.Further, as noted above, the use of such generic algorithms requires an "isometric completion" in the case that the bond dimension χ is not a power of 2, and may expend additional resources in exactly compiling noise in the isometries.Special purpose methods have also been developed for compiling permutation gates in Ref. [64], which have been shown to outperform the generic algorithms in some cases.This method uses a reversible logic synthesis to map the permutation into a reversible circuit comprised of single-target gates, and these single-target gates are then compiled into networks of CNOTs, Hadamard gates, and Rz (θ) = |0 0| + e iθ |1 1| rotations.
In order to compare our methods with the generic, constructive method for compiling isometries, we again consider the  isometry in Figs. 3 and 5.As noted above, in order to utilize the generic methods we have to map this isometry into a complete isometry over a set of qubits, which requires us to define the action of the isometry on the state in which the ancilla qubits are all in the state |1 , which was left unconstrained by the optimization procedure.For simplicity, we use the "isometric completion" in which the operator takes this state to itself without modifying the state of the physical qubit.Using the iso method of the QuantumCircuit class from Qiskit [4] implementing the generic methods of Ref. [63] on the unconstrained ibmq_qasm_simulator hardware topology produces a gate representation with 122 CNOTs at optimization_level 0, and 120 CNOTs at optimization_level 3. The greedy compilation procedure presented in this work achieves a representation with a cost function error of 5.6 × 10 −10 with an order of magnitude fewer entangling gates for this particular isometry.An explicit circuit representation is given in Fig. 35(d) of the appendix.As a point of comparison for the specialized methods for permutation gates studied in Ref. [64], we consider the isometry shown in Fig. 6(a).This is indeed a permutation on the space acted upon, and so can be represented by a family of "unitary completions."We take the straightforward choice of unitary completion in which we leave the ancilla qubits unchanged by the permutation, as shown in Fig. 6(c).The result of applying our greedy compilation procedure is given in matrix form in panel (b), and in quantum circuit form in panel (d).This gate requires 7 CNOTs, and has a cost function error of ∼ 2 × 10 −15 .The result of applying the methods of Ref. [64] are shown in panels (c) and (e); the gate here re-quires 14 CNOT operators.Generally speaking, we find that our greedy compilation procedure finds comparable or better gates for isometries corresponding to near-diagonal permutations compared to using the methods of Ref. [64] with the straightforward unitary completion given above.However, it is also worth noting that our procedure is designed for isometries and so generally does not produce permutation operators on the entire space at the end of optimization.That is to say, the optimized gate is a permutation in the space spanned by the isometry, but the full unitary is not a permutation, see, e.g., Fig. 34 of the appendix.It is also worth noting that for complex, highly non-diagonal permutations, as can occur for the diagonality center when transforming to the diagonal gauge, the methods of Ref. [64] can produce more efficient representations.

IV. EXACTLY SOLVABLE BENCHMARK MODEL
As an exactly solvable benchmark, we consider an MPS Born machine encoding the probability distribution of classical discrete data vectors x, x i ∈ {0, 1} ∀i.The simplest nontrivial situation is when the data vectors consist of all zeros except for a single 1, closely related to the canonical bars and stripes (BAS) dataset.Let us denote the probability that the 1 resides at location i as p i , with It can be shown that this data can be represented exactly as a bond dimension 2 MPS Born machine with tensors with the {φ j } denoting arbitrary phases.The presence of a large number of arbitrary phases is a generic feature of TN models for generative applications: since the square of the wavefunction is used to generate classical data samples, the phase structure of the wavefunction is generally underconstrained.This in turn implies that TN models can have some flexibility over the particular gate set used to entangle the physical qubits to the ancillae without affecting the sampling outcomes.The exactly solvable model encapsulated by Eqs. ( 49)-( 51) is a useful benchmark both because it is the simplest nontrivial example of a sequentially preparable QAML model, involving a single ancilla qubit, and because it can be exactly solved for any classical data vector length and probabilities p.An example dataset for p = (1/5, 1/20, 1/20, 1/4, 1/5, 1/4) is given in Fig. 7.
The construction in Eqs. ( 49)-( 51) is reminiscent of the well-known MPS representation of the W state [31].In order to convert this generic MPS into a sequential qubit preparation scheme we should place the MPS into left-canonical form.Since the bond dimension is known, we can do so in terms of the QR decomposition.For simplicity of exposition, we will take all phases φ j = 0, though we will relax this condition shortly.Performing the QR decomposition on the first FIG.7.
tensor, we find ⇒ Reshaping the second tensor and decomposing, we find ⇒ L This generalizes to with the last tensor being We now take these left-canonical tensors and reshape them to correspond to isometries acting on a single physical qubit |i q and an ancilla qubit |α a .We start from the N th tensor, where both the qubit and ancilla are in the state 0. The isometry is Following this, the physical qubit can be measured in the computational basis and its outcome (classically) stored, and then the physical qubit is returned to the state |0 q .We then repeat this procedure of acting with isometries, measuring the physical qubit, classically recording its output, and returning the physical qubit to 0, using the isometries Lj = |0 a 0 q 0 a 0 q | (62) We note that Eq. ( 62) also holds for the final site, j = 0, and produces an unentangled ancilla in the state |0 a .With these operators in hand, we can re-insert the arbitrary phases on the elements resulting in the state |1 q , yielding Lj = |0 a 0 q 0 a 0 q | (64) We note that there is a "natural" unitary completion of the operators in Eq. ( 64) given by Ûj = |0 a 0 q 0 a 0 q | + |1 a 1 q 1 a 1 q | (65) in which the state |1 a 1 q -that is never populated under ideal operation-is left unchanged and the action on the-also ideally unpopulated-state |0 a 1 q is determined by orthogonality.Written in the basis representation {|0 a 0 q , |0 a 1 q , |1 a 0 q , |1 a 1 q }, we find in which cos θ j = i<j pi i≤j pi and so sin θ j = pj i≤j pi .This gate has a natural interpretation as a rotation within the subspace of a single quantum of excitation shared between the qubit and ancilla, with the rotation angle set by the classical data vector probabilities (for p j → 0, θ j → 0 and Eq. ( 66) becomes the identity).An analogous unitary completion for the isometry LN−1 is given by From a gate-based perspective, the operators in Eqs. ( 66) with φ = −π/2 are described by the Fermionic Simulation, or fSim(θ, ϕ) gate [65], with ϕ = 0 and θ = θ j ; this gate has been recently been demonstrated in gmon qubits [66].Alternatively, they are a one-parameter generalization of the iSWAP gate [67].We note that the unitary completion Ûj at φ j = 0 is given by Ŝ (θ j , θ j ) in the notation of Eq. ( 46), and so for the gate set employed by the IBMQ processors, the shortest decomposition for U j is given by Eq. 47.While in some alternative hardware platforms, such those employing tunable qubits [68], iSWAP gates can be implemented natively, partial iSWAPs still require decomposition.We also note that the operation Eq. ( 66) is generated by the effective Hamiltonian for "unit time" in the sense that when φ j = π/2.This gate is readily achieved in trapped ionbased quantum computers using an equally weighted combination of XX and YY Mølmer-Sørenson gates [69], as well as a variety of other platforms implementing XY effective spin-spin interactions [70].It is also interesting that the "data angle" θ j has a natural interpretation as an ersatz "evolution time" in this perspective.Before moving on from this exactly solvable example, we would like to point out how the freedom in representation of the bond basis manifests itself in this exactly solvable example.Namely, the predictions of the model Eq. ( 49)-( 51) are unchanged if we reverse the roles of the |0 a and |1 a ancilla states in all but the first and last steps of preparation (using the unitary freedom exploited in the transformation to diagonal gauge discussed in Sec.III A).In this case, we have the isometries The natural unitary completions of these isometries take the matrix representation and so are described by Ŝ (−θ j , θ j ) at φ j = 0, and are generated by the effective Hamiltonian at φ = π/2.

A. Training and compilation
In this section, we detail the application of the methods outlined in this paper to the exactly solvable benchmark in Sec.IV using the probabilities p = (8/31, 18/31, 5/31).As a point of comparison, we consider a "hand compiled" version of the unitary completed isometries Eqs. ( 65) and (67).Taking φ = 0, we can compile these gates using the circuits shown in Fig. 8.However, we additionally note that with the assumption that the physical qubit starts in the state |0 q , the first CNOT in Fig. 8(a) is the identity, and so can be neglected, leading to a circuit with three CNOTs.

B. Performance of benchmark on cloud-based hardware
In this section, we present results for the exactly solvable benchmark model running on cloud-based NISQ hardware, using IBM devices as an example.We note that the current IBM hardware does not allow measurement and reinitialization during an experimental run, and so our sequential preparation schemes cannot be directly implemented on these devices.However, we can still test our generative models by implementing the gates Ûj of the sequential preparation scheme on a register of (N +1) qubits prepared in the |0 . . .0 state, coupling each physical qubit to the same ancilla in order from (N − 1) down to 0. This procedure is limited in practice by the number of available qubits and their connectivity to a single ancilla qubit.However, for devices with a cross-shaped topology, such as the IBMQ-X2, we can readily couple up to 4 qubits to a central ancilla qubit, and for devices with a T-shaped topology, such as the Vigo, we can couple up to 3 qubits to a single ancilla.
To demonstrate our methods to this benchmark case, we train a χ = 2 Born machine using the single-site gradient descent described in Sec.II B and compile it into gates using the procedures of Sec.III with the diagonality center at 1 and a greedy optimization tolerance of 5 × 10 −4 .Following the usual parlance of MPSs from condensed matter physics, we will refer to the physical indices of the MPS tensors as sites.The results of this procedure are shown in Fig. 9, with panels (a)-(c) for site 0, panels (d)-(f) for site 1, and panels (g)-(i) for site 2. The final cost functions for sites 0, 1, and 2 are 6.7 × 10 −9 , 7.3 × 10 −10 , and 2.0 × 10 −9 , respectively.We see that the obtained quantum circuits are substantially different than those obtained by hand-compilation of the "natural" unitary completion, but are still of very high fidelity in the space spanned by the isometry.In addition, the gates for sites 0 and 1 are shallower than the hand-compiled gate, which may be anticipated based on known optimality results for two-qubit gates [71].
Utilizing this approach for the exactly solvable Born Machine model with the probability vector given in Sec.IV A, we find the circuits shown in Fig. 10.Here, the physical qubits (those that are sampled to obtain output classical data vectors) are assigned to be qubits 0, 1, and 3, and the ancilla is qubit 2. The upper panel is for the hand-compiled circuits from Fig. 8, and the lower panel is the circuit from Fig. 9 using the workflow put forth in this work.The dashed vertical lines demarcate the circuits corresponding to the individual sites of the Born machine, but are inessential and neighboring singlequbit rotations can be joined for increased efficiency.
As metrics for assessing the performance of our QAML models, we utilize both the raw experimental counts used to infer measurement probability distributions and a convex version of the Kullback-Leibler (KL) divergence between the ideal (p T ) and estimated (p N ) distributions, as implemented in SciPy [72], The noise levels of NISQ devices fluctuate over time, and so to account for these statistical variations we implemented a jackknife procedure [73] for the mean and variance including bias correction, utilizing 25 experimental runs per day of 2 13 = 8192 shots each across 5 days.We further refine each experimental run using the measurement noise filter implemented in Qiskit [4], which produces a measurement noise correction map from a collection of calibration measurements which are performed immediately before the experimental shots.
The results of our jackknife analysis on the experimental measurement counts per state are shown in Fig. 11.Here, panels (a) and (c) are the results for the hand-compiled model circuit in Fig. 10 Comparison of hand-compiled and auto-compiled circuits for exactly solvable test case.The exactly solvable benchmark with three physical qubits (0, 1, and 3) and a single ancilla qubit (2) implemented as a quantum circuit using the hand-compiled circuits in Fig. 8 (upper panel) or the auto-compiled gates in Fig. 9 (lower panel).
IBMQ-Vigo device.In all panels, the rightmost green bar represents the ideal counts given by the model probability vector p = (8/31, 18/31, 5/31), the center blue bars are the raw experimental measurements without noise calibration applied, and the leftmost orange bars are the experimental measurements with the noise calibration applied.The black lines centered on the tops of the bars indicate the 1σ confidence intervals from the jackknife procedure.As noted above, qubits 0, 1, and 3 map to the probabilities p 0 , p 1 , and p 2 , respectively, and qubit 2 is the ancilla.Clearly, the application of the measurement noise filter improves the fidelity of the results.Also, generally speaking, the results for the Vigo device (lower panels) are closer to the ideal results than for the X2 (upper panels).The largest probability state resulting from errors is the state |0000 with no "hot" physical bits, followed by |1100 , with the two highest probability physical bits "hot."We note that the outcomes involving the ancilla qubit in the |1 state can be removed in postselection by virtue of the fact that the sequential preparation scheme should end with the ancilla in the |0 state (see Sec. II A), but this results in small corrections for the present case.Finally, we see that the auto-compiled results using the approach of Sec.III (right panels) are generally closer to the ideal results than the hand-compiled circuits (left panels), though this is not true for each state individually.
In Fig. 12 we display the KL divergence between the ideal, noiseless probabilities of measuring each individual quantum state and the measurement probabilities estimated from 25 experiments of 2 13 shots without (filled symbols) and with (empty symbols) the measurement noise calibration filter applied.The x axis denotes consecutive experimental days, and the horizontal lines indicate the KL divergence resulting from the distributions averaged over all days.Clearly, the application of the measurement noise filter improves the estimation of probabilities, as indicated by a lower KL divergence with respect to the ideal results.In addition, the auto-compiled cir-cuits (squares) show a lower KL divergence than the handcompiled circuits, likely due to their shallower circuits.Finally, we find that the Vigo results in panel (b) have lower KL divergence than the X2 results in panel (a), indicating an overall lower noise level for these days, in spite of the dayto-day fluctuations in the KL divergence being comparable in magnitude between the two machines.

V. EXAMPLE USING THE MNIST DATASET
The exactly solvable benchmark presented in Sec.IV provided data that was simple and well-structured enough that it could be exactly memorized using a single qubit to mediate bistring correlations.In this section, we consider a QAML benchmark that is again a generative MPS Born machine, analogous to Sec.IV, but using data from the MNIST handwritten digit dataset [24], a canonical ML test case.The MNIST dataset consists of greyscale images, each consisting of 28×28 pixels, of the numbers 0 through 9. We process the data by passing through a filter that returns the max value from every contiguous 2×2 pixel block twice, resulting in images of size 7×7.While this is not necessary, and produces less raw data available for learning, it reduces the number of isometries in the sequential preparation scheme to compile, allowing for both a more detailed case-by-case analysis and reducing the overall gate depth for the ancilla in the sequential preparation scheme.Our next step in processing this data is to binarize the greyscale images, such that we can use the binary qubit encoding for simplicity.Examples of the processed data are shown in Fig. 13.
In this work, we explore this dataset in the small-data regime where MPS models of modest bond dimension can memorize all patterns, analogous to the benchmark in Sec.IV, (a) Hand-compiled results for IBMQ-X2 but the bitstrings demonstrate more complex correlations that require more quantum resources.The bitstrings are specified by mapping the 7×7 pixel arrays into binary vectors of length 49.As an example of the classical optimization procedure, an MPS Born machine with χ = 8 (three ancilla qubits) converges to a negative log-likelihood of 2.563 following roughly 400 iterations of single-site gradient descent with a learning rate of η = 10 −4 (see Sec. II B) on the N T = 10 item dataset shown in Fig. 13.Using χ = 16 (four ancilla qubits), we reach the theoretical minimum value of the negative log-likelihood of log N T after roughly 200 iterations.The convergence behavior of these log-likelihoods is shown in Fig. 14, together with samples drawn from the model before and after optimization.Because the χ = 8 model does not reach the theoretical minimum, data elements not seen in the training data are present in the samples.In contrast, the χ = 16 model reaches the theoretical minimum, and so only produces data samples from the training data.
Following the classical optimization of the MPS tensors, we clean the MPSs to remove small numerical values from the classical optimization procedure, place it into left-canonical form to describe a sequential preparation scheme, and then fix the remaining permutation ambiguity in the bond degrees of freedom using the transformation to diagonal gauge described in Sec.III A. For the case in which the diagonality center is at site 35, we find the isometries collected in Figs.19-67 of the appendix.We compile these isometries using our greedy compilation procedure with a cost function (Eq.( 43)) tolerance of ε = 5 × 10 −4 except for the diagonality center, which we compile using the methods of Ref. [64] following the straightforward unitary completion procedure defined in Sec.III B; the results are again shown in Figs.19-67 of the appendix.In these figures, the "raw" circuits may include non-native parameterized gates such as Ŝ and single-qubit rotations with rotation angles near zero, while the "expanded and cleaned" circuits compile the non-native gates into single-qubit gates and CNOTs using Eqs.( 47) and ( 48), collect adjacent single-qubit rotations, and then remove small single-qubit rotations with additional optimization passes.We note that ε = 5 × 10 −4 translates into roughly √ ε ∼ 2% error in the elements of the compiled unitary, which is comparable to the entangling gate error rates of current cloud-based machines.
To investigate the fidelity of the compiled model, we implemented the sequential preparation procedure in which a single data qubit is coupled to three ancilla qubits using the isometries in Figs.19-67 on the IBM qasm hardware simulator.As described in Sec.II A, the isometries are applied from site 48 down to site 0 with data qubit measurement in the z basis and reinitialization in the |0 state between the ap-   from Fig. 13 are clearly recognized as the elements with highest probability, save for the digit 5, whose highest probability data sample involves confusion with the 1 digit.We recall that deviations from the ideal training digits and their ideal occurrence probabilities of 10% are a result of the restriction of the model to χ = 8 as well as the finite optimization tolerance ε.

< l a t e x i t s h a 1 _ b a s e 6 4 = " I b D y 1 1 9 P P c s 2 + 5 P T N + N R j e B Z i r M = " >
Comparing with Fig. 14(c), which shows samples taken from the classically trained model, it appears that the restriction to χ = 8 has a greater influence on the probabilities and sample variations than the finite compilation tolerance.
To investigate the effects of hardware noise, we use a simple model of depolarization noise in which we take ξ = 1 − F to be the average gate error, with F the average gate fidelity.With this, the depolarizing channel is represented by the operator in which p = 2 Nq ξ/(2 Nq − 1) with N q the number of qubits and the Kraus representation of the depolarizing channel is given by the operators in which is the set of N q -qubit products of Pauli operators without the N q -qubit identity matrix.In our model, we assign the same error ξ 2 to all CNOT gates and an error ξ 1 = ξ 2 × 10 −2 to all single-qubit gates in accordance with typical IBM hardware characteristics.In addition to the depolarizing error, we also include an uncorrelated-qubit (tensor product) readout noise model parameterized by [74,75] in which P (a|b) denotes the probability of obtaining measurement outcome a from a preparation of the quantum state |b , assumed identical across all qubits for simplicity.We again characterize the difference in predictions between the ideal and noisy outcomes using the convex KL divergence defined in Eq. (75).As before, we estimate the probabilities from ensembles of N s = 2 13 shots calculated using a hardware simulator for both the noiseless "truth" model and the noisy models.This divergence is shown in Fig. 16 as a function of the error parameterizations ξ 2 and ζ.The CNOT error rate ξ 2 parameterizes both the two-qubit and single-qubit depolarization errors and ζ parameterizes the readout error.The divergence shows a rapid rise driven by the appearance of data samples not present in the noiseless model.The samples with greatest occurrence drawn from the model evaluated at the CNOT error rate ξ 2 ∼ 0.01, indicated by the dashed vertical line in Fig. 16, and ζ = 0 are shown in Fig. 17.We can recognize many of the digits from the training set in this model, but they occur with significantly lower probabilities due to the appearance of additional noise-driven patterns.Because of the sequential preparation, we can expect that the bits near the end of the bitstring (i.e. for sites near 48) are produced at higher fidelity than those of lower site indices because of errors present in manipulating the ancilla qubits.We see that this is the case in Fig. 18, which displays the convex KL divergence as functions of the CNOT error rate ξ 2 and the number of bits sampled in the bitstring at zero measurement  75) between a noiseless and noisy simulation is shown as a function of the CNOT error rate ξ2 and the sampled bitstring length at zero readout error (ζ = 0).The probability distributions of both the noiseless and noisy models are estimated from 2 13 shots using a hardware simulator.
error ζ = 0. Non-monotonic behavior is due to the significant differences in complexity of the gate sequences to produce individual bits, see Figs. 19-67 of the appendix.A rise in error is seen as the number of bits increases as the total gate depth increases, peaking around ten bits.After this point the KL divergence levels off as additional bits may require shallower gate sequences, bringing the overall agreement between the noisy and true probability distribution closer.

VI. CONCLUSIONS AND OUTLOOK
We have presented a complete workflow for generative quantum-assisted machine learning (QAML) using Born machines with a matrix product state (MPS) tensor network (TN) structure.In our workflow, classical data is encoded into quantum states using an embedding map, the ensemble of quantum states is learned as a TN Born machine using a classical DMRG-like procedure with gradient descent of the negative log-likelihood, and the model is compiled into operations for target quantum hardware to obtain data samples as measurement outcomes.Using MPS-based models enables the use of highly quantum resource-efficient sequential preparation schemes requiring O (1) qubits for a classical data vector length N and O (log χ) qubits for bond dimension χ, which encapsulates the model expressivity.We presented several optimizations in the compilation stage of the workflow, such as the introduction of the diagonal gauge of the MPS model that utilizes inherent freedom in the model representation to reduce the complexity of the compiled model, as well as greedy heuristics for finding shallow gate sequences matching a target isometry to a specified tolerance given hardware topology and allowed gate constraints.We presented an exactly solvable benchmark model requiring two qubits, and assessed its performance on currently available quantum hardware.We also presented an example application modeling features extracted from the MNIST dataset parametrically with depolarizing and readout hardware noise using a hardware simulator.
Our results lay the groundwork for utilizing TN models in practical QAML applications, and leave several avenues for future research.First, the QAML demonstrations given in this work consist of overfit models, and so do not constitute "true" machine learning models which should be able to appropriately generalize from data.This is a result of either using data with very simple structure, as in our exactly solvable model, or using a very small sample size of training data, as in our MNIST application.Small sample sizes were used in the present work to enable detailed analysis of model performance with limited quantum resources.In future work, the generalization power of TN-based QAML models on NISQ hardware will be explored moving towards the largedata regime.We also note that other studies have indicated that TN models with current training strategies generally have a tendency towards overfitting [76].Second, we have focused on the applications of MPSs to generative modeling, but other TN structures, such as tree tensor networks [35,77] may also be useful for QAML applications, as well as other tasks such as feature extraction and classification.The procedures outlined in this paper can be readily adapted to compiling the isometries appearing in models for other TNs and other applications.Finally, the procedure outlined in this paper wherein a model is trained classically before being compiled to a quantum device cannot by itself yield a quantum advantage, as it requires the model to be both classically and quantumly simulable.However, our procedures will be useful in designing and analyzing TN-inspired model structures for scaling towards the classically intractable regime, and can also serve as "preconditioners" where a model trained using optimal classi-cal strategies is augmented with additional quantum resources and then trained directly on the quantum device or in a hybrid quantum/classical optimization loop, potentially avoiding local minima and speeding up optimization times.
FIG. 2.Example isometry for optimization.An example isometry acting on a single physical qubit in the state |0 and a χ = 7-level ancilla, taken from the MNIST example in Sec.V.The isometry has been cleaned to remove small numerical values resulting from classical optimization, but no further optimization has been applied.

FIG. 3 .
FIG.3.Application of diagonal gauge.Example isometries for an operations with a χ = 7 dimensional ancilla before (left panelsame isometry as Fig.2) and after (right panel) applying the diagonal gauge transformation Eq. (34) to the right ancilla basis states.

FIG. 4 .
FIG. 4.Exemplar NISQ hardware architecture.The qubit layout (circles), CNOT coupling topology (lines), and error sources of the 5-qubit IBMQ-X2 device[4] as an exemplar NISQ machine.We note that these error rates are a snapshot, and are subject to fluctuations.

63 <
7 g D 6 z P H 0 w 3 k r o = < / l a t e x i t > l a t e x i t s h a 1 _ b a s e 6 4 = " U 9 I 7 v V U 3 4 f 4 j h K 0 i n m m 5 E u A i C p o = " > A A A B + X i c b V B N S 8 N A F H y p X 7 V + R T 1 6 W S y C p 5 C 0 o l 6 E Q i 8 e K 9 h a a E P Z b D f t 0 s 0 m 7 G 4 K J f S f e P G g i F f / i T f / j Z s 2 B 2 0 d W B h m 3 u P N T p B w p r T r f l u l j c 2 t 7 Z 3 y b m V v / + D w y D 4 + 6 a g 4 l Y S 2 S c x j 2 Q 2 w o p w J 2 t Z M c 9 p N J M V R w O l T M G n m / t O U S s V i 8 a h n C f U j P B I s Z A R r I w 1 s u x 9 h P S a Y Z 8 3 5 n e t c 1 w d 2 1 X X c B d A 6 8 u 1 o o 4 y n M E 5 X I I H N 9 C A e 2 h B G w h M 4 R l e 4 c 3 K r B f r 3 f p Y j p a s Y u c U / s D 6 / A F D H p K 0 < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " g + 1 o o 4 y n M E 5 X I I H N 1 C H B 2 h C C w h M 4 R l e 4 c 3 K r B f r 3 f p Y j p a s Y u c U / s D 6 / A F G L 5 K 2 < / l a t e x i t > Ry( 1.64) • Ry( 0.10) • Ry(1 r 5 / q 4 N r L m U h S D Y I u P g p S j n W M Z 4 H g E Z N A N c 8 M I V Q y s y u m Y y I J N a m o i g n B X T 5 5 l X T q N f e y V r + v V x v 1 I o 4 y O k V n 6 A K 5 6 A o 1 0 C 1 q o j a i 6 B E 9 o 1 f 0 Z j 1 Z L 9 a 7 9 b F o L V n F z D H 6 A + v z B x W T l 7 8 = < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " F HV Q b H J a s k X K T C G T Q V y G c 5 l W 6 m E = " > A A A B + X i c b V D L S g M x F L 3 j s 9 b X q E s 3 w S K 4 K j N 1 o c u C C C 4r 2 A e 0 Q 8 m k d 9 r Q T G Z I M o U y 9 E / c u F D E r X / i z r 8 x b W e h r Q c C h 3 P u S W 5 O m A q u j e d 9 O x u b W 9 s 7 u 6 W 9 8 v 7 B 4 d G x e 3 L a 0 k m m G D Z Z I h L V C a l G w S U 2 D T c C O 6 l C G o c C 2 + H 4 b u 6 3 J 6 g 0 T + S T m a Y Y x H Q o e c Q Z N V b q u 2 6 P o T S o u B y S 3 i A x u u 9 W v K q 3 A F k n f k E q U K D R d 7 9 s j m W x v Y Y J q n X X 9 1 I T 5 F Q Z z g T O y r 1 M Y 0 r Z m A 6 x a 6 m k M e o g X 2 w + I 5 d W G Z A o U f Z I Q x b q 7 0 R O Y 6 2 n c W g n Y 2 p G e t W b i / 9 5 3 c x E t 0 H O Z Z o Z l G z 5 U J Q J Y h I y r 4 E M u E J m x N Q S y h S 3 u x I 2 o o o y 2 4 U u 2 x L 8 1 S + v k 1 a t 6 l 9 X a 4 + 1 S v 2 + q K M E 5 3 A B V + D D D d T h A R r Q B A Y T e I Z X e H N y 5 8 V 5 d z 6 W o x t O k T m D P 3 A + f w C W s Z O h < / l a t e x i t > . . .< l a t e x i t s h a 1 _ b a s e 6 4 = " F H e H N y 5 8 V 5 d z 6 W o x t O k T m D P 3 A + f w C W s Z O h < / l a t e x i t > Ry(0.11) • Ry(1.68)Ry(0.00)Ry(0.62)Ry( 1.57) Ry(1.57)Ry(0.11) • Ry(1.68)Ry( 1.57) Ry(1.57)Ry(0.62)Ry(0.00)C = 0.764 < l a t e x i t s h a 1 _ b a s e 6 4 = " c 7 0 z G O U k X o Y o 7 d 0 E p b n Z t 9 b s J v 8 = " > A A A B + n i c b V D L S s N A F J 3 U V 4 2 v V J d u B o v g K i R V r J t i o R u X F e w D 2 l A m 0 0 k 7 d D I J M x O l x H 6 K G w V F 3 A r + h x v x b 5 y 0 X W j r g Y H D O f d y z x w / Z l Q q x / k 2 c i u r a + s b + U 1 z a 3 t n d 8 8 q 7 D d l l A h M G j h i k W j 7 S B J G O W k o q h h p x 4 K g 0 G e k 5 Y 9 q m d + 6 J U L S i N + o c U y 8 E A 0 4 D S h G S k s 9 q 9 A N k R p i x N L a p O L Y 5 f O z n l V 0 b G c K u E z c O S l e f p i V + O n L r P e s z 2 4 / w k l I u M I M S d l x n V h 5 K R K K Y k Y m Z j e R J E Z 4 h A a k o y l H I Z F e O o 0 + g c d a 6 c M g E v p x B a f q 7 4 0 U h V K O Q 1 9 P Z k H l o p e J / 3 m d R A U X X k p 5 n C j C 8 e x Q k D C o I p j 1 A P t U E K z Y W B O E B d V Z I R 4 i g b D S b Z m 6 B H f x y 8 u k W b L d U 7 t 0 7 R a r J T B D H h y C I 3 A C X F A G V X A F 6 q A B M L g D D + A Z v B j 3 x q P x a r z N R n P G f O c A / I H x / g M M z 5 Y k < / l a t e x i t > C = 0.764 < l a t e x i t s h a 1 _ b a s e 6 4 = " c 7 0 z G O U k X o Y o 7 d 0 E p b n Z t 9 b s J v 8 = " > A A A B + n i c b V D L S s N A F J 3 U V 4 2 v V J d u B o v g K i R V r J t i o R u X F e w D 2 l A m 0 0 k 7 d D I J M x O l x H 6 K G w V F 3 A r + h x v x b 5 y 0 X W j r g Y H D O f d y z x w / Z l Q q x / k 2 c i u r a + s b + U 1 z a 3 t n d 8 8 q 7 D d l l A h M G j h i k W j 7 S B J G O W k o q h h p x 4 K g 0 G e k 5 Y 9 q m d + 6 J U L S i N + o c U y 8 E A 0 4 D S h G S k s 9 q 9 A N k R p i x N L a p O L Y 5 f O z n l V 0 b G c K u E z c O S l e f p i V + O n L r P e s z 2 4 / w k l I u M I M S d l x n V h 5 K R K K Y k Y m Z j e R J E Z 4 h A a k o y l H I Z F e O o 0 + g c d a 6 c M g E v p x B a f q 7 4 0 U h V K O Q 1 9 P Z k H l o p e J / 3 m d R A U X X k p 5 n C j C 8 e x Q k D C o I p j 1 A P t U E K z Y W B O E B d V Z I R 4 i g b D S b Z m 6 B H f x y 8 u k W b L d U 7 t 0 7 R a r J T B D H h y C I 3 A C X F A G V X A F 6 q A B M L g D D + A Z v B j 3 x q P x a r z N R n P G f O c A / I H x / g M M z 5 Y k < / l a t e x i t > r 9 T Y t z V i z n n 3 4 I + v 9 B 5 e W k 5 Q = < / l a t e x i t > C = 0.07 + 0.6 = 0.67 < l a t e x i t s h a 1 _ b a s e 6 4 = " + 4 8 2 G R I e X 6 8 2 G C r I 0 Q D o + S C r 3 e g = " > A A A C A n i c b V D L S s N A F L 3 x W e s r 6 k r c D B Z B E E J S p X U j F L p x W c E + o A 1 l M p 2 0 Q y c P Z i Z C C c W N v + L G h S J u / Q p 3 / o 2 T N g t t P T B w 5 p x 7 u f c e L + Z M K t v + N l Z W 1 9 Y 3 N g t b x e 2 d 3 b 1 9 8 + C w J a N E E N o k E Y 9 E x 8 O S c h b S p m K K 0 0 4 s K A 4 8 T t v e u J 7 5 7 Q w g 8 w j O 8 w p v x Z L w Y 7 8 b H v H T F y H u O 4 A + M z x 9 4 R J T T < / l a t e x i t > C = 2.33 < l a t e x i t s h a 1 _ b a s e 6 4 = " S E U p t 4 Y j c F w R L 2 6 5 Z t 4 C u V q + U Z I = " > A A A B + X i c b V D L S s N A F L 2 p r 1 p f U Z d u B o v g K i S t o B u h 0 I 3 L C v Y B b S i T 6 a Q d O p m E m U m h h P 6 J G x e K u P V P 3 P k 3 T t o s t P X A w O G c e 7 l n T p B w p r T r f l u l r e 2 d 3 b 3 y f u X g 8 O j 4 x D 4 9 6 6 g 4 l Y S 2 S c x j 2 Q u w o p w J 2 t Z M c 9 p L e L M y 6 8 V 6 t z 5 W o y W r 2 D m H P 7 A + f w B B n Z K z < / l a t e x i t > C = 5.6 ⇥ 10 10 < l a t e x i t s h a 1 _ b a s e 6 4 = " e k l X T S 9 U U 7 g t 7 c y t U f 2 x t G Y M + P 8 = " > A A A C B 3 i c b V D L S s N A F J 3 4 r P U V d S n I Y B H c G J L 6 3 A i F b l x W s A 9 o Y p l M J + 3 Q y S T M T I Q S s n P j r 7 h x o Y h b f 8

FIG. 5 .
FIG. 5. Example of greedy compilation procedure.Example gates represented as circuits and matrix plots resulting from applying the greedy compilation procedure to the isometry shown in the upper left (same isometry as in the right panel of Fig. 3).The starting ansatz is a single-qubit rotation on each qubit, given in the top center of the figure.The next row down shows the gates resulting from adding a single entangling gate to this ansatz, ordered left to right by their cost functions C. A constant penalty 0.6 is added to the cost function for use of a F gate in ordering the priority queue, resulting in the given ordering.The gates indicated by green lines denote those passed to the next level of optimization.This procedure terminates in the gate shown at the bottom of the figure with the given cost function tolerance of 5 × 10 −4 .

FIG. 9 .
FIG. 9.Exactly solvable Born Machine benchmark isometries.Isometries and optimized gates for the three-site exactly solvable Born Machine benchmark.Panels (a), (d), and (g) are the isometries output from the classically trained model, panels (b), (e), and (h) are matrix plots of the unitaries output by our greedy compilation procedure, and panels (c), (f), and (i) are circuit representations of the optimized unitaries.
FIG.10.Comparison of hand-compiled and auto-compiled circuits for exactly solvable test case.The exactly solvable benchmark with three physical qubits (0, 1, and 3) and a single ancilla qubit (2) implemented as a quantum circuit using the hand-compiled circuits in Fig.8(upper panel) or the auto-compiled gates in Fig.9(lower panel).

FIG. 11 .
FIG. 11.Comparison of ideal, measured, and noise-corrected measurement outcomes on quantum hardware.The results of the hand-compiled benchmark model shown in Fig. 10(a) are jackknifed over several days of independent experimental runs on the IBMQ-X2 [panel (a)] and IBMQ-Vigo [panel (c)] hardware.Similarly, the jackknifed results for auto-compiled benchmark model shown in Fig. 10(b) are shown for the IBMQ-X2 [panel (b)] and IBMQ-Vigo [panel (d)] hardware.The rightmost green bars are the noiseless expectations, the center blue bars are the raw measurements, and the left orange bars have a measurement filter applied.Black lines indicate 1σ confidence intervals.
FIG. 12. Convex KL divergence between ideal and measured QAML outcomes with time.The convex KL divergence Eq. (75) between the ideal, noiseless measurement probabilities and the measurement probabilities inferred from 25 experiments of 2 13 shots is shown as a function of experimental run day for the (a) IBMQ-X2 and (b) IBMQ-Vigo devices.Filled symbols use the raw experimental counts (center blue bars in Fig.11) and empty symbols use the counts with measurement noise filter applied (left orange bars in Fig.11).Lines indicate the KL divergences computed using all measurements from all days.

FIG. 14 .
FIG. 14. Convergence of classical TN optimization.The convergence behavior of the negative log-likelihood is shown for χ = 8 (red solid lines) and χ = 16 (blue dashed lines) in panel (a).Both results use single-site gradient descent with a learning rate of η = 10 −4 .Panel (b) displays sample data drawn from the initial, random χ = 16 MPS model before optimization, panel (c) shows samples drawn from the χ = 8 MPS model after optimization, and panel (d) shows samples drawn from the χ = 16 MPS model after optimization.Because of the small size of the dataset, the χ = 16 model is able to reach the theoretical minimum and so memorize the full dataset.

FIG. 15 .
FIG. 15.Example data sampled from MNIST model on simulated noiseless quantum hardware.The data samples with probability ≥ 1% obtained from the sequential preparation procedure defined by the isometries in Figs.19-67 are displayed together with their probability of occurrence estimated from 2 13 shots.

FIG. 16 .
FIG.16.Convex KL divergence between sampled data without and with hardware noise.The convex KL divergence Eq. (75) between a noiseless and noisy simulation is shown as functions of the CNOT error rate ξ2 and the readout error ζ.The probability distributions of both the noiseless and noisy models are estimated from 2 13 shots using a hardware simulator.The vertical dashed line indicates 1% CNOT error rate, a rough measure of the current state-of-the-art for NISQ devices.

FIG. 18 .
FIG. 17.Example data sampled from MNIST model on simulated noisy quantum hardware.The highest probability data samples from the sequential preparation procedure defined by the isometries in Figs.19-67 on a simulated machine with depolarization error rate ξ2 = 0.01 (dashed line in Fig.16) are displayed together with their probability of occurrence estimated from 2 13 shots.