Trading classical and quantum computational resources

We propose examples of a hybrid quantum-classical simulation where a classical computer assisted by a small quantum processor can efficiently simulate a larger quantum system. First we consider sparse quantum circuits such that each qubit participates in O(1) two-qubit gates. It is shown that any sparse circuit on n+k qubits can be simulated by sparse circuits on n qubits and a classical processing that takes time $2^{O(k)} poly(n)$. Secondly, we study Pauli-based computation (PBC) where allowed operations are non-destructive eigenvalue measurements of n-qubit Pauli operators. The computation begins by initializing each qubit in the so-called magic state. This model is known to be equivalent to the universal quantum computer. We show that any PBC on n+k qubits can be simulated by PBCs on n qubits and a classical processing that takes time $2^{O(k)} poly(n)$. Finally, we propose a purely classical algorithm that can simulate a PBC on n qubits in a time $2^{c n} poly(n)$ where $c\approx 0.94$. This improves upon the brute-force simulation method which takes time $2^n poly(n)$. Our algorithm exploits the fact that n-fold tensor products of magic states admit a low-rank decomposition into n-qubit stabilizer states.


I. INTRODUCTION
Quantum computers promise a substantial speedup over classical ones for certain number-theoretic problems and the simulation of quantum systems [1][2][3].Experimental efforts to build a quantum computer remain in their infancy though, limited to proof-ofprinciple experiments on a handful of qubits.In contrast, the design of classical computers is a mature field offering billions of operations per second in offthe-shelf machines and petaflops in leading supercomputers.To prove their worth, quantum computers will have to offer computational solutions that rival the performance of classical supercomputers, a daunting task to be sure.
Here we study hybrid quantum-classical computation, wherein a small quantum processor is combined with a large-scale classical computer to jointly solve a computational task.To motivate this problem, imagine that a client can access a quantum computer with 100 qubits and essentially perfect quantum gates.Such a computer lies in the regime where it is likely to outperform any classical machine (since it would be nearly impossible to emulate classically).Imagine further that the client wants to implement a quantum algorithm on 101 qubits, but it is impossible to expand the hardware to accommodate one extra qubit.Does the client have any advantage at all from the access to a quantum computer in this scenario?Can one divide a quantum algorithm into subroutines that require less qubits than the entire algorithm?Can one implement each subroutine separately and combine their outputs on a classical computer?These are the main questions addressed in the present paper.Put differently, we ask how to add one virtual qubit to an existing quantum machine at the cost of an increased classical and quantum running times, but without modifying the machine hardware.More generally, one may ask what is the cost of adding k virtual qubits to an existing quantum computer of n qubits and how to characterize the tradeoff between quantum and classical resources in these settings.
As one may expect, the cost of adding virtual qubits varies for different computational models.Although the circuit-based model of a quantum computer is the most natural and well-studied, several alternative models have been proposed, such as the measurementbased [4] and the adiabatic [5] quantum computing, as well as the model DQC1 where most of the qubits are initialized in the maximally mixed state [6].Our goal is to identify quantum computing models which enable efficient addition of virtual qubits.Below we describe two examples of such models.
We begin with the model based on sparse quantum circuits.Recall that a quantum circuit on n qubits is a collection of gates, drawn from some fixed (usually universal) gate set, with n input qubits and n output qubits.Below we assume that the gate set includes only one-qubit and two-qubit gates.Let us say that a circuit is d-sparse if each qubit participates in at most d two-qubit gates.We shall be interested in the regime when d is a constant independent of n or when d grows very slowly, say d ∼ log (n).This regime covers interesting quantum algorithms that can be described by low-depth circuits [7] since any depth-d quantum circuit must be d-sparse (although the converse is generally not true).It is believed that a constant-depth quantum computation cannot be efficiently simulated by classical means only [8,9].It is also likely that early applications of quantum computers will be based on relatively low-depth circuits because they impose less stringent requirements on the qubit coherence times.
Define a d-sparse quantum computation, or d-SQC, as a sequence of the following steps: (i) initialization of n qubits in the |0 state, (ii) action of a d-sparse quantum circuit, (iii) measurement of each qubit in the 0, 1 basis, and (iv) classical processing of the measurement outcomes that returns a single output bit b out .We require that the final classical processing takes time at most poly(n).A classical or quantum algorithm is said to simulate a d-SQC if it computes probability of the output b out = 1 with a small additive error.Our first result is the following theorem, which quantifies the cost of adding k virtual qubits to a d-SQC on n qubits.Theorem 1. Suppose n ≥ kd + 1.Then any d-sparse quantum computation on n+k qubits can be simulated by a (d + 3)-sparse quantum computation on n qubits repeated 2 O(kd) times and a classical processing which takes time 2 O(kd) poly(n).
The above result is most useful when both k and d are small, for example, k = O(1) and d = O(log n).In this case both quantum and classical running time of the simulation scale as poly(n).On the other hand, we expect that a direct simulation of a d-SQC on a classical computer takes a super-polynomial time (see the discussion above).Hence the theorem provides an example when a hybrid quantum-classical simulation is more efficient than a classical simulation alone.
The proof of the theorem exploits the fact that any d-sparse quantum circuit U acting on a bipartite system AB with |A| ≈ k and |B| ≈ n can be decomposed into a linear combination of 2 O(kd) tensor product terms V α ⊗ W α , where V α and W α are d-sparse circuits acting on A and B respectively.We show that the task of simulating U can be reduced to simulating the smaller circuits W α , as well as computing certain interference terms that involve pairs of circuits W α , W β .We show that the interference terms can be estimated by a simple SWAP test which can be realized by a (d + 3)-sparse computation on n qubits.
Our second model is called Pauli-based computation (PBC).We begin with a formal definition of the model.Let P n be the set of all hermitian Pauli operators on n qubits, that is, n-fold tensor products of single-qubit Pauli operators I, X, Y, Z with the overall phase factor ±1.A PBC on n qubits is defined as a sequence of elementary steps labeled by integers t = 1, . . ., n where at each step t one performs a nondestructive eigenvalue measurement of some Pauli operator P t ∈ P n .Let σ t be the measured eigenvalue of P t .Note that σ t = ±1 since any element of P n squares to one.We allow the choice of P t to be adaptive, that is, P t may depend on all previously measured eigenvalues σ 1 , . . ., σ t−1 .The latter have to be stored in a classical memory.The computation begins by initializing each qubit in the so-called magic state Once all Pauli operators P 1 , . . ., P n have been measured, the final quantum state is discarded and one is left with a list of measured eigenvalues σ 1 , . . ., σ n .The outcome of a PBC is a single classical bit b out obtained by performing a classical processing of the measured eigenvalues.All classical processing must take time at most poly(n).We shall prove that the computational power of a PBC does not change if one additionally requires that all Pauli operators P 1 , . . ., P n pairwise commute (for all measurement outcomes).A classical or quantum algorithm is said to simulate a PBC if it computes probability of the output b out = 1 with a small additive error.An example of a PBC is shown at Fig. 1.
The PBC model naturally appears in fault-tolerant quantum computing schemes based on error correcting codes of stabilizer type [10].Such codes enable a simple fault-tolerant implementation of non-destructive Pauli measurements on encoded qubits, for example using the Steane method [11].Furthermore, topological quantum codes such as the surface code enable a direct measurement of certain logical Pauli operators by measuring a properly chosen subset of physical qubits [12].Several fault-tolerant protocols for preparing encoded magic states such as |H have been developed [13][14][15][16][17]. PBCs implicitly appeared in the previous work on quantum fault-tolerance.Our analysis closely follows the work by Campbell and Brown [18] who showed that a certain class of magic state distillation protocols can be implemented by PBCs.
Let us now state our results.First, we claim that a PBC has the same computational power as the standard circuit-based quantum computing model.Theorem 2. Any quantum computation in the circuit-based model with n qubits and poly(n) gates drawn from the Clifford+T set can be simulated by a PBC on m qubits, where m is the number of T gates, and poly(n) classical processing.
Recall that the Clifford+T gate set consists of single-qubit gates , and the two-qubit CNOT gate.This gate set is known to be universal for quantum computing.Secondly, we show that PBCs enable efficient addition of virtual qubits.
Theorem 3. A PBC on n + k qubits can be simulated by a PBC on n qubits repeated 2 O(k) times and a classical processing which takes time 2 O(k) poly(n).
Both theorems follow from the fact that a generalized PBC that incorporates unitary Clifford gates, ancillary stabilizer states (such as |0 or |+ ), and has poly(n) measurements can be efficiently simulated by the standard PBC defined above.To prove Theorem 2 we convert a given quantum circuit on n qubits with m T -gates into a generalized PBC on n + m qubits initialized in the |0 ⊗n ⊗ |H ⊗m state.Each T -gate of the circuit is converted into a simple gadget that includes adaptive Pauli measurements and consumes one copy of the |H state.Simulating such generalized PBC by the standard PBC on m qubits proves Theorem 2.
To prove Theorem 3 we represent k copies of the magic state |H as a linear combination of k-qubit stabilizer states φ α such that |H H| ⊗k = α c α |φ α φ α | for some real coefficients c α .The number of terms in this sum is 2 O(k) .We carry out the simulation independently for each α using a generalized PBC on k +n qubits initialized in the state |φ α ⊗ |H ⊗n and combine the outcomes on a classical computer.Finally, we simulate the generalized PBCs by the standard PBCs on n qubits.
Perhaps more surprisingly, we prove that PBCs can be simulated on a classical computer alone more efficiently than one could expect naively.Let us first describe a brute-force simulation method based on the matrix-vector multiplication.Let φ t be the nqubit state obtained after measuring the Pauli operators P 1 , . . ., P t .One can store φ t in a classical memory as a complex vector of size 2 n .Each step of a PBC involves a transformation φ t → φ t+1 where φ t+1 = (1/2)(I +σ t P t )φ t .Since P t is a Pauli operator, the matrix of P t in the standard basis is a permutation matrix modulo phase factors.Thus, for a fixed vector φ t , one can compute φ t+1 for both choices of σ t in time O(2 n ).Furthermore, one can compute the norm of φ t+1 in time O(2 n ) and thus determine the probability of each measurement outcome σ t .By flipping a classical coin one can generate a random variable σ t = ±1 with the desired probability distribution.Since any PBC has at most n steps, the overall cost of the classical simulation is O(n2 n ).Below we show that this brute force simulation method is not optimal.
Our simulation algorithm exploits the fact that tensor products of magic states admit a low-rank decomposition into stabilizer states.Recall that an n-qubit state φ is called a stabilizer state if |φ = U |0 ⊗n for some n-qubit Clifford operator U -a product of the elementary gates H, S, and the CNOT.
Suppose ψ is an arbitrary n-qubit state.Define a stabilizer rank of ψ as the smallest integer χ such that ψ can be written as |ψ = χ α=1 c α |φ α , where c α are complex coefficients and φ α are n-qubit stabilizer states.The stabilizer rank of ψ will be denoted χ(ψ).By definition, 1 ≤ χ(ψ) ≤ 2 n for any n-qubit state ψ and χ(ψ) = 1 iff ψ is a stabilizer state.For example, the magic state |H has stabilizer rank χ(H) = 2, since |H is not a stabilizer state itself, but it can be written as a linear combination of two stabilizer states |0 and |1 .Furthermore, using the identity one can easily check that χ(H ⊗2 ) = 2.More generally, let χ n be the stabilizer rank of |H ⊗n .Note that χ n+m ≤ χ n χ m since a tensor product of two stabilizer states is a stabilizer state.In particular, χ n ≤ (χ 2 ) n/2 = 2 n/2 .The probability to observe measurement outcomes σ 1 , . . ., σ t in a PBC implemented up to a step t can be written as where φ α are n-qubit stabilizer states, c α are complex coefficients, and Π = t a=1 (I + σ a P a )/2 is the projector describing the partially implemented PBC.We will use a version of the Gottesman-Knill theorem [19] to show that each term φ α |Π|φ β can be computed on a classical computer in time n 3 .Since the number of terms is χ 2 n and the number of steps is at most n, we would be able to simulate a PBC on n qubits classically in time (χ n ) 2 n 4 .Improving upon the bruteforce simulation method thus requires an upper bound χ n ≤ 2 βn for some β < 1/2.We establish such an upper bound with β = log 2 (7)/6 ≈ 0.468 by showing that χ 6 ≤ 7 which implies χ n ≤ (χ 6 ) n/6 ≤ 7 n/6 .We expect that the scaling in Theorem 4 can be improved by computing χ n for larger values of n.In Appendix B we describe a heuristic algorithm for computing lowrank decompositions of |H ⊗n into stabilizer states which yields the following upper bounds: We believe that these upper bounds are tight.A lower bound χ n ≥ Ω(n 1/2 ) is proved in Appendix C.

II. DISCUSSION AND PREVIOUS WORK
Classical algorithms for simulation of quantum circuits based on the stabilizer formalism have a long history.Notably, Aaronson and Gottesman [19] studied adaptive quantum circuits that contain only a few non-Clifford gates.Assuming that a circuit contains at most m non-Clifford gates and that all n qubits are initially prepared in some stabilizer state, Ref. [19] showed how to simulate such a circuit classically in time 2 4m poly(n).To enable a comparison with our results, assume that all unitary gates belong to the Clifford+T set.By Theorem 2, a quantum circuit as above can be transformed into a PBC on m qubits, where m is the number of T -gates.Thus Theorems 2,4 provide a classical simulation algorithm with a running time 2 0.94m poly(n) which improves upon [19].In addition, Ref. [19] studied adaptive quantum circuits composed only of Clifford gates and Pauli measurements with more general initial states.Assuming that the initial n-qubit state can be written as a tensor product of some b-qubit states, a quantum circuit as above can be simulated classically in time 2 2b+2d poly(n), where d is the total number of measurements [19].Methods for decomposing arbitrary states into a linear combination of stabilizer states aimed at simulation of quantum circuits were pioneered by Garcia, Markov, and Cross [20,21] who studied decompositions into pairwise orthogonal stabilizer states (named stabilizer frames).The latter are more restrictive than the general decompositions analyzed in the present paper.Furthermore, Refs.[20,21] have not studied stabilizer decompositions of magic states.
The simulation algorithm of Theorem 4 is conceptually close to the matrix multiplication algorithms based on tensor decompositions [22,23].In this case the analogue of a stabilizer state is a product state and the analogue of a magic state is a tripartite entangled state that contains EPR-type states shared between each pair of parties, see [24] for details.
Efficient classical algorithms for simulation of quantum circuits in which the initial state can be described by a discrete Wigner function taking non-negative values were investigated by Veitch et al [25] and by Howard [26] et al.As was pointed out by Pashayan, Wallman, and Bartlett [27], such methods can be combined with Monte Carlo sampling techniques to enable classical simulation of general quantum circuits with the running time scaling exponentially with the quantity related to the negativity of the Wigner function.To enable a comparison between Theorem 4 and the results of [27] one can employ a discrete Wigner function representation of stabilizer states and Clifford operations on qubits developed by Delfosse et al [28].The latter is applicable only to states with real amplitudes and to Clifford operations that do not mix X-type and Z-type Pauli operators (CSS-preserving operations).A preliminary analysis shows that combining the results of Refs.[27,28] yields a classical algorithm for simulating a restricted class of PBC on n qubits in time M 2n poly(n) ≈ 2 0.543n poly(n), where M = 2 −1 + 2 −1/2 ≈ 1.207 is the so-called mana of the magic state |H H|, see [27,28] for details.The restriction is that all Pauli operators to be measured are either X-type or Z-type, and the measurements cannot be adaptive.Such restricted PBCs are not known to be universal for quantum computation.
Our method of simulating sparse quantum circuits has connections to ideas of tensor network representations of quantum circuits developed by Markov and Shi [29].Indeed, our proof of Theorem 1 can be interpreted as a particular method of expressing the acceptance probability of a quantum computation in terms of a contraction of tensors associated with the quantum circuit.The individual entries of the tensors are then estimated separately with a smaller quantum computer and then added together.
Let us now discuss some open problems and possible generalizations of our work.A natural question is whether the scaling in Theorem 4 can be improved if |H is replaced by some other magic state.By definition, any magic state is Clifford-equivalent to one of the states |H and |R , where |R is the +1 eigenvector of an operator (X + Y + Z)/ √ 3, see Ref. [13] for details.The numerics suggests that |H ⊗n and |R ⊗n have the same stabilizer rank for n ≤ 6.We conjecture that this remains true for all n.Moreover, we pose the following conjecture which, if true, highlights a new optimality property of magic states in terms of their stabilizer rank.
Conjecture 1.Let χ n be the stabilizer rank of |H ⊗n and φ be an arbitrary single-qubit state.Then Less formally, the conjecture says that magic states have the smallest possible stabilizer rank among all non-stabilizer single-qubit states.
It is also of great interest to understand the asymptotic scaling of the stabilizer rank χ n .Assuming that a universal quantum computation cannot be simulated classically in polynomial time, one infers that χ n must grow super-polynomially in the limit n → ∞.However, we were unable to derive such a lower bound directly without using any assumptions.The fact that amplitudes of any stabilizer state in the standard basis take only O(1) different values implies a weaker lower bound χ n ≥ Ω(n 1/2 ), see Appendix C. We conjecture that in fact χ n ≥ 2 Ω(n) .Note that if this conjecture is false, that is, χ n ≤ 2 o(n) , then constantdepth circuits in the Clifford+T basis can be simulated classically in a sub-exponential time, which appears unlikely.Indeed, since such a circuit contains at most m = O(n) T -gates, where n is the number of qubits, Theorems 2,4 would provide a simulation algorithm with a running time χ 2 m •poly(n) = 2 o(n) poly(n).(Here we ignore the complexity of finding the optimal stabilizer decomposition since it has to be done only once for each n.) Finally, one may explore generalizations of the stabilizer rank to approximate decompositions into stabilizer states.It should be pointed out that the simulation algorithm of Theorem 4 would require approximate stabilizer decompositions with a precision at least 2 −Ω(n) since the probability of a particular measurement outcome σ 1 , . . ., σ t can be exponentially small in n.It is not clear whether such approximate decompositions would have a rank substantially smaller than the exact ones.
In the rest of the paper we prove the theorems stated in the introduction.From the technical perspective, Theorems 1,2,3 follow easily from the definitions and from the previously known results.On the other hand, Theorem 4 and the notion of a stabilizer rank appear to be new.We analyze sparse quantum circuits in Section III.A classical algorithm for simulation of PBCs and the stabilizer rank of magic states are discussed in Section IV.Theorems 2,3 are proved in Section V. Appendix A proves a technical lemma needed to compute inner products between stabilizer states.Appendix B describes a numerical method of computing low-rank stabilizer decompositions.Appendix C proves a lower bound on the stabilizer rank of magic states.

III. SPARSE QUANTUM CIRCUITS
In this section we prove Theorem 1.All quantum circuits considered below are defined with respect to some fixed basis of gates G.We assume that any gate in G acts on at most two qubits.Furthermore, we assume that G contains all single-qubit Pauli gates X, Y, Z, their controlled versions, the Hadamard gate, and the π/2 phase shift S = |0 0|+i|1 1|.For example, G could be the Clifford+T basis.Let Σ n ≡ {0, 1} n be the set of n-bit binary strings.Lemma 1.Let U be a d-sparse quantum circuit on k + n qubits.Partition the set of qubits as AB, where where V α and W α are d-sparse quantum circuit acting on A and B respectively, and c α are some complex coefficients such that χ α=1 |c α | 2 = 1.Proof.Since U is a d-sparse circuit, it contains at most kd two-qubit gates that couple some qubit of A and some qubit of B. Let G 1 , . . ., G m be the list of all such gates, where m ≤ kd.Any two-qubit gate G[i, j] acting on qubits i ∈ A and j ∈ B can be expanded in the Pauli basis as , where P α ∈ {I, X, Y, Z} are Pauli operators and c α are some complex coefficients such that α |c α | 2 = 1.Applying the above decomposition to each gate G 1 , . . ., G m and, if necessary, appending dummy identity gates to make m = kd, one arrives at Eq. ( 1).Note that replacing a two-qubit gate in U by a tensor product of two single-qubit Pauli gates cannot increase the sparsity of the circuit.Thus each term V α ⊗W α is a tensor product of two d-sparse circuits.
The classical post-processing step can be described by a poly(n) classical circuit f : Σ n+k → {0, 1}.By definition of the SQC model, the final output of a computation is a single random bit b out = f (x), where x ∈ Σ n+k is the bit string obtained by measuring each qubit of a state U |0 n+k in the 0, 1 basis.Let π(U ) be the probability of the output b out = 1, that is, Let us first show how to estimate the quantity π(U ) with a small additive error using dk-sparse circuits on n + 1 qubits.Substituting Eq. ( 1) into the definition of π(U ) one gets where We claim that each coefficient c α (y) can be computed exactly in time O(kd • 2 k ).Indeed, we can merge consecutive single-qubit gates of V α such that each qubit is acted upon by at most d two-qubit gates and at most d + 1 single-qubit gates.Thus we can assume that the total number of gates in V α is O(kd).One can compute the quantity y|V α |0 k classically in time O(kd • 2 k ) by performing matrix-vector multiplication for each gate of V α .Furthermore, it is clear from the proof of Lemma 1 that each coefficient c α can be computed in time O(kd).
Consider some fixed triple (y, α, β) that appears in the sum Eq. ( 3).Define a controlled-W operator Define a quantum circuit R acting on n+1 qubits that consists of the following steps: (i) initialize n+1 qubits in the |0 state, (ii) apply H gate to the first qubit, (iii) apply Λ(W ) with the first qubit acting as the control one, (iv) apply H gate to the first qubit, (v) measure each qubit in the 0, 1-basis.The construction of R, illustrated at Fig. 2, is very similar to the standard SWAP test, except that we finally measure each qubit.Let b, z be the measurement outcomes, where b = 0, 1 and z ∈ Σ n , see Fig. with probability one.By Hoeffding's inequality, one can estimate E(ξ) with a small additive error by generating c2 2k χ 2 samples of ξ for some constant c = O(1).Generating each sample of ξ requires 2 k χ 2 samples of the σ-variables.Thus one can estimate π(U ) by repeated applications of dk-sparse circuits on n + 1 qubits with the number of repetitions scaling as Recall that the dk-sparse circuits R constructed above have a very special pattern of sparsity.Namely, all qubits except for one participate in at most d twoqubit gates, whereas one remaining qubit participates in at most kd two-qubit gates.We can distribute the sparsity more evenly among all n + 1 qubits by performing a swap gate that changes position of the control qubit after each application of a control gate (this is possible only if n is sufficiently large, specifically, if n ≥ kd + 1).After this modification one obtains an equivalent circuit which is (d + 3)-sparse.
Finally, we can apply exactly the same arguments as above if the subsets A and B in Lemma 1 have size |A| = k + 1 and |B| = n − 1.This frees up one extra qubit that can play the role of the control one in the above construction.Now we can estimate π(U ) by repeated applications of (d+3)-sparse circuits on n qubits with the number of repetitions scaling as c2 16(k+1)d+3(k+1) = 2 O(kd) .This completes the proof of Theorem 1.

IV. STABILIZER RANK AND CLASSICAL SIMULATION OF PBC
In this section we prove Theorem 4. We begin with an algorithm for computing a quantity ψ|Π|φ , where ψ, φ are n-qubit stabilizer states and Π is a projector onto the codespace of some stabilizer code.We note that several previous works addressed the problem of computing the inner product ψ|φ between stabilizer states ψ, φ.In particular, Aaronson and Gottesman [19] showed that the magnitude | ψ|φ |can be computed in time O(n 3 ).Furthermore, Garcia, Markov, and Cross [21] used canonical form of Clifford circuits to compute both the magnitude and the phase of ψ|φ in time O(n 3 ).Below we describe a technically different (and somewhat simpler) algorithm which is more suited for computing the quantity ψ|Π|φ as above.
Since the proof is rather straightforward, we postpone it until Appendix A. It was shown by Dehaene and De Moor [30] and by Van den Nest [31] that any stabilizer state ψ of n-qubit can be written (up to a global phase and a normalization) as for some degree-two polynomial f : F k 2 → Z 8 , some k × n binary matrix Ψ, and some vector z ∈ F n 2 .Here we treat u and z as row vectors.In the rest of this section we take Eq. ( 5) as our definition of a stabilizer state.
Let G ⊂ P n be an abelian group with t independent generators P 1 , P 2 , . . ., P t ∈ G. Define a projector Π onto the G-invariant subspace, Π = 2 −t P ∈G P.
Lemma 3. The action of Π in the computational basis can be represented as for some degree-two polynomial g : F t 2 → Z 8 and some binary matrices A, B of size t × n.
Proof.Given a binary vector f ∈ F n 2 , let X(f ) ∈ P n be the Pauli operator that applies X to each qubit in the support of f .Define Z(f ) in a similar fashion.Let e k ∈ F t 2 be the basis vector which has a single '1' at the position k.The k-th generator of G can be written as P k = i c k X(e k A)Z(e k B) for some c k ∈ Z 4 and some binary matrices A, B of size t × n.In other words, the k-th row of A (of B) specifies the X-part (the Z-part) of P k .Choose any vector y ∈ F t 2 .Then where Clearly, g : F t 2 → Z 8 is a degree-two polynomial.Thus Consider now a pair of n-qubit stabilizer states ψ, φ, where ψ is defined in Eq. ( 5) and Here h : F m 2 → Z 8 is a degree-two polynomial, Φ is a binary matrix of size m × n, and z ∈ F n 2 is some vector.Using Lemma 3 and Eqs.(5,6) one gets Clearly the non-zero terms are those with z + uΨ = z + vΦ + yA.We can enforce this equality by introducing an extra variable x ∈ F n 2 such that Then with Note that F (u, v, x, y) is a degree-two polynomial in k +m+n+t variables.By Lemma 2, one can compute the sum F in time O(k + m + t + n) 3 = O(n 3 ).Also, Lemma 2 and Eq. ( 7) implies that ψ|Π|φ takes values 2 q/2 ω j for some integer q and j ∈ Z 8 .Consider now a PBC on n qubits as defined in Section V. Let t be some fixed time step.Recall that a sequence of measurement outcomes σ 1 , . . ., σ t is observed with the probability Below we shall construct an algorithm that takes as input a step t, a sequence of outcomes σ 1 , . . ., σ t and returns Pr(σ 1 , . . ., σ t ).It allows us to compute Pr(σ 1 ) and get a sample of σ 1 by flipping a coin with a properly chosen bias.By calling the algorithm twice one can also compute conditional probabilities Pr(σ t |σ 1 , . . ., σ t−1 ) = Pr(σ 1 , . . ., σ t ) Pr(σ 1 , . . ., σ t−1 ) .
Thus, for fixed variables σ 1 , . . ., σ t−1 one can get a sample of σ t by computing the conditional probability Pr(σ t |σ 1 , . . ., σ t−1 ) and flipping a coin with a properly chosen bias.The ability to sample the outcomes σ 1 , . . ., σ n from the distribution Pr(σ 1 , . . ., σ n ) is equivalent to simulating the PBC classically.Hence it suffices to construct an algorithm that computes Pr(σ 1 , . . ., σ t ).Suppose we are given some integers k, χ = O(1) and a decomposition where φ a are k-qubit stabilizer states and c a are complex coefficients.Suppose also that n = mk for some integer m.Taking the m-fold tensor power of Eq. ( 8) one gets where a = (α 1 , . . ., α m ), c a = c α1 The discussion above implies that each term φ a |Π|φ b can be computed exactly in time O(n 3 ).Assuming that arithmetic operations with complex numbers have a unit cost (see Remark 1 below), the probability Pr(σ 1 , . . ., σ t ) can be computed in time Let us now show an explicit decomposition Eq. ( 8) with k = 6 and χ = 7.This gives an algorithm for computing Pr(σ 1 , . . ., σ t ) with a running time O(7 n/3 n 3 ) which is enough to prove Theorem 4. It will be more convenient to normalize the magic state such that where This completes the proof of Theorem 4. The numerical method used to find the above decomposition is discussed in Appendix B. We conjecture that k = 6 is the smallest integer such that χ 2 k < 2 k , see Section I. Accordingly, k = 6 is likely to be the smallest integer for which the the above simulation strategy outperforms the brute-force simulation algorithm.
Remark 1: Let us point out that all coefficients in Eq. ( 11) belong to the ring Z[ √ 2] = {p + √ 2q : p, q ∈ Z} known the ring of quadratic integers with a base two.Hence the coefficients c a in Eq. ( 9) also belong to . Using Eq. ( 7) and Lemma 2 we conclude that each term in Eq. ( 10) has a form 2 −q ηω j for some integer 0 , and some j ∈ Z 8 .Multiplying Eq. ( 10) by a suitable power of two we can assume that each term in Eq. ( 10) has a form α + iβ where α, β ∈ Z[ √ 2] (of course we can ignore the imaginary part iβ since Pr(σ 1 , . . ., σ t ) is a real number).Thus computing the sum in Eq. ( 10) only requires arithmetic operations in the ring Z[ √ 2].Remark 2: One can notice that the first five terms in Eq. ( 11) are stabilizer states symmetric under all permutations of qubits.On the other hand, the states φ and φ break the permutation symmetry.Interestingly, we found that the state |H ⊗6 does not belong to the subspace spanned by symmetric stabilizer states of six qubits.Thus any stabilizer decomposition of |H ⊗6 must use at least two non-symmetric states.On the other hand, one can check that |H ⊗n belongs to the subspace spanned by symmetric stabilizer states for n ≤ 5.The best decompositions that we were able to find for n ≤ 5 are formed by symmetric stabilizer states, see Appendix B.

V. ADDING VIRTUAL QUBITS TO A PBC
In this section we prove Theorems 2,3.We begin with Theorem 2. Recall that we consider a quantum circuit U on n qubits in the Clifford+T basis which contains m T -gates.We assume that all qubits are initialized in the |0 state.Each qubit is finally measured in the 0, 1 basis.Let us first define a more general version of PBC called PBC * where some subset of qubits can be initialized in the |0 state.Apart from that, definitions of PBC and PBC * are the same.First we will show that U can be efficiently simulated by PBC * on n + m qubits with the initial state |0 ⊗n ⊗ |H ⊗m .Indeed, replace each T -gate of U by the gadget shown on Fig. 4.This gadget uses one ancillary qubit prepared in the magic state |T ∼ |0 + e iπ/4 |1 .The latter is equivalent to |H modulo Clifford gates, |T = e iπ/8 HS † |H .Let |ψ = α|0 +β|1 be the input state for the gadget.Let σ 1 and σ 2 be the measured eigenvalues of ZZ and IX operators, see Fig. 4. One can check that the gadget outputs a state ψ σ1,σ2 where Furthermore, all four measurement outcomes are equally likely.Applying a correcting Clifford operator I, Z, S, ZS for the measurement outcomes ++, +−, −+, −− respectively, one gets the desired T gate.Let U be the circuit obtained from U by replacing each T gate with the gadget as above.The final measurement of n qubits in the 0, 1 basis is equivalent to a non-destructive eigenvalue measurement of Z 1 , . . ., Z n after which the final state is discarded.This allows one to commute all Clifford gates of U towards the end of the circuit by properly updating which Pauli operator one has to be measured at each step.Once a Clifford gate reaches the end of the circuit, it serves no purpose and can be discarded.We conclude that U can be simulated by a PBC * on n+m qubits.Let P 1 , . . ., P r ∈ P n+m be the Pauli operators that have to be measured.We can assume that all Pauli operators P 1 , . . ., P r pairwise commute.Indeed, suppose this is not the case and let t be the first time step when P t anti-commutes with P s for some s < t.Let φ be the state reached just before the measurement of P t .Note that P s φ = ±φ and thus (I + σ t P t )φ = (σ t P t ± P s )φ.One can easily check that an operator V ≡ (σ t P t ± P s )/ √ 2 is a Clifford unitary operator whenever P t and P s anticommute.This shows that both outcomes σ t have the same probability and the measurement of P t has the same effect as drawing σ t from the uniform distribution and applying the Clifford unitary V defined above.Such a unitary V can be commuted towards the end of the circuit and discarded.Hence we can assume that all operators P 1 , . . ., P r pairwise commute.Furthermore, one can append the sequence P 1 , . . ., P r at the beginning with dummy Pauli measurements of Z i for all qubits i initialized in the |0 state.Applying the above argument again one can modify the sequence P 1 , . . ., P r such that all P t commute with the dummy measurements, that is, any operator P t acts trivially on the qubits initialized in the |0 state.Therefore such qubits serve no purpose and can be discarded.We have shown that the original circuit U can be simulated by a PBC on m qubits with r steps and pairwise commuting Pauli operators P 1 , . . ., P r .Furthermore, since the number of independent pairwise commuting Pauli operators on m qubits is at most m, we can assume that r ≤ m, that is, the PBC has the standard form.This completes the proof of Theorem 2.
Let us now prove Theorem 3. Let Q be a fixed PBC on n + k qubits and let p(Q) be the probability that the final outcome of Q is b out = 1.Our goal is to approximate p(Q) on a classical computer assisted by a PBC on n qubits.Suppose one can find a decomposition for some k-qubit stabilizer states φ i and some real coefficients α i .By linearity, one has where Q i is a PBC-type computation obtained from Q by initializing the first k qubits in the state φ i rather than |H ⊗k .We note that any stabilizer state φ i can be represented as |φ i = U i |0 ⊗k for some Clifford unitary U i .Commuting U i towards the end of Q i and properly updating which Pauli operator has to be measured at each step we can assume that |φ i = |0 ⊗k for all i.As we have already showed above, such computation Q i is equivalent to a PBC on n qubits.Let b i be the output bit of The above shows that ξ is an unbiased estimator of p(Q) and one can generate a sample of ξ by repeating a PBC on n qubits χ times.Since all variables b i are independent, the variance of ξ is bounded as Using the Monte Carlo method one can estimate p(Q) with a constant precision by generating M = min {1, O(σ 2 )} independent samples of ξ.Thus the overall cost of adding k virtual qubits is It remains to choose a decomposition in Eq. ( 12).One can decompose each copy of |H H| as a linear combination of stabilizer states using the identity where and then take the tensor product decomposition.Thus χ = 3 k and C ∼ χ = 2 O(k) .This completes the proof of Theorem 3.

APPENDIX A
In this section we prove Lemma 2. Since the constant term f ∅ contributes a multiplicative factor ω m to f , we can assume wlog that f ∅ = 0. Define coefficients g 1 , . . ., g n ∈ Z 2 such that Let S ⊆ [n] be the set of indexes a such that f a = 1, 3 (mod 4).A simple algebra shows that Let us split the sum over y into two terms corresponding to y n = 0, 1.We get Using the definition of h(y) one gets where L (z) is a linear Boolean function and H is a symmetric binary matrix with zero diagonal.Importantly, the matrix H does not depend on .It is wellknown that any matrix H as above can be transformed into a block-diagonal form with all non-zero blocks being 0 1 1 0 by a transformation H → V T HV , where V is an invertible binary matrix [32].The number of non-zero blocks in V T HV is r, where 2r is the rank of H (which is always even).Moreover, the matrix V can be computed in time O(n 3 ) using the standard linear algebra methods [32].Performing a change of variable z → V z and defining new linear functions for some coefficients u( , a) = 0, 1 and v( , a) = 0, 1 determined by L .A direct inspection shows that S ,a takes values 0 and ±2.We conclude that S takes values 0 and ±2 n−1−r .This leaves only nine possible combinations for f = S 0 + iS 1 , Namely, f = 0 (if both S 0 and S 1 are zero), or f = 2 n−1−r ω 2m for some m ∈ Z 4 (if exactly one of S 0 and S 1 is nonzero), or f = 2 n−1−r+1/2 ω 2m+1 for some m ∈ Z 4 (if both S 0 and S 1 are non-zero).This is equivalent to the statement of Lemma 2. The case when S = ∅ is completely analogous.
check that this move maps stabilizer states to stabilizer states.If the move increases the value of the objective function F , we accept the new state φ a , that is, φ a is replaced by φ a .Otherwise, the new state φ a is accepted with a probability p acc = exp [−β(F − F )], where F and F are the values of the objective function before and after the move.If (I + P )φ a = 0, the move is rejected right away.The walk is stopped as long as we observe a tuple of states with F = 1.We start with relatively small values β = β in and gradually increase β using the geometric sequence until it reaches the final value β = β f .This corresponds to the simulated annealing method.For each value of β the random walk was repeated for M 1 steps.In practice we used values β in = 1, β f = 4000, and M = 1000.The number of annealing steps was chosen as 100.Since we worked with relatively small values of n, the stabilizer states φ j were represented by vectors of size 2 n .Since our target state |φ = |H ⊗n has real amplitudes in the computational basis, one can easily show that the optimal decomposition Eq. ( 16) can be chosen such that all stabilizer states φ a have real amplitudes as well (the real part of a stabilizer state is either zero or proportional to a stabilizer state).Accordingly, we restricted the random walk to the subset of S χ n corresponding to real stabilizer states.Clearly, a move φ j → φ j = c(I + P )φ j maps real states to real states if P contains even number of Y 's.The move was accepted only if this condition is satisfied.
The best decompositions of |H ⊗n found using this method are shown below.Here we use the notations of Section IV, so that Here K = i<j Λ(Z) i,j applies cnotrolled-Z to each pair of qubits.The stabilizer decomposition of |H ⊗6 is shown in Eq. (11).By definition, the number of terms χ in these decompositions gives an upper bound on the stabilizer rank χ n .We conjecture that all above decompositions and the one in Eq. ( 11) are optimal in the sense that χ = χ n .

APPENDIX C
Let χ n be the stabilizer rank of |H ⊗n .Here we prove a lower bound χ n = Ω(n 1/2 ).
Let φ be a pure n-qubit state.Define the T -count of φ denoted τ (φ) as the minimum integer τ such that φ can be prepared starting from the all-zeros state by a quantum circuit composed of Clifford gates, T -gates, and (postselective) eigenvalue measurements of Pauli operators, such that the number of T -gates is at most τ .We claim that Indeed, as was shown in Section V, the T -gate can be realized by a gadget that consumes one copy of the magic state |H and performs (postselective) Pauli measurements.Thus we can prepare φ starting from τ (φ) copies of the magic state |H by a sequence of (postselective) Pauli measurement and Clifford operations.Since the latter do not increase stabilizer rank, we can write φ as a linear combination of χ τ (φ) stabilizer states.This is equivalent to Eq. ( 17).
We shall now choose a state φ will a relatively small T -count and a large stabilizer rank.Define Lemma 4. The state φ n has 2 n distinct amplitudes in the computational basis.
We postpone the proof of the lemma until the end of the section.Let us first show that φ n has a large stabilizer rank.Indeed, any stabilizer state has C = O(1) distinct amplitudes in the computational basis.Thus any linear combination of χ stabilizer states has at most C χ distinct amplitudes.Applying this to φ n one gets C χ(φn) ≥ 2 n , that is, χ(φ n ) = Ω(n).
Let us now show that φ n has a small T -count.First we claim that the state θ k has T -count O(k).Indeed, we can first prepare a state |+ ⊗(k+1) ⊗ |0 and then apply multiple control CNOT gate Λ k+1 (X) such that the last qubit is the target one.This creates a state Measuring the first k + 1 qubits in the X-basis and postselecting the outcome + leaves the last qubit in a state (2 k+1 − 1)|0 + |1 , which coincides with θ k modulo a bit-flip.One can easily check that the multiple control CNOT gate Λ k+1 (X) can be implemented using O(k) Toffoli gates.Furthermore, the Toffoli gate can be implemented using seven T -gates [33,34] This is equivalent to Eq. ( 19).Now let s(K) and s(M ) be the sum of all elements in K and M respectively.Assume wlog that s(K) ≥ s(M ).Then Eq. (18) implies 2 s(K)−s(M ) = m∈M (1 Here the last inequality follows from Eq. (19).Thus s(K) = s(M ) and k∈K Let k 1 and m 1 be the smallest elements of K and M respectively.Assume wlog that m 1 ≥ k 1 .Let us show that in fact m 1 = k 1 .Indeed, otherwise m 1 ≥ k 1 + 1.Then Eq. ( 20) implies ξ ≤ 1 − 2 −k1 and Here the last inequality follows from Eq. (19).Thus (1 − 2 −m1 ) 2 < 1 − 2 −k1 which implies m 1 < k 1 + 1 leading to a contradiction.We conclude that k 1 = m 1 .Thus we can cancel the factor (1 − 2 −k1 ) in both parts of Eq. ( 20) and use induction in the number of elements in the largest of the sets K, M to show that K = M .

FIG. 1 .
FIG. 1. Example of a PBC on n = 3 qubits.Each step t involves an eigenvalue measurement of a Pauli operator Pt on n qubits with an outcome σt = ±1.A choice of Pt may depend on the outcomes of all previous measurements.A PBC on n qubits can be described by a binary tree T of height n such that internal nodes of T are labeled by n-qubit Pauli operators and leaves of T are labeled by 0 and 1.The latter represent the final output bit bout.We require that label of any node of T can be computed classically in time poly(n).
the set of all n-bit strings and B n,k ⊂ B n be the subset of strings with the Hamming weight exactly k.Let B n = E n ∪ O n , where E n and O n are the subsets of even-weight and odd-weight strings respectively.Given a set of bit strings S, we shall write |S = x∈S |x for the uniform superposition of all strings in S. For example, |B n,0 = |0 ⊗n , |B n,n = |1 ⊗n , and |H ⊗n = n k=0 t k |B n,k .Define also a state |K n = x∈Bn (−1) |x|(|x|−1)/2 |x = i<j Λ(Z) i,j |B n Note that |B n,0 , |B n,n , |E n , |O n , and |K n are stabilizer states as defined by Eq. (5).Define also a pair of graphs G = (V , E ) and G = (V , E ) with six vertices shown on Fig. 3.The desired stabilizer decomposition of |H ⊗6 is

FIG. 3 .
FIG. 3. Graphs G and G used in the definition of stabilizer states φ and φ , see Eq. (11).

= y∈F n 2 i
where g(x) = n a=1 g a x a + 1≤a<b≤n f a,b x a x b .Let us first assume that S = ∅.Without loss of generality S n (otherwise permute the variables).Define a new summation variable y ∈ F n 2 such that y a = x a for a = 1, . . ., n − 1 and y n = a∈S x a .Note thatx a = y a if a = 1, . . ., n − 1, y n + a∈S\n y a if a = n.Using the identity i a∈S xa = i a∈S xa (mod 2) • (−1) a<b∈S xax b one arrives atf yn • (−1) h(y) with h(y) = a / ∈S g a y a + a∈S\n (g a + g n )y a + g n y n + 1≤a<b≤n−1f a,b y a y b + n−1 a=1 f a,n y a y n + n−1 a=1 b∈S\n f a,n y a y b .
• • • c αm , and φ a= φ α1 ⊗ • • • ⊗ φ αm .Note that φ a are stabilizer states and for a given index a one can compute the standard form of φ a as defined in Eq. (5) in time O(n).