General-purpose quantum circuit simulator with Projected Entangled-Pair States and the quantum supremacy frontier

Recent advances on quantum computing hardware have pushed quantum computing to the verge of quantum supremacy. Random quantum circuits are outstanding candidates to demonstrate quantum supremacy, which could be implemented on a quantum device that supports nearest-neighbour gate operations on a two-dimensional configuration. Here we show that using the Projected Entangled-Pair States algorithm, a tool to study two-dimensional strongly interacting many-body quantum systems, we can realize an effective general-purpose simulator of quantum algorithms. This technique allows to quantify precisely the memory usage and the time requirements of random quantum circuits, thus showing the frontier of quantum supremacy. With this approach we can compute the full wave-function of the system, from which single amplitudes can be sampled with unit fidelity. Applying this general quantum circuit simulator we measured amplitudes for a $7\times 7$ lattice of qubits with depth $1+40+1$ and double-precision numbers in 31 minutes using less than $93$ TB memory on the Tianhe-2 supercomputer.

Quantum computers offer the promise of efficiently solving certain problems that are intractable for classical computers, most famously factorizing large numbers [1][2][3].With the rapid progress of various quantum systems towards Noisy Intermediate-Scale Quantum computing devices [4][5][6][7][8][9][10][11], we are now on the verge of quantum supremacy [12], i.e. demonstrating that a quantum computer has the ability to do a computation that no classical computers can tackle, an important milestone in the field of computer science.Various candidates have been suggested to demonstrate quantum supremacy, such as BosonSampling [13,14], the instantaneous quantum polynomial protocol [15,16] and random quantum circuits (RQCs) [3,17] which demand less physical resources and are easier to implement compared to, for instance, factorization.
A central aspect for all these near-term supremacy proof-of-principle computations is to produce a quantum state using as fewer number of qubits as well as quantum gate operations as possible, which would nevertheless be highly entangled and hence difficult to obtain and/or characterize by a classical computer, for instance by sampling from it in the computational basis.In the meanwhile, it is important to find effective ways to simulate accurately quantum algorithms on classical computers, which could be used as a benchmarking baseline and to validate near term quantum devices.In the field of quantum many-body physics, tensor network states are often used to efficiently represent quantum states with a sizeable amount of entanglement [18,19].The storage required by these tensor network states is closely related to the amount of entanglement of the quantum state.Recently, matrix product states (which are one-dimensional tensor networks) have been applied to simulate quantum circuits [20].However, the performance of matrix product states is much less effective if the underlying quantum system is essentially two-dimensional.In this work, we present an efficient and generic quantum circuit simulator based on the Projected Entangled-Pair States (PEPS) [19,[21][22][23][24][25][26][27][28], a type of tensor-network quantum states representation designed for two-dimensional lattices.Our PEPS-based simulator is a general-purpose quantum circuit simulator for arbitrary quantum circuits: it stores the full quantum state and it can be readily used to compute single amplitudes, observables, and also perform sequences of quantum measurements.
While the quantum circuit simulator we present can tackle generic circuits, in the following we focus on RQCs.They consist of a series of single and two-qubit gates which are applied to different qubits in a particular order.A group of commuting gates, which can be applied simultaneously, constitutes one layer of the circuit, and the more groups of operations that do not commute, the deeper the circuit is.More precisely, for the depth of a circuit we will use the notation (1 + d + 1) where the 1 s indicate the Hadamard gates applied to each site at the be-ginning and at the end of the calculations, while d is the number of non-commuting layers including controlled-Z (CZ) gates and single qubit gates applied to different sites.RQCs are the standard benchmark for quantum supremacy as put forth by [3].The general complexity of quantum supremacy experiments is studied in [29].For RQCs, it was previously shown in [30] that the complexity scales exponentially with min(O(dL h ), O(N )).
RQCs have thus stimulated the search for efficient classical algorithms which would show where exactly the limits of classical simulations are [17,[30][31][32][33][34][35][36][37][38].State of the art algorithms can be mainly divided into two categories.i) State-vector approach which stores the quantum state as a vector and evolves it directly.For example, in [31] a 45-qubit simulation is reported based on this approach.However, this approach is limited by the number of qubits due to the exponential growth of the Hilbert space.ii) Tensor-based approach, which represents the quantum states as tensors and specifies the input and output states as rank-1 Kronecker projectors.This approach is less sensitive to the number of qubits and has been pursued more actively.For instance, a full amplitude simulation of a 7 × 7 circuit to depth (1 + 39 + 1) was implemented in 4.2 hours on Sunway TaihuLight supercomputer [33], which however exploits the weakness in the original design of RQCs in [3].Recently, it was proposed to trade circuit fidelity for computational efficiency so as to match the fidelity of a given quantum computer [36,37], and practically compute around 1 million amplitudes of a 7 × 7 circuit to depth (1 + 40 + 1) with 0.5% circuit fidelity in 2.44 hours on Summit supercomputer [38].Our approach differs from the above approaches in that we use PEPS as the data structure to represent the quantum states.Quantum gate operations as well as quantum projections are adapted accordingly to this new data structure.
Quantum Circuit Simulator Based on PEPS.In the following we consider a two-dimensional rectangular lattice of size L v ×L h , where L v and L h are, respectively, the sizes in the vertical and horizontal directions.We use N = L v L h to denote the total number of qubits.The quantum state on such a lattice can be represented as a PEPS [21,23,24] where A σn n is a rank-5 tensor with elements [A σn n ] l,r,u,d at site n, with σ = 0, 1 corresponding to the physical dimension, and l, r, u, d corresponding to the left, right, up and down auxiliary dimensions, see Fig. 1(a).The function F in Eq. (1) indicates the sum over the common auxiliary indices.The bond dimension χ is defined as the maximum size of the four auxiliary dimensions, and it characterizes the size of the PEPS.In the language of PEPS, a single-qubit gate operation U τn σn on site n only operates locally on the n-th tensor A σn n (shown in Fig. 1(b)), which can be written as As we can see from Eq. ( 3), the size of the local tensor is not affected by a single-qubit gate operation.For a two-qubit gate acting on a horizontally nearest-neighbour pair of qubits (n, m) (shown in Fig. 1(c)), denoted as O τn,τm σn,σm , we first use a by singular value decomposition (SVD) to factorize it into a product of two local tensors where the singular values have been absorbed into U .The size of the auxiliary dimension s is denoted as χ o , which, for any two-qubit controlled gate, is χ o = 2.The two local tensors U and V are then applied on the two qubits n and m separately, as single-qubit gate operations Here we haved used the indices r = (r, s), l = (s, l), which bundles the two tensor dimensions into one.As a result, χ increases by a factor of χ o .To keep χ in a affordable size, one would usually use a subsequent singular value decomposition to compress the resulting tensors by throwing away singular values below a suitably chosen threshold.However, we point out that for RQCs we cannot perform such a compression because the distribution of the singular values after the two-qubit gate operation is almost flat, making it impossible for compression (this is also an indication that this problem has large entanglement across the whole circuit).Calculating a single amplitude of the final state |ψ is done by projecting |ψ onto a separable PEPS which encodes one spin configuration | τ , and then contracting the resulting tensor network, which can be written as where the rank-4 tensor ] l,r,u,d .This calculations are depicted in Fig. 1(d).To this end, we also note that with our method it is also straightforward to simulate sequences of quantum measurements.Concretely, to measure an N -th qubit system, we can first compute the probability that a qubit is in state |0 or |1 .Then, we use another copy of the wavefunction (which is stored as PEPS), project the measured qubit in the relevant state, measure another qubit and so forth.In between different measurements more gates can be applied too, all seamlessly because we can effectively and efficiently compute and store the wavefunction of the system.
Application to random quantum circuits and complexity analysis.In the following, we apply our PEPS simulator to study the two-dimensional RQCs of [39,40].The simulation of this circuit is divided into two parts: (i) circuit evolution and (ii) computing the overlap with randomly selected spin configurations, namely calculating the amplitudes.To quantify the size of the bond dimension required by the tensors, we realize that a single-qubit operation does not affect the size of the tensor it operates on, while a nearest-neighbour two-qubit controlled operation increases the sizes of the two tensors it operates on by a factor of 2 as shown previously [41].This results in where . . . is the ceiling function.The equality in Eq. ( 8) is reached if the depth d can be divided by 8 (each nearest-neighbour pair of sites will be acted on by a CZ gate in every 8 depths).As can be seen from Eqs. (3,5,6), the cost of each gate operation on PEPS scales as O(χ 4 ), which is relatively cheap.As a result, circuit evolution can be performed very efficiently.In fact, we can simulate the exact evolution of a 12 × 12 lattice to a depth (1 + 40 + 1) within minutes on a personal laptop.
In contrast, a well-known result about PEPS is that exactly computing the overlap as in Eq. ( S1) is an exponentially hard problem [42].While there exist approximate algorithms to evaluate Eq. (S1) which scale polynomially with χ [22,25,26], they are inadequate for RQCs due to the large entanglement of the states produced.In the following we ignore both the space and time complexity of circuit evolution and only focus on calculating one amplitude, since the cost of the former stage is negligible compared to the latter.
We have developed different strategies to evaluate Eq. (S1) efficiently, depending on the shape and size of the lattice.A generic strategy which works for any rectangular lattice has space and time complexities (assuming L v ≥ L h ) given by For square lattices, specialized tensor contraction strategies can be used to further reduce the complexity or for better parallelization (see [40] for details of these strategies).We highlight here that Eqs.(S7,S8) are more accurate estimates for space and time complexities compared to the results of [30], and the exact value will depend on the details of the particular implementation on the hardware.However, these numbers can work as a theoretical approximate benchmarking baseline for achieving quantum supremacy.
To give more precise numbers, using Eqs.(S7,S8) we can evaluate that simulating a 8 × 8 lattice to a depth (1 + 40 + 1) (same space complexity of a 10 × 10 circuit to a depth (1 + 32 + 1)) would require 32 TB of memory, while simulating a 8 × l (with l > 8) lattice to a depth (1 + 40 + 1) would require about 0.5 PB memory.However, simulating a 9 × 9 lattice with a depth (1 + 40 + 1) would require 16 PB (petabytes) memory and simulating a 12 × 12 lattice to a depth (1 + 32 + 1) would require 8 PB memory, which are currently out of reach.Our circuit simulator can straightforwardly be extended to other types of two-dimensional lattices including Google Bristlecone QPU architecture.By applying a complexity analysis to this architecture, we find that it only requires less than a manageable 0.6 PB of memory to simulate an RQC with 72 qubits at depth (1 + 32 + 1) (for details of this analysis see [40]).
To demonstrate the performance of our method, we have implemented small scale simulations on a personal computer, which takes less than 1 hour to compute one Our PEPS-based method can be readily scaled up onto a massive parallel computing platform.We implemented the large scale tensor contractions based on an opensource software package Cyclops Tensor Framework [43].The massive parallel benchmarking was executed on the Tianhe-2 supercomputer [44].We have simulated a 7 × 7 circuit with depth (1 + 40 + 1) and a 10 × 10 circuit with depth (1+26+1).The simulation of the 7×7×(1+40+1) circuit was done on 4096 nodes (22%) of Tianhe-2, taking 31 minutes and 92.51 TB memory in total [40].Our large-scale simulation results are listed in TABLE I.
Conclusions.In this work we have adapted the Projected Entangled-Pair States representation of quantum states from many-body quantum physics to build a general-purpose quantum circuit simulator.This simulator can be used to store effectively highly entangled wavefunctions, and it is readily adaptable to compute expectation values or simulate sequential quantum measurements.With this circuit simulator, we have computed an accurate estimate for the space and time complexity analysis of a standard random quantum circuit [39].Based on this analysis, we point out that simulating an 8 × l circuit to a depth (1 + 40 + 1) or a Bristlecone-72 circuit to a depth (1 + 32 + 1) are within reach of current supercomputing platforms.
We have implemented numerical experiments on a personal computer with a 8×8 circuit to a depth (1+25+1), and on Tianhe-2 supercomputer with a 10 × 10 circuit to a depth (1 + 26 + 1), as well as a 7 × 7 circuit to a depth (1 + 40 + 1).Currently we compute the amplitudes exactly, however we could also investigate the trade-off between fidelity and speed, so as to be able to sample many trajectories.For instance, we could reduce the memory requirement of our method by using the 'cut' technique in [38], namely mapping a large ten- sor contraction into summations over many smaller tensor contractions by unraveling several for-loops.More importantly, PEPS-based techniques which are currently used in quantum many-body physics can be transferred to the study of quantum circuits, for example for contractions and the evaluation of expectation values [45].These investigations, which could be particularly useful for circuits in which the wavefunction can be effectively compressed, are left for future works, together with the plan to include the effects of noise or errors in order to characterize more closely the actual behavior of a noisy intermediate-scale quantum computer.
We gratefully acknowledge the help from China Greatwall Technology and National Supercomputing Center in Guangzhou.We thank Sergio Boixo  For a L v × L h qubit lattice, the Random Quantum Circuit (RQC) defined by [3] is described as follows: 1. Apply a Hadamard gate to each qubit to initialize the qubits to a symmetric superposition.
2. Apply controlled-phase (CZ) gates alternating between eight configurations similar to Fig. S1 to entangle neighbouring qubits.
3. Apply a randomly chosen gate (T, X 1/2 or Y 1/2 ) to each qubit on which the CZ gates has not just been applied, according to the rules in [3].
4. Repeat steps 2 and 3 to add layers of depth to the circuit.
5. Apply a final Hadamard gate to each qubit.
It has been proven that this random quantum circuit satisfies both average-case hardness and anti-concentration condition [17], and hence it cannot be efficiently simulated on a classical computer.

ALGORITHM FOR EXACT COMPUTATION OF THE OVERLAP
Depending on the shape of the lattice, we have developed three different strategies to evaluate the contraction of the tensor netwrok, which are shown in Fig. S2.
In the following we first show a generic way to evaluate the overlap in the main text exactly, namely the equation  Assuming L h ≤ L v , we first contract all the tensors on the first row to get a rank-L h tensor where the bottom legs where we have used the fact that for E 21 one has the size dim(l (2,1) ) = 1 and u (2,1) = d (1,1) .The resulting tensor G 1 is a rank-(L h + 1) tensor.Then we contract G 1 with the second tensor in the second row E (2,2) and get where we have used the fact that for E (2,2) one has l (2,2) = r (2,1) and u (2,2) = d (1,2) , and the resulting tensor G 2 is again a rank-(L h + 1) tensor.We can repeat this procedure and move on to the right until we have contracted all the tensors on the second row and get where we have used the fact dim(r (2,L h ) ) = 1 and redefined G L h and F 2 .Noticing that F 2 has the same structure as F 1 , therefore we repeat this procedure until we have reached the last row and get F L , which is a scalar since all the indexes dim(d (Lv,n) ) = 1 for 1 ≤ n ≤ L h .Thus we get From this analysis it appears that the largest tensor involved in this procedure is rank-(L h +1).Moreover, for L h > L v , instead of moving from top down, it is straightforward to slightly modify the algorithm to move from left to right, and the largest tensor involved would become rank-(L v + 1).Therefore the memory required scales exponentially with the exponent min(L h + 1, L v + 1).this circuit to a depth (1 + 32 + 1) with our circuit simulator scales as 2 32/8×11+1 = 2 45 , which corresponds to less than 0.6 PB memory.

MASSIVE PARALLEL BENCHMARKING ON SUPERCOMPUTER
We have implemented our large scale tensor contraction algorithms based on an open-source software package Cyclops Tensor Framework [43], with MPI and OpenMP as the parallel interfaces.The massive parallel benchmarking was then executed on Tianhe-2 supercomputer.According to the features of the supercomputer platform and the results of the scaling test, we chose to use one MPI process with 24 OpenMP threads on each node.Each normal node contains two 12-core CPUs, and is equipped with 64GB (128 GB on each fat node) memory.The maximum number of nodes used reaches 4,096 (98,304 compute cores in total), which is less than 1/4 of the whole system, and since we only use CPUs, the peak performance we use is ∼ 1.73 PFlops.All our calculations are done with double-precision numbers.Our results are listed in Table .1 of the main manuscript.
The numerical simulation with the largest number of qubits is a 10 × 10 circuit with d = (1 + 26 + 1), which is done on 1,024 normal nodes and takes 6 minutes to measure one amplitude, using the partitioning strategy as in Fig. S2(c).The numerical simulation with the largest depth is a 7 × 7 circuit with d = (1 + 40 + 1), which is done on 4, 096 fat nodes and takes 31 minutes.On each fat node 23.13 GB memory is used, and thus this simulation takes 92.51 TB memory in total (detailed data can be found in supplementary information).To pursue efficiency, parts of the data is duplicated on several computing nodes to reduce the cost of data communication, leading to a larger memory usage than theoretical prediction 16 TB.Here we note that recently in [38] the authors compute, with 0.5%

FIG. 1 :
FIG. 1: (a) PEPS on a 5 × 5 lattice, each qubit of the lattice is represented with a rank-5 tensor [A σn n ] l,r,u,d , where σn = 0, 1 labels the physical dimension and l, r, u, d label the auxiliary dimensions which connect [A σn n ] l,r,u,d to the tensors on the neighbouring sites.(b) single-qubit gate operation on the PEPS.(c) Two-qubit gate operation on PEPS.(d) Overlapping of two PEPSs by contraction of all the physical dimensions of the two PEPSs and all the auxiliary dimensions inside each PEPS.

FIG. 2 :
FIG.2:The blue circles show the log transformed probabilities from calculating 10000 amplitudes, while the red line is the log transformed Porter-Thomas distribution.The circuit size is 8 × 8 with a depth (1 + 25 + 1).
FIG. S1: Layout of the CZ gates for the rank-2 random quantum circuit.
FIG.S2: Contracting strategies for different lattices.(a) Generic contracting scheme for lattices with Lv ≥ L h .One first contracts the tensors on each horizontal line from top down.For the case with Lv < L h , one can contract the tensors on each vertical lines for left to right.In this strategy, the largest stored tensor is of rank min(L h , Lv) + 1.(b) For a square lattice with an even number of qubits, namely L h = Lv = 2m, the tensor network is divided into four square sub-lattices with sizes m × m.Each sub-lattice is contracted to get a single larger tensor, and then they are contracted to get the probability amplitude.In this strategy, the largest stored tensor has √ N indices.(c) For a square lattice with an odd number of qubits, namely L h = Lv = 2m + 1, the tensor network is divided into four sub-lattices, with sizes (m + 1) × m, (m + 1) × (m + 1), m × m and m × (m + 1).The subsequent contraction follows (b).In this strategy, the largest stored tensor is rank-√ N + 1 .The strategies (b) and (c) allow easier parallelism.

1 =Fd ( 1 , 1 ) 1 ×
are written explicitly to indicate that they are not contracted in this step.Note also the notation for the position with two numbers instead of one, i.e. (n, m) indicates the qubit on the n−th row and m−th column.Next we contract F 1 with the first tensor in the second row E (2,1) and get G r (2,1) ,d (2,1) ,d (1,2) ,...,d (1,L h ) ,...,d (1,L h ) E (2,1) r (2,1) d (1,1) d (2,1), (S3) FIG. S3: The space and time complexity of RQCs based on the PEPS quantum circuit simulator for a circuit with L h × Lv qubits.(a) Scaling of time complexity C t with Lv when Lv > L h .(b) Scaling of C t with Lv when Lv = L h .(c) Scaling of space complexity C s with Lv when Lv > L h .(d) Scaling of C s with Lv when Lv = L h .The grey dotted lines in (c) and (d) represent the current memory limit of supercomputers (2.17PB for Tianhe-2 and 2.67 PB for summit).

TABLE I :
Large-scale simulation with PEPS based circuit simulator.The column denoted by "Node usage" indicates the number of cores used divided by the total available on Tianhe-2, and the corresponding percentage."Qubits" and "Depth" describe the circuit analyzed while "Elapsed time" shows the time required to compute one amplitude.
and Giacomo Nannicini for helpful discussions.C. G. acknowledges support from National Natural Science Foundation of China under Grants No. 11504430 and No. 11805279.H.-L. H. acknowledges support from the Open Research Fund from State Key Laboratory of High Performance Computing of China (Grant No. 201901-01), National Natural Science Foundation of China under Grants No. 11905294, and China Postdoctoral Science Foundation.D.P. acknowledges support from the Singapore Ministry of Education, Singapore Academic Research Fund Tier-II (project MOE2016-T2-1-065).J.W. acknowledges support from National Natural Science Foundation of China under Grants No. 61632021. )