Architectures for quantum simulation showing a quantum speedup

One of the main aims in the field of quantum simulation is to achieve a quantum speedup, often referred to as"quantum computational supremacy", referring to the experimental realization of a quantum device that computationally outperforms classical computers. In this work, we show that one can devise versatile and feasible schemes of two-dimensional dynamical quantum simulators showing such a quantum speedup, building on intermediate problems involving non-adaptive measurement-based quantum computation. In each of the schemes, an initial product state is prepared, potentially involving an element of randomness as in disordered models, followed by a short-time evolution under a basic translationally invariant Hamiltonian with simple nearest-neighbor interactions and a mere sampling measurement in a fixed basis. The correctness of the final state preparation in each scheme is fully efficiently certifiable. We discuss experimental necessities and possible physical architectures, inspired by platforms of cold atoms in optical lattices and a number of others, as well as specific assumptions that enter the complexity-theoretic arguments. This work shows that benchmark settings exhibiting a quantum speedup may require little control in contrast to universal quantum computing. Thus, our proposal puts a convincing experimental demonstration of a quantum speedup within reach in the near term.


I. INTRODUCTION
Quantum devices promise to solve computational problems efficiently for which no classical efficient algorithm exists.The anticipated device of a universal quantum computer would solve problems for which no efficient classical algorithm is known, such as integer factorization [1] and simulating many-body Hamiltonian dynamics [2].However, the experimental realization of such a machine requires faulttolerant protection of universal dynamics against arbitrary errors [3][4][5], which incurs in prohibitive qubit overhead [6,7] beyond reach in available quantum devices.This does not mean, however, that the demonstration of a computational quantum advantage is unfeasible with current technology.
Indeed, in recent years, it has become a major milestone in quantum information processing to identify and build a simple (perhaps non-universal) quantum device that offers a large (exponential or superpolynomial) computational speedup compared to classical supercomputers, no matter what.The demonstration of such an advantage based on solid complexity-theoretic arguments is often referred to as "quantum computational supremacy" [8].This important near-term goal still constitutes a significant challenge, as technological advances seem to be required to achieve it, as well as significant efforts in theoretical computer science, physics, and the numerical study of quantum many-body systems: after all, intermediate problems have to be identified with the potential to act as vehicles in the demonstration of a quantum advantage, in the presence of realistic errors.
There already is evidence that existing dynamical quantum simulators [9,10] have the ability to outperform classical supercomputers.Specifically the experiments of Refs.[11][12][13][14] using ultracold atoms strongly suggest such a feature: They probe situations in which for short times [11] or in one spatial dimension [12,13], the system can be classically simulated in a perfectly efficient fashion using tensor network methods, even equipped with rigorous error bounds.However, for long times [11] or in higher spatial dimensions [12,13] such a classical simulation is no longer feasible with state-of-the-art simulation tools.Still, taking the role of devil's advocate, one may argue that this could be a consequence of a lack of imagination, as there could-in principle-be a simple classical description capturing the observed phenomena.Hence, a complexity-theoretic demonstration of a quantum advantage of quantum simulators outperforming classical machines is highly desirable [15].Not all physically meaningful quantum simulations are to be be underpinned by such an argument, but it goes without saying that the field of quantum simulation would be seriously challenged if such a rigorous demonstration was out of reach.
Several settings for achieving a quantum speedup have been proposed [16][17][18][19][20] based on quantum processes that are classically hard to simulate probabilistically unless the Polynomial Hierarchy (PH) collapses.These processes remain hard to be simulated up to realistic (additive) errors assuming further plausible complexity-theoretic conjectures.The proof techniques used build upon earlier proposals giving rise to such a collapse [21,22].However, at the same time they still come along with substantial experimental challenges.
This work constitutes a significant step towards identifying physically realistic settings that show a quantum speedup by laying out a versatile and feasible family of architectures based on quenched local many-body dynamics.We remain close to what one commonly conceives as a dynamical quantum simulator [9][10][11]23], set up to probe exciting physics of interacting quantum systems.Indeed, it is our aim is to remain as close as possible to experimentally-accessible or at least realistic prescriptions, closely reminiscient of dynamical quantum simulators while at the same not compromising the rigorous complexity-theoretic argument.
Our specific contributions are as follows.We focus on schemes in which random initial states are prepared on the 2D square lattices of suitable periodicity, followed by quenched, constant-time dynamics under a local nearest-neighbor (NN), translation-invariant (TI) Hamiltonian.These are prescriptions that are close to those that can be routinely implemented with cold atoms in optical lattices [9,[24][25][26].Since evolution time is short, decoherence will be comparably small.In a last step, all qubits are measured in a fixed identical basis, producing an outcome distribution that is hard to classically sample from within constant 1 -norm error, requiring no postselection.Technically, our results implement sampling over new families of NNTI two-local constant-depth [21] IQP circuits [17,22,27].We build upon and develop a type of setting [28] in which resource-states for measurementbased quantum computation are prepared (MBQC) [29], but subsequently non-adaptively measured.We lay out the complexity-theoretic assumptions made, detail how they are analog to those in Refs.[16][17][18][19][20], and present results on anticoncentration.
By doing so, we arrive at surprisingly flexible and simple NNTI quantum simulation schemes on square lattices, requiring different kinds of translational invariance in the preparation.Interestingly, and possibly counterintuitively, our schemes share the feature that the final state before the readout step can be efficiently and rigorously certified in its correctness.This is further achieved via simple protocols that involve on-site measurements and a number of samples of the resource state that scales quadratically in the system size.The possibility of certification is unique to our approach.In fact, from the quadratically many samples of the prepared state, one can directly and rigorously infer about the very quantity that is used in the complexity-theoretic argument.We believe that the possibility of such certification is crucial when it comes to unambiguously argue that a quantum device has the potential to show a true quantum speedup.
Based on our analysis, we predict that short-time certifiable quantum-simulation experiments on as little as 50 × 50 qubit square lattices should be intractable for state-of-the-art classical computers [18,30].It is important to stress that this assessment includes the rigorous certification part, and no hidden or unknown costs have to be added to this.Our proposed experiments are particularly suited to qubits arranged in twodimensional lattices e.g., cold atoms in optical lattices [9,[24][25][26] with qubits encoded in hyperfine levels of atoms.When assessing feasible quantum devices, it is crucial to emphasize that systems sizes of the kind discussed here are not larger but generally smaller than what is feasible in present-day architectures [9,11,13,26].

II. BASIC SETUP OF THE QUANTUM SIMULATION SCHEMES
We present a new family of simple physical architectures that cannot be efficiently simulated by classical computers with high evidence (cf.Theorem 1 below).All share the basic feature that they are based on the constant-time evolution (quench) of an NNTI Hamiltonian on a square lattice.Each architecture involves three steps: Q1 Preparations.Arrange N := µmn qubits side-by-side on an n-row m-column square lattice L, with vertices V , edges E, initialized on a product state . Architectures I-III.Colors illustrate the rotation angle of the initial state (1): βi = 0 (blue), βi = π/4 (yellow), and βi = π/8 (crimson).Solid lines between qubits represent Ising-type interactions (2) with coupling constants Ji,j = π/4 (gray) and Ji,j = π/8 (black).X and Z label the basis in which the respective qubits are to be measured.
for fixed θ ∈ { π 4 , π 8 }, which is chosen uniformly or randomly with probability p β (e.g., as a ground state of a disordered model).We consider standard square primitive cells.We allow in one scheme each vertex to be equipped with an additional qubit, named "dangling bond qubit".For this, µ=2, otherwise µ=1.Q2 Couplings.Let the system evolve for constant time τ = 1 under the effect of an NNTI Ising Hamiltonian This amounts to what is usually referred to as a quench.
Local fields {h i } i and couplings {J i,j } i,j are set to implement a unitary U := e iH , giving rise to a final ensemble Q3 Measurement.Measure primitive-cell qubits on the X basis and (if present) dangling-bond qubits on the Z basis.
Since the latter can be traded for a measurement in the X basis by a uniform basis rotation, one can equally well measure all qubits in the same basis.
As we will discuss later, all individual steps have been realized with present technology.Note also that Q2 amounts to a constant depth quantum circuit [31].

A. Physical desiderata and concrete schemes
Before presenting concrete schemes, we lay out desiderata that we associate with feasible ones.We require the implementation of each step Q1-Q3 to be as simple as possible.For preparations, couplings, and measurements, we desire the periodicity as measured by the 2D periods (k x , k y ) in the xy axes of the TI symmetry to be small.Couplings should further be simple and not to scale with the system size.Last, we want the final measurement to be translationally invariant.
We now present three concrete quantum architectures of the form Q1-Q3 that live up to the above desiderata.We label them I-III and illustrate them in Fig. 1: Previous work Ref. [19] 7-fold brickwork graph Table I.Resource requirements of our hard-to-simulate quantum architectures.Architectures which employ simpler (more ordered) initial states require lattices of higher periodicity and finer controlled-rotations.The degree of symmetry of the preparation, evolution and measurement steps is quantified by the 2D periods indicated by vector subscripts (a, b).We compare our results to the simplest previously known quantum simulation architecture on a planar graph showing a quantum speedup.The underlying complexity-theoretic assumptions needed for theses speedups are compared in section III.Above, X θ := e −i θ 2 Z Xe i θ 2 Z .
I A disordered (DO) product state is prepared on a squared lattice, followed by a quench with an Ising Hamiltonian with couplings J i,j = h i = π/4 -which implements controlled-Z (CZ) gates on edges-and a final measurement in the X basis.
II The initial state is TI with period 1 in one lattice direction and uniformly random in the other (TI (1,∞) ); couplings and measurements are picked as in I.
III Qubits are prepared on a dangling-bond square lattice.The initial state is TI with period 1 in all directions (TI (1,1) ).We pick J i,j , h i as in I-II on bright edges and J i,j = h i = π/16 on dark ones-the latter implement controlled-e −iπ/8Z (CT ) gates on dangling bonds.Measurements are in the Z basis for dangling qubits, elsewhere in the X basis.
(Cf.Appendix A for full Hamiltonian specifications.)The resources needed in each architecture are summarized in Table I.It is worth noting that, in all architectures I-III the state prepared after Q2 is a resource for postselected measurement-based quantum computation postselecting w.r.t. the measurements in Q3 (section VI B), but as such does not amount to universal quantum computation.In fact, architectures I-III require neither adaptive measurements (which are key in MBQC [29]) nor physical postselection: our result states that if three plausible complexity-theoretic conjectures hold a single-shot readout cannot be classically simulated.

III. MAIN RESULT
We now turn to stating that while the above three architectures I-III are physically feasible, the output distributions of measurements cannot be efficiently classically simulated on classical computers, based on plausible assumptions and standard complexity-theoretic arguments.As in previous works [16,17,20,21,[32][33][34], Theorem 1 relies on plausible complexity-theoretic conjectures.The first, originally adopted in Ref. [21], is a widely believed statement about the structure of an infinite tower of complexity classes known as "the Polynomial Hierarchy" (PH), the levels of which recursively endow the classes P, NP, and coNP with oracles to previous levels.
The claim generalizes the familiar P = NP conjecture in that P = NP would imply a complete collapse of PH to its 0-th level.Furthermore, if two levels k, k + 1 coincide, then all classes above level k collapse to it.The available evidence for P = NP makes Conjecture 1 plausible, for it would be surprising to find a collapse of PH to some level k but not a full one [35] (cf.Ref. [36] for further discussion).Similarly to the Riemann hypothesis in number theory, many theorems in complexity theory have been proven relative to Conjecture 1, probably most notably the Karp-Lipton theorem NP P/poly [37].
We highlight that, assuming Conjecture 1 only, a classical computer would still not be able to sample from our experiments either exactly or within any constant relative error (cf.section VI D).However, such level of accuracy is physically unrealistic for it cannot be achieved by a quantum computer.A goal of this work is to understand how unlikely is for architectures I-III to be classically intractable under realistic errors.
Our second conjecture, adopted from Ref. [17], is a qubit analog of the "permanent-of-Gaussians" conjecture [16].It states that partition functions of (unstructured) random Ising models should be equally hard to approximate in average-and worst-case.Now, let (a, b) := (a 1 , . . ., a N X , b 1 , . . ., b N Z ) be the outcomes of the X and Z measurements in our architectures, with b i = 0 for I-II.In appendix B we show that where Z (α,ϑ) := tr(e iH (α,ϑ) ) is the partition function of a random Ising model on an n × m square lattice L sq : where θ ∈ { π 4 , π 8 } is chosen as in step (Q1), and α (resp.ϑ) is random and DO-(resp.either DO-or TI (1,∞) -) distributed.
We complement Conjecture 2 with the following lemma.
Lemma 2 (#P-hardness).Let H (α,β) be the Ising model (4) on the n × m square lattice with either (i) DO-distributed ϑ and θ ∈ {0, π 4 }; or (ii) TI (1,∞) -distributed ϑ and θ ∈ {0, π 8 }.Then, for m ∈ O(n 2 ), approximating |Z (α,β) | 2 with relative error 1  4 + o( 1) is #P-hard.Thus, accepting Conjecture 2 implies that approximating |Z (α,β) | 2 for these models is as hard in average as any problem in #P [38].The proof (section VI C) applies MBQC methods [39] to show that I-III are computationally equivalent to an encoded n-qubit 1D nearest-neighbor circuit comprising random gates of the form where c i (resp.d i ) is DO (resp.DO-or-TI (1,∞) -) distributed and H is the Hadamard gate.Post-selecting such circuits, we can implement two known universal schemes of quantum computation [40,41].We then exploit that universal quantumcircuit amplitudes are #P-hard to approximate.As a remark, we discuss that the bound m ∈ O(n 2 ) in Lemma 2 might not be optimal.In fact, we believe the result should still hold for m ∈ O(n) (possibly for a different constant error) based on two pieces of evidence.
(i) On the one hand, our anti-concentration numerics (appendix C), indicate that O(n)-depth universal random circuits of gates of form (5)-whose output probabilities are in on-to-one correspondence via (6) with the instances of |Z (α,β) | 2 -are Porter-Thomas distributed: the latter is a signature of quantum chaos, and of our quantum circuits being approximately Haar-random [18,[42][43][44][45][46].Hence, our numerics suggest that our n × n-qubit lattices efficiently encode chaotic approximately-Haar-random nqubit unitaries.
(ii) On the other hand, we analytically show that #P-hardness arises for m ∈ O(n) and slightly-different choices of input states (resp.dangling-bonds) in architectures I-II (resp.III) (cf.appendix D).
Last, we claim that random circuits of gates of form (5) anticoncentrate.
Conjecture 3 (Anti-concentration).Let C be an n-qubit O(n)-depth random circuit of gates of form (5), then for a uniformly random choice of x = (x 1 , . . ., x n ).
In section VI D, we show that Eq. ( 6) is a sufficient condition for the output distribution of architectures I-III to display anti-concentration.Analog numerically-supported conjectures have been made in Refs.[16,18,47].Here, we ran exact simulations of random circuits with up to 20 logical qubits to test Conjecture 3 (appendix C), and observed, first, that the anti-concentration ratio of Eq. ( 6) quickly converges to 1/e with the system-size; and, second, that measurement outcomes are Porter-Thomas [42] (i.e., exponentially) distributed, which is a signature of chaotic Haar-random unitary processes [18,[43][44][45][46].
Previously, Refs.[17,20,48] argued that anti-concentration of measurement outcomes on an n × n lattice should require Ω(n) physical depth on 2D NN layouts in order not to induce a violation of the counting exponential time hypothesis [49,50].This contrasts with the constant-depth nature of our proposal.To clarify this discrepancy, we note that our numerical evidence for anti-concentration is for logical n-qubit 1D circuits (5) of depth O(n) (Fig. 6), which implies anticoncentration of the corresponding constant-depth evolution on a lattice of size n × O(n).Because this encoding introduces a linear overhead factor O(n) there is no contradiction with Refs.[20,48].More critically, the observed signatures of anti-concentration rule out a potential efficient classical simulation of our schemes via sparse-sampling methods [20,51].
We end this section with a remark: Closest to our work is the approach of Ref. [19], which is also an NNTI nonadaptive MBQC proposal albeit with larger resource requirements (see Table I for a comparison).What is more, it requires a stronger hardness assumption with regards to the required level of approximation in that it introduces a variation of Conjecture 2 with a less-natural inverse-exponential additive error (cf.Appendix F).Furthermore, Refs.[18,20,47] gave non-TI schemes based on time-dependent NN random circuits acting on square lattices: the latter approaches require less qubits, but also circuits of polynomial depth.In ours and in that of Ref. [19], circuit depth is traded with ancillas and kept constant.For ours and that of Ref. [19], efficient certification protocols also exist and can be used to determine if the experiment has actually worked, as discussed below.

IV. EFFICIENT CERTIFICATION OF THE FINAL RESOURCE STATES
It is key to all schemes proposed that the correctness of the final resource-state preparation in the quantum simulation can be efficiently and rigorously certified.Since the prepared state is the ground state of a gapped and frustration-free parent Hamiltonian H parent = i h i , Ref. [52] gives a scheme involving local measurements only that certifies the closeness of the prepared state ρ to anticipated state |Ψ β Ψ β | immediately before measurement in terms of an upper bound on the trace-distance |Ψ β Ψ β | − ρ 1 [52] (see also Ref. [53]).This directly yields an upper bound on the 1 -norm distance between the respective measurement outcome distributions.
The key idea of the protocol of Ref. [52] is to estimate the energy E ρ = tr[ρH parent ].This yields a fidelity witness Certification protocol.We illustrate how our scheme works for architecture I. Qubits with thick (thin) borders denote odd (even) sites in V odd (Veven).The figure illustrates a pattern of onsite measurements for one execution of the certification protocol that measures the energy of H odd = i∈V odd hi for the configuration of initial states in Fig. 1.On-site measurements are of type Z, X and X π/4 .Three hamiltonian terms whose joint measurement can be simulated from these single-qubit measurements are singled out by the depicted dashed diamonds.
This can be done, for example, by measuring the local Hamiltonian terms h i , which are five-(six-) body observables for architectures I-II (III).Ref. [52] showed that this approach requires O(N 2 log N ) samples of the state preparation ρ to estimate a single term h i with polynomial accuracy.Hence, the full certification protocol requires O(N 3 log(N )) independent preparations of ρ and five-(six-) body measurements.Though this scaling is efficient, its supra-cubic time-scaling and the complexity of the local measurements could render it impractical for near-term experiments with thousands of atoms.
We now introduce three optimizations to the protocol of Ref. [52], analyzed in Appendix E (Lemmas 8-12).First, for any non-degenerate gapped local Hamiltonian with known ground-state energy E 0 ∈ O(1) and known gap ∆ ∈ Ω(1), we show that the sample complexity of the protocol can be reduced to O(N 2 ) by exploiting parallel measurement sequences of commuting Hamiltonian terms.Second, we show how to implement this more resource-economical protocol using on-site measurements only by expanding the latter terms in a local product basis.Third, we introduce a few setting-dependent optimizations for the architectures I-III (Lemma 10), tailored to their underlying square lattice geometry, the explicit tensor product structure of the parent Hamiltonian of their pre-measurement states (Appendix E 1), and their translation-invariant symmetry.We emphasize that, as in the sampling measurement step Q3 of our architectures, the optimized certification protocol relies on on-site measurements only.
In the three afore-mentioned cases (Appendix E), the certification measurement pattern inherits the initial symmetry of the preparation step (Table I), i.e., DO for architecture I, TI (1,∞) for architecture II, TI ( √

2, √
2) for architecture III.For architectures I-II, our setting resembles a certification protocol for preparing a family of hypergraph states given in Ref. [54], though states and measurements therein are asymmetric.
The above certification measurement is a slightly more difficult prescription than the experiments as such, yet as simple as one could hope, since we need to measure additional albeit single-qubit bases.However, it is key to see that the correctness of the final-state preparation of an experiment can be certified even the absence of a known classical algorithm for sampling its output distribution.This is also in contrast to other similar schemes, where no efficient rigorous scheme for certification of the final state before measurement is known [16-18, 20, 47].
Note that the certification protocol of Ref. [52] is stated in terms of noise-free measurements.Nevertheless, for certain noise models it can be shown that rigorous certification is still possible [19].Moreover, it is not an unreasonable assumption that local measurements can be benchmarked to very high a precision.Most importantly, the protocol readily accepts imperfect measurements: The imperfect measurements can be seen as perfect measurements, preceded by a quantum channel reflecting the noise.This quantum channel noise can equally well be seen as acting on the quantum state, reducing the trace-norm closeness to the anticipated state.Hence, certifying -closeness of the state-preparation under the assumption of ideal measurements is equivalent to certifying <closeness of the state preparation using measurements preceded by a noise channel with -norm at most − .To set up a detection scheme living up to the required error bounds (that scales inversely with the system size for each on-site measurement) is demanding, but not unrealistic.
Last, we highlight that the property of the final-state preparation being certifiable is rooted in the fact that the states prepared are ground states of gapped frustration-free local Hamiltonian models.At the same time they are injective projected entangled pair states (PEPS) of constant bond dimension [55,56].The protocols discussed here can hence be seen as PEPS sampling protocols that generate samples from local measurement on PEPS.
Certification protocol.Let us now outline the precise certification protocol (analyzed in Appendix E) including the required quantum measurements and the post-processing of the measurement outcomes.We do so in three steps: first, we find the parent Hamiltonians for state preparations |Ψ β in architectures I-III.Second, we show how on-site measurements are sufficient to obtain a rigorous certificate.Finally, we comment on the reduced sampling complexity O(N 2 ) of the protocol.We refer the reader to Lemmas 8-12, Appendix E) for proofs of the results described below.
Observing that the resource states |Ψ β are stabilizer states, all we need to do is find the appropriate stabilizers.The sum of the stabilizers is then a parent Hamiltonian of |Ψ β .For architectures I and II this yields (cf.App.E 1) where 2 Z is a rotated Pauli-X operator acting on site i and β i is distributed as described in Q1.Hence, the Hamiltonian consists of N terms that are 5-local except at the boundary of the lattice, where their locality is reduced to 4or 3-local.In the specific case of architecture III, a danglingbond qubit is attached to each qubit via a CT interaction.This yields a two-body term that replaces the X βi,i term in Eq. ( 7) (see App. E 1).
The stabilizers h i = X βi,i j:(i,j)∈E Z j in architectures I-II can be measured using on-site measurements in a demolition fashion, by first measuring their tensor components and then multiplying their outcomes using classical postprocessing.(This procedure is reminiscent of the syndrome measurement of subsystem codes [57] and its correctness can easily be seen by decomposing each stabilizer into a eigenbasis.)These act on distinct sites and can therefore be measured simultaneously.For the specific case of architecture III, the local Hamiltonian terms have on-site Z factors as well as a two-body component CT XCT † .However, the purpose of measuring these stabilizers in the certification protocol of Ref. [52] (see Appendix E), is to estimate the average energy of the relevant parent Hamiltonian.To this end, we can, w.l.o.g., expand CT XCT † as a sum of product operators (Eq.E3), measure the on-site factors appearing in this sum, and infer the target expected value using efficient classical post-processing.Hence, for our three architectures, on-site measurements suffice.
Finally, the sampling complexity can be reduced from supra-cubic to quadratic O(N 2 ) by simultaneously measuring commuting stabilizers (directly, or reducing them to onsite measurements as outlined above) on the same state preparation, following a specific pattern (Fig. 2).Precisely, we can define a lattice 2-coloring V = V odd ∪ V even and simultaneously measure Z on all sites i ∈ V odd and X βj ,j on every site j ∈ V even (or vice-versa).Since our Hamiltonian is commuting, each measurement round allows us to sample from the output distribution of H odd := i∈V odd h i (H even := i∈Veven h i ), as shown in Fig. 2. Hence, roughly ∼ N/2 terms of the form h i can now be measured in parallel.A simple application of Hoeffding's bound then shows that we can estimate the expected energy of H odd (H even ) using O(N 2 , ) samples.Our proof concludes by noting that tr(Hρ) = tr(H odd ρ) + tr(H even ρ).

V. CONCEIVABLE PHYSICAL ARCHITECTURES
We now turn to discussing that the above assumptions are plausible in several physical architectures close to what is available with present technology.What we are considering are large-scale quantum lattice architectures on square lattices L with a quantum degree of freedom per lattice site.On the level of physical implementation, the most advanced family of such architectures and the most plausible for the anticipated system sizes is that provided by cold atoms in optical lattices [9].In an optical lattice architecture, internal degrees of freedom are available with hyperfine levels.Also, the encoding in spatial degrees of freedom within double wells is in principle conceivable.Large-scale translationally invariant controlled-Z interactions-precisely of the type as they are required for the preparation of cluster and graph states [39,58]-are feasible via controlled collisions [25,59].Actually, the discussion of controlled collisions [59] and hence the theoretical underpinning of such quenched dynamics triggered work on cluster states and measurement-based quantum computing and predates this development.Other interactions to nearest neighbors are also conceivable.Interactions such as spin-changing collisions for 87 Rb atoms have been experimentally observed [60].Controlled-T gates require a more sophisticated interaction Hamiltonian.The dangling bonds seem realizable making use of optical superlattices [11,61].Single sites-specifically of the sampling type considered here-can be addressed in optical lattice architectures via several methods.E.g., quantum-gas microscopes allow for single-site resolved imaging [62,63], even though the type of single-site addressing required here remains a significant challenge.Optical superlattices [11,61] allow for the addressing of entire rows of sites in the same fashion.Entire rows that contain a fixed particle number neighboring ones left empty can already be routinely prepared [64].Also, disordered initial states can be prepared [12,14,65].
But also other architectures are well conceivable.This includes in particular large arrays of semiconductor quantum dots allowing for single-site addressing-a setting that has already been employed to simulate the Mott-Hubbard model in the atomic limit [66]-or polaritons or exciton-polariton systems in arrays of micro-cavities [67].In this type of architecture, the addressing of entire rows is also particularly feasible.Superconducting architectures also promise to allow for large-scale array structures of the type anticipated here [68][69][70].Trapped ions can also serve as feasible architectures [71].None of the physical architectures realize all elements required to the necessary precision, but at the same time, the prescriptions presented here are comparably close to what can be done.

VI. PROOF OF HARDNESS RESULT
In this section we prove Lemma 2 and Theorem 1, and develop the main techniques of the paper.The section is organized as follows: • In section VI A, we use MBQC techniques to develop mappings that allow us to recast architectures I-III as (computationally equivalent) MBQCs on 2D cluster states (as introduced in [29]).
• In section VI B, we show that enhancing architectures I-III with the ability (or an oracle) to post-select the outcomes of random variables turns them as powerful as a post-selected universal quantum computer (as defined in [72]).
• In section VI C, we prove Lemma 2 using earlier findings and a new parallelization technique to implement the twolocal "dense" IQP circuits of Ref. [17] in linear depth on a 1D nearest-architecture.
• In section VI D, we give the proof of our main result Theorem 1.The proof makes use of Lemma 2 and Stock-meyer's Theorem [73].The latter is applied in an analogous way as in the Boson-sampling and IQP-circuit settings [16,17] to show that if an efficient classical algorithm can approximately sample from the output distribution of architectures I-III, then an FBPP NP algorithm can approximate a large fraction of the amplitudes in (3), if the latter are also sufficiently anti-concentrated.By Conjectures 2-3, the latter algorithm can solve any problem in P #P , which contains PH via Toda's theorem [74].This implies a collapse of the Polynomial Hierarchy to its 3rd level.

A. Mapping architectures I-III to cluster state MBQCs
We show that any architecture I-III can be mapped via a bijection to a computationally-equivalent sequence of X-Yplane single-qubit measurements on the 2D cluster state [29].Below, T := diag 1, e iπ/4 and √ T := diag 1, e iπ/8 .First, note that (via teleportation) the effect of measuring a dangling-qubit (if present) is equivalent to generating a uniformly-random classical bit b ∈ {0, 1} and, subsequently, implementing the gate T b onto its neighbor; we can thus replace all dangling-bond qubits by introducing a uniformlyrandom measurement of X or X − π 4 = T † XT ∝ X − Y on every primitive qubit.Furthermore, we can re-write the input |ψ β in Q1 as where is a random bit-string defined via b i := β i /θ, with β, θ as in Q1.Since √ T gates in (8) commute with the Hamiltonian (2) and their effect is unobserved by Z measurements, they can be propagated out of the experiment by measuring bi instead of X on every primitive qubit i ∈ V .Combining these facts, we obtain the following mappings: (C1) Architectures I and III are computationally equivalent to a quantum circuit that prepares a 2D cluster state on their underlying primitive square lattice and measures {X, X − π 4 } randomly on each vertex.(C2) Architecture II is computationally equivalent to an analogous circuit of random {X, X −π/8 } single-qubit measurements, which chooses the latter measurements to be identical along the columns of the 2D cluster state.

B. Universality of architectures I-III for postselected measurement based quantum computation
For each of our architectures I-III, we prove that the ensemble {p β , |Ψ β } β is a universal resource for postselected MBQC w.r.t. the measurements in step Q3.Precisely, this means that if the ability to post-select the outcomes of the experiment's random variables (the qubit outcomes in step Q3 and the random vector β) is provided as an oracle [21,72], then it is possible to implement any poly-size quantum circuit [75] with arbitrarily high-fidelity in a subregion of the lattice using (at most) polynomially-many qubits.Our proof is constructive and shows how to simulate universal circuits of Clifford+T gates [76] via postselection.
Below, we call a quantum circuit 1D homogeneous if it consists of 1D nearest-neighbor gates that and parallel operations are identical modulo translation.The latter do not need to be translation-invariant: e.g., an arbitrary S-size 1D nearestneighbor circuit can be serialized to be 1D homogeneous in depth O(S).In Fig. 5 below, we give an example of an Ssize IQP circuit that can be implemented in depth O( √ S) (by bringing single-qubit gates to the end).1D homogeneous circuits, as defined here, can be regarded as examples of quantum cellular automata [77].
We highlight that the complexity of the simulation in Lemma 3 scales with the depth of the input circuit (not the size), allowing to parallelize concurrent nearest-neighbor gates.To prove this result, we assumes basic knowledge of MBQC on cluster states [29,39].Additionally, we make use of two technical lemmas.
Lemma 4 (Efficient preparation via MBQC).Let V be an n-qubit D-depth 1D homogeneous Clifford+T circuit.Then, the state vector |ψ := (V |0 ⊗n ) |0 ⊗3n−2 can be efficiently prepared exactly via an MBQC of single-qubit {X, X ±π/8 } measurements on an (4n − 2) × O(Dn)-qubit 2D cluster state, and even if measurements are constrained to act "quasiperiodically" as follows: for every column, each of its qubits is measured in either the X basis, or in one of the X ±π/8 bases (where the sign can be picked freely on distinct sites).
Lemma 5 (On-site efficient preparation via MBQC).Let V be an n-qubit D-depth 1D homogeneous Clifford+T circuit.Then, the state vector |ψ = V |0 ⊗n can be efficiently prepared exactly via an MBQC of single-qubit {X, X ±π/4 } measurements on an n × O(Dn)-qubit 2D cluster state.
Lemma 4 is an MBQC implementation of a 1D quantumcomputation scheme given in Ref. [40].Lemma 5 follows from Lemma 3 in Ref. [41] by using that commutinggate measurement-patterns can be applied simultaneously in MBQC.
Proof of Lemma 4. We first show how to implement a universal set of gates that can be converted to the Clifford+T gateset.We begin by picking a translationally-invariant gate set with the desired property [40] Above, gates act on a one-dimensional chain of M := 4n − 2; H is the Hadamard gate; CZ i,i+1 is the CZ gate on qubits i, i + 1; E is a global entangling gate; and E M +1 implements a "mirror" permutation i → i := M + 1 − i of the qubits.
The computation is encoded on n logical qubits with physical positions The remaining qubits are kept in the state |0 .Ref. [40] shows how to implement generators for the Clifford+T gate-set using the following sequences of (9) operations: Sequence (S1) implements a e ∓i π 8 Zi gate; (S2), a e ∓i π 8 Xi gate; and (S3), a logical e ∓i π 8 X [i] X [i+1] gate [40].We will now show that any of the above gate sequences can be implemented directly on an MBQC on an (4n − 2) × O(n)-qubit 2D cluster state with quasi-periodic {X, X ±π/8 } measurements.W.l.o.g., we make E (resp.non-entangling unitaries) act on even steps (resp.odd ones) by introducing identity gates when necessary.We now reorder operations in the creation and measurement of the cluster state as indicated in Fig. 3: therein, balls denote qubits prepared in |+ ; steps In MBQC, Y all gates can be treated as byproduct Pauli operators and do not need to be enacted [39].Further, performing a periodic measurement of X in Fig. 3.(3) implements the E gate in (9).In turn, a quasi-periodic X ±π/8 measurement, where observables' signs are chosen adaptively to counteract random byproduct operators, can be used to implement E followed by a U Z (±π/8) gate.Similarly, and last, we can implement EU X (−π/8) by delaying the measurement to the next step and propagating a U Z (−π/8) backwards: this works because U Z and U X never occur in subsequent odd steps in sequences (S1)-(S2)-(S3).We thus have an MBQC simulation of the translation-invariant computation in Ref. [40].
Last, we show that an n-qubit D-depth homogeneous circuit of e ∓iπZ [i] /8 , e ∓iπX [i] /8 , e ∓iπX [i] X [i+1] /8 gates can be implemented on an (4n−2)×O(Dn)-qubit cluster-state MBQC using the above protocol.Here, we invoke that measurementpatterns on disjoint-regions of a cluster-state MBQC can be simultaneously applied for commuting logical gates (hence, also concurrent ones): the latter fact is easily verified in the MBQC's logical-circuit picture [39,78].
Proof of Lemma 5. Lemma 3 in Ref. [41] shows that performing an X ±π/4 measurement on a boundary qubit of an n × (n + 2) one can selectively implements any logical gate of form e ∓iπZi/8 , e ∓iπXi/8 , e ∓iπZj Xj+1/8 , e ∓iπXj Zj+1/8 , 1 ≤ i ≤ n, 1 ≤ j ≤ n − 1 on an n-qubit 1D chain.As in proof of Lemma 5, measurement-patterns associated to commuting gates can be implemented simultaneously.The result of Ref. [41] thus yields an exact n × (n + 2) cluster-state MBQC implementation of any circuit of form for any n-qubit 1D commuting circuit C of e −i π 8 Zj Xj+1 , e −i π 8 Xj Zj+1 gates.The proof follows as the one of Lemma 4.
We now proof the main claim of this section.
(D2) Postselected random {X, X − π 8 } measurements, chosen identically on columns, on a (n + r) × O(Dn)-qubit cluster state.Statement (D1) (resp.(D2)) covers the case for I and III (resp.architecture II).To prove (D1)-(D2), we show that if an MBQC scheme on a cluster state is universal w.r.t. a family of X-Y plane measurements {X θi } i , X θi = e −i θ i 2 Z Xe i θ i 2 Z , then, the reduced negative-angle subfamily {X −|θi| } i is universal for postMBQC; in combination with Lemmas 5-4, the claims follow.Recall that any non-final measurement in cluster-state MBQC [29,39] produces a uniformly random outcome s ∈ {0, 1} (cf.section VI C, Eq. ( 13) for an explicit formula), whose effect in the logical circuit is to introduce a random byproduct Pauli operator X s on its associated qubitline.If not accounted for (e.g., by adapting the measurement basis) and an X θ is subsequently performed, the latter effectively implements a X (−1) s θ measurement: this can be seen by propagating X s forward in the circuit using conjugation relationships, and it is illustrated in Fig. 4.

H X s
Measurement of X θ i , X θ j on an edge of a 1D cluster state: the outcomes si, sj ∈ {0, 1} are uniformly random.(ii) The associated logical circuit: the byproduct operator X s i can be propagated forward in circuit by substituting θj with (−1) s 1 θj.The argument extends to the full cluster by induction, choosing the 1st qubit to be measured in the X basis (this fixes the input of the logical circuit and does not change its universality properties).The 2D cluster-state case is analogous [29,39].

C. #P-hardness of approximating output probabilities (proof of Lemma 2)
In this section we prove Lemma 2. Our proof below shows that the ability to approximate the given Ising partition function can be used to approximate the output probabilities of the "dense" 2-local long-range IQP circuits of Ref. [17].The proof further exploits a new technique (Lemma 6) to implement O(n 2 )-size long-range IQP circuits in O(n)-depth in a 1D nearest-neighbor architecture, which is asymptotically optimal.We regard Lemma 6 of independent interest since the latter dense IQP circuits were argued in Ref. [17] to exhibit a quantum speedup but, to our best knowledge, linear-depth 1D implementations for them were not previously known.On the other hand, recently, it has been shown that a "sparse" subfamily of the latter IQP circuits can be implemented in depth O(nlogn) in a 2D nearest-neighbor architecture [20].

A 1D linear-depth implementation of dense IQP circuits
We first derive our intermediate result for IQP circuits.For any positive n, we let C be any "dense" random n-qubit IQP circuit whose gates are uniformly chosen from the set e iθi,j XiXj , e iθiXi : i, j ∈ {1, . . ., n}, which contains arbitrary long-range interactions in a fullyconnected architecture.
Proof.It is easy to see that n 2 -size 2-local quantum circuits require Ω(n) depth to be implemented.Our proof gives a matching upper bound for the given IQP circuits.
Recall that IQP gates can performed in any order (as they commute).Hence, by reordering gates and redefining the θ i,j angles, any given IQP circuit C can be put in a normal form C that contains at most one single-qubit gate per qubit and one two-qubit gate per pair of qubits.Our approach now is to . Linear-depth implementation of dense IQP circuits (11).We illustrate our algorithm for 4-qubits.Yellow-(resp.blue-) blocks implement the two-(resp.one-) qubit gates in (11).The contains 4 single-qubit gates (resp.6 two-qubit ones), which coincides with the number of vertices (resp.edges) of the complete graph K4.
introduce additional layers of nearest-neighbor SWAP gates following layers of two-qubit gates (Fig. 5).To illustrate the algorithm, we regard qubits as "particles" moving up or down the line by the action of the SWAP gates.At a given step t, we apply a two-qubit IQP gate followed by a SWAP to all pairs of form (2i − 1, 2i) for 1 ≤ i ≤ n/2 when t is even (resp.(2i, 2i + 1) for 1 ≤ i ≤ (n − 1)/2 when t is odd).By iterating this process n-times, the qubit initially in the ith position in the line (with arbitrary i) travels to the n − i + 1th position, meeting every other qubit exactly one time along the way due to the Intermediate Value Theorem; each 2-qubit gate of C is implemented in one of these crossing.Furthermore, each qubit spends one step without meeting any qubit when they reach the line's boundary; at these points, single-qubit gates can be implemented.
Above, we used that p β is uniformly supported over Γ (by design) as well as q(y|β) = 1/2 N −n , which follows from standard properties of X-teleportation circuits [78,79].Note that prob(a, b|β) in (3) and q(x, y|β) as above are identical probability distributions up to a relabeling (x, y) = (a, b) of the random variables.Thus, if q( (a, b)|β) approximates q( (a, b)|β) up to relative error b+β) | 2 with the same error.Hence, the proof reduces to showing that approximating q(x, y|β) for architectures I-III is #P-hard for the given error and m ∈ O(n 2 ).
Next, recall that the output probabilities a z = | z 1 , . . ., z k |C|0 ⊗k | 2 of an arbitrary k-qubit dense IQP circuits C as in (11) are #P-hard to approximate up to relative error 1/4 + o(1) [80,81].Via Lemmas 3 and 6, the latter can further be implemented in our architectures using lattices with n × O(n 2 ) qubits for some n := k + r with r ∈ O(k); to apply Lemma 3, we can either decompose C exactly as a 1D homogeneous Clifford+T circuits [82], or use the gadgets in the proofs of Lemmas 4-5 to directly implement the IQP gates.Now, let |ψ y,β denote the state vector of the n right-most primitive qubits after observing y, β.It follows from our discussion that |ψ y,β = C|0 ⊗k |0 ⊗r for some efficiently-computable value of y, β.

D. Hardness argument (proof of Theorem 1)
Finally, we prove Theorem 1. Similarly to Refs.[16,17], we apply Stockmeyer's Theorem [73] to relate the problems of approximately sampling from output distributions of quantum circuits to approximating individual output probabilities.Our proof is by contradiction: assuming that the worst-case #P-hardness of estimating the partition functions of the Ising models (4) extends to average-case (Conjecture 2) and that the output probabilities of architectures I-III are sufficiently anticoncentrated (Conjecture 3), we show that the existence of a classical algorithm for sampling from the latter within constant 1 -norm implies that an FBPP NP algorithm can solve #P-hard problems; this leads to a collapse of the polynomial hierarchy to its third level in contradiction with Conjecture 1.
Let Γ := {β ∈ {0, θ} mn : p β = 0} be the set of allowed β configurations in step Q1; x ∈ {0, 1} n (resp.y ∈ {0, 1} N −n , N = µmn) be the measurement outcomes of the n right-most primitive qubits (resp.remaining ones) after step Q3; and q(x, y, β) be the final total probability of observing the values x, y, β.As a preliminary, we prove that if Conjecture 3 holds, then the probability distribution q(x, y, β) associated to random-input states and measurement outcomes of architectures I-III is anti-concentrated.We first note that q(x|y, β) in ( 13) coincides with the output distribution of some n-qubit O(m)-depth circuit C y,β of gates of form (5): this is easily seen from mappings (C1)-(C2) and standard properties of X-teleportation [39,78,79] (cf.also the next section and Fig. 6).For arbitrary y ∈ {0, 1} N −n , β ∈ Γ, let us now define γ y,β is the fraction of C y,β 's output probabilities larger than 1/2 n , and α is the fraction of C y,β circuits that fulfill (6).Conjecture 3 states that γ y,β ≥ 1/e for m ∈ O(n).Consequently, for n × O(n) lattices, (13) implies that Furthermore, since q(y, β) is uniformly distributed over its support, it also follows from ( 13) that for a uniformly random x, y ∈ {0, 1} N , β ∈ Γ. Eq. ( 17) tells us that the robustness of the anti-concentration inequality ( 16) can be tested by computing the average value E y,β (γ y,β ) of γ y,β , or by estimating the fraction α.As it is discussed further in appendix C, γ y,β × e and α are expected to converge to one for universal 1D nearest-neighbor circuits as n grows asymptotically [44][45][46]83] in the regime m ∈ O(n) [84][85][86].In appendix C, Fig. 7, we present numerical evidence that E y,β (γ y,β ) → 1/e, α → 1 in the asymptotic limit and a tight agreement for n ≥ 9; more strongly, we also find that γ y,β is nearly 1/e for almost every uniformly-sampled instance for n ≥ 9 and that q(x|y, β) is Porter-Thomas distributed, which is a signature of Haar-random chaotic unitary processes [18,[42][43][44][45][46].We are now ready to prove Theorem 1.For any architecture I-III, we let a denote an element of {0, 1} N × Γ, and assume that the output distribution p c (a) of a classical BPP algorithm fulfills that for a constant ε ≥ 0. By Stockmeyer's Theorem [73] and the triangle inequality, there exists an FBPP NP algorithm that computes an estimate p c (a) such that where we used log |Γ| ∈ O(N ) to remove dependencies on |Γ|.From Markov's inequality and ( 18), we get that for any constant 0 < δ < 1, where a ∈ {0, 1} N × Γ is picked uniformly at random.Hence, with probability at least 1 − δ over the choice of a.
We now claim that Eqs. ( 20) and ( 16) simultaneously hold for a single q(a) with probability (1/e) • (1 − δ), since a classical description of Ref. I-III does not reveal to a classical sampler which output probabilities are #P-hard to approximate: hence, the latter cannot adversarially corrupt the latter.This is manifestly seen at the encoded random circuit level, due to the presence of random byproduct operators of form i X yi i (with random y i ), which obfuscate the location of the #P-hard probabilities from the sampler [17,48].Hence, setting ε = γ/8, δ = γ/2, γ = 1/e, we obtain that with probability at least γ(1 − γ/2) > 0.3 over the choice of a. Setting ε = 1/22 < γ/8, the above procedure yields an approximation p c (a) of q(a) up to relative error 1/4 + o(1).Using ( 3) and ( 13), we obtain an FBPP NP algorithm that approximates |Z α,β | 2 with relative error 1/4 + o(1) for at least a 0.3 fraction of the instances.This yields a contradiction.
As final remarks, note that the above argument is robust to small finite-size variations to the threshold γ = 1/e in Conjecture 3, Eq. ( 6) since the constants ε = γ/5, δ = γ/2, and γ(1 − γ/2) have only linear and quadratic dependencies on γ.Also, notice that Conjectures 2 and 3 enter the above argument in order to allow for a constant additive error ε, which is key for a real-life demonstration of a quantum speedup.Additive errors give rise to demanding, but not unrealistic prescriptions.However, in the ideal case where one assumes no-sampling errors, or multiplicative sampling errors, our result holds even without these conjectures via the arguments in Refs.[21,22].

VII. CONCLUSION
In this work, we have established feasible and simple schemes for quantum simulation that exhibit a superpolynomial quantum speedup with high evidence, in a complexitytheoretic sense.As such, this work is expected to significantly contribute to bringing notions of quantum devices outperforming classical supercomputers closer to reality.This work can be seen as an invitation towards a number of further exciting research directions: While the schemes presented may not quite yet constitute experimentally realizable blue-prints, it should be clear that steps already experimentally taken are very similar to those discussed.It seems hence interesting to explore detailed settings for cold atoms or trapped ions in detail, requiring little local control and allowing for comparably short coherence times.What is more, it appears obvious that further complexity-theoretic results on intermediate problems seem needed to fully capture the potential of quantum devices outperforming classical computers without being universal quantum computers.It is the hope that the present work can contribute to motivating such further work, guiding experiments in the near future.
Our results are summarized in Fig. 7: therein, one can see that both for circuits associated to n × n and n × n 2 lattices, this fraction quickly approaches a constant γ = 1/e with rapidly decreasing variance with respect to the choice of circuits.We can conclude that, with very high probability, in a realization of the proposed experiment the amplitude of the final state of the computation anti-concentrates As discussed in Refs.[17,20,48], it might seem a priori counter-intuitive that constant-depth nearest-neighbor architectures anti-concentrate.However, the above connections between our architectures and random circuits shed key insights into why this behavior is actually natural.As shown in section VI B, the random logical circuits of gates of form (5) encoded in our architectures are universal for quantum computation.Universal random quantum circuits of increasing depth are known to approximate the Haar measure under various settings [44-46, 83, 86].For 1D nearest-neighbor layouts, the latter are expected to reach a chaotic Porter-Thomas-distributed regime [42][43][44] in depth D ∈ O(n) [84][85][86] (cf.[18] for further discussion).As an additional piece of supporting evidence for anti-concentration, we numerically confirmed that our output-probabilities are close to being Porter-Thomas-distributed in 1 -norm (Fig. 8).Furthermore, our numerics are in agreement with prior numerical works on MBQC settings [46] and other gate sets in 2D layouts [18,47].
Circuit families.For the sake of completeness, we spell out the logical circuits that are effectively implemented in architectures I-II.The latter are derived via mappings (C1)-(C2) and X-teleportation properties [78,79].Examples for 4 × 2 lattices are depicted in Fig. 6.The logical circuit family corresponding to I and III (resp.to II) is denoted F DO (resp.F col ).Let us now label primitive-lattice sites by row-column coordinates [i, j].The circuits are generated inductively, starting from the left column j = 1.Measurements are ordered from left to right.The computation begins on the |+ ⊗n state and proceeds as follows: 1. Apply the gate exp (iβ [i,j] Z [i,j] ) to qubit [i, j], with β [i,j] chosen as in step Q1 for I-II; for III, we let where s [i,j] is the outcome after measuring the dangling-neighbor of [i,j] gate to every qubit [i, j], where a [i,j] is the outcome of the measurement at site [i, j]. 3. Apply CZ on all neighboring qubits.4. Apply a Hadamard gate to each qubit.5.If j = m, measure in the standard basis and terminate; otherwise increase j := j + 1.
a. Convergence to the chaotic regime.It is an interesting detail that the value of γ = 1/e, which we observe above, is a signature of the exponential distribution (also known as Porter-Thomas distribution) that is known to emerge in chaotic quantum systems for large system sizes [18, 42-44, 46, 47].This distribution is given by and thus anti-concentrates in precisely the fashion observed here.Note that the same behavior was observed in previous work investigating random MBQC settings [46], as well as in recent work [18], which investigated random universal circuits on a 2D architecture.Notably, the finite and universal gate sets considered in these works are very similar to the ones considered here.Likewise, convergence to the exponential distribution was observed in Ref. [47] for approximately Haar-random two-qubit unitaries in a 2D setup.
In Fig. 8 we show the total variation distance between the empirical distributions of output probabilities of the random circuits generated in our numerical experiments and the discretized Porter-Thomas distribution.We can see that as the number of qubits increases, the output distributions of random circuits approach Porter-Thomas distribution.
To calculate the total variation distance to the exponential distribution (C1), we discretized the interval [0, 1] into m bins each of which contains probability weight 1/m.In other words, the discretization (p 0 , p 1 , . . ., p m ) is defined by given p 0 = 0, p m =  ..,m .The total variation distance between P and the exponential distribution is then given by Since the number of samples we obtain in each run is given by 2 n we choose the number of bins m depending on n.Specifically, we choose m = min{ 2 n /5 , 100} to allow fair comparison for small n.The full experiment can now be mapped to non-adaptive MBQC analogue to (C1) with two differences: first, the MBQC acts on a graph state vector |G [89], instead of a cluster state, whose underlying graph G is derived from the 2D lattice by deleting |0 -state vertices (the output probabilities of the computation can be mapped to an Ising model on G using the tools of appendices A-B); second, the remaining vertices on even columns are measured on the Y basis.We now pick n = 2k − 1 and study the logical-circuit of the MBQC in two scenarios: (i) All even-row qubits are initialized in |0 .The MBQC acts on a graph state vector |G that is the product of k disconnected 1D cluster states.Modulo byproduct operators, the local measurements drive a random logical k-qubit circuit of singlequbit gates {R ai i (θ) := H i e i(aiθ)Zi , a i ∈ {0, 1}} and depth m − 1 (information flows on odd rows).For architecture II, the latter circuit inherits a TI (2,∞) -symmetry from the TI (4,∞) -one of the input state vector |ψ β .
(ii) Even-column even-row qubits are initialized in |0 ; even-column odd-row ones are left unspecified.The MBQC acts on a cluster-state with "holes" |G as in Refs.[90,91].Information flows again on odd-rows.If an even-column odd-row qubit i is prepared in | − i and measured in the X basis (or, equivalently, in |+ and measured in the Y basis), we obtain a reduced graph-state vector |G whose graph G is obtained from G by contracting the edges incident to i [90,91].Thus, | − i state vectors between the odd qubit lines let us implement logical entangling gates of form where b i = 1 if the qubit between lines 2i, (2i − 1) is in | − i and zero otherwise; and c i indicates whether we measure X or X −θ on the (2i − 1)th line.Postselecting b i gives us the ability to implement non-translation-invariant 2-qubit entangling gates between qubit lines at will.Again, for architecture II, the gate E b,c (θ) inherits a TI (2,∞) symmetry.
Combining the above facts, it follows that we can simulate arbitrary k-qubit nearest-neighbor Clifford+T circuits in the modified architecture I via postselection, using lattices with (2k − 1) × (2k − 1) qubits.The latter can efficiently implement the #P-hard IQP circuits of Lemma 6 with a constant-overhead factor.
In the modified architecture II, observation (i) and postselection of byproduct operators (Fig. 4) yields TI (2,∞) -symmetric circuits of {H i , e ±i π 16 Zi , e ±i π 16 Xi } gates.In combination with the gadgets in the proof of Lemma 4, this lets us implement k-where J i,j = π/4, and E 2 the dangling bonds where J i,j = π/16.We find the corresponding parent Hamiltonian to be where the two-body terms evaluate to The Hamiltonian terms of H III are 6-local except at the boundary where its locality is reduced.

Energy estimation of G-local Hamiltonians
An N -qubit non-degenerate gapped local Hamiltonian H = i∈V h i with τ -body interactions on an (simple connected) interaction graph G = (V, E) is called "G-local" if each qubit is located at a vertex i ∈ V and each term h i is supported on the neighborhood ∂(i) of i, Though we will consider arbitrary interaction graphs G in our analysis, we will be particularly interested on constant-degree ones in our applications: i.The key ingredient of our result below is a subroutine for estimating average energies of G-local Hamiltonians using parallelized measurement circuits with time complexity dominated by chromatic number G (2) = (V (2) , E (2) ) of the next-neighbor interaction graph of G (2) .The latter has with same vertices V (2) = V as G and edges between all pairs of neighbors and next-neighbors of G ((v 1 , v 2 ) ∈ E (2) iff v 1 , v 2 ∈ V and have graph distance d(v 1 , v 2 ) ≤ 2).The existence of such parallel circuits relies on the existence of certain decompositions of local Hamiltonians into commuting terms, named (κ, α, τ ) ("cat") decompositions below.In short, a Hamiltonian is (κ, α, τ ) if it can be decomposed as a small sum Hermitian operators that admit parallel measurement circuits that use single-shot τ -body operators.A sufficient condition for Def 7.(c) to hold is that i ∈ [κ], all terms in {h (i) j } j are τ -local have non-overlapping support.Then, one can simply measure all h i j in parallel, obtain the outcomes {e (i) j } j , and sample the output distribution of H i by classically computing the sum e i := j∈V f (i) e i j .The fact that Hamiltonians of these form admit parallel measurement circuits is formalized by the following lemma, which generalizes Lemma 1 in Ref. [52].
Lemma 8 (Estimation of the energy).Let H be an N -qubit Hamiltonian with a given (κ, α, τ )-decomposition (E5).Let P i,µ be the µth e i,µ -eigenprojector of H i , for i ∈ [κ], and X (j) i be the random variable that takes the value e i,µ with probability tr(ρ (j) p P i,µ ), modeling a measurement of H i on the j th copy of ρ p .Moreover, let and since all measurements are independent The latter identity holds whenever m ≥ m opt with The next result states that any G-local Hamiltonian admits a (κ, α, τ ) decomposition.
The bounds in Lemma 9 are not necessarily tight for commuting Hamiltonians.For instance, the fully-connected Ising Hamiltonian H = − i,j J i,j Z i Z J − µ k h k Z k admits a (1, 1, 1) decomposition, though χ(G (2) ) = N , because it can be measured directly via single-shot on-site Z i measurements and classical post-processing (as it is a polynomial of the latter commuting observables).For the Hamiltonians in architectures I-III, we also find much tighter bounds.
Lemma 10 (Local decompositions in the architectures).Let H a be the N -qubit Hamiltonian in architecture a ∈ {I,II,III}.Then, • H I admits a 2, 5  9 , 1 decomposition of form (E5) where every H i has on-site terms and DO symmetry.• H II admits a 2, 5  9 , 1 decomposition of form (E5) where every H i has on-site terms and TI (1,∞) symmetry.• H III admits a two-body 2, 5  9 , 2 and an on-site 32, 5 9 , 1 decompositions where every 2) symmetry.
Proof of Lemma 9. We prove the existence of the (κ, α, τ ) decomposition by constructing a partition V = κ i=1 V i such that where the terms in {h i ∈ H i } are τ -body by construction and have non-overlapping support for all i ∈ [1, κ], (hence, can be simultaneously measured).First, χ G (2) is the minimal number of classes in any vertex coloring of G (2) (i.e., a vertex partition where no pair of adjacent vertices falls in the same class).Further, two vertices v 1 , v 2 ∈ V (2) are adjacent in G (2) iff they are neighbors or next-neighbors in G. Letting V = χ(G (2) ) i=1 V i , be a minimal vertex coloring of G (2) , we obtain a decomposition of form (E8) with κ ≤ χ G (2) .Last, picking α : The existence of the (κ4 deg(G) , α, 1) decomposition now follows by expanding each term h i in every H i in (E8) in a product basis where A = {A(µ)} 4 µ=1 is some Hermitian single-qubit operator basis (e.g., the standard Pauli matrices).By picking a fixed ordering of every set ∂(j), this lets us write each H i as a sum of 4 τ Hermitian operators {H i,x , x = (x 1 , . . ., x τ ), x i = [1, 4]} x , H i,x := j∈Vi α j,x k∈∂(j) A(x k ) k , the energy distribution of which can be sampled from via on-site A(x i ) measurements.
Proof of Lemma 10.For H I,II , we first split the Hamiltonian terms in (E1) into two groups ("even" and "odd") using a bi-coloring of the N -sites square lattice, and set H = H even +H odd .The terms {h i } i of H even (H odd ) are products of X τi,i , Z j on-site factors.Because we use a 2-coloring, the on-site factor list associated to two distinct h i , h j terms contains at most two overlapping onsite pairs of form (Z k , Z k ), (Z k , Z k ).Hence, overlapping terms are identical and can be measured jointly.This allows us to measure H even (H even ) via a parallel measurement of all on-site factors and classical post-processing.This yields a (κ, α, τ ) decomposition with κ = 2 and τ = 1.Further, the largest component V max of a square lattice 2-coloring has N/2 vertices for even N , and (N + 1)/2 otherwise.Hence, we can pick α = |V max |/N ≤ 1/2(1 + 1/N ) ≤ 5/9, where we use that the smallest odd value of N is 9.The same approach works leads to a (2, 5/9, 2) decomposition for H III , Eq. (E2) where H even , H odd are sums of products of on-site terms supported on the primitive lattice, and two-body terms acting on dangling-bonds.This leads to a (32, 5/9, 1) one by expanding the 2-body terms on a local basis, as in the proof of Lemma 9. Finally, all new Hamiltonians inherit the symmetry of their corresponding parent Hamiltonian by construction.

Certification protocol
Finally, we describe a quadratic-time weak-membership certification protocol for ground states of non-degenerate gapped G-local τ -body Hamiltonians with constant τ , and, in general, any (κ, α, τ )-measurable Hamiltonians (Def.7).By virtue of Lemma 10, the protocol can be applied to efficiently certify the final state preparation of our proposed quantum architectures I-III.We describe the protocol for the latter class since we know any G-local Hamiltonian is of that form (Lemmas 9,10).The protocol is simply a parallelized version of the one in Ref. [52].
Definition 11 (Weak-membership quantum state certification [92]). .Let F T > 0 be a threshold fidelity and 0 < p err < 1 be a maximal failure probability.A test which takes as an input a classical description of ρ 0 and copies of a preparation of ρ p , and outputs "reject or "accept is a weak-membership certification test if with high probability p succ ≥ 1 − p err it rejects every ρ p for which F (ρ p , ρ 0 ) ≤ F T , and accepts every ρ p for which F (ρ p , ρ 0 ) ≥ F T + δ for some fidelity gap δ > 0.
Protocol 1 (Certification of (κ, α, τ )-decomposable Hamiltonians).The protocol receives a description of a non-degenerate gapped Hamiltonian H that admits a (κ, α, τ )-decomposition of form (E5), which is given to us, and performs the following steps: 1. Arthur chooses a threshold fidelity F T < 1, maximal failure probability 1 > p err > 0 and an error ≤ (1 − F T )/2. 2. Arthur asks Merlin to prepare a sufficient number of copies of the ground state ρ 0 of H. min < F T + he rejects, otherwise he accepts.Lemma 12 (Weak-membership certification).Let H be an N -qubit non-degenerate gapped (κ, α, τ )-decomposable Hamiltonian with known ground state energy E 0 , gap ∆, interaction strength J = max λ h λ , and let E 0 , ∆ −1 , J be upper bounded by a constant.Then, Protocol 1 is a weak-membership certification test, in the sense of [52], with fidelity gap In combination with Lemma 10, it follows that one can efficiently certify the final state preparation of our architectures I-III, if the latter are at least 1/N close to the target state in fidelity since H ∼ N and ∆ is larger than a constant by construction of the parent Hamiltonians H I-III .
Proof of Lemma 12 .The proof is identical to that of Proposition 1 in Ref. [52] if we substitute Protocol 1 (Lemma 1) therein where our Protocol 1 (our Lemma 8) in this appendix.We refer the reader to Ref. [52] for details.being infinite.Second, in Conjecture 2, we do not need the problem of approximating Ising partition functions to be #P-hard in average.Instead, it is enough that the this problem is not in the complexity class BPP NP , which is contained in the 3rd level of the Polynomial Hierarchy (and would be in the 2nd, if the widely believed conjecture P = BPP [35] holds).Note how this would be in striking contrast with our hardness result 2, since the latter says that an oracle to solve the worst-case version of the same problem would allow us to solve all problems in all levels of the Polynomial Hierarchy.Third, if this weaker form of Conjecture 2 holds, then Conjecture 1 is obviously not needed in the proof of Theorem 1.We have also mentioned in the main text that stating Conjecture 2 in terms of relative errors-which is the approach followed here and in [16,18,20,22]-is somewhat more natural than stating it in terms of additive ones, as in Ref. [19].To illustrate the difference, note that there exist quantum algorithms for approximating (normalized) Ising partition functions up to polynomially small additive errors [94,95], while the latter are #P-hard to approximate up to polynomially small and even constant relative ones.

Theorem 1 (
Hardness of classical simulation).If Conjectures 1-3 below are true then a classical computer cannot sample from the outcome distribution of any architecture I-III up to error 1/22 in 1 norm in time O(poly(n, m)).

Figure 7 .
Figure 7. Fraction of output probabilities larger than 1/2 n of random circuits drawn from the families FDO (l.h.s.) and Fcol (r.h.s.) for both linear (top) and quadratic (bottom) circuit depth in the number of qubits the circuit acts upon n, i.e., lattices of size n × n (top) and n × n 2 (bottom).For each n we draw 100 i.i.d.realizations (β, y) and thus of the circuit C β,y and plot the resulting distribution in the form of a box plot.The red dashed line shows the value of 1/e, which is precisely the value to be expected if the output probabilities are Porter-Thomas distributed.

Figure 8 .
Figure 8.Total variation distance to the Porter-Thomas distribution of the empirical distribution of output probabilities of random circuits from the families FDO (l.h.s.) and Fcol (r.h.s.) for both linear (top) and quadratic (bottom) circuit depth in the number of qubits the circuit acts on n, i.e., lattice of size n × n (top) and n × n 2 (bottom).For each n we draw 100 i.i.d.realizations (β, y) and thus of the circuit C β,y and plot the resulting distribution in the form of a box plot.
e., those with maximum vertex degree deg(G) upper bounded by a constant, independently of the number of qubits.Because, w.l.o.g, we can pick τ = deg(G), the latter graphs model physical systems with geometrically constrained connectivity, which are ubiquitous both in condensed matter physics and quantum information processing.Examples of such graphs are lattices of constant geometric dimension D and fixed-size primitive cells.In particular, the Hamiltonians of the main text are L-local and have deg(L) = 4 (deg(L) = 5) for architectures I-II (III).
E5) where (a) max i |V i | ≤ αN , (b) the terms h (i) j are hermitian, and (c) the energy distribution of H i can be sampled from via parallel measurements of τ -body observables and efficient classical post-processing.

3 .
Arthur performs m energy measurements for each Hamiltonian term H i on distinct copies of the state ρ p to determine an estimate E * of the expectation value i tr[ρ p H i ], with m given by expression (E6).Each H i is measured by a single-shot circuit of τ -local observables and classical postprocessing.. 4. From the estimate E * he obtains an estimate F * min of lower bound F min = 1− H ρp /∆ [52] on the fidelity F = F (ρ p , ρ 0 ) such that F * min ∈ [F min − , F min + ] with probability at least 1 − p err . 5. If F *
j) i be the estimate of H i on ρ p by a finite-sample average of m measurement outcomes, and H * ρp = κ i=1 H i * ρp the resulting estimate of H ρp .Last, let J = max λ h λ .Then, for any p err ∈ [1/2, 1) and > 0 it holds that P | H * ρp − H ρp | ≤ ≥ p err ,