Analyzing Prospects for Quantum Advantage in Topological Data Analysis

Lloyd et al. were first to demonstrate the promise of quantum algorithms for computing Betti numbers, a way to characterize topological features of data sets. Here, we propose, analyze, and optimize an improved quantum algorithm for topological data analysis (TDA) with reduced scaling, including a method for preparing Dicke states based on inequality testing, a more efficient amplitude estimation algorithm using Kaiser windows, and an optimal implementation of eigenvalue projectors based on Chebyshev polynomials. We compile our approach to a fault-tolerant gate set and estimate constant factors in the Toffoli complexity. Our analysis reveals that super-quadratic quantum speedups are only possible for this problem when targeting a multiplicative error approximation and the Betti number grows asymptotically. Further, we propose a dequantization of the quantum TDA algorithm that shows that having exponentially large dimension and Betti number are necessary, but insufficient conditions, for super-polynomial advantage. We then introduce and analyze specific problem examples which have parameters in the regime where super-polynomial advantages may be achieved, and argue that quantum circuits with tens of billions of Toffoli gates can solve seemingly classically intractable instances.

Lloyd et al. [1] were first to demonstrate the promise of quantum algorithms for computing Betti numbers, a way to characterize topological features of data sets. Here, we propose, analyze, and optimize an improved quantum algorithm for topological data analysis (TDA) with reduced scaling, including a method for preparing Dicke states based on inequality testing, a more efficient amplitude estimation algorithm using Kaiser windows, and an optimal implementation of eigenvalue projectors based on Chebyshev polynomials. We compile our approach to a fault-tolerant gate set and estimate constant factors in the Toffoli complexity. Our analysis reveals that super-quadratic quantum speedups are only possible for this problem when targeting a multiplicative error approximation and the Betti number grows asymptotically. Further, we propose a dequantization of the quantum TDA algorithm that shows that having exponentially large dimension and Betti number are necessary, but insufficient conditions, for super-polynomial advantage. We then introduce and analyze specific problem examples which have parameters in the regime where super-polynomial advantages may be achieved, and argue that quantum circuits with tens of billions of Toffoli gates can solve seemingly classically intractable instances.

I. INTRODUCTION
An important outstanding challenge in quantum computing is to find quantum algorithms that provide a significant speedup for practical problems. One area of great interest is quantum machine learning [2]. Early proposals included, for example, principal component analysis [3], and were often based on quantum solution of linear equations [4]. However, it has proven possible to dequantize many of these proposals, indicating that there is at most a polynomial speedup [5,6]. Analysis of the cost taking into account error-correction overhead indicates that more than a quadratic speedup would be needed to provide a useful quantum advantage within quantum error-correction [7,8].
An algorithm for topological data analysis proposed by Lloyd et al. [1] turned out not to be directly "dequantizable" using the same techniques, raising the question of whether a greater speedup was possible. A simple analysis by Gunn et al. [9] contradicted some of the scaling results originally reported by Lloyd et al., and indicated that under certain assumptions there would still only be a quadratic speedup for these algorithms (our analysis agrees with that of Gunn et al.). Here we give a far more careful analysis of the complexity, and examine applications which provide better-than-quadratic speedups.
An important goal in data analysis is to extract features of a data set and use them to cluster or classify the data. This data set would be represented as a set of points in some metric space, such as R n with the Euclidean distance function. One approach for the analysis is to convert the point cloud into a graph where the vertices are the given data points and the edges are determined by whether or not pairs of points lie within a chosen distance ϵ. This approach can capture features such as connectivity but ignores potential higher dimensional features, especially if the data points are sampled from some underlying high-dimensional manifold. Topological data analysis (TDA) attempts to extract such higher dimensional global topological features of an underlying data set by applying techniques from the field of algebraic topology, in particular what is known as simplicial homology.
A simplex is a point, line segment, triangle, or higher-dimensional equivalent, and a simplicial complex is a collection of simplexes. One can form a simplicial complex from the data set with respect to a distance scale ϵ, by adding points that are within distance 2ϵ to simplices. The Betti number β k is the number of k-dimensional holes of the complex. One can determine the Betti number for a chosen range of ϵ. Betti numbers which persist over an appreciable range of the values of ϵ are indicative of intrinsic topological features of the data set, as opposed to artifacts that appear at a particular scale and disappear shortly thereafter. The study of such features is referred to as persistent homology.
The classical complexities of algorithms for estimating Betti numbers are typically exponential in k. That means the computation can be intractable even for a moderate amount of data. That is an important feature for the promise of quantum algorithms, because even fully error-corrected quantum computers with millions of physical qubits are expected to be very limited in data storage. The most promising applications of quantum computers are therefore those involving a limited amount of classical data that needs to be fed into the quantum algorithm as part of the problem specification.
Recent work on quantum TDA algorithms introduced more efficient fermionic representations of the Dirac operator [10] and employed the quantum singular value transformation to implement the kernel projector [11,12]. Some of these techniques have led to significant asymptotic improvements over the original approach, but it is unclear whether they are useful for reducing the fault-tolerant implementation cost for solving problems in practice. Indeed, to the best of our knowledge, no study has been done on the fault-tolerant implementation of quantum TDA algorithms for solving any instance of problems of practical interest.
In this work, we give a new algorithm for estimating Betti numbers on a quantum computer. We significantly reduce the cost of fault-tolerant implementation as compared to prior work, as well as estimating the constant factors that are needed to give realistic estimates of gate counts. Specifically, we develop a new method to prepare the initial Dicke states, introduce improved amplitude estimation using Kaiser windows, directly construct the quantum walk operator from block encoding, optimally project onto the kernel of the boundary map, then use the overlap estimation to estimate the kernel dimension of the block-encoded operator, leading to a quadratic improvement in precision over classical sampling. We also provide the concrete constant factors in the complexity of our algorithm and estimate its fault-tolerant cost, going beyond the asymptotic analyses of all existing work on quantum topological data analysis. Finally, we show that it is possible to construct specific data sets which have parameters in a regime where quantum TDA would appear to have a significant speedup. In particular, we give examples of a very specific family of problem instances exhibiting the required parameters for the quantum TDA to have a superpolynomial speedup over the naive general classical algorithm, and a more general family of instances that exhibits the required parameters for a quartic speedup. Here, we are comparing to well studied classical approaches with complexity approximately linear in the possible number of cliques.
We provide a more detailed explanation of the technical background needed to understand Betti numbers in Section II. We then provide the improved algorithm and the analysis of its complexity in Section III. We use this result to analyse the regimes where large quantum speedups may be expected in Section IV. In particular, we consider cases where the Betti number would be large (implying a large quantum speedup) in Section IV A, and novel competing classical algorithms in Section IV D. We then conclude in Section V.

II. TECHNICAL BACKGROUND
Here we give the more detailed background that is needed to understand the standard approaches for this problem and our contribution. For the technical definitions of the simplicial complex and Betti number, see Appendix A.

A. Overview of the TDA algorithm and its implementation
In order to analyse the Betti numbers, the points and lines between the points are represented by a graph G. Then a simplex is represented by a clique in the graph (groups of vertices that are all connected by edges). The n vertices of the graph G are represented by n qubits. That is, |1⟩ |0⟩ · · · |0⟩ would represent the first vertex, and |0⟩ |0⟩ · · · |1⟩ |0⟩ would represent vertex n − 1. Note that this is a distinct representation from that often used to analyse sparse Hamiltonians, where each computational basis state represents a distinct vertex (so n qubits would represent 2 n vertices). In the representation here, a computational basis state with more than one |1⟩ would represent a clique of the graph (a set of vertices with edges connecting every pair of vertices). For example, |1⟩ |1⟩ |1⟩ |0⟩ · · · |0⟩ would represent a clique of the first three vertices.
The entire Hilbert space can then be subdivided into subspaces of different Hamming weights. Using H k to denote the space spanned by computational basis states with Hamming weight k, we have where dim(H k ) = n k . This space includes all states of the various Hamming weights. One can also restrict to only states which represent vertices of cliques of the graph G. Denoting by H G k the space spanned by basis states of all k-cliques of G, we have H G k ⊆ H k . We also use Cl k (G) to denote the set of bit strings which correspond to k-cliques of G.
We define boundary maps ∂ k : H k+1 → H k by their actions on the basis states |x⟩ ∈ H k+1 as where x\(i) means the ith 1 in the bit string x is set to 0. We also define ∂ G k : H G k+1 → H G k as the restriction of ∂ k to H G k+1 . That is, it gives zero for any x not representing a k + 1-clique of G. By definition, we have that both im(∂ G k+1 ) and ker(∂ G k ) are subspaces of H G k . But in fact, we have im(∂ G k+1 ) ⊆ ker(∂ G k ) ⊆ H G k , which can be seen from (with |x⟩ ∈ H G k ) Since im(∂ G k+1 ) is a subspace of ker(∂ G k ), one can define the quotient space This space is called the kth homology group, and its dimension β G k := dim(H k (G)) = dim(ker(∂ G k )) − dim(im(∂ G k+1 )) is the kth Betti number. In practice, Betti numbers β G k can be used to extract features of the shape of the data modeled by the graph G, and their estimation is the main problem in the topological data analysis we will consider here. In this work we will be estimating β G k−1 for the k − 1th Betti number, so we can simplify our discussion by considering Hamming weight k.
To describe our quantum algorithm and its circuit implementation for estimating Betti numbers, we will introduce the Dirac operator B G . Specifically, for any graph G and a fixed value of k, we define where the blocks indicate the subspaces H G k−1 , H G k and H G k+1 . Since ∂ G k−1 ∂ G k gives zero, squaring B G yields It can be seen here that the middle part corresponds to the combinatorial Laplacian [13] It is known that [13][14][15] dim(ker(∆ G k−1 )) = β G k−1 , which provides a convenient way of computing Betti numbers. Since B G is Hermitian, the kernel of B G and B 2 G is identical. Therefore, to estimate the Betti number corresponding to a particular graph and a fixed value of k, it suffices to construct the Dirac operator and compute the dimension of its kernel on the subspace H G k . It can be difficult in general to perform topological data analysis on a classical computer due to the high-dimensional nature of the problem, with the dimension increasing exponentially in k. However, the Dirac operator could be efficiently simulated on a quantum computer, in the sense of solving the Schrödinger equation with the Dirac operator as the Hamiltonian. That indicates exponential speedups are possible, though there are a number of other stages needed for the quantum algorithm. Previous work has provided several approaches for estimating Betti numbers on quantum computers. The stages of these approaches include preparation of a uniformly mixed state, construction of the projector onto the kernel subspace, and estimation of the overlap.
The original approach of Lloyd et al. [1] applied amplitude amplification and estimation to prepare the desired initial state in H G k , starting from a uniform mixture of all Hamming weight k basis states in H k . Their approach actually produces a superposition over all values of k, so the success probability of obtaining a specific k can be quite low; this issue was addressed by later work such as [9,15]. To construct the projector onto the kernel, they implement Hamiltonian simulation and perform quantum phase estimation on the resulting operator. The Betti number is then estimated as the frequency of zero eigenvalues in the measurement. That is, the algorithm can be summarised as below.
1. For i = 1, . . . , m, repeat: (a) Prepare the mixed state (b) Apply quantum phase estimation to the unitary e iB G t .
(c) Measure the eigenvalue register to obtain an approximation λ i .
2. Output the frequency of zero eigenvalues: In this work, we give a new algorithm for estimating Betti numbers on a quantum computer. We provide a number of improvements which significantly reduce the cost of fault-tolerant implementation. Specifically, we do the following.
• Develop new methods to prepare a mixture of fixed Hamming-weight states with garbage information that have significantly lower fault-tolerant cost.
• Introduce improved amplitude estimation using Kaiser windows to estimate the number of steps of amplitude estimation needed.
• Directly construct the quantum walk operator from block encoding without an additional step of quantum simulation.
• Project onto the kernel of the boundary map by implementing a Chebyshev polynomial to optimally filter the zero eigenvalues. This is more efficient than previous approaches that implement the phase estimation or rectangular window functions for filtering.
• Use the overlap estimation to estimate the kernel dimension of the block-encoded operator, leading to a quadratic improvement in precision over the classical sampling approach used by previous work.
We also provide the concrete constant factors in the complexity of our algorithm and estimate its faulttolerant cost for solving example problems, going beyond the asymptotic analyses of all prior work on quantum topological data analysis. Ultimately, the performance of our algorithm (as well as other algorithms from previous work) will depend on several important problem parameters. First, the desired state on which we perform the kernel projector is a uniform mixture of all the |Cl k (G)| basis states in H G k , whereas we start with a uniform mixture of all n k basis states in H k . Their ratio n k /|Cl k (G)| will determine the number of amplification steps required in the state preparation. There is potential to improve the efficiency of preparation of the cliques via an improved clique-finding algorithm.
Second, we need to implement a spectral projector that distinguishes the zero eigenvalue from the remaining nonzero eigenvalues of the Dirac operator, and the cost of implementing such a projector will depend on the spectral gap of the Dirac operator. Third, the output of the quantum TDA algorithm will not be the actual Betti number but instead a normalized version of the quantity β G k−1 /|Cl k (G)|. In order to estimate the Betti number to some additive precision, we need to increase the complexity by a factor that depends on |Cl k (G)|, with the result that the complexity would roughly scale as n k . An alternative scenario is that a fixed relative error is required; that is, the ratio of the uncertainty in the Betti number to the Betti number. Then the complexity would roughly scale as n k /β G k−1 , as we show in Section III. This means that significant speedups can be provided in cases where the Betti number β G k−1 is large, and we provide examples of such graphs in Section IV.
Our overall complexity may be summarised as in the following lemma.
Lemma 1 (Total complexity). The complexity of estimating to relative error r the Betti number β G k−1 of graph G with n vertices may be approximated as, for two different methods with probability of failure δ = δ 1 + δ 2 , where r = r 1 + r 2 + r 3 , |Cl k (G)| is the number of k-cliques, and it is assumed we are given a classical database of edges of the graph. In the case where we are instead given a database of missing edges, then 6|E| is replaced with 4|E C | in the above expressions.
See Section III F for the explanation of this total complexity.

B. Complexity classes of TDA
Linear-algebraic applications of quantum computing have led to numerous suggestions of how various types of machine learning subroutines could be implemented on a quantum computer with superpolynomial speed-ups over their classical counterparts. Many of these methods were in the end shown to only suffice for at most polynomial speed-ups, due to the randomized "dequantizations" of Tang and others [5,6,16]. The algorithm of Lloyd et al. [1], however, turned out not to be directly "dequantizable" using similar techniques, raising the question of whether more robust complexity-theoretic quantum-classical separations can be proven. The current landscape on this topic is somewhat involved.
In general, we have a number of discrepancies between the computational problems in ordinary TDA applications and the computational problems for which we have certain complexity-theoretic insights. In ordinary TDA applications one is typically concerned with the computation of the exact count of zero eigenvalues of combinatorial Laplacians. By the result of [17] -which shows that deciding if a combinatorial Laplacian has a trivial or non-trivial kernel (i.e., Betti number zero or non-zero) is QMA 1 -hard -this problem is likely beyond what is efficient even for quantum computers in the worst case. This observation goes in line with classical bodies of work showing that exact computations of Betti numbers is NP-hard [18], and that it can even be PSPACE-hard for more involved topological spaces (i.e., so-called algebraic-varieties) [19].
From the perspective of the types of problems quantum algorithms may be efficient for, one could attempt a few relaxation of the problem. First, it may be fruitful to relax the TDA problem with respect to the quantity estimated. Specifically, instead of the number of exactly zero eigenvalues, one could relax it and count the number of "small" eigenvalues (i.e., below a threshold). This relaxation may be convenient from a quantum algorithmic perspective, but it also still useful from a data analysis perspective, since Cheeger's inequality demonstrates that the magnitudes of the small non-zero eigenvalues of the graph Laplacian characterises the connectedness of the graph [20], and similar results hold for combinatorial Laplacians [21]. In folklore it is conjectured that for difficult cases, the magnitude of the smallest non-zero eigenvalue of combinatorial Laplacians very often scales inverse polynomially [14], in which case the number of "small" eigenvalues coincides with the number of zero eigenvalues if the threshold is chosen appropriately. While the problem of counting small eigenvalues is more suitable to be solved on a quantum computer, it could turn out to still be QMA 1 -hard if the TDA matrices have a sufficiently large spectral gap. Specifically, if the TDA matrices are sufficiently gapped, then one could count the number of zero eigenvalues (which is QMA 1 -hard [17]) by counting the number of eigenvalues below the spectral gap (i.e., the number of "small" eigenvalues).
A related (yet different) problem for which complexity-theoretical results are known is that of estimating normalized Betti numbers to within additive inverse polynomial precision. That is, the number of zero eigenvalues divided by the total number of eigenvalues, which here would be β G k−1 /|Cl k (G)| (if the TDA matrix is sufficiently gapped). This quantity is natural from a quantum computational complexity perspective (though not from an applications perspective), since a quantum algorithm naturally estimates probabilities (so normalized quantities in this case), and since additive errors allow for a direct relationship to definitions of complexity classes like DQC1.
Specifically, in [15] it was shown that the generalization of this problem, namely estimating the ratio when allowing a range of small eigenvalues, rather than strictly zero eigenvalues, for arbitrary Hermitian operators (i.e., the so-called low-lying spectral density) is DQC1-hard. This result was build upon in [10], where it was shown that the problem remains DQC1-hard when restricting the input to combinatorial Laplacians of general chain complexes. It is unknown whether the hardness persists when further restricting to combinatorial Laplacians of clique-complexes, and the closest result to this is the QMA 1 -hardness result of [17] for the problem of exact counting. This normalized quantity is not typically studied, and indeed there are concerns that Betti numbers may fail to be large enough to be detectable when normalized (see also Appendix F).
As discussed above, estimates of the normalized Betti number with additive error are more natural from a quantum computational complexity perspective. However, from the perspective of applications, we typically work with (unnormalized) Betti numbers (and perhaps their estimates). For this case, the rescaling from normalized Betti numbers to Betti numbers causes an in general exponential blow up of additive errors, and leads to algorithms which always have exponential run-times (for constant error). At the same time, in many applications, we only require small additive errors when the quantities in question are themselves small. For these reasons here we focus on estimation to within a given relative error; that is, the error in the Betti number divided by the Betti number. That is immune to rescaling and can lead to efficient algorithms in the cases when the Betti numbers are large.
Note that the problem of estimating the low-lying spectral density up to a certain relative error is also DQC1-hard. The reason is that the relative error must always be at least as large as the error in the normalised quantity, and estimating the normalised quantity to additive precision ϵ is DQC1-hard for ϵ = 1/poly(n) [15]. It is unknown whether the hardness result holds for Betti numbers, because they are found by restricting to combinatorial Laplacians, rather than arbitrary Hermitian operators. Generally, the larger the Betti number the more efficient the quantum algorithm will be, which in certain cases results in a polynomial quantum runtime. Examples of cases where the Betti numbers are large are discussed in more detail in Section IV.

III. OPTIMIZATION AND ANALYSIS OF QUANTUM TOPOLOGICAL DATA ANALYSIS
In this section we describe our algorithm in detail. Subsections III A, III B, and III C are for preparing a state similar to ρ G k in prior work. That is a combination of states |x⟩ that correspond to k-cliques of the graph G. The general principle is to first prepare a Dicke state, which is explained in Subsection III A. That is an equal superposition of all states with k ones, of which only a subset will be k-cliques. Therefore, Subsection III B then describes how to efficiently detect the states out of those that are k-cliques.
In order to provide a further speedup we then use amplitude amplification, as described in Subsection III C.
The key difficulty there is that the number of steps of amplitude amplification depends on the amplitude. We therefore use amplitude estimation separately from the amplitude amplification. This provides a significant advantage over fixed-point amplitude amplification [22], which incurs a logarithmic overhead, because we only need to perform the estimation once but perform the amplification many times. Moreover, our amplitude estimation technique using Kaiser windows is improved over standard amplitude amplification, and can be used in far more general applications.
Then in Subsection III D we describe how to block encode the operator B G in order to provide a quantum walk operator that has eigenvalues related to those of B G , as in [23,24]. In particular, we need to find eigenvalues ±1 of this walk operator, which correspond to eigenvalue 0 for B G . This provides a significant advantage over prior work that was based on simulating a Hamiltonian time evolution under B G , because we avoid the overheads inherent in simulating Hamiltonian evolution.
Next, in Subsection III E we show how to use an optimal filter to find eigenstates of B G with eigenvalue 0 (or ±1 for the walk operator). This method improves over prior work which used phase estimation, which has an overhead due to it providing more information (an estimate rather than just distinguishing between zero and nonzero eigenvalues).
Finally, in Subsection III F we use the amplitude estimation again to estimate the proportion of zero eigenvalues, then provide the overall complexity. The use of amplitude estimation here provides a squareroot speedup over work based on classical sampling.

A. Generating Dicke states with garbage
In this section, we consider preparing an n-qubit uniform superposition of Hamming weight k basis states (which is allowed to be entangled with garbage states). Such a state is known in previous literature as the Dicke state. In [25] it was shown how to prepare a Dicke state with O(nk) gates, although these gates included rotations, so there would be a logarithmic factor in the complexity when counting non-Clifford gates.
Because the preparation here allows an entangled state to be prepared between the superposition for the Dicke state and ancilla states, it is possible to prepare the state more efficiently. One approach is to apply a quantum sort to n registers, then use it to apply an inverse sort to the n qubits with the first k set in the state |1⟩. This is similar to the approach used for symmetrising states for chemistry in [24]. Another approach is to use inequality testing to obtain k successes. Both those approaches give a factor of log n in the number of qubits required, which is costly when n is large. Throughout we use "log" for base 2, and "ln" for natural logs.
We provide two schemes here. In our first scheme we prepare n registers with approximately log n qubits in equal superposition. This is similar to the first step in [24] where a sort was used. Here we instead find a threshold such that k registers are less than or equal to this threshold. This approach is explained in detail in Appendix B 1.
Our second scheme is based on preparing separate superposition states for the positions of each of the individual ones. The target state is obtained by adding all those ones into the target state, but has a higher amplitude for failure arising from ones in the same locations. We provide the details in Appendix B 2. The complexities of these two schemes are as in the following lemma.
Lemma 2 (Dicke preparation). The Dicke state with k ones in n qubits may be prepared with probability of success using (n seed + 1) n 2 (n seed + 2) + ⌈log n⌉ Toffolis, where n seed is a number of seed qubits for some constant c. Alternatively it may be prepared with probability of success k! n k n k (17) with Toffoli complexity or for n a power of 2. For this preparation, the state may be entangled with an ancilla system.
Although the first scheme has better asymptotic complexity of O(n), we find that for realistic parameters its complexity is considerably larger. The lower probability of success of the second scheme results in a larger factor in the complexity, so the approach that is optimal will depend on the parameters.
In comparison, prior work in Refs. [9,15] used a procedure based on an integer enumeration of all basis states for the Dicke state. Lloyd et al. [1] used a method with a superposition over values of k that is not directly comparable. The complexity in Ref. [9] does not appear correct (the complexity in Ref. [15] just cites that result). The method it uses is to first compute a Pascal's triangle of binomial coefficients up to n k with complexity O(n 2 k).
To convert a natural number l to a Hamming-weight k string, it then starts by finding the largest value of x such that x k < l. It is said that the value of x can be found using O(k) gates via a binary search using the Pascal's triangle as a lookup table. The complexity is given as the number of stpes in the binary search, which is not correct. The reason is that l is given in quantum superposition, so x needs to be searched for in superposition, and finding the appropriate entry in the lookup table (to perform the inequality test for the binary search) has complexity of the size of the lookup table. The value of k is fixed so not the entire lookup table is needed, but there are n entries needed, and each has size O(k log n). This has a complexity of O(kn log n), which then needs to be performed O(log n) times in the binary search, so would give a complexity O(kn log 2 n). That complexity needs to be multiplied by k steps of the algorithm to give overall complexity O(k 2 n log 2 n) for the conversion.
The complexity of converting in the opposite direction, from a Hamming-weight k string to a natural number, is given correctly in Ref. [9] as O(nk). The complexity of the Pascal's triangle is somewhat less than that given in Ref. [9]. It can be calculated classically and entered into a quantum registers with O(nk log n) Clifford gates, so zero Toffoli complexity. The complexity of preparing an equal superposition over natural numbers is O(k log n). That is omitted in Ref. [9], which is reasonable because it is smaller than the other complexities.
So in Ref. [9], the leading order complexity of the Dicke state preparation is O(k 2 n log 2 n), which is a factor of k 2 larger than our complexity of O(n log 2 n) for our first approach, and a factor of O(k) larger than the complexity of Dicke state preparation in Ref. [25]. In the example we give in Section IV A, k = 16 so the factor of k 2 is 256, and our first approach has about two orders of magnitude improvement in the complexity of this step as compared to Ref. [9]. For that example there is about another factor of 4 improvement by using our second approach.

B. Detecting the cliques
In the previous section, we have discussed the preparation of the n-qubit Dicke state with Hamming weight k (and an additional garbage register) Here, the first register holds all n-qubit strings with Hamming weight k, representing subsets of k vertices in an n-vertex graph G. We now describe a quantum circuit that detects whether a given string x represents a k-clique in the underlying graph, with the promise that x has Hamming weight k. Specifically, our goal is to implement the mapping Here, the second register has value 1 if x represents a k-clique in G and 0 otherwise. The third register contains some garbage information garb x that can depend on x and need not be uncomputed. Our implementation of the clique detection is related to the approach of [26]. Specifically, we introduce a register of ⌊log k 2 ⌋ + 1 ≤ 2 log k qubits to represent integers 0, . . . , k 2 . This register will be used to count the number of edges in the subgraph induced by the k vertices denoted by x. For the graph, we assume that it is given by a classical database, so we need to run through this classical database, rather than assuming any oracular access to the graph. Let us assume that we have a listing of all edges in the graph. That is, for each edge, we have a listing of the two nodes. In order to implement this classical data, for each edge in the list we use a Toffoli with the qubits representing those two nodes as controls, and an ancilla as target.
In the case where both qubits are in the state 1, the ancilla qubit will be flipped.
The complexity is then given by a number of Toffolis equal to the number of edges, which we denote |E|. We aim to sum all the bits output by these Toffolis. Provided that we are restricted to Hamming weight k, if x represents a k-clique then every pair of ones in x will result in a 1, so the sum will yield k 2 . Summing bits in the obvious way would yield a complexity scaling as 2|E| log k Toffolis, because each addition requires multiple Toffolis. An improved method is given in [27], where it would take no more than |E| Toffolis, but the same number of ancillas would be required, which would typically be a prohibitively large cost. An alternative way of summing bits is given in [7], where multiple groups of bits are summed, and their sums are summed. The overall complexity is no more than 2|E| Toffolis, and only a logarithmic number of ancillas is used. The costs of the three main parts of the algorithm are as follows.
1. There is cost |E| Toffolis for checking the edges of the graph. The resulting qubits can be erased with measurements and phase corrections, with zero Toffoli cost.
2. The complexity of the efficient bit sum approach from [7] is 2|E|.
3. There is complexity no more than 2 log k Toffolis to check that the output register is equal to k 2 .
Therefore, the total cost of clique detection is no more than 3|E| + 2 log k Toffolis. In many cases we will need to reflect on the result of this test. In that case the 2 log k cost is not doubled, because we can replace the equality test with a controlled phase. Therefore the cost in that case is 6|E| + 2 log k. If we were to retain the qubits resulting from the edge checking and use the sum from [27], the cost would be 2|E| + 2 log k, though with a large ancilla cost. Later when we consider the block encoding of the Hamiltonian we will need to allow a wider range of Hamming weights, k − 1, k and k + 1, in the case where we are block encoding the Hamiltonian projected onto this subspace. First we can sum the ones in the string x, which has Toffoli complexity n. We can check if the sum is equal to k − 1 with ⌈log n⌉ Toffolis, then check if it is k or k + 1 with further Toffolis with the unary iteration procedure. The number of Toffolis needed depends on the value of k, and some values will require about another ⌈log n⌉ Toffolis. For each we can use CNOTs to output a success flag on an ancilla qubit.
We can also output the value of k−1 2 , k 2 , or k+1 2 in another register. In this case we would need to apply an equality test between the result in our sum register and the result in this register, which again has a Toffoli complexity no larger than 2 log k. There are also n Toffolis needed to sum the ones in x and no more than 3⌈log n⌉ Toffolis to check the number of ones.
Much of the complexity of the algorithm is due to the use of amplitude amplification to find the cliques. There has been much work on quantum algorithms for clique finding, but these algorithms are typically posed in terms of calls to an oracle for the graph, with a possibly large complexity for additional gates. What that means is that the complexity in terms of oracle calls is no more than O(n 2 ) to find all the edges, and then there can be a very large amount of postprocessing to find the cliques.
When there are n vertices there cannot be any more than n(n − 1)/2 edges, but in practice we would only use the above algorithm for |E| less than half this. The reason is that for larger numbers of edges, it is more efficient to use a database of missing edges. If we use such a database, we can then iterate through all pairs of nodes in the database and use a Toffoli controlled on the corresponding qubits. If we find any cases where this gives one, it means that there is a missing edge and the state does not represent a clique. We therefore need to perform an OR on all the resulting qubits.
Similarly to the case above using the list of edges, one could perform an approach using addition and obtain a complexity with |E| replaced by |E C |, where E C is the set of missing edges. But, since we only need to find a single one out of all the results, we can instead use an approach for a multiply-controlled Toffoli with a limited number of ancillas. If one is willing to use about |E C | ancillae, then the cost is 2|E C |, with the same cost for erasure.
In particular, consider grouping the list of missing edges into sets of approximately |E C |. For each we can perform the Toffolis with the qubits representing the corresponding nodes as controls, then perform a multiply-controlled Toffoli with those qubits as controls (with the appropriate bit flips to give an OR). The multiply-controlled Toffoli has Toffoli cost one less than the number of qubits in the group. So, for example if |E C | is an integer, then it is cost |E C | − 1. The qubits giving the results of the Toffolis for the missing edges can be erased using Clifford gates similarly as before.
We do this for each of the approximately |E C | groups. Then we perform a multiply-controlled Toffoli on the results for all these groups. Because there are about |E C | this is again the Toffoli and ancilla cost. For example, if |E C | is an integer, then there would be Toffoli cost ( |E C | − 1) |E C | for the |E C | groups, followed by |E C | − 1 for the multiply-controlled Toffoli on the results, for cost |E C | − 1. That is combined with the |E C | cost of the Toffolis for the individual missing edges, for total cost 2|E C | − 1. Given the improved efficiency of this approach, it would be preferable to using the list of edges for |E| above about n 2 /5. We can therefore summarise the complexity of the clique checking as follows.
Lemma 3 (Clique checking). It can be checked that an n-qubit state corresponds to a clique of graph G using 3|E| + 2 log k Toffolis given a classical database of edges E, or 2|E C | given a classical database of missing edges E C . The costs of reflecting on the result of the clique check are 6|E| + 2 log k or 4|E C | Toffolis.
The complexity in Ref. [1] is given as O(k 2 ) in terms of calls to an oracle for the distances between the points. Similar complexities are given in Refs. [9,15] but are not explained, so they are likely using the same assumption as Ref. [1]. That complexity is not directly comparable to the result here, because we are instead assuming that we are given an explicit listing of edges (or missing edges).
In order to provide a comparison between that approach and ours, we can consider a slight modification of that where a database of locations of points is given. In that case, one can use a quantum sort on the Dicke state to also sort the locations of the nodes to the first k data locations. That sort has complexity O(n log n log(1/ϵ)) given that the locations are given to accuracy ϵ (relative to the range of positions). Then the distances can be checked with complexity O(k 2 log 2 (1/ϵ)). The factor of the square of the log here is from the complexity of performing squares for determining the (squared) distance.
We consider an example of a graph in Section IV A with n = 256, k = 16 and |E| = 30720. In that example, |E C | = 1920, so it is more efficient to use the list of missing edges in our approach. In comparison, if one were to attempt to use the database of locations, then just the sort would have higher complexity than 2|E C |. If edges were determined from positions given in a three-dimensional space, then an accuracy of the components of the locations of only 4 bits would result in complexity larger than our costing by an order of magnitude. Although it is difficult to compare our approach to Refs. [1,9,15] due to the different model, we can expect an actual implementation of that type of approach to have at least an order of magnitude larger complexity.

C. Amplifying the initial state
We aim to amplify the initial state so that we have the state with an equal superposition over cliques. The strategy is to initially estimate the amplitude just once, then apply the appropriate number of steps of amplitude amplification when we are preparing the state to estimate the size of the null eigenspace. It is possible to show that the complexity of estimating the amplitude is as given in the following Lemma.
Lemma 4 (Quantum amplitude estimation). Let U be a unitary and let 0 < a < 1 be such that There exists a quantum algorithm which estimates a to within error ϵ with probability of error less than δ, using The proof for this Lemma is given in Appendix D. To see the value of ϵ needed, note that probability of success will be reduced to approximately sin 2 ((1 ± ϵ/a)π/2) if we incorrectly choose the number of iterates in the amplitude amplification due to imprecision in estimating the amplitude. That translates to a probability of failure of the amplitude amplification of approximately (ϵπ/2a) 2 . For our application, the amplitude is approximately where the first factor comes from failure of the Dicke state preparation, and the second from the clique checking. For simplicity, in the following expressions for complexity we will omit the factor of 1 − 1/2c which is close to 1. The amplitude estimation is needed because it typically will be unknown how many cliques there are |Cl k (G)|. Inaccuracy in the amplitude estimation translates to a probability for failure of the amplitude amplification due to using an incorrect number of steps. In practice the "failure" of the amplitude amplification is not a major problem, because it can be combined into an uncertainty in estimation of the Betti number. That is, in the next step instead of estimating the Betti number relative to |Cl k (G)|, we will be estimating it relative to a value that may be increased by about a factor of 1/[1 − (ϵπ/2a) 2 ] (using the approximation of the sin function). If we want (ϵπ/2a) 2 no more than a relative error r, then we should choose That means that the cost would be steps. In comparison, the number of steps of the amplitude amplification is approximately That is, the amplitude estimation is more costly by a factor of ln(1/δ)/ √ r. This cost of the Dicke state preparation from Lemma 2 will be doubled in amplitude estimation and amplification when we account for the need to unprepare the Dicke state. We also need to reflect on the clique check, with complexity 6|E| + 2 log k or 4|E C | as described in Lemma 3. The total complexity for each step of the amplitude estimation and amplification is the total of these two complexities.
This approach of separating the estimation and amplification provides a significant improvement over the obvious approach of using fixed-point amplitude amplification [22] to provide amplification with an unknown overlap. That requires a logarithmic factor in the complexity similar to amplitude estimation. In contrast, here we only have that logarithmic factor in the cost once in the initial amplitude estimation, then in the remainder of the algorithm we eliminate the logarithmic factor by just performing amplitude amplification with the initially estimated amplitude.
It is somewhat ambiguous to compare our approach to that in Refs. [1,9,15]. Reference [1] just invokes the "multi-solution version of Grover's algorithm", which is not sufficiently specific to give a complexity because there are multiple approaches. Reference [9] cites the version of Grover's algorithm from [28], and Ref. [15] just mentions Grover's algorithm and uses the complexity from [9]. The problem with citing [28] is that it is not sufficient to specify exactly which approach is intended.
One approach for searching with an unknown number of solutions given in that work is to just use the approach of [29], which would give a factor of 9/π in the complexity, but that approach would not be compatible with a later amplitude estimation used in Ref. [9]. That is because the approach of [29] relies on a sequence of measurements to obtain success of the search. The measurements would prevent the later amplitude estimation (for the number of zero eigenvalues) being used.
Reference [28] also mentions the approach of performing amplitude estimation, followed by Grover's algorithm for a known number of solutions. Our proposal here is to divide between using amplitude estimation once, followed by amplitude amplification based on the estimation many times within the rest of the algorithm. That gives a significant improvement over using both at every step (which would be the obvious interpretation of just citing [28]).
Moreover, we provide a significant improvement in the efficiency of amplitude estimation over that in [28]. See Theorem 6 of that work for their result in terms of the error in the squared amplitude. Translating that to the error in the amplitude, the number of steps needed is approximately π/ϵ to obtain 1 − δ = 8/π 2 . Repetitions would be needed to obtain a desired δ which would typically be smaller. If, for example, δ = 1/20 and the number of repetitions is 5, then our approach gives about an order of magnitude improvement.

D. Block-encoding the sparse Hamiltonian
Having constructed the sparse oracles in the previous section, we now implement a quantum circuit that block-encodes the sparse Hamiltonian. Block encoding is a generalisation of a linear combination of unitaries, where an operator B is given by ⟨0| U |0⟩ = B/λ for a unitary operator U acting on an ancilla system as well, and |0⟩ on that ancilla system. Together with a reflection on the ancilla system, it can then be used to construct what was dubbed a "qubitised" or "qubiterate" operator. These principles were introduced in Ref. [23]. We use a similar principle as in [10,11], except here we are implementing the Dirac operator B G rather than the combinatorial Laplacian. In [10] it is shown that the Dirac operator for all Hamming weights and unrestricted by the cliques can be written as where a j and a † j are fermionic annihilation and creation operators on qubit j. Using the usual Jordan-Wigner representation that gives the Hamiltonian where the subscripts indicate the qubits that these operators act on (starting the numbering from 1). This is the core of the implementation of the complete Hamiltonian, and can easily be implemented by first preparing an equal superposition state over n basis states, then applying the controlled string of Pauli operators as in Figure 9 of [30].
To understand the reason that the Pauli string encodes the matrix, note that ∂ k will remove a one from some location in the bit string x of Hamming weight k + 1 and apply a sign according to the number of ones prior to that location. That can be achieved by applying an X in that location, and applying Z gates on all qubits prior to that location. We need a superposition of applying the X in all locations where there are ones. Moreover, we also want to apply ∂ † k+1 to a bit string of Hamming weight k + 1. This involves flipping a zero to a one (which can be done with an X gate) and applying a sign according to the number of ones prior to that position. This can again be done using a string of Z gates. Now we want a superposition of performing X gates at all locations where there are ones, and X gates where there are zeros, which can be implemented by the above sum of Pauli strings.
Here we aim to block encode the matrix The difference of this from the unrestricted case B in [11] is that it only acts on states with Hamming weight k − 1, k, k + 1, and gives zero otherwise. Similarly, it only gives states with Hamming weight in this range. Moreover, B G is restricted to the clique subspace. That means it must give zero if the input state is not a clique, and must also not give any output states that are not cliques. Next we provide a general method of constructing a qubiterate operator in cases where tests on the system state are required. The block encoding with the tests can be described as where P is a projection on the system that tests the Hamming weight and cliques. We are adopting notation similar to Eq. (3) in [24], but replacing the identity with P to indicate that a projection is needed on the target system. We will assume V is Hermitian; if it is not we can construct a Hermitian V by block encoding it as V → V ⊗ |1⟩⟨0| + V † ⊗ |0⟩⟨1| [31]. Similarly, we are writing B G for the operator we aim to block encode, but this reasoning applies for a more general Hamiltonian H.
If |k⟩ is an eigenstate of B G with energy E k and satisfying P |k⟩ = |k⟩, then by definition we must have where 0k ⊥ is defined as a state such that Then we can define the qubiterate as with This is similar to that in [24], except we have included the projection P in the reflection operation. That is, we are applying the tests as part of the reflection, instead of applying them in the operation V .
Then we obtain It is also found that Here we have corrected a minor error from [24] where there was an i appearing on the second term. See Appendix E for the derivation. Then it is easy to see that are eigenstates of W with eigenvalues ±e ±i arcsin(E k /λ) . This is the usual relation for the eigenvalues of the qubitised operators, showing that this approach for constructing the walk operator works. For our implementation here, V is just the controlled string of Pauli operators together with preparation of an equal superposition state. The reflection on the target system expressed by the projector can be implemented by computing an ancilla qubit flagging that the projection is satisfied (we have the appropriate Hamming weight range and cliques), reflecting on that qubit and the control qubits, then uncomputing the test. In some cases this can give a significant reduction in complexity over performing the test before and after V . If the ancilla qubits used to compute the tests are retained, then they can be erased with Clifford gates and measurements. For the application here that would be too costly in terms of ancilla qubits, so we incur the Toffoli cost of the test again in erasing the ancillas.
For the complexity of the implementation we have the following costs.
1. Preparing an equal superposition state over n basis states, which can be performed with complexity 4⌈log n⌉ + 1 Toffolis [7], or just with Hadamards if n is a power of 2. This cost is incurred twice.
2. The controlled string of Pauli operators can be applied with Toffoli complexity n − 1 using the method in Figure 9 of [30].
3. The Hamming weight can be computed with no more than n Toffolis and n ancilla qubits [27]. In that case we would not need to double the complexity for the reflection because the sum can be uncomputed with Cliffords. We could use 2n Toffolis with a logarithmic number of qubits [7], but in that case we would need to double the complexity for the uncomputation cost.
4. The complexity of outputting qubits with k−1 2 , k 2 , or k+1 2 is no more than 3⌈log n⌉. At the same time we can use the QROM to output a qubit which flags if the Hamming weight is outside the range. These qubits can be erased with Cliffords by retaining a logarithmic number of ancilla qubits.
5. As described above the cost of the reflection on the clique test is no more than 6|E| + 2 log k given a database of edges, or 4|E C | given a database of missing edges, as in Lemma 3.
6. Note that there is a reflection on the result of two tests, but this would correspond to a controlled-Z which is a Clifford gate.
The overall complexity is therefore as in the following lemma.
Lemma 5 (Block encoding complexity). The Toffoli complexity of block encoding B G /λ with the operator B G as defined in Eq. (6) is when given a database of edges E, or when given a database of missing edges E C . The value of λ for this block encoding is approximately n.
These complexities come from adding the Toffoli complexities in the list above. The value of λ is obtained by noting that we use a linear combination of n Pauli strings. This value will be increased very slightly because of imperfect preparation of an equal superposition state in the method of [7]. That increase is normally less than one part in 1000, so will be ignored here.
In comparison, the approaches in Refs. [9,15] are not very specific about the approach. They give a factor of n 2 for an n-sparse operator, coming from the general procedure for decomposing an unstructured n-sparse operator into 1-sparse operators from [32]. The implementation of the operator also requires checking cliques, which is not addressed in Refs. [9,15].
Ignoring those issues of how the operator is applied, the major difference between the proposal here is that we use a block encoding to construct a walk operator instead of simulating evolution under the Hamiltonian. References [9,15] invoke the results in Refs. [32,33] for the Hamiltonian evolution for unit time. In practice that would need to be adjusted to a time 1/λ in the simulation in order to prevent wraparound of the eigenvalues (which would cause nonzero eigenvalues to be measured as zero). For these short times the complexity of the Hamiltonian evolution is multiplied by a logarithmic factor.
If we use the estimate of the complexity from Ref. [33] given in Ref. [34], then we may expect the complexity to be larger than the complexity for the block encoding we have given here by a factor of 6 for the example in Section IV A. There will be more significant factors in other examples with smaller gaps than the example in Section IV A.

E. Projection-based overlap estimation
In order to estimate the number of zero eigenvalues of the Hamiltonian, we project onto the zero eigenspace, then perform amplitude estimation. The projection can be approximated using a Chebyshev polynomial approach. First, recall that in the qubitisation the zero eigenvalue of the Hamiltonian is mapped to eigenvalues ±1 of the qubitised operator. For the filter function on the phase ϕ of the eigenvalues of the walk operator, one can takew (ϕ) = ϵT ℓ (β cos (ϕ)) (41) for ϕ taking discrete values πk/ℓ for k from −ℓ to ℓ, and where β = cosh( 1 ℓ cosh −1 (1/ϵ)). Taking the discrete Fourier transform of these values gives the window w j such that Note thatw(ϕ) is a function of cos ϕ, so w j = w −j . Moreover, we have values of j separated by 2. If ℓ is even, then we have even powers of cos ϕ, and therefore only even j. This means that it can be regarded as a polynomial in e 2iϕ . We can select between the qubitised walk step and its inverse by controlling on the reflection, so implementing a linear combination of unitaries may be performed with cost ℓ.
The peak forw(ϕ) will be at 0 and π, which is what is needed because the qubitised operator produces duplicate eigenvalues at phases of 0 and π. The width of the operator can be found by noting that the peak is for the argument of the Chebyshev polynomial equal to β, and the width is where the argument is 1, so β cos(ϕ) = 1. This gives us The gap in the Hamiltonian is λ min , which translates to a gap in the qubitised operator of arcsin(λ min /λ). Because the width of the peak should be equal to the gap, we can replace ϕ with arcsin(λ min /λ), and solving for ℓ gives The complexity of the filter on the walk operator may therefore be given as in the following lemma.
Lemma 6 (Eigenvalue filtering). The complexity of filtering out nonzero eigenvalues of B G by a factor of ϵ is n λ min ln(2/ϵ) (45) calls to the block encoding of B G , given that the gap from eigenvalue 0 is at least λ min .
This lemma is obtained by using λ = n for the block encoding of B G in Eq. (44). To determine the appropriate value of ϵ to take, note that ϵ tells us the multiplying factor for amplitudes for states with eigenvalues outside the gap. The state starts with equal weighting on all eigenvalues, so ideally we should have the amplitude after filtering β G k−1 /|Cl k (G)|. If the state amplitudes outside the gap are multiplied by ϵ, then the error in the squared amplitude can be at most ϵ 2 . This corresponds to an error in β G k−1 of ϵ 2 |Cl k (G)|, or a relative error of ϵ 2 |Cl k (G)|/β G k−1 . In comparison, the approaches in Refs. [1,9,15] are based on phase estimation. They just give the scaling without specifying the method, which is needed to know the constant factors. The best algorithm for phase estimation would correspond to the method we have given in Appendix D. That would have asymptotic complexity similar to the filtering approach given here. To compare the complexities, we first need to note that the accuracy of the phase estimation should be half the gap. This is because if the phase estimate has error half the gap, if the eigenvalue is zero then it could give an estimate of λ min /2, which could also correspond to an eigenvalue of λ min . When this factor of 2 is accounted for the phase estimation approach has similar complexity to the filter. The phase estimation has a somewhat larger complexity in the nonasymptotic regime, though by only about 60%.
That is, there is a moderate improvement over phase estimation even if one were to use the optimal phase estimation introduced in Appendix D. Because Refs. [1,9,15] did not use that method of phase estimation we would provide a larger improvement over those works, though the size of the improvement is ambiguous because they do not specify the method of phase estimation.

F. Total complexity of algorithm
The Toffoli costs of the algorithm are as follows. In the following we will present the complexity when using the database of edges, then explain the modification for a database of missing edges.
1. The preparation of the Dicke state has a leading order complexity Toffolis, or approximately 2kn for the two schemes presented in Appendix B. Here we are including a factor of 2 for inversion.
2. The cost of checking cliques is given by 6|E|+2 log k where we take into account the need to uncompute the result.
3. The cost of amplitude estimation is a number of iterations of steps 1 and 2 given as 4. The cost of amplitude amplification of the cliques is given by approximately (π/4) n k /|Cl k (G)| of iterations of steps 1 and 2.
5. The walk step for the qubitisation needs 6|E| + 5n + 11 log n + 2 log k + O(1) Toffolis. 6. For the filtering there are n λmin ln(2/ϵ) calls to the block encoding with costs in item 5 above.
7. Lastly, we need to perform amplitude estimation on the entire procedure, using ≈ log(1/δ)/2ϵ calls to the amplitude amplification in 4 and filtering in 6.
In comparison Ref. [9] invoked the amplitude estimation scheme of Ref. [28], which we improve over by about an order of magnitude. Reference [15] just uses classical sampling, which is quadratically more costly. To give the leading-order complexities, the combined cost of steps 1 and 2 is 6|E| + n log 2 n + O(n log n).
To distinguish the ϵ, δ and r (relative error) needed in different steps we will use subscripts. The cost of amplitude estimation is then approximately This is expected to be a trivial cost in the overall algorithm, because the amplitude amplification is performed many more times. For the remainder of the algorithm, we have an initial cost of for the amplitude amplification for the initial state. Then there is a cost for the block encoding of for each step. Multiplying by the number of steps needed for filtering, there is a cost To determine the appropriate value of ϵ 3 to take, note that we are measuring a kernel of size β G k−1 as compared to an overall dimension of |Cl k (G)| ≫ β G k−1 . The relative accuracy in the estimation of β G k−1 will therefore be about ϵ 2 3 |Cl k (G)|/β G k−1 as explained above at the end of Section III E. If we aim for relative accuracy r 3 , then we have complexity Lastly, the amplitude estimation on the entire procedure needs a number of repetitions But, this amplitude estimation is on a number of steps corresponding to a reflection requiring both the forward and reverse calculations. That introduces a further factor of 2, so we should use Next, ϵ 2 corresponds to an accuracy of estimating a ratio β G k−1 /|Cl k (G)|. If we want a relative accuracy r 2 , then the error propagation formula gives where ∆β G k−1 is uncertainty in β G k−1 . In terms of r 2 , the number of repetitions becomes Applying this to the complexity required for each step, we get a complexity of approximately This is the expression given as Eq. (12) in Lemma 1. If we use the second Dicke preparation scheme, then n log 2 n would be replaced with 2kn, and n k replaced with n k /k!, which gives the expression in Eq. (13) of Lemma 1. For this complexity the factor of 6|E| at the beginning is for the complexity from the database of edges; for the database of missing edges this factor would be replaced with 4|E C |.
Comparing this to the amplitude estimation cost in (49) the primary difference is the factor of here. There is another difference in that the amplitude estimation cost has the factor 1/ √ r 1 rather than 1/r 2 , so the scaling in terms of relative error is improved. In cases where the number of cliques |Cl k (G)| is much larger than the Betti number β G k−1 , then the amplitude estimation cost is trivial. We will have a total probability of failure δ = δ 1 + δ 2 due to the two amplitude estimations, and a total relative error r 1 + r 2 + r 3 . In order to reduce the complexity, we can use the fact that the cost of the initial amplitude estimation is much smaller, so we can take δ 1 and r 1 smaller without much impact on the overall complexity. The r 3 appears inside a logarithm, so can be taken to be smaller than r 3 .
To give the scaling of the complexity in a simpler way, we can simply ignore the amplitude estimation complexity, and replace δ 2 with δ, and replace both r 2 and r 3 with r (since r 3 can be taken as for example r/20 without much impact on the overall complexity). We will also omit terms of complexity kn or n as compared to |E|. That then gives where T (G, k, r, δ) gives the required number of Toffoli gates to estimate, with precision parameters r, δ, the (k − 1)-th order Betti number of a graph G with n nodes, |E| edges and a Laplacian with gap λ min . This expression for the complexity is in terms of the relative accuracy r. Alternatively, if we aimed for a given absolute accuracy α, then α = rβ G k−1 , and so the expression for the complexity becomes We now discuss the complexity of just the first term in the square brackets, which corresponds to the state preparation cost rather than the filtering cost. That cost will be dominant if the gap is large, though it must be emphasised that the gap will be small in many cases. This first term for the cost gives If we are aiming for a given absolute accuracy α in β G k−1 , then the complexity would be The complexity is now larger for large Betti number β G k−1 . The reason for this is that the amplitude estimation is estimating the square root of β G k−1 . The square root has a small derivative for large values of β G k−1 , making it more difficult to estimate the Betti number with small absolute error. Again, note that the last three expressions above are only for the state preparation cost, without the filtering cost.
To compare to the complexity of classical approaches, an exact diagonalisation approach would tend to scale as n k 2 , whereas approximate schemes scale as n k . Thus the quantum algorithm would give approximately a square-root speedup over these classical algorithms if β G k−1 is on the order of a constant and one is targeting a fixed relative error estimate. On the other hand, for graphs with large β G k−1 , a speedup that is greater than a square root can be obtained for fixed relative error estimates.

IV. REGIMES FOR QUANTUM SPEED-UP
In this section, we ask if there exist regimes where our quantum algorithm offers a significant speedup over the best classical algorithms. The aim is to compute to relative error the (k − 1) th Betti number of the clique complex of a graph G. Say G has n nodes, |E| edges, r is the desired multiplicative error, and λ min is the spectral gap of the combinatorial Laplacian ∆ To simplify the arguments, we will represent the quantum complexity of this problem as Comparing to Eq. (60), this will asymptotically upper bound both terms up to log factors. For a rough estimate of the cost of computing the Betti number classically, one could use |Cl k (G)| (i.e., the number of k-cliques) or n k . The reason is that classical algorithms typically start by constructing a list of k-cliques, and afterwards compute the nullity of the combinatorial Laplacian or boundary operator. The cost of this second step (i.e., estimating the nullity of the combinatorial Laplacian or boundary operator), scales at best linearly in size of the matrix |Cl k (G)| [35]. On the other hand, the first step (i.e., listing all k-cliques) can be done using a brute force search at cost n k . There are more efficient algorithms for listing cliques, though the complexity tends to be dependent on the properties of the graph. However, |Cl k (G)| always lower bounds the cost of listing all the k-cliques. Therefore, |Cl k (G)| and n k can be considered to be lower and upper bounds on the scaling of the classical complexity, respectively. In conclusion, the best classical algorithms for this problem have scaling lower bounded by Recall Cl k (G) is the set of cliques of size k, which form the (k − 1)-simplices of the simplicial complex. Classical algorithms have extra factors in the complexity, such as 1/r 2 dependence on the required precision, that introduce orders of magnitude over this lower bound. Another category of classical algorithms to compare to are those tailored for the specific regime where quantum algorithms are most efficient. Notable examples include the algorithm developed in Section IV D, and the algorithm of Apers et al. [36]. We defer their comparison to Section IV D, where we will highlight regimes where the examples introduced in Section IV A continue to exhibit a superpolynomial speedup.
The quantum algorithm will offer a speedup on instances where β k−1 is large, and where λ min is not too small. We can remove the dependence on λ min if we instead focus on computing an approximate Betti number, in the following sense.
The same quantum algorithm computes B δ k−1 to relative error with cost A. A family of graphs with large Betti numbers and large spectral gaps In this section, we will construct a family of graphs with all the necessary parameters to enable a large quantum speedup. Our objective here is to demonstrate the existence of instances that fulfill all the prerequisites for the quantum algorithm to achieve a superpolynomial quantum speedup.
Let K(m, k) be the k-partite complete graph, where each partition contains m vertices. That is, K(m, k) consists of k clusters, each with m vertices; there are no edges within clusters, but all edges between clusters are included. Note K(m, 1) is a collection of m points with no edges. K(m, k) gives a useful example of a clique complex with a high Betti number [37]. It also has a Laplacian with a large spectral gap. (67) We prove these in Appendix F using techniques from simplicial homology. A further useful fact is that Standard classical approaches need to at least store a vector of this length, so we can give a classical complexity as As a first approximation for the quantum cost, we use the formula in (62) and consider just the square root factor and |E|. In fact, for this example the large number of edges means that it is better to use the list of missing edges to give complexity proportional to |E C |. Bearing in mind that n = mk, Stirling's approximation gives where in the first line we have omitted the exponentials in Stirling's approximation because they cancel. Proposition 1 gives β k−1 = (m − 1) n/m , giving a quantum complexity scaling as Therefore, for constant m, there is a polynomial speedup by a 2 ln m root (ignoring the n 2 factor). Alternatively, taking k constant, the above formulae give Then there is a polynomial speedup by a k/2 root. To obtain a superpolynomial speedup, m can be taken to increase close to linear in n, but k can be taken to also increase with n. Close to the best result is obtained for k = c ln 2 n with some constant c. Then the logs of the complexities are approximately ln T q ∼ 2 ln n + (c/2) ln 2 n.
That implies a speedup by a 2 ln n root, which is superpolynomial. This is still not an exponential speedup, but as far as the graph is concerned this is the best speedup that could be obtained from this type of approach. This is because, with k constant, the quantum complexity ignoring the |E C | factor is O(1). The Betti number is already scaling the same as n k , but the overhead from |E C | means that the speedup is not exponential.
Next we provide numerical results for the Toffoli complexity as a function of n and k. For these calculations we have made a number of adjustments to our simplified expressions in order to provide more accurate results.
In particular we compute the integral of the Kaiser window rather than using the asymptotic expression, as well as including the Dicke state preparation cost and the initial amplitude estimation cost for the number of steps needed for the state preparation. We are also using the second Dicke preparation scheme from Appendix B which provides a smaller complexity for this example.
The results are as given in Fig. 2 as a function of n for a range of values of k. It can be seen that the cost of the quantum algorithm for a given k scales approximately as n 2 , with the cost scaling primarily coming from the number of edges in the graph. The classical cost given as the number of k-cliques or n k has similar scaling, which is considerably worse than for the quantum algorithm, and is much worse for larger values of k, as expected from the analysis above.
For the example of n = 256, k = 16 the quantum cost is approximately 6.8 billion Toffolis, which is comparable to gate counts for classically intractable instances of quantum chemistry [38]. In contrast, the number of cliques is about 2 × 10 19 , and n k ≈ 10 25 . These numbers are sufficiently large that it should be classically intractable for any method that scales as |Cl k (G)|. For example, just storing the vector would be beyond the storage capacity of supercomputers. Potentially, more advanced classical algorithms that do not need to store the vector could be tractable.
To compare to the scheme as presented in Refs. [9,15], We improve by about two orders of magnitude for this example by using a more efficient Dicke state preparation scheme. We have a further order of magnitude improvement in complexity by using optimal quantum amplitude estimation in the final step. That gives at least three orders of magnitude improvement, which is the difference between a quantum computer running for a day versus years. The total improvement is unclear because some parts of the algorithm were not specified in Refs. [9,15].
We obtain about another order of magnitude improvement by separately performing an amplitude estimation to avoid needing to repeatedly perform it in the initial state preparation. The question of how this would be performed was not addressed in Refs. [9,15]. There is a more modest improvement in using the optimal filter as compared to optimal phase estimation. But, the optimal phase estimation is a procedure introduced here, and there would be larger improvement over less efficient phase estimation. The type of phase estimation was not addressed in prior work. We also provide an improved clique checking procedure, but the model of the graph considered in Refs. [9,15] is different, making a direct comparison of complexities impossible.

B. Erdős-Rényi graphs
The family of graphs in Section IV A is specifically constructed to have high Betti number and large spectral gap. One might wonder what speedups are generically possible. To shed light on this question, we examine the Erdős-Rényi family of random graphs.
The Erdős-Rényi random graph G(n, p) has n vertices, and each of the n 2 edges is present i.i.d. with probability p. In [39], the following theorem is established.
Taking p = n −1/(k+ 1 2 ) gives β k ∼ n k+1 n −k/2 almost surely. Ignoring the factor n|E|/λ min , our quantum algorithm can compute the k th Betti number in time scaling as T q ∼ n k/4+2 for constant k. For large k, where the +2 coming from |E C | is negligible, this is approximately a quartic speedup.

C. Rips complexes
One of the main applications of topological data analysis is to Rips complexes induced by finite-dimensional data in R d . This is another shortcoming of the graph family from Section IV A -they are defined as abstract graphs, rather than being induced from finite-dimensional data. But are such speedups possible for Rips complexes? Unfortunately, there are results which prevent these large speedups. It is shown in [40] that, for any fixed k and d max S⊂R d :|S|=n In [41], the author studies a setting where n data points are drawn from a fixed underlying probability measure on R d . This is arguably the setting of interest in topological data analysis. They show that the Betti numbers of the derived Rips complexes have three 'phases' depending on the scale ϵ. (Recall that we include an edge if two points are within distance ϵ.) For small ϵ = o(n −1/d ), called the subcritical phase, the Betti numbers vanish asymptotically. Intuitively the complex is highly disconnected, since we are below the percolation threshold. There is a critical phase ϵ ∼ n −1/d , where the Betti numbers will scale linearly β k ∼ n. Then for large ϵ = ω(n −1/d ), in the supercritical phase, the Betti number grows sublinearly β k = o(n). Thus in all regimes, the Betti number grows at most linearly in the number of points. This is of course far from the n k scaling needed for superpolynomial speedup.
However, it is possible to construct a Rips complex with large Betti number and large spectral gap, even in R 2 [40]. We describe such a Rips complex here.
We prove these in Appendix F using techniques from simplicial homology. Our quantum algorithm can compute the k th Betti number in time scaling as T q ∼ n 3+k/4 for constant k.

D. Randomized classical algorithms for Betti number estimation
While the previous discussion shows there are cases where quantum algorithms can provide superpolynomial advantages with respect to deterministic classical algorithms for TDA, the question of how randomized classical algorithms perform in this setting is comparably understudied. There are works studying generalizations of random walk operators corresponding to higher order Laplacians for simplicial complexes [42][43][44]. These are specific to the combinatorial Laplacian context and while they could lead to more efficient classical approaches, no such result is known at the present.
Here we show, perhaps surprisingly, that there exists a randomized classical algorithm which can compute normalized Betti numbers in the clique dense case using a polynomial number of operations under appropriate assumptions. This shows that the sufficient conditions needed for quantum algorithms to provide an advantage are more subtle than anticipated and that simply having a high-dimensional vector space does not necessarily guarantee a super-polynomial speedup.
The main idea of our algorithm is to use imaginary-time evolution to create a projector onto the kernel of More specifically, we focus on the Dirac operator B G and look at simulating its imaginary time dynamics of its square using path integral Monte-Carlo. We further simplify this approach by takingB G to be an analogous operator to B G except we now use an energy penalty to penalize any configuration that is not a clique or of the correct parity. In particular, where P as before is the projector onto the states of appropriate Hamming weight and configurations that correspond to a clique, and γ min is an upper bound on the spectral gap of B 2 G which coincides with the second smallest eigenvalue of the combinatorial Laplacian. Further, it is easy to see that if a vector is in the kernel of B 2 G it is also in the kernel of B G . Following the same reasoning as before, as B G is Hermitian so is B G and thus it has a complete set of orthonormal eigenvectors. This implies that any unit vector |ψ⟩ which is supported on H G k can be decomposed as where |ψ g ⟩ is the projection of |ψ⟩ onto the kernel of B 2 G and |ψ b ⟩ is its orthogonal complement. Then If we then pick where γ min is the smallest non-zero eigenvalue of B 2 G , the expectation value will be at most cos 2 (θ) + O(ϵ). If |ψ⟩ is chosen such that it is a column of a Haar random unitary over the constrained parity subspace H G k , the expectation value will be Thus performing imaginary time evolution and a Haar expectation value will give the required normalized Betti number. The remaining question centers around whether the imaginary time evolution can be performed on a classical computer in polynomial time. First, let us consider a decomposition of the Hamiltonian of the form where each H p is one-sparse, Hermitian, and unitary. Hence the eigenvalues of each are λ pi,νi = ±c p , where ν i is an index of the eigenvalue and p i is the index of the Hamiltonian [45,46]. The Jordan-Wigner decomposition on B provides such a decomposition and the projector P can always be written as a sum of a reflection over computational basis states and an identity gate, which provides an efficient decomposition into one-sparse Hermitian and unitary terms. With this decomposition in hand, we focus on using a path-integral Monte-Carlo simulation of exp(− B 2 G t). The path integral expansion works by first breaking up e − B 2 G t into r timeslices, Trotterizing over the matrices in our one-sparse Hermitian decomposition of B 2 G (which is efficient to determine [45,46]), and then expanding each one-sparse Hermitian matrix in its eigenbasis. Since one-sparse Hermitian matrices can be efficiently diagonalized, this process is classically efficient. (Hermitian one-sparse matrices can be decomposed as a direct sum of 1 and 2-dimensional matrices, which are trivial to diagonalize.) Let Γ denote a particular path of eigenvectors in the path integral representation, W (Γ) be the product of overlaps between the eigenvectors and λ pi,Γi be the eigenvalue corresponding to the eigenvector that appears in the i th step in the path Γ. Finally, let Pr(Γ) be a probability distribution from which the paths are drawn that can be chosen to reduce the variance (as is standard in importance sampling). We show in Appendix G that taking the Haar-expectation of the result leads to E Haar (cos 2 (θ)) = We then average over a finite ensemble of these random paths to estimate the expectation value drawn from an appropriate probability distribution. We propose, for general purposes, a Metropolis-Hastings based algorithm for selecting appropriate paths in the decomposition that are unlikely to have zero values of W . More specifically, the algorithm works by drawing an initial eigenstate of the first term in the one-sparse decomposition of the Dirac operator B G uniformly. Then a path is drawn by transitioning to one of the two possible connected eigenstates for it randomly. As the terms are Hermitian by assumption, all eigenvalues at each step in the path integral are the same up to a sign. This means that there are only two choices when constructing a path: either we choose to traverse the positive eigenvalue or the negative eigenvalue. Thus each path can be described using O(log(d k−1 )rD) bits. A path that has non-zero overlaps between the neighboring eigenstates can then be selected in O(log(d k−1 )rD) time. This is used as an initial guess that is improved using Metropolis-Hastings, wherein the probability of transitioning between two randomly chosen paths Γ (a) and Γ (b) is: The equilibrium distribution leads to a thermal distribution over the path K with where S Γ is the set of all paths. The number of such updates needed to achieve this distribution (within fixed error) scales as O(1/γ M ) where γ M is the gap of the Markov chain. We show in Appendix G that, provided the gap is large, this distribution can be efficiently sampled from and forms a good choice for the importance distribution for the paths that minimizes the variance over the paths. We ultimately find that the number of arithmetic operations needed to estimate the ratio of the kernel to the size of the set of all k-simplices within additive error ϵ is, assuming that the sample variance in the estimates yielded by Algorithm 1 is σ 2 , is in where κ is the ratio of the largest eigenvalue to the smallest non-zero eigenvalue, i.e. the condition number of the combinatorial Laplacian, |E| is the size of the edge set in the input graph, and |Cl k (G)| is the number of k-cliques in the input graph (this is not equal to d k−1 in general!). This shows that even in cases where the dimension is exponentially large, we can use path integration to estimate the ratio dim ker ∆ k /d k−1 using a number of operations that scales polynomially with the number of vertices n provided that σ, D, κ, γ −1 M , and |Cl k (G)| −1 are at most poly(n). We also show that σ can be polynomially large in some cases in Appendix G. Quantum algorithms for TDA were thought to outperform classical counterparts in the clique dense case [15], but this algorithm serves as a counter-example. Another key point behind this dequantization result is that while an exponentially large dimension is a necessary condition for an exponential speedup for quantum TDA, it is not a sufficient condition. This implies that further work is needed in order to understand when, and even if, quantum algorithms can provide truly exponential advantages relative to all classical randomized algorithms for TDA.
Apers et al. [36] subsequently developed another classical path-integral Monte Carlo algorithm which can also be efficient in some of the regimes where the quantum algorithm for Betti number estimation works best. It is therefore natural to investigate how it fares compared to our classical and quantum algorithms. They study additive error estimation of the normalized Betti number, so we focus on comparing the runtime of their algorithm for this case.
For general simplical complexes, their runtime depends exponentially on γ −1 of ∆ k , which is an upper bound on the condition number κ. They thus require constant γ and ϵ for general simplicial complexes. By contrast, our classical algorithm above has a run time depending on fixed polynomials in γ −1 and ϵ −1 , and can thus tolerate poly(n) scaling for both provided σ 2 also depends polynomially on κ and 1/ϵ. For clique complexes, their best-case runtime is polynomial if γ ∈ Ω(1) and ϵ = 1/poly(n), or γ ∈ Ω(1/ log n) and ϵ ∈ Ω(1). Analogous conclusions hold for our classical algorithm in these regimes if we are once again given suitable promises on the polynomial dependence of σ 2 on those parameters. Thus given the difficulty Algorithm 1: Classical randomized algorithm for Betti number computation.
in analyzing the dependence of σ 2 on κ and ϵ −1 in general, our algorithms cannot be easily compared. In the worst case bound for σ 2 given in (G37), we would require Dκ to be poly-logarithmic in n and κ/ϵ a constant at worst and thus also cannot tolerate inverse polynomial error scaling or spectral gap.
The algorithm of Apers et al. can efficiently compute a constant additive error estimate of the normalized Betti number but cannot efficiently compute an inverse polynomial additive error estimate for the graphs K(n/k, k) discussed in this work. This yields an immediate exponential separation with the quantum algorithm since the latter is able to efficiently compute such an inverse polynomial additive error estimate. Moreover, even stronger separations are possible. As noted above, the runtime of the algorithm of Apers et al. depends exponentially on the inverse of the (normalized) spectral gap of the combinatorial Laplacian, unlike in the case of the quantum algorithm whose runtime only depends polynomially on this parameter.
For the family K(n/k, k), this (normalized) spectral gap is large and suitable for the algorithm of Apers et al. There are nevertheless many graphs that do have a (normalized) spectral gap which incurs an additional exponential scaling for the case of the classical algorithm but not for the quantum one. For example, by adding a single edge to each cluster in K(n/k, k) the (normalized) spectral gap of the combinatorial Laplacian becomes much smaller, which causes the algorithm of Apers et al. to no longer be able to efficiently compute even a constant additive precision estimate of the normalized Betti number. On the other hand, the quantum algorithm can still efficiently compute an inverse polynomial additive error estimate as can our classical algorithm given similar guarantees on the polynomial scaling of σ 2 with 1/γ. In Appendix F 1, we detail some of these modified graph examples with smaller spectral gaps.

V. CONCLUSION
In order to provide applications where quantum computers can practically outperform classical computers on hardware anticipated in the near-future, it is necessary to develop algorithms where there is a greater than square-root speedup in the complexity [8]. This is because the large overheads involved in implementing quantum gates in an error-corrected code mean that there is a huge slowdown in the gate frequency as compared to classical computers. When the Betti number is of order 1, the complexity of the quantum algorithm for estimating Betti numbers is only a square root speedup over classical approaches. This is as compared to classical approaches that scale approximately linearly in n k . On the other hand, when the Betti number is large, the quantum complexity of estimating the Betti number to given relative error (error as a ratio to the Betti number) will be small.
There exist classes of graphs with very large Betti numbers. We introduce graph classes which exhibit Betti numbers in the regime where the speedup is superpolynomial; specifically it is approximately a 2 ln n root. The magnitude of the speedup is limited by the need to enter the data in the quantum algorithm, which introduces a |E| factor to the complexity. These are very specially constructed graphs for large Betti number, but we show there exist far more general classes of graphs whose parameters give a quartic speedup over naive classical algorithms, showing that speedups beyond quadratic are possible far more generally.
We have also provided a host of new techniques for quantum Betti number estimation that reduce the complexity. These include Kaiser-window amplitude estimation, improved Dicke state preparation, and improved eigenstate filtering. These improvements greatly improve the complexity in many ways, though the main scaling of the complexity as n k remains. In particular, we have major improvements in the complexity arising from improved Dicke state preparation as well as improved amplitude estimation, which together give about 3 orders of magnitude improvement. We have further improvements arising from our clique checking procedure, separation of amplitude estimation and amplification in initial state preparation, and use of filtering instead of phase estimation, which may give a further order of magnitude of improvement depending on how prior work is interpreted. Moreover, our methods enable accurate estimation of the complexity of the quantum algorithm, including all constant factors.
Based on that, we estimate that tens of billions of Toffolis would be sufficient to estimate a Betti number that should be classically intractable. This number of Toffoli gates is reasonable for early generations of fully fault-tolerant quantum computers. While the exact threshold for quantum advantage depends on constant factors in the classical algorithms, it seems likely that this application will fall somewhere in between quantum chemistry applications and Shor's algorithm in terms of the resources required for quantum advantage. The standard classical approaches would be expected to be intractable because they would require an extremely large storage. We have also presented an alternative approach for classical estimation that could be more efficient, because it does not require large storage and instead requires Monte-Carlo sampling. That classical approach may be tractable, but it is difficult to evaluate its complexity because it depends on the gap of a Markov chain which is unknown.
There is scope for further improvement of the quantum algorithm for Betti numbers by implementing a more efficient method of clique finding. We have currently applied just amplitude amplification for clique finding, but there are more efficient classical methods for clique finding that could potentially be adapted for the quantum algorithm. That is nontrivial because these methods often require large storage, which would not be practical in the quantum algorithm where we need to minimize the number of ancilla qubits.
We present some background material from singular homology needed in topological data analysis, broadly following the treatments in [47][48][49].
Let v 0 , . . . , v k be k + 1 distinct points in R n . The set {v 0 , . . . , v k } is said to be affinely independent if the set {v 1 − v 0 , . . . , v k − v 0 } is linearly independent. In other words, we consider the given set of points to be affinely independent if when we take one of the points to be the "origin" (say v 0 WLOG) and draw vectors from this point to the others, the collection of the resulting vectors is linearly independent.
If {v 0 , . . . , v k } is affinely independent, the k-simplex spanned by them is the set Equivalently, a simplex is just the convex hull of its affinely independent set of vertices. The points v i are the vertices of the simplex and the integer k is the dimension of the simplex. Figure 4 shows some examples of simplices. Note that it follows from the definitions given that k ≤ n since any set of n + 2 points or more cannot be affinely independent. This is because no collection of n + 1 vectors or more in an n-dimensional vector space can be linearly independent. Such a collection of vectors therefore cannot determine any simplicies of dimension n + 1 or higher. Let σ be a k-simplex. A simplex spanned by a non-empty subset of the vertices of σ is a face of σ. For example, the 0-dimensional faces of σ are its vertices and its 1-dimensional faces are the edges, which are spanned by two vertices. Faces of σ that are not equal to σ are called proper faces. The (k −1)-dimensional faces of σ are called its boundary faces and their union is its boundary. Definition 2 (Simplicial Complex). A simplicial complex S is a finite collection of simplices satisfying the following conditions: 1. If σ ∈ S, every face of σ is in S 2. If σ 1 , σ 2 ∈ S, then σ 1 ∩ σ 2 = ∅ or σ 1 ∩ σ 2 is a face of σ 1 and σ 2 The first condition says that a simplicial complex should also contain all the faces of a given simplex in the complex. The second condition says that any two simplices in a simplicial complex either do not intersect or intersect at a common face of both. We define the dimension of a simplicial complex to be the maximum of the dimensions of all simplices in the complex. It can be shown that a simplicial complex is completely determined by its vertices and information about which sets of vertices span which simplices. This provides the motivation for the following definition.

Definition 3 (Abstract Simplicial Complex
). An abstract simplicial complex is a collection C of nonempty finite sets such that if s ∈ C, then every non-empty subset of s is also in C.
This general notion of a simplicial complex is particularly useful when we wish to construct one "abstractly", i.e. without reference to a particular embedding into Euclidean space.
We will mainly be concerned with computing topological invariants of certain kind of simplicial complexes constructed from graphs, called clique complexes, throughout this paper. The graphs serving as inputs into the quantum algorithms in this paper are not necessarily induced by any finite-dimensional data and are instead abstract graphs, i.e. abstract simpicial complexes in the sense of Definition 3 in the preceding section.

Definition 4 (Graphs).
A graph G is a pair of objects G = (V, E), where V is a set of elements referred to as the "vertices" of G and E is a set consisting of pairs of vertices thought of as "edges" connecting the pairs of vertices.
Given an undirected graph (i.e. one in which the edges are not assumed to have direction), a clique C of a graph is a subset of V such that every pair of distinct vertices in C is connected by an edge. C is called a k-clique if |C| = k.
We now define the notion of a clique complex: Definition 5 (Clique Complex). The clique complex of a graph G is the abstract simplical complex formed by associating a k-simplex to every k + 1-clique in G.
The invariants we are most interested in are the Betti numbers of a clique complex associated to a graph, which give the number of holes of a given dimension in that clique complex. We now show how to determine the Betti numbers of an arbitrary simplicial complex via simplicial homology.
Let K be a simplicial complex and consider the set of k-simplices in K. In what follows, we would like to make sense of taking "linear combinations" of the k-simplices in K with coefficients in some field R (we will only need R = R or R = C for our purposes). A k-chain is a formal sum of k-simplices i c i σ i where σ i ∈ K, c i ∈ R. The set of all k-chains is denoted by C k (K) and is a vector space over R. The k-simplices form a basis for C k (K), so the dimension of C k (K) equals the number of k-simplices in K.
Definition 6 (Boundary Map). Let σ = [v 0 , . . . , v k ] be a k-simplex. The boundary map on k-simplices is a map that acts as whereû i denotes that the vertex i has been removed.
The boundary map acts on k-simplices σ ∈ C i (K) and gives a (k − 1)-simplex ∂ k σ that can be interpreted as the boundary of σ.
An k-cycle is a k-chain c ∈ C k (K) such that ∂ k c = 0. Therefore, k-cycles are precisely the kernel of the boundary map and are a subspace of C k (K) denoted by Z k = ker ∂ k . An k-chain c is an k-boundary if there exists an (k + 1)-chain σ ∈ C k+1 (K) such that c = ∂ k+1 (σ). Equivalently, k-boundaries are precisely the image of the boundary map and form a subspace denoted by B k (K) = Im ∂ k+1 . Figure 6 shows an example of a 1-boundary and 1-cycle. It turns out that there is a relationship between the two subspaces as implied by the following fundamental result in homology theory.
The above proposition essentially says that the boundary of a boundary is 0. It implies that the image of ∂ k+1 is contained in the kernel of ∂ k . This allows us to define the homology groups as follows: Definition 7 (Homology Groups). The k-th singular homology group H k of a simplicial complex is the quotient vector space (A4) Definition 8 (Betti Numbers). The i-th Betti number β i is the dimension of the i-th homology group.
The i-th homology group is generated by cycles that are not the boundaries of any simplex. In other words, these are simplices that enclose a "void" or "hole" (see the right image of Figure 6 above). It is worth noting that β 0 , the 0 th Betti number, represents the number of connected components the simplicial complex has.
The problem of computing the Betti numbers of a simplicial complex therefore reduces to the problem of computing the rank of the boundary map. A common and simple classical approach to doing this at the computational level is as follows.
Let K be a simplical complex and assume for simplicity we are working over the field Z 2 . Label the p-simplices in C p (K) by x 1 , . . . , x np and the (p − 1)-simplices in C p−1 (K) by y 1 , . . . , y np−1 . These simplices form bases for C p (K) and C p−1 (K) as mentioned previously. We can then represent the action of the boundary map ∂ p on C p (K) as follows Then for any p-chain c = np j=1 a j x j , we can write the above in matrix form Thus the boundary map on C p (K) can be represented as an n p−1 × n p sparse matrix with entries in Z 2 . The columns of this matrix span Im ∂ p = B p−1 , so rank ∂ p = dim B p−1 = b p−1 . The boundary matrix can brought into the Smith Normal Form via a generalization of Gaussian elimination which applies to any principal ideal domain (this includes fields). This results in a matrix with a number of 1's on the diagonal equal to the rank of the boundary matrix. Gaussian elimination for ∂ p takes O(n p−1 n p min(n p , n p−1 )) time and requires O((n p−1 + n p ) 2 ) memory [50,51]. When working over arbitrary fields, the same procedure holds with the exception that the matrix elements of the boundary operator can be ±1 or 0.
In the worst-case scenario however, the number of p simplices grows exponentially with the number of points in the complex, so this procedure becomes intractable for large data sets. This motivates the problem of finding efficient classical and quantum algorithms for extracting Betti numbers of simplicial complexes. 5. Perform an inequality test if all n seed bits are ≥ b 1 . . . b n seed . The cost is n seed n Toffolis. 6. Sum the results of those inequality tests, with cost n.
7. Check if the result is equal to k, with cost log n.
The logic of this procedure is that whenever there are more than k registers greater than or equal to the threshold, we increase the threshold which is given by the bits b 1 b 2 . . ., otherwise we reduce it. This test corresponds to the inequality test ≥ k in part 4(c), and the threshold is adjusted by choosing b j . At the end, provided we have success where there sum of the ones was equal to k, then we have k ones in a superposition of permutations corresponding to a Dicke state. These are obtained by the inequality tests in step 5, and we check there are exactly k ones in steps 6 and 7. The prepared state is entangled with the ancilla registers, but this is suitable for our application. Summing all the costs in this procedure gives a total Toffoli cost (n seed + 1) n 2 (n seed + 2) + ⌈log n⌉ .
To derive the probability of success, note that the failure occurs where it is not possible to provide a threshold where there are k numbers greater than or equal to it. That will be true if the kth smallest and k + 1th smallest numbers are equal. In other words, if we were to sort our numbers, and compare the values in register k and register k + 1, there would be failure if these numbers were equal. (Note that we do not perform this sort in practice.) The success probability for a given k may be given in the form of a sum as This expression is obtained by considering values up to ℓ in registers 1 to k, and greater than ℓ for registers k + 1 to n. The number of combinations of k registers with a maximum of ℓ is ℓ k − (ℓ − 1) k , and the number of combinations of n − k registers with values above ℓ is (cn − ℓ) n−k . The factor of n k is to account for the number of positions to choose the k registers with maximum value ℓ. Summing over ℓ then gives the number of combinations of values where there is a success when sorting the numbers, and we divide by the total number (cn) n to give the probability.
This sum can be approximated by an integral to give n k cn ℓ=0 ℓ cn The integrals are evaluated by repeated integration by parts which cancels the factor of n k . This shows that the exact expression is well approximated by an upper bound of 1 − 1/2c. See Appendix C for analysis of the error in this approximation, showing that it is of order O(c −(min(k,n−k)+1) ). Moreover, we show that 1 − 1/2c is an exact lower bound to the success probability.
In practice, we find that moderate constant values of c are suitable for minimising the complexity, and we will take c = 8 in examples below. That only increases the cost about 3%, while only needing another 3 qubits to store the numbers. The result of taking c constant is that the Toffoli complexity is approximately (n/2) log 2 n. In comparison, in the sorting approach from [24], the complexity is approximately 2n seed n log n, where n log n is the number of steps in the sorting network. Moreover, that approach fails when any of the numbers in the sorted registers have equal values, which requires taking 2 n seed ≳ n 2 to provide a reasonable probability of success. That results in a preparation complexity of approximately 4n log 2 n, or 8 times what we have provided here.
As a simple example where our threshold procedure is performed, consider n = 4 and c = 4, so we have 4 registers of 4 qubits. An example of a basis state is Now say we aim for k = 2. The steps are as follows.
• Sum the first (most significant bits) giving 1. This is equivalent to checking how many registers are at least 1000. The sum is 1, which is less than k, so b 1 = 0.
• Perform an inequality test on the first two bits of each register with 01. We get a total of 3, which is greater than k, so b 2 = 1.
• Perform an inequality test of the first three bits with 011. We get 3 again, so b 3 = 1.
• Perform an inequality test with 0111. We get a total of 2, which is equal to k, so we get b 4 = 1.
• Perform an inequality test with 0111 again. Now test equality with k, which succeeds, so we get overall success.
As an alternative to the above example, consider the case that the basis state is Then in the second-last step, the total would be 1, so we would have b 4 = 0, and in the last step we would test inequality with 0110. That would give three greater than or equal to 0110, so we would fail the test that the number is equal to k. In this case, if you sort the numbers you have 0010, 0110, 0110, 1110, and you can see that the second and third numbers are equal. This is the condition for failure discussed above.

Second scheme
Here we provide an alternative scheme for Dicke state preparation that can be more efficient, but provides a lower amplitude for success. It is useful in the case of large n but small k (smaller than √ n), where n k ∼ n k /k!. The steps of this scheme are as follows. 1. First prepare k registers in equal superposition states over n values. The complexity of preparing an equal superposition over n values (with high probability of success) is 4⌈log n⌉ + 1 Toffolis [7], so this would give complexity k(4⌈log n⌉ + 1). In the case where n is a power of 2, then this preparation can be performed just with Hadamards.
2. For each of the k registers, apply unary iteration as in [30] with the n qubits for the Dicke state as the target. This is used to flip the corresponding qubit. The complexity of each unary iteration is n − 2, giving total complexity k(n − 2). There will be approximately log n temporary ancillas used in the unary iteration that are reset to zero.
3. Now we sum the bits in the string of n qubits. A method of summing bits is given in [7], where multiple groups of bits are summed, and their sums are summed. The overall complexity is no more than 2n Toffolis, and only a logarithmic number of ancillas is used. The ancillas used that need to be kept for later stages of the calculation can be given as ⌈log n⌉, and there are 2⌈log n⌉ temporary ancillas used.
4. The sum of the bits is compared to k. This has complexity ⌈log k⌉ because we are guaranteed that the number of ones is at most k.
As before, this is a scheme which prepares the Dicke state in an entangled state with an ancilla. The overall complexity is then (k + 2)n + k(4⌈log n⌉ − 1) + ⌈log k⌉ Toffolis with the number of ancillas used being with a logarithmic number of working ancillas as well. In comparison, the scheme used above uses a number of qubits scaling as n log n, so is much larger when n ≫ k. In the case where n is a power of 2, the initial preparation of the equal superposition can be just performed with Hadamards, and so the Toffoli complexity is There is a probability of failure for the preparation, because it is possible for the ones to overlap, which will be detected in the final stage where the sum of bits is compared to k. This is an example of the birthday problem, and the probability of success will be at least 1/2 when k is not larger than approximately √ n. The probability of success is more specifically given by It is possible to perform amplitude amplification on this step, but it is simpler to combine this step with the clique finding. The net result on the complexity is that wherever we have n k in the original costs, it is replaced with n k /k!.

Appendix C: Accuracy of Dicke preparation success probability
Here we analyse the accuracy of the approximation of the sums by integrals in Eq. (B3). The sums correspond to the integrals discretised in steps of 1/cn, so the approximation can be expected to converge for large c. Even for moderate c the approximation of the sums by integrals is highly accurate, because the integrands are zero at the bounds of the integrals, and the first non-zero derivatives at the bounds are of order min(k, n − k). That means the error term in the Euler-Maclaurin formula will correspond to the derivative of order min(k, n − k).
In particular, since the summands have nonzero derivatives only up to order n, the first sum can be given by cn ℓ=0 ℓ cn where f (ℓ) is the summand, B 2j are Bernoulli numbers, and the upper bound on the sum over j is chosen so that only derivatives up to order n are included. Now the summand is zero at the bounds, f 2j−1 (0) is nonzero only for 2j − 1 ≥ k, and f 2j−1 (cn) is nonzero only for 2j − 1 ≥ n − k. When 2j − 1 ≥ k, the only nonzero term in the derivative at ℓ = 0 is for k derivatives of (ℓ/cn) k and 2j − 1 − k derivatives of (1 − ℓ/cn) n−k . This is because we need exactly k derivatives of (ℓ/cn) k for it to be nonzero at ℓ = 0. The general Leibniz rule therefore gives Similarly, the only nonzero term in the derivative at ℓ = cn is for n − k derivatives of (1 − ℓ/cn) n−k , and 2j − 1 − n + k derivatives of (ℓ/cn) k . In that case the general Leibniz rule gives These values for the derivatives then give us cn ℓ=0 ℓ cn Next, for the evaluation of the second integral, we have exactly the same reasoning, except the values of the derivatives at the bounds are slightly different. For the value at ℓ = 1 no (instead of ℓ = 0), we now have a factor of Similarly, for the values of the derivatives at ℓ = cn, we obtain a factor of which is the same. Using these factors and combining the expressions for the two sums then gives n k cn ℓ=0 ℓ cn where D(n, k, c, j) : We will show that |D(n, k, c, j)| decreases with j, so the dominant terms come from the smallest j. Moreover, we will show that the largest values come from taking j = 1 with k = 1 or n − k = 1, in which case it is of order 1/c 2 . That means the largest value the correction can take is of higher order than the failure probability.
To show that |D(n, k, c, j)| decreases with j, first note that so the monotonic decreasing property of the Riemann zeta function ζ(2j) gives We can therefore upper bound the ratio of |D(n, k, c, j + 1)| to |D(n, k, c, j)| as (C11) Here we have used the fact that the factor in square brackets for D(n, k, c, j) is monotonically decreasing in j. Note that in the sums we use both D(n, k, c, j) and D(n, n − k, c, j), and the same reasoning holds for both. This shows that the summands D(n, k, c, j) and D(n, n − k, c, j) rapidly decrease with j, so the value for the smallest j is dominant. Now let us consider the size of D(n, k, c, j) for the smallest j in the sum. For k = 1, the sum starts from j = 1, which gives where we have used B 2 = 1/6. Then for k = 2, the sum starts from j = 2, which gives where we have used B 4 = −1/30. Thus we find that the first term in the sum is O(c −(k+1) ) for k = 1 or 2. Next we will consider the value as we increase k in steps of 2. Let us put j k := ⌈ k+1 2 ⌉ for the first j in the sum. Regardless of whether k is even or odd we find (2j k − 1 − k)! = 1. For the case of odd k we have 2j k = k + 1, so This shows that the sum is decreasing when we consider steps of 2 in k. For k even we have 2j k = k + 2, so |D(n, k + 2, c, j k+2 )| |D(n, k, c, j k )| < k + 3 (2πc) 2 (k + 1) which shows the sum is still decreasing. In either case, increasing the value of k by 2 corresponds to reducing the value by at least c 2 , so we find that this first term in the sum is O(c −(k+1) ) in general. Moreover, because we have shown that successive terms in the sum are smaller by factors of 1/(2πc) 2 , the entire sum is O(c −(k+2) ). We have two sums, one with k and one with n − k, so the total of the two sums over j can be given as These results can be used to show a lower bound of 1 − 1/2c for the success probability. We will take n ≥ 3, since in the trivial case n = 2 we find the probability is exactly 1 − 1/2c. First, we use (C17) In the fifth line we have used c ≥ 1. Next, we lower bound the probability of success by upper bounding the sums over j by the cases with k = 1 or n − k = 1, and using a factor of (1 − 1/(2πc) 2 ) −1 to account for the sum where each successive term is less than the previsou by at least a factor of 1/(2πc) 2 . We find that n k cn ℓ=0 ℓ cn In the last line we have used the bound we derived in Eq. (C17). Including the case n − 2 as well, this shows that the probability of success is lower bounded by 1 − 1/2c in general.
For the example where n = 256 and c = 8, the integral approximation gives estimated success probability 0.94001551223575. We show the error in the integral approximation in Fig. 7 for this example. As can be expected from our analysis of the Euler-Maclaurin formula, it is found that the integral approximation is more accurate as min(k, n − k) is increased. For k near n/2 the approximation is accurate to well over 200 decimal places. For k = 16, the integral approximation is accurate to about 30 decimal places. For k = 2 the error in the approximation is still less than 10 −6 , though it increases to 0.0012 for k = 1 or n − 1.
We can also more simply prove that 1 − 1/2c lower bounds the probability of success when averaging over k. Summing Eq. (B2) over k = 1 to n − 1 gives Dividing by n − 1 for the number of values of k then gives a lower bound of 1 − 1/2c for the average success probability.

Appendix D: Proof of complexity of amplitude estimation
For the complexity of amplitude estimation, the standard approach is to use phase estimation on the Grover iterate of amplitude amplification. If there is an initial amplitude of a, then the phase of each step of amplitude amplification is 2 arcsin a. The original proposal was to use control registers in the phase estimation in an equal superposition, but of course for phase estimation that is a poor choice. Here we would like a small probability of error beyond a given confidence interval, and for that case it is better to use a Kaiser window.
When applying phase measurement, we would start with a control state of the form (omitting normalisation) where I 0 is a zeroth-order modified Bessel function of the first kind. If we call the operator combining U and the reflection on the flag qubit W , then we would then control between applications of W and W † with The inverse quantum Fourier transform then corresponds to an inner product with a phase state The inner product then gives the Fourier transform, so is proportional to This needs to be squared to give the probability distribution for the error in the phase measurement. The distribution has its first zero for θ = (π/N ) √ 1 + α 2 , so to estimate the probability in the wings of the distribution we should integrate past that point. We also have the difficulty that we are not given the exact normalisation of the probability distribution. To approximate the normalisation, we can approximate the centre of the distribution by a Gaussian. The approximation can be found by taking the Taylor series of the log of the distribution about zero, and gives where we have replacedθ − θ = ∆θ. Taking the integral over ∆θ then gives sinh 2 (πα) It is found that this expression is asymptotically This can be found using the asymptotic properties of Bessel functions, and sinh 2 (πα) ≈ e 2πα /4 and coth(πα) ≈ 1, so sinh 2 (πα) That gives the asymptotic expression claimed. Now, for the integral over the tails we can upper bound the probability by that where we replace the sin with 1, so we have an upper bound Now using arcsinh(α) ≈ ln(2α), we have the asymptotic expression Dividing by the asymptotic expression for the normalisation then gives This tells us that, if we want probability of error outside the range given by δ, then we should take Solving for α then gives The size of the confidence interval is (π/N ) √ 1 + α 2 , so if that needs to be ϵ, we should take The higher-order ln ln term for α is larger than the correction term for approximating √ 1 + α 2 by α. The number of calls to U or U † is N , giving the complexity stated.
Similarly, for χk ⊥ , we have Now applying W † to the expression for W |χ⟩ |k⟩ gives Hence we obtain Eq. (37) as required. We have corrected the extra factors of i included in [24]. Note that the rest of the reasoning in [24] is correct.

Lemma 7. (Kunneth formula)
We would also like to relate the Laplacian of X * Y to the Laplacians of X and Y .
Proof. Consider first the Laplacian ∆ Si k of R 1 (S i ). Say it has smallest eigenvalue c. Then by induction using Corollary 1, the smallest eigenvalue of ∆ S k is also c.

Perturbations of K(m, k) and smaller spectral gaps
In Section IV D, we introduced a novel classical algorithm that, in conjunction with Apers et al.'s algorithm [52], significantly enhances classical runtimes in the domain where quantum algorithms achieve polynomial run-times. Moreover, these classical algorithms can potentially operate in polynomial time, contingent on the required precision for estimating the Betti number, provided the spectral gap of the combinatorial Laplacian is sufficiently large. In this section, we study perturbations of the graph K(m, k) (defined in the previous section) that lead to a smaller spectral gap. These perturbations render the classical algorithms inefficient, regardless of the precision needed for the Betti number estimation, though they do not impair the efficiency of the quantum algorithm. Additionally, these perturbations also show that the speed-ups are robust, in the sense of a much larger class of graphs having a guaranteed speed-up We will denote our perturbed family of graphs K ′ (m, k). Recall that K(m, k) consisted of k clusters, each with m vertices. There were no edges within the clusters, and all edges between vertices in different clusters were included. Let G ′ be the graph consisting of m vertices with a single edge. Thus G ′ consists of a single edge, and m − 2 additional completely disconnected vertices. Let K ′ (m, k) consist of k clusters, where each cluster is a copy of G ′ . Likewise, we include all edges between clusters. In other words, we obtain an instance of K ′ by adding a single edge to each of the clusters which were originally independent sets. The spectrum of ∆ 0 is thus 0 with multiplicity m − 2, 2 with multiplicity m, and 2m with multiplicity 1.
Corollary 2. The (k−1) th combinatorial Laplacian ∆ k−1 of the clique complex of K ′ (n/k, k) has normalized spectral gap In short, when k = c · log(n) the graph K ′ (m, k) has an inverse polynomial spectral gap, as opposed to an inverse logarithmic spectral gap as in the case for K(m, k). This key distinction amplifies the disparities in runtime between the quantum and classical algorithms. Notably, the quantum algorithm (still) attains a superpolynomial speedup over its classical counterparts, even in scenarios where only a constant level of additive precision for the normalized Betti number is required.
Next, we demonstrate the potential for even broader generalization of these perturbations. Consider the m-vertex graph G ′′ that is a disjoint union of m/2 edges (i.e., there are no isolated vertices), and let K ′′ (m, k) be the k-fold join of G ′′ . Lemma 11. The spectrum of the (k − 1) th combinatorial Laplacian of the clique complex of K ′′ (m, k) is where λ i,j,ℓ = 0 · i + 2 · j + m · l and they have multiplicities Proof. The spectrum of the 0 th combinatorial Laplacian of G ′′ is A quick computation reveals that ∆ 0 has a spectrum spec (∆ 0 (Cl(G ′′ ))) = {0 (with multiplicity m/2 − 1), 2 (with multiplicity m/2), m (with multiplicity 1)} .
Following this observation, the lemma follows from the application of Corollary 1.
Corollary 3. The (k−1) th combinatorial Laplacian ∆ k−1 of the clique complex of K ′′ (n/k, k) has normalized spectral gap Again, the graph K ′′ (m, k) has an inverse polynomial spectral gap, as opposed to an inverse logarithmic spectral gap. In particular, the quantum algorithm (still) attains a superpolynomial speedup over its classical counterparts, even in scenarios where only a constant level of additive precision for the normalized Betti number is required. In conclusion, adding a single edge or m/2 edges to each cluster within K(m, k) does not change the superpolynomial quantum speedup that it exhibits.
In order to set up the relevant path integrals, we must first employ a Trotter-decomposition. This allows us to represent the exponential in terms of exponentials of the one-sparse matrices, which can then be simulated through randomization. This leads us to the conclusion that As H p is Hermitian, it has a complete set of eigenvectors. Since H p is also one-sparse and as such matrices can be written as the direct sum of irreducible one and two-dimensional matrices, we can parameterize the eigenvectors to respect the structure of the two dimensional space via Note that each eigenvector |λ p,ν ⟩ is such that ⟨p|λ p,ν ⟩ is non-zero for only two different computational basis vectors.
Note that the first term on the RHS of (G1) contains 2rD terms. If we introduce a vector of indices p = {1, . . . , r, 1, . . . , r, . . . , r} with 2rD entries denoted by p i , we can express the expectation of the exponential of the boundary operator as (H stands for the Haar average) Next we set up our path integrals by selecting sets of 2rD indices that correspond to the eigenstates that we transition to in the path integral. We denote such a path via the vector Γ where Γ j corresponds to the index of the j th eigenstate in the path. Using this notation we can insert resolutions of the identity of the form Γj λ pj ,Γj λ pj ,Γj consisting of the eigenvectors λ pj ,Γj of each U pj in between each of the 2rD terms and defining ρ = |ψ⟩ ⟨ψ| gives λ pi,Γi t/2r W (Γ) |λ p1,Γ1 ⟩ ⟨λ p 2rD ,Γ 2rD |   = E H Tr ρ E Γ exp (− i=1 λ pi,Γi t/2r)W (Γ) |λ p1,k1 ⟩ ⟨λ p 2rD ,k 2rD | Pr(Γ) λ pi,Γi t/2r W (Γ)δ Γ1,Γ 2rD where we have defined for convenience the quantity W (Γ) = ⟨λ p1,Γ1 |λ p2,Γ2 ⟩ · · · λ p 2rD−1 ,Γ 2rD−1 λ p 2rD ,Γ 2rD , with Γ = [Γ 1 , . . . , Γ 2rD ]. In the fourth line, we divided and multiplied by a probability Pr(Γ) to express the sum as an average, which allows to use importance sampling to minimize the variance via a judicious choice of Pr(Γ). In the last line, we used the fact that p 2rD = p 1 for the symmetric Trotter formula.
If we wish to estimate this value by sampling, the primary driver of the complexity will be the estimation of the expectation value through the sample mean which corresponds to the optimal unbiased estimator of the population mean. The number of samples scales with the variance of the set that one averages over and the variance over Γ of the above Haar expectation is then simply Tr exp (− i=1 λ pi,Γi t/2r)W (Γ) |λ p1,Γ1 ⟩ ⟨λ p 2rD ,Γ 2rD | Pr(Γ) λ pi,Γi t/2r W (Γ)δ Γ1,Γ 2rD λ pi,Γi t/r |W (Γ)| 2 δ Γ1,Γ 2rD
There are many probability distributions that we could choose to sample from to minimize the variance in (G6). The most straight forward distribution to choose, and the appropriate one to pick in the limit of short t, is a uniform distribution. However, in practice the eigenvalues in the sum may have wildly varying sizes and so the importance of each of the different paths can swing substantially. A more natural choice to make for the probability of drawing each path is where S Γ is the set of valid paths with 2rD vertices such that each edge corresponds to a path of connected eigenvectors for the one-sparse matrices used in the decomposition of B 2 G . The central challenge in employing this formula is to estimate the value of the sums over the values of Γ i . As the H j in used in the derivation are Hermitian and unitary, we have that λ pi,Γi = ±c pi .
(G8) Furthermore, the one-sparse decomposition is chosen in such a way that the matrix elements are all off diagonal with the exception of any diagonal matrix that appears in the decomposition. This can be seen explicitly using the discussion of the Jordan-Wigner representation of the Dirac operator. This means that each eigenvector couples to at most two eigenvectors. At most one term is diagonal in the standard Trotter decomposition of the Dirac operator [45]. A simple combinatorial argument therefore leads to the conclusion that the total number of valid paths is at most d k−1 2 2r(D−1)−1 . Given this choice, the normalization constant (which is analogous to a partition function) can be expressed (assuming that the diagonal element is always p i = D) as (cosh(c pi t/r)δ pi̸ =D + δ pi=D e −λp i ,Γ i t/r /2). (G9) This can be computed using O(poly(rD)) arithmetic operations and so does not ruin the efficiency of the algorithm. Note that were the sum over the W (Γ) terms considered instead, then the result would be computationally difficult to compute as these terms generate correlations that would prevent us from performing an independent sum for each of the factors.
The variance σ 2 over the values of k chosen in the path integrals for the expression for the Haar average is then This shows that the variance after making this substitution is precisely equal to the gap in the Cauchy-Schwarz inequalty in the latter sum. This suggests that in certain cases where the Cauchy-Schwarz inequality is tight, the variance may be extremely small given that we have the ability to sample from the distribution Pr(Γ).

Metropolis Hastings algorithm
Outside of specific cases such as graphs, it is difficult in general to sample from the probability distribution Pr(Γ) to employ the above variance reduction strategy. It is therefore necessary to provide a general method to obtain these samples if we wish to understand how we could address the problem more generally. One way to address the issue of how to sample from the distribution Pr(Γ) is to use the Metropolis Hastings algorithm. The idea behind the algorithm is to design a Markov chain whose stationary distribution equals our choice of Pr(Γ).
We first start with a connected, undirected graph on the set of all possible states Γ 1 , Γ 2 , . . . , Γ 2rD−1 which represent the "paths" involved in our Trotter decomposition. Each vertex of the graph then represents one possible collection of values for Γ 1 , . . . , Γ 2rD−1 .
At each vertex a, we therefore select a neighbor with probability 1/(2rD − 1). Since the degree may be less than 2rD − 1 at a given vertex, the walk may remain at that vertex as there is a non-zero probability of no edge being selected. To account for such situations, we have the following rules: if a neighboring vertex b is selected and the probability of transitioning is at least as great as the probability of remaining p a := Pr(Γ) a , we transition to b. If p b < p a , then we transition to b with probability p b /p a , where b ∈ N (a) with N (a) referring to the neighbors of a and Otherwise we remain at a with probability 1 − p b /p a . Defining where the operator H is decomposed into a sum of Γ terms, and Υ is the number of "stages" of the formula. For the symmetric Trotter-Suzuki formula, Υ = 2(5) q−1 for a TS formula of order 2q, where q = 1 and p = 2 for our case. Γ = 2rD and α can be upper bounded by Since this is the short time multiplicative error bound for simulating an operator for time t/r, we want to bound the resulting error when simulating for large t. To this end note that if we have an operator A we approximate by an operator B up to some multiplicative error m, then B = A(I + mC) where C is an operator such that ∥C∥ ≤ 1, and m is a constant. Then (mre) q ≤ ∥A∥ r 1 + mre 1 − mre . (G24) Therefore the long-time multiplicative error is bounded by mre/(1 − mre) and we would like this to be less than some desired error ϵ T > 0. This implies that we must have mre ≤ ϵ T /(1 + ϵ T ) ≤ ϵ T . In our context, A = e tH/r , B is an approximation to A as given by a Trotter formula, and m is the short-time multiplicative Trotter error bound cited above.
Using the bound on m and substituting in the parameters relevant for our situation, we have If r ≥ 4t ln 2 ℓ ∥H ℓ ∥, then mre ≤ 2eα t 3 r ≤ 2eα t 3 r 2 ≤ ϵ T (G26) which implies that The former term dominates asymptotically, so we will henceforth take r ∈ Θ t 4etα The systematic error in the estimate of the expectation value from the Trotter-Suzuki formula and the finite length Markov chain is at most The total number of operations needed to draw a single sample from the distribution with bias at most ϵ T M is from (G16) in contained in the boundary of all (k + 1)-simplices and dividing by the actual number of k-simplices in an n-simplex to account for over-counting. This quantity is precisely (k + 2) n+1 k+2 n+1 k+1 = n − k and we thus get that the diagonal entries of ∆ k are all equal to n + 1 when k > 0 (note that this saturates the bound on the up-degree of a k-simplex in any simplicial complex given earlier with N = n + 1). Now fix a particular k-simplex σ in the n-simplex. We want to count the number of other k-simplices that are not lower adjacent to σ. σ has k + 1 simplices of dimension k − 1 in its boundary. From the above result, each of these (k − 1)-simplices has up-degree n − (k − 1) = n − k + 1. Then the number of potential lower adjacent k-simplices is (k + 1)(n − k + 1). But by the uniqueness of common upper simplices, we have counted σ k + 1 times, one for each of the (k − 1)-simplices in the boundary of σ. Thus the number of other lower adjacent simplices is (k + 1)(n − k + 1) − (k + 1) = (k + 1)(n − k). Thus the number of other k-simplices not lower adjacent to σ is n+1 k+1 − (k + 1)(n − k) − 1. On the other hand, the number of other k-simplices that are upper adjacent to σ is given by multiplying the up-degree n − k of σ, which gives the number of k + 1 simplices that are upper-adjacent to σ, by the number of k-simplices in each k + 1-simplex which is k + 2. But this overcounts σ by n − k, so (n − k)(k + 2) − (n − k) = (n − k)(k + 1). The sum of this and n+1 k+1 − (k + 1)(n − k) − 1 is clearly n+1 k+1 − 1, so the only non-zero entries in ∆ k are the diagonal ones when k > 0.
When k = 0, Theorem 3.3.4 in [55] shows the diagonal entries of ∆ 0 are the degree of the vertices in our n-simplex, which is precisely n. It also implies all the off-diagonal entries are 0 since every vertex is upper-adjacent in an n-simplex. Thus in either the k = 0 or k > 0 case, ∆ k is proportional to the identity. We can then set D = r = 1 in (G10) and the number of valid paths summed over is precisely d k . By an analogous reasoning to the completely disconnected case, Γ∈SΓ W (Γ) = Γ∈SΓ |W (Γ)| 2 = Γ∈SΓ = d k and where λ = n + 1 when k > 0 and λ = n when k = 0.
Analogous arguments on needing a promise on the least upper bound on the degree of the vertices hold as in the case of the completely disconnected graph above.
More can be said of κ in restricted cases. For instance, it is known that orientable simplicial complexes of dimension d ≤ 2 have κ ∈ O(n 2 k ), where n k denotes the number of k-simplices and 0 ≤ k ≤ 2, and it is further conjectured that κ ∈ O(n 2/d k ) in most cases [14]. This latter bound is quite favourable for high-dimensional simplices, i.e. when d ∼ n and with the approximation that n k ∼ 2 n . Unfortunately, it is difficult to give tight bounds on γ M and σ 2 in (G38) for arbitrary simplicial complexes. Though the Perron-Frobenius theorem applied to stochastic matrices implies 0 < γ M ≤ 1 [57], its scaling with the other parameters of relevance are unknown.
While we do not ultimately expect this classical TDA method to be efficient generically, the above discussions show there exist sufficient conditions under which it can be used to extract Betti numbers in polynomial time without suffering from a generic exponential dependence on the number of data points characteristic of other classical TDA algorithms. Furthermore, it creates uncertainty about the necessary and sufficient conditions for an exponential advantage for quantum TDA as the cases where this algorithm has an exponential advantage relative to the deterministic classical algorithm are analogous to those for the quantum algorithm. This means that an identification of clear cases where exponential advantage is likely remains an important open problem within the domain of quantum algorithms for TDA.