Quantum Gram-Schmidt Processes and Their Application to Efficient State Read-out for Quantum Algorithms

Many quantum algorithms that claim speed-up over their classical counterparts only generate quantum states as solutions instead of their final classical description. The additional step to decode quantum states into classical vectors normally will destroy the quantum advantage in most scenarios because all existing tomographic methods require runtime that is polynomial with respect to the state dimension. In this work, we present an efficient read-out protocol that yields the classical vector form of the generated state, so it will achieve the end-to-end advantage for those quantum algorithms. Our protocol suits the case that the output state lies in the row space of the input matrix, of rank $r$, that is stored in the quantum random access memory. The quantum resources for decoding the state in $\ell^2$ norm with $\epsilon$ error require $\poly(r,1/\epsilon)$ copies of the output state and $\poly(r, \kappa^r,1/\epsilon)$ queries to the input oracles, where $\kappa$ is the condition number of the input matrix. With our read-out protocol, we completely characterise the end-to-end resources for quantum linear equation solvers and quantum singular value decomposition. One of our technical tools is an efficient quantum algorithm for performing the Gram-Schmidt orthonormal procedure, which we believe, will be of independent interest.

Despite the claimed quantum speed-up, most QML algorithms suffered from both the input and the read-out problems. Specifically, the input problem tackles the issue of efficient state preparation, namely, encoding the classical data, potentially of tantamount size, into quantum states. A few techniques [9,16,18,19] have been proposed to address this problem, and among them, the quantum random access memory (QRAM) oracle model [18] has become, arguably, the most popular method in the domain of machine learning applications. It has induced interesting outcomes in quantum algorithms for tasks such as the linear system solver [9,20,21], the singular value decomposition [10], support-vector machines [12,22,23], supervised and unsupervised learning [13,15], neural networks [24,25], and other machine learning tasks [26][27][28]. Generally, for a data matrix A ∈ R m×d , the corresponding QRAM oracle could be prepared by using O(polylog(md)) quantum operations with O(md) physical resources [18] stored in a binary tree data structure [29]. Although the QRAM oracle is criticized for the requirement of large physical resources, recent works [30,31] have proven possible the practical implementation of the QRAM oracle.
On the other hand, the read-out problem addresses recovery of classical description from the output quantum state that contains the classical solutions. In order to preserve the quantum advantage of the underlining quantum algorithm, the output state needs to be decoded efficiently. For some quantum algorithms, such as the quantum recommendation system [27], the readout issue is relatively mild because the classical solution can be obtained by only a few measurements on the output state. In general, most machine learning problems demand classical solutions in vector form, for example, finding solutions to linear systems. Hence, the read-out problem of these quantum algorithms could be critical. However, protocols for efficiently decoding the output quantum states into classical vectors remain little explored [32].
The task of recovering the unknown quantum state from measurements, which is also known as Quantum State Tomography (QST), is one of the fundamental problems in quantum information science. QST has attracted significant interest from both theoretical [33][34][35][36][37][38] and experimental [39][40][41][42][43][44][45] perspectives in recent years. The best general tomography method [36] could reconstruct a d×d density matrix ρ for the unknown state with rank r by using n = O(rd −2 ) copies to the state, which implies O(d −2 ) copy complexity for the pure state case ρ = |v v|. We remark that most of QML algorithms that output a d-dimensional state as the solution claim the time complexity polylogarithmical to d. Thus, directly using state tomography methods for state read-out in QML is computationally expensive and would offset the gained quantum speedup. Since the required number n is proven optimal for both cases [36], any further improvement on n could be achieved only by assuming special prior knowledge on state ρ. For example, QST via local measurements provides efficient estimation for states which can be determined by local reduced density arXiv:2004.06421v2 [quant-ph] 30 May 2022 matrices [38] or states with a low-rank tensor decomposition [37]. However, the output states generated by QML algorithms normally do not have these structures.
In contrast with the assumptions in the QST scenarios, the output states generated by most QML algorithms do have inherent relationship between the solution vector and the input data, commonly represented as a matrix. Specifically, the solution vector normally lies in the row space of the input data matrix. Notable examples that satisfy the aforementioned condition include: (1) the quantum SVD algorithm where the singular value σ i and corresponding singular vectors |u i and |v i for matrix A = i σ i u i v T i ; and (2) the quantum linear system solver for linear system Ax = b whose solution state |x ∝ A −1 b lies in the row space of A. Most machine learning problems can be reduced to these two categories [32]. Hence, finding efficient read-out protocols for them that go beyond the standard QST limit will be extremely desirable in the field of QML.
In this work, we design an efficient state read-out protocol that works for QML algorithms which involve a r-rank input matrix A ∈ R m×d stored in the quantum random access memory (QRAM), and the output state |v lies in the row space of A. Instead of obtaining coefficients {v i } by measuring the state |v = n i=1 v i |i in the standard orthonormal basis {|i }, our key technical contribution is an efficient method to obtain the classical description x i in the complete basis spanned by the rows denotes the indices of rows selected as the basis. Our state read-out protocol requires O(poly(r)) copies of the output states andÕ(poly(r, κ r )) queries to input oracles, where r is the rank of the input matrix and κ = σ max (A)/σ min (A) is the condition number of the input matrix. We remark that the low-rank matrix assumption is common in machine learning models [46][47][48]. Compared to previous QST methods which require at least O(d −2 ) copies of pure states, our protocol is much more efficient given r n with small condition numbers, and more importantly, the complexity does not depend on the system dimension. Finally, combining our read-out protocol with quantum SVD or quantum linear system solver yields an end-to-end complexity that takesÕ(poly(r, κ r , log(md))) queries to input oracles.
During the whole read-out protocol, we develop a quantum generalization of the Gram-Schmidt Orthonormalization process. Our quantum Gram-Schmidt Process (QGSP) algorithm can construct a complete basis, by sampling a set of rows {A g(i) } r i=1 of the input A, with O(poly(r, κ r )) queries to QRAM oracles. Since the vector orthonormalization is a crucial procedure in linear algebra as well as machine learning [49][50][51], an efficient quantum algorithm will be of independent interest. Notice that there are some related works for the construction of orthogonal states [52][53][54][55]. However, these results deviate from standard Gram-Schmidt process and their applications are also limited. Ref. [52] is only applicable to the single-qubit system, while Refs. [53,54] only gener-ate a state that is orthogonal to the input state and their complexity depends on the system dimension. Ref. [55] constructs orthogonal states from original states by lifting the dimension of the original Hilbert space, and cannot select a complete basis as standard Gram-Schmidt process does. Consequently, our proposed QGSP algorithm avoids all these restrictions and can be proven to be efficient.
Specifically, we have the following result for QGSP.
Theorem 1 (Informal). By using O(r 27 κ 14r ) queries to QRAM oracles of the matrix A, we could find a group of linearly independent rows {A g(i) } r i=1 , where r and κ is the rank and the condition number of A, respectively.
Main Result. The main result for our state read-out protocol is as follows.
Theorem 2. For the d-dimensional state |v lies in the row space of a matrix A ∈ R m×d with rank r and the condition number κ, the classical form of |v could be obtained by using O(r 4 −2 ) queries to the state |v and O(r 27 κ 14r + r 18 κ 8r −2 ) queries to QRAM oracles of A, such that the 2 norm error is bounded in .
Further discussion about the applications of our main result will be delayed in Section III. Instead, we will move on to formally define the framework of the state read-out protocol.

II. STATE READ-OUT FRAMEWORK
In this section, we explain our protocol in detail. Since A ∈ R m×d is of rank r, we can identify a set of r linearly independent vectors {|A g(i) } r i=1 selected from all rows of A so that the output state can be rewritten as |v = r i=1 x i |A g(i) . Our goal is accomplished if we can determine {x i } r i=1 efficiently. Following this, our algorithm consists of two major parts, a subroutine to sample a set of r linearly independent rows {|A g(i) } r i=1 from all rows of A and a subroutine to calculate {x i }, which will be introduced in following subsections, respectively.

A. Complete Basis Sampling
We begin with the first subroutine. The Quantum Gram-Schmidt Process (QGSP) in Algorithm 1 is developed to generate a complete row basis, by performing a quantum version of the adaptive sampling. The advantage of our adaptive sampling is that those rows, which have larger orthogonal part to the row space of previous sampled row submatrix, will be sampled with a larger probability. This ensures that the complete basis is nonsingular, and will improve the accuracy of the estimation of the coefficients in the second subroutine. Now we analyze the QGSP in detail. We utilize QRAM oracles V A and U A to encode the matrix A in the amplitude of quantum states: where A ij , A i , and A F denote the (i, j)-th element, the i-th row, and the Frobenius norm of A, respectively.
In the first iteration of the QGSP, an index g(1) is sampled from the set [m] := {1, 2, · · · , m} with the probability Pr (1) . Let |t 1 := |A g(1) be the first basis vector. The remaining basis vectors are generated inductively. Assume a set of orthogonal states {|t i } −1 i=1 has been generated in the previous − 1 iterations. To proceed to the -th iteration, we perform the quantum circuit illustrated in Fig. 1, which first creates the state with the help of input oracles U A and V A . Then a Hadamard gate is applied to the third register, followed by a sequence of controlled R i gates where the unitary R i = I − 2|t i t i |. Next, another Hadamard gate is applied to the third register, and the quantum state evolves into: After all unitary operations, we measure the third register and post-select on result 0 with the success probability:  . and the post-selected state (without the third register) is We need roughly 1/P copies of |φ ( ) 1 to generate the state |φ ( ) 2 . Finally, we measure the first register for a new basis index g( ) and a new orthogonal state |t : where Z is the normalizing constant. Specifically, denote the probability of the outcome g( ) being j ∈ [m] by Pr ( ) (j), and let S I = {g(i)} −1 i=1 . We have where π S I (A j ) denotes the projection of the row A j on the row space of the submatrix A(S I , ·) ∈ R ( −1)×n . In other words, the new index is sampled with the probability proportional to the norm of orthogonal part of the row A g( ) to the current basis set S I . After r iterations, we could obtain the index set forms a linearly independent basis. We remark that orthonormal states {|t i } r i=1 are generated from {|A g(i) } r i=1 by performing Gram-Schmidt orthogonalization. Thus, an orthonormal basis could be also generated after the implementation of Algorithm 1.
The technical difficulty of constructing the circuit in Fig. 1 comes from efficient implementation of the controlled version of reflection R = I − 2|t t |, since we do not have additional quantum memory to store {|t } generated during the algorithm. To overcome this problem, we note that the state |t lies in span{|A g(i) } i=1 , so that |t = i=1 z i |A g(i) for some coefficients {z i } i=1 . Instead, we could generate |t i by the linear combination of unitary (LCU) method [56] with post-selections. Let C be the Gram matrix of {|A g(i) } i=1 , and let C −1 be the submatrix of C by deleting the last row and column. The following lemma shows that the coefficient vector z = (z 1 , · · · , z ) T has a compact expression that only depends on the Gram matrices. The proof is provided in Appendix A.
We remark that each element in the matrix C , i.e., the inner product between quantum states {|A g(i) } i=1 , is unknown and needs to be estimated in practice. The error on elements in C would influence the accuracy of coefficients z , and consequently, impacts the whole complexity of the state read-out protocol. Let are the coefficients calculated following Lemma 1 with noisy Gram matricesC . Denote σ min (C ) as the least singular value of C . We have the following Lemma 2 to bound t − t , whose proof is given in Appendix B.

Lemma 2. If each element inC deviates from that in
R , then for any R ∈ (0, 1), the 2 norm of the error between t andt is bounded as Lemma 1 and 2 complete preconditions to generate the state |t through the LCU method. Then, given copies of |t t |, we can implement the controlled version of the gate R = I − 2|t t | = e −iπ|t t | with the help of the Hamiltonian simulation developed in Quantum PCA [57], as explained in Lemma 3.
Lemma 3. Given Eq. (11) in Lemma 2, the state |t could be prepared using The proof is provided in Appendix C. As a natural corollary, the Gram-Schmidt orthonormal basis {|t } r =1 could be provided using O(r 2 σ −1/2 min (C r )) queries to the oracle U A .
Notice that the complexity of implementing C(R ) depends on the least singular value of the Gram matrix C , which is largely affected by the choice of the sampled basis {|A g(i) } i=1 . A too small σ min (C ) will significantly increase the number of queries to the oracles. Notice that a group of basis with a small least singular value tends to have less probability being sampled, e.g., the probability of sampling a linearly dependent basis is 0 by Eq. (10). Through further analysis, we prove that the expectation of σ min (C ) with the distribution formed by Eq. (9) is lower bounded as: This statement also holds approximately if we take into account the error of implementing each R i for i ∈ [ − 1], as provided in Lemma 4.
Lemma 4. Given that each gate R i in Algorithm 1 is implemented with error bounded by R = 1 3r 5 κ 2r , where r and κ is the rank and the condition number of A, respectively, we have where the distributioñ follows from Eq. (9) using noisy gatesR i .
The proof is very technical with lengthy steps. Hence we delay their introduction to Appendix D.
As a result, we could perform Algorithm 1 for a few times to generate a basis with bounded least singular value. The conclusion is summarized in Theorem 3 whose proof is given in Appendix E.
Theorem 3. By using O(r 27 κ 14r ) queries to input oracles V A (1) and U A (2), we could find a group of linearly independent states {|A g(i) } r i=1 , such that the least singular value of the Gram matrix C r formed by {|A g(i) } r i=1 is greater than 1 2r 2 ·κ 2r−2 , where r and κ is the rank and the condition number of A, respectively.

B. Coefficient Calculation
Next we focus on the second subroutine. Once the row basis has been selected, which now we denote as for simplicity, the read-out problem reduces to obtaining The idea of Algorithm 2 is fairly natural. Since the QGSP algorithm generates orthonormal states {|t i } r i=1 , we could first calculate the coordinate of state |v under the basis {|t i } r i=1 : |v = r i=1 a i |t i , and then transfer the orthonormal basis to the row basis {s i } r i=1 : where Z = [z ij ] r×r is the transformation matrix. The coordinates {x i } r i=1 is given as: x = Za. The crucial part of Algorithm 2 is to calculate the coefficient a i = v|t i , ∀i ∈ [r]. However, the overlap estimation techniques based on the Hadamard Test [58] could not be directly employed for estimating the state overlap, since the unitaries for generating the states are required. This drawback limits most quantum algorithms, e.g., the quantum linear system solver, that require post-selection to yield the solution state easily. Another choice is the SWAP test [59] that only requires copies of states. However, directly using the quantum SWAP test could only obtain the estimation to the value | v|t i | 2 , while sign(a i ) remains unknown. To overcome this difficulty, we could assume that the state |v has the positive overlap with one of the basis, say |t k , and take the value as the state overlap. This assumption is equivalent to adding a global phase 0 or e iπ = −1 on |v , and will not affect the extraction of the classical description.
We construct a variant of the SWAP Test, illustrated in Fig. 2 for estimating a i = t k |v v|t i . It is easy to see that the probability of the measurement outcomes '00' and '11' yields the value a i : Similar to the SWAP Test, the proposed quantum circuit provides a -error estimation to the value t k |v v|t i withÕ( −2 ) measurements. Notice that a larger | t k |v | is preferred to obtain more accurate estimations of a i in Eq. (15) through the estimations of a i in Eq. (16). Thus, we mark k := argmax i∈[r] | t i |v | 2 by using the SWAP Test, before the estimations of {a i } r i=1 by running the circuit in Fig. 2.
The difficulty of implementing the quantum circuit in Fig. 2 is to efficiently prepare the state (|t k |0 + |t i |1 )/ √ 2. We apply the linear combination of unitaries (LCU) method again such that (|t k |0 + |t i |1 )/ √ 2 could be prepared with query complexity O(rσ −1/2 min (C r )). See Appendix F for detail. By using this circuit along with the SWAP Test, we could approximately calculate the coordinates {x i } r i=1 . The error and time complexity of Algorithm 2 is provided in Theorem 4, with proof given in Appendix F.
queries to input oracles. Thus, our state read-out protocol only requires O(poly(r) −2 ) copies of the unknown quantum state. The required state copy complexity is independent from the dimension of the state, which makes our algorithm more efficient than previous QST methods [36] in the lowrank case, since the latter needs at least O(d −2 ) copies. We remark that the combination of Theorem 3 and Theorem 4 yields the main result in Theorem 2.

III. APPLICATIONS
As introduced in previous text, our read-out protocol suits the case that the output state of the quantum algorithm lies in the row space of the input matrix. We remark that this assumption is naturally satisfied by many proposed quantum algorithms in the field of machine learning and linear algebra. In this section, we discuss the end-to-end versions of two existing quantum algorithms: the quantum singular value decomposition (SVD) algorithm and the quantum linear system solver, when employing our state read-out protocol for generating classical solutions.

A. Quantum singular value decomposition
We begin with the quantum singular value decomposition protocol. For a given r-rank so any singular vector v i lies in the row space Given QRAM oracles of the matrix A, quantum SVD allows to perform the operation j β j |v j → j β j |v j |σ j with complexity O(polylog(md) A F −1 ) such thatσ j ∈ σ j ± with high probability. Consider the state |0 |0 as the input to the quantum SVD algorithm to generate the state 1 Then the measurement on the eigenvalue register could collapse the state to different eigenstates |u i |v i with probability Thus, any target state |v i could be prepared with com- where ∆ σ is the eigen gap of the matrix A. Using this result along with Theorem 2, we could derive the end-to-end complexity for SVD as follows.
Corollary 1. The classical form of any eigenstate |v i of A could be obtained by using O(κ 14r poly(r, log(md)) A F ∆σ 2 ) queries to the input oracle of A, such that the 2 norm error is bounded in .
B. Quantum linear system solver There has been an increasing interest in quantum machine learning [12,13,60] and linear algebra [23,28] algorithms following the quantum linear system solver proposed by Harrow, et al. [9]. The first quantum linear system solver was proposed especially for the sparse case by Hamiltonian simulation, and several other different linear system solvers [20,61] have been proposed subsequently for the general case. Here we consider the quantum solver [20] which encodes the input matrix A ∈ R d×d into the QRAM model.
, the solution could be written as: where , which means x also lies in the row space span{A i } n i=1 by using the previous conclusion about eigenvectors.
For the linear system Ax = b, the solution state |x = |A + b could be prepared in time O(κ 2 polylog(d) A F −1 ) with 2 norm error bounded in , where κ is the condition number of A. Then we could derive the end-to-end complexity for the quantum linear system solver as follows.
Corollary 2. The classical form of the solution state |A + b for the linear system Ax = b could be obtained by using O(κ 14r poly(r, log d) A F 3 ) queries to input oracles of A, such that the 2 norm error is bounded in .

IV. CONCLUSION AND DISCUSSION
In this work, we developed an efficient state read-out framework for quantum algorithms which involve a lowrank input matrix and the output state |v lies in the row space of the input matrix. The proposed framework takesÕ(poly(r) −2 ) copies of the output state andÕ (poly(r, κ r ) −2 ) queries to input oracles for providing error bounded classical description. Thus, our protocol preserves the quantum speed-up at the state read-out step of these quantum algorithms for the case that the rank r and the condition number κ are small, relative to the system dimension d. We analyzed the feasibility of our framework for quantum algorithms including the quantum SVD and the QRAM-based linear system solver in the low-rank case.
Recently, several quantum-inspired classical algorithms [62][63][64][65] have been developed as challenges to quantum advantage on machine learning tasks. Since QRAM oracles are employed in this work, we would like to emphasize the difference between these classical algorithms and the proposed read-out protocol. Note that the state read-out is a "pure quantum" task which aims to generate the classical form of the unknown quantum state. However, the quantum-inspired algorithms are developed for solving certain linear algebra problems if certain data structure and query access are allowed.
Finally, we believe that the proposed results about decoding the pure state could be extended into the mixedstate case. A quick outline of the procedure is as follows. We could first employ the quantum PCA [57] to perform the eigen-decompositions, and then to decode the eigenstates using our protocol. Another future direction is to improve our read-out framework such that the complexity is polynomial in both the rank and the condition number. Proof. Denote s i := A g(i) for the simplicity of notation. Consider the state: where Z has another formulation obtained by multiplying t | on both sides The restriction that |t is normalized and is orthogonal to states |s 1 , |s 2 , · · · |s −1 could yield: Rewrite Equation (A2) and (A3) in the vector form: Equation (A4) could be written as: where the third equation derives from z = Z C −1 e by Equation (A5) and the last equation is derived by noticing that the ( , )-th element of C −1 is |C −1 | |C | . Thus, we obtain Finally, solving (A5) is trivial Appendix B: Proof of Lemma 2 Proof. We denote · as the 2 norm and the spectral norm for vectors and matrices. First notice that where C in Eq. (B2) is the Gram matrix of {|A g(i) } i=1 , and ∆z =z − z in Eq. (B3). Since C ≤ Tr[C ] = , we can obtain the desired result; namely, if the following claim is true: To prove Eq. (B6), let us introduce some more notation. Denote byC andC −1 the perturbed Gram ma- where Eq. (B12) follows from the triangular inequality.
Since each element inC diviates from that in C by at most C ≤ σ 2 min (C ) 80 5/2 R , we could obtain and Eq. (B15) follows from the Weyl's inequality To finish the proof of Eq. (B6), we only need to bound If Eq. (B19) were true, we could further bound ∆z from Eq. (B18) as follows: where > 1, C ≤ σ 2 min (C ) 80 5/2 R and The last part of this section is to prove Eq. (B19). To further analyze this term, we utilize the bound on the determinant of the perturbed matrix [66, page 113]: We can obtain where the second inequality follows by noticing that the function f (x) = x 1−x is monotonically increasing and the property that the range of singular values of the submatrix is contained in that of the original matrix: Consequently, we have the bound on the term |∆Z |: where Eq. (B28) is derived by employing the following equivalent form of Eqs. (B23) and (B24): Since max(A, B) ≤ A + B for A, B ≥ 0, Eq. (B28) yields Eq. (B31) is derived by using Eqs. (B13), (B22) and C −1 = σ −1 min (C ). The last equation holds because which is obtained by using the bound of C and Appendix C: Proof of Lemma 3 Proof. The main idea is to firstly derive the error analysis of |t and R , followed by the development of the LCU protocol. Denote s i := A g(i) for the simplicity of notation. We begin from the assumption that wheret = i=1z i s i / s i . Then the 2 norm of the error of the state |t is bounded as follows.
Eqs (C2-C6) are derived by using t = 1, the triangular inequality, and Eq (C1). We could further provide the spectral norm of the error of the gate R : Eq.(C7) is derived due to the definition of R . Eq.(C9) is derived by using the triangular inequality. Eq.(C11) is derived by using Eq.(C6). Now we provide a framework to implement operations C(R ) using coefficients {z j } j=1 . We could first prepare the pure stateρ = |t t | by the linear combination of unitaries method as follows. Firstly, initialize the state |0 ⊗ log m |0 ⊗ log n |0 . Then, we apply Hadamard operations on the last log qubits in the first register to create the state: Next, we employ the operation to swap states |i and |g(i) , ∀i ∈ [ ], to yield the state: The unitary U index could be implemented by O( ) operations. Then we employ the oracle U A on the first and the second register, followed by the unitary U † index , to yield: |i i| ⊗ I on the third register, conditioned on the first register |i , to obtain: Finally, we employ Hadamard operations on last log qubits in the first register, to obtain the state The measurement on the first and the third registers of the final state could yield state |t with success probability t 2 / 2z2 , so we could prepare the state |t with O( z / t ) queries to U A by using the amplitude amplification method [67]. Note that operationsR = I − 2|t t | can be viewed as the unitary with Hamiltonianρ = |t t |: Therefore, by using the Hamiltonian simulation method developed in Quantum PCA [57], the controlled version ofR could be performed with error R /5 consuming O(5π 2 / R ) = O(1/ R ) copies ofρ . Taking the complexity of generating state |t into account, we could implement operation C(R ) with the error ofR( ) bounded as R /5, by using O( max i |z i |/( t R )) queries to U A . We remark that the 2 norm of vectorz is bounded as which yields: So the query complexity for implementing C(R ) could be bounded as O( σ −1/2 min (C ) −1 R ). By considering the distance between R andR in Eq. (C11), we could then implement the controlled version of the gate R with error bounded by R . Now we have proved Lemma 3.
Appendix D: Proof of Lemma 4 In this section, we prove Lemma 4. Before we detail main technical procedures, we first provide some useful theoretical bounds in Lemma 5 and Lemma 6.

Proof. Denote the singular value decomposition
Since the state |t i is the linear sum of rows {A j } m j=1 , while each row is the linear sum of singular vectors: we can further write: Rewrite Eq. (6) as: where Eq. (D4) comes from Eq. (D1) and Eq. (D2). Expand the square term in Eq. (D4) yields: are orthogonal with each other. We can add w , · · · w r such that {w i } r i=1 forms an orthonormal basis in the r-dimensional space. Denote the matrix W = (w 1 , w 2 , · · · , w r ). Since W T W = I, we have: Hence by using Eqs. (D7-D9) and A 2 F = r i=1 σ 2 i , we could obtain the lower and upper bounds for P as follows.
Lemma 6. Denote P to be the distribution of the adaptive sampling following from the Eq. (10): where s ∈ [m] denotes the index of the row s in the matrix A ∈ R m×d . Then Proof. By the Cauchy-Schwarz Inequality, we have: If the following inequality were true, then we could reach the conclusion of this lemma: To prove Eq. (D14), we first rewrite it as follows: In Eq. (D18), we rewrite P (s 1 , · · · , s ) with Eq. (D12) and where, in Eq. (D19), we denote Eq. (D20) is derived from Z 2 = |s − −1 i=1 |t i t i |s 2 , and Eq. (D21) is due to Eq. (A6).
Continuing from Eq. (D18), it holds where Eq. (D23) uses with C (i) ∈ R ( −1)×( −1) being the principal submatrix of C by removing the i-th row and column, and Eq. (D24) follows by rearranging the sum order.
Proof. The main idea is that, if the following statement holds true for any 0 ≤ j ≤ − 1: then we could provide a lower bound on the expectation of σ min (C ) with the distributionP inductively. Specifically, we could obtain . . . to obtain the last inequality. To prove Eq. (D44), we need a lower bound on the distributionPr (j+1) , which could be derived as follows.

Appendix E: Proof of Theorem 3
Proof. We sketch the main idea of the proof first. We could implement Algorithm 1 for N times to guarantee sampling out one basis which satisfies the conditions Let T QGSP be the query complexity of oracles U A and V A to implement Algorithm 1 once. Thus, the overall query complexity is To begin with, consider the first iteration of Algorithm 1. The Gram matrix of the sampled basis has the dimension 1 × 1 with one element 1. Thus, the condition cond (1) always holds. We proceed to the general cases inductively. Suppose that a basis with ( − 1) rows, which satisfies the condition cond ( −1) in Eq. (E1), has been obtained. Next, we move on to the -th iteration of Algorithm 1. We accept the newly sampled row as part of the basis, if the condition cond ( ) holds, and proceed to the + 1-th iteration. If the condition is violated, we stop the procedure and repeat Algorithm 1 from the first iteration. Thus, the conditions in Eq. (E1) would hold during the procedure, with the cost that Algorithm 1 needs to be run N number of times in order to guarantee one basis obtained with high probability. Now we analyze the complexity of the procedure in detail. Notice that T QGSP consists of three parts: the cost of oracles U A and V A for encoding all rows of the input matrix A, the cost of Hadamard Test for calculating coefficients {z } r−1 =1 , and the cost of implementing gates {C(R )} r−1 =1 . Based on Lemma 2 and Lemma 3, the latter two complexities depend on the error in the implementation of R . In the following proof, we provide explicit upper bounds of N and T QGSP , by setting to be the error bound of each element in C r .
Firstly we demonstrate that the sampling in each iteration of Algorithm 1 obeys the distribution in Eq. (13), i.e., the error of each gate C(R j ) is bounded as Based on Lemma 2, the error of t j induced by noisy coefficients is bounded by Then, based on Lemma 3, we could implement the gate C(R j ) with an error R by using queries to the oracle U A . Since the condition cond (j) (E1) holds, we obtain Eq. (E10). Eq. (E11) follows from the definition of R in Eq. (E4).
Finally we move on to analyze the query complexity T QGSP . Based on Lemma 1, coefficients {z } r−1 =1 are calculated using the estimation of C r . Denote by T C the required query complexity of the oracle U A to estimate each element in C r via the Hadamard Test. We have where Eq. (E21) is derived by using Eq. (E3). Recall that in each iteration of = 1, · · · , r in Algorithm 1, we perform operations U A , V A , R 1 , R 2 , · · · , R −1 for 1/P times. Taking the complexity of estimating C r into account, we have Eq. (E24) is obtained by using Eq. (E22) and Eq. (E11). Eq. (E26) is derived by using Eq. (D62). By considering N ≤ O(r 2 κ 2r−2 ) being the required number of times to run Algorithm 1, we prove the Theorem 3.
Appendix F: Proof of Theorem 4 We will first demonstrate that the proposed quantum circuit in Fig. 2 is similar to the SWAP test, and provides a -error estimation to the value a i = t k |v v|t i , ∀i ∈ [r], with O(1/ 2 ) measurements.
Firstly, after all unitary operations, the state in Fig. 2 before the measurements is: |0 |v |t k + |v |t i + |t k |v + |t i |v |0 Measuring the first and the last register could result in outcomes 00 and 11 with probability: We remark that the statistics of outcomes 00 and 11 implies the value a i a k .
The efficiency of the quantum circuit in Fig. 2 depends on the efficiency of preparing the input state (|t k |0 + |t i |1 )/ √ 2. Lemma 7 below proves that it can be prepared with query complexity O(rσ −1/2 min (C r )). Lemma 7. Given perturbed coefficients provided in Lemma 2 for both indices k and , the state 1 √ Then, we provide the error analysis. Specifically, Given the coefficientsz k andz , we prepare the state 1 t 2 + t k 2 ( t k |0 |t k + t |1 |t ) (F2) by the LCU method as follows. Since the notation k and are symmetrical here, we could assume that ≥ k for convenience. Firstly, we initialize the state |0 +|1 √ 2 |0 ⊗ log m |0 ⊗ log n |0 . Then, we apply Hadamard operations on the last log qubits in the second register to create the state: Next, we employ the operation U index defined in (C12) to create the state: Then we employ the oracle U A on the first and the second register, followed by the unitary U † index to yield: Denotez ≡ max(max j |z j |, max j |z jk |). Next, we perform the controlled rotation Now we begin the proof of Theorem 4 that provides the error analysis of Algorithm 2 for reading out the state |v .
Proof. We firstly study the error in the read-out procedure and then provide the time analysis. Specifically, notice that the state 1 √ 2 (|0 |t k + |1 |t i ) generated by Lemma 7 is perturbed due to the noisy coefficients z k and z i . Thus, the read-out error consists of two parts: the error on generating 1 √ 2 (|0 |t k +|1 |t i ), and the error induced by the statistical noise during the measurement in the Fig 2. Firstly, we analyze the measurement distribution of Fig. 2 which uses the perturbed input state 1 √ 2 (|0 |t k + |1 |t i ). Denotez j andt j as the perturbed form of z j and t j , respectively, ∀j ∈ [r]. In this proof, we assume the 2 norm on the error of each t j is bounded by 3 = 1 14r 3/2 . The final state in Fig. 2 is: where we denote Measuring the first and the last register could result in outcomes 00 and 11 with probability: Thus, the perturbed statistics of outcomes 00 and 11 is: (F7) where we denoteã i = f i t i |v .
Next, we analyze the error induced by the statistical noise. Notice that eachã iãk in Eq. (F7) is estimated via the SWAP Test. We assume the statistical error of each value in {ã iãk } r i=1 is bounded by 2 = 1 14r 3/2 , and denote (ã iãk ) as the approximated value ofã iãk . Then, in parallel to the exact form we use the expressionṽ as the perturbed description of the vector v, wherẽ Thus, the 2 norm of the error on the vector description of the read-out state could be bounded as follows.
where Eq. (F11) is obtained by using Eqs. (F8-F9). Eq. (F12) follows from the triangular inequality. Eq. (F13) holds due to t i − t i ≤ 3 and t T i t j = δ ij . Eq. (F14) is derived by using the definition in Eq. (F10) and Since the term 3 = 1 14r 3/2 is provided, we notice that Eq. (F15) holds if the following statements is true for any i ∈ [r]: So we just need to bound terms |ã i − a i | and |ã i −ã i | for deriving the upper bound on ṽ − v , which can be obtained in Eqs. (F18-F23) and Eqs. (F24-F30), respectively, as follows.