Linear Regression by Quantum Amplitude Estimation and its Extension to Convex Optimization

Linear regression is a basic and widely-used methodology in data analysis. It is known that some quantum algorithms efficiently perform least squares linear regression of an exponentially large data set. However, if we obtain values of the regression coefficients as classical data, the complexity of the existing quantum algorithms can be larger than the classical method. This is because it depends strongly on the tolerance error $\epsilon$: the best one among the existing proposals is $O(\epsilon^{-2})$. In this paper, we propose the new quantum algorithm for linear regression, which has the complexity of $O(\epsilon^{-1})$ and keeps the logarithmic dependence on the number of data points $N_D$. In this method, we overcome bottleneck parts in the calculation, which take the form of the sum over data points and therefore have the complexity proportional to $N_D$, using quantum amplitude estimation, and other parts classically. Additionally, we generalize our method to some class of convex optimization problems.


I. INTRODUCTION
Following the rapid advance of quantum computing technology, many quantum algorithms have been proposed and their applications to the wide range of practical problems have been studied in recent research.One prominent example is linear regression.Linear regression, which is based on the least squares method in many cases, is a basic and ubiquitous methodology for many fields in natural and social sciences.There are some quantum algorithms for linear regression [1][2][3][4][5][6] with complexity depending on the number of data points N D as O(polylog(N D )) 1 .This denotes the exponential speedup compared with the naive classical method explained in Sec.II B, whose complexity is proportional to N D .
However, despite this, the existing quantum methods are not necessarily more beneficial than classical ones, when we want to obtain the values of the optimized regression coefficients as classical data, and N D is mid-sized, say O(10 3 −10 5 ).Roughly speaking, in the existing methods such as the first proposed one [1], which is based on the Harrow-Hassidim-Lloyd (HHL) algorithm [10] for solving systems of linear equations, the authors create quantum states in which the values of the coefficients are encoded and read out in classical form.Therefore, it is inevitable that the estimated coefficients are accompanied by errors, and high-accuracy estimation leads to large complexity.As far as we know, the existing method with the best complexity with respect to the tolerance error ǫ is that in [3]: the complexity of estimating coefficients with additive error at most ǫ is where d is the number of the coefficients (or the explanatory variables) and κ is the condition number of the design matrix (defined in Sec.II B).On the other hand, a naive classical method has complexity O d 2 N D .Therefore, assuming the prefactors in these expressions of complexities are comparable, under an ordinary situation ǫ ∼ 10 −3 , d ∼ 10, κ ∼ 1, the minimum N D for which the quantum method is advantageous over classical one is N D ∼ 10 6 .Although in some problems N D is of such an order or much larger, cases where N D ∼ 10 3 or 10 4 are also ubiquitous, and in such cases the exiting quantum methods are inferior to classical ones.The mid-sized regression problems are often timeconsuming and need to be sped up, although we might naively think that such problems can be solved by classical computers in a short time.For example, if such regressions are repeated in the whole of the calculation flow, the total computational time can be long.One such case is the least square Monte Carlo (LSM) [11].LSM is a methodology used in pricing financial derivatives 2 with an early exercise option, that is, the contract term stating that either of two parties in the contract can terminate it before the final maturity. 3In Monte Carlo pricing, we generate many (say, O(10 4 − 10 5 )) paths of the random time evolution of underlying assets, and we estimate the price as the expected cash flow.In the case of early-exercisable products, we have to determine the optimal exercise time for each path.In LSM, we approximate the continuation value of the derivative at each exercise date by linear regression using certain functions of underlying asset prices as explanatory variables, whose number is typically d ∼ 10.Since regression is done many times in pricing one contract and banks have numerous contracts, they have huge complexity in total and are meaningful targets of quantum speedup, even though each of them is mid-sized.
Based on the above motivation, in this paper, we present a new quantum algorithm for linear regression, focusing on reducing the order of the inverse of ǫ in the expression of the 2 Financial derivatives, or simply derivatives, are financial contracts in which two parties exchanges cash flows whose amounts are determined by the price of some assets.For textbooks on derivatives and their pricing, we refer to [12,13]. 3 For the details of LSM, see [11].
complexity.In our method, unlike existing methods, we do not perform all of the calculations on a quantum computer.Instead, we use a quantum computer only to perform a bottleneck part in the naive classical method.That is, we perform some intermediate calculation, which is the sum over N D terms and therefore has complexity proportional to N D naively, by quantum algorithm.Then, we classically solve the d-dimensional system of linear equations, whose coefficients and constant terms are outputs of the preceding quantum computation, and we obtain the regression coefficients.A classical computer can perform this step in negligible time for d ∼ 10, without inducing any error.Also note that we naturally obtain the regression coefficients as classical data.
To speed up the bottleneck sums, we use quantum amplitude estimation (QAE) [14][15][16][17][18][19][20], which is also used to speed up Monte Carlo integration [15,21].Then, as discussed in detail in Sec.III, we can perform the sums with complexity and the total complexity of the whole of our method is almost equal to this.Compared with the naive classical method, whose complexity is O(d 2 N D ), our method accomplishes speedup with respect to N D .In addition, compared with the existing quantum methods, our method improves the dependency of the complexity on ǫ from ǫ −2 in (1) to ǫ −1 .Unfortunately, in our method, the dependencies on d and κ are worse than those of (1) and the naive classical method.Therefore, our method is more suitable to the situation such that which includes typical mid-sized problems such as LSM.
In addition to linear regression, we generalize our method to some class of convex optimization.As we see in Sec.IV, linear regression can be considered as an optimization problem of the sum of quadratic functions solved by Newton's method.Inspired by this, we consider a convex optimization problem where an objective function is written as sums of many terms similarly to linear regression, and we present a quantum algorithm of Newton's method, in which calculation of the gradient and the Hessian is sped up by QAE.We also estimate complexity to obtain a solution with a desired error level under some mathematical assumptions which are usually made in the convergence analysis of Newton's method.
The remaining parts of this paper are organized as follows.Sec.II is a preliminary section.In Sec.II A, we present some notations and definitions of quantities necessary for the later discussion.In Sec.II B, we briefly review linear regression and its classical method.In Sec.III, we explain our method in detail.In Sec.IV, we discuss the extension of our method to some convex optimization problems.Sec.V summarizes the paper.

A. Notations and Definitions
In this subsection, we present some notations and definitions about linear algebra for later use.For details, refer to textbooks on linear algebra, e.g., [22].
For v = (v 1 , ..., v n ) T ∈ R n , we define the Euclidean norm v and the max norm v ∞ as follows: We also define some types of norms for a matrix A = (a i j ) 1≤i≤m 1≤ j≤n ∈ R m×n .The first one is the spectral norm, which is defined as and equal to σ max (A), the maximum singular value of A. Secondly, we define the Frobenius norm as which satisfies A ≤ A F .Furthermore, we introduce the condition number, which measures the sensitivity of a linear operator to the errors in the input.For a full-rank matrix A ∈ R m×n , the condition number is defined as that is, the ratio of the maximum singular value σ max of A to the minimum one σ min .If A is invertible,

B. Linear regression
We here define the problem of linear regression.Assume that we have N D data points D := {( x i , y i )} i=1,..,N D , each of which consists of a vector of d explanatory variables 4 x i = (x (1)  i , ..., x (d) i ) T ∈ R d and an objective variable y i ∈ R. Linear regression attempts to fit the linear combination of the explanatory variables to the objective variable, that is, finding a such that y ≃ X a. (8) Here, a 1 , ..., a d ∈ R are model parameters called regression coefficients and a := (a 1 , ..., a d ) T ∈ R d .y := (y 1 , ..., y N D ) T and is called the design matrix.In this paper, as in many cases, we determine a by the least squares method, that is, As is well known (again, see [22] for example), the solution of ( 10) is given by where we assume that X is full-rank, as stated again in Sec.III B as Assumption 2.
Here, let us introduce some symbols.We define d×d matrix W, whose (i, j)-element is W is invertible since we assumed that X is full-rank.We also define the d-dimensional vector z, whose i-th element is In ( 12) and ( 14), the prefactor 1/N D is just for the later convenience.Using these, (11) becomes We here call the following classical computation the naive classical method: calculate w i j 's and z i 's by simply repeating multiplications and additions N D times, literally following the definitions ( 13) and (15), and then solve the d-dimensional system of linear equations ( 16) to find a.Let us discuss the complexity of this method.Since we are considering the situation d ≪ N D , we focus only on the major contributions with respect to N D .Calculating one w i j or z i takes complexity of O(N D ).Since the numbers of w i j 's and z i 's are O(d 2 ) and O(d), respectively, and their total is O(d 2 ), the total complexity of calculating all of w i j 's and z i 's is O(d 2 N D ).Then, solving (16) takes the negligible complexity, since it does not depend on N D .For example, even if we use the elementary method such as row reduction, the complexity is O(d 3 ).In summary, the complexity of the naive classical method is O(d 2 N D ), which dominantly comes from calculation of w i j 's and z i 's.

III. LINEAR REGRESSION BY QUANTUM AMPLITUDE ESTIMATION
A. Quantum Amplitude Estimation In this section, we present our method for linear regression.Before this, let us review the outline of QAE briefly.
QAE is the algorithm to estimate a probability amplitude of a marked state in a superposition.Consider the system consisting of some qubits.We set the system to the initial state where all qubits are set to |0 and write such a state as |0 for simplicity.Then, we assume that there exists an unitary transformation A on the system such that where |Ψ is the 'marked state', |Ψ ⊥ is a state orthogonal to |Ψ and 0 < a < 1.Typically, |Ψ and |Ψ ⊥ are the states where a specific qubit takes |1 and |0 respectively.In addition to A, we use the following unitary operators S 0 and S Ψ , which are defined as S 0 can be constructed using a multi-controlled Toffoli gate, and S Ψ is simply a controlled-Z gate if |Ψ is defined as the state where a specific qubit is |1 .Then, defining we can construct a quantum algorithm (see [14] for details) which makes uses of Q (therefore, O(1/ǫ) uses of A) and outputs an estimate of a with an ǫ-additive error.
We make a comment here on success probability.In the algorithm of QAE [14], the success probability, that is, the probability that the algorithm outputs the estimation with the desired additive error, is not 1 but lower-bounded by 8/π 2 .However, we can enhance the success probability to an arbitrary level 1 − γ, where γ ∈ (0, 1), by repeating QAE O log(γ −1 ) times.That is, taking the median of the results in the O log(γ −1 ) runs of QAE, we can obtain the estimation with the additive error ǫ with probability 1 − γ [21,23].Considering this point, we can write the number of calls to A in repeating QAE with an ǫ-additive error and a success probability larger than 1 − γ as If we set 1 − γ to some fixed value, say 99%, ( 22) is reduced to (21).

B. Assumptions
Next, we present some assumptions which are necessary for the method.The first one is as follows.
Here and hereafter, for a number x, the ket |x corresponds to a computational basis state on a quantum register where the bit string represents the binary representation of x. (23) and ( 24) mean that P x and P y output the element of X and y, respectively, for the specified index.Previous papers [1-6, 9, 24] also assume such oracles.We can construct P x and P y if quantum random access memories (QRAMs) [25] are available.
The second assumption is just a reproduction.
Because of this, κ(X), the condition number of X, can be defined, and W defined as ( 13) is invertible.Hereafter, we simply write κ(X) as κ.
The third assumption is as follows.
That is, we assume that the explanatory variables and the objective variable are bounded by 0 and 1. Besides, we make the fourth assumption as follows: Assumption 4.There is a positive number c (say, 1  2 ), which is independent of N D , ǫ, κ and d, such that Since w ii ≤ 1 is immediately derived from Assumption 3, Assumption 4 means that c < w ii ≤ 1.
Although Assumption 3 and 4 are too strong seemingly, they should be assumed for successful regression, in either the classical or the quantum way, for the following reasons.First, we note that Assumption 3 is satisfied if we know the bounds for x (i)  k and y k in advance and can rescale them.That is, for i ∈ {1, ..., d}, although in general x (i)  k 's are not in [0, 1], we can redefine as Then, redefined as y k with L y , U y such that ∀k ∈ {1, ..., N D }, L y ≤ y k ≤ U y (30) leads to 0 ≤ y k ≤ 1. Assumption 3 is then satisfied.Besides, if we know the bounds which are not too far from the typical scale of the original k | for most k's, Assumption 4 is satisfied.In summary, Assumption 3 and 4 are naturally satisfied if we know the typical scales of x (i)  k 's and y k 's.Practically, we must know the typical scales, since we have to address the problem of outliers.Data sets often contain points whose explanatory and/or objective variables are much larger than those of others, because of various reasons (for example, misrecord).Such points are called outliers.It is widely known that outliers lead to inaccurate regression and so we have to address them.Typically, we omit them from data points used for regression or replace the values of the explanatory and/or objective variables out of some range with the upper or lower bound of the range.For such a preprocess, we have to know the typical scales of the variables.In fact, the previous paper [3] makes assumptions similar to Assumption 3 and 4. Mentioning the necessity of preprocessing outliers, it assumes that the design matrix and the objective variable vector do not contain the extraordinarily large elements.

C. Details of our method
We now explain our method in detail.First, we present a lemma on the error in the solution of a system of linear equations where coefficients and constant terms contain errors.Lemma 1.Let Assumptions 2 to 4 be satisfied.For given symmetric Ŵ ∈ R d×d and ẑ ∈ R d , consider a system of linear equations where â ∈ R d .For a given ǫ > 0, if each element of δW := Ŵ − W and δ z := ẑ − z has an absolute value smaller than ǫ ′ such that then â is uniquely determined by solving Eq. ( 31), i.e., â = Ŵ−1 ẑ, and it becomes an O(ǫ)-additive approximation of a, which means that The proof is given in Appendix A 1.This lemma means that, if we want a solution of a system of linear equations with an O(ǫ)-additive error, it is sufficient to calculate coefficients and constant terms with an additive error ǫ ′ satisfying (32).
We therefore propose the following method for linear regression: we first estimate W in (13) and z in (15) by a quantum method with ǫ ′ -additive error and then calculate ( 16) by some classical method.Since classical methods basically introduces no additional error, we can obtain a solution with O(ǫ)-additive error.
Then, we state a theorem on the complexity of our method, presenting the concrete procedure in the proof.
Theorem 1.Given ǫ > 0, accesses to oracles P x and P y which satisfy Assumptions uses of P y and, with a probability larger than 99%, outputs an O(ǫ)-additive approximation of a, which is defined as (16).
Proof.The outline of the algorithm is as follows.
1. Estimate the elements of W and z in ( 16) using a quantum algorithm based on QAE.Let the matrix and the vector consisting of the estimation results be Ŵ and ẑ.
2. Solve Ŵ a = ẑ by some classical solver of systems of linear equations (for example, row reduction).Let the solution be â.This is an output of our algorithm.
Since we do not use the oracles P x , P y in the second step, we focus on the step 1.As we explained in Sec.III A, we can obtain an estimation ŵi j of w i j by QAE if we construct the following operators A i j .A i j transforms |0 , a state in which all qubits are 0, to a state in the form of where |ψ and |ψ ⊥ are some orthogonal states.Such an operator is constructed as follows.
(i) Prepare quantum registers R 1 , ..., R 5 , which have enough qubits, and a single qubit register R 6 .Set R 1 , R 2 and the others to |i , | j and |0 , respectively.
(ii) Create 1 (iii) Apply P x to a block of R 1 , R 3 and R 4 , which outputs x (i) k on R 4 .Similarly, apply P x to a block of R 2 , R 3 and R 5 , which outputs x ( j)  k on R 5 .
(iv) Using by some arithmetic circuits and controlled rotations 5 .Then, the resultant state is in the form of (36).
Through the steps (i) to (iv), the state is transformed as follows.
In order to estimate w i j with an additive error ǫ ′ , QAE makes O 1 ǫ ′ uses of A i j .In the current case, A i j contains O(1) uses of P x (precisely, two uses).In total, we use it O 1 ǫ ′ times in the whole process of the QAE.
We can obtain an estimation ẑi of z i in the similar way by replacing one of two P x 's with P y and x ( j) k with y k .As shown in Lemma 1, in order to obtain an ǫ-additive approximation of a, it is sufficient to take ǫ ′ satisfying (32).Therefore, the number of calling P x , P y in one QAE for w i j or z i is Here, we omit c since it is independent of N D , κ, d (Assumption 4) and of course ǫ.
Then, let us recall the issue of the success probability, which is mentioned in Sec.III A. Since the numbers of w i j 's 5 More concretely, we can perform the following calculation: where the first, second, and last kets correspond to R 4 , R 5 , and R 6 , respectively, and the third and fourth kets correspond to two ancillary registers, which are not displayed in (37).We here used the multiplication circuit (e.g.[26]) at the first arrow, the circuit for square root (e.g.[27]) at the second arrow, and the rotation gate on R 6 controlled by the second ancillary register at the last arrow.
and z i 's are d(d + 1)/2 and d, respectively, in order to successfully estimate all of them with a probability larger than 0.99 = 1 − 0.01, it suffices to make the success probability of each estimation larger than 1 − 0.01 d 2 .Therefore, it is sufficient to repeat QAE O log d 2 0.01 = O log(d) times in estimating one of w i j 's or z i 's.Combining this point and (38), and noting the numbers of w i j 's and z i 's, we see that the total number of calls to P x and P y in the whole algorithm are (34) and (35), respectively.

D. Time complexity
Here, let us make some comments on overall time complexity.Although we saw that we can reduce the query complexity, that is, the number of calls to P x and P y , exponentially with respect to N D compared to the naive classical method, the time for one such query might depend on N D .Fortunately, in the following examples, the time complexity of one query is at most logarithmic with respect to N D , which means that our method provides speed-up also in time complexity.
We first consider LSM, where the explanatory and objective variables are not given as exogenous data, but are produced by a simulation.Concretely, P x is the oracle to generate sample paths of time evolution of asset prices, P y is the oracle to calculate the cashflows on paths, and N D is the number of the paths.Typically, we generate asset price paths based on some stochastic differential equation, using random numbers.The implementation of such a calculation as a quantum circuit has already been considered [28][29][30].In such a circuit, we create and use random numbers with O(log(N D )) digits in order to generate N D different paths.The gate number for creating such numbers and arithmetic operations on them is O(polylog(N D )), and so is the time complexity.We can summarize this situation as follows: we aim to approximate some complex function by linear regression, and the quantum circuits which calculate the explanatory and objective variables with O(polylog(N D )) time complexity are available.We expect that this is not unique to LSM.
Next, let us consider the cases where we use a QRAM to implement the oracles.According to [25], once we are given the QRAM containing N D data points, we can load them into a register with O(polylog(N D )) time complexity per query.Thus, the cost of querying the oracle is negligible.However, preparing such a QRAM also takes time, which must be taken into account.The reasonable assumption is that we are initially given the data points as classical data, and load them into the QRAM, which takes O(N D ) time complexity.Even if we consider the QRAM initialization time, in practical uses of linear regression, our quantum method can provide a reduction of time complexity in the entire process of data analysis.For example, consider the situation where we explore an optimal combination of preprocessing, such as one-hot encoding and logarithmic transformation, by trial and error.Such transformations, by which we aim to enhance explanatory power of the variables, are ubiquitous in data analysis, and we often repeat regression for one data set applying various patterns of transformation and compare the results.In this situation, the costliest part of the total calculation is a series of classical regression in many trials for the classical method, but a single QRAM initialization at the very beginning for the our method, which will be less time-consuming.Therefore, we conclude that the proposed method is totally faster than the naive classical method.

IV. EXTENTION TO A CLASS OF CONVEX OPTIMIZATION
A. Linear regression as optimization by Newton's method In this section, we extend our method for linear regression to more general optimization problems.Before this, we firstly present another interpretation of (11), the formula for the solution of linear regression.We regard linear regression as an optimization problem of This can be rewritten as where The first and second derivatives of F are respectively.This means that, W = 1 N D X T X and − z = − 1 N D X T y are the Hessian matrix and the gradient vector of F at a = 0, respectively.Besides, the updating formula in Newton's method is where H F and g F are the Hessian and the gradient of the function F, respectively, and a n is the optimization variable after the n-th update.Then, we can interpret (11) as an one-time update in Newton's method from the initial point a 0 = 0. Note that Newton's method gives the exact solution by only one update from any initial point, if the objective function is quadratic.
In summary, we can consider our method as optimization of the objective function (39) by Newton's method, where calculation of the gradient and the Hessian, which is timeconsuming in the classical method, is done by the QAE-based method.
B. Extension of our method : the QAE-based Newton's method On the basis of the above discussion, it is now straightforward to extend our method to a more general class of optimization problems.That is, keeping the updating formula (44), we can perform Newton's method based on the gradient and the Hessian estimated by QAE.
Concretely, we consider an optimization problem in which the objective function can be written as a sum of the values of some function with different inputs: Here, f is the real-valued twice-differentiable function, which are shared by all the terms.Its inputs are the optimization variables a ∈ R d and some parameters c k , which are different in each term.Some conditions on F necessary for convergence analysis are given in Sec.IV C. It is obvious that the objective functions (39) fall into (45).For the objective function like (45), the gradient g F ( a) = (g F,1 ( a), ..., g F,d ( a)) T and the Hessian H F ( a) = (h F,i j ( a)) 1≤i≤d 1≤ j≤d are given as where we simply write ∂ ∂a i f and ∂ 2 ∂a i ∂a j f as f i and f i j , respectively.
Then, we estimate the gradient and the Hessian as follows.We assume the availability of the followings: • P c , which outputs c k for given k: This can be constructed by QRAM.
• P i for i = 1, .., d, which outputs f i ( a, c k ) for given a and c k : • P i j for i, j = 1, ..., d, which outputs f i j ( a, c k ) for given a and c k : Then, preparing appropriate registers, we can perform the following computation: where P c and P i are used at the second and third arrows, respectively.We then obtain an estimation of g i ( a) by estimating the probability that the last qubit takes 1 by QAE.We can also estimate h i j ( a) similarly, replacing P i with P i j and f i ( a, c k ) with f i j ( a, c k ).Again, in order to estimate one g i ( a) or h i j ( a) with ǫ-additive error, the number of calling P i and P i j is at most O(1/ǫ), which means the exponential speedup with respect to N D compared with classical iterative calculation.
Using the gradient and the Hessian estimated as above, we update a n to a n+1 similarly to (44), that is, where ĝF ( a n ) and ĤF ( a n ) are the estimated gradient and Hessian, respectively.After sufficiently many iterations, we obtain the approximated solution of the optimization.Hereafter, we call this method the QAE-based Newton's method.
Note that ĤF ( a n ) must be invertible so that the update (52) can be defined.Hereafter, we consider the situation where the original Hessian H F ( a n ) is positive-definite and therefore invertible.In order to keep such a property, we have to obtain ĤF ( a n ) accurately enough.

C. Convergence analysis of the QAE-based Newton's method
Let us estimate the complexity of the QAE-based Newton's method to obtain an approximated solution with ǫ-additive error.For a mathematically rigorous discussion, we first make some assumptions on the objective function F : R d → R.

Assumption 5. F is twice-differentiable.
This is reproduced since we assumed that f in (45) is twicedifferentiable.Assumption 6. F is µ-strongly convex, that is, there exists a positive number µ such that This assumption means that the eigenvalues of H F ( a) are greater than or equal to µ for any a ∈ R d , that is, where I d is the d × d identity matrix and for A, B ∈ R d , A B means that A − B is positive-semidefinite.This immediately leads to Assumption 7. F has an M-Lipschitz Hessian, that is, This assumption leads to the following inequality: Assumptions 5 to 7 are usually made in the discussion on convergence properties of the ordinary Newton's method, that is, cases where the gradient and the Hessian can be exactly computed in the classical way (see, for example, [31]).We do not make any additional assumptions for the QAE-based Newton's method.
Next, let us define some quantities.Since QAE introduces errors in the estimated Hessian and gradient, we have to consider Newton's method in which the update difference contains some error.For a ∈ R d , we define ∆ F ( a) as Besides, we write the minimum 6 that we search as a ⋆ := argmin a∈R d F( a), the optimization variable after the n-th update as a n , and the difference between a ⋆ and a n as δ n := a n − a ⋆ .Then, we can show the following lemma, which is repeatedly used.
Lemma 2. Let Assumptions 5 to 7 be satisfied.Then, for any non-negative number ǫ and any a ∈ R d such that the following inequality holds: where δ := a − a ⋆ , δ ′ := a ′ − a ⋆ , and a ′ = a − ( ĤF ( a)) −1 ĝF ( a).Furthermore, when 6 Since we are considering the convex optimization as stated in Assumption 6, there is only one global minimum.
is satisfied, the following hold: where The proof is given in Appendix A 2. Here, let us comment on what Lemma 2 implies.Equation (60) indicates that, even when the update differences in Newton's method contain errors at most ǫ, the difference δ between the optimization variables a and the optimal point a ⋆ quadratically converges like Newton's method with no error, as long as ǫ M 2µ δ 2 .More strictly, while δ − < δ, δ decreases at every update, as shown in (63).On the other hand, after δ reaches δ − , ǫ is not negligible in (60), and therefore δ does not necessarily decreases.Nevertheless, δ does not exceeds δ − , once it goes below.Since δ − < 2ǫ, we can make a converge with desired accuracy if we can suppress ǫ.
Using Lemma 2, we obtain the following lemma, which shows how many updates are sufficient for δ to reach 2ǫ in Newton's method with erroneous update differences.Lemma 3. Let Assumptions 5 to 7 be satisfied.Suppose that we repeatedly updates a ∈ R d by ( 52) from some initial point a 0 , where we write the result of n-times updates as a n and define δ n := a n − a ⋆ for n = 0, 1, 2, ....Then, for any positive number ǫ satisfying (61) and any a 0 such that if (59) holds for a = a n , n = 0, 1, 2, ..., there exists a nonnegative integer n it such that for any n ≥ n it , where The proof is given in Appendix A 3 Next, we consider how accurate we should estimate the gradient and the Hessian in order to suppress the error in the update difference to the order of ǫ.Lemma 4. Let Assumptions 5 to 7 be satisfied.Besides, suppose that we are given a positive integer n, a positive number ǫ satisfying (61), and a ∈ R d satisfying δ < 2µ M , where δ := a − a ⋆ .Then, in order for ∆ F ( a) defined as (58) to satisfy ∆ F ( a) ≤ ǫ, it is sufficient that each component of the estimated gradient ĝF ( a) and Hessian ĤF ( a) has an additive error ǫ ′ g and ǫ ′ H such that and where δ := max{δ, δ − }, respectively.
Lemma 5. Let Assumptions 5 to 7 be satisfied.Besides, suppose that we are given a positive number ǫ satisfying (61) and an initial point a 0 ∈ R d satisfying δ 0 := a 0 − a ⋆ < δ + .Furthermore, let a n ∈ R d be a vector given by n-times updates by ( 52) from the initial point a 0 .Then, ∆ F ( a n ) defined as ( 58) satisfies for any positive integer n, if each component of ĝF ( a) and ĤF ( a) has an additive error ǫ ′ g and ǫ ′ H satisfying (68) and where δ0 := max{δ 0 , δ − }, respectively.
The proofs of Lemmas 4 and 5 are given in Appendix A 4 and A 5, respectively.
Combining Lemma 3 and Lemma 5, we immediately obtain the following theorem.
Theorem 2. Let Assumptions 5 to 7 be satisfied.Besides, suppose that we are given a positive number ǫ satisfying (61) and a 0 satisfying (65).Then, using the QAE-based Newton's method which is based on the updating formula (52), we obtain a 2ǫ-additive approximation of a ⋆ by n it -times updates, where n it is given by (67), with a success probability higher than 99%.In the process, the total number of calls P i , i = 1, ..., d is that for P i j , i, j = 1, ..., d is and that for P c is Proof.At First, let us consider a situation where all the estimation processes in the QAE-based Newton's method are successful, more concretely, all the estimations have errors smaller than the desired level.Correspondingly, suppose that, for ǫ ′ g satisfying (68) and ǫ ′ H satisfying (71), we can calculate each component of ĝF and ĤF with an ǫ ′ g -additive error and an ǫ ′ H -additive error, respectively, at each update in the QAEbased Newton's method.Because of Lemma 5, this means that (70) is satisfied at each update.Then, from Lemma 3, we obtain the optimization variable a n it satisfying (66), which is therefore 2ǫ-approximation of a ⋆ , after n it -time updates.
Then, let us evaluate the total number of calls to P i and P i, j in the above process, recalling that the success probability of QAE is not 1.We calculate the elements of ĝF and ĤF , whose number is O(d 2 ) in total, at n it updates.Therefore, the total number of such calculations is O(n it d 2 ).To make the probability that all calculations are done with required errors larger than 99%, it suffices to enhance the success probability of each calculation to 1 − 0.01 n it d 2 .This can be achieved by repeating QAEs O log(n it d 2 ) times in each calculation.Besides, in one trial in repeating QAEs, it is sufficient to make P i at each of n it updates, and therefore the total call number of P i becomes (72).Similarly, that of P i, j becomes (73).
Since one call to P i or P i, j accompanies one call to P c , we obtain (74).

D. The Newton's method based on classical Monte Carlo
The important feature of the QAE-based Newton's method is that we estimate the components in the Hessian and the gradient, which are represented as the sum of many similar terms, by QAE, or, more concretely, the procedure similar to the quantum algorithm for Monte Carlo integration based on QAE [15,21].It is natural to compare the QAE-based Newton's method with the classical Monte Carlo (CMC)-based Newton's method, which is just Newton's method whose summation parts for calculating the Hessian and the gradient are replaced with CMC.If we can use the oracles similar to those in the above discussion, P c , P i and P i, j in Sec.IV on a classical computer, we can estimate the components of the Hessian and the gradient also by CMC, that is, we can take the average of the values of sampled terms in the sum.Note that the classical counterpart of a QRAM, which can be used for P x , P y and P c , is just a RAM.
Let us estimate the complexity of this method.In the classical Monte Carlo, the estimation error is proportional to N −1/2 samp , where N samp is the sample number.More precisely speaking, asymptotically, the deviation of the estimated value from the true value is less than C √ N samp with a probability 1 − γ, where samples in order to obtain an estimation with error less than ǫ with probability higher than 1 − γ.Besides, note that the number of calling for each oracle is proportional to the sample number.Then, by a discussion similar to that on Theorem 2, we see that we obtain a 2ǫ-additive approximation of a ⋆ by the CMC-based Newton's method with a probability higher than 99%.In the process, the iteration number n it is given as (67), the total number of calling P i , i = 1, ..., d is that for P i j , i, j = 1, ..., d is and that for P c is (74).Therefore, as the QAE-based Newton's method does, the CMC-based Newton's method removes the dependency of the query complexity on N D , even though it is classical.Note that the complexity of the QAE-based Newton's method is smaller than the CMC-based one, due to the so-called quadratic quantum speedup of Monte Carlo integration.

E. Related previous works
As a final comment, we briefly refer to some previous works [24,32], which used similar ideas to ours in the sense that the derivatives of the objective function are calculated by QAE.These papers considered not Newton's method but gradient descent and its application to an optimization problem with an objective function consisting of many similar terms, which is less general than (45) but includes linear regression.Then, they aimed at speedup of gradient calculation by QAE similarly to ours.Compared with Newton's method, gradient descent has both pros and cons.As pros, it requires not a Hessian but only a gradient, and it reaches the minimum from any initial point in convex optimization.On the other hand, as cons, it usually requires more iterations than Newton's method and there is an issue on choice of appropriate step size.

V. SUMMARY
In this paper, we proposed a quantum algorithm for linear regression, or, more concretely, estimation of regression coefficients as classical data.Existing algorithms such as [1,3] create the quantum state encoding coefficients in its amplitude by the HHL algorithm [10] or its modification and then read out the coefficients.On the other hand, in our method, we estimate the elements of W is the design matrix, y = (y 1 , ..., y k ) T is the objective variable vector, and N D is the number of the data points, and then we find the coefficients by classical computation of a = W −1 z.Since, as shown in ( 13) and ( 15), the elements have the form of the sum of x (i) k x ( j) k or x (i) k y k over data points, we can estimate them by QAE [14][15][16][17][18][19][20], assuming availability of the oracles P x and P y which output x (i)  k and y k , respectively, for specified i and k.The query complexity of our method is given as (34) and (35), which means exponential speedup with respect to N D compared with the naive classical method, and improvement with respect to the tolerance error ǫ compared with the previous quantum methods such as [1,3].
Finally, we extended our method to more general optimization problems, that is, convex optimization with an objective function consisting of many similar terms like (45).In light of the interpretation of linear regression as Newton's method, we proposed the QAE-based Newton's method, in which the gradient and the Hessian are estimated by QAE.Introducing effects of estimation errors in the ordinary discussion on convergence of Newton's method, we derived the convergence property and the query complexity of the QAE-based Newton's method.Even if there are estimation errors, the method shows well-known quadratic convergence and reaches the solution in a small number of iterations.
Obtaining the improved dependence of complexity on ǫ, we expect that we can apply our method to midsized but many-times-repeated regression problems like LSM.Generally speaking, our method can be better than the previous quantum methods and the naive classical method when d, κ ≪ 1 ǫ ≪ N D .On the other hand, since the complexity of our method depends on d and κ more strongly than the previous quantum methods, they will be better when 1 ǫ ≪ d or 1  ǫ ≪ κ.Since the naive classical method does not induce any error as quantum ones, it will be better when N D ≪ 1 ǫ .In future work, we will consider implementation of LSM on quantum computers using our method for linear regression, as a concrete and practical use case of quantum computing in financial industry.
Then, let us consider the following three cases of the initial deviation.

H
calls to P i, j ) in order to get an element of ĝF (resp.ĤF ) with an ǫ ′ g -additive (resp.ǫ ′ Hadditive) error.In summary, we calculate the d components of ĝF by O log(n it d 2 ) -time repeating QAEs with O 1 ǫ ′ g calls to