Solving the Bethe-Salpeter equation with exponential convergence

The Bethe-Salpeter equation plays a crucial role in understanding the physics of correlated fermions, relating to optical excitations in solids as well as resonances in high-energy physics. Yet, it is notoriously difficult to control numerically, typically requiring an effort that scales polynomially with energy scales and accuracy. This puts many interesting systems out of computational reach. Using the intermediate representation and sparse modelling for two-particle objects on the Matsubara axis, we develop an algorithm that solves the Bethe-Salpeter equation in $O(L^8)$ time with $O(L^4)$ memory, where $L$ grows only logarithmically with inverse temperature, bandwidth, and desired accuracy, This opens the door for computations in hitherto inaccessible regimes. We benchmark the method on the Hubbard atom and on the multi-orbital weak-coupling limit, where we observe the expected exponential convergence to the analytical results. We then showcase the method for a realistic impurity problem.


I. INTRODUCTION
The rapid increase in available computational resources has propelled material calculations into a new era. Computational methods based on the density functional theory are widely used for material design. One of the remaining grand challenges is computing dynamical response determining the functionalities of materials, especially those with strong electronic correlations. In theory, excitons, magnons, and other composite excitations can only be described by two-particle correlation functions (optical conductivity, susceptibilities). At the core of calculating these lie equations at the two-particle level, in particular the Bethe-Salpeter equation [1,2].
Solving the BSE is notoriously difficult since it scales unfavorably in the number of orbitals and frequencies.
To reduce the computational effort most ab initio calculations at the two-particle level are performed with the static approximation, where only the zero frequency component is taken [3,4,[27][28][29][30], or with a reduced frequency dependence [6][7][8]. The fully dynamical calculations [10-14, 31, 32] are quickly stopped by the curse of dimension- * M. Wallerberger and H. Shinaoka contributed equally to this work. Solving the BSE using sparse modeling with a cutoff L (this work) compared with using a dense frequency box of linear size N : (a) nominal truncation error and runtime/memory scaling, where N orb is the number of orbitals, βωmax is the bandwidth in units of temperature, and γ is the power in inverse frequencies to which the asymptotics is known analytically; (b) actual scaling of the error for the Hubbard atom (cf. Sec. VI A), where black and red lines are the average (p = 2) and maximum (p = ∞) deviation from the exact result, respectively. ality at only few orbitals or high temperatures. This renders many interesting regimes such as low-temperature phases of materials inaccessible.
Fully dynamical calculations mostly use a dense box of Matsubara frequencies. Solving the BSE then requires O(N 3 ) memory and, even worse, O(N 4 ) computational time, where N is the number of frequencies along each side of the box (see Fig. 1). The naïve truncation error decays as O(1/N ). These exponents are problematic since N is proportional to the bandwidth, ω max , and to the inverse temperature, β = 1/T , with a large prefactor due to substantial finite-size effects of the box. Chemical accuracy is then usually far out of reach and error control becomes difficult.
In this paper we offer a substantial step towards numerically low-cost evaluation of the BSE. Based on the previously developed concept of sparse modeling of the two-particle correlation functions [33,34], we directly solve the BSE in the sparse representation (intermediate representation, IR [35]). As illustrated in Fig. 1, the polynomial scaling of the maximal (red curves) and average (black curves) relative error with the linear system size (linear frequency box size N ) is replaced by faster than exponential decay with the size of the IR representation L. Moreover, L grows only logarithmically with βω max [36]. With this method we achieve not simply an improvement in scaling, but a paradigm shift: from brute force high performance computing to data compression and systematic error control. The concept of sparse modeling is already widely adopted for ab initio calculations at the one-particle level [37][38][39][40][41][42][43]. Here, we pave the way to taking dynamical ab initio calculations with sparse modeling to the two-particle level.
In order to keep the paper self-contained and introduce the necessary concepts and notation, we review sparse modeling in Sec. II. In Sec. III, we will then show that these ideas carry over to diagrammatic equations such as the Bethe-Salpeter equation, which we rewrite in the compressed form. In Sec. IV, we then develop a method to also compress all intermediate results (summation over inner propagators) needed in solving the BSE (sparse convolution). The final step needed to obtain the solution is to solve the resulting least-squares problem, which we present in Sec. V. We then benchmark our method on two challenging systems: in Sec. VI, we compare sparse modeling to the analytic results for the Hubbard atom as well as for the weak-coupling limit of a multi-orbital impurity. In Sec. VII, we then show the results for a realistic impurity problem. We conclude with a summary and outlook in Sec. VIII.

II. REVIEW: SPARSE MODELING
We review sparse modeling before its use in solving the BSE. The concept of sparse modeling is based on a generic feature of imaginary frequency data: in transitioning numerical data from real to imaginary (Matsubara) frequencies, we lose information as features of correlation functions get smeared out. That information loss makes it difficult to reconstruct the original real-frequency signal, but at the same time allows imaginary frequency data to be compressed extremely efficiently [35,36,44].
The essence of this approach is twofold: (i) the truncated intermediate representation (IR) provides an efficient basis for representing Matsubara Green's functions and (ii) the basis coefficients can be inferred from Matsubara data at special sampling frequencies in a quick and stable fashion. In the following, we review the use of IR in representing one-particle and two-particle response functions, focusing specifically on the Green's function. For an extended review on the IR, we refer the reader to Refs. 45 and 46. A. One-particle response functions A one-particle response function can be written as [47]: where τ denotes imaginary (Euclidean) time, T τ indicates time ordering, α ∈ {B, F} denotes fermionic or bosonic statistics, A and B are operators, and iω is a fermionic (bosonic) Matsubara frequency for α = F (α = B). Equation (2) and the associated spectral function ρ(ω) in real frequency ω are connected through analytic continuation [47]: where we require the spectral function to be of the form (Note that the bosonic kernel has been regularized using an additional factor ω , consistent with analytic continuation practice.) The kernel in a Fredholm integral equation of the first kind, K α admits a singular value expansion [48]: where S α l are the singular values in strictly decreasing order, S 0 > S 1 > . . . > 0,Û α l are the left singular functions, which form an orthonormal set on the Matsubara frequencies, and V α l are the right singular functions, which form an orthonormal set on the real frequencies [49]. The singular functionsÛ α l and V α l are the so-called IR basis functions [35].
We can use the left singular functions of the kernel as representation for the Green's function [35]: where G α l = S l i ρ i V α l (ω i ) is a basis coefficient and L is an error term associated with truncating the series. One can prove that S l drop faster than any power with l [50], and there is numerical evidence for faster-thanexponential decay [36]. One empirically finds logarithmic growth of singular values with respect to increasing the energy cutoff [36], S L /S 0 = O(log(βω max )), and also that iν iν iω the right singular functions are bounded, i.e., there exists a V max such that |V l (ω)| < V max . This implies that the truncated representation (5) converges faster than exponentially, L = O(S L ), and is substantially more compact than a polynomial representation. In order to efficiently extract the basis coefficients G l fromĜ(iω), we exploit the fact that the roots of the singular functions are similar in structure to the roots of orthogonal polynomials [51]. Hence, we choose a set of sampling frequencies close to the sign changes ofÛ L (iω): and turn Eq. (5) into an ordinary least squares fit [37]: An example of sampling frequencies is shown in Fig. 4. One empirically observes that the |W| × L fitting matrix U nl =Û α l (iω α n ) is well-conditioned provided that W was chosen as in Eq. (6) [37]. This in turn implies the fitting error is consistent with the overall truncation error L . Similar rules can be found for imaginary time.

B. Two-particle response functions
We now turn to the two-particle Green's function: where A, . . . , D are now fermionic operators and, consequently, iν 1 to iν 4 are fermionic Matsubara frequencies.
The Lehmann representation of Eq. (8) can be cast in the following form [33]: where iν, iν are fermionic Matsubara frequencies, iω is a bosonic Matsubara frequency, ω 1 , ω 2 , ω 3 are real frequencies, and T r is a frequency translation tensor defined in Table I. This translation is necessary in order to have a spectral function of the form: because the two-particle Green's function cannot be made compact in any single frequency convention due to permutations induced by time ordering in Eq. (8). Therefore, we require the sum over r = 1, . . . , 12 different representations in Eq. (9), unlike Eq. (3), where only one representation is sufficient.
Entering Eq. (9) is a product of the one-particle kernels K F and K B from Eq. (3). However, K B must be augmented in order to ensure proper decay of the expansion coefficients: where SB 0 , SB 1 are arbitrary prefactors (we shift the remaining singular values by S B l → SB l+2 ). A δ-function at iω = 0 is not included in the unaugmented kernel (3) as it cannot be resolved by it; however, terms like those are indeed present in the two-particle Green's function.
Since the one-particle kernels can be truncated, Eq. (9) implies that there also exists a truncated, compact representation for the two-particle Green's function analogous to Eq. (5): whereÛ F l (iν) is defined in Eq. (4) andÛB l are the singular functions of the bosonic kernel, augmented by the additional contributions in Eq. (11). Since G r,ll m is then given by projection of Eq. (10): one again has faster than exponential convergence in Eq. (12), L = O(S L ). Storing G in the IR basis thus only requires 12L 3 numbers, where L is O(log βω max −1 ) and is the desired accuracy.
Extracting G r,ll m from G requires us to construct a fitting problem. We choose the set of sampling frequencies as: i. e., the outer product of the one-particle sampling frequencies W α from Eq. (6) according to the one-particle kernels in Eq. (9), transformed from the "native" frequencies (iν, iν, iω) of each representation to the all-fermionic convention (iν 1 , . . . , iν 4 ). We show an example of sampling frequencies in Fig. 7(a). This choice ensures that the sampling points required to construct each r, ll m are present in W. However, since the IR basis is overcomplete, every representation projects to the same set of frequencies and thus the ordinary least squares problem: where E is the transformation tensor from the IR to the sampling frequencies from Eq. (12), is ill-conditioned for λ = 0. A way to mitigate this problem is by using Tikhonov regularization, where one adds a term to the cost function penalizing large fitting parameters G r,ll m (λ = 0): one, e.g., can use the a priori knowledge of the decay of basis coefficients from Eq. (13) and enforce this by choosing (Γ Tikh. rll m ) −1 = S F l S F l SB m (for a more detailed discussion see Sec. IV of Ref. 33).
As for the number of sampling frequencies, one finds L 3 ≤ |W| ≤ 12L 3 , as sampling frequencies coming from multiple representations may coincide, and one typically has |W| ≈ 8L 3 .

III. DIAGRAMMATIC EQUATIONS
We will now discuss how to extend sparse modeling in order to solve two-particle diagrammatic equations. Diagrammatic equations are an algebraic way to sum up whole classes of diagrams, usually by invoking a topological argument. They are the bread and butter of the diagrammatic technique and at the heart of renormalization methods and embedding techniques. As in Sec. II, we will start with a one-particle example (the Dyson equation) and then go to the two-particle case.

A. Self-energy and vertex basis
For simplicity, we again start with the one-particle case, where we will introduce the tools that we later use for the two-particle case.
LetĜ 0 be the non-interacting one-particle Green's function, i.e., Eq. (2) with . . . replaced by the average . . . 0 over the non-interacting system. It is related to the full Green's function viâ whereM is the full one-particle vertex encoding all interactions in the system, and iν is the fermionic Matsubara frequency [52]. (We restrict ourselves to fermionic systems for simplicity.) Diagrammatically, Eq. (16) is shown in Fig. 2(a). The full vertexM is related to its irreducible counterpart, the self-energyΣ, via the Dyson equation:M which is shown diagrammatically in Fig. 2(b). Neither vertex,M norΣ, can be modelled by the same basis as the Green's function, because they contain the Hartree-Fock term, which is a constant in frequency. To model it we need to augment the fermionic kernel in Eq. (3) to: where SF 0 is an arbitrary non-zero constant. As the remaining terms inΣ andM behave like a scaled Green's function, one has truncated expansions analogous to Eq. (5):Σ where Σ l and M l are again basis coefficients and L is an error term associated with truncating the series, which is guaranteed to drop quickly.  (30). Symmetry forces these functions to be either purely real (a,c) or purely imaginary (b,d) and we always plot the respective non-zero part. The constant term U0 is scaled by a factor 1/2.
In principle, SF l andÛF l are the singular values and left singular functions, respectively, of the augmented kernel (18). However, the kernel is not compact, which means one is unable to compute the singular value expansion numerically. Instead, one choosesÛF 0 (iω) = 1 and shifts the remaining terms of the unaugmented kernel by one, i.e., SF l+1 = S F l andÛF l+1 =Û F l . We plot the lowest order basis functions in Fig. 3(a,b).
Fitting Σ l from Matsubara data for the self-energy requires us to choose a set of fitting frequencies. Since the associated basis functions for l = 1, . . . , L − 1 are identical to the underlying left singular functions, the sampling frequencies W F for order L ≥ L − 1 allow stable and compact fitting. The Hartree-Fock term M 0 , on the other hand, is given by the limit iν → ∞, and corrections to this asymptotic constant only decay as 1/iν. Fitting Σ 0 fromΣ(iν) is thus a somewhat delicate procedure: on the one hand we would like to have an extra sampling frequency for iν large, yet on the other hand one often has unfavorable scaling of the uncertainty inΣ(iν) with increasing frequency [53], which discourages us from doing so. We empirically observe that the distribution of W F for order L extends to higher frequencies as L is increased. Therefore, a reasonable choice is to use W F for order L = L as WF for order L.
Let us demonstrate the fitting for a model self-energy: where Σ 0 = 1 and 0 = 0.5 at β = 10 (ω max = 1 and SF L /SF 0 10 −15 ). The result is shown in Fig. 4. The augmented basis fits the self-energy accurately including the Hartree-Fock term Σ 0 from low to high frequencies.
As seen in Fig. 4(c), the expansion coefficient Σ l decay faster than exponentially [54].
We are now in a position to tackle a diagrammatic equation for the full one-particle vertexM . We note that by introducing the prefactor (1−ΣĜ 0 ), Eq. (17) becomes a linear equation forM . We can then insert Eq. (20) to arrive at a least squares problem for the Dyson equation: (22) Let us note that solving the Dyson equation by solving the least squares problem (22) is obviously not optimal: since Eq. (17) is diagonal in frequency, one can first solve the equation on the sampling frequencies and then fit M l from M (iν) in a second step [37]. However, in the twoparticle case, this ceases to be an option, since it involves convolutions over frequencies.

B. Bethe-Salpeter equation
After having developed the necessary tools for the sparse modeling of one-particle vertices and the rewriting of diagrammatic equations as fitting problems for that case, we are ready to tackle the two-particle case.
The two-particle analogue of Eq. (16) reads: whereF is the full two-particle vertex [55]. Diagrammatically, Eq. (23) is shown in Fig. 5(a). There are now three different notions of two-particle reducibility in F : with respect to severing frequencies 1,2 from 3,4 (particle-hole channel), frequencies 1,4 from 3,2 (particle-hole transverse channel), and frequencies 1,3 from 2,4 (particleparticle channel). Consequently, there is an irreducible vertex and a corresponding diagrammatic equation for each of these channels. [15,21] Without loss of generality, we will restrict ourselves to the particle-hole channel for now. The Bethe-Salpeter equation, which relates the full vertexF to the irreducible vertexΓ, reads [cf. Fig. 5 Solving the BSE in this convention is cumbersome, as it requires a sum over two inner frequencies, one of which is fixed by conservation of energy. This can be eliminated by switching into the "natural" frequency convention for the particle-hole channel: where iω is a bosonic transfer frequency and iν, iν are fermionic frequencies, and from now on the threeargument version of any quantity indicates the particlehole convention [56]. Equation (24) then reads: This equation can be rewritten into the following form: where we have defined a "Bethe-Salpeter operator" A Γ : FIG. 5. Sparse modeling for the Bethe-Salpeter equation: (a) diagrammatic representation the two-particle Green's function (23), where F is the full two-particle vertex; (b) BSE in the particle-hole channel (26), where Γ is the particle-hole irreducible two-particle vertex; (c) tensor network for the BSE in natural frequencies (27), where AΓ is defined in Eq. (28); (d) sparse modeling and quadrature for the BSE (33), where w is the tensor of summation weights, T ph changes from the fermionic to the particle-hole frequency convention, E is the expansion tensor (cf. Fig. 8), and FIR are the IR basis coefficients of the full vertex. • only gives a contribution if all connected indices have the same value.
1. modeling of the two-particle vertex.
Analogous to the expansion of the two-particle Green's function (12), we expand the two-particle vertex in an overcomplete basis: and similarly for the irreducible vertex Γ.
Like the self-energy, F and Γ have an asymptotically constant background. Therefore, as in Eq. (20), the fermionic kernel has to be augmented using Eq. (18). At the same time, the bosonic kernel for the two-particle Green's function KB has to be further augmented using a bosonic . Four-point vertex basis functionsÛF lÛF l ÛB m for selected r, l, l , m, in Matsubara frequencies and projected back into the particle-hole convention (iω, iν, iν ). We plot over a fermionic frequency box (ν, ν ) for two given bosonic frequencies: ω = 0 (left side) and ω = 10π/β (right side). Columns show different expansion orders (l, l , m), while the rows show different representations r (the first number, indicated in bold, is the representation plotted; other numbers given denote structurally similar terms). Symmetry dictates that the functions must be purely real (Re) or imaginary (Im), and we plot only the respective part.
analogue of Eq. (18). This defines an augmented bosonic kernel KB: (We again shift the remaining singular values by S B l → SB l+3 ). We plot the corresponding one-particle basis functions in Fig. 3(c,d).
In addition to the background, the vertex has a rich asymptotic structure, consisting of "lines" and "planes" running horizontally, vertically and diagonally through the frequency domain and extending to infinity [11]. These structures are captured by the augmented basis functions when combined with the frequency translations in the different representations r. For F , we can prove this by applying the same arguments as for Σ (cf. Sec. III A), but for the dependence on each of the outer frequencies iν 1 , . . . , iν 4 separately. F differs from the connected two-particle propagator (23) by removal of four one-particle Green's function lines (one for each outer frequency). Similar to Σ, the dependence of F on each of the frequencies then amounts to a constant term plus a scaled one-particle Green's function, translated through the frequency translation tensor T r .
For the irreducible vertex Γ, one can show that the asymptotic part has a similar fundamental form as for F away from the divergence lines (cf. Sec. VI A). We thus conjecture that the expansion (23) remains compact also for Γ, and offer numerical evidence in Secs. VI and VII.
Let us illustrate the effect of the augmented twoparticle basis by plotting the low-order basis functions in Fig. 6 in Matsubara frequencies in the particle-hole convention, cf. Eq. (25). The results for ω = 0 as given on the left side. We see that, e.g., F r,110 translates to the horizontal line structure for r = 4 and to the vertical line for r = 11. As expected, these structures shift away from the center for ω = 10 (right side).  [10,[57][58][59] improve upon this, but require additional knowledge of a set of two-and three-point correlation functions, together with a (usually uncontrolled) method of connecting asymptotic and non-asymptotic region.
Storing enough frequencies to reliably perform the fermionic sum quickly exceeds available memory. For-tunately, using sparse modeling (Sec. II B) we can solve this: analogous to Eq. (22), we rewrite Eq. (27) as a least squares problem for the IR coefficients F r,ll m : Due to the augmentated kernels entering Eq. (29), W is the set of sampling frequencies generated from Eq. (14), but with W F (WB) replaced by WF (WB) and transformed in the particle-hole basis. Eq. (31) solves the storage problems, but it still requires a sum over an infinite set of frequencies. A naive truncation of that sum to the innermost N frequencies will only converge as O(1/N ). In Sec. IV, we will improve on this by developing an algorithm which replaces the full infinite sum by a weighted finite sum: where W ωνν are the set of quadrature frequencies for the outer frequencies iω, iν, iν and w ωνν (iν ) are the corresponding integration weights.
where we have omitted a, usually necessary, regularization term for brevity, cf. Eq. (15). Diagrammatically, Eq. (33) is shown in Fig. 5(d). By using this equation, we have both the truncation error and the quadrature error under control. Therefore we can expect exponential convergence in F in both time and memory. We will discuss details of the fitting in Sec. V. Let us note that the other direction of the Bethe-Salpeter equation (26), i.e., computing Γ from F , can be done by introducing: and switching F and Γ in Eqs. (27) to (33).

IV. SPARSE CONVOLUTION
Ultimately, we want to perform convolutions of vertices in a specific channel, cf. Eq. (27), which requires an (infinite) sum over a fermionic Matsubara frequency. As outlined in Sec. III B, truncating this sum to the innermost N frequencies will only converge as O(1/N ), which is why we are seeking to replace it with a weighted sum over a finite set of frequencies instead, cf. Eq. (32).

A. Simpler case: the Lindhard bubble
To motivate our sparse convolution scheme, let us first consider a simpler problem, which we will again use to develop the necessary tools: Let A and B be one-particle Green's functions and let C be defined as where iω is a bosonic or fermionic Matsubara frequency and, correspondingly B and C are bosonic or fermionic Green's functions. The sum converges without a convergence factor as A(iν)B(iν +iω) decays as O(1/ν 2 ). Using the residual calculus, it is straightforward to show that the product (35) can be decomposed as follows: where A and B are families of auxiliary Green's functions in iν and iν + iω, respectively, and both families are parameterized by iω. (For completeness, we show this relation in Appendix A.) Equation (36) again admits an overcomplete representation of the integrand in terms of two sets of coefficients, A ω,l and B ω,l : Since each constituent can be represented compactly with the IR basis, there exists a compact representation for the product and the error drops superexponentially, L ∼ S l . Using Eq. (37), we can compute the Matsubara sum: where U α l (0 − ) are the Fourier transform of the l-th bosonic or fermionic basis functionÛ l , evaluated at τ = 0 − . We proceed in a manner similar to the overcomplete representation of the two-particle function, but for each value of iω separately. We first generate a set W ω of fermionic sampling frequencies for A(iν)B(iν + iω): expanding A in iν corresponds to the standard set W A = W F of fermionic sampling frequencies (6), and expanding B in iν + iω corresponds to a shifted set W B of fermionic sampling frequencies. The full set is then just the union of both individual sets: Using the frequency set (39), we can turn Eq. (37) into a least squares problem: where the data vector is [AB] ω,i = A(iν i )B(iν i + iω), iν i runs over frequencies in V ω , the design matrix is given block-wise as [Û α ω ] il =Û α l (iν i +iω), and the fitting vector is just the IR coefficients of A and B , stacked vertically. The Matsubara sum (38) is then given by: i. e., the full sum is replaced by a weighted sum over a small subset of frequencies. The vector of integration weights w is determined by the solution of the following least squares problem: where w ω,i = w ω (iν i ), the evaluation vector is given block-wise as [u α ] l = U α l (0 − ), and whereÛ T denotes the transpose of the design matrix in Eq. (40). If Eq. (42) is underdetermined, we take its least norm solution.
B. Full two-particle convolution Now we turn to the case (27) of multiplying two two-particle functions. For simplicity, we focus on the particle-hole channel. Similar relations can be inferred for the transverse channel and for the particle-particle channel.
By transforming Eq. (9) into the particle-hole convention through Eq. (25) and focussing on the dependence on iν , one has: where A (1) to A (4) are a family of auxiliary objects. With iω and iν held fixed, A (1) and A (2) have the structure of a fermionic Green's function, while A (3) and A (4) are bosonic Green's functions. Similarly, for the dependence on the other fermionic frequency, one finds: where X (1) , . . . , X (6) , with ω, ν, ν held fixed, are again Green's functions.
This means we can generate an overcomplete representation consisting of six terms, and the set of sampling frequencies becomes: where the "mixing" of fermionic and bosonic models in Eq. (45) is reflected in the presence of both fermionic and bosonic sampling frequencies, shifted by a bosonic and fermionic shift frequency, iω s and iν s , respectively, to create a grid of fermionic frequencies [60]. This fixes the quadrature frequencies in Eq. (32). What remains to be determined are the weights. Analogous to Eq. (42), w ωνν is given by the solution to the least squares problem: where again the evaluation vector is given block-wise by [u α ] l = U α l (0 − ) and the design matrix is formed using blocks of [Û α ω ] il =Û α l (iν i + iω). Figure 7(a) shows sampling frequencies W for a vertex function for β = 1 and ω max = 10 with a cutoff of 10 −5 . We define quadrature points for the left and right objects as W L ≡ (iν, iν , iω) : iν ∈ W ωνν ∧ (iν, iν , iω) ∈ W and W R ≡ (iν , iν , iω) : iν ∈ W ωνν ∧ (iν, iν , iω) ∈ W , respectively. We plot W L and W R in Figs. 7(b) and 7(c), respectively. The sampling frequencies and quadrature points are distributed sparsely especially at high frequencies.
With both quadrature points and weights specified, let us discuss computational cost and scaling. From Eq. (46), we have that for each choice of "outer" frequencies, 2L ≤ |W ωνν | ≤ 6L, with values for typical outer frequencies close to |W ωνν | ≈ 6L. The size of the design matrix in Eq. (47) is 6L × |W ωνν |, thus the weights require O(L 3 ) time to compute for each outer frequency.
In solving the BSE (33), the quadrature has to be computed for each of the sampling frequencies in W (14). Since one has |W| ≈ 8L 3 (cf. Sec. II B), this implies we in total have to store ≈

V. BASIS EXPANSION AND FITTING
We will now discuss the solution of the least squares problem (15). For this, it is useful to first "flatten" the tensor E into a matrix form. We thus impose an ordering on the sampling frequency set (14) and on the set of basis coefficients: where V is an index into the sampling point grid and R is a flat index into r, ll m. Correspondingly, we definê G V := G(iν 1V , . . . , iν 4V ) and G R := G r R ,l R ,l R ,m R . With these definitions, we arrive at the matrix form of Eq. (15): where E V R is the flattened version of E ν1ν2ν3ν4 r,ll m (we will discuss its explicit form shortly).
After constructing the matrix E V R , Eq. (50) can be fed directly to a ordinary least squares solver. However, for large L, the cost can be prohibitive: from Eq. (49), one has |R| = 12L 3 . Constructing E thus requires storing ≈ 100L 6 numbers and solving the least squares problem requires O(L 9 ) flops. Even though one has L = O(log βω max −1 ), this will only be viable for small values of L. For larger L, we would like to construct EG and E †Ĝ on the fly and use an iterative least squares solver. We start with the explicit form of E by combining Eqs. (12) and (50): The tensor network representation of Eq. (51) is given in Fig. 8. Exploiting that internal structure, one can compute EG and E †Ĝ at a cost of O(L 5 ) with a negligible memory overhead. We discuss this algorithm in Appendix B.
We can now solve Eq. (50) efficiently using a conjugate gradient solver based on Lanczos bidiagonalization, as these solvers only require us to compute matrix-vector products EG and E †Ĝ for given G andĜ instead of creating the full E. These solvers come with a number of guarantees, in particular exponential convergence with the number of matrix-vector products, with the convergence rate depending on how E is conditioned [61]. Apart from pathological cases, they also guarantee success after constructing the "full" matrix, which implies a worstcase runtime scaling of O(L 8 ) of the fitting procedure. In practice, we choose the LSMR solver [62] and find that for cutoffs not too low, it converges in relatively few iterations, typically around 100.
Although L scales only logarithmically with βω max , the power in the scaling may become problematic in calculations with large L, e.g. at low T with a small cutoff L . We may be able to improve on this scaling using the low-rank approximation of an IR basis vector (tensor network representations) [34].
Let us mention one complication in solving Eq. (50): examining Eq. (13), we see that G r,ll m must be decay as S ll m = S F l S F l SB m . This implies that for a given cutoff L ∼ S L /S 0 , we are including terms ρ ll m that are dampened below the level of L . We illustrate in Fig. 9, where we plot the isosurfaces of S ll m for different error levels. Basis coefficients outside of the isosurface cannot be reliably fitted by empiricalĜ with errors at the same level, and including them may thus lead to overfitting.
One can however remedy this by restricting R to: i.e., only to those coefficients which are not dampened below the tolerance. Since one-particle singular values S α l decay faster than exp(−cl) but slower than exp −cl 2 for all real coefficients c, we have 2L 3 < |R| < 6L 3 , and typically |R| ≈ 4L 3 . In addition to mitigating oversampling, we have thus also reduced the number of coefficients needed for modeling G by a factor of three. Since in practice |W| > 6L 3 , we have also turned Eq. (50) from a formally underdetermined to an overdetermined system.

VI. NUMERICAL BENCHMARKS
We now move to benchmark the method on physical examples and provide numerical evidence for the exponential convergence. One of the simplest test cases is the Anderson impurity model. Its Hamiltonian reads: where c a annihilates an electron on the impurity spinorbital a, a = 1, . . . , N orb , f k annihilates an electron on the bath spin-orbital k, k = 1, . . . , N bath , E ab parametrizes the impurity levels, U abcd is the (antisymmetrized) two-body interaction strength, V ka are the hybridization strengths, and E k are the bath level energies. For N orb + N bath small enough, one can compute the full vertex F (23) to arbitrary precision using exact diagonalization. However, even with the full vertex exact, only one or two digits of accuracy in the irreducible vertices Γ are achievable with existing methods before the computational resources are exhausted.
For a convergence analysis across several orders of magnitude, we thus resort to limiting cases of the Anderson impurity model for which analytical results are available: the atomic limit (Sec. VI A) and the weak-coupling limit (Sec. VI B).

A. Hubbard atom
We first consider the Hubbard atom, which is the zerohybridization limit, V → 0, of the half-filled single impurity Anderson model (53). Its Hamiltonian reads: where c σ annihilates an electron of spin σ and U is the strength of the electron-electron interaction. The spectral function is given by . Despite the simplicity of Eq. (54), the irreducible and reducible vertices of the Hubbard atom have highly nontrivial structures in the Matsubara frequency domain: there are sharp, "δ-like" planes running horizontally, vertically and diagonally through the frequency box, structures which do not decay asymptotically. Moreover, the proximity of a family of poles on the imaginary frequency axis [63,64] as well as channel coupling of the dominant spin susceptibility and the exponentially suppressed charge susceptibility [65] causes the irreducible vertex to vary across several order of magnitudes. As a result, solving the BSE for the atomic limit presents a formidable challenge to solvers which truncate the Matsubara frequency domain to a finite frequency box.
Fortunately, in Ref. 66 analytical results are derived for F and Γ for the Hubbard atom. We are thus able to perform an absolute convergence analysis of our sparse modeling approach to solving the BSE: we construct A Γ in Eq. (28) using the analytical expressions for Γ, use Eq. (33) to solve the BSE, and finally compare the resulting F to the analytical expression F ex . In the following we (arbitrarily) choose β = 1.55, U = 2.3 [67] (for results at other values of β see Appendix C). We use the IR basis for βω max = 10. Figure 10 shows the fitting error in F for different choices of IR basis truncation L, which corresponds to different cutoffs for the singular values. The fitting was performed using the LSMR iterative solver (cf. Sec. V) and the system was regularized by imposing an accuracy goal of F − F ex 2 = 5 F 2 (black thin line). The black plusses indicate the least squares deviation F − F ex 2 , while red crosses indicate the maximum deviation F − F ex ∞ . Both values are normalized by F ∞ , the largest value of F . We see that the "training" error indicated by dotted lines, i.e., the error on the sampling frequencies W , closely tracks the desired singular value cutoff. This shows that the IR representation (12) is able to fully capture features of the vertex across multiple orders of magnitude without any underfitting.
To check our fitting we construct a set of validation frequencies: i.e., a dense frequency box of 30 fermionic and 29 bosonic frequencies centered around the origin, with the sampling frequencies removed. Fig. 10 shows that the validation error (solid lines) follows the training error (dotted lines) closely, both for the maximum and average deviation, which implies there is no significant overfitting and the basis has predictive power at the accuracy level specified by the training (for results obtained for data with statistical noise see Appendix D). Let us finally direct our attention to the error scaling with the truncation L of the IR basis, shown as top axis. We see that by doubling L, we gain more than four orders of magnitude in terms of precision.
Next, we compare the error scaling of sparse modeling with the conventional (dense) approach: in the latter, one constructs the operator A Γ (28) on a box of N × N fermionic frequencies centered around the origin and then solves Eq. (27) by matrix inversion. Fig. 1, shown in the Introduction, compares the validation error on W of sparse modeling with cutoff L and of the conventional approach with box size N . Let us note that this is not a fair comparison in terms of computational time, rather, the point is the scaling of the error: we see that the error improves as 1/N , which together with the factor that one requires for storage of N 2 frequencies, makes it difficult to add precision once N becomes sufficiently large. On the other hand, sparse modeling converges quickly with cutoff.

B. Multi-orbital weak coupling
We will now consider the opposite limit of the Anderson impurity model (53): the limit of weak coupling. There, it is reasonable to approximate the irreducible vertex with the bare vertex U : where we now consider more than one orbital and thus Γ acquires spinorbital indices a, b, c, d. Similarly, the oneparticle Green's function can be approximated by its noninteracting counterpart: where E, E and V are defined in Eq. (53) and we have introduced the hybridization function ∆(iν).
The Bethe-Salpeter equation (27) in this multi-orbital case then takes the form: By iterating Eq. (58) with F 0 = Γ = U and using the residual calculus (cf. Appendix A), one can show that F is given as the solution to the following algebraic equation: (59) where we have introduced: and f (x) is the Fermi function, and γ i and g j are the eigenvalues and eigenvectors, respectively, of the onebody matrix formed blockwise by (E, V, V † , E ). (For γ i = γ j , the corresponding term in Eq. (60) has to be understood in the limit γ i → γ j .) We note that in this approximation F has no dependence on fermionic frequencies. By combining a pair of spinorbital indices into a single index, Eq. (59) can be transformed into a system of linear equations and solved exactly. It thus provides an ideal benchmark for solving Eq. (58) with our sparse modeling approach. Fig. 11 shows the fitting error in F for different choices of IR basis truncation L, which corresponds to different cutoffs for the singular values. We considered three impurity spinorbitals (N orb = 3) and four bath sites (N bath = 4): U abba = 0.3 (a = b) or 0 (otherwise), {γ i } = {−0.25, −0.1, 0.1, 0.25}, g ia = cos(i + 3a/2 + 1/10) for a = 1, 2, 3 and i = 1, 2, 3, 4. We used β = 1.55. Similarly to Fig. 10, Fig. 11 shows the least squares deviation F − F ex 2 and the maximum deviation F − F ex ∞ for the same validation frequencies. One can see that there is neither significant overfitting nor underfitting. The result shows that the present method works for multiorbital systems. Comparison between the maximal relative deviation F − Fex ∞ / F ∞ of sparse modeling of the Bethe-Salpeter equation with the dense (box) calculation for the impurity model with constant hybridization function ∆ = π/5 for U/∆ = 1.59 and two values of the inverse temperature β = 1 (βωmax = 10) and β = 10 (βωmax = 100).

VII. RESULTS FOR THE ANDERSON IMPURITY MODEL
In this section we apply the method to solve the BSE for a fully-fledged Anderson impurity model (53), where analytical expressions for the vertices are not known. As already mentioned in Sec. VI, although this model can be solved exactly, it is not possible to construct benchmark two-particle vertex functions with arbitrary accuracy. The vertices can be however obtained numerically with the precision of several digits from the parquet equations method [21,68].
In the following we will also use the parquet approximation (PA). PA is not exact but gives excellent results in the weak coupling regime [69,70]. It has the advantage, that the two-particle vertices do not have statistical errors. Contrary to the weak-coupling model used for benchmarking in Sec. VI, both the irreducible vertex Γ and the full vertex F are dependent on two fermionic and one bosonic frequency and have nontrivial structure coming from channel mixing in the parquet equations. The Anderson impurity model is characterized by: (i) The strength of impurity-bath hybridization function ∆ ab (ν) [cf. Eq. (57)]. We choose it to be spinorbital and energy independent with ∆ = π/5 [71]. We restrict ourselves to two spinorbitals (spins) on the impurity and two spins in the bath. (ii) The interaction U on the impurity between electrons with different spins, which is here U/∆ = 1.59, corresponding to weak coupling. (iii) The impurity filling n (here n = 1, i.e. half-filling). For these parameters the estimated Kondo temperature is T K ≈ 0.36 but due to small value of U/∆ we are far from vertex divergencies present in this model [72,73]. We consider two temperatures T = 1 and T = 0.1.
In Fig. 12 we show a comparison between maximal relative deviation F − F ex ∞ / F ∞ of dense (with linear size of the frequency box N ) and sparse (with IR basis size L) evaluations of the BSE for two different inverse temperatures (red: β = 1, blue: β = 10). The input irreducible vertex Γ and the benchmark vertex F ex were obtained from a PA solution on a frequency box with linear size N = 1024 [74]. The precision of this benchmark calculation, as estimated from box-size convergence, is 10 −5 for β = 1 and 10 −3 for β = 10, which limits our comparison to only a couple of orders of magnitude for the lower temperature. It is however already visible, that also in this case the error drops quickly with L as compared to the O(1/N ) scaling of the dense calculation. We show here only the larger, maximal relative deviation on the validation set W (55). The average relative deviation F − F ex 2 is, similarly to the atomic limit, smaller and has the same scaling behaviour as the maximal one.
Comparing the results for the two different inverse temperatures, we see that reaching the same precision for an order of magnitude lower temperature requires only twice the IR basis size L, whereas N needs to be at least an order of magnitude larger.

VIII. CONCLUSIONS AND OUTLOOK
We proposed an efficient method for solving the Bethe-Salpeter equation based on sparse modeling and the intermediate representation (IR). Our algorithm is based on a sparse convolution method, which allows us to perform summation over frequencies of inner propagators needed in solving the Bethe-Salpeter equation. All intermediate objects, such as vertices, are stored in compressed form. We numerically demonstrated the exponential convergence of the algorithm with respect to the basis size for the Hubbard atom, the weak-coupling limit of a multi-orbital impurity, and a realistic impurity problem. In the present study, we focused on the particle-hole channel, however, the proposed method can be straightforwardly applied to the particle-particle channel as well.
The natural connection of the IR basis to the analytic continuation kernel charts a course to obtaining two-particle response functions on the real-frequency axis from Matsubara data, though we expect challenges of regularization and bias similar to the one-particle case will have to be overcome first. Ultimately, this may allow the interpretation of experimental spectroscopy, optical conductivity and neutron scattering data for models and parameter regimes where direct calculation in real frequencies is infeasible.
Also diagrammatic calculations based on more numerically involving equations, such as parquet equations [19][20][21], and calculations involving higher-order vertices can be made possible by combining the IR basis for frequencies with the form-factor basis for momenta [75,76].
It was recently shown that irreducible vertices diverge on specific lines in the parameter space [63,64]. This causes numerical instability in solving diagrammatic equations near the divergence line. Such divergence is characterized by emergent poles on an imaginary frequency axis, which the original IR basis does not take into account. Further augmentation of the IR basis may provide a controllable way to numerically handle effects of the vertex divergence.
The unit-tested implementation, with which the data in this manuscript has been generated, is available from the authors upon request. We expect to release it as an open-source package in the near future.
For completeness, this appendix is deriving Eq. (36) from Eq. (35). We begin by restating Eq. (35): let A and B be one-particle Green's functions and let C be defined as follows: We expand A and B, respectively, into a set of poles a i and b i with expansion coefficients A i and B i : The sum (A2) can be performed explicitly via residual calculus, yielding the Lindhard bubble: Common input: -Tuple I = ((ν1, ν 1 , ω1), . . . , (νN , ν N , ωN ))n=1,...,N , where iν and iν are fermionic frequencies, iω is a bosonic frequency, and the elements of I are in strict lexicographical order. I0 := (−∞, 0, 0).
Function apply(ρ) is Input: ρ ll m , a L × L × L complex tensor Result: Gn through Eq. (B1) Data: A, a L × L tensor, and B, and L vector for n = 1, . . . , N do if νn = νn−1 then As in apply, but with matrices replaced by their adjoints and steps done in reverse order. where f (x) is the Fermi function. We note that for a bosonic Matsubara frequency, f (x + iω) = f (x), while for a fermionic Matsubara frequency, is the Bose function. By expanding f (x) in its Matsubara sum, we find: (A4) Let us briefly comment on the inclusion of Eq. (A3): this may seem like a detour, as a partial fraction decomposition of Eq. (A2) yields Eq. (A4) for each iν. However, splitting the terms into two sums, there is an ambiguity in the convergence factor exp(iν0 ± ). This ambiguity must be spurious as the series (A1) is convergent, yet a convergence factor of exp(iν0 + ) will give an overall sign in Eq. (A3). The proper way to split up this sum is using the residual calculus, which fixes the convergence factor to be exp(iν0 − ).
Finally, comparing coefficients in Eqs. (A1) and (A4) yields Eq. (36), which we restate here for convenience: where A and B are now auxiliary Green's functions.

Appendix B: Fast on-the-fly expansion
In order to solve the fitting problem (50) using a sparse least squares solver, we have to apply E, defined in Eq. (51) to an arbitrary IR basis vector as well as E † to a sampling frequency vector in an efficient manner.
The core part of applying E is the construction of the following intermediate object in each channel: i.e., the application of the transformation matricesÛ followed by the projection to those frequencies which, after translation using T r , will end up in the sampling frequency set. We note that this structure (cf. Fig. 8), is in principle well-suited for on-the-fly application, as theÛ tensors can be applied separately one after the other and we then simply select elements. The problem is that the internal frequencies iν, iν , iω not only contain the L one-particle sampling frequencies in W α , but also any shifts of these frequencies due to T r . In total one has about 4L 2 unique frequencies for iν and similarly for iν and iω. Evaluating Eq. (B1) from right to left, we construct an intermediate tensor of size O(L 6 ) before selecting O(L 3 ) elements from it. This puts the total cost at O(L 9 ) time.
We can improve upon this by discarding the block structure and simply compute G n for each n separately. This involves contracting ρ ll m along each axis with three vectors U F l (iν n ), U F l (iν n ), and UB m (iω n ) at a cost of O(L 3 ), O(L 2 ), and O(L), respectively. Since this has to be done for each sampling frequency, the total cost is O(L 6 ).
There is still room for improvement: if we order the sampling frequencies lexicographically, we can reuse partial contraction results from one sampling point to the next. Since one observes only O(L 2 ) unique iν n , this brings down the total cost to O(L 5 ), while incurring a memory overhead of O(L 2 ) for storing the partial results. We list as function apply in the algorithm in Fig. 13.
The core of the reverse direction, i.e., the application E † to a sampling frequency vector, is the construction of the following object in each channel: where U * denotes the complex conjugate. A similar idea applies to Eq. (B2) as to Eq. (B1): we perform the outer product for a sequence of vectors and cache the intermediate results from frequency to frequency. This again yields a O(L 5 ) cost and O(L 2 ) auxiliary memory requirement.
Appendix C: Divergences of the irreducible vertex In Refs. 63 and 64 it has been shown that the irreducible vertex Γ diverges for certain values of temperature and interaction. These divergencies are also present in the Hubbard atom for a series of T /U values (the ratio T /U is the only energy scale in the atomic limit). The results presented in Sec. VI A were obtained for the irreducible vertex in the density channel for T /U = 0.28, which is very close to, but slightly above the first divergence point at T /U = √ 3 2π ≈ 0.276. In Fig. 14 we show the scaling of the relative average error F − F ex 2 / F ∞ on the validation set of Matsubara frequencies (cf. Fig. 10) for several values of the inverse temperature β = 1/T (keeping the value of U = 2.3 constant), that lie in the vicinity of vertex divergencies [64]: slightly above and slightly below the first, second, and third divergence point. We observe that the closeness to a divergency does not change the exponential convergence of the sparse solver. It does however influence the overall magnitude of the error.

Appendix D: Stability against noise
In practical calculations, the propagators and vertex functions may sometimes only be known from a stochastic method: e.g., F (iω; iν, iν ) can be obtained from continuous-time quantum Monte Carlo [78], provided the sign problem is not too strong, while Γ(iω, iν, iν ) is available from bold diagrammatic Monte Carlo [79], provided that the diagrammatic series is convergent and converges to a physical solution. Vertices obtained in this way contain significant statistical noise.
Before we discuss the two-particle case, let us briefly review the effect of noise on fitting one-particle propagators and vertices. There the fitting problem translates to an ordinary least squares problem, min G AG −Ĝ 2 .
If we now add white noise onĜ,Ĝ →Ĝ + δ G , the relative error on G on the fitting problem is bounded by: where S l are the singular values and κ[A] is the condition number of the fitting matrix A. This means that as we increase the size of the basis L, we expect the error to drop like the singular values until we reach the κ times the noise level, at which point the error flattens out. κ is thus a measure of "noise amplification". Since κ ∼ 1 for Σ and G, we do not observe any noise amplification there [37]. For the two-particle vertices and propagators, the picture is similar, with the complications that (a) the fitting matrix in Eq. (33) is generated from a quadrature rule, which involves another fitting problem (47) and (b) overcompleteness causes the condition number κ to deteriorate. This makes it considerably more complicated to prove the stability of our method against noise in all possible cases.
In order to investigate the stability of the present method against noise for a challenging benchmark, we construct noisy input by adding Gaussian noise to the exact irreducible vertex of the Hubbard atom. We add noise to the atomic Γ as Γ(iω; iν, iν ) → Γ(iω; iν, iν ) + δ · r(iω; iν, iν ) Γ ∞ , where r are independent identically distributed Gaussian random variables of mean zero and unit variance, and δ (> 0) denotes the level of the noise. Figure 15 shows the computed results for the sparse modeling of the BSE for the Hubbard atom for U = 2.3 and β = 1.8 and different noise levels. The rest of the parameters in the calculations are the same as those in Fig. 10. One can see that the relative error of the BSE on the Matsubara axis F − F ex 2 / F ∞ vanishes exponentially until it hits the (amplified) noise level in the input Γ(iω; iν, iν ), following the form of Eq. (D1). While the noise is thus somewhat amplified, the method seems to be fundamentally numerically stable.
Since F comes with approximately white noise when generated from, e.g., continuous-time quantum Monte Carlo in the interaction expansion, auxiliary-field expansion, and also in the hybridization expansion when using symmetric improved estimators [53], we expect the picture to remain qualitatively similar there.
While thorough performance analysis of solving the BSE for noisy input data is beyond the scope of this paper, the numerical results presented suggest the robustness of the exponential scaling of the error against noise. Gaussian noise in the input data Γ. The noise level δ was set to δ = 10 −4 , 10 −5 and 10 −6 for panels a, b, and c, respectively. The rest of the calculation conditions are the same as those in Fig. 10. The horizontal dotted lines denote the noise levels, respectively.