Sublinear scaling in non-Markovian open quantum systems simulations

While several numerical techniques are available for predicting the dynamics of non-Markovian open quantum systems, most struggle with simulations for very long memory and propagation times, e.g., due to superlinear scaling with the number of time steps $n$. Here, we introduce a numerically exact algorithm to calculate process tensors -- compact representations of environmental influences -- which provides a scaling advantage over previous algorithms by leveraging self-similarity of the tensor networks that represent Gaussian environments. Based on a divide-and-conquer strategy, our approach requires only $\mathcal{O}(n\log n)$ singular value decompositions for environments with infinite memory. Where the memory can be truncated after $n_c$ time steps, a scaling $\mathcal{O}(n_c\log n_c)$ is found, which is independent of $n$. This improved scaling is enabled by identifying process tensors with repeatable blocks. To demonstrate the power and utility of our approach we provide three examples. (1) We calculate the fluorescence spectra of a quantum dot under both strong driving and strong dot-phonon couplings, a task requiring simulations over millions of time steps, which we are able to perform in minutes. (2) We efficiently find process tensors describing superradiance of multiple emitters. (3) We explore the limits of our algorithm by considering coherence decay with a very strongly coupled environment. The algorithm we present here not only significantly extends the scope of numerically exact techniques to open quantum systems with long memory times, but also has fundamental implications for simulation complexity.


I. INTRODUCTION
A common challenge in quantum technology is the ubiquity of dephasing and dissipation caused by interactions between quantum systems and their surrounding environment [1].Understanding environmental influences is, thus, crucial for mitigating them [2][3][4][5][6], or even using them to enhance the functionality of quantum devices, as in phonon-assisted state preparation [7,8] or environment-assisted quantum transport [9][10][11].A standard approach for modeling the dynamics of open quantum systems is the Lindblad master equation [1,12].This can be derived using perturbation theory and the Born-Markov approximation, which assumes that the environment is static and memoryless and couples only weakly to the system.However, this is often an oversimplification and more sophisticated methods are required to accurately model the dynamics in many cases, where the structure of the environment matters or coupling is strong.Examples include: solid-state quantum dots (QDs) coupled to local phonon environments [4,[13][14][15][16], charge or excitation transport in biomolecules [10,17,18], ultrafast spin dynamics [19,20], spontaneous emission in photonic structures [21,22], and superconducting qubits in quantum computers [2].
Beyond the restrictive weak-coupling Born-Markov approximation, the regime of non-Markovian dynamics [23] arises.Here a quantum system triggers a dynamical evolution of the environment, and the environment's response feeds back to the system at a later point in time, constituting a time-non-local memory.Several strategies have been discussed for modeling non-Markovian dynamics [23].One general strategy is to solve the Schrödinger equation for the total closed system composed of the system of interest and its environment.Because environments typically consist of a large number-often a continuum-of degrees of freedom, this becomes numerically intractable unless the model is somehow reduced to a finite number of effective degrees of freedom.How to choose the relevant degrees of freedom differs from method to method: Mean-field, cumulant, or cluster expansions [24,25] are based on the assumption that higher-order correlations are negligible, and thus one may consider only degrees of freedom within the subspace of product states, or states with low-order correlations.Other methods based on techniques from many-body quantum theory employ efficient representations of the total system such as multilayer multiconfiguration time-dependent Hartree wave functions [26] or matrix product states [18,27,28].In the reaction coordinate mapping [29], parts of the environment are explicitly taken into account by extending the system of interest.A similar concept is to replace the actual environment by a small set of auxiliary oscillators [30].
A second general strategy for simulating open quantum systems is to keep the description confined to the degrees of freedom of the system, but to account for environmental effects via equations of motion that are non-local in time.For example, the Nakajima-Zwanzig formalism [1,31] explicitly includes memory effects of the form of an integral over past times.The Feynman-Vernon path in-tegral formalism [32] encapsulates environmental effects in an influence functional, which acts as a trajectorydependent weight in a sum over all possible system trajectories.This formalism is the starting point for several practical methods for open quantum systems simulation.For Gaussian environments with exponentially decaying bath correlations, repeated time-differentiation of the Feynman-Vernon path sum gives rise to a set of Hierarchical Equations of Motion (HEOM) [33,34].Alternatively, the path sum can be expressed as a stochastic average with bath correlations encoded in the noise statistics [35], from which one can derive non-Markovian stochastic Schrödinger equations [36][37][38].The combination of both ideas leads to equations for a Hierarchy of Pure States (HOPS) [39].These concepts can be further generalized, e.g., to stochastic master equations for non-Gaussian environments [40] and to open quantum systems simulations where the environment is continuously measured [41].
Here, we focus on the question of how to predict non-Markovian dynamics for both long propagation times and long memory times.While some specific features of long time dynamics-such as the nature of the steady state-may sometimes be more accessible straightforwardly, the problem of modeling non-Markovian dynamics over long times is important but challenging.For example, one may want to find the general time evolution, or multi-time correlations of a system, but in all the methods described above, it is generally hard to track such non-Markovian dynamics over long times.Even though strategies of explicitly representing environment degrees of freedom at first glance seem to scale linearly with the total number of time steps n, i.e.O(n), they tend to become inaccurate with increasing propagation time.This is due to the increase with time of the number of degrees of freedom of the environment that can be excited and thus entangled with the system.For example, consider the discretization of a continuum of phonon modes by energy intervals of width h∆ω.Propagating over a time t e , energy-time uncertainty implies that a discretization of ∆ω < ∼ 1/t e is required to preclude resolving a difference between the continuum and the discretized model.The need for a finer discretization results in a scaling of O(n 2 ) unless a more sophisticated, e.g., adaptive, discretization is used, or the resolution of individual sampling points is blurred by some form of line broadening [27].In approaches where the environment is mapped to a chain of coupled oscillators [18,27], this statement is related to the observation that the Lieb-Robinson theorem restricts environment excitation to a light cone [42], which increases in size with increasing propagation time t e .Similarly, stochastic sampling requires more trajectories to reach the same accuracy when the propagation time t e is increased.Hence, most numerical methods for open quantum systems show superlinear scaling with the number of time steps n.
For direct applications of the Feynman-Vernon influence functional approach, one finds exponential scaling with n.However, a reformulation into the iterative scheme QUAPI [43,44] yields truly linear scaling when the memory time of the environment is finite and so the memory can be truncated after n c time steps.Such an approach however retains exponential scaling with the memory time n c .It could be expected that O(n) is the natural lower bound for the scaling, in particular for general open quantum systems with time dependent driving, because, at the very least, O(n) observables have to be obtained in order to extract the full time evolution.In practice, however, it may happen that the numerically most demanding parts of the simulation can be precalculated in sublinear time, leaving a linear-in-time scaling with very small prefactor for the evaluation of the full time evolution which is irrelevant for most practical applications.An example is the small matrix decomposition of the path integral (SMatPI) approach [45], where, for a fixed time-independent Hamiltonian, an effective propagator in Liouville space including the effects of the environment is obtained, which can be used to propagate a system over time at a numerical cost comparable to a Lindblad master equation.Yet, for applications involving time-dependent Hamiltonians, e.g., to simulate the response of open quantum systems to strong driving with shaped laser pulses [46,47], the effective propagator would have to be recalculated for each time step.
In this paper, we introduce an algorithm based on the process tensor (PT) framework that demonstrates sublinear scaling of the numerically demanding step.A PT captures environment influences equivalent to the the Feynman-Vernon influence functional in a numerically exact way and can be efficiently represented in the form of a matrix product operator (MPO) [48,49].Once obtained, an open quantum system with an arbitrary time-dependent system Hamiltonian can be propagated in time by simple matrix multiplications on a vector space given by the product of the system Liouville space and the inner dimension of the PT-MPO, which we denote as χ.The bottleneck in MPO techniques is the MPO compression, i.e., the reduction of the inner dimension χ, using rank-reducing operations, often achieved by truncated singular value decomposition (SVD).We show that the self-similarity of timeindependent Gaussian environments-which include electromagnetic environments, typical vibrational baths, and the paradigmatic spin-boson model-can be exploited to devise a divide-and-conquer scheme which reduces the number of SVDs from O(n 2 ) in the algorithm introduced by Jørgensen and Pollock [49] to O(n log n).If the memory time is further limited to n c time steps, we arrive at a theoretical scaling O(n c log n c ), constant in n.The actual scaling can be larger than this, as the bond dimension χ, and thus the time required for SVD evaluation, can depend on the propagation time.However, we will see from numerical results that sublinear scaling with n is indeed seen in the examples we study.
To test the performance of our algorithm in practice and to demonstrate a sample of new applications it en-ables, we provide several examples: First, we investigate the fluorescence spectra of semiconductor QDs coupled to a bath of acoustic phonons with strong driving.While being of considerable interest for experiments [50,51], numerically calculating QD fluorescence spectra is a challenging multi-scale problem, because the width of the zero-phonon line is determined by radiative lifetimes of the order of nanoseconds, while typical phonon memory times are of the order of picoseconds, and strong driving forces us to use small time steps on the femtosecond scale.The sublinear scaling of our algorithm enables us to obtain numerically exact spectra from simulations involving a million time steps within minutes on a conventional laptop computer.Second, we use the capability of our algorithm to deal with very many time steps to study superradiance of multiple emitters without making any rotating wave approximation.This allows us to describe the breakdown of superradiance in the presence of disorder and dephasing due to interactions with phonon environments.Finally, we discuss coherence decay for a system with strong coupling to an environment with a strongly peaked spectral density.This example illuminates where the limitations of our algorithm arise.
The article is structured as follows: In the theory section II, we introduce and describe our algorithm, where, in subsection II A, we first summarize the PT formalism, on which our algorithm is based.For comparison with the commonly-used sequential algorithm by Jørgensen and Pollock [49], and to introduce quantities also used in our approach, we revise the PT calculation scheme for Gaussian environments in Ref. [49] in subsection II B. Then, we introduce our divide-and-conquer scheme in subsection II C and introduce periodic PTs in subsection II D. The examples of fluorescence spectra, multi-emitter superradiance, and coherence decay are discussed in subsections III A, III B, and III C, respectively.Our results are summarized in section IV.

A. Process Tensors
An open quantum system is defined by dividing the total system into a system of interest S and an environment E. Correspondingly, the total Hamiltonian is decomposed into H = H S + H E , where H S is an arbitrary, possibly time-dependent, system Hamiltonian and H E is the environment Hamiltonian, which also includes the interaction with the system of interest.In this paper, we present an algorithm applicable to environments with Gaussian correlations, focusing on a generalized spinboson model consisting of a few-level system coupled linearly to a bath of harmonic oscillators, described by where Ô is a Hermitian operator acting on the system Hilbert space.The environment Hamiltonian is conveniently characterized by the spectral density Here, we choose to work in the basis where the coupling is diagonal Ô = u λ u |u⟩⟨u| with a set of basis states |u⟩ of the system Hilbert space.
Throughout this article, we use a compact Liouville space notation and assume a uniform time grid t j = t 0 + j∆t with time step ∆t.For example, the reduced system density matrix ρ(t) = Tr E [ρ(t)] at time t j is denoted by ραj = ρsj,rj (t j ) = ⟨s j |ρ(t)|r j ⟩, where s j and r j are system Hilbert space indices, which are combined into the Liouville space index α j = (s j , r j ).The time argument t j is implied in the index α j .In this notation, the reduced density matrix ραn at a final time t n can be expressed as [32]: where describes the free evolution of the system during one time step ∆t under the free system Liouvillian, e.g., L S [ρ] = i h H S , ρ and F αn,...,α1 is the Feynman-Vernon influence functional [32], which exactly captures all environment influences in the limit ∆t → 0.
For the generalized spin-boson model, the influence functional is given by [49,52] where the factors b i−j αi,αj are related to the bath correlation function via with α j = (s j , r j ) and λ s,r as defined above, and [53] where The memory of the environment is finite if the bath correlation function C(t) vanishes after n c time steps, i.e.C(t ≥ n c ∆t) ≈ 0, which implies η l≥nc ≈ 0 and b l≥nc αi,αj ≈ 1.In this case, it is sufficient to consider at most n c terms in the second product in the right hand side of Eq. ( 3).Performing the Feynman-Vernon summation in Eq. ( 2) is notoriously difficult.For a system Hilbert space of dimension D, the sum involves D 2n terms and, thus, scales (a) FIG. 1.The tensor network for a PT with n = 8 time steps and memory cutoff nc = 6 using the sequential algorithm by Jørgensen and Pollock [49] (a and b) as well as our divide-and-conquer scheme (c and d), respectively.(a) and (c) show the subdivision of the overall tensor network into different blocks (black rectangles) to be contracted from bottom to top.In the sequential algorithm, this is done row by row (b).The divide-and-conquer algorithm is based on the observation that the last block in (c), formed by the topmost n/2 lines, exactly matches the part of the already contracted network shaded in red.This enables a contraction of multiple rows at a time as depicted in (d).The black semicircle seen here represents a "closure" that describe tracing out dangling bonds; see text for further discussion.
exponentially with the total number of time steps n.This issue can be addressed [49,52] by representing the influence functional in a more convenient form using matrix product operators (MPOs) [54,55].It is then referred to as the process tensor matrix product operator (PT-MPO) where the monolithic tensor F αn,...,α1 is decomposed into a set of smaller elements Q α l d l ,d l−1 , which are regarded as matrices with respect to the inner bond indices d l .The latter serve the purpose of conveying time-non-local information, i.e. memory, over multiple time steps.The maximal dimension χ of the inner bonds d l strongly depends on the complexity of the environment [48].
which can be summed sequentially, one time step at a time, with the complexity of n matrix multiplications of dimension D 2 χ.The main challenge for simulating the open quantum system is thus to bring the influence functional in Eq. (3) into the form of the PT-MPO in Eq. (7).

B. Sequential algorithm
Jørgensen and Pollock [49] devised the algorithm that is currently most commonly used to obtain PTs in MPO form for Gaussian environments.To distinguish it from our divide-and-conquer scheme, we refer to the approach in Ref. [49] as the sequential algorithm.The common starting point of both algorithms is the graphical representation of the double product on the right hand side of Eq. (3) in the form of a triangular tensor network (introduced in Ref. [52]), depicted in Fig. 1(a) and (c) for n = 8 time steps and memory cutoff n c = 6.The latter limits the maximum length of the rows.
Each node represents a factor c α l β l+1 ,β l = δ β l+1 ,β l b l α l+j ,β l , where system Liouville space indices α l are represented by upward facing links, while β l+1 and β l are links to the left and right, respectively.Left or right dangling indices are traced out.Note that, in the notation of Ref. [49], the upward facing links are meant to pass by those nodes further up the tensor network, and then connect to the same outer index α l .The more conventional interpretation of tensor networks, where a bottom node is instead connected to its neighbor to the top, is regained by adding a fourth leg with index αl at the bottom of each node (except those on the bottom row) via another Kronecker delta, i.e. cα l , αl β l+1 ,β l = c α l β l+1 ,β l δ α l , αl .In the following we however retain the notation of Ref. [49], labeling only three indices on such tensors.
In the sequential algorithm, the PT is written as a product of rows where C α k ,...,αj is the j-th row in the tensor network, which, as described above, may be reduced to a product of at most n c terms if memory truncation is employed.The PT F α k ,...,α1 is constructed by multiplying row after row from bottom to top using the recursion visualized in Fig. 1(b), Fα k ,...,α1 (k:j+1) =C α k ,...,αj Fα k ,...,α1 with initial value F α k ,...,α1 . The final PT is identified as F αn,...,α1 = Fαn,...,α1 (n:n) . Crucially, the iteration in Eq. ( 10) retains the MPO form Fα k ,...,α1 However, the dimension of inner indices d l = (d ′ l , β l ) is expanded to the product of the inner dimension of the previous PT d ′ l with that of a single row β l .If χ ′ denotes the typical inner dimension of the MPO F α k ,...,α1 and the inner dimension of a single line corresponds to D 2 with system Hilbert space dimension D, the inner dimension of the product is now increased to χ = χ ′ D 2 .To keep the inner dimensions tractable, the MPO is compressed by sweeping across it and applying truncated singular value decompositions (SVDs) to every element.In the forward direction (increasing time step indices), one calculates the SVD with non-negative singular values σ s and matrices U and V with orthogonal columns.Keeping only terms s with significant values σ s ≥ ϵσ 0 , where ϵ is a given truncation threshold and σ 0 is the largest singular value, one replaces which passes on the singular values-indicators for the importance for the represented degree of freedom-to the next element in the PT while simultaneously reducing the inner dimension to the number of significant singular values.A forward sweep is followed by an analogous backward sweep.While this scheme to compress the PT-MPO was introduced in Ref. [49], we find empirically that changing the order of sweeps, namely performing the backward sweep before the forward sweep, leads to smaller inner PT dimensions χ and hence to shorter computation times.For all simulations presented here, we therefore consistently compress PT-MPOs in this reversed order.The combination and backward sweep is diagrammatically represented in Fig. 2(a).Overall, the calculation of a PT-MPO using the sequential algorithm requires O(n 2 ) SVDs without memory truncation, or O(nn c ) SVDs with memory truncation.Exact algorithms to perform SVDs of N × M matrices with N > M generally require O(N M 2 ) floating point operations.Denoting by χ ′ the typical inner bond dimension of the PT-MPO after the previous iteration step, the corresponding matrix dimensions for a single SVD as in Eq. ( 12) are M = D 2 χ ′ and N = D 4 χ ′ , respectively, where a factor D 2 in the long matrix dimension N stems from the outer bonds of the PT-MPO while the remaining factors originate from combined inner indices.Consequently, the number of floating point operations required per SVD is O(χ ′ 3 D 8 ).

C. Divide-and-conquer algorithm
Guided by the visual representation of the PT contraction in Fig. 1(c), we suggest an alternative algorithm based on the observation that the triangular tensor network contains a high degree of self-similarity: The first and second rows are identical up to a shift.Rows three and four together form a shifted and truncated replica of the combination of rows one and two.Similarly, rows five to eight are the same as the part of rows one to four shaded in red in Fig. 1(c).
Concretely, the tensor network can be contracted in blocks of rows, with sizes progressing in powers of two, 0 , which is shown as a large triangle with three external legs.Updated blocks of the PT-MPO are formed by compressing the Kronecker product of the corresponding matrices V (1) † and V (2) † using these projectors (dotted boxes).These steps are repeated until the last time step is reached.
by iterating: where the block and can be written as Here, q d k−2 m , represented as a black semicircle in Fig. 1(d), define objects we call "closures" that describe tracing out dangling bonds.A similar object has been referred to as a "cap tensor" in other work [56].Such objects are trivial before MPO compression, but after compression become non-trivial.However, the closures can be calculated iteratively from PT-MPOs by using trace preservation [57].For the generalized spin-boson model considered here, one starts with q dn=1 = 1 and iterates , where x is a fixed but arbitrary system Hilbert space index (e.g., x = 1).As can be seen in Eq. ( 4), setting α i = (x, x) results in b (i−j) αi=(x,x),αj = 1.This effectively terminates the double product in Eq. ( 3) at an ealier time step.The proposed algorithm follows the divide-andconquer paradigm because the total number of rows of the tensor network is divided into an upper and lower block, where only the lower block has to be calculated explicitly.The lower block is then again divided into two halves, and so on.Like other divide-and-conquer schemes, such as the famous fast Fourier transformation algorithm [58,59], the numerical complexity is reduced from O(n 2 ) to O(n log n) operations.In our case these operations are SVDs.
Despite the favourable scaling with n, the divide-andconquer algorithm faces some challenges.These arise because of the size of matrices we must combine.In the sequential algorithm, each step combines a MPO with potentially large inner dimension χ ′ and a single line of relatively small inner dimension of D 2 .In contrast, in our divide-and-conquer algorithm, two MPOs of similar inner dimensions χ ′ are combined to χ = χ ′ 2 .Because of the scaling of SVD routines with the third power of the combined inner dimension together with a factor D 2 from the outer bond, the direct application of SVDs would lead to a typical complexity O(χ ′ 6 D 2 ) per SVD, which is prohibitively demanding.Clearly, a different way to combine two MPOs is needed.
Here, we address this issue by preselecting relevant degrees of freedom: Say, we wish to update an MPO with matrices f by multiplying it with an MPO with matrices gα l . We first perform SVDs on the in- and gα l , respectively.Using this, the matrices of the combined MPO can be formally written as Note again that, to keep the notation similar to the sequential algorithm in Ref. [49], Eq. ( 16) is considered an element-wise product for each value of α l .In the established notation for more general tensor networks, Eq. ( 16) would be considered a contraction over outer indices after including a Kronecker delta f To reduce the inner dimension, we keep only combinations of indices (s, t) for which the product of singular values σ exceeds a value determined by a given threshold ϵ select .In practice, the new matrix is directly set to for the subset of (s, t) obeying the condition above, while the remaining terms are passed on to the next matrices of the original MPOs: t .
The combined forward sweep, selection, and combination process is depicted in Fig. 2(b).It is followed by a backward sweep to further reduce bond dimensions.Over a wide range of different examples, we find that the selection reduces the combined inner dimensions from χ ′ 2 to λχ ′ with an empirical factor λ between 2 and 12.The resulting number of floating point operations per SVD O(λ 3 χ ′ 3 D 2 ) is therefore nominally comparable to that in the sequential algorithm, O(χ ′ 3 D 8 ).However, the massive reduction of inner bonds by our selection strategy comes at the cost of not guaranteeing optimal low-rank approximation.This can result in larger bond dimensions χ ′ and therefore in higher numerical demands per SVD compared to the sequential algorithm.As discussed in more detail in Appendix A, this issue can be ameliorated by choosing different thresholds for the selection and backward sweep compared to the forward sweep, which we parameterize by the ratios r s = ϵ select /ϵ forward and r b = ϵ backward /ϵ forward , while at the same time we obtain similar accuracies as for the sequential algorithm if we identify ϵ forward (the largest of our thresholds) with the nominal threshold ϵ of the sequential algorithm.
Finally, it is noteworthy that our divide-and-conquer algorithm can be implemented in place, i.e. in such a way that only a single copy of the full PT-MPO has to be stored, despite the combination of MPO matrices at different positions in the MPO chain.This reduces the memory footprint of the algorithm.Furthermore, as the line sweeps access the individual MPO matrices in a predictable order, storing the PT-MPO on hard disk and preloading blocks of matrices can be efficiently implemented, with only a small overhead of about 10% in overall computation time.This enables the calculation of PTs with many more time steps n than would be possible if required to keep the full MPO in memory.
It should though be noted that, as discussed above, the time required for each SVD scales as O(χ 3 ).Since χ, in general, depends on the complexity of the environment as well as on the total propagation time in the spirit of the Lieb-Robinson theorem [42], the actual scaling of the computation time with n may be larger than the scaling O(n log n) of the number of SVDs in the divideand-conquer algorithm.Benchmarking the numerical run times, as is done in section III, is therefore essential to assess the overall scaling of the algorithm for typical scenarios in open quantum systems.

D. Finite nc and Periodic Process Tensors
A useful feature of the sequential algorithm is that it profits significantly from memory truncation.If the memory of the environment becomes negligible after n c time steps, the number of overlapping columns in subsequent rows in the tensor network is limited to n c − 1, as can be seen, e.g., in the three bottom rows of Fig. 1(a).Only in the overlap region do MPO compression sweeps have to be performed, reducing the number of SVDs to O(nn c ).This suggests that it is also worthwhile to investigate how memory truncation can be incorporated into the divide-and-conquer algorithm.
To this end, consider an intermediate step in the divide-and-conquer algorithm where a block of 2n c rows is formed by two blocks of n c rows, as depicted in Fig. 3(b) for n c = 4.In principle, following the same rationale as in the sequential algorithm, sweeping only over the finite overlap of n c − 1 columns would produce an algorithm scaling as O n c log n .Remarkably, for n c ≪ n, this means that obtaining the total PT involves fewer than n SVDs, which implies that many MPO matrices of the final PT must be exact copies of others.
In fact, one can identify a structure in PTs that can be repeated indefinitely.Such an observation is analogous to that made in introducing repeating tensors that represent the state of spatially-infinite systems with translational invariance, in algorithms such as infinite density-matrix renormalisation group (iDMRG) [60], infinite time-evolving block decimation (iTEBD) [61], and infinite projected entangled pair states (iPEPS) [62].We now consider how this can be constructed for the periodic PT.Consider again the extension of the tensor network from n c to 2n c rows.For this step, the iteration Eq. ( 14) becomes as shown in Fig. 3(c).This has explicit elements The first 2nc rows of the tensor network can be subdivided into 4 blocks, where blocks of the same colour are identical.The combination of a blue block stacked on top of a green block forms an orange block, which can be used as a fundamental unit of a periodic PT and can then be repeated indefinitely, as depicted in (a).This is due to the fact that the left and right interfaces of the orange block, marked by the light blue arrow in (c), seamlessly link together, even after MPO compression of the first nc rows.
where we define f α l 2nc−1:nc dn c ,dn c −1 = δ dn c ,dn c −1 to artificially extend the overlap from n c − 1 to n c columns.This is done to achieve a particular partitioning, where the first and last cases in Eq. ( 21) correspond to the areas shaded in blue and green, respectively, in Fig. 3(b,c).If shifted and put together, they exactly reproduce the original block F α2n c −1,...,α1 2nc−1:nc consisting of the bottom n c rows.
The role of the middle section in Eq. ( 21), visualized as orange box in Fig. 3(b,c), becomes clear when the tensor network is further extended by another n c rows, as in Fig. 3(a).The overlap region with the new block of rows again contains n c columns, but these are exact copies of the (orange) middle section of the MPO, and so do not have to be calculated anew.Note that the fact that multiple central (orange) blocks fit seamlessly together is non-trivial, as the inner bonds between the subsequent matrices in the MPO have been strongly modified by MPO compression, so their relation to the bonds in the original tensor network is no longer obvious.In particular, inner bonds at different positions within the MPO generally have different dimensions.Here, however, we have constructed the central (orange) blocks in such a way that the left and right bonds, as shown by the links marked by the light blue arrow in Fig. 3(c), correspond exactly to the bonds between the first (blue) and last (green) section in Eq. ( 21).This allows us to repeat the orange block indefinitely and we arrive at the periodic PT F αn,...,α1 = dn,...d1 where again the matrices f 2nc−1:nc describing the bottom n c lines of the tensor network.In practice, it is useful to compress the periodic part of the PT-MPO once more in a final step using our preselection scheme for combining MPOs with large inner dimensions, described in Eq. ( 16).In doing this, care must be taken not to modify the left and right interfaces.
To summarize this section, if the memory of a Gaussian environment is finite, a periodic PT can be obtained using only O(n c log n c ) rank-reducing SVDs.Remarkably, this is constant in the overall propagation time t e = n∆t and provides again a nominal scaling advantage over the sequential algorithm with O(nn c ).A further advantage of the periodic PT is that the memory requirements for storing it is O(n c ), much smaller than that for storing a full PT O(n).
In the following sections we will showcase the power of our divide-and-conquer scheme with and without memory truncation on a series of applications.

III. APPLICATIONS
A. Fluorescence Spectra of Quantum Dots

Model and context
Self-assembled III-V semiconductor quantum dots (QDs) are key elements for photonic quantum technologies [63].Due to their strong interaction with light, they can be used as bright sources of pure single photons [8,64], entangled photon pairs [65,66], and other non-classical multi-photon states [67,68].However, electronic excitations in QDs also strongly interact with longitudinal acoustic phonons.The resulting non-Markovian effects are significant, and so typically cannot be fully described by weak-coupling master equations [69].While polaron master equations accurately account for phonon effects in the limit of weak driving [16], these break down under strong-driving conditions [69].Therefore, numerically exact path integral techniques based on QUAPI [43,44] have been applied [8,70] to investigate the dynamics of driven QDs.
While path integral techniques have allowed calculations of some features of quantum dots, numerically exact calculation of the spectra of strongly driven QDs is difficult.Multi-time correlations (as needed to obtain emission spectra) can be calculated with such approaches [71], but reaching convergence remains a challenge.This is due to a separation of timescales (see Appendix B for a convergence study): Because of the long radiative lifetimes in QDs, the dynamics has to be propagated to a few nanoseconds in order to avoid artifacts in the Fouriertransformed spectrum.Typical phonon memory times span several picoseconds.When considering driving with high laser intensities, or strongly off-resonant driving [72], the resulting oscillatory dynamics makes it necessary to choose small time steps ∆t of the order of a few femtoseconds.Taken together, these separate timescales yield a problem where both the number of timesteps n, and the memory cutoff n c must be large to produce accurate results.Moreover, to analyze phonon sidebands, spectra are often presented on a logarithmic scale, which makes numerical errors quite prominent.One should therefore use very small MPO compression thresholds in PT simulations, resulting in sizeable inner dimensions χ.Nevertheless, for the reasons explained above, this challenging problem can be addressed using our divide-and-conquer scheme, in particular when combined with the use of a periodic PT.
In Fig. 4, we present the fluorescence spectra of a QD driven with Rabi frequency Ω and coupled to a bath of phonons, showing how the Mollow triplet [73] is affected by this bath.The phonon-free system evolution with driving and radiative decay with rate κ is given by the Lindblad master equation with system Hamiltonian H S = h 2 Ωσ x and Lindbladian Here, the QD is modelled as a two-level system with ground and exciton states |g⟩ and |e⟩, respectively, and we use the conventional notation for operators σ + = |e⟩⟨g|, σ − = |g⟩⟨e|, and σ x = σ + + σ − .The spectrum is obtained as the Fourier transform of the two-time corre-lations: where the coherent scattering contribution (elastic peak) has been subtracted.For comparison, the spectra for simulations without phonons-as would be relevant for a driven atomare shown as light blue lines in Fig. 4(c-e) for different Rabi frequencies Ω and fixed radiative decay rate κ = (0.5 ns) −1 .In this case one sees the characteristic Mollow triplet [73] with a central peak at the two-level transition frequency and two side peaks at frequencies ±Ω.The ratio between the heights of the central peak and each side peak is 3:1, while the ratio between the integrated areas is 2:1 [74]

Simulation results and discussion
In solid-state QDs, emission spectra are strongly affected by interactions with phonons.The QD-phonon interaction is of the form of a spin-boson model, corresponding to Eq. ( 1) with a coupling operator Ô = |e⟩⟨e|.The phonon spectral density takes the superohmic form: For a GaAs-based QD, we take parameters given in Ref. [75] corresponding to an electron and hole radii a e = 4 nm and a h = a e /1.15, respectively, and we show the resulting spectral density in Fig. 4 Previous works have addressed QD fluorescence spectra in the regime of weak Rabi driving using polaron master equations [15,76], or path-integral calculations [71].
There, two effects have been identified: First, compared to the phonon-free spectra, simulations with phonons reveal a line broadening, which also reduces the heights of the peaks.Second, the side peaks are shifted towards the center due to phonon renormalization of the transition dipole.These effects can be seen in Fig. 4(c), where we show the weak-driving case of hΩ = 0.05 meV.
Importantly, the ability of our divide-and-conquer algorithm to treat problems with many timesteps enables us to also investigate the regime of stronger driving, beyond the range of previous work.In Fig. 4(d), we see that on increasing the driving strength to hΩ = 0.5 meV, three changes occur: the spectral lines are broadened, an asymmetric background arises, and there is a notable change in the relative height of the three Mollow peaks with the low-energy side peak now dominating the emission.We discuss each of these features in turn.The broadening seen is consistent with behavior known in the weakcoupling limit [16], where the linewidth is proportional to the spectral density J(ω) evaluated at the Rabi frequency ω = Ω.Since J(Ω) is significant at hΩ = 0.5 meV-see Fig. 4(b)-the broadening here becomes large.Regarding the asymmetric background, this feature is similar to that seen in the phonon side band in an undriven QD [77].Finally, the relative peak heights can be understood as the result of fast thermalization in the dressedstate basis.Thermalization predominantly occupies the lower dressed state |−⟩, because the energetic splitting between |+⟩ and |−⟩ is larger than the thermal energy hΩ > k B T ≈ 0.34 meV.Hence, emission from |+⟩ is quenched.Moreover, the direct transition |−⟩ → |−⟩ emitting photons at the frequency of the central peak also competes with phonon-assisted photoemission processes, where an energy hΩ is efficiently absorbed into the phonon bath.The photon with the remaining energy contributes instead to the lower energy side peak.
Considering even larger driving with hΩ = 5 meV, as depicted in Fig. 4(e), we find a restoration of the characteristic Mollow triplet with 3:1 ratio between the heights of central and side peaks just like in the phononfree simulations.This is due to dynamical decoupling from phonons [5,78,79], which occurs because the spectral density J(ω) becomes small for frequencies ω > ∼ 5 meV/h, see Fig 4(b).

Scaling of computation time
We next discuss the computational cost of these calculations for the various different algorithms presented above.This is shown in Fig. 5 where we show the computation time, defined as the total elapsed time from start to end of the program on a conventional laptop computer with Intel Core i5-8265U processor.
For the sequential algorithm by Jørgensen and Pollock [49], we initially see the expected superlinear increase in computation time up to about n ∼ 1 000 time steps, with a scaling compatible with the O(n 2 ) behavior Results are shown for the sequential algorithm, the divideand-conquer scheme with different ratios r b = rs between compression thresholds for backward and forward sweeps, and the periodic PT approach starting from a divide-and-conquer calculation with r b = rs = 0.2.For reference, linear and quadratic scaling with n are depicted as dashed dark gray and dotted light gray lines, respectively.
of the number of SVDs in the algorithm (cf.dotted and dashed lines indicating slopes corresponding to O(n 2 ) and O(n) scaling, respectively).For larger n, where n > n c = 2 048, the scaling switches to linear, which again matches the expectation of O(nn c ) SVDs .The sequential algorithm data is limited to n < ∼ 10 000 timesteps due to the computation time required beyond this.
The divide-and-conquer scheme similarly shows an initial superlinear scaling for n ≪ n c , but is approximately one order of magnitude faster than the sequential algorithm after n ∼ 100 time steps.After the kink at n ∼ n c , this algorithm scales more slowly, with a slope on doublelogarithmic scale that indicates sublinear behavior.
As discussed in the theory section, it can be beneficial to use different compression thresholds for selection, backward, and forward sweeps.The computation time for simulations with threshold ratio r b = r s = 0.2 is also shown in the Figure , and is seen to be a factor of three smaller than that with r b = r s = 1.We have checked that this change corresponds to a reduction of the maximal inner bond dimension of the final PT from χ = 158 to χ = 133.When studying accuracy vs the forward compression threshold ϵ, we find similar accuracies for both values of threshold ratios.These calculations allow us to reach n ∼ 1 000 000 timesteps; the ultimate limitation in this case is the storage of the full PT-MPO, which requires over 400 Gigabytes, rather than the computation time.
Considering the periodic PT algorithm, we show results from switching to periodic PTs after calculating an intermediate PT of n c rows using the divide-and-conquer scheme with r b = r s = 0.2.As anticipated, in this case the computation time remains constant until about n ∼ 1 000 000 time steps, because no more numerical resources have to be spent on PT calculation.At extremely large n ∼ 10 7 , the time becomes dominated by the propagation of the open quantum system in Eq. ( 8), where PT-MPO is contracted with the set of system propagators M α l ,α l−1 via the outer indices, and by the extraction of the observables.Even though this only involves matrix multiplications, which are much less demanding than SVDs, they have to be applied at each time step, and so linear scaling arises for extremely large n.The prefactor of this linear scaling is however far smaller than that of the other algorithms, as is clear from the vertical offset on the double-logarithmic graph.Storing the periodic PT requires less than 2 Gigabytes, which elimitates the need for writing to and reading from hard disk.
Finally, we want to point out that there are occasions where PT-MPOs can be employed with genuine sublinear scaling with respect to n.For example, if one is interested in the stationary states of an open quantum system with time-independent or periodic system Hamiltonian, one can contract the periodic PT with the corresponding system propagators.Thereby, one obtains an effective propagator for the system expanded by the inner dimension of the PT-MPO at the periodic cut, which can be diagonalized.The stationary state(s) are then given by the eigenvector(s) of the effective propagator corresponding to eigenvalue(s) with value 1.In addition, multi-time correlations can be calculated directly from considering the other eigenvalues and eigenvectors.

Model and context
Superradiance is a dramatic consequence of collective quantum behavior [80,81], where N emitters act as one, resulting in spontaneous emission rates that scale superextensively with N .It occurs when the inter-emitter distances are much smaller than the wavelength of the emitted light, making the emitters spatially indistinguishable [see Fig. 6(a)], and when the energies of emitters are the same, making them spectrally indistinguishable.In this case, if all emitters are prepared in their excited state at t = 0, the emitted intensity features a characteristic burst at short times and a slowly decaying tail on long times, in distinction to the exponential decay that occurs for independent or distinguishable emitters [81].Concomitant with superradiance is the existence of subradiant states, which are optically inactive and so can be used to store excitations or quantum information over long periods of time.Due to recent advances in fabrication, it has become possible to explore collective effects of a few spectrally indistinguishable solid-state QDs, including coupling multiple QDs into the same photonic waveguide [82][83][84][85].As in the previous section, consideration of semiconductor QDs raises the issue of phonon effects, and a key theoretical question is thus modeling  of imperfections and environmental effects in real-world solid-state devices [13].
The standard model to consider for superradiance is N two-level systems coupled to the electromagnetic field via electric dipole interaction.We can thus write the full Hamiltonian (i.e.H S + H E ) as [80,81] where ω j is the transition frequency of emitter j, ω k is the frequency of photon mode k, g j,k are the light-matter coupling constants, a † k and a k create and destroy a photon in mode k, and σ + j and σ − j describe the excitation and deexcitation of the j-th emitter.Throughout this example, we assume g j,k = g k is real and independent of j, which implies that all emitters couple to the photon environment with the same phase, as is crucial for superradiance [80,81].
The Hamiltonian written above is a multimode generalization of the Dicke model, without any rotating wave approximation.As such, the time evolution of this model will involve high frequencies, associated with the characteristic frequencies ω j , ω k .In many cases, it is preferable to make a rotating wave approximation (RWA), yielding the Tavis-Cummings model, which would then allow one to shift to a frame in which these high frequencies are eliminated.However, doing so changes the system-environment coupling in a way that precludes the direct application of the PT-MPO formalism discussed above: After the RWA, the system-environment coupling no longer takes the simple product form in Eq. ( 1), and instead involves two different system operators coupling to the environment.While approaches do exist to construct tensor network representations for the evolution of such problems [86], they involve tensors with larger internal dimensions.However, our divide-andconquer scheme enables us to resolve very many time steps.As such, we are able to calculate a PT-MPO for the photonic environment with the original electric dipole coupling in Eq. ( 27) without making the RWA.In appendix C, we discuss an alternative perspective, starting from the Tavis-Cummings model (with the RWA), and re-introducing counter-rotating terms as an approximation, controlled by adding a common carrier frequency, ω R , to the system frequencies ω j , ω k .

Comparison of numerical and analytic results
Starting from Eq. ( 27), we directly apply our algorithm to the environment Hamiltonian H E = H − H S with system Hamiltonian H S = j ω j σ + j σ − j .The form of the light-matter coupling is advantageous, as degeneracies allow us to treat relatively large N exactly.Specifically j σ x j has a spectrum with only N + 1 different eigenvalues −N, −N − 2, . . ., N .This allows us to construct PTs with only (N + 1) 2 instead of 2 2N different outer indices α [52,70].As a result, the PT-MPO calculation is so efficient that it is the final contraction-describing the propagation of the N -emitter system in a 2 2N dimensional space-rather than the PT calculation that currently limits the number of emitters N in our numer-ically exact simulations.
The spectral density used for the PT-MPO calculation, shown in the inset to Fig. 6(c), is chosen as: This expression has a wide flat region around ω ≈ ω R with a constant value κ/(2π), so that in the Markov limit a single emitter would spontaneously decay with rate κ.The edges of this box-shaped spectral density are rounded using logistic functions centered at ω = 0 and ω = 2ω R , respectively, with transition widths w = ω R /10.We find such rounding produces PT-MPOs with smaller inner dimensions than for a sharp cutoff.Throughout this discussion, we choose ω R = 1000κ, time steps ∆t = 1/(8192κ), total number of time steps n = t e /∆t = 32 768, and MPO compression threshold ϵ = 10 −9 .The divide-and-conquer algorithm is applied without memory truncation.The emitted intensity where n j (t) denotes the occupation of the j-th emitter.The results of such calculations for identical emitters, with ω j = ω R , are shown by solid lines in Fig. 6(c).
For comparison to these simulations, Fig. 6 also shows analytic results.These are found using the standard approach of considering transitions between the ladder of Dicke states, as depicted in Fig. 6 x/y/z j , specifically the total angular momentum Ĵ2 |J, M ⟩ = J(J + 1)|J, M ⟩ and its z-component Ĵz |J, M ⟩ = M |J, M ⟩.Starting from the fully excited state, |J = N/2, M = N/2⟩, the dynamics is restricted to the states with J = N/2 and M = −J, . . ., J. The light-matter interaction involves the collective lowering operator Ĵ− which gives rise to optical transitions between states |J, M ⟩ to |J, M − 1⟩.The rates Γ J,M for these transitions can be found using Fermi's golden rule, which yields Γ J,M = κ(J + M )(J − M + 1) [81].The sequential photon emission processes can be solved analytically; the solutions are listed explicitly in Appendix D and are shown by points in Fig. 6(c).
We see in Fig. 6(c) that the emitted intensities from N = 1, . . ., 5 indistinguishable (ω j = ω R for all j) quantum emitters obtained from the PT-MPO simulations very closely match the analytical results.As noted above, the PT-MPO calculations are for the Dicke model (without RWA), while the analytic results are for the Tavis-Cummings model, with the RWA in the light-matter coupling; this confirms the RWA would be valid for these parameters.While not clearly visible in the figure, there are discrepancies between the two calculations at early times, t < ∼ 1/ω R .These differences are due to the difference of models with and without the RWA: there are transient effects of turning on the matter-light coupling including the counter-rotating terms at t = 0. We have checked that, as expected, such effects can be reduced by increasing ω R (not shown).

Effects of distinguishability and decoherence
While the results above demonstrate the ability of the PT-MPO approach to recover superradiant dynamics for the Dicke model, the fact that all results were recoverable from analytic calculations suggests such an approach was not needed.However the PT-MPO approach includes the full description of environment influences and so can be used in situations where no analytic solutions are available.Here we use this to present results on how superradiance is destroyed by spectral distinguishability and decoherence.
Figure 6(d) shows the breakdown of superradiance for N = 3 quantum emitters when they become spectrally distinguishable.This is implemented by changing the system Hamiltonian by setting different values for the two-level transition frequencies ω We note that this can be done with a single calculation of the PT-MPO, and just changing the system propagator with which it is then contracted.Indeed, with increasing frequency detuning between emitters, ∆, the superradiant intensity burst is suppressed, and replaced by oscillations around an exponentially decaying intensity.At sufficiently large ∆ one recovers monoexponential decay with rate κ as is expected for independent emitters.
The PT-MPO approach also allows us to investigate how superradiance is affected when the emitters are additionally coupled to other environments, such as phonons.These can affect the emission dynamics by dephasing inter-emitter coherences and by accumulating which-path information, i.e., which of the emitters is excited and which is not.In Fig. 6(e), we show the corresponding intensity for N = 2 spectrally indistinguishable semiconductor QDs (ω j = ω R ) , with each QD coupled to a bath of longitudinal acoustic phonons.For this bath we use the same spectral density and parameters as in Fig. 4; we thus present results in physics units, and so require a specific choice of κ = 1/64 ps −1 .We find that superradiant emission is already slightly suppressed due to QD-phonon interactions for baths at initial temperature T = 0 K; similar behavior persists up to temperatures of T = 4 K. Signatures of superradiance start to significantly decrease from about T = 10 K and almost vanish at liquid Nitrogen temperature T = 77 K. C. Coherence decay with a strongly-peaked spectral density

Motivation and model
To determine the practical limits of what our divideand-conquer approach can achieve, we consider strong coupling to a bath with strongly peaked spectral density.Such environments can be addressed by approximate methods such as the reaction coordinate approach [87,88], or by specialized numerically exact techniques like Hierarchical Equations Of Motion (HEOM) [33].However, such environments have been challenging for PT-MPO-based approaches due to long memory times and correspondingly large inner PT-MPO dimensions.As such, they can serve as a critical test for new PT-MPO algorithms.
To explore the behavior of such models, we consider the free decay dynamics of the spin-boson model.That is, we consider a two-level system, with vanishing system Hamiltonian, H S = 0, and system-environment coupling operator Ô = |e⟩⟨e|.We then consider the decay of initially prepared coherences, i.e. the time evolution of ⟨σ x (t)⟩ for initial state ρ(0) = 1 2 (|g⟩ + |e⟩)(⟨g| + ⟨e|).This setup is particularly well suited for studying the convergence as the coherences react much more sensitively to the environment than occupations.Furthermore, for H S = 0 the model reduces to the independent boson model, for which analytical solutions are available for comparison.Moreover, as [H S , H E ] = 0, the Trotter error, i.e. the error due to the finiteness of the time step ∆t, vanishes.
We take the spectral density to have the form: defined by interaction strength η, central frequency Ω, and damping γ.This form of spectral density is commonly used to describe quantum dissipative tunneling [18,87], such as in charge or excitation transfer in biological or chemical molecular systems.It can be derived from a hierarchical model, where the two-level system couples to a single nuclear coordinate with frequency Ω which in turn is damped by an Ohmic bath [87].In this interpretation, η is set by the Ohmic damping strength, and γ = η/(2M ) where M is the nuclear mass.Throughout this example, we set h = 1 and use dimensionless parameters Ω = 10 and γ = 1, while we vary the coupling strength η.J QDT (ω) is depicted in the inset of Fig. 7(a).The initial bath temperature is set to T = 0.

Numerical results and computation time
Figure 7(a,b) shows the time-evolution of the coherence, calculated using the sequential and the divideand-conquer algorithm, for η = 0.01, ∆t = 1/32 and η = 0.1, ∆t = 1/128, respectively.For the case H S = 0 that we consider, exact results can be obtained by polaron transformation [16] where coth(βω/2) → 1 for zero temperature T → 0. These results are shown for comparison in Figure 7.For the weaker coupling, we see both methods are capable of recovering the exact result when the compression threshold ϵ is small enough.For the stronger coupling strength convergence is not reached, as discussed further below.
Of particular interest here is how the computation time for the construction of the PT-MPOs scales as a function of the compression threshold ϵ.This is presented in Fig. 7(c) for baths with coupling strengths η = 0.01, η = 0.04, and η = 0.1 and time steps ∆t = 1/32.For weak coupling η = 0.01, we find the divide-andconquer scheme to be about one order of magnitude faster than the sequential algorithm for all relevant compression thresholds.As shown in Fig. 7(a) for this coupling, both methods reproduce exact results for small thresholds ϵ = 10 −9 , while they incur comparable numerical errors for large thresholds ϵ = 10 −6 .
Increasing the coupling to η = 0.04, the computation time required for the divide-and-conquer scheme is still shorter than for the sequential algorithm when the threshold ϵ remains large enough.However, the divideand-conquer computation time increases faster with decreasing threshold than the time for the sequential algorithm.This increase eliminates the computational advantage over the sequential approach at about ϵ < ∼ 10 −9 .This increase in computation time corresponds to an increase in the maximal inner dimension of the PT-MPO at intermediate steps of the algorithm, as is shown in Fig. 7(d).From this figure we may note that the bond dimension resulting from the divide-and-conquer calculation is notably larger than for the sequential algorithm at the same cutoff.This suggests some inefficiency in how the truncation is applied in the divide-and-conquer algorithm, and so it may be possible to improve performance further with more sophisticated approaches to truncation [89,90].
Increasing the coupling further to η = 0.1, we again see a crossover in computation times on reducing ϵ, but in this case this occurs at about ϵ = 3 × 10 −6 .Once again, this increased computation time is associated with a larger maximal inner dimension.Moreover for this coupling strength, Fig. 7(b) shows that the convergence behavior of the divide-and-conquer algorithm can be less regular than that of the sequential algorithm: we see large errors at late time.This is due to the fact that the preselection of products of singular values does not provide the locally optimal low-rank approximation, as discussed in detail in appendix A. There, we also demonstrate that this can be combatted by using a finer threshold for the selection process, characterized by the ratio r s = ϵ select /ϵ forward .Setting r s = 0.1 in simulations in Fig. 7(b), we see the results of the divide-and-conquer approach match those of the sequential algorithm at ϵ = 10 −6 .We note however that neither algorithm has converged to the exact result for such a value of ϵ.
Moreover we find that the choice of r s = 0.1 can lead to an order-of-magnitude speedup of the divide-andconquer algorithm for simulations with strong systemenvironment coupling η = 0.1 depicted in Fig. 7(c), rendering the computation time of the divide-and-conquer scheme close to that of the sequential algorithm, even for the most challenging parameter regime with small truncation thresholds.
The identification of the main drawback of our divideand-conquer algorithm suggests that even more challenging open quantum systems with strong systemenvironment coupling and very long memory times can be tackled by developing and implementing higher-quality low-rank approximations for the combination of PT-MPO blocks with large inner dimensions.

IV. SUMMARY
We have introduced an approach for numerically exact simulations of non-Markovian open quantum systems with a scaling advantage over established techniques.Most techniques that account for non-Markovian effects scale quadratically O(n 2 ) with the number of times steps n, or linearly O(nn c ) if the memory can be truncated after n c time steps.In contrast, we arrive at a method that calculates the PT-MPO-a compact representation of the Feynman-Vernon influence functional-in O(n log n) or O(n c log n c ) rank reducing operations for situations without and with memory truncation, respectively.Once the PT-MPO is obtained, the system dynamics are simulated with linear complexity O(n) with a small prefactor.We achieve this enhancement by exploiting the high degree of self-similarity in the tensor network representing the influence functional, which enables two novel developments: a divide-and-conquer strategy to contract PT-MPOs with a large number of time steps, and the construction of periodic PTs with blocks that can be repeated indefinitely.To demonstrate that the theoretical scaling advantage can be realized in practice, we have applied our approach to several examples of challenging multiscale problems of technological interest, and compared it with the established sequential PT-MPO calculation scheme introduced by Jørgensen and Pollock [49].
For calculations of the fluorescence specturm of a semiconductor QD strongly coupled to a superohmic phonon bath, we indeed find a good agreement between the theoretically expected scaling behavior.We observe sublinear scaling over a wide range of practically relevant propagation times with n = 10 3 to n = 10 6 time steps.The combination of divide-and-conquer with periodic PTs enables well converged calculations of the resonance fluorescence spectra of driven QDs, which requires simulations of quantum dynamics over millions of time steps.Our method produces these spectra within minutes, whereas the current standard PT algorithm of Ref. [49] would require weeks to months to achieve the same results.
In the example of superradiant emitters, we demonstrate several additional aspects enabled by our algorithm: First, the possibility of resolving many time steps facilitates the treatment of problems in quantum optics beyond the RWA.The PT-MPO approach allows combination of multiple environments, which facilitates investigations of the breakdown of superradiance with additional couplings to vibrational environments.Such numerically exact studies including effects of local environments are relevant for current efforts by several research groups to realize cooperative emission [82,91] in spectrally tunable semiconductor QDs [83][84][85], as well as in organic systems [92].The ability of the divide-andconquer scheme to explicitly model both optical decay and phonon dephasing promises tremendous acceleration of investigations of a whole class of challenging scenarios, for example, where photonic environments as well as phonon baths are strongly structured and non-additive cross-interactions between different environments are important [93,94].
While our first two examples clearly show the significant practical utility of our novel approach, a detailed analysis of performance hints at a potential disadvantage of the divide-and-conquer algorithm.It requires the combination PTs with large inner dimensions.We have shown how to significantly reduce the cost associated with this by preselection degrees of freedom based on SVDs of the individual PTs.This combination strategy however does not lead to a locally optimal rank reduction like a SVD of the combined PT.This can result in the appearance of a larger number singular values above the compression threshold than are found for the PT-MPO in the sequential algorithm of Ref. [49].Consequently, the inner dimension χ of the PT may be increased, and thereby the computation time needed to perform each SVD.This limitation of the divide-and-conquer approach is illustrated in our final example: free coherence decay with a model spectral density consisting of a narrow peak.This choice is motivated by the expectation of large inner dimensions χ, which are controlled by the bath coupling strength in such a case.Indeed, we find that on increasing the coupling strength, there is a cross-over from a regime where the divide-and-conquer algorithm outperforms the sequential algorithm to a regime where the sequential algorithm performs better.Simple strategies to mitigate the extra singular values appearing in the divide-and-conquer scheme, e.g., by using different compression thresholds at different steps in the algorithm, are promising.
Such numerical experiments support a picture that the optimal strategy for PT-MPO calculations is a balance between reducing the number of SVDs (or alternative rank-reducing operations) and reducing the numerical effort of each individual operation.The divideand-conquer algorithms provides a scaling advantage for the number of operations, but at the cost of increasing the effort of each operation.It should however be noted that the increased inner PT dimension of the divide-andconquer algorithm is not due to the divide-and-conquer strategy per se: it is a consequence of the suboptimal compression after joining two PT-MPOs with large inner dimensions.It would thus be worthwhile for future work to explore other strategies to combine large PT-MPOs, e.g., alternate criteria for selecting singular values, or other methods for low-rank matrix approximations [89,90].Alternatively, one may also construct approaches that interpolate between the sequential and the divide-and-conquer approach.For example, merging fixed-size blocks of the tensor network instead of single lines in the sequential algorithm, or by sweeping and compressing PT-MPOs multiple times after combining into blocks, and allowing a different compression threshold for each sweep.We also note that periodic PTs can be readily calculated starting from blocks obtained with the sequential algorithm.This could lead to sizeable reduction of computation time when propagating open quantum systems coupled to environments with finite memory times but very strong system-environment couplings.
Finally, it is noteworthy that, even though we have implemented the divide-and-conquer approach only for bosonic environments, as defined in Eq. ( 1), in principle it can be extended to the more general class of Gaussian environments [86,93], which includes, e.g., fermionic environments of impurity problems [95,96].Moreover, PTs describing environments coupled to small quantum systems can be reused for simulations for which the system of interest is coupled to another subsystem.Extending the system Hilbert space is accommodated simply by modifying the outer bonds of the PT-MPO [57,70].Thus, the PT-MPO calculated in our first example can be readily employed to obtain spectra of QDs embedded in microcavities, which is another current topic of interest [97].A variant of this approach was used in Ref. [13] to investigate cooperative emission from two QDs, each additionally strongly coupled to a local phonon environment, providing numerically exact predictions including possible cross-interactions between different environments.Furthermore, Ref. [56] delivers a proof of principle for scaling up PT-MPO-based numerically exact approaches to small networks of open quantum systems.Given this wider context, progress in methods for constructing PT-MPOs has immediate implications for a large class of topical problems in open quantum systems.
After submission of our manuscript, a related preprint [98] by Link et al. showed how periodic PTs can be also be calculated using iTEBD methods.as in Fig. 4, threshold ϵ forward = ϵ = 10 −12 , time step ∆t = 0.01 ps, and total propagation time t e = 20.48ps.
First of all, we find that reducing only the backward sweep threshold ratio r b already results in faster PT-MPO construction with optimal ratio around r b ∼ 0.7.The computation time can be further reduced by simultaneously decreasing the threshold used for preselection via r s where less than half the computation time is required for the choice r s = r b = 0.2 compared to equal thresholds r s = r b = 1.In contrast, here, decreasing only r s does not lead to a better overall performance.This suggests that, in the example studied here, the preselection is already nearly optimal in selecting relevant degrees of freedom, while the MPO after selection and combination deviates considerably from its canonical form.The former can be explained by the fact that what dominates the computation time are the last few iterations of the divideand-conquer algorithm, where the inner dimensions are largest.In this example, this involves the combination of elements that are shifted by a number of time steps of the order of the memory time n c , where the new block overlaps with the long tail of the MPO with elements that are nearly independent of the outer index α l , as [b αi,αj l ] → 1 for l → n c , irrespective of the value of α i and α j .In this case, it can be expected that the difference between Eq. (A10) and Eq.(A11) becomes less relevant.Moreover, we attribute the observation that decreasing r s along with r b is more efficient than the decrease of r b alone to consistency: When ϵ select < ϵ backward , terms are eliminated in a not locally optimal way during the preselection that would otherwise allow the canonicalization during the backward sweep to result in a more compact and accurate form.
A somewhat different picture unfolds for the example of the spin-boson model with strongly peaked spectral density with coupling strength η = 0.01 and total propagation time t e = 8 discussed in section III C, for which we plot the computation time for varying threshold ratios r b and r s in Fig. 8b.There, decreasing only r b increases the computation time, while decreasing r s is found to be more efficient.The main difference to the situation in Fig. 8a is that in the parameter regime considered in Fig. 8b the final time t e is still well within the memory time of the environment, and the preselection combines MPO matrices whose elements depend significantly on the outer indices.As such, optimal compression of Eq. (A10) does not necessarly compress Eq. (A11) optimally, and the preselection based on products of singular values is not a perfect proxy for the SVD of the product of the MPO matrices.As small r s increases the inner bond dimension after the selection process, it appears that truncating sooner rather than later, i.e. already during the backward sweep by keeping r b = 1, is the best strategy to minimize the overall computation time in this situation.
To test the hypothesis that the truncation strategy should be chosen differently depending on whether the total propagation time exceeds the physical memory time of the environment, we show in Fig. 8c the computation times for simulations as in Fig. 8b but with larger total propagation time t e = 64.In this case, we find indeed that the optimal strategy is to reduce both threshold ratios r s and r b , as the situation is more comparable to the one in Fig. 8a.

Appendix B: Convergence of fluorescence spectra
To demonstrate that calculating fluorescence spectra of quantum dots in the strong driving limit indeed puts strong requirements on the convergence parameters, we compare in Fig. 9 the spectra obtained for hΩ = 5 meV in Fig. 4e with simulations where a single convergence parameter is varied.
The impact of a too coarse time discretization can be observed in Fig. 9a and b, where panel a shows the spectra on logarithmic scale while panel b shows the left Mollow peak enlarged and on a linear scale.The overall memory time n c ∆t = 20.48ps is kept fixed, i.e. n c is varied along with the time steps ∆t.For larger time steps ∆t = 320 fs, one already finds many features of the spectra well reproduced on the logarithmic scale.Note, however, that sizable deviations are still found on the linear scale in panel b for ∆t = 160, suggesting that our choice for ∆t = 10 fs in the main text is indeed of the required order of magnitude for convergence.In additional calculations (not shown), we kept this time step ∆t = 10 fs while varying the memory cut-off n c .We found the algorithm to lead to fastest results for n c = 2 048 memory time steps, i.e., no gain is made by truncating the memory earlier.
In Fig. 9c and d, we show that reducing the overall number of time steps from n = 2 21 to n = 2 19 results in oscillatory artefacts in the spectra, as the total propagation time t e = n∆t is too short for all itinerant excitations in the two-time correlation function to decay.One can also see on the linear scale in panel d that the low-energy peak is not sampled sufficiently.
Finally, the impact of the compression threshold is analyzed in Fig. 9e and f.This seems less severe, as it mainly leads to small blips on top of the otherwise well reproduced spectra.Note, however, that we find deviations of the two-time correlation functions in the time domain by < ∼ 10 −4 when comparing simulations with ϵ = 10 −12 to those with ϵ = 10 −11 (not shown).This indicates that such small thresholds are required for well converged dynamics, yet small errors in temporal data may be averaged out by the Fourier transform.
In any case, the chosen convergence parameters ∆t = 10 fs, n c = 2 11 , and n = 2 21 used for the simulations in Fig. 4 are indeed not too far from the minimal requirement for well converged spectra.In this appendix we discuss how one may consider Eq. ( 27) as a "reverse rotating wave approximation" of the standard model of superradiance, that uses the Tavis-Cummings model.In this picture we consider using the reverse RWA to convert a system-environment coupling that is not of the product form required to match Eq. ( 1) into such a form.In such a picture, we consider ω R as a convergence parameter that determines the quality of the approximation.
We start by summarising the standard RWA, starting from Eq. (27).Such an approximation is valid if the typical value of the emitter frequencies, ω j , is much larger than the couplings g k .The RWA is achieved by first applying a time-dependent unitary to change to a rotating frame, Here ω R is a reference frequency, typically chosen to be similar to the emitter frequencies ω j .with The RWA is completed by neglecting the counterrotating terms, i.e. the terms oscillating with frequency ±2ω R , which is justified if the reference frequency ω R is much larger than any other frequency in the system.The impact of these terms on the dynamics on a coarsegrained time scale T is of the order Eq.(C5) has the advantage that only comparatively slow oscillations with frequencies of the order ωj and g k have to be resolved; fast oscillations with frequencies ω R are eliminated.One may also note that after this process, the value ω R does not appear: as long as the RWA is valid, models with different values of ω R will all map to the same approximate model.We may now consider this process in reverse.If one requires a solution of the multi-mode Tavis-Cummings model in Eq. (C5), a reverse rotating wave approximation can be applied by introducing a reference frequency ω R , which now takes to role of an additional convergence parameter.As seen in Eq. (C4), the error introduced by the reverse RWA can be systematically reduced by increasing the frequency ω R .Our divide-and-conquer algorithm is key to dealing with the small time steps ∆t needed to resolve the fast oscillations ω R ∆t ≪ 1.As noted above, the value of ω R does not enter the calculation as long as it is large enough.This means that one may also use the above arguments to model systems where the physical ω R is too large even for the divide-and-conquer approach.A RWA can be made to eliminate the original common frequency, and then the reverse RWA made to recover a model of the form of Eq. 1, allowing the use of the PT-MPO approaches described in this paper.

FIG. 2 .
FIG. 2. (a): MPO combination and compression by backward sweep used in the sequential algorithm [see Fig. 1(b)].The units enclosed by the dashed lines are combined into a single matrix.A truncated SVD U P † P ΣV † is performed, where P denotes the projection onto the space spanned by singular vectors with singular values σ ≥ ϵσ0, which is shown as a triangle reducing thick connection (with large dimensions) to thin connection (with smaller dimensions).The dotted lines indicate units which are combined to form the nodes in the updated PT-MPO.The part P ΣV † is passed onto the nodes corresponding to the previous time step, before another SVD is performed.This process is repeated until the first time step is reached.(b): MPO combination and index preselection for the divide-and-conquer method [see Fig.1(d)].A forward sweep with truncated SVDs U ΣP † P V † is performed on the last matrix of the bottom PT-MPO block before it overlaps with matrices of the top PT-MPO block.The corresponding matrices U Σ are passed onto the next MPO matrix.SVDs are performed at the top and bottom blocks independently.Singular values from both SVDs are used to identify the projectors onto the subspaces spanned by singular vectors for which σ (1) σ(2) ≥ ϵσ (1) 0 σ

4 FIG. 3 .
FIG. 3. (a)A tensor network with many time steps, n ≫ nc, is decomposed into blocks of length nc.The parts of the tensor network highlighted in orange are identical.(b) The first 2nc rows of the tensor network can be subdivided into 4 blocks, where blocks of the same colour are identical.The combination of a blue block stacked on top of a green block forms an orange block, which can be used as a fundamental unit of a periodic PT and can then be repeated indefinitely, as depicted in (a).This is due to the fact that the left and right interfaces of the orange block, marked by the light blue arrow in (c), seamlessly link together, even after MPO compression of the first nc rows.

T = 4 KFIG. 4 .
FIG. 4. (a) Cartoon of fluorescence experiment: The external laser drive dresses ground and excited state of the QD.In the absence of a phonon environment, photon emission can be understood in terms of optical transitions between laser-dressed states.Black arrows indicate transitions contributing to photons at the central frequency of the two-level system, whereas red and blue arrows depict transitions detuned by −Ω and Ω from the central frequency, respectively.(b) Superohmic phonon spectral density of a QD with electron radius ae = 4 nm.(c)-(e) Fluorescence spectra of driven QDs with different driving strengths Ω, shown on a logarithmic intensity scale.Insets depict the same spectra on a linear scale.The simulations were performed over n = 2 21 time steps of size ∆t = 0.01 ps using periodic PTs with memory cut-off at nc = 2 11 time steps.
(b).The corresponding polaron shift or reorganization energy ∞ 0 dωJ(ω)/ω ≈ 0.072 meV is absorbed into the definition of the excited state energy.The remaining physical and convergence parameters for our simulations using divide-and-conquer combined with periodic PTs are the initial phonon temperature T = 4 K, time discretization ∆t = 0.01 ps, memory time t mem = n c ∆t with n c = 2 048, total propagation time t e = n∆t with n = 2 21 ≈ 2 × 10 6 , and MPO compression threshold ϵ = 10 −12 .

FIG. 5 .
FIG.5.Computation time as a function of the total number of simulation time steps n, on a double-logarithmic scale.Results are shown for the sequential algorithm, the divideand-conquer scheme with different ratios r b = rs between compression thresholds for backward and forward sweeps, and the periodic PT approach starting from a divide-and-conquer calculation with r b = rs = 0.2.For reference, linear and quadratic scaling with n are depicted as dashed dark gray and dotted light gray lines, respectively.

77 KFIG. 6 .
FIG. 6.(a) Cartoon showing N = 5 two-level quantum emitters confined to a region much smaller than the wavelength of the emitted light.If the emitters are also spectrally indistinguishable, they effectively couple to the electromagnetic field as a single emitter, resulting in superradiant emission.(b) Superradiant transitions between states in the Dicke ladder.(c) Emitted intensity vs time for N = 1 . . . 5 superradiant identical emitters.Points show analytic results of rate equations (see text), lines represent divide-and-conquer PT simulations without memory truncation.The inset shows the spectral density used to model radiative decay at rate κ.This is flat around the central frequency ωR with smooth edges.(d) Breakdown of superradiance for N = 3 emitters upon increasing spectral distinguishability.The transition frequencies of the emitters are detuned by −∆/2, 0, and ∆/2, respectively, from the central frequency.(e) Effect of phonon baths on superradiance for N = 2 degenerate quantum dots, for a variety of temperatures T .
(b), and considering the result in the RWA (i.e.Tavis-Cummings model).The validity of the RWA is controlled by the size of ω R compared to other parameters.The Dicke states |J, M ⟩ are eigenstates of the collective angular momentum operator Ĵx/y/z = N j=1 σ

6 6 J
FIG. 7. (a) Free decay of coherences in a two-level system with spectral density JQDT (ω) [cf.Eq. (29)] for parameters η = 0.01, Ω = 10, γ = 1.JQDT (ω) is shown in the inset.Exact results for the time evolution of coherences are shown by the solid black line, and results of the different PT-MPO algorithms are indicated by points.Although an internal time step ∆t = 1/32 is used, data points are only shown at times tj = (1/4)j for clarity.(b) Time evolution for η = 0.1 and ∆t = 1/128, where the PT-MPO results are not converged.The divide-and-conquer algorithm with equal thresholds ϵ for forward and backward sweeps (shown in this case as thin green solid line) converges less regularly than the sequential algorithm; this can be ameliorated by choosing a non-unity threshold ratio rs = 0.1 in the preselection stage of the divide-and-conquer block combination.(c) Computation times of the sequential and the divide-and-conquer algorithms for different coupling strengths η as a function of the truncation threshold ϵ.(d) Maximal inner PT dimension encountered at intermediate steps in the respective algorithm.

FIG. 9 .
FIG. 9. Convergence of fluorescence spectra with respect to time step ∆t (a,b), total number of time steps n (c,d), and truncation threshold ϵ (e,f).Left panels (a,c,e) use logarithmic scale, right panels (b,d,f) show a zoom into the left peak on a linear scale.The remaining parameters are the same as for Fig. 4e.

)
With the definition of relative frequencies ωj = ω j −ω R and ωk = ω k − ω R , the RWA turns the multi-mode Dicke model of Eq. (27) into the multi-mode Tavis-