Density Matrix Renormalization Group with Tensor Processing Units

Google's Tensor Processing Units (TPUs) are integrated circuits specifically built to accelerate and scale up machine learning workloads. They can perform fast distributed matrix multiplications and therefore be repurposed for other computationally intensive tasks. In this work we demonstrate the use of TPUs for accelerating and scaling up the density matrix renormalization group (DMRG), a powerful numerical approach to compute the ground state of a local quantum many-body Hamiltonian. The cost of DMRG scales with system size $N$ as $O(ND^3)$, where the so-called bond dimension $D$ regulates how expressive the underlying matrix product state (MPS) variational ansatz is. We consider lattice models in two spatial dimensions, with square lattices of size $10\times 10$ (free fermions) and $20\times 20$ (transverse field Ising model), for which the required MPS bond dimension is known to scale at least as $\exp(\sqrt{N})$. Using half of a TPU v3 pod (namely $1,\!024$ TPU v3 cores) we reached an unprecedentedly large bond dimension $D = 2^{16} = 65,\!536$, for which optimizing a single MPS tensor took about 2 minutes.

DMRG is an optimization method over a matrix product state (MPS) [50][51][52][53][54], which is a powerful variational ansatz in the form of a one-dimensional tensor network. For a system size N (where N denotes e.g. the number of sites in a lattice model or the number of electronic orbitals in a molecule) the MPS is made of N tensors and the computational cost of DMRG scales as O(N D 3 ). Here, the so-called bond dimension D determines how much quantum entanglement the MPS is capable of accounting for, and may depend on the system size, that is D = D(N ). For instance, to accurately represent a generic wavefunction, the bond dimension must grow as D = exp(N ), making the MPS optimization as expensive as a direct, brute-force ground state computation. Fortunately, most ground states of local Hamiltonians in d spatial dimension contain restricted amounts of entanglement according to the so-called area law (with possible logarithmic corrections). Both the incredible success of DMRG for one-dimensional systems and its challenges in higher dimensions can be understood to be a direct consequence of this area law.
Such applications of DMRG in d > 1 dimensions, while exceedingly difficult and computationally expensive, are also tremendously important in order to account for exotic quantum effects that other, less expensive methods fail to capture. Recent years have seen a growing effort to understand how modern computer architectures, in particular HPC clusters, can be used to speed up such computations [26,[55][56][57][58][59]. In this work we show that Googles's Tensor Processing Units (TPUs) [60,61], originally developed for machine learning workloads but more recently applied to other computational tasks [62][63][64][65][66][67][68][69][70], can be leveraged to perform, within hours, large scale DMRG calculations of 2d quantum systems that would otherwise take many months to finish on conventional shared memory hardware with up to a few dozens of CPU cores. We demonstrate the approach in 2d square lattice models of sizes 10 × 10 and 20 × 20. To the best of our knowledge, the largest bond dimension employed, D = 2 16 = 65,536, sets a new record. (This was achieved using only 1024 TPU v3 cores, namely half of a TPU v3 pod, and without exploiting internal symmetries in the MPS representation, and therefore there is significant room for further increasing D, see discussion section). These results herald a new age of DMRG and, more generally, tensor network methods, with the potential to transform the computational landscape in all research areas where such techniques are applied, from condensed matter to quantum chemistry, materials science and machine learning.
The paper is organized as follows: in Sect. II we review some relevant aspects of the DMRG algorithm, the MPS ansatz and the entanglement area law; in Sect. III we then briefly describe TPUs; in Sect. IV we introduce the arXiv:2204.05693v1 [cond-mat.str-el] 12 Apr 2022 strategy used to distribute DMRG on TPUs, including data distribution, a necessary out-of-core approach, and distributed tensor contractions; in Sect. V we present benchmark results using two models on a square lattice: free fermions and the transverse field Ising model; we conclude the paper with a summary and discussion in Sect. VI. We also include Appendices A -C with additional technical details of our DMRG implementation.

II. DENSITY MATRIX RENORMALIZATION GROUP
In this section we present a brief review of the Density Matrix Renormalization Group (DMRG) algorithm [1] and the matrix product state (MPS) ansatz [50], as well as other relevant background material.
We consider a lattice system made of N sites, with each site described by a vector space of finite dimension q and orthonormal basis {|i }, i = 1, 2, · · · , q. For instance, with q = 2, each site is represented by a two-dimensional vector space with orthonormal basis {|1 , |2 }, corresponding e.g. to empty/occupied fermionic states {|0 , |1 } if each site represents a spinless fermionic degree of freedom, or to spin up/spin down states {|↑ , |↓ } if each site represents a spin-1 2 quantum spin degree of freedom, as in the two examples used later on in this paper. The many-body wavefunction of the lattice system then reads where ψ i1i2···i N denotes q N (possibly complex) amplitudes and |i 1 i 2 · · · i N stands for the product basis |i 1 ⊗ |i 2 ⊗ · · · ⊗ |i N for the q N -dimensional vector space of the N sites. Similarly, the local many-body Hamiltonian expressed in the same basis reads although a more natural, efficient expression is as a sum of local terms. Our goal is to compute an accurate approximation to the ground state |ψ GS of the lattice Hamiltonian H, without explicitly storing the wavefunction amplitudes in (1), which would incur in a computational cost exponential in the system size N .

A. Matrix Product Decompositions
For that purpose, one can use the Density matrix Renormalization Group (DMRG) algorithm [1], which is a variational method in the space of matrix product states (MPSs) [50]. The MPS ansatz consists of a collection of N order-3 tensors Each tensor M k has (possibly complex) components [M k ] i k α k−1 α k , where each index α k takes D k different values (that is, α k = 1, 2, · · · , D k ). In other words, for each value of i k ∈ {1, 2, · · · , q}, tensor M k defines a matrix [M k ] i k of size D k−1 × D k with matrix elements labeled by indices α k−1 and α k . Following a common practice, we will refer to the index i k labeling the local basis of states as a physical index, and to the indices α k−1 and α k as bond indices. The bond indices α 0 and α N will be chosen to have dimensions D 0 = D N = 1, so that for fixed value of the physical indices i 1 and i N , [M 1 ] i1 and [M N ] i N are not matrices but vectors (of dimension D 1 and D N −1 and components [M 1 ] i1 1α1 and [M 2 ] i N α N −1 1 , respectively). Given the above N tensors, the MPS ansatz assumes that the q N wavefunction amplitudes in (1) can be written as where in the first line we have written explicitly both all the indices of each tensor and the sum over the matrix indices {α} = α 1 , α 2 , · · · , α N −1 , whereas in the second line we regarded each [M k ] i k as a matrix (except for [M 1 ] i1 and [M N ] i N , which are vectors) and used the matrix product symbol '·' to represent matrix-matrix multiplication (respectively, matrix-vector multiplication). Notice that the name 'matrix product state' of this variational ansatz comes from the fact that it expresses the wavefunction amplitudes as the matrix product (5). Intuitively, the MPS ansatz assumes that the state |ψ has a restricted amount of entanglement. Indeed, the socalled bond dimension D k limits how much entanglement the ansatz can represent between two parts of the system, namely between a part containing the first k sites (from site 1 to site k) and another part containing the rest of sites (from site k + 1 to site N ). In particular, the larger the bond dimension D k is, the more entanglement the MPS can account for between these two parts, and thus the more capable the ansatz is to represent entangled many-body wavefunctions. On the other hand, tensor M k contains q × D k−1 × D k variational parameters, so that the cost of storing the MPS grows with the bond dimensions. Quite often, the bond dimension D k is chosen according to the rule namely such that it grows exponentially with k for small k until it reaches some maximum allowed bond dimension D, and similarly with N − k for large k < N . Given a choice of maximum bond dimension D, k 1 above is simply the smallest site index such that D < q k1 , whereas k 2 is the largest site index such that D < q N −k2 . In many applications the above prescription implies that most of the MPS tensors have size q × D 2 . Then the memory space required to store the MPS scales with the bond dimension D and system size N as O(N D 2 ).
The MPS ansatz comes with so-called gauge freedom [71][72][73], in that the amplitudes ψ i1i2···i N are invariant under the simultaneous change [M k where Q k is any invertible D k ×D k matrix and Q −1 k is its inverse. Indeed, the double replacement is easily seen to leave the matrix (5), so that the wavefunction amplitudes are not changed. Using this gauge freedom, we can bring the MPS (3) into the socalled central gauge [71,72] with respect to site n, where letters A and B are used to denote MPS tensors that satisfy the following orthogonality constraints: The central tensor C n above does not satisfy any of the two relations. The central form is important in order to both simplify, and provide numerical stability to, the MPS optimization procedure, as briefly reviewed below.
The system's Hamiltonian H in (2) can similarly be expressed in matrix product operator (MPO) [74][75][76][77][78] form, given in terms of a sequence of N order-4 tensors where each tensor H k has components [H k ] mi k j k m k−1 m k . For fixed values of the physical indices i k and j k , we can think of [H k ] i k j k as a D k−1 × D k matrix. We again refer to i k and j k } as physical indices and to m k−1 and m k as MPO bond indices. The Hamiltonian coefficients in (2) can then be written as Importantly, in this case the MPO representation is not used as a variational ansatz for the Hamiltonian H, but as a convenient way of exactly representing it.

B. Variational Energy Optimization
The exact ground state |ψ GS of Hamiltonian H is the state |ψ that minimizes the expectation value of the energy, as given by E(|ψ ) ≡ ψ|H|ψ / ψ|ψ . Accordingly, an MPS approximation |ψ MPS to the ground state |ψ GS is obtained by minimizing the energy over the set of states |ψ MPS that can be written as an MPS (for some fixed choice of bond dimensions {D k }), In the following we will outline the main steps of the DMRG algorithm, which aims at obtaining |ψ MPS , and refer the reader to the literature [1,72] for more details. Starting from some initial state |ψ MPS given by a set of MPS tensors {M k } in (3), the DMRG algorithm attempts to minimize the energy E(|ψ MPS ) in (13) by iteratively optimizing one MPS tensor at a time. More concretely, we first optimize the tensor on site n = 1, then the tensor on site n = 2, etc, all the way to the tensor on site n = N , in what is known as a forward sweep. That is followed by a backward sweep, progressing from the last site to the first one. For a given site n, the optimization proceeds as follows. First, the MPS is written in the central canonical form for that site, namely as in Eq. (7). Then the central tensor C n is replaced with a new tensor C n that is chosen in a way as to optimally lower E(|ψ MPS ) while keeping the rest of MPS tensors constant. This is accomplished by diagonalizing a linear operator using a Krylov-space method such as the Lanczos method (see Fig. 1 and text below for more details).
The DMRG algorithm proceeds with forward and backward sweeps until either the energy E(|ψ MPS ) has converged to some desired accuracy, or a maximum sweep number has been reached. Although ideally one would like to obtain the optimal MPS ground state approximation |ψ MPS in Eq. (14), in practice it is understood that one must settle for a reasonably converged MPS ground state approximation, also denoted |ψ MPS in the rest of this paper.

C. Computational Cost and Area Law
For simplicity, from now on we assume a uniform MPS bond dimension D. Optimizing a central tensor C n → C n (e.g. using Lanczos) as well as shifting the central canonical form (7) from site n to site n + 1 (which can be accomplished using a polar, QR or singular value decomposition [79]) both have a computational cost O(D 3 ), leading to a total cost per optimization sweep that scales with bond dimension D and system size N as O(N D 3 ). It is important to emphasize the linear scaling in N at fixed D, as opposed to the exponential scaling incurred in a brute-force ground-state computation using the q N amplitudes in (1). However, in order to achieve some desired accuracy with an MPS computation, the bond dimension D may need to be adjusted as a function of the system size N , that is, in general D = D(N ), in which case the scaling of computational resources may no longer be linear in N .
In order to gain further insight into computational costs, a useful rule of thumb is to think that the bond dimension D must grow exponentially with the entanglement entropy S [80][81][82] of the target ground state |ψ GS for half of the system, that is D ∼ exp(S). For a generic (non-local) Hamiltonian H, the half-system entropy of the ground state is expected to grow linearly in the system size, S ≈ N , scaling known as entanglement volume law. In this case, the bond dimension must grow exponentially with system size, D ∼ exp(N ), and using an MPS representation is not computationally advantageous with respect to a brute-force computation. Luckily, ground states of local Hamiltonians often have a more forgiving scaling of entanglement entropy S with system size N , known as the entanglement area law [80,[82][83][84][85][86][87][88][89] (sometimes with a logarithmic correction), which justifies the use of an MPS representation and the DMRG algorithm.
Specifically, in a d-dimensional cubic lattice made of N = L d sites, the half-system entropy of a state obeying the area law scales as or, in the presence of a logarithmic correction, as The above rule of thumb then indicates that the required MPS bond dimension should scale, respectively, as

correction). (18)
In d = 1 dimensions, the ground state typically obeys an area law if the local Hamiltonian has a finite energy gap. In this case S ∼ N 0 suggests that a constant bond dimension D, independent of the system size N , may suffice to accurately approximate the ground state |ψ GS with an MPS, resulting in overall computational cost linear in N . For gapless Hamiltonians in d = 1 dimensions, the ground state often exhibits a logarithmic correction to the area law, S ∼ log(N ), resulting in a polynomial bond dimension D ∼ poly(N ) and thus also a computational cost that grows as some power of the system size N . We conclude that DMRG is efficient for ground states of d = 1 systems.
In d = 2, 3 dimensions, ground states of gapped and gapless local Hamiltonians often obey the entanglement area law, with some gapless systems (e.g. systems with a Fermi surface of dimension d − 1) also display logarithmic corrections. For such systems, the above rule of thumb indicates that DMRG has cost that scales at least as exp( √ N ) and exp(N 2/3 ) in d = 2, 3 dimensions, respectively. Notice that, in spite of this exponential scaling of computational costs (in a fractional power of N ), DMRG still has significant advantage with respect to the exp(N ) scaling of a brute-force computation.
In conclusion, DMRG is efficient in d = 1 dimensions, where it is firmly established as the method of choice to compute ground states. On the other hand, DMRG scales exponentially as exp( √ N ) and exp(N 2/3 ) in d = 2, 3 dimensions. In spite of this unfavorable scaling, DMRG is actively used in restricted d > 1 geometries such as thin two-dimensional strips and cylinders [90][91][92][93][94][95][96][97][98][99][100][101][102] and small three-dimensional molecules [29,[103][104][105][106][107][108][109][110]. In such cases, the massive computational cost of running DMRG constitutes the main roadblock to studying larger systems. As we show in this work using Tensor Processing Units, specialized hardware originally developed to accelerate and scale up machine learning workloads can be repurposed to also accelerate and scale up DMRG computations, significantly increasing the bond dimension D that can be afforded. This allows us to use DMRG to more accurately address larger systems in d > 1 dimensions.

III. TPU CORES, BOARDS AND PODS
Tensor Processing Units (TPUs) are application specific integrated circuits (ASICs) developed by Google specifically for large scale machine learning applications [60,61]. However, in recent times a growing number of papers have demonstrated their applicability to accelerating and scaling up other computationally intensive tasks, including large-scale dense linear algebra operations [64], the simulation of quantum circuits [67,68], brute-force ground state computation and dynamics simulation in quantum many-body systems [62,63,66], and quantum chemistry electronic structure computations using density functional theory [65,69,70]. In this work we focus on TPUs of third generation, denoted v3 in the following. [After completion of our work, TPUs of fourth generation, with increased compute power, were made available. The results presented in this work can be straightforwardly generalized to TPU v4.] In the third generation, eight TPU v3 cores form a TPU board, and up to 256 TPU v3 boards can be connected into a TPU pod (with 2048 TPU v3 cores).
A single TPU v3 core is equipped with two matrixmultiply units (MXUs) and 16 GB of on-chip, highbandwidth memory (HBM). An MXU is a systolic array that can multiply matrices of size 128 × 128 natively, using multiplication of floating numbers in half precision (specifically, in brain float 16 format, or bf16) and accumulation in single precision (fp32). Using 6 passes through the MXU, a single TPU core can however also deliver over 10 TFLOPS of single precision (fp32) matrix-matrix multiplication.
At the next level we find a TPU board, which is actually the smallest available configuration, with eight TPU v3 cores and one controlling host CPU machine. The eight cores are arranged into a 2d torus and, importantly, each core is connected to its neighbors through a fast Inter Core Interconnect (ICI) communication link (with 656GB/s bandwidth [61]). A TPU board has a total of 128 GB of HBM and can yield up to about 80 TFLOPS of single precision matrix-matrix multiplication [64].
Finally, up to 256 TPU boards (that is, up to 2048 TPU cores) can be joined into a TPU v3 pod, where the cores are again arranged on a 2d torus and directly connected to nearest neighbors with ICI links, with a total of 32 TB of HBM and near 20 PFLOPS of single precision matrix-matrix multiplication [64]. One can also use a slice of a pod containing an intermediate number of TPU cores. For instance, in Fig. 4 we provide performance results and estimates for slices with 32, 128, 512 and 2048 cores. The largest TPU configuration we used in this work consisted in half a pod (that is, 1024 cores). TPUs can be programmed using XLA [111], an optimized graph compiler that translates from roughly C-like commands called HLOs to roughly assembly-like equivalents called LLOs. The HLOs themselves may be written directly, but are usually instead "traced" from any of several higher-level languages. For the DMRG work presented in this paper, we wrote the code with Jax [112], a NumPy like interface to XLA, following the single instruction multiple data (SIMD) paradigm.

IV. DMRG ON TPUS
The performance of our large-scale implementation of DMRG on multi-core TPU configurations is based on three main points: (i) individual MPS tensors (and other auxiliary tensors) are distributed through the available TPU cores; (ii) an out-of-core approach is adopted in order to more efficiently use the 16 GB of high bandwidth memory on each TPU core; (iii) tensor contractions are accelerated through parallelization.

A. Data distribution
The largest data objects in a DMRG simulation are (a) the N order-3 MPS tensors M n that contain the variational parameters of the ansatz and (b) two sets of N auxiliary tensors L n and R n , called left and right environment Hamiltonian tensors [1,72], which given a choice of central site n, represent a contraction of MPS and MPO tensors for all sites k < n and for all sites k > n, respectively. In components, we can write Greek letters are used to denote "large" indices, such as the MPS bond indices, with size D that in our implementation could potentially be scaled up to D ∼ 10 5 , whereas Roman letters are used to denote "small" indices, namely the physical index i taking q values for q = 2 in the examples below, and the MPO bond dimension D , which in those examples grows up to D ∼ 100.
Let T denote any of these order-3 tensors, with components [T ] i αβ . In our implementation, we regard tensor T as a collection {T i=1 , T i=2 , · · · } of large matrices, where each matrix T i has components [T i ] αβ given by [T ] i αβ . Each matrix T i is then distributed across all available TPU cores in a checkerboard fashion, as shown in Fig.  1 for the case of eight TPU cores. Each matrix panel is stored in the high bandwidth memory of the corresponding TPU core. Since in SIMD code each matrix panel is expected to have the same size, matrix dimensions are chosen appropriately such that they can be evenly divided by the grid-shape of the TPU cluster. The motivation to distribute data in this way will become clear below, where tensor contractions are reduced to sequences of distributed matrix-matrix multiplications.
B. Out-of-core approach The memory required to store the MPS tensors and the left and right environment Hamiltonian tensors scales as O(N D 2 q) and O(N D 2 D ), respectively, where N is the number of sites, D the MPS bond dimension, q the physical index dimension and D the MPO bond dimension. For large DMRG computations, these memory requirements can quickly become prohibitive, when compared to the total available HBM on TPUs. For example, for a system of two-level quantum degrees of freedom on each site (local dimension q = 2) on a square lattice made of a 10 × 10 grid (number of sites N = 100) with a local Hamiltonian consisting only of a nearest neighbor fermionic hopping term (MPO bond dimension D = 22) and with an MPS bond dimension D = 2 16 = 65,536, the minimally required memory using single precision (4 bytes per fp32) grows to about 32TB, which therefore already exhausts the maximally available 32TB HBM of an entire TPU v3 pod.
On the other hand, at any given time of a DMRG optimization sweep only a small number of such tensors are really required for processing on the TPUs. We have therefore adopted an out-of-core approach, where the bulk of the data is stored on hard drives (which are readily and cheaply available). For one optimization step of the DMRG algorithm, the necessary data is read from hard-drive into the HBM of the TPUs, where the relevant optimization step is executed, then the result is written back to the hard drive. To minimize the idling time of the TPUs, we utilize three simultaneous threads to perform DMRG optimization on the TPUs, and reading and writing of data from and to the disk.

C. Distributed tensor contractions
To illustrate how tensor contractions are performed in a distributed way, let us consider the contraction of the tensor network shown in the right panel of Fig. 1, which is the bottleneck operation in the DMRG algorithm. Given a site n, the tensor network contains the central MPS tensor C n for that site, as well as the left and right environment Hamiltonian tensors L n and R n and the MPO tensor H n . Conceptually, we can think of the contraction of this tensor network as corresponding to a "vector-matrix" multiplication, if we regard the central MPS tensor C n as representing a "vector" (an effective wavefunction on site n) and the remaining three tensors as representing a "matrix" (an effective Hamiltonian H eff on site n). This effective "vector-matrix" multiplication needs to be performed several times as part of the Lanczos tridiagonalization (which aims to compute the ground state of the effective Hamiltonian H eff on site n as part of a single DMRG optimization step).
In order to proceed, we first preprocess the MPO tensor Notice that for each non-zero value in v, we must perform two matrix-matrix multiplications involving matrices from tensors L n , C n and R n . As explained above, each of these matrices has been suitably distributed among the TPU cores. Then we multiply them using the scalable universal matrix multiplication algorithm (SUMMA) algorithm [113], following a TPU implementation discussed in [64]. Each iteration of the loop produces a different distributed matrix, which is weighted by the corresponding weight v k and added to one of the q matrices that will constitute the final order-3 tensor with the result of the tensor network contraction.
This approach is particularly appealing for highly sparse MPO tensors, as one typically find when the MPO is encoding a local Hamiltonian of a lattice model. For dense MPO matrices, as e.g. appearing in some quantum chemistry applications of DMRG, a vectorized approach can be more efficient.
Another important step in the DMRG algorithm is tensor orthogonalization [72], which is traditionally implemented using a QR decomposition or a singular value decomposition. In this work we chose to perform orthogonalization using instead a polar decomposition (which, in a so-called two-site DMRG approach can also be used for optimal tensor truncation). Further implementation details can be found in the Appendix. In order to benchmark our distributed implementation of DMRG on TPUs, we computed an MPS approximation |ψ MPS to the ground state |Ψ GS of two different 2d square lattice models. The first one is a model of free spinless fermions on a lattice of size 10 × 10, which can also be solved efficiently using the free fermion formalism, so that we have the exact solution to compare against. Its ground state displays a logarithmic correction to the area law, making this model extremely challenging from a DMRG perspective. The second model is the transverse field Ising model on a lattice of size 20 × 20, for which we do not have an exact solution, but other techniques can be used. The ground state of this model obeys an area law. This makes it less computationally demanding for DMRG, allowing us to consider a larger lattice.
The two models analyzed in this section are already well understood. We have chosen them mostly for two reasons. On the one hand, they are challenging from a DMRG perspective and, as such, can be used to meaningfully illustrate the use of very large bond dimension, as made available by TPUs. On the other hand, such models are often also used to benchmark other methods, including quantum monte carlo [114] and numerical linked-cluster expansions [115] or other tensor network algorithms such as those based on a tree tensor network (TTN) [116], the multi-scale entanglement renormalization ansatz (MERA) [117,118] and projected entangled pair states (PEPS) [119]. Benchmarking DMRG on the same models (although for different system sizes) enables useful comparisons.

A. Free fermion model
We first consider, on a 10 × 10 square lattice with N = 100 sites, the nearest neighbor Hamiltonian withĉ i (anti-commuting) fermionic annihilation operators and µ the chemical potential. This model describes a system of non-interacting/free electrons that can hop from each site to its nearest neighboring ones, where the value of µ can be tuned to determine the number of electrons in the ground state (e.g. N/2 = 50 particles for µ = 0). Using the free fermion formalism, the quadratic Hamiltonian (19) can be numerically diagonalized with computational cost that scales just as O(N 3 ), instead of the generic O(exp(N )) of a brute-force diagonalization. This is in contrast with the interacting case (e.g. if we added quartic terms to the above Hamiltonian), where the free fermion formalism can no longer be used. Here, the ground-state energy from the O(N 3 ) diagonalization will be used to assess the accuracy of the DMRG result.
It is important to emphasize that, despite the lack of interactions, computing the ground-state of Hamiltonian (19) is still a formidable challenge from the perspective of the DMRG algorithm. Indeed, for sufficiently small value of |µ| this Hamiltonian is seen to describe a system with a one-dimensional Fermi surface, which results in the presence of a large number of gapless excitations. As such, its ground state |ψ GS displays a logarithmic correction to the area law [80,86], implying that an accurate MPS approximation |Ψ MPS requires a bond dimension D expected to scale faster than O(exp( √ N )), see Eqs. (16) and (18) for d = 2. This is the strongest scaling of ground state entanglement (and bond dimension D) observed to naturally occur in condensed matter systems in d = 2 dimensions. Thus, as far as DMRG is concerned, this non-interacting/free lattice model is not easier than a strongly interacting/strongly correlated lattice model.  Fig. 2 shows the evolution of the relative error of this energy as a function of bond dimension D. At D = 2 16 the approximation achieves a relative accuracy of less than one part in a million. Note that simulations were carried out in single precision, limiting the maximum achievable accuracy to about 10 −7 . To the best of our knowledge, these are the largest DMRG computations (in terms of bond-dimension) to date. Results for D = 2 16 = 65,536 were obtained within roughly 23 hours on a slice of a TPU v3 pod made of 1,024 cores. We used seven sweeps and a Krylov dimension of 10 for the sparse diagonalization required for optimizing each tensor.
A remark regarding internal symmetries in the MPS representation is in order. Hamiltonian H SF commutes with the particle number operator N = iĉ † iĉ i , indicating particle number preservation, an internal U(1) symmetry generated by N . Thus its ground state |ψ GS also has a well-defined particle number N GS , N |ψ GS = N GS |ψ GS . This can be exploited in DMRG [73,[120][121][122][123]  symmetry group, we can ensure that the MPS representation is exactly symmetric with the correct number N GS of particles, N |ψ MPS = N GS |ψ MPS . In addition, this confers each MPS tensor a block-sparse structure that significantly reduces the number of variational parameters to be optimized, as well as the required computational cost.
Our current distributed implementation of DMRG on TPUS does not enforce or exploit the above model's internal U(1) symmetry. Our goal here is to benchmark the performance of DMRG in a way that the results are representative of a more general 2d lattice model, where such internal symmetry may not be present (see e.g. our next example). We foresee nevertheless no obstruction to incorporating particle conservation in our current implementation.

B. Transverse field Ising model
As a second benchmark, we have also considered the transverse field Ising model,  on a 20 × 20 square lattice with N = 400 sites. Herê σ x i andσ z i are Pauli matrices and B the magnetic field strength. This model represents a system of spin-1 2 quantum spin degrees of freedom with ferromagnetic interactionσ z iσ z j between nearest neighbor spins and subject to an external transverse magnetic field. This model is invariant under spin-flip transformations (internal Z 2 symmetry) generated by the unitary operator U = iσ x i , which we again do not enforce or exploit in our DMRG implementation. The model has a quantum critical point for B c ≈ 3.04, and thus we expect the ground state |ψ GS at or near this value of the magnetic field to be robustly entangled. Since there is no Fermi surface, the ground state entanglement entropy scales as an area law without logarithmic corrections, as previously confirmed us-ing other methods [114,115,117,118]. Accordingly, the bond dimension D required for an accurate approximation |ψ MPS scale as O(exp( √ N )), see Eqs. (15) and (17) for d = 2. This is still a very challenging computations for DMRG, but the milder scaling of the entanglement entropy (compared to the free fermion model above) allows us to consider a larger lattice. Fig. 3 shows the DMRG approximation E(|ψ MPS ) for the (unknown) ground state energy E(|ψ GS ) of the transverse field Ising Hamiltonian (20) on a 20 by 20 square lattice, at a near-critical magnetic field strength of B = 3.0 and exactly encoded in an MPO with bond dimension D = 22. The plot shows how the converged DMRG energy per site appears to saturate to a constant as we increase the bond dimension D, in clear analogy with Fig. 2 for the spinless fermion model, where such saturation was to the correct value of the ground state energy. While this model cannot be solved analytically, results from numerical studies using e.g. Monte-Carlo or tensor network methods [119] are available, albeit not at the exact same system size. At bond dimension D = 2 15 = 32,768 our simulations already reached the maximum level of achievable accuracy within single precision arithmetic.

C. Scaling of runtimes
Finally, Fig. 4 shows measured and estimated runtimes, for fixed MPO bond dimension D = 22 and as a function of the MPS bond dimension D, for the update of a single MPS tensor. These times include the time to perform the Lanczos tridiagonalization, the orthogonalization of the optimized tensor, and the update of one left or right block Hamiltonian.
At fixed TPU configuration (fixed color in the plot), the runtimes are seen to scale with the MPS bond dimension D as D 3 . On the other hand, if every time that we double the bond dimension we quadruple the number of TPU cores (see data points for D = 2 15 , 2 16 , 2 17 and 2 18 for 32, 128, 512 and 2048 cores, respectively), then the runtimes grow approximately only linearly in D. The type of scaling is key to reaching unprecedentedly large bond dimensions with affordable run times.

VI. SUMMARY AND DISCUSSION
We have presented an implementation of the DMRG algorithm on Google's Tensor Processing Units (TPUs). Our implementation leverages the distributed accelerated hardware and high-bandwidth memory of a TPU cluster to perform DMRG simulations at unprecedented scale, speed and accuracy. We benchmarked the implementation on two problems that are notoriously difficult from a DMRG perspective, namely a system of spinless fermions on a 10 × 10 square lattice known to display a logarithmic correction to the area law, and the (near-) critical transverse field Ising model on a 20 × 20 lattice.
We performed simulations with bond dimensions of up to D = 2 16 , to the best of our knowledge the largest ever simulated bond dimension so far. We obtained converged results at these bond dimensions in just less than a day. We estimate that such simulations would take months on standard shared-memory hardware with up to a few dozens of CPU cores, using highly efficient, compiled code. Our results show that compute clusters of hardware accelerators can be leveraged very efficiently for tensor network computations. For our demonstration we used TPUs, but we would like to emphasize that similar results can be obtained with a cluster of tightly connected Graphic Processing Units (GPUs).
There are several obvious ways to further improve the performance of our implementation of DMRG on TPUs. On the one hand, at an algorithmic level, we already mentioned that one can exploit the internal symmetries of a model (e.g. internal U(1) symmetry corresponding to particle number conservation in the 2d free fermion model (19)). Incorporating the internal U(1) symmetry into our implementation of DMRG will lead to a substantial reduction of variational parameters and run times for the same (effective) bond dimension. At fixed TPU configuration, we expect to then be able to further increase the maximal bond dimension D by perhaps up to ∼ 5−10×. On the other hand, the largest TPU configuration used in this work was made of 1024 cores, or half a TPU v3 pod. Using a full TPU v3 pod with 2048 cores would result in a roughly 2× faster computation at fixed bond dimension D. Alternatively, it would allow us to increase the largest D by a factor √ 2. While conducting our simulations, Google announced the fourth generation of TPUs, which are currently available. A TPU v4 pod (with 8192 TPU v4 cores) would allow for an additional 2× increase of the maximal bond dimension D at comparable runtimes. On the other hand, a superpod of NVIDIA's DGX nodes (with each DGX node containing eight A100 or H100 GPUs) could be utilized in a similar way to reach even larger bond dimensions.
It is worth pointing out that MPS algorithms similar to DMRG form also the basis for more sophisticated tensor networks approaches like e.g. projected entangled pair states for 2d quantum lattice systems [12], and the availability of fast, large-scale MPS algorithms hence directly impacts not only DMRG but the field of tensor network algorithms as a whole.

VII. ACKNOWLEDGMENTS
The authors would like to thankÖrs Legeza  and does not require tensor truncations, but here we explain how to implement them for completeness.
Consider a matrix M and its singular value decomposition (SVD), where W and V are unitary (or isometric) matrices and S is a diagonal matrix with the singular values s α of M in its diagonal, that is [S] αα = s α , organized in decreasing order s 1 ≥ s 2 ≥ · · · ≥ s m ≥ 0. In components, the SVD reads Before proceeding further, we note here that in tensor network algorithms, the matrix M to be truncated will often result from reshaping a higher-order tensor (e.g. an order-4 tensor assigned to two adjacent sites of a lattice in a two-site DMRG update). However, the specific origin of M is not important in our discussion below.
A δ-truncated singular value decomposition of M corresponds to the SVD of another matrixM obtained by keeping only the singular values s α of M that are larger than δ. Suppose that there are m (with m ≤ m) such singular values. ThenM is defined through whereW andṼ are obtained from W and V by keeping only their first m columns and rows, respectively. That is, We can similarly define a truncated singular value matrix S, of size m × m , as the diagonal matrix that contains the m singular values s α organized in decreasing order, such that MatrixM =W ·S ·Ṽ can then be seen to be the rank-m best approximation to M , in that the difference matrix ∆ ≡ M −M has the smallest possible norm ||∆|| = tr (∆ · ∆ † ). Our goal is to produce two matrices F and G, with m columns and rows, respectively, such that their product equatesM , that isM = F · G. (B7) In a tensor network algorithm, the pair F, G corresponds to adjacent tensors where the bond index connecting them has been truncated (e.g. two adjacent MPS tensors during a two-site update in DMRG). An obvious way to obtain F and G is from the SVD of M , by choosing e.g. F =W and G =S ·Ṽ . However, here we are interested in obtaining F and G without resorting to an SVD of matrix M . Remarkably, the above task can be achieved with the polar decomposition which, as described in the previous appendix, can be implemented using a small set of simple matrix operations: matrix-matrix multiplications and additions, as well as matrix transposition and complex conjugation. Next we describe how.
As a first step, we use the polar decomposition to obtain the isometric and positive semi-definite factors U and H of matrix M in Eq. (A6). By construction, H has the singular values s α of M as its eigenvalues. As a second step, we compute the polar decomposition of H − 1δ. Let U and H be resulting the unitary and positive semi-definite factors, In general, the polar decomposition Z = X ·|Z| of a Hermitian matrix Z with (real) eigenvalues z α is given in terms of a unitary matrix X and a positive semi-definite matrix |Z| with very simple structure: both X and |Z| have the same eigenvectors as Z; moreover, for the αth common eigenvector, |Z| has as eigenvalue the absolute value |z α | of the corresponding eigenvalues z α of Z, whereas X has as eigenvalue σ α = ±1, where the sign is such that z α = σ α |z α |. In other words, the unitary factor U in (B8) must have m eigenvalues +1 (for the m eigenvectors of H − 1δ with positive eigenvalues s α − δ > 0) and the rest of eigenvalues must be −1 (for the eigenvectors of H − 1δ with negative eigenvalues s α − δ < 0). In particular, we can use U to define two projectors P + and P − onto the positive and negative subspaces of H − 1δ (equivalently, the subspaces of H with s α > δ and with s α < δ) by and use them in turn to define projections H >δ and H <δ of matrix H onto its s α > δ subspace and s α < δ subspace, H >δ ≡ P + · H · P + , H >δ ≡ P − · H · P − , (B10) such that H = H >δ + H <δ . We can thus write where the first term U · H >δ = U · P + · H corresponds to the largest m singular values s α of M . In other words, we have obtained the best rank-m approximationM to MM = U · P + · H.
However, we have not yet reduced the number of columns of U . For that purpose, we must find an isometry C + with m columns such that That is, we need to find an orthonormal basis for the m -dimensional column space of the rank-m projector P + . We achieve this using a slight modification of the standard, QR-based subspace-iteration method to avoid the use of a QR decomposition (see Section C). Then we haveM = U · P + · H = U · C + · C † + · H =Ũ ·H, where matrices have m columns and rows, respectively, and therefore qualify as matrices F and G in the truncated decomposition (B7).
A similar approach based on the McWeeny iteration can be used to truncate to a fixed number of singular values, instead of truncating singular values below a certain threshold δ [125].

Appendix C: subspace iteration
Consider a m × m Hermitian matrix P that is a rankm projector, namely such that P · P = P, tr(P ) = m , where we also assume that P is rank deficient, meaning m < m. Our goal is to find an isometric matrix C of shape m × m such that we can write P as the product For that purpose we can use Alg. 3. It is a specialization (for a rank-deficient projector P ) of the subspace iteration method that can more generally be used to compute the first n dominant eigenvectors of a matrix. Specifically, we modified the standard subspace iteration method in two ways: (1) since P 2 = P , a single iteration is sufficient (so we skip looping over steps 4 and 5); (2) we use a polar decomposition (easier to implement on a distributed TPU setting) instead of the usual QR decomposition.