Differentiable Programming Tensor Networks

Differentiable programming is a fresh programming paradigm which composes parameterized algorithmic components and trains them using automatic differentiation (AD). The concept emerges from deep learning but is not only limited to training neural networks. We present theory and practice of programming tensor network algorithms in a fully differentiable way. By formulating the tensor network algorithm as a computation graph, one can compute higher order derivatives of the program accurately and efficiently using AD. We present essential techniques to differentiate through the tensor networks contractions, including stable AD for tensor decomposition and efficient backpropagation through fixed point iterations. As a demonstration, we compute the specific heat of the Ising model directly by taking the second order derivative of the free energy obtained in the tensor renormalization group calculation. Next, we perform gradient based variational optimization of infinite projected entangled pair states for quantum antiferromagnetic Heisenberg model and obtain start-of-the-art variational energy and magnetization with moderate efforts. Differentiable programming removes laborious human efforts in deriving and implementing analytical gradients for tensor network programs, which opens the door to more innovations in tensor network algorithms and applications.

One of the central problems relevant to many research directions is the optimization of tensor networks in a general setting. Despite highly successful optimization schemes for one dimensional matrix product states [23][24][25][26][27][28], optimizing tensor networks in two or higher dimensions has been a challenging topic. The hardness is partly due to the high computational cost of tensor contractions, and partly due lacking an efficient optimization scheme in high dimensional cases.
The difficulty is particularly pressing in optimizing tensor network states for infinite transnational invariant quantum systems. In these cases, the same tensor affects the variational energy in multiple ways, which results in a highly nonlinear optimization problem. Optimization schemes based on approximate imaginary time projection [29][30][31][32][33] have been struggling to deal with nonlocal dependence in the objective function. Reference [34,35] apply gradient based optimization and show it significantly improves the results. However, it is cumbersome and error prone to derive the gradients of tensor network states analytically, which involves multiple infinite series of tensors even for a simple physical Hamiltonian. This fact has limited broad adoption of the gradient based optimization of tensor network states to more complex systems. * wanglei@iphy.ac.cn † txiang@iphy.ac.cn Alternative approaches, such as computing the gradient using numerical derivative is limited to highly symmetric cases with only a handful of variational parameters [36,37]. While deriving the gradient manually using the chain rule is only manageable for relatively simple tensor network structures and contraction schemes [38].
Automatic differentiation provides an elegant and efficient solution to these problems. In this paper, we present differentiable programming techniques which compute (higher order) derivatives of tensor network programs exactly. This progress opens the door to gradient-based (and even Hessian-based) optimization of tensor network states in a general setting. Moreover, computing (higher order) gradients of the output of a tensor network algorithm offer a straightforward approach to compute physical quantities such as the specific heat and magnetic susceptibilities. The differentiable programming approach is agnostic to the detailed lattice geometry, Hamiltonian, and tensor network contraction schemes. Therefore, the approach is general enough to support a wide class of tensor network applications.
We will focus on applications which involve two dimensional infinite tensor networks where the differentiable programming techniques offer significant benefits compared to the conventional approaches. We apply reverse mode automatic differentiation to the computation graph of contracting infinite tensor networks. We show that after solving major technical challenges such as numerical stable differentiation through singular value decomposition (SVD) and memory efficient implementation for fixed point iterations, one can obtain the state-of-the-art results in variational optimization of tensor network states.
The organization of this paper is as follows. In section II we introduce automatic differentiation in the context of tensor network algorithms and formulate tensor contractions as computation graphs. In section III we present the key techniques for stable and scalable differentiable programming of tensor networks algorithms. And section IV we demonstrate the ability of the approach with applications to classical Ising model and quantum Heisenberg model on the infinite square lattice. Finally, we outlook for future research directions opened by this work in Sec. V. Our code implementation is publicly available at [39].

II. GENERAL THEORY
We first review the core idea of automatic differentiation and then explain its application to tensor network algorithms with different contraction schemes.

A. Automatic differentiation
Automatic differentiation (AD) mechanically computes derivatives of functions expressed in terms of computer programs [40]. Unlike numerical differentiation or symbolic differentiation, AD computes the value of the derivatives to the machine precision without generating a symbolic expression of the gradient function. Moreover, the performance of AD has a general theoretical guarantee, which does not exceed the algorithmic complexity of the original program [41,42]. AD is the computational engine of modern deep learning applications [43,44]. Moreover, AD also finds applications in quantum optimal control [45] and quantum chemistry calculations [46].
Central to the AD is the concept of the computation graph. A computation graph is a directed acyclic graph (DAG) composed by elementary computation steps. The nodes of the graph represent data, which can be scalars, vectors, matrices, or tensors. The graph connectivity indicates the dependence of the data flow in the computation process. The simplest computation graph is a chain shown in Fig 1(a). Starting from, say vector valued, input parameters θ one computes a series of intermediate results until reaching the final output L, which we assume to be a scalar. The so called forward evaluation simply traverses the chain graph in sequential order θ → T 1 → · · · → T n → L.
To compute the gradient of the objective function with respect to input parameters, one can exploit the chain rule Since we consider the case where the input dimension is larger than the output dimension, it is more efficient to evaluate the gradient in (1) by multiplying terms from left to the right using a series of vector-Jacobian products. In terms of the computation graph shown in Fig. 1(a), one traverses the DAG backward and propagates the gradient signal from the output back to the input. Computing the derivative this way is called reverse mode AD. This approach, commonly referred to as the backpropagation algorithm [43], is arguably the most successful method for training deep neural networks. It is instructive to introduce the adjoint variable T = ∂L/∂T to denote the gradient of the final output L with respect to the variable T . One sees that the reverse mode AD propagates the adjoint from T n = L ∂L ∂T n with L = 1 all the way back to T i = T i+1 ∂T i+1 ∂T i with i = n − 1, . . . , 1, and finally computes θ = T 1 ∂T 1 ∂θ . In each step, one propagates the adjoint backward via a local vector-Jacobian product.
The adjoint backpropagation picture generalizes well to more complex computation graphs. In general, the DAG contains both fan-out and fan-in, which correspond to reusing data from previous steps in a downstream computation. Therefore, a data node, such as T 1 in Fig. 1(b), can affect the final outcome via several different paths. And in the backward pass one needs to accumulate all contributions from the child nodes to obtain the adjoint of the node The reverse mode AD algorithm can be understood as a message passing process on the DAG. After a topological sort of the computation graph defined by the forward pass, one visits the graph backward from the output node with adjoint L = 1. Each node collects information from its child nodes to compute its own adjoint, then passes this information to its parents. Thus, one can compute the gradient with respect to all parameters in one forward and one backward pass. Typically one caches necessary information in the forward pass for efficient evaluation of the vector-Jabobian product in the backward pass.
The building blocks of a differentiable program are called primitives. The primitives can be elementary operations such as addition, multiplication, and math functions [47]. Each primitive has an associated backward function in addition to the ordinary forward function. The backward function backpropagates the adjoints according to the vector-Jacobian product Eq. (2). Note that one does not need to explicitly instantiate or store the full Jacobian matrices. Moreover, one can group many elementary computation steps together as a primitive. For example, the linear algebra operations such as matrix multiplication can be regarded as a primitive. In this way, the forward pass of these customized primitive can be operated as a black box. Designing the differentiable program in such a modular way allows one to control the level of granularity of AD. There are several advantages of crafting customized primitives for domain specific problems. First, this can reduce the computation and memory cost. Second, in some cases, it is numerically more stable by grouping several steps together. Third, one can wrap function calls to external libraries into primitives, without the need to track each individual computation step.
Modern machine learning frameworks such as TensorFlow [48] and PyTorch [49] support AD either by explicitly constructing the computation graphs or tracking the program execution order at run time. Besides math functions, one can also differentiate through control flows, loops and function calls. Building on these infrastructures, differentiable programming is emerging as a new programming paradigm that emphasizes assembling differentiable components and learning the whole program via end-to-end optimization [44]. Letting machines deal with mechanical AD has greatly reduced laborious human efforts and reallocated human creativity to design more sophisticated deep learning models.
Finally, we note that it is also possible to evaluate Eq. (1) from right to left, which corresponds to the forward mode AD. The operation of the forward mode AD is akin to the perturbation theory. One can compute the objective function and the gradient in a single forward pass without storing any intermediate results. However, the forward mode AD is not favorable for computation graphs whose input dimension is much larger than the output dimension [44]. Therefore, the majority of deep learning work employs the reverse mode AD. For the same reason, we focus on the reverse mode AD for tensor network programs.

B. Computation graphs of tensor network contractions
A tensor network program maps input tensors to output tensors, which we assume to be a scalar. Depending on the contexts, the input tenors may represent classical partition function or quantum wavefunction, and the outputs can be various physical quantities of interest. It is conceptually straightforward to apply AD to a tensor network program by expressing the computation, and in particular, the tensor network contractions as a computation graph.
At this point, it is worth to distinguish between the exact and approximate contracting schemes for tensor networks. Exact approaches, such as the one used for counting and simulating quantum computing [50,51], in general exhibit expo- nentially scaling with the problem size. It is straightforward to apply AD to these cases since they only involve tensor index permutation and contractions.
We will focus on the less trivial cases of differentiating through approximate tensor contractions, which typically involve truncated tensor factorization or variational approximations. They include important tensor network applications which show great advantages over other methods [1][2][3]. In particular, we are interested in contracting infinite tensor network, where the fundamental data structure is the bulk tensor. The contraction schemes loosely fall into three categories, the ones based on coarse graining transformations [52][53][54][55][56][57][58], the ones based on the corner transfer matrix [59][60][61], and the ones based on matrix product states [2,24,62]. Since the last two contraction schemes approaches are closely related [2], in the following we will focus on AD for the tensor renormalization group (Sec. II B 1) and corner transfer matrix renormalization group approaches (Sec. II B 2) respectively.

Tensor renormalization group
Tensor renormalization group (TRG) contracts the tensor network by factorizing and blocking the bulk tensors iteratively [52]. Figure 2(a) shows one step of the TRG iteration as the computation graph, which includes 1 Split the bulk tensor in two ways using SVD, where we have truncated the singular values and vectors to a prescribed bond dimension χ. 2 Assemble the four 3-leg tensors generated in the last step into a 4-leg tensor. After this contraction, we obtain a new tensor for the next iteration. The TRG method grows the lattice size exponentially fast. So one quickly reaches the thermodynamic limit after a few tens of iterations. Note that for numerical stability one needs to rescale the tensor elements after each iteration. The computational cost of TRG method scales O(χ 6 ) and the memory cost scales as O(χ 4 ). After unrolling the iterations, the computation graph of the TRG method is similar to the simple chain graph shown in Fig. 1(a). Within each iteration step, the basic operations are tensor index permutation, truncated SVD and tensor contractions. Since each of these operations is differentiable, one can backpropagate through the TRG procedure to compute the derivative of a downstream objective function with respect to the input tensor.

Corner transfer matrix renormalization group
The computation graph of the corner transfer matrix renormalization group (CTMRG) [59] has a more interesting topology. The goal of CTMRG calculation is to obtain converged corner and edge tensors which represent the environment degrees of freedom of the bulk tensor.
In cases where the bulk tensor has the full symmetry of the square lattice, the step of one CTMRG iteration is shown in Fig. 1(b). 1 Contract the bulk tensor with the corner and edge tensors to form a 4-leg tensor. 2 Perform truncated SVD to the 4-leg tensor, keeping the singular dimensions up to the cut off χ. And use the truncated singular matrix as the isometric projector. 3 Apply the isometry to the 4-leg tensor from the first step to find a new corner tensor. 4 Apply the same isometry to find a new edge tensor for the next step. And iterate this procedure until convergence. One sees that the same bulk tensor with bond dimension d appears in each step of the CTMRG iteration. Due to this reason, the converged environment tensors will depend on the bulk tensor in a complicated way.
Unlike the TRG method [52], the CTMRG approach grows the system size linearly. So one may need to iterate a bit more steps to reach convergences in CTMRG. On the other hand, the computational complexity O(d 3 χ 3 ) and memory cost O(d 2 χ 2 ) of CTMRG are smaller than the ones of TRG in terms of the cutoff bond dimension.

III. TECHNICAL INGREDIENTS
To compute gradients of a tensor network program using reverse mode AD, one needs to trace the composition of the primitive functions and propagate the adjoint information backward on the computation graph. Thankfully, modern differentiable programming frameworks [48,49,63,64] have taken care of tracing and backpropagation for their basics data structure, differentiable tensors, automatically.
What one needs to focus on is to identify suitable primitives of tensor network programs and define their vector-Jacobian products. The key components of tensor network algorithms are the matrix and tensor algebras. And there are established results on backward through these operations [65][66][67]. First of all, it is straightforward to wrap all BLAS routines as primitives with customized backward functions. Next, it is also less trivial to derive backward rules for many LAPACK routines such as the eigensolver, SVD, and QR factorization [65]. By treating these linear algebra operations as primitives, one can rely on efficient implementations of matrix libraries in a differentiable program.
There are, however, a few practical obstacles to stable and scalable implementation of differentiable tensor network programs. First, the backward for the eigensolver and SVD may face numerical instability with degeneracy in the eigenvalues or singular values. Second, the reverse mode AD may incur large memory consumption, which prevents one from reaching the same bond dimension of an ordinary tensor network program. We present solutions to these problems in below.

A. Stable backward through linear algebra operations
We present several key results on matrix derivatives involving linear algebra operations that are relevant to tensor network algorithms. Recall the modular nature of reverse mode AD, one just needs to specify the local backward function to integrate these components into a differentiable program. We will comment on their connections to physics literature and pay special attention to stable numerical implementations [39]. For more information, one can refer to [65][66][67].

Symmetric eigensolver
The forward pass reads A = UDU T , where the diagonal matrix D are the eigenvalues and each column of the orthogonal matrix U is a corresponding eigenvector. This is a fan out in the computation graph from A to U and D.
In the backward pass, given the adjoint U and D, we have [65,67] where F i j = (D j − D i ) −1 if i j and zero otherwise. The symbol denotes an element-wise Hardmard product. One can readily check that the gradient is also a symmetric matrix.
Equation (3) can be regarded as "reverse" perturbation theory. When the downstream calculation does not depend on the eigenstate, i.e. U = 0, the backward equation is related to the celebrated Hellmann-Feynman theorem [68] which connects the perturbation to the Hamiltonian and its eigenvalues. Ref. [69] applied this special case of Eq. (3) for inverse Hamiltonian design based on energy spectra.
The appearance of the eigenvalue difference in the denominator of F is a reminder of the first order nondegenerate perturbation theory. Reference [46] concerned about the stability of the backward pass through the eigensolver, thus turned to less efficient forward mode AD for variational Hartree-Fock calculations of quantum chemistry problems. We found that by using a Lorentzian broadening with 1/x → x/(x 2 + ε) with ε = 10 −12 , one can stabilize the calculation at the cost of introducing a small error in the gradient, see also [67].

Singular value decomposition
A ubiquitous operation in tensor network algorithms is the matrix SVD, which is used for canonicalization and factorization of tensor networks [1,2,25]. The forward pass reads A = UDV T ., where A is of the size (m, n), and U, V T has the size (m, k) and (k, n) respectively, and k = min(m, n). In the reverse mode AD, given the adjoints U, D and V, one can obtain [66] where [F ± ] i j = 1 D j −D i ± 1 D j +D i for i j and zero otherwise. To prevent the numerical issue in case of degenerate singular values, we use the same Lorentzian broadening as Sec. III A 1 for the first term, which works well in our experience. In practice, for variational tensor network calculation starting from random tensors, the chance of having exact degenerate eigenvalues is small. And even if this happens, applying the rounding is a reasonable solution. While for the cases of degeneracy due to intrinsic reasons [54,56,70,71], one will still obtain the correct gradient as long as the end-to-end gradient is well defined. Lastly, inverting the vanishing singular values in Eq. (4) is not a concern since the corresponding space is usually truncated.

QR factorization
QR factorization is often used for canonicalization of tensor networks [1,2,25]. In the forward pass, one factorizes A = QR, where Q T Q = I and R is an upper triangular matrix [72]. Depending on the dimensions (m, n) of the matrix A there are two cases for the backward function.
For input shape of A matrix m ≥ n, R is a n × n matrix. The backward pass reads [67] where M = RR T − Q T Q and the copyltu function generates a symmetric matrix by copying the lower triangle of the input matrix to its upper triangle, [copyltu(M)] i j = M max(i, j),min(i, j) . The multiplication to R −T can be dealt with by solving a linear system with a triangular coefficient matrix. For the case of m < n, Q is of the size (m, m), and R is a m × n matrix. We denote A = (X, Y) and R = (U, V), where X and U are full rank square matrices of size (m, m). This decomposition can be separated into two steps, first X = QU uniquely determines Q and U, then we calculate V = Q T Y. Applying the chain rule, the backward rule gives where M = XX T − (Q + VY T ) T Q.

B. Memory efficient reverse mode AD with checkpointing function
A straightforward implementation of the reverse mode AD for tensor networks has a large memory overhead. This is because one needs to store all intermediate results in the forward pass for evaluating the vector-Jacobian products in the backward pass. The number of stored variables is related to the level of granularity in the AD implementation. In any case, the memory consumption of reverse mode AD will be proportional to the depth of the computation graph. This is particularly worrying for tensor networks with large bond dimensions and a large number of renormalization iterations.
The solution to the memory issue of reverse mode AD is a well known technique called checkpointing [47,73]. The idea is to trade the computational time with memory usage. Taking the chain computation graph in Eq. (1) as an example, one stores the tensor every a few steps in the forward process. And in the backward pass, one recomputes intermediate tensors whenever needed by running a small segment of the computation graph forwardly. In this way, one can greatly reduce the memory usage with no more than twice of the computational effort.
In a deeper understanding, checkpointing amounts to define customized primitives which encapsulates a large part of the computation graph. These primitives have their own special backward rules which locally runs the forward pass again and then backpropagates the adjoint. Therefore, in the forward pass one does not need to cache internal states of these checkpointing primitives.
The checkpointing is a general strategy that is applied to the computation graph of any topological structure. When applied to tensor network algorithms, it is natural to regard the renormalization steps shown in Fig. 2 as checkpointing primitives. In this way, one avoids storing some large intermediate tensors in the forward pass.

C. Backward through fixed point iteration
Fixed point iteration is a recurring pattern in tensor network algorithms. For example, one iterates the function T i+1 = f (T i , θ) until reaching a converged tensor T * and uses it for downstream calculations. To compute the gradient with respect to the parameter θ, one can certainly unroll the iteration to a deep computation graph and directly apply the reverse mode AD. However, this approach has the drawback of consuming large memory if it takes long iterations to find the fixed point.
One can solve this problem by using the implicit function theorem on the fixed point equation [74]. Taking the derivative on both sides of T * = f (T * , θ), we have The second line expands the matrix inversion in the square bracket as a geometric series. Therefore, to backpropagate through a fixed point iteration the basic operation is just the vector-Jacobian products involving the single step iteration function. And one performs iteration in the backward function to accumulate the gradient θ until convergence. The geometric series show the same convergence rate to the fixed point as the forward iteration [74]. Since the backward iteration is now independent of the forward step, one avoids caching a long chain of intermediate results in the forward iteration. Many of the tensor network contraction schemes, including the CTMRG method reviewed in Sec. II B 2, fall into the framework of fixed point iterations. Thus, one can use Eq. (7) for backward through CTMRG iteration, where the iteration is over the RG step shown in Fig. 2(b). Similar to the checkpoint technique of Sec. III B, Eq. (7) also reduces the memory usage in the reverse mode AD since one does not need to cache anything in the forward evaluation. Moreover, since the downstream objective function is independent of how the fixed point tensor is obtained, one can also exploit an accelerated convergence scheme in the forward process [75]. When applying this approach, one should pay attention to properly fix the redundant gauge degree of freedom of tensors (such as an overall sign), such that the tensors are indeed converged to the fixed point. We note computing gradients of tensor network contraction in this way shows similarity to the analytical gradient derived in Refs. [34,35], which also contains a summation of geometric series [76].

D. Higher order derivatives
Since the gradient can also be expressed as a computation graph, one can compute the seconder order derivatives by applying AD to the graph again. In this way, one can in principle compute arbitrary higher order derivatives of a program using AD [47]. Deep learning frameworks [48,49,63] have out-ofthe-box support for computing higher order derivatives.
The ability to compute higher derivatives supports Hessian based optimization of tensor network states, such as the Newton method [77], for tensor network states. However, computing and inverting the full Hessian matrix explicitly could be prohibitively expensive and unnecessary. One can efficiently compute the Hessian-vector product via j ∂θ j x j without constructing the Hessian matrix explicitly [78]. This is sufficient for iterative linear equation solvers used for the Newton method.

IV. APPLICATIONS
We present two applications to show the versatility of differentiable programming of tensor networks states for statistical physics and quantum many-body physics problems.  Figure 3. Energy density and specific heat of the 2D Ising model. They are computed via reverse mode AD of the free energy after 30 TRG iteration steps with cut off bond dimension χ = 30. Solid lines are exact solutions [79].

A. Higher order derivative of the free energy
Consider an Ising model on the square lattice with inverse temperature β, its partition function can be expressed as a two dimensional tensor network with bond dimension D = 2 The bulk tensor is [80] T uldr = where λ u = e β + (−1) u e −β . We contract the infinite tensor network using the TRG approach discussed in Sec. II B 1. We use a cut off bond dimension χ = 30 and iterate for 30 TRG steps. Finally, we obtain the partition function Eq. (8) and the free energy by tracing out the bulk tensor. Next, we compute the physical observables such as energy density and specific heat by directly taking derivatives of the free energy using AD, as shown in Fig. 3. One notices that the energy density shows a kink and the specific heat exhibits a peak around the critical temperature β c = ln(1 + √ 2)/2 ≈ 0.44068679. Unlike numerical differentiation, these results are free from the finite difference error [55,81]. Accurate computation of higher order derivatives of the tensor network algorithm will be useful to investigate thermal and quantum phase transitions. We note that it is in principle possible to obtain the specific heat by directly computing the energy variance [35,82], which, however, involves cumbersome summation of geometric series expressed in term of tensor networks.
There are alternative ways to compute the specific heat with AD. For example, one can directly compute the energy via   [34,35]. The accuracy is measured relative to the extrapolated quantum Monte Carlo (QMC) result [83]. (b) A comparison of the staggered magnetization, where the dashed line is the extrapolated QMC result [83]. The simple and full update reference data are also from Ref. [34].
using the impurity tensor and then take the first order derivative to obtain the specific heat. Or, one can also use forward mode AD since there is only one input parameter β to be differentiated. We have purposely chosen the present approach to show off the power of differentiable programming with the reverse mode AD technique. Backpropagating through the whole TRG procedure, and in particular the SVD, allows one to compute physical observables using higher order derivatives. It is remarkable that this works at all given many of the degenerated singular values due to the Z 2 symmetry of the Ising model [46]. To obtain correct physical results, it is crucial to implement the SVD backward function in a numerical stable way as explained in Sec. III A 2.

B. Gradient based optimization of iPEPS
We consider a variational study of the square lattice antiferromagnetic Heisenberg model with the Hamiltonian We consider an infinite projected entangled pair state (iPEPS) as the variational ansatz. The variational parameters are the elements in the iPEPS where s denotes the physical indices, and the remaining indices u, l, d, r are for virtual degrees of freedom of the bond dimension D. We initialize the tensor elements with random Gaussian variables. The overlap of the iPEPS forms a tensor network state, where the bulk tensor is the double layer tensor with bond dimension d = D 2 To contract the infinite tensor network formed by this bulk tensor we use the CTMRG method reviewed in Sec. II B 2. We initialize the corner and edge tensors by partially tracing out legs from the bulk tensor, then perform the CTMRG iteration until we reach convergence in the corner and edge tensors. After contraction, we can evaluate the expected energy ψ|H|ψ / ψ|ψ . Due to the translational invariance of the problem, it is sufficient to consider the expected energy on a bond where the black rectangle in Eq. (13) is the Hamiltonian operator acting on a bond. We have performed a basis rotation to the Hamiltonian so that the ground state will have a single site unit cell. We use cutoff bond dimension χ = 30, 50, 80, 100, 144, 160 for D = 2, 3, . . . , 7 respectively. Since the expected energy decreases with the cutoff [36,84], the approximated CTMRG contraction with cutoff gives upper bound to the variational energy. The expected energy Eq. (13) has both explicit and implicit dependence on the variational parameters in Eq. (11) via the corner and edge tensors. We compute the gradient of Eq. (13) with respect to the single layer tensor Eq. (11) using AD, which automatically resolves the intricate structure in the computation graph Fig. 2(b). The gradient computation takes time comparable to the forward evaluation of the expected energy. Then, we optimize the iPEPS using quasi-Newton L-BFGS algorithm [77] with the automatically computed gradient. One quickly reaches an optimum after a few hundred functions and gradient evaluations. Figure 4(a) shows the relative error in energy compared to extrapolated quantum Monte Carlo (QMC) results [83] for various bond dimensions. The accuracy of the ground state energy is comparable to the state-ofthe-art results [34,35], which were shown to be more accurate than imaginary time projection based simple and full update algorithms [29][30][31][32][33] [85]. Our optimization scheme is essentially equivalent to the ones of [34,35], except that we employ the reverse mode AD for the gradient computation. Note that both [35] and our ansatz contain only half of the variational parameters of the one in [34], so the energy results are slightly higher than Ref. [34] at D = 2, 3. However, for larger bond dimensions D = 4, 5, 6, 7, our calculation reaches the lowest variational energy for the infinite square lattice Heisenberg model. Figure 4(b) shows the staggered magnetization measured on the optimized state, which approaches to the extrapolated QMC results at larger bond dimensions.
To obtain results for bond dimension D > 4, we need to employ either the checkpoint technique III B or the fixed point iteration III C to keep the memory budget low enough to fit into a single GPU card with 12G memory. It is rather encouraging that with moderate effort one can reach the state-of-the-art performance in variational optimizing iPEPS [34,35]. The success is also a nontrivial demonstration that one can indeed stabilize reverse mode AD for linear algebra operations appeared in scientific computation [46].
We note that the present approach applies as well to finite systems or problems with larger unit cells, more complex Hamiltonians [86][87][88][89], and more sophisticated contraction schemes with improved efficiency [84], which is promising to deliver new physical results to quantum many-body problems.

V. DISCUSSIONS
Computing the gradient via automatic differentiation significantly boosts the power of existing tensor network algorithms. Researchers can focus on the core tensor network contraction algorithms without worrying about the tedious gradient calculations. The computational complexity of AD is the same as the forward contraction of the tensor network.
In this paper, we have focused on the high level applications of AD that differentiates through the whole contraction algorithms for optimizing tensor networks and computing physical observables. The same techniques are also applicable to low level cases such as finding optimal truncation bases or variational transformation of tensor networks [58]. Moreover, besides the optimization of the expected energy of quantum problems, the approach is also relevant to variational contraction of tensor networks [2,90]. We expect differentiable programming techniques will become an integrated part of the standard tensor network toolbox.
A bonus of implementing tensor network programs using deep learning frameworks [48,49] is that one can readily enjoy the GPU acceleration. The calculation of this work is done with a single Nvidia P100 GPU card. Pushing this line of research further, we envision that it will be rewarding to deploy tensor network algorithms on emerging specialized hardware.
Finally, it is useful to comment on the difference of AD for tensor networks and neural networks. Typical neural network architectures do not involve sophisticated linear algebra operations. However, with the development of tensorized neural networks [91] and applications of various tensor networks to machine learning problems [10][11][12][13][14]92], the boundary between the two classes of networks is blurred. Thus, results presented this paper would also be relevant to tensor network machine learning applications when one moves to more sophisticated contraction schemes.