Neural tensor contractions and the expressive power of deep neural quantum states

We establish a direct connection between general tensor networks and deep feed-forward artificial neural networks. The core of our results is the construction of neural-network layers that efficiently perform tensor contractions, and that use commonly adopted non-linear activation functions. The resulting deep networks feature a number of edges that closely matches the contraction complexity of the tensor networks to be approximated. In the context of many-body quantum states, this result establishes that neural-network states have strictly the same or higher expressive power than practically usable variational tensor networks. As an example, we show that all matrix product states can be efficiently written as neural-network states with a number of edges polynomial in the bond dimension and depth logarithmic in the system size. The opposite instead does not hold true, and our results imply that there exist quantum states that are not efficiently expressible in terms of matrix product states or PEPS, but that are instead efficiently expressible with neural network states.

Introduction -Many fundamental problems in science can be formulated in terms of finding explicit representation of complex high-dimensional functions, ranging from time-dependent vector fields to normalized probability densities. In recent years, Machine Learning (ML) techniques based on deep learning [1] and artificial neural networks have established themselves as the leading numerical approach to approximate high-dimensional functions emerging in industrial applications, for example in image-recognition tasks. In reason of this success, ML methods have been also recognized as a prime computational tool to attack functional approximation problems in physics [2]. In the quantum world, one of the main theoretical challenges in describing interacting, many-body systems stems from the complexity of finding explicit representations of many-particle quantum wave functions. In this context, ML techniques based on neural-network representation of quantum states, dubbed NQS, have been introduced [3] and subsequently used in a variety of variational applications.
Several theoretical properties of NQS have been established to date. General representation theorems for neural networks [4] guarantee that sufficiently large NQS can describe arbitrary quantum states. Moreover, exact representations of many-body ground states of local Hamiltonians can be analytically found in terms of deep Boltzmann Machines [5]. Both representation results however do not bound the size of the corresponding NQS networks that, in the worst case, can be exponentially large in the number of physical degrees of freedom [6].
Despite the worst-case exponential bound on NQS, examples of physically-relevant quantum states that can be efficiently represented are numerous. These encompass both analytical and numerical results. On the analytical side, for example exact and compact NQS representations of several correlated topological phases of matter are known [7][8][9][10]. On the numerical side, suitable learning algorithms have shown competitive results to find abinitio approximate description of many physical systems of interest in physics [11][12][13][14][15][16] and chemistry [17][18][19].
An alternative paradigm to describe many-body quantum states are tensor networks states (TNS), a representation intrinsically rooted in the notion of locality in quantum systems. These representations constitute both a key theoretical language to analyze many-body phenomena as well as a powerful numerical tool for simulations [20][21][22][23][24]. While generic TNS are widely believed to be general enough to compactly describe most physical quantum states, however only a restricted subset of them are amenable for numerical calculations. A determining factor in the applicability of TNS as variational quantum states is played by how complex it is to use these representations to compute physical quantities, and it is in turn related to the complexity of contracting TNS. TNS that can be efficiently contracted most notably encompass matrix product states (MPS) [20], a very powerful representation of low-entangled states in one-dimension. Higher-dimensional TNS are in general to be contracted only approximately, and rigorous complexity results have been established. For example, computing expectation values of physical quantities over planar tensor networks in two dimension, the Projected Entangled Pair States (PEPS) [25], is non polynomial problem that is known to belong to the #P complexity class [26,27].
Given the distinctive features of NQS and TNS, several works have studied possible connections between the two representations. For example, the volume-law entanglement capacity of neural networks has been established in several works [28][29][30]. Also, mappings between the two classes of states have been realized, including between general fully-connected NQS and MPS with expo-FIG. 1. We demonstrate a mapping from any tensor network with an efficient contraction algorithm to a compact neural network. In this figure we illustrate our coarse-grained construction of a Neural Network -approximation of a Matrix Product State over N sites, each of d degrees of freedom, and bond dimension χ. The resulting neural network is of depthÕ ln N + 1 / and uses onlyÕ N (d+χ)χ 2 + 1 / edges. nentially large bond-dimension [29]. An approach mapping MPS onto non-standard neural-networks has also been introduced [31]. Despite the important theoretical progress, however a direct mapping between generic, efficiently contractible TNS and standard NQS has not been established to date. This situation for example leaves open the possibility that TNS can offer a general representational advantage over NQS representations [32,33], and that there might exist compact, contractible TNS that cannot be expressed by means of compact NQS.
In this work, we establish a direct mapping between TNS in arbitrary dimension and NQS. By directly constructing neural-network layers that perform tensor contractions, we show that efficiently contractible TNS can be constructed in terms of polynomially sized neuralnetworks. Our result, in conjunction with previously established results on the entanglement capacity of NQS, then demonstrates that NQS constitute a very flexible classical representation of quantum states, and that TNS commonly used in variational applications are strictly a subset of NQS.
Preliminaries -We consider in the following a pure quantum system, constituted by N discrete degrees of freedom s≡(s 1 , . . . , s N ) (e.g. spins, occupation numbers, etc.) such that the wave-function amplitudes s|Ψ = Ψ(s) fully specify its state. Following the approach introduced in [3], we can represent log(Ψ(s)) as g 1 (s)+i·g 2 (s), where g 1 and g 2 are two outputs of a feed-forward neural network, parametrized by a possibly large number of network connections. Given an arbitrary set of quantum numbers, s, the output value computation of the corresponding NQS can generally be described as two roots of a directed acyclic graph (V, E), where the value of each node v ∈ V is recursively defined as: where {W u,v ∈ R} (u,v)∈E and {b v ∈ R} v∈V are the parameters of the network, and σ : R → R is some nonlinear function known as the activation function, e.g., ReLU(x) = max(x, 0) or softplus(x) = log(exp(x) + 1) [34,35]. The root nodes of the network can optionally use the identity instead of a non-linear activation function. The depth of a neural network is defined as the maximal distance between an input node and the roots. Alternatively, a state Ψ(s) can also be viewed as a complex tensor A s1,...,s N that is in turn represented in terms of tensor factorization schemes. Most forms of tensor factorizations are conveniently described graphically via Tensor Networks, undirected graphs whose nodes are tensors and edges specify contractions between connected tensors. See App. A for a brief introduction to tensor networks and Penrose diagrams.
In the next section we will present our main results on efficiency of approximating TN by NN. To properly discuss the complexity of computing a TN, we have to take into account their contraction order. In contrast to NN, which can be computed in O(|E|) time, the complexity of TN is dependent on its contraction order. While finding the optimal contraction order for an arbitrary TN is known to be NP-complete, for many common TN forms, e.g. Matrix Product States, efficient algorithms exist. Two such contraction schemes are the sequential and parallel contractions that are depicted in Figure 2.
While the contraction order exactly determines the computational complexity, it is not sufficient for describing some structural properties, e.g. depth and number of neurons, of the NN approximating it. For that, we have to more accurately describe the computational (left) Sequential contraction scheme for Matrix Product States: At step 1, we map indices d1, . . . , d8 to their corresponding matrices (or vectors at boundaries), a O(dχ 2 )-time operation. In each of the following steps, we contract a boundary vector with its neighboring matrix node, a O(χ 2 )-time operation, amounting to a total of O(N dχ 2 ) for the entire contraction. (middle) Parallel a contraction scheme for Matrix Product States: Following step 1 as in the sequential contraction, we contract pairs of neighboring nodes in parallel, each an O(χ 3 )-time operation, amounting to a total of O(N (d + χ)χ 2 ) for the entire contraction. (right) Illustration of a simple contraction scheme, in this case matrix-vector multiplication, as an arithmetic circuit.
a When parallelizing across sites, the effective run-time in practice depends mostly on the number of steps, i.e., log 2 N , and χ 2 rather than χ 3 because each matrix multiplication itself can be parallelized across the coordinates of the output matrix, resulting in O(χ 2 log N ).
operations carried out. Given a contraction order, the value of Ψ(s) can alternatively be described in the form of an arithmetic circuit, i.e., a computational graph comprising product and weighted-sum nodes. Specifically, the value for a product node v ∈ P is given by are the parameters of the circuit, corresponding to the tensor nodes in the tensor network. Input to the arithmetic circuit is represented by leaf input nodes, where for every s i and possible value k there is an indicator node v i,k = 1 [s i = k]. The depth of the circuit is defined the same as for neural networks. See Figure 2 for an illustration of a simple TN to AC conversion.
Main Results -Here we present our main results. First, that NN are as efficient at representing any quantum state that can be modeled as TN. Second, that there exist states that can be modeled efficiently by NN, but require exponential time for TN's representation. The main outcome of our work is the representability diagram in Fig. 3, summarizing the expressive power of NN and TN as variational quantum states. As explained in the preliminaries, when assessing the expressive efficiency of TN, we do so with respect to a given contraction scheme that gives rise to an explicit computation in the form of an AC, composed of product and weighted sum operations. With that in mind, our fundamental question is whether AC can be efficiently approximated by NN. While the weighted sum operation is common to both NN and AC, the product operation is not trivially simulated by NN and has been the topic of several works [36][37][38]. The best approximation result by Yarotsky [37] demonstrates a construction of a NN with a width and a depth at most O(log( M / )) such that max x,y∈[−M,M ] |NN(x, y) − x · y| < . While this rate of approximation is sufficient for many purposes, it is less relevant for quantum states representation. Consider for example an arbitrary N -qubit system, then due to normalization at least half of its wave-function amplitudes are, in modulus, less than 2 −N/2 , which entails < 2 −N/2 for a meaningful approximation, and thus requiring O(poly(N )) width and depth.
As opposed to prior approaches, we consider the approximation of the log-value of AC. We assume the magnitude of the AC's output is strictly positive for all inputs and greater than some fix value, f min , such that the logvalue is well-define. f min can be extremely small, on the order of 10 −10 10 , without having a meaningful impact on our results and so this assumption bares little effect in practice. Furthermore, to simplify the presentation of our proofs, we assume the absolute value of both real and imaginary parts to be strictly positive, though this last assumption could be relaxed. By examining this setting, we are able to prove that AC can be simulated to almost arbitrary precision by NN with a very modest overhead relative to the runtime of the original AC, formally described by the following theorem: Theorem 1 Let f : X → C be a complex-valued function given by an arithmetic circuit comprising n nodes and m edges, of depth l, and using complex parameters.
Then, there exist a function g : X → R 2 described by a neural network comprising O (n + m + c) nodes, O (m + c) edges, of depth O (l log(m) + c), and using softplus activation functions and real parameters such that max x∈X |g 1 (x) + i · g 2 (x) − log(f (x))| < , where c( , m, W max , f min )≡O ln 2 m ln Wmax fmin + ln 1 1 .
The proof of Theorem 1, which is given in full in app. B, is based on two steps. First, we show that AC with non-negative parameters and inputs can be exactly reconstructed with NN with real parameters and softplus activation functions. Let o 1 = log(x 1 ), o 2 = log(x 2 ) for x 1 , x 2 ≥ 0. Then, working in log-space, multiplication becomes summation, i.e., log(x 1 · x 2 ) = o 1 + o 2 , making input-input multiplication trivial for NN, unlike before. For every input-parameter multiplication, i.e., a sum-node edge in the AC graph, we add an auxiliary neuron with a single input. The AC's parameters are stored in the bias terms of these auxiliary neurons, adding m nodes to the NN but with negligible effect on runtime (number of edges). For summation, softplus activations arise naturally: For log-space summation of n inputs, we can decompose it as a binary tree, which gives the log(m) correction to the depth of the network. With both log-space NN analogs in place, a non-negative AC can be exactly reproduce with same asymptotic time complexity. For the second step, we reduce the general complex case to the non-negative case. A real number x ∈ R can be represented with a redundant representation of two non-negative numbers x + , x − ≥ 0 by x = x + − x − . Addition and multiplication can be applied directly on this representation: Thus, a real AC can be expressed as the difference of two non-negative AC, and a complex AC by representing the real and imaginary parts in this fashion. Finally, to compute the logarithm of this redundant complex representation, i.e., the log-magnitude and phase, we employ various univariate approximation schemes. Since these two operations are smooth and used only at the end of the network, it results in the additive term c( , m, W max , f min ), which is merely logarithmic in the number of edges of the AC, and double logarithmic with respect to the magnitudes of the weights and the WF amplitudes. Due to these weak dependencies of the target AC, it allows for an approximation with a practically arbitrary precision.
The immediate implication of Theorem 1 is that NQS can simulate TNS at least as efficiently as their TN representation, as given by the following corollary: In turn, this result also allows to use previously established rigorous results on MPS to directly quantify the expressive power of NQS on special classes of quantum systems. For example, Hastings famously established an area-law entanglement for the gapped ground state of one-dimensional systems [39] that directly translates into an efficient approximation by MPS [39][40][41][42]. Our result in 2, in connection with the bound established in [39] implies the following  N, 1/ )).
While the connection we have established is strictly inclusive, we show that the inverse does not hold, i.e., that there exists NQS that cannot be efficiently reproduced by widely adopted classes of variational TNS: Corollary 4 There exist quantum states that can be represented by neural networks with parameters and runtime polynomial in the number of sites, that MPS, MERA, and PEPS tensor networks cannot represent efficiently unless they use exponential number of parameters.
Recently, a theoretical model for representing quantum states by convolutional arithmetic circuits was proposed and analyzed in terms of the entanglement entropy scaling [13]. They proved that this model cannot be described in the form of tensor networks and that it can represent certain volume-law states with an AC of polynomial size. With Theorem 1, we can map this theoretical model to a conventional NN. Since some of the most common forms of TNS can represent at most a logarithmic correction to area-law scaling, it entails the existence of quantum states that can be represented efficiently by a NQS, but not with TNS. Overall, we have then established the representability diagram of Fig. 3.
Discussion -In this work we have introduced a general mapping between tensor networks and deep artificial neural networks. This mapping allows to directly connect two of the most important classes of parametric representations of high dimensional functions, and allows to establish a representation diagram of modern variational many-body variational quantum states. We expect that our mapping will be especially useful to establish further rigorous representation results on neural-network based quantum states, using the well-developed theory of tensor-network representations. On the other hand, the kind of neural-network architectures and connectivity patterns resulting from our mapping might also inspire new practical applications inspired by successful tensornetwork ideas. Along the same lines, our mapping can also help clarify in what circumstances gradient-based optimization strategies, ubiquitous in machine learning, are to be preferred over successful alternated optimization strategies instead commonly adopted for tensor networks.

Non-negative Case
For the first step, we assume an AC with non-negative inputs and parameters. The inputs and AC parameters are transformed to their log-value, where we extend the real-line with ±∞ and represent log(0) = −∞. For most practical considerations, −∞ could be substituted with a large but finite negative constant.
In our NN construction, we freely use the identity instead of a softplus activation function when it is more convenient. We can do so because the identity operation can be simulated with arbitrary precision using the weighted sum of just two neurons with softplus activations: The above workaround can at most double the number of neurons and edges in our construction, and thus does not affect our asymptotic bounds. Every product node with k in-edges in the AC is replaced by a neuron with k in-edges, whose weights are set to 1 and bias to 0, representing multiplication in log-space, i.e., log( are the log-values of the connected nodes. Every weighted-sum node with k in-edges and parameterized by w ∈ R k ≥0 is replaced by the following NN sub-graph of O(k) nodes and O(k) edges. Every inputparameter multiplication term, i.e., w i · x i , is represented by a single neuron with a single in-edge with weights set to 0 and bias set to w i , resulting in p i ≡ log(w i · x i ) = w i + o i . Without loosing our generality, assume k = 2 t for some t ∈ N, and so we can decompose k i=1 p i as a complete binary tree of depth t, 2k − 1 nodes, and 2k − 1 in-edges in total. Each node in the tree represent a binary addition, which can be realized with 2 neurons, one with softplus activation and one with identity: Applying the above transformations to a non-negative AC with n nodes, m edges, and depth l results in a NN of depth l log(m) with O(n + m) nodes and O(m) edges, concluding the proof of the first step.

Complex Case
For the second step, we begin initially by transforming a complex AC into four distinct non-negative AC graphs, representing the following four "parts" of a complex number: positive real, negative real, positive imaginary, and negative imaginary.
Every real number x ∈ R can be represented with the redundant form x = x + − x − , where x + , x − ∈ R ≥0 . Multiplication and addition can be performed directly within that representation using the following identities: Similarly, a complex number z ∈ C can be represented with four components, z = z re,+ −z re,− +i·(z im,+ −z im,− ), where z re,+ , z re,− , z im,+ , z im,− ∈ R ≥0 .
Given a complex AC with m edges, n nodes, and of depth l, we can use the above redundant representation for its inputs, parameters, and intermediate computations. Propagating the operations with the above identities through the complex AC graph, results in four non-negative AC, each with O(m) edges, O(n) nodes, and of depth O(l), denoting each component of the complex AC's output, i.e., AC(z) = AC(ẑ) re,+ − AC(ẑ) re,− + i · (AC(ẑ) im,+ − AC(ẑ) im,− ), whereẑ = (z re,+ , z re,− , z im,+ , z im,− ). The logarithm of each of these non-negative AC can be represented with a NN according to the first step.
What remains is to convert the redundant representation to a log-polar form, i.e., log(z) = log(|z|) + i · arg(z), per the desired output described in Theorem 1. We employ various approximation techniques to simulate this operation. In the following we denote the components of the redundant representation and its log-value by o re,+ = ln z re,+ , o re,− = ln z re,− , o im,+ = ln z im,+ , and o im,− = ln z im,− .
In the rest of this sub-section we focus on the approximation of ln |z re |, where the same methods can be applied for ln |z im |. We begin by defining o re,max = max(o re,+ , o re,− ) and o re,min = min(o re,+ , o re,− ), and similarly for the imaginary part. Recall that max(x, y) = y + max(x − y, 0) and min(x, y) = y − max(y − x, 0), and so both can be approximated to arbitrary precision with softplus networks. With that, we can write: ln |z re | = ln (max(z re,+ , z re,− ) − min(z re,+ , z re,− )) = ln ( where softplus −1 is the inverse of the softplus function. To approximate the inverse, we employ two strategies: (i) for large values, softplus −1 (x) ≈ x to a high precision, and (ii) for smaller values, we estimate the inverse using root-finding algorithms, and specifically, the bisection method. Let > 0, and x = o re,max − o re,min . For x > x large ≡ − ln(1 − exp(− )) it holds that x − softplus −1 (x) < . For realizing the bisection method, we first set the initial search range for y * = softplus −1 (x). y * max can be set to x large because softplus −1 (x) ≤ x. For y * min we can bound the minimal value of x as follows x = o re,max − o re,min = ln z re,max z re,min = ln |z re | + z re,min z re,min = ln |z re | z re,min + 1 ≥ ln f min z re,min + 1 .
Next, we upper bound the value of z re,min by finding an upper bound on the value of a generic non-negative AC with m edges. First, we replace every non-zero weight with the maximal weight in the graph.
Then, we can replace every weighted sum with v(s) = (u,v)∈E W u,v u(s) ≤ |{(u, v) ∈ E}| (max e∈E W e ) max (u,v)∈E u(s) . Finally, we can prove by induction along the topological order of the graph that the output of every sub-graph of m edges is upper bounded by m max (v,u)∈E |W v,u | m .
In total, to approximate log |z| up to , requires O ln 2 m ln Wmax fmin nodes, edges, and depth on top of the base NN used to approximate the four non-negative AC.

b. Estimating arg z
In this sub-section, we describe the estimation of arg z by softplus networks, building on the approximations of ln |z re | and ln |z im | described in the previous sub-section. arg z can be computed according to the following formula: for the zero function. Since arctan x ≤ x for any x > 0, then t(x) ≤ exp(x), and so for x min = − ln(1/ ) it holds that ∀x ≤ x min , |t(x)| < . Thus, a piecewise linear function with O ln(1/ ) 1 segments can approximate t(x) up to maximal difference. Finally, a piecewise linear function with k segments can be realized with a ReLU network of O(k) nodes and edges, and of constant depth.