A Neural Network Perturbation Theory Based on the Born Series

Deep Learning has become an attractive approach towards various data-based problems of theoretical physics in the past decade. Its protagonists, the deep neural networks (DNNs), are capable of making accurate predictions for data of arbitrarily high complexity. A well-known issue most DNNs share is their lack of interpretability. In order to explain their behavior and extract physical laws they have discovered during training, a suitable interpretation method has, therefore, to be applied posthoc. Due to its simplicity and ubiquity in quantum physics, we decide to present a rather general interpretation method in the context of two-body scattering: We find a one-to-one correspondence between the nth-order Born approximation and the nth-order Taylor approximation of deep multilayer perceptrons (MLPs), that predict S-wave scattering lengths a0 for discretized, attractive potentials of finite range. This defines a perturbation theory for MLPs similarily to Born approximations defining a perturbation theory for a0. In the case of shallow potentials, lower-order approximations, that can be argued to be local interpretations of respective MLPs, reliably reproduce a0. As deep MLPs are highly nested functions, the computation of higher-order partial derivatives, which is substantial for a Taylor approximation, is an effortful endeavour. By introducing quantities we refer to as propagators and vertices and that depend on the MLP’s weights and biases, we establish a graph-theoretical approach towards partial derivatives and local interpretability. Similar to Feynman rules in quantum field theories, we find rules that systematically assign diagrams consisting of propagators and vertices to the corresponding order of the MLP perturbation theory.


I. INTRODUCTION
Machine Learning (ML) is a highly active field of research that provides a wide range of tools to tackle various data-based problems. As such, it also receives growing attention in the theoretical physics literature, such as in Refs. [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15]. Many data-based problems involve modeling an input-target distribution from a data set, which is referred to as supervised learning. After a successful training procedure, the ML-algorithm is capable of correctly predicting targets, even when given previously unknown inputs, i.e. it generalizes what it has learned to new data. Nowadays, neural networks (NNs) are a popular choice in the context of super-Motivated by Refs. [11] and [19], we develop a rather general interpretation method in the context of two-body scattering. Here, two-body scattering appears to be a suitable field of application, due to its simplicity and its ubiquity in quantum physics. Given an ensemble of deep MLPs that predict S-wave scattering lengths for shallow, attractive potentials of finite range, the idea is to generate an interpretable proxy by considering the ensemble's Taylor approximation. This approach itself is not new. However, while local interpretation methods like the deep Taylor decomposition in Ref. [19] or LIME in Ref. [20] merely use first-order derivatives or a linear regression of adjacent synthetic samples, respectively, higher-order derivatives have not yet been considered in the literature. This makes it even more compelling for us to show how the latter can be sytematically determined in the case of MLPs.
Especially for deep MLPs, which, as such, are highly nested non-linear functions, the analytical computation of Taylor coefficients of arbitrarily high order is anything but trivial. Although using numerical instead of analytical derivatives seemingly simplifies this problem, they are inaccurate due to truncation and round-off errors and do not reveal the contribution of the individual weights and biases to a particular order. Similar to backpropagation in gradient-descent techniques, where the first-order derivatives of the loss function with respect to an internal parameter can be represented as a matrix product, we want to bypass the naive and inefficient use of the chain-and product-rule and understand arbitrary derivatives of an MLP in terms of tensor products. We observe two distinct classes of quantities, we refer to as propagators and vertices, that each depend on the weights, biases and chosen activation functions and naturally appear in such a tensor formulation. The naming is intentional, as we discover several similarities between the Taylor expansion of MLPs and perturbation theory in quantum field theories. We find a one-to-one correspondence between the n th -order Born approximation and the n th -order Taylor approximation of the trained MLPs. In the same way as Born approximations define a perturbation theory for scattering lengths, we argue for the above Taylor approximations to define a perturbation theory for MLPs. Analogously to Feynman rules, there are underlying rules that specify which combinations of vertices and propagators, i.e. which diagrams are allowed and contribute to a certain Taylor coefficient. One major difference is, however, that loops are not allowed in contrast to quantum field theories. In a graph theoretical context, we can show these diagrams to be oriented and rooted trees, i.e. arborescences.
Having established such a perturbation theory for MLPs, based on the Born series, we use it to derive the first-and second-order Taylor coefficients of our ensemble. While the first one reproduces the firstorder Born term, indeed, the latter deviates strongly from the actual second-order Born term and has a devastatingly large variance. This is neither right nor wrong, as the contribution of the second order is much lower compared to the first order, and just reveals how the ensemble operates on extremely shallow potentials, namely by relying primarily on the first order. Nonetheless, we also want to recover the secondorder Born term using MLP perturbation theory. Therefore, we generate data sets with scattering lengths from which the first-order Born term has been removed. In these reduced scattering lengths the second order is, therefore, the leading order. When training an additional ensemble of MLPs to predict these, we observe a much better agreement between its second-order Taylor coefficients and the second-order Born term.
The manuscript is organized as follows: In Sect. II, we develop the perturbation theory for MLPs. Sect. III discusses the Born series in quantum two-body scattering as a Taylor series in the space of discretized potentials. Sect. IV contains the details of the NN used and the its training. The first-order Born term is evaluated and discussed in Sect. V, followed by the investigation of the second-order Born term in Sect. VI. We end with a discussion and outlook in Sect. VII. Various technicalities are relegated to the appendix.

II. PERTURBATION THEORY FOR MULTILAYER PERCEPTRONS
Given an input x 0 ∈ R H 0 that the output Y (x 0 ) ∈ R H L is produced for by the NN we denote as Y . If and only if Y is interpretable in vicinity of x 0 , it is possible to predict how the output will change for a small perturbation δx of the input without actually processing the perturbed input x = x 0 + δx by Y . This is equivalent to discovering an underlying formalism that the neural network has learned during training and applying exactly that set of rules to x + δx. Obviously, local interpretability can be established by a Taylor expansion of each component of Y around a given expansion point x 0 , The mentioned formalism can be parametrized by the Taylor coefficients ∂ N Y n /(∂x k 1 . . . ∂x k N )| x=x 0 . Finally, since higher-order terms can be neglected for small perturbations, this naturally defines a perturbation theory for the neural network Y . As many feedforward-architectures like CNNs, RBFNs and ResNets have an MLP-representation, the following description of such an NN perturbation theory is accordingly given with respect to MLPs.
An MLP with L layers is the prototype example of a layered architecture and can be understood as a non-linear function Y : n . In MLPs exclusively linear layers are used in combination with non-linear activation functions a (l,n) : R → R, where a (l,n) is applied to the n th neuron of the l th layer. This can be formulated recursively, with the recursive step together with the weights w n . The recursion in Eq. (2) is terminated for l = 1 due to reaching its base y (0) m = x m . For deep architectures, that is for L 1, Y is a strongly nested function, such that computing derivatives becomes an extremely difficult task because of a hardly managable amount of chain-and product-rule applications. In fact, there is another field of machine learning in which it is well known how to efficiently compute first order partial derivatives of a strongly nested function: Within gradient-descent-based training algorithms it is necessary to compute the gradient of a loss function, that is an error function of the network Y and therefore nested to the same extent. Here, the first order partial derivatives of the loss function with respect to any internal parameter can be expressed by a matrix product. This is the famous backpropagation which significantly speeds up training steps by avoiding to naively apply chain-and product-rules.
In order to derive Taylor coefficients of Y of any order and for an arbitrary number L of layers in terms of the weights and biases, we desire a systematic description in the spirit of backpropagation. Let us therefore at first define which we refer to as the nm th matrix element of the l th layer propagator of order p. Since the last layer is usually activated via the identity, a (L,n) = id, and has no bias, that is b This redefinition entirely describes outputs in terms of propagators and reduces the search for Taylor coefficients to computing partial derivatives of propagator matrix elements, Applying the chain-rule throughout all layers yields for the first-order derivatives (see App. A on p. 23), In Eq. (6) we make two observations: First, a derivation increases the order of the propagator by one. Second, an additional factor ∆ (l,1) mk is introduced, that impacts higher order derivatives of propagators. Defining the tensor elements we can express the N th derivative of the propagator as the following superposition by successively applying the rule mentioned in Eq. (6) and by absorbing the remaining derivatives by Eq. (7) (see App. A on p. 24), Each summand in Eq. (8) depends on a higher order propagator. Here, the second sum runs over the set of all partitions of the number N with length c, The set of all partitions is, thereby, simply given by the union Lastly, the third sum runs over a subset of the permutation group S N , respecting the structure of the given partition. Tab. I contains all partitions, respective permutation subgroups and resulting propagator derivatives for N = 1, 2, 3, 4.
∈ Π N and all corresponding permutation subsets What remains is to find an expression of the tensors ∆ (l,p) mk 1 ...kp in terms of propagators such that we can completely determine the partial derivatives of the propagators in Eq. (8). We therefore introduce the mk th matrix element of a vertex of order p in the l th layer, acting as a weighted sum, If l − 1 ≤ p, the vertex becomes saturated, that is it becomes a constant and is equal to any higher order vertex in the same layer. Because of this and due to Eq. (6), vertices display the following behavior when exposed to a partial derivative, Obviously, vertices of order p in the l th layer only commute with partial derivatives, if they are saturated. This is embodied by the proportionality of the commutator to the step-function with Θ(0) = 0. Note that we have expressed ∆ for which we choose the graphical representation of a single vertex. Applying Eq. (12) to Eq. (13) yields ∆ (l,2) This term depends on a first order vertex that sums over a second order propagator and a zeroth order vertex, which suggests the graphical representation of two vertices that are connected via a propagator with two arrow heads, directed from the first to the second vertex, ∆ (l,2) Applying Eq. (12) once more to Eq. (14) yields This is a superposition of three different terms that we can each assign a different graph to. Note that the second term only contains one propagator of third order instead of two second-order propagators as in both other terms. In order to connect three vertices, we decide to graphically represent a propagator of third order via one incoming edge and two outgoing edges, each displayed with one arrowhead. It would not be of much use to continue with successively deriving higher order tensors as above. The general idea of how the ∆ (l,N ) mk 1 ...k N are structured and how the individual terms can be translated into graphs should be clear: • ∆ (l,N ) mk 1 ...k N can be represented by a sum of weakly connected and directed tree-level graphs consisting of N vertices and between one and N − 1 propagators. The weak connectivity expresses that there would be a path between each pair u, v of vertices, if each propagator was understood as an undirected edge. Each of these graphs if oriented away from the k 1 st vertex, which is designated as the root. In fact, the root is connected to each other vertex by exactly one path, meaning that each other vertex has exactly one incoming propagator, while the root has none. Graphs with these properties are also referred to as arborescences, see e.g. Ref. [21].
• Propagators of order n are represented by one incoming edge and n − 1 outgoing edges. The number of arrow heads is equal to the order of that propagator.
• A term of the structure indicates that it is the b th propagator originating in the k x 1 st vertex that establishes a connection to the k x 2 nd , . . . , k x n−1 th and k xn th vertex. This implies that this propagator is of order n. If no propagator is originating in a vertex, that vertex is called a leaf of the given arborescence.
• Derivatives of saturated vertices vanish, as seen in Eq. (12). Thus, it depends on the layer l propagators and vertices are considered for, whether a certain arborescence contributes or not. This is embodied by multiplying a factor Θ(l − α) to each arborescence. The given arborescence only contributes, if l overshoots the its saturation threshold, which we denote by α. The appearence of internal vertices and propagators of order three or higher decreases α.
In order to pursue a more systematical approach, we want to understand an arborescence in terms of an adjacency matrix A ∈ N N ×N 0 containing information about which of the N vertices are connected by which propagators. At each vertex, we therefore enumerate outgoing propagators starting with 1. If A ij = 0, there is no connection from the i th vertex to the j th vertex. Otherwise, it is the A ij th propagator originating in the i th vertex that establishes this connection. Due to orientation, each allowed adjacency matrix is an upper triangular matrix with a vanishing main diagonal. Since all vertices but the first one have exactly one incoming propagator, there is exactly one non-zero entry in each but the first column of A. The set of such triangular matrices over K is given by Then the set of all allowed adjacency matrices for N vertices is the following subset of T N N 0 : Counting the appearances of A ij > 0 in the i th line of a given adjacency matrix A ∈ A N determines the order of the respective propagator: If A ij appears n − 1 times in the i th line, the corresponding propagator is of order n. Note that the appearance of a propagator of order n decreases the saturation threshold α(A) by n − 2. Another source that leads to smaller α(A) are internal vertices: α(A) is decreased by 1 for each internal vertex, or, in terms of adjacency matrices, for each non-zero line but the first one. This behavior is entirely described by the expression If l reaches or undershoots α(A), the corresponding arborescence is saturated and does not contribute to that determines the line in which the entry in the j th column is non-zero. Since there is no antecedent propagator to the root of an arborescence, Eq. (18) vanishes for j = 1.

MLP-Arborescence
TABLE II. All arborescences and corresponding adjacency matrices A ∈ A N as well as saturation thresholds α(A) for N = 1, 2, 3, 4 vertices. In the way arborescences are represented here, it is always the vertex k 1 that serves as root. From here, we can deduce that the maximum saturation threshold among all arborescences with N vertices is N − 1.
Using Eq. (18) and the previous observations, we write δ (l,N ) Eq. (19) may appear very intimidating at first sight, but its individual terms are easy to interpret and to recognize in the summands of Eqs. (13), (14) and (15): • The expression max r (A cr ) corresponds to the number of all propagators originating in the c th vertex. Thus, all these propagators are collected by the product The order of the b th propagator originating in the c th vertex is equals to 1 plus the total number of appearances of the entry b in the c th line of A. If the c th vertex is a leaf of the arborescence, that is if it is external such that there are no outgoing propagators and max r (A cr ) = 0, the product can be neglected.
• The c th vertex of the arborescence is denoted by the expression For each of the max r (A cr ) propagators that originate in the c th vertex, there is a summation index It is the β c (A)-th vertex, whose A βc(A)c -th propagator leads to the c th vertex. This is the reason, why we sum over the q • All N vertices of the entire arborescence are collected by Using Eq. (19) and taking the corresponding saturation thresholds given in Eq. (17) into account, we can represent ∆ (l,N ) mk 1 ...k N as the following sum over all adjacency matrices A in A N (see App. A on p. 28): Finally, Eq. (20) can be inserted into Eq. (8), which is required for computing the Taylor coefficients of Y as shown in Eq. (5). As we approach the end, we want to motivate two more notations. At first, considering Eq. (8), it is important to note that none of the indices k 1 , . . . , k N is shared among several factors ∆ (l,π i ) . Therefore, each summand of ∂ N D (l,p) nm /(∂x k 1 . . . ∂x k N ) consists of arborescences that each contain different vertices. For simplicity, let us assume we are given a product of two arborescences covering the vertices k i 1 , . . . , k ix and k j 1 , . . . , k jy , respectively, with I ∩ J = ∅ for I = {i 1 , . . . , i x } and J = {j 1 , . . . , j y }. Then we write the product as one disconnected graph with each former arborescence being a connected component and, as such, a subgraph, Eq. (21) allows expressing Eq. (8) and therefore ∂ N Y /(∂x k 1 . . . ∂x k N ) as a sum of graphs, whose connected components are arborescences. If and only if that graph has exactly one connected component, it is an arborescence itself as the ones in Tab. II. The saturation threshold of a disconnected graph is simply the maximum saturation threshold among all its connected components. The Taylor coefficients up to third order then turn out as Although graphs, whose connected components are arborescences, are in general not invariant under vertex permutations, invariance is recovered by summing over all vertices, as it is the case in the Taylor series. This collapses graphs, that are related via vertex permutations (e.g. graphs 4, 5 and 6 in Eq. (24)), onto one graph and correspondingly introduces symmetry factors to the naive Taylor series.
The following short-hand notation naturally agrees with vertex permutation symmetry and proves useful for the Taylor series: Here, each external leg stands for the contraction of the graph with some component of the displacement (x − x 0 ) to the expansion point x 0 . Since L > 0 by construction, assigning to the empty graph, that is A ∈ ∅, the value and using the short-hand notation from Eq. (25) finally yields the interesting series representation: This reveals the following structure of the component Y n : Graphs that contain up to N vertices contribute to the N th Taylor approximation of Y n . Thus, a graph with c connected components is weighted with an additional factor D The deeper Y , that is the larger L, the more graphs contribute to a given order of the Taylor expansion due to overshooting the corresponding saturation threshold.

III. THE BORN SERIES AS A TAYLOR SERIES
In the following, we exemplarily decide to apply the above established framework of a Neural Network Perturbation Theory to low-energy two-body scattering. All relevant observables can be parameterized by the S-wave scattering length a 0 , which conversely provides a 0 as a meaningful and distinguished target and the corresponding potential V as a suitable input in a supervised-learning-scenario. By analyzing patterns in their first weight matrix, it could be shown in Ref. [11], that under similar training conditions, MLPs develop a quantum perturbation theory during training and approximate scattering lengths in terms of Born approximations. Unfortunately, the arguments given there strongly rely on the fact, that L2-regularization and ReLU activations are used. Considering the Taylor expansion of a 0 , instead, provides an analytical interpretation that is independent of the particular MLP-architecture.
The representability as a Born series is carried over from the T-matrix to the S-wave scattering length. Assuming that potentials V vanish after a finite range ρ, the Born series for a 0 can then be written as with the reduced mass µ of the two-body system. For sufficiently shallow potentials, Eq. (27) can be treated perturbatively by neglecting higher order summands, which finally results in Born approximations of a 0 . Correspondingly, the first two Born approximations are given by a and a An MLP can only process data in a finite dimensional vector space. When training an MLP A i to predict dimensionless scattering lengths a 0 /ρ based on dimensionless input potentials, we therefore need to discretize potentials at first. A dimensionless, discretized potential can be understood as a vector U ∈ R d with d the degree of discretization and the component U n = −2µρ 2 V (n/d) corresponding to the n th potential step. It is obvious that d must be chosen sufficiently large such that the discretization error becomes negligible. For simplicity, we only consider potentials with U n ≥ 0, that is attractive potentials. In terms of discretized potentials, the Born terms in Eqs. (28) and (29) reduce to the sums and a (2) Eqs. (30) and (31) strongly suggest to understand the Born series in Eq. (27) for discretized potentials as a Taylor series with respect to the expansion point U 0 = 0. Vice versa, the more precise the predictions of A i become, the closer we expect its leading-order Taylor coefficients to approach the leading-order Taylor coefficients of a 0 /ρ. For the Taylor coefficients we could then ideally, that is if the loss would vanish, observe the equalities and

IV. NETWORK AND TRAINING DETAILS
In order to enhance precision even further and to reduce statistical noise, we consider an ensemble A of N = 20 MLPs A i , instead of working with a single MLP. Each member A i consists of one output layer and nine linear 64 × 64 layers, that each are activated via the GELU activation function [22]. GELU is smooth in the origin and bypasses the vanishing-gradients-problem, which makes it particularly interesting for such deeper architectures. Finally, the output of the ensemble A is simply given as the mean of all individual member outputs, Due to linearity, we observe the same relation between the respective derivatives of all A i and A, We decide to discretize potentials into d = 64 steps. Thereby, discrete potentials are generated as Gaussian random walks. In fact, we generate two training sets T 1 , T 2 with lengths |T 1,2 | = 4 × 10 4 and corresponding test sets t 1 , t 2 with lengths |t 1,2 | = 4 × 10 3 . Through downsampling, each of the four target distributions is uniform in and limited to the narrow interval [−0.01, 0] containing the origin. While the targets in T 1 and t 1 are simply the S-wave scattering lengths a 0 that belong to the given potentials, the targets in T 2 and t 2 are given by a 0 + (1/d 3 ) d−1 n=0 n 2 U n , which are scattering lengths of which the first Born term has been substracted. Here, we derive scattering lengths for discretized potentials using the transfer matrix method, see Ref. [23]. While the first-order Born term dominates the targets in T 1 and t 1 , it is the second-order Born term that is of leading-order within the targets of T 2 and t 2 .
Using PyTorch [24], two ensembles A and B as described in Eq. (35) are trained by training their members one after another, one using the data sets (T 1 , t 1 ) and the other one with respect to (T 2 , t 2 ). The target range is scaled to the interval [0, 1] by multiplying a factor −10 2 to the targets. Therefore, predictions and derivatives must be rescaled by multiplying −10 −2 afterwards. Weights in both ensembles are initialized using He-initialization [25]. We use mini-batch learning with batch-size 10 and train over 20 epochs with an exponentially decaying learning-rate schedule η t = 10 −3 e −t/2 starting with t = 0, while using the RMSprop-optimizer, see Ref. [26], and abandoning L2-regularization. During training we minimize the mean-squared-error loss (MSELoss). Another useful performance measure is the mean-average-percentage-error (MAPE), where (U , a) is a pair of a discrete potential and the corresponding target a from the test set t. Finally, when finishing an epoch, the state of the MLP A i is indeed kept as the starting point for the subsequent epoch, but only saved to a file, if its loss is less than all previous losses for that A i . After the two training procedures, we observe the two ensembles to have the satisfactorily low losses L A,t 1 = 3.034 × 10 −2 % and L B,t 2 = 5.009 × 10 −1 %, which indicates that A and B have learned to accurately predict a 0 and a 0 + (1/d 3 ) d−1 n=0 n 2 U n , respectively. This was to be expected, as the used MLParchitecture is very deep in relation to the simplicity of the problem. However, it is not their performance on the data sets itself, but the quality of their Taylor coefficients, that is linked to Born approximations and that is, therefore, as important to their interpretability as it is the actual focus of the following analysis.

V. FIRST-ORDER BORN TERM
From the ensemble A, we expect its analytical first-order Taylor coefficients to agree with the numerical derivatives and the theoretically expected coefficients, that is The superscript [A i ] in the analytical coefficients points out that the respective propagators and vertices are computed for the weights and biases of the member A i . For the numerical derivatives we use the step size s = 10 −2 . As we are given an ensemble, we estimate errors of ensemble quantities via the standard deviation of the corresponding member quantity distribution. To give an example, we observe the mean µ = −2.503 × 10 −7 and standard deviation σ = 5.602 × 10 −7 for the distribution {A 1 (0), . . . , A N (0)}, such that we may estimate A(0) = −2.503(5602) × 10 −7 .
Since there is no constant term in the Born series, such an axis intercept, that is by several orders of magnitude smaller than average scattering lengths in T 1 and t 1 , was to be expected. The estimated error to the theoretically expected intercept A(0) = 0 is less than 1σ.
The analytically derived Taylor coefficients (blue) and the expected coefficients −n 2 /d 3 (red) are shown in Fig. a). For the analytical Taylor coefficients we have used Eq. (33) and averaged according to Eq. (36). As can be seen, the two curves are in very good agreement with each other. Thus, the ensemble has learned to apply the first-order Born approximation to data. In Fig. b) the difference (black) between the analytical and numerical Taylor coefficients is displayed. With the difference being three orders of magnitude smaller than the actual coefficients, we can finally verify that Eq. (20) is a valid tool for computing derivatives of MLPs. Now we can finally move to the first Born term. In Fig. 1 the analytically and numerically computed Taylor coefficients are compared to the theoretical Taylor coefficients −k 1 3 /d 3 . As can be seen, the three curves agree very well with each other, which for one implies, that the ensemble applies a very good approximation of the first Born approximation to input data, and for another shows, due to the similarity to the numerical derivatives, that Eq. (20) is a valid tool for computing derivatives of MLPs.
Quantitatively, the quality of the analytical Taylor coefficients of A can be estimated by fitting a model α A i k 1 2 to the first-order Taylor coefficients of each member A i . Due to d = 64, we expect values close to α = −1/d 3 = −3.815 × 10 −6 . Considering the distribution of all α [A i ] , finally yields which deviates less than 2σ from the theoretical value α = −3.815 × 10 −6 . As can be seen, the error ∆α A = 3.0 × 10 −8 is by two orders of magnitude lower, which indicates that the Taylor coefficients have expectedly a clear quadratic behavior.

VI. SECOND-ORDER BORN TERM
For the same ensemble A, we could naively try to reproduce the second-order Born term in the same fashion as presented above. This would involve fitting a model βk 1 k 2 (k 1 + k 2 − |k 1 − k 2 |) to the secondorder Taylor coefficients, that are analytically computed using Eq. (34) for each member A i and later averaged over the whole ensemble. However, the expected fit parameter β = −1/d 5 = −9.313 × 10 −10 is by four orders of magnitude smaller than α. As the contribution of the second-order Born term is followingly much smaller than the contribution of the first order, it is not guaranteed, that the A i accurately reproduce the second-order. This can also be understood by the loss reduction benefiting considerably more from adapting to the first-order Born term during training, while the overall performance does not significantly suffer from a badly approximated second-order Born term. Indeed, we find the fit parameter β A = − 1.76(1134) × 10 −9 , which has an incredibly large variance and thus miserably fails at repro-ducing the second Born term. Therefore, it is much more reliable to consider the second ensemble B that has been trained under the same conditions with respect to the data sets T 2 and t 2 . The leading term in their targets a 0 + (1/d 3 ) d−1 n=0 n 2 U n is now the second-order Born term. At first, we observe B(0) = 6.082(4825) × 10 −6 and α B = −9.173(692) × 10 −8 .
Again, the axis intercept is negligibly small. In contrast to the previous ensemble A, the fit parameter of the first-order Taylor coefficients is by two orders of magnitude smaller than α from construction, which is just another evidence, that the second-order Born term is indeed of leading order in B. The second-order Taylor coefficients are analytically computed as Due to L = 10, both arborescences overshoot their saturation threshold and therefore contribute to the coefficients. These analytical coefficients can be compared to the numerical derivatives and to the theoretically expected coefficients An overview of these is given in Fig. 2. The shapes of the graphs in Figs. 2a) and 2b) slightly deviate from that in Fig. 2c), especially due to the distinctive diagonal that does not appear in theory. Its origin presumably lies in the absence of a proper weight regularization: Due to overfitting there could be residual oscillations near the origin that are not completely averaged out by ensembling, which implies a slightly larger curvature and, thus, diagonal elements in the Hessian. Apart from that, all three plots are sufficiently similar such that we may confidently claim B to have approximated the second-order Born term well, which is also reflected in the low loss of the B. We measure the quality of the analytical Taylor coefficients by fitting to them the model which has already been mentioned above. Considering the distribution {β B i }, we obtain the fit parameter Unfortunately, its error is of the same order of magnitude, for which the noisy artifacts and the fact that the behavior of the ensemble's Taylor coefficients is not purely k 1 k 2 (k 1 + k 2 − |k 1 − k 2 |), as both can be seen in Fig. 2a), are responsible. These again originate in training-and architecture-details and probably could be alleviated by a suitable hyperparameter optimization. Nonetheless, β B deviates less than 2σ from the theoretical parameter β = −9.313 × 10 −10 , which is additional evidence that B approximates the second-order Born term and applies it to new input potentials.  . Especially for the numerical coefficients we observe noisy artifacts that might originate in choosing a relatively large step size of s = 0.01 or in a missing regularization. The shapes of the graphs in a) and b) also slightly deviate from that in c), especially due to the distinctive diagonal. Nonetheless, all three graphs display the same behavior for the most part. Thus, the ensemble B has learned to predict targets by computing the second-order Born term of the input data. In addition, this provides further validation of Eq. (20).  Finally, we can combine the first-order Taylor coefficients of A and the second-order Taylor coefficients of B to an effective, machine-learned second-order Born approximation. The comparison of that effective Born approximation with the predictions of A and the true scattering lengths, displayed in Fig. 3, shows that its range of validity we estimate to be approximately [−0.2, 0] is by one order of magnitude larger than the original training range. It appears that the second-order Born approximation does not suffice beyond that regime, such that the third-order Born term has to be included in order to predict scattering lengths for deeper potentials.

VII. DISCUSSION AND OUTLOOK
As soon as an NN can be represented as an arbitrarily deep MLP, we have proven that partial derivatives of any order can be analytically derived by Eqs. (8) and (20). Instead of expressing these in terms of the MLP's weights and biases directly, we use the notion of propagators and vertices, instead, which is motivated by Feynman diagrams in a quantum field theory and makes underlying combinatorics more manageable. Given an expansion point x 0 in input space, their partial derivatives serve as coefficients for a Taylor series. Considering small perturbations around x 0 , higher order terms can be neglected, which defines a perturbation theory and leaves us with a Taylor approximation. Thereby, changes in the MLP's output can be easily predicted without explicit evaluations of that MLP, which is an important earmark of local interpretability.
We train two ensembles A and B, each consisting of N = 20 MLPs with L = 10, to predict S-wave scattering lengths and scattering lengths from which the first Born approximation has been substracted, respectively, for given discretized potentials. We easily derive the individual first-order Taylor coefficients of A applying the above framework and find them to behave similar to −(k 1 ) 2 /d 3 . This indicates that A approximates scattering lengths even beyond the training regime by the first-order Born approximation. However, the second-order Taylor coefficients of A do not reliably reproduce the second-order Born term, since the contribution of the second order is by several orders of magnitude lower than that of the first order.
Errors occurring here would, therefore, hardly affect the MLP's loss. Instead we consider the ensemble B, as the leading contribution in its predictions can be shown to accurately approximate the second-order Born term. For both ensembles, the similarity of the analytical, numerical and theoretical Taylor coefficients not only implies that the respective ensembles apply Born approximations, but also serves as a practical verification of the established MLP perturbation theory.
At this point one could of course argue that it would have been much more convenient to simply train an NN Y : R 64 → R that is just the sum of one linear layer l and one bilinear layer B, that is Using this architecture instead of deep MLPs would not only reduce the computational effort significantly, but also would have imposed some desired properties like Y (0) = 0 and simultaneously learning the firstand second-order Born terms. Note that Y in this case is a second-order Taylor approximation by itself, which allows to directly read off Taylor coefficients instead of deriving them first, as performed in our analysis. However, such an NN is not an universal approximator, as it violates the universal approximation theorem, and, therefore, will fail in reproducing more negative scattering lengths a 0 −0.2 for deeper potentials. This is because the third-order Born-term could be shown to be no longer negligible in this regime, but cannot be approximated by Y due to behaving like O(U 3 ). Therefore, using such an intrinsically interpretable architecture may indeed simplify the analysis, but must be well justified for the particular case.
The presented approach of establishing local interpretability by considering Taylor approximations on NNs in vicinity of given expansion points is a typical example of a proxy method according to Ref. [18]. As such, it just provides a post-hoc interpretation of the networks predictions, based on approximations and thus deviations from the actual predictions. In this case, prediction and interpretation, therefore, have to be understood as two independent instances. In recent years there have been many efforts to close the gap between prediction and interpretation by ad-hoc interpretation methods. These exemplarily involve training NNs whose architectures are either intrinsically interpretable or can be brought in an interpretable representation, see Ref. [18]. At the cost of a prediction-interpretation-tradeoff, the advantage of ad-hoc methods is that resulting intepretations are completely faithful to the NN's prediction, in contrast to the mentioned post-hoc methods. Nevertheless, the local post-hoc interpretations of A and B presented here are still eminently insightful, especially since they explain the observations made in [11] regarding the development of a perturbation theory for S-wave scattering lengths in an analytical and, up to the requirement of being an MLP, architecture-independent manner. Theorem 1. The first-order partial derivatives of the nm th matrix element of the l th layer propagator D (l,p) nm of order p is given by where we have introduced the matrix elements . . .
Proof. First of all, it is easy to see that the derivative with respect to the k th component x k of the input is proportional to a propagator of higher order p + 1. Due to the chain rule, the term ∂z By inserting the recursive step from Eq. (2), this dependency can be shifted to the previous layer, In the same manner, we can apply the chain rule successively to all antecedent layers until the base y (0) q 0 = x q 0 is reached. Thereby, each layer provides a matrix-multiplication with a first-order propagator: Rearranging those propagators and sums finally yields . . .
] as defined in Eqs. (7), (9) and (10), the following relation holds between sums over the partition subsets Π c N +1 , Π c−1 N and Π c N : Proof. Due to the bijectivity of permutations, each summand on the left-hand side of Eq. (44) is explicitly either linear in or independent of ∆ (l,1) mk N +1 , which allows us to make the ansatz Understand this as a linear function in ∆ (l,1) Due to the derivation, only summands with π i = 1 and σ i j=1 π j = N + 1 contribute to f (l,c) mk 1 ...k N . This requires the considered partition (π i ) c i=1 ∈ Π c N +1 to contain a summand 1. As we have defined partitions to be ordered towards lower summands, this condition can be easily formulated as π c = 1. Therefore, which allows us to reduce the sum over Π c N to a sum over Π c−1 N . Note that for i = i each contributing This implies that the index k N +1 does not appear in any of the remaining factors. Therefore, we can eliminate π i from (α i ) c−1 i=1 ⊕ (1) together with the sum over i and write the second sum as a sum over .
It remains to derive the axis intercept, .
Here, summands for which π i = 1 and σ i j=1 π j = N + 1 are both fulfilled, are the only ones to not contribute. In contrast to the previous case, π c = 1 is therefore not required. For each summand there is exactly one i = i N +1 [σ] with σ i j=1 (π j ) = N + 1, which can be written as In order for such a term to contribute, it is necessary that π i N +1 [σ] > 1, which implies π i N +1 [σ] = 1.
Separating the corresponding factor from the product yields g (l,c) .
In analogy to the previous analysis of the slope f (l,c) mk 1 ...k N , we again want to express the sum over Π c N +1 in terms of simpler partitions. As π c = 1 is no longer required, we are able to approach this by a sum over Π c N , instead. This is possible, since for all (π i ) c i =1 ∈ Π c N +1 and i = i N +1 [σ] ∈ {1, . . . , c} there is exactly one Note that this naturally covers all contributing partitions due to π i = β i + 1 > 1, which allows us to eliminate the factor Θ(π i N +1 [σ] − 1). Therefore, we can write g (l,c) .
All contributing permutations must not map other values than i j=1 β j + 1 onto N + 1, which requires σ : such that we can reduce the sum over S (β i + δ i i ) c i =1 to a sum over S (β i ) c i =1 , which also no longer depends on the summation index i. Finally, eliminating the factor δ N +1,σ( i j=1 β j +1) yields the desired expression g (l,c) .
Theorem 2 (Eq. (8)). The N th derivative of the propagator D (l,p) nm is given by .
Proof. We prove Eq. (8) by a complete induction. Therefore, we quickly convince ourselves of its validity in the base case N = 1, This, indeed, corresponds to Eq. (6), which is also the defining equation for ∆ (l,1) mk 1 . Subsequently, the inductive step involves evaluating the derivative . By applying the product rule, we either encounter propagator derivatives or derivatives of tensor elements, which we split into two distinct sums, .
Up to the N th summand in the first sum and the first summand in the second sum, all other summands can be combined to one sum from c = 2 to c = N , The two external summands can be explicitly derived using The permutation subgroup in each case consists just of the identity, such that Using Lemma 1, the remaining sum can be expressed as a sum over Π c N +1 . Then, combining all terms finally proves the inductive step, .

C. Proof of Eq. (20)
Theorem 3 (Eq. (20)). The tensor elements ∆ (l,p) mk 1 ...k N can be expressed as the following weighted sum of all N -vertex arborescences, as defined in Eq. (19), with adjacency matrices A ∈ A N , Weighting with factors Θ(l − α(A)) causes an arborescence only to contribute, as long as the layer l, the tensor element is considered for, overshoots the saturation threshold α(A), given in Eq. (17).
Proof. Since the base case (N = 1) has already been shown in Eq. (13), we directly start with the inductive step. The commutator formula will later prove to be useful. For O = ∂/∂x k N +1 and for , B c (A) (12) = Θ (l − max r (A cr ) − 1) Due to the derivation, a new vertex is introduced to each summand. However, note that the way this new vertex is connected to the given vertices differs in both terms: In the first summand there now appears to be an additional propagator of second order establishing a connection to the N + 1 th vertex, while in the remaining max r (A cr ) summands, the order of the b th propagator is raised by one, which also allows an additional connection to the new vertex. Let us define the set which is a subset of A N +1 . Its |ν c (A)| = max r (A cr ) + 1 elements correspond to adjacency matrices of N-vertex arborescences, that have been extended by an (N + 1) th vertex, which is connected to the c th vertex. For the element with b = max r (A cr ) + 1, this corresponds to establishing a connection via an additional propagator, that is consequently of second order. Else, we have b ∈ {1, . . . , max r (A cr )}, which corresponds to raising the order of the b th propagator in the c th vertex and thereby allows being connected with the new vertex.
The observations made above can be formulated in the language of adjacency matrices: In terms of (N + 1) × (N + 1) adjacency matrices of the set ν c (A), the commutator in Eq. (49) is given by Here we could express the (N + 1) th vertex as a term B N +1 (A ) with A ∈ ν c (A), due to the (N + 1) th line containing only zeros, thus max r (A N +1,r ) = 0, and due to β N +1 (A ) = c as well as Each element of ν c (A) is represented in this sum, which implies that each possible connection from the c th vertex to the (N + 1) th vertex is established. Note that the first summand, that introduces a new propagator, is the only term that may alter the saturation threshold of the arborescence, namely in the case that max r (A cr ) ≥ α(A). Therefore, we write The derivative of the arborescence δ It is very insightful to analyze Eq.
where we sum over the union As the introduction of a new vertex to a given arborescence only influences the corresponding adjacency matrix by appending a new line and column, but leaves the original adjacency matrix unaltered, the disjuncture is obvious. Nonetheless, it can be easily argued that A N +1 is the union of all ν(A) for A ∈ A N . Therefore, it follows that both of the following sums must be identical: