Neural Network Perturbation Theory and its Application to the Born Series

Deep Learning using the eponymous deep neural networks (DNNs) has become an attractive approach towards various data-based problems of theoretical physics in the past decade. There has been a clear trend to deeper architectures containing increasingly more powerful and involved layers. Contrarily, Taylor coefﬁcients of DNNs still appear mainly in the light of interpretability studies, where they are computed at most to ﬁrst order. However, especially in theoretical physics numerous problems beneﬁt from accessing higher orders, as well. This gap motivates a general formulation of neural network (NN) Taylor expansions. Restricting our analysis to multilayer perceptrons (MLPs) and introducing quantities we refer to as propagators and vertices, both depending on the MLP’s weights and biases, we establish a graph-theoretical approach. Similarly to Feynman rules in quantum ﬁeld theories, we can systematically assign diagrams containing propagators and vertices to the corresponding partial derivative. Examining this approach for S-wave scattering lengths of shallow potentials, we observe NNs to adapt their derivatives mainly to the leading order of the target function’s Taylor expansion. To circumvent this problem, we propose an iterative NN perturbation theory. During each iteration we eliminate the leading order, such that the next-to-leading order can be faithfully learned during the subsequent iteration. After performing two iterations, we ﬁnd that the ﬁrst-and second-order Born terms are correctly adapted during the respective iterations. Finally, we combine both results to ﬁnd a proxy that acts as a machine-learned second-order Born approximation.


I. INTRODUCTION
Machine Learning (ML) is a highly active field of research that provides a wide range of tools to tackle various data-based problems.As such, it also receives growing attention in the theoretical physics literature, such as in Refs.[1][2][3][4][5][6][7][8][9][10][11][12][13][14][15].Many data-based problems involve modeling an input-target distribution from a data set, which is referred to as supervised learning.After a successful training procedure, the MLalgorithm is capable of correctly predicting targets, even when given previously unknown inputs, i.e. it generalizes what it has learned to new data.Nowadays, neural networks (NNs) are a popular choice in the context of supervised learning.There is an overwhelming variety of NN-architectures that are as diverse as the problems they are specially suited for.The certainly most fundamental class of NNs is given by multilayer perceptrons (MLPs).Many obvious properties of state-of-the-art NNs like the concept of a layered architecture or the use of non-linear activation functions originate in much simpler MLP-architectures.
The property that makes NNs perform so well in many different applications is that of being an universal approximator: As long as the architecture comprises an output layer and at least one hidden layer that is activated via a bounded, non-linear activation function, the NN can approximate any continuous map between inputs and targets arbitrarily precise for a sufficiently large number of neurons in that hidden layer, as described by the universal approximation theorem (UAT), see Refs.[16,17].However, increasing the number of neurons in one layer is a rather inefficient way to improve the NN's performance.It is more promising to introduce additional non-linearily activated hidden layers, instead, which eventually opens up the field of Deep Learning.Here, the term "deep" refers to a large number of such non-linearily activated layers.Its protagonists, the deep neural networks (DNNs), are known for their enormous predictive power and for demonstrating super-human performances for specific tasks.Last but not least, this makes them a promising approach towards problems of theoretical physics, as shown e.g. in Refs.[1][2][3].
While there has clearly been a trend towards deeper and more complex architectures in the past decade, capable of approximating increasingly involved target functions, the interest in the analytical properties of NNs remains limited to interpretability studies.Notable methods used to gather post-hoc interpretations of NN predictions are the deep Taylor decomposition and LIME, see Refs.[18] and [19].The former is used exclusively for image classifiers: Given a root point of the classifier, a heatmap can be constructed using the NN's first-order derivatives, assigning to each pixel a certain relevance value.This highlights those pixels, that are substantially involved in the resulting decision.Similarly, LIME performs a linear regression of adjacent synthetic samples and, thus, provides a linear approximation, or in the language of Ref. [20] a linear proxy, of the target function in vicinity of the root point.Both methods, indeed, facilitate powerful local post-hoc interpretations of the respective NN's behavior in feature space.However, they do not provide any information about second-or higher-order derivatives and are, therefore, blind to most of the NN's analytical structure.This contrasts greatly with the fact a majority of problems in theoretical physics certainly benefit from having not only access to the NN's first-order derivatives, but ideally to their entire analytical structure.Obvious examples that come to mind are the post-hoc verification of equations of motion or, what the present study focuses on, the extraction of dominating terms from an underlying perturbation theory.
Closing the mentioned gap requires a general method to compute NN derivatives of arbitrary order.On one hand, a numerical differentiation is certainly no difficult task, but may be rendered inaccurate due to truncation and round-off errors and does not reveal the contribution of the individual weights and biases to a particular order.On the other hand, a naive analytical differentiation of NN predictions does not share these weaknesses, but will suffer from an unmanageable amount of different contributions, especially for high-order derivatives and a large number of hidden layers.Restricting the analysis for simplicity to MLPs, we propose a graph-theoretical formalism to analytically compute partial derivatives of any order for arbitrarily many hidden layers, while keeping track of the combinatorics.Similarly to backpropagation in gradient-descent techniques, where the first-order derivatives of the loss function with respect to an internal parameter can be represented as a matrix product, we want to bypass the naive and inefficient use of the chain-and product-rule and understand arbitrary derivatives of an MLP in terms of tensor products.We observe two distinct classes of quantities, we refer to as propagators and vertices, that each depend on the weights, biases and chosen activation functions and naturally appear in such a tensor formulation.Their naming is intentional, as we discover several similarities between the Taylor expansion of MLPs and perturbation theory in quantum field theories: Analogously to Feynman rules, we find underlying rules that specify which combinations of vertices and propagators, i.e. which diagrams are allowed and contribute to a given Taylor coefficient.One major difference, however, is that loops are not allowed in contrast to quantum field theories.In a graph-theoretical context, we can show that these diagrams are oriented and rooted trees, i.e. arborescences.The concept of explaining derivatives in terms of graphs is already known and a successful approach in the context of ordinary differential equations or, more precisely, the Butcher series, see Ref. [21].In contrast to the Butcher series, however, we clearly want to set our focus on Taylor expansions and perturbation theory.
Due to its simplicity and ubiquity in quantum physics, two-body scattering appears to be an adequate field for studying the adaptation of MLPs to perturbation theories.In fact, the present study is strongly motivated by the weight inspections performed in Ref. [11]: Considering the first hidden layer of MLPs trained to predict S-Wave scattering lengths for shallow, attractive potentials of finite range, it is shown that weights among all of its active neurons satisfy a quadratic pattern w nm ∝ m 2 .This can be proven to reproduce the first-order Born approximation.As soon as MLPs are trained on successively deeper potentials, additional structures emerge within their weights, that are later identified with the second-order Born term.This qualitatively indicates that during training MLPs adapt to the Born series and, thus, develop a quantum perturbation theory.Applying the proposed graph-theoretical formalism, we complement these findings by a quantitative investigation of the dominating analytical structure of MLP ensembles that predict S-wave scattering lengths.Since we observe these MLPs to mainly adapt their derivatives to the leading order, we develop an iterative scheme that can be understood as an NN perturbation theory to successively obtain remaining terms of the target function's Taylor expansion: At each iteration, the idea is to eliminate the leading order from the current targets in the training and test sets, which generates new data sets for the next iteration, a new auxiliary ensemble of MLPs can be trained on.A downside of this approach is that each iteration requires to run an additional training pipeline.However, the dominating contributions of the auxiliary ensembles are significantly more faithful to the corresponding terms of target functions's Taylor expansion than a differentiation of a single, naively trained ensemble could provide.The first-and second-order Taylor coefficients found this way can be identified one-to-one with the first-and secondorder Born term, respectively.These two results are then combined to a machine-learned second-order Born approximation, that performs well for shallow potentials.Finally, these finally indicate that our NN perturbation theory naturally translates to a perturbation theory for scattering lengths.
The manuscript is organized as follows: At first, Sec.II introduces the S-wave scattering length as a functional, briefly presents the first two variational derivatives and relates them to the Born series in quantum two-body scattering.When approximated by NNs, their sampled form gives rise to understand the Born series as a Taylor series in the space of sampled potentials.In Sec.III, we then propose a graphically motivated approach to compute partial derivatives of arbitrarily deep MLPs in terms of propagators and vertices.Sec.IV builds on these findings, develops the NN perturbation theory we finally want to apply to MLPs trained on S-wave scattering lengths and sheds light on the training pipeline as well as on architecture details.The first-order Born term is evaluated and discussed in Sec.V, followed by the investigation of the second-order Born term after one iteration in Sec.VI.We end with a discussion and outlook in Sec.VII.Various technicalities are relegated to the appendix.

II. THE BORN SERIES AS A TAYLOR SERIES
Let Y : R H 0 → R H L be an analytical function.The local behavior of Y in the vicinity of an arbitrary point x 0 ∈ R H 0 can be described by neglecting higher-order terms in the Taylor expansion In the following, we interpret the components ).In this context, higher dimensions H 0 correspond to higher sampling rates.Note that x and f become totally equivalent in the limit of an infinitely fine sampling rate, that is H 0 → ∞.In this case the given function Y can rather be understood as a functional Y : Consequently, the above expression generalizes to an expansion around a given function In this limit it is no longer the partial derivatives Sampling the latter yields partial derivatives and again reproduces the multidimensional Taylor expansion in Eq. ( 1).An example we study thoroughly in Sec.V and Sec.VI is the functional that maps an attractive, dimensionless potential U = 2µρ 2 V with finite range ρ to the corresponding dimensionless S-wave scattering length in units of ρ, Here, the quantities µ, V , T and G 0 denote the reduced mass of the two-body system, the dimensionful potential and the dimensionless T-matrix and resolvent, respectively.Eq. ( 2) not only contains the expansion of a 0 around the force-free case U = 0: It also displays the classical representation of the S-wave scattering length as the Born series and, therefore, suggests to treat a 0 perturbatively for shallow potentials.The two lowest-order variational derivatives, can then be used to compute the first and, respectively, second-order Born approximation of a 0 .Consequently, their sampled versions are given by Eqs. ( 1) and (3) give rise to understand the Born series in the space of sampled potentials as a Taylor series.Thereby, each sampled potential U corresponds to a H 0 -dimensional vector with components U k = U (r = k/H 0 ).It is obvious that the discretization error becomes negligibly small for sufficiently high sampling rates H 0 .Now Y could serve as a target function that we try to imitate by an NN.In the context of the above example this means that the NN can successfully predict S-wave scattering lengths for sampled potentials after completing a supervised training procedure.According to the UAT, these predictions can be arbitrarily precise, provided that the given architecture contains sufficiently many neurons or is sufficiently deep.Nonetheless, the UAT in no way guarantees that the NN also reproduces the analytical properties, i.e. the target function's partial derivatives at each order.A pathological, but obvious example are NNs with Heaviside-activations: Here the derivatives at any order can only take the values 0 or ±∞, which can be realized as a superposition of delta functions.At this point the following questions arise: What conditions must the given architecture satisfy such that loss minimization during training also causes the partial derivatives of the MLP for any given order to approximate the corresponding derivatives of the target function?Or asked differently: If we are given a raw data set and do not know the analytical representation of the target function, how can we be sure that its analytical structure is reproduced by the trained NN?What is the benefit of having the NN approximate the analytical structure of the target function?And how can MLP and target function derivatives be compared with each other analytically in the first place?Of course, in order to comply with the assumed analyticity of the target function, the activation functions themselves need to be analytical.As many pooling layers like Max-Pooling or rectifiers like ReLU are not everywhere differentiable, this already excludes a wide range of conventional architectures.Also, note that this analyticity criterion is just a neccessary and not a sufficient condition for the NN in order to approximate the analytical structure of the target function.To give an example, GELU is an analytical rectifier that serves as activation function later in this work.While lower-order derivatives are unproblematic, higherorder derivatives vanish almost everywhere and diverge for a small range of inputs, which renders them highly unstable.Also, there is no ad-hoc guarantee, that the NN approximates all lower-order derivatives of the target function equally well: If the training set only covers a narrow range of inputs around the expansion point, the NN may tend to approximate only the leading order.While we will observe this issue for the numerically stable first-and second-order derivatives of the Born series in Sec.V and VI, a further investigation for higher-order derivatives is beyond the scope of the present study and, therefore, left for future studies.

III. PARTIAL DERIVATIVES OF MULTILAYER PERCEPTRONS
An MLP with L layers is the prototype example of a layered architecture and can be understood as a non-linear function Y : R H 0 → R H L between real vector spaces.The term "layered" describes that n .In MLPs exclusively linear layers are used in combination with non-linear activation functions a (l,n) : R → R, where a (l,n) is applied to the n th neuron of the l th layer.This can be formulated recursively, with the recursive step together with the weights w n .The recursion in Eq. ( 5) is terminated for l = 1 due to reaching its base y (0) m = x m .For deep architectures, that is for L 1, Y is a strongly nested function, such that computing derivatives becomes an extremely difficult task because of a hardly managable amount of chain-and product-rule applications.In fact, there is another field of machine learning in which it is well known how to efficiently compute first order partial derivatives of a strongly nested function: Within gradient-descent-based training algorithms it is necessary to compute the gradient of a loss function, that is an error function of the network Y and therefore nested to the same extent.Here, the first order partial derivatives of the loss function with respect to any internal parameter can be expressed by a matrix product.This is the famous backpropagation which significantly speeds up training steps by avoiding to naively apply chain-and product-rules, see Ref. [22].
In order to derive Taylor coefficients of Y of any order and for an arbitrary number L of layers in terms of the weights and biases, we desire a systematic description in the spirit of backpropagation.Let us therefore at first define which we refer to as the nm th matrix element of the l th layer propagator of order p.Since the last layer is usually activated via the identity, a (L,n) = id, and has no bias, that is b This redefinition entirely describes outputs in terms of propagators and reduces the search for Taylor coefficients to computing partial derivatives of propagator matrix elements, . . .
In Eq. ( 9) we make two observations: First, a derivation increases the order of the propagator by one.
Second, an additional factor mk is introduced, that impacts higher order derivatives of propagators.We introduce the tensor elements that are obviously invariant under permutations of the indices k 1 , . . ., k p .Now we can express the N th derivative of the propagator as the following superposition by successively applying the rule mentioned in Eq. ( 9) and by absorbing the remaining derivatives by Eq. ( 10) (see App.A on p. 23), Each summand in Eq. ( 11) includes a higher order propagator.Here, the second sum runs over the set of all partitions of the number N with length c, nm / (∂x k1 . . .∂x k N ) using Eq.(11).Note that the index permutation symmetry of the ∆ (l,p) mk1...kp has been used to combine ε π summands.Thus, there are only N !/ε π remaining summands for each partition π.
The set Π N of all partitions is, thereby, simply given by the union of all Π c N .Lastly, the third sum runs over the permutation group S N .For a given partition π ∈ Π c n , the factor 1/ε π takes the symmetry of the tensor elements ∆ (l,p) mk σ(1) ...k σ(p) = ∆ (l,p) mk 1 ...kp under index permutations σ ∈ S N into account and can be derived via Tab.I contains all partitions, respective ε π and resulting propagator derivatives for N = 1, 2, 3, 4. What remains is to find an expression of the tensors ∆ (l,p) mk 1 ...kp in terms of propagators such that we can completely determine the partial derivatives of the propagators in Eq. (11).We therefore introduce the mk th matrix element of a vertex of order p in the l th layer, acting as a weighted sum over elements of arbitrary tensors f , If l − 1 ≤ p, the vertex becomes saturated, that is it becomes a constant and is equal to any higher order vertex in the same layer.Because of this and due to Eq. ( 9), vertices display the following behavior when exposed to a partial derivative, q j p+1 q jp Ω (q a ) jp a=1 Obviously, vertices of order p in the l th layer only commute with partial derivatives, if they are saturated.This is embodied by the proportionality of the commutator to the step-function with Θ(0) = 0. Note that we have expressed mk as a vertex of order zero in order to arrive at Eq. ( 15), ,1) for which we choose the graphical representation of a single vertex.Applying Eq. ( 15) to Eq. ( 16) yields q j 1 +1 q j 1 (j 1 ,1) This term depends on a first order vertex that sums over a second order propagator and a zeroth order vertex, which suggests the graphical representation of two vertices that are connected via a propagator with one arrow head, directed from the first to the second vertex.Note that the index permutation symmetry of the ∆ (l,p) mk 1 ...kp is inherited by the right-hand side of Eq. ( 17).Applying Eq. ( 15) once more to Eq. ( 17) yields which can be graphically represented as This is a superposition of three different terms, to each of them we can assign a different graph.Note that the second term only contains one propagator of third order instead of two second-order propagators as in both other terms.Given a vertex, we decide to enumerate outgoing propagators by the number of their arrow heads.If there are n outgoing edges with the same number of arrow heads, as it is the case in the second term with n = 2, they represent the same propagator of order n + 1.
It would not be of much use to continue with successively deriving higher order tensors as above.The general idea of how the ∆ (l,N ) mk 1 ...k N are structured and how the individual terms can be translated into graphs should be clear: • ∆ (l,N ) mk 1 ...k N can be represented by a sum of directed and rooted trees, that is arborescences, see Ref. [23].Each arborescence consists of N vertices and up to N − 1 propagators.
• At each vertex, outgoing propagators are counted by the number of arrow heads on the respective edges.Therefore, an edge with n arrow heads belongs to the n th propagator originating at that vertex.
• If there are n edges with the same amount of arrow heads originating in a given vertex, these represent a propagator of order n + 1 originating in that vertex.Consequently, this propagator establishes connections to n other vertices.
• A term of the structure . . .
indicates that it is the b th propagator originating in the k x 1 st vertex that establishes a connection to the k x 2 nd , . . ., k x n−1 th and k xn th vertex.This implies that this propagator is of order n.If no propagator is originating in a vertex, that vertex is called a leaf of the given arborescence.
• Derivatives of saturated vertices vanish, as seen in Eq. (15).Thus, it depends on the layer l, propagators and vertices are considered for, whether a certain arborescence contributes or not.This is embodied by multiplying a factor Θ(l − α) to each arborescence.The given arborescence only contributes, if l overshoots its saturation threshold, which we denote by α.The appearance of internal vertices and propagators of order three or higher decreases α.
In order to pursue a more systematical approach, we want to understand an arborescence in terms of an adjacency matrix A ∈ N N ×N 0 containing information about which of the N vertices are connected by which propagators.At this point, it is useful to recall the propagator numbering based on arrow heads, which has been introduced above: If A ij = 0, there is no connection from the i th to the j th vertex.Otherwise, it is the A ij th propagator originating in the i th vertex that establishes this connection.Alternatively, A ij can be understood as the number of arrow heads on the edge directed from the i th to the j th vertex.Due to orientation, each allowed adjacency matrix is an upper triangular matrix with a vanishing main diagonal.Since all vertices but the first one have exactly one incoming propagator, there is exactly one non-zero entry in each but the first column of A. The set of such triangular matrices over K is given by Then the set of all allowed adjacency matrices for N vertices is the following subset of T N N 0 : TABLE II.All arborescences and corresponding adjacency matrices A ∈ A N as well as saturation thresholds α(A) for N = 1, 2, 3, 4 vertices.In the way arborescences are represented here, it is always the vertex k 1 that serves as root.
From here, we can deduce that the maximum saturation threshold among all arborescences with N vertices is N − 1.
Counting the appearances of A ij > 0 in the i th line of a given adjacency matrix A ∈ A N determines the order of the respective propagator: If A ij appears n − 1 times in the i th line, the corresponding propagator is of order n.Note that the appearance of a propagator of order n decreases the saturation threshold α(A) by n − 2. Another source that leads to smaller α(A) are internal vertices: α(A) is decreased by 1 for each internal vertex, or, in terms of adjacency matrices, for each non-zero line but the first one.This behavior is entirely described by the simple expression If l reaches or undershoots α(A), the corresponding arborescence is saturated and does not contribute to ∆ (l,N ) mk 1 ...k N .Tab. II lists example arborescences and corresponding adjacency matrices A as well as saturation thresholds α(A).Note that there may be several allowed adjacency matrices for one arborescence due to index permutation symmetry.The only thing left for expressing ∆ (l,N ) mk 1 ...k N solely in terms of propagators and biases is an analytical representation δ (l,N ) mk 1 ...k N (A) of an individual arborescence with N vertices for a given adjacency matrix A. For its formulation, we use the function that determines the line in which the entry in the j th column is non-zero.That means it provides the unique vertex, from which a propagator leads to the c th vertex.Since there is no antecedent propagator to the root of an arborescence, Eq. ( 21) vanishes for c = 1.Using Eq. ( 21), the abbreviations , and the previous observations, we write Eq. ( 22) may appear very intimidating at first sight, but its individual terms are easy to interpret and to recognize in the summands of Eqs. ( 16), ( 17) and ( 18): • The expression max r (A cr ) corresponds to the number of all propagators originating in the c th vertex.
Thus, all these propagators are collected by the product The order of the b th propagator originating in the c th vertex is equal to 1 plus the total number n bc (A) of appearances of the entry b in the c th line of A. If the c th vertex is a leaf of the arborescence, that is if it is external such that there are no outgoing propagators and max r (A cr ) = 0, the product can be neglected.
• The c th vertex of the arborescence is denoted by the expression .
For each of the max r (A cr ) propagators that originate in the c th vertex, there is a summation index j {c} b .It is the β c (A)-th vertex, whose A βc(A)c -th propagator leads to the c th vertex.This is the reason, why we sum over the (q c (A), k c )-th matrix element of the c th vertex in the j c (A)-th layer.
• All N vertices of the entire arborescence are collected by .
Using Eq. ( 22) and taking the corresponding saturation thresholds given in Eq. ( 20) into account, we can represent ∆ (l,N ) mk 1 ...k N as the following sum over all adjacency matrices A in A N (see App.A on p. 25): Finally, Eq. ( 23) can be inserted into Eq.( 11), which is required for computing the Taylor coefficients of Y as shown in Eq. ( 8).Considering Eq. ( 11), it is important to note that none of the indices k 1 , . . ., k N is shared among several factors ∆ (l,π i ) .Therefore, each summand of ∂ N D (l,p) nm /(∂x k 1 . . .∂x k N ) consists of arborescences that each contain different vertices.The effective saturation threshold of a product of several ∆ (l,π i ) is, thereby, simply the maximal saturation threshold among all given factors.The Taylor coefficients up to third order then turn out as ( ) ( ) Let I j = {i j 1 , . . ., i j y j } ⊆ {1, . . ., N } with j ∈ {1, . . ., p} be pairwise disjoint index sets such that I 1 ∪ . . .∪ I p = {1, . . ., N } and I j 1 ∩ I j 2 = ∅ for j 1 = j 2 .This implies y 1 + . . .+ y p = N .Given these indices, the following short-hand notation proves useful for the Taylor series, since it helps to combine p arborescences covering N vertices to one disconnected graph with p connected components: The resulting graph is only connected for p = 1.Using the Taylor coefficients from Eqs. ( 24) to ( 26) and the short-hand notation from Eq. ( 27) finally yields the interesting series representation, which is ordered by the number of connected components: x,x 0 + Θ(L − 2) 2 x,x 0 + Θ(L − 3) 6 x,x 0 + Θ(L − 2) 6 x,x 0 + Θ(L − 2) 6 x,x 0 + . . .
Graphs that contain up to N vertices contribute to the N th Taylor approximation of Y.The deeper Y, that is the larger L, the more graphs contribute to a given order of the Taylor expansion due to overshooting the corresponding saturation threshold.

IV. NN PERTURBATION THEORY APPLIED TO SCATTERING LENGTHS
Depending on the analytical structure of the target function, there may be a difference of several orders of magnitude between the Taylor coefficients of different orders.For example, if the first-order Taylor coefficients dominate and if all training samples are closely distributed around the expansion point, then a trained MLP basically applies a linear approximation to imitate the target function.As a supervisor one does not gain any insights on higher-order terms in such cases, since these presumably are not faithful to the derivatives of the target function.As can be seen in Eq. ( 3), the first and second-order Taylor coefficients of the sampled Born series vary by a factor 1/(H 2 0 ) ≈ 10 −3 .The sampling rate H 0 must be sufficiently large such that the discretization error becomes negligible, which we assume to be the case for H 0 = 32, as used in the following analysis.For the specific application, this implies that we cannot expect to recover both, the first and second-order Born terms, from a naively trained MLP or ensemble.Therefore, we propose an iterative scheme to gain information on Taylor coefficients of successively rising order.Given a training set T (0) 1 and test set T (0) 2 and assuming we do not train single MLPs, but ensembles of several MLPs, the i th iteration contains the following steps: n .The output of the ensemble is simply the mean of the individual member outputs.

Train the ensemble Y (i) on the training set T (i−1) 1
and validate it later using the test set n /N i around the expansion point x 0 .As this term is of leading order, the corresponding i th -order derivatives and, therefore, Taylor coefficients can be assumed to be faithful to the analytical structure of the target function., respectively.If necessary, the targets must be normalized or standardized again.

Generate new training and test sets T
At the cost of rerunning the training pipeline for each iteration anew, we especially expect this procedure to yield faithful first-and second-order Taylor coefficients in the case of S-wave scattering lengths.Note that such an iterative approach is anything but unnatural and is really just the central idea of perturbation theory.
At first we generate a training and test set of shallow potentials without any bound states.The scattering lengths for sampled potentials are derived using the Transfer Matrix Method, see Ref. [24], and are uniformly distributed between the boundaries a 0 = −1 and a 0 = 0.The training and test set contain |T Training and validation of the ensembles at each iteration is performed in PyTorch [25].At first, we initialize an ensemble Y (1) of N 1 = 10 2 MLPs, in which each but the output layer is activated via the GELU activation function [26].GELU is smooth in the origin in contrast to other rectifiers like ReLU.Being a rectifier, it bypasses the vanishing-gradients-problem, which makes it particularly interesting for deeper architectures, see Ref. [27].Their weights and biases are initialized using He-Initialization [28].Apart from the requirement of being a GELU-activated MLP, we allow various numbers of layers n of units per hidden layer, learning rates η (i) n and weight decays λ (i) n : The former two are uniformly distributed random integers in the intervals [3,10] and [16,256], respectively.For the random floats η and with the weight-decay The index labels the current training epoch and ranges from = 1 to = 20.We decide to use the Mean Average Percentage Error (MAPE) as loss function, The upper expression is used to evaluate the loss of a single member n during training, while the lower expression corresponds to the MAPE of the entire ensemble.Processing these losses and computing corresponding weight updates by the Adam optimizer, see Refs.[29] and [30], we perform mini-batch learning with batch size B = 128.
When it comes to the Taylor decomposition of the ensembles Y (i) , it is convenient to choose the same expansion point, that is U = 0, as we have already seen for a 0 in Sec.II.We note that the scattering length a 0 (U ) = 0 vanishes in the case with no interaction.Therefore, we expect the dominating term of the Taylor series of Y (1) not to be the first, constant term, but the second summand, containing the first-order derivatives ∂Y (1) 1/2 this motivates us to skip one iteration and to directly perform the substraction in order to compute samples (U , a 0 (U )) ∈ T (1) 1/2 of the successive data sets.We expect these new targets to have a vanishing constant and linear contribution and, therefore, to behave mainly like 1/2 • U KU with the Hessian K ∈ R H 0 ×H 0 .Since we are already satisfied with these first two orders, we stop after training the auxiliary ensemble Y (2) on T (1) 1 , using the same training pipeline as before, and do not perform further iterations beyond that.
Both ensembles perform sufficiently well on their respective test sets, which follows from their low MAPEs of L Y (1) ,T (0) 2 = 0.152% and L Y (2) ,T (1) 2 = 0.438%.Note that Y (2) as well as its individual members exhibits a slightly worse performance than the members of Y (1) , cf.Fig. 1.Since both ensembles draw their hyperparameters from the same probability distributions, the task of adapting to a dominating, quadratic relation between inputs and targets appears to be more challenging that learning a constant or linear relation./% among all members of the ensembles Y (1) and Y (2) with respect to the corresponding test sets T (0) 2 and T (1) 2 .We see that members of the first ensemble perform slightly better, which also leads to a better MAPE L Y (1) ,T (0) 2 for the entire ensemble Y (1) .Since all hyperparameters are drawn from the same distributions for both ensembles, it appears that it is an easier task to learn a linear than a quadratic relation.

V. FIRST-ORDER BORN TERM
Given the first ensemble Y (1) , we first verify that its members have, indeed, adapted to a vanishing axis intercept.This is an important performance requirement, as the scattering length vanishes in the force-free case U = 0, which we choose as an expansion point for our proxy model.Deriving errors of ensemblerelated quantities by computing the standard deviation of that quantity among all members, we find As Y (1) (0) takes a small value, compared to the range of all targets in T (0) 1/2 , and since the corresponding error even has a slightly larger magnitude, we can confirm a vanishing axis intercept for the first ensemble.The next step is crucial, not only to this iteration but also to the success of the following one: We compute the first-order Taylor coefficients of the ensemble Y (1) using Eq. ( 24) and the formalism provided in Sec.III.This reveals its dominating, linear contribution in the space of sampled potentials.Ideally, the ensemble would reproduce the linear contribution in Eq. (3) of the scattering length one-to-one, which then would imply for the Taylor coefficients The superscript Y (1) n points out that the respective quantity is computed for the weights and biases of the ensemble member Y n .In order to evaluate how well the left-hand and right-hand sides actually match, we fit the model αk 1 2 to the H 0 = 32 ensemble Taylor coefficients on the left-hand side.Considering the First-order Taylor coefficients of the ensemble Y (1) , values of the model α k 1 2 fitted over these coefficients and first-order Taylor coefficients k 1 2 /(H 0 ) 3 of the sampled Born series over the index k 1 .As α deviates just slightly more than 1σ from 1/(H 0 ) 3 , this ensemble, indeed, applies the first-order Born approximation to shallow potentials in order to predict S-wave scattering lengths.mean and standard deviation of the distribution of all fitting parameters among the members of Y (1) The ensemble's Taylor coefficients are displayed together with the fitted curve and the values k 1 2 /(H 0 ) 3 in Fig. 2. We note that the deviation of the fitting parameter α from the value 1/(H 0 ) 3 = 3.052 × 10 −5 is just slightly larger than 1σ.This shows that the ensemble Y (1) reproduces the first-order Born approximation sufficiently well and, thereby, predicts S-wave scattering lengths for shallow potentials.

VI. SECOND-ORDER BORN TERM
Having identified and analyzed the linear contribution of the first ensemble Y (1) , it is time to move over to the successive data sets T 1/2 with their targets derived according to Eq. (31) and to train the auxiliary ensemble Y (2) .As argued in Sec.IV, it is the quadratic contribution, based on the Hessian, which dominates these new targets.Similar to the investigation of the first-order Taylor coefficients in Sec.V, we can specify an ideal case in which Y (2) would reproduce the second-order Born term one-to-one, namely if their secondorder Taylor coefficients satisfy In order to investigate, how close the auxiliary ensemble actually approximates this ideal case, we fit the model If we observe the fitting parameter β to closely approach the value  In contrast to Y (1) , the auxiliary ensemble Y (2) has adapted to the second-order Born term much better.The diagonals appearing on both ensemble Hessians is presumably an artifact that could be eliminated using other and more capable architectures than MLPs.
To justify our perturbative ansatz, we not only compute the Hessian of the auxiliary ensemble Y (2) , but also consider the Hessian of the ensemble Y (1) from the previous iteration.The latter can be expected to be significantly less faithful to the second-order Born term, since the quadratic contribution to scattering lenghs of shallow potentials is lower than that of the linear term from Sec. V and, thus, has not been prioritized during training.Both Hessians and the Hessian of the Born series, c.f. Eq.( 3), are shown in Fig. 3.For Y (2)  we find the fitting parameter β = (−2.25 ± 1.63) × 10 −8 .Indeed, the deviation from −1/(H 0 ) 5 is less than 1σ.But at this point, we also notice the unfortunately large error, which may be explained by the slightly weaker performance of the auxiliary ensemble.Moreover, interestingly, we can observe a very distinct diagonal in both ensemble Hessians, which does not appear in the second-order Born term.Since weight decay and ensembling have a regulatory influence on the resulting predictions, we can exclude overfitting as cause.We, therefore, conjecture that these diagonals are artifacts that might disappear when using other, more capable architectures than MLPs.Apart from that, Fig. 3b) shows that the auxiliary ensemble Y (2) mostly reproduces the desired behavior.And a glimpse on Fig. 3a) lets us surmise that even the ensemble Y (1) very roughly behaves like −1/(H 0 ) Nonetheless, by moving to the auxiliary ensemble, much of the noise attached to the coefficients in Fig. 3a) is heavily reduced and the desired shape of the Hessian displayed in Fig. 3c) is much better approximated.In conclusion, we consider the Hessian of the auxiliary ensemble to be suitable for constructing an S-wave scattering length proxy.
In completing the second iteration, we can now use the gradient and the vanishing axis intercept of the first ensemble Y (1) and the Hessian of the auxiliary ensemble Y (2) to construct a much simpler proxy The proxy p 0 can be understood as a machine-learned second-order Born approximation.In order to examine its range of validity, we proceed as follows: At first we randomly generate two different potential shapes n i ∈ R H 0 .For both of these shapes we generate a set of 100 equidistant potentials U i = U i n i with mag-  36) for 200 different, sampled potentials that take one of two randomly generated shapes.Solid (dashed) lines are related to potentials of the first (second) randomly generated shape.The relative error between p 0 (U ) and a 0 (U ) is less than 3% for shallow potentials with U 1. Introducing successively higher-order terms to the proxy would provide better predictions and would, consequently, increase the range of validity.
nitudes U i ∈ [0, . . ., 5].For each of these potentials U the above, machine-learned Born approximation p 0 (U ) and true scattering lengths a 0 (U ) are evaluated and plotted in Fig. 4. We observe that the relative error between p 0 (U ) and a 0 (U ) is less than 3% for shallow potentials with U 1. Beyond that regime, additional higher-order terms must be introduced to the proxy to make better scattering length predictions.

VII. DISCUSSION AND OUTLOOK
In this manuscript we propose a neural network perturbation theory for MLPs.This allows us to construct much simpler proxies of the original target function.The key idea of that perturbation theory is the successive identification and elimination of the leading-order contribution to the ensemble's Taylor decomposition.This establishes a sequence of ensembles that each specialize in approximating consecutive orders of the target function's Taylor decomposition.Combining these accordingly can, thus, be viewed as the first step of a proxy method, see Ref. [20].
Especially when dealing with deep MLPs, the computation of higher-order Taylor coefficients can be a challenging task.Nonetheless, we manage to obtain an analytical expression for partial derivatives of any order for arbitrarily deep, analytically activated MLPs in terms of propagators and vertices.The underlying formalism is motivated by Feynman diagrams in quantum field theories and the entailed graph theoretical approach makes the underlying combinatorics significantly more systematical and manageable.Note that the graphical representation of its derivatives does not depend on the particular choice of an MLP: Indeed, the calculation of propagators, vertices and saturation thresholds themselves may be altered due to varying weights, biases, activation functions, number of neurons and hidden layers.However, there is no way to uniquely infer more information about its architecture from the mere structure of the contributing arborescences.
We apply this graphical formalism and neural network perturbation theory to S-wave scattering lengths of shallow potentials.For this, we train two ensembles within the same number of iterations.Using the axis intercept and gradient of the first ensemble and the Hessian from the second, auxiliary ensemble yields a proxy of the Born series, which reproduces the second-order Born approximation for shallow potentials.At this point one could, of course, argue that it would have been much more convenient to simply train an NN Y : R 32 → R that is just the sum of one linear layer l and one bilinear layer B, that is Using this architecture instead of deep MLPs would not only reduce the computational effort significantly, but also would have imposed some desired properties like Y(0) = 0 and simultaneously learning the firstand second-order Born terms.Note that Y in this case is a second-order Taylor approximation by itself, which allows to directly read off Taylor coefficients instead of deriving them first, as performed in our analysis.However, such an NN is not an universal approximator, as it violates the UAT, and, therefore, will fail in reproducing scattering lengths for deeper potentials than in this analysis.This is because models of this architecture are unable to adapt to higher-order terms of the Born series.Therefore, using such an architecture may indeed simplify the analysis, but must be well justified for the particular case.Note that the obvious next step of this analysis would be the interpretation of the constructed proxy in the space of all sampled potentials.In doing so, we would just gather a post-hoc interpretation, based on approximations and thus deviations from actual scattering lengths.In this case, prediction and interpretation, therefore, have to be understood as two independent instances.In recent years there have been many efforts to close the gap between prediction and interpretation by ad-hoc interpretation methods.These exemplarily involve training NNs whose architectures are either intrinsically interpretable or can be brought in an interpretable representation, see Ref. [20].At the cost of a prediction-interpretation-tradeoff, the advantage of ad-hoc methods is that resulting intepretations are completely faithful to the NN's prediction, in contrast to the mentioned post-hoc methods.Theorem 1.The first-order partial derivatives of the nm th matrix element of the l th layer propagator D (l,p) nm of order p is given by mk , where we have introduced the matrix elements Proof.First of all, it is easy to see that the derivative with respect to the k th component x k of the input is proportional to a propagator of higher order p + 1. Due to the chain rule, the term ∂z By inserting the recursive step from Eq. ( 5), this dependency can be shifted to the previous layer, In the same manner, we can apply the chain rule successively to all antecedent layers until the base y (0) q 0 = x q 0 is reached.Thereby, each layer provides a matrix-multiplication with a first-order propagator: D (l−2,1) q l−1 q l−2 . . .
Rearranging those propagators and sums finally yields Theorem 2 (Eq.( 11)).The N th derivative of the propagator D (l,p) nm is given by .
Proof.We prove Eq. ( 11) by a complete induction.Therefore, we quickly convince ourselves of its validity in the base case N = 1 with ε This, indeed, corresponds to Eq. ( 9), which is also the defining equation for ∆ (l,1) mk 1 .Subsequently, the inductive step involves evaluating the derivative .
By applying the product rule, we either encounter propagator derivatives or derivatives of tensor elements, which we split into two distinct sums, .
Up to the N th summand in the first sum and the first summand in the second sum, all other summands can be combined to one sum from c = 2 to c = N , = Θ (l − max r (A cr ) − 1) × Ω (q Due to the derivation, a new vertex is introduced to each summand.However, note that the way this new vertex is connected to the given vertices differs in both terms: In the first summand there now appears to be an additional propagator of second order establishing a connection to the N + 1 th vertex, while in the remaining max r (A cr ) summands, the order of the b th propagator is raised by one, which also allows an additional connection to the new vertex.Let us define the set which is a subset of A N +1 .Its |ν c (A)| = max r (A cr ) + 1 elements correspond to adjacency matrices of N-vertex arborescences, that have been extended by an (N + 1) th vertex, which is connected to the c th vertex.For the element with b = max r (A cr ) + 1, this corresponds to establishing a connection via an additional propagator, that is consequently of second order.Else, we have b ∈ {1, . . ., max r (A cr )}, which corresponds to raising the order of the b th propagator in the c th vertex and thereby allows being connected with the new vertex.
The observations made above can be formulated in the language of adjacency matrices: In terms of (N + 1) × (N + 1) adjacency matrices of the set ν c (A), the commutator in Eq. ( 39) is given by .
Each element of ν c (A) is represented in this sum, which implies that each possible connection from the c th vertex to the (N + 1) th vertex is established.Note that the first summand, that introduces a new propagator, is the only term that may alter the saturation threshold of the arborescence, namely in the case that max r (A cr ) ≥ α(A).Therefore, we write The derivative of the arborescence δ It is very insightful to analyze Eq. (43): We already know that the i th commutator [∂/∂x k N +1 , B i (A)] is a sum of max r (A) + 1 terms and corresponds to establishing a connection from the i th vertex to the (N + 1) th vertex, either by introducing a new propagator or by raising the order of an already existing propagator by one.However, Eq. ( 43) is a sum of N terms with the i th summand containing the i th commutator.This means that all allowed connections from all of the given N vertices to the (N + 1) th vertex are covered here.
As the i th commutator leaves other vertices unaltered, we can write for a single arborescence, using Eq.(42), where we sum over the union As the introduction of a new vertex to a given arborescence only influences the corresponding adjacency matrix by appending a new line and column, but leaves the original adjacency matrix unaltered, the disjuncture I. All partitions π ∈ Π N and corresponding factors ε π for N = 1, 2, 3, 4 required for computing the propagator derivatives ∂ N D (l,p) the leading order term of the previous step from the targets of T

n
drawn from the uniform distributions over the intervals [2, 3] and [3, 5], we work with an exponentially decaying learning rate schedule

FIG. 3 .
FIG. 3. Second-order Taylor coefficients of the ensembles Y (1) (Fig. a), Y (2) (Fig. b) and the Hessian of the Born series (Fig. c).Both ensembles reproduce the basic behavior displayed of the Hessian in Fig. c).Since adapting to the second-order Born term has not been prioritized during the training of Y (1) , the resulting elements in Fig. a) are noisy and very large values appear in the bottom right corner.In contrast to Y(1) , the auxiliary ensemble Y(2) has adapted to the second-order Born term much better.The diagonals appearing on both ensemble Hessians is presumably an artifact that could be eliminated using other and more capable architectures than MLPs.

FIG. 4 .
FIG.4.Scattering lengths a 0 (U ) and corresponding scattering length predictions p 0 (U ) by the proxy from Eq. (36) for 200 different, sampled potentials that take one of two randomly generated shapes.Solid (dashed) lines are related to potentials of the first (second) randomly generated shape.The relative error between p 0 (U ) and a 0 (U ) is less than 3% for shallow potentials with U 1. Introducing successively higher-order terms to the proxy would provide better predictions and would, consequently, increase the range of validity.

A 11 .
express the (N + 1) th vertex as a term B N +1 (A ) with A ∈ ν c (A), due to the (N + 1) th line containing only zeros, thus max r (A N +1,r ) = 0, and due to β N +1 (A ) = c as well as A β N +1 (A ),N +1 = b,