Scattering with Neural Operators

Recent advances in machine learning establish the ability of certain neural-network architectures called neural operators to approximate maps between function spaces. Motivated by a prospect of employing them in fundamental physics, we examine applications to scattering processes in quantum mechanics. We use an iterated variant of Fourier neural operators to learn the physics of Schr\"odinger operators, which map from the space of initial wave functions and potentials to the final wave functions. These deep operator learning ideas are put to test in two concrete problems: a neural operator predicting the time evolution of a wave packet scattering off a central potential in $1+1$ dimensions, and the double-slit experiment in $2+1$ dimensions. At inference, neural operators can become orders of magnitude more efficient compared to traditional finite-difference solvers.

Recent advances in machine learning demonstrated the ability of neural networks to approximate not only functions, but also non-linear operators, using various architectures collectively known as neural operators [1][2][3][4][5].In this work, we ask whether they can serve as a practical computational tool in fundamental physics. 1 This question is motivated in part by the exploding complexity of perturbative quantum field theory computations needed for precision predictions at the Large Hadron Collider [9,10], calling for rethinking how to represent and compute S-matrix elements efficiently.
This investigation has to start somewhere and here we consider arguably the simplest scattering operator in (d + 1)-dimensional quantum mechanics, which acts on the initial position-space wave function Ψ(⃗ x, 0) to produce the final one Ψ(⃗ x, T ) at some fixed time T : Here, (⃗ x, t) are the space-time coordinates, T is the timeordering operator, and Ĥ = − ℏ 2 2m ∇ 2 + gV (⃗ x) is the timeindependent Hamiltonian, where m is the particle mass, and g is a coupling constant.We emphasized the functional dependence on the potential V , which is precisely what makes S a non-linear operator, viewed as acting simultaneously on the space of V 's and initial conditions for Ψ.
We ask whether, instead of computing the time evolution using traditional methods, the Schrödinger operator 1 Neural networks, but not operators, have already had transformative impact on particle-physics phenomenology and beyond, see, e.g., [6][7][8].S can be represented as a neural operator N (defined more precisely below), so that: The universal approximation theorem for operators, originally due to Chen and Chen [11], actually guarantees an affirmative answer.Translated to our setup, it implies that for any ϵ > 0, there exists a complicated enough for every continuous V (⃗ x) and Ψ(⃗ x, 0), see [12][13][14] for details.We will make this statement more precise shortly.Nevertheless, it is not known how to find such an N ϵ constructively (it would be akin to looking for the next best-selling novel in the decimal expansion of π).Indeed, constructing decent N 's became practically viable only recently, by combining neural operators with deep learning ideas [1,4].In this work, we examine whether this framework allows us to machine learn the Schrödinger operator in a meaningful way.
Strategy Let us sketch the general idea first.The neural operator is trained on triples of data: where the input is a Gaussian noise and Ψ(⃗ x, T ) is computed exactly according to (2), see Fig. 1 (top).The learning process is entirely data-driven and has no physics input: N has to effectively "dream up" its own representation of the Schrödinger equation from scratch.In fact, the neural operator learns not just one instance of this equation, but instead a whole family parameterized by the interaction potential V .It would not be surprising if N could accurately solve problems drawn from the same distribution as it was trained on.Therefore, after training, or at inference, N is instead tasked with predicting the time evolution of a previously-unseen problem: for example, a wave packet approaching a double-slit potential, see Fig. 1 (bottom).In other words, we are probing the generalization or outof-distribution error : the ability of neural operators to extrapolate the information learned during training to new problems.
The prediction needs to be accurate enough to be able to iterate this process k times and study strongly-coupled scattering by evolving the wave function Ψ(⃗ x, kT ) for long times.As we will see in Tab.I, at inference, this way of encoding physics can become faster and more memory-efficient compared to traditional algorithms, such as finite-difference methods, because once the operator N is learned, it does not need to be recomputed for every new input V and Ψ.

ARCHITECTURE
We use a version of the Fourier neural operator (FNO) architecture [3] adapted to the problem at hand.Channels The data is organized into channels.We take the input I : D → R d+3 to consist of d+3 channels, where D ∋ ⃗ x is the spatial domain we take to be the unit torus D = T d (we impose periodic boundary conditions for simplicity).In practice, one samples points from a discretization of D, such as a uniform lattice of N d points.
Likewise, the output O : D → R 2 consists of 2 channels: Splitting Ψ's into real/imaginary parts and adding positional embeddings ⃗ x is used to increase numerical stability and facilitate the learning process.The architecture will be designed to be independent of the specific discretization of D, which in particular means that a neural operator trained on a coarse lattice can be used to make predictions on a finer one, as will be discussed below.
Neural operator A neural operator N is written as a composition of L+2 layers resembling standard deep neural networks [4]: Here, L and P are the lifting and projection layers respectively, which act as identities in D and whose role is to map the input data I to its hidden representations with h channels, H ℓ : D → R h for ℓ = 0, 1, . . ., L, and then back to the output O: For us, L and P are implemented as 1-and 2-layer perceptrons respectively, with the hidden dimension p in the second case.Typically, L = 4 and h, p ≫ d.The hidden-layer functions f ℓ+1 are defined through where the weights W ℓ ∈ R h×h , biases b ℓ ∈ R h , and skip connections s ℓ ∈ R act as identities in D. The activation function σ is the only non-linearity and is applied elementwise.We take it to be the Gaussian Error Linear Unit (GELU) [15].

Fourier layers
The key objects are the (linear) integral kernel operators K ℓ with ℓ = 1, 2, . . ., L, which take the general form for all ⃗ x ∈ D. Note this is the only layer acting nondiagonally in D.
Out of the previously-explored choices for parametrizing the convolution kernel K ℓ [2][3][4], we write (10) using Fourier transforms F and their inverses F −1 : This choice allows us to represent number of Fourier modes.The expression (11) can then be evaluated efficiently using fast Fourier transforms and tensor multiplication.
To summarize, the parameters of N are the entries of W ℓ , T ℓ , b ℓ , and s ℓ that we will optimize for, see Fig. 2. The hyperparameters are all the numbers describing the network properties, such as L, h, F , etc.In this case, the aforementioned universal approximation theorem states that one can always find large enough values of the hyperparameters such that there exist specific W ℓ , T ℓ , b ℓ , and s ℓ for which the L 2 -error is bounded by ϵ for any input in a compact subset of a Banach space [12][13][14].Note that the theorem would not be true if we did not include the non-linearity σ and the Fourier layers.
In practice, we truncate the number of modes to be half along each dimension, i.e., F = (N/2) d , and ensure reality of the output by imposing on the remaining ones.Since the tensors T ℓ are by far the largest part of the neural operator (contributing 2Lmh 2 real parameters), it pays off to instead represent them using Tucker decomposition with rank, or compression ratio, r [16].
We have tried adding a number of other features, including normalization layers, dropout regularization [17], and Sobolev training [18] without drastic improvements in the results.Nonetheless, we suspect they will be important in large-scale applications of neural operators.
To avoid confusion, we will add subscripts: )] to distinguish between the exact and inferred computations.We use the AdamW optimizer [38], which is an adaptive variant of stochastic gradient descent, with learning rate ν and weight decay w, together with a scheduler halving ν every time the training loss reaches a plateau.

Training dataset
The training set is constructed by drawing the first three entries of (5) independently from a Gaussian random field with spatial width µ = 0.1 (i.e., the power spectrum ∝ e −µ 2 |⃗ p| 2 /2 ), sampled over a uniform grid of N d points on T d .In particular, the potential V (⃗ x) is purely real.In order to probe unitarity, each sample is normalized such that ∥Ψ(⃗ x, 0)∥ = 1 and we similarly set ∥V (⃗ x)∥ = 1.We then create a timeline by iteratively solving for Ψ GT (⃗ x, kT ) up to k ⩽ M = 256.Hence, each timeline produces M −1 input-output pairs.For numerical stability, we normalize each output such that ∥Ψ GT (⃗ x, kT )∥ = 1 before evolving it to Ψ GT (⃗ x, (k+1)T ).We repeat this process K = 32 times, which results in the training dataset of size n train = K(M −1) = 8160.See Fig. 1

and 3 (top rows) for examples.
We emphasize that it is possible to achieve much better performance with a training set adapted to scattering problems, for example by including Ψ's resembling wave packets.Likewise, one can include probability conservation in the loss function, hard-code the fact N is supposed to be linear in Ψ, or make use of other physics-informed training strategies [39][40][41][42].Here, we purposely do not employ these steps, because our goal is to probe the outof-distribution error and find out whether N has learned quantum physics.All computations are performed on a single NVIDIA A100 GPU with 40GB memory.

WAVE PACKET SCATTERING
As the first application, we consider wave packet scattering in d = 1.We set ℏ = m = 1 and g = 3×10 3 .In those units, the time step is chosen to be T = 6.3×10 −5 , which is fine enough to produce a "movie", see [43].Much smaller T 's can cause numerical approximation errors and much larger ones prove more difficult to train.The spatial dimension is a unit circle T 1 ∋ x 1 and we discretize it with N = 256 points.
Testing dataset The testing dataset is prepared in exactly the same way as the training one, except it is out-of-distribution: we start with the specific wave packet Ψ(x 1 , 0) approaching a potential well V (x 1 ), as illustrated in Fig. 3 (bottom left).After evolving it through time t ⩽ M T , the testing dataset comprises of n test = M −1 = 255 samples.
Hyperparameters We performed a Bayesian search over the hyperparameter space, which revealed preference for relatively small networks that are less prone to overfitting.In the results below, we use h = 64, p = 512 hidden and projection channels, L = 4 Fourier layers, and Tucker rank r = 10 −2 .Learning rate is set to ν = 10 Example learning curves are shown in Fig. 4, where we also keep track of the testing loss as a benchmark.At first, it closely follows the training loss, but around epoch ≳ 500 the neural operator starts overtraining (memorizing instead of learning).We still find it beneficial to continue learning until both losses stabilize with the training loss converging to ∼ 3×10 −4 and testing to ∼ 5×10 −4 .The kinks in the losses occur in places where the learning rate ν gets halved.Example training and testing data samples are compared in Fig. 3.Note that in the latter case, the wave function has flat near-zero segments, which are very  non-generic from the perspective of the training data, but nevertheless correctly evolved.
Unitarity We found that unitarity can be treated as a proxy for the confidence of the neural operator about its predictions.In Fig. 5 (left), we display a histogram of ∥Ψ NO (x 1 , T )∥ over the whole training and testing datasets.It demonstrates the neural operator has learned unitarity with ∼ 10 −4 precision.
Long-term predictions Finally, we discuss the ability of the neural operator to make long-term scattering predictions by iterating N , where k ⩽ M = 256.For simplicity of exposition, we keep the input potential V constant, even though the same N k is also capable of treating time-dependent V 's.As with the finite-difference methods, we normalize ∥Ψ NO (⃗ x, kT )∥ = 1 after each iteration.Example timeline of ∥Ψ NO (⃗ x, t)∥ before normalization are shown for t ⩽ 256T in Fig. 5 (right) for examples of training and testing samples.They remain accurate to within ∼ 10 −4 of identity, indicating high confidence in reliable predictions.
A full timeline is presented in Fig. 6, which shows the wave packet approaching the central potential, interfering with it, and emerging on the other side with a time delay and a spreading effect.The error builds up over time, but stays within ∼ 0.02 even in the final time step t = 256T , showing strong generalization capabilities.It demonstrates that the iterated neural operator N k has effectively learned the physics of the Schrödinger operator at strong coupling in d = 1.

DOUBLE-SLIT EXPERIMENT
We next discuss neural operators in d = 2, with the outof-distribution test problem being a wave packet scattering off a double-slit potential, see Fig. 1 (bottom) for the initial conditions.
The hyperparameters used are the same as in the d = 1 case, except for the Tucker rank, which we take to be r = 10 −3 and the grid size, N = 64.The resulting number of parameters is ∼ 1.2×10 5 .Training and testing dataset sizes remain the same as in d = 1.
Learning As expected, the neural operator becomes more difficult to train in d = 2, as exemplified by the learning and unitarity curves in Fig. 4  Performance As before, we construct N k to study time evolution of the wave function.The results are illustrated in Fig. 7.After passing through the double-slit openings, the wave packet interferes with itself, creating a fringe pattern in the x 1 -direction.Due to periodic boundary conditions, it also develops interference in the x 2 -direction at late times.As in the d = 1 case, the numerical errors build up over time: in the first and last time-frames, t = T and t = 256T , the L 2 -error between the exact and inferred computations is ∼ 5×10 −4 and ∼ 0.14 respectively.
Zero-shot super-resolution Let us finish this discussion by emphasizing the distinction between a neural network acting on a very large but finite-dimensional Hilbert space (obtained by a specific discretization of D) and a neural operator, which acts on an infinitedimensional space (independent of the discretization).In particular, it means that the latter can be trained on a lower-resolution lattice and then applied to computing Ψ NO (⃗ x, T ) at higher resolution.In the literature, this procedure is called a zero-shot super-resolution [4].
To exemplify it, in Tab.I we collected indicative times for computing Ψ GT (⃗ x, T ) and Ψ NO (⃗ x, T ) from the testing set at various resolutions, together with the 2 -errors between the two techniques.Finite-difference methods scale badly at large lattice sizes because they involve computing and taking powers of an N d × N d matrix for every new potential V .On the other hand, neural operator trained on the original sizes (here, N = 256 in d = 1 and N = 64 in d = 2) can be applied to finer grids and new potentials with low overhead.

DISCUSSION
In this work, we explored the idea of using deep operator learning as a computational tool for mapping between function spaces appearing in fundamental physics.As an illustrative example, we considered neural operators predicting quantum-mechanical scattering of wave packets with potential barriers and wells.We envisage that neural operators have much broader applicability in quantum theory, beyond just time-evolution operators, both in numerical and symbolic manipulations.
Let us emphasize that this strategy depends on a reliable way of producing training samples.The prospect is that, once trained, a neural operator can perform the same computation approximately but much more efficiently.The biggest open question concerns out-of-distribution errors.For example, given that neural operators represent physics in unconventional ways, would cheaply-obtainable training data (e.g., coming from exactly-solvable systems) suffice to learn solutions to conventionally difficult problems?We leave a study of this provocative question until future work.

Figure 1 :
Figure 1: Example training (top) and testing (bottom) data with the input in the first two columns and the output in the third one.Only real parts are displayed for conciseness.

Figure 2 :
Figure 2: Diagram of the neural operator architecture used in this work.The Fourier layer (red) is introduced in (11).

1 Figure 3 :
Figure 3: Comparison between the exact and inferred wave functions for training (top) and testing (bottom) datasets in d = 1.From left to right: Potential V (x1) (yellow) and the initial wave function Ψ(x1, 0) (real part in red, imaginary part in blue); exact wave function ΨGT(x1, T ); neural operator output ΨNO(x1, T ); and the error ΨGT(x1, T ) − ΨNO(x1, T ) magnified 10 3 × for visibility.The difference between t = 0 and t = T is non-zero, but barely visible by eye due to the choice of a small time step; see Fig.6 for time evolution across t ⩽ 256T .
and 5 respectively.The generalization error is worse by around an order of magnitude compared to d = 1: the unitarity errors stay within ∼ 4×10 −3 of identity; the training and testing losses reach ∼ 3×10 −4 and ∼ 1.1×10 −3 respectively.Unitarity dips around the time when the wave function develops high-frequency modes due to encountering the potential barrier.

Table I :
Comparison between single time-step GPU times (in seconds) and memory usage (in GB) of the Crank-Nicholson (CN) method and neural operator (NO) for different dimensions d and lattice sizes N .The last column gives an average L 2 -error between the two methods across the testing dataset.Original training sizes for NO are underlined.Crosses indicate that a computation did not terminate due to memory shortage.