Heat flux for semi-local machine-learning potentials

The Green-Kubo (GK) method is a rigorous framework for heat transport simulations in materials. However, it requires an accurate description of the potential-energy surface and carefully converged statistics. Machine-learning potentials can achieve the accuracy of first-principles simulations while allowing to reach well beyond their simulation time and length scales at a fraction of the cost. In this paper, we explain how to apply the GK approach to the recent class of message-passing machine-learning potentials, which iteratively consider semi-local interactions beyond the initial interaction cutoff. We derive an adapted heat flux formulation that can be implemented using automatic differentiation without compromising computational efficiency. The approach is demonstrated and validated by calculating the thermal conductivity of zirconium dioxide across temperatures.

The thermal conductivity tensor κ describes the ability of a material to conduct heat when exposed to a temperature gradient.Its computational prediction is of great interest for the design of novel high-performance materials which are needed, for example, as thermal barrier coatings in engines [1], or thermoelectrics for waste heat recovery [2].Such materials often feature complex structure and strongly anharmonic potential-energy surfaces (PES) [3,4].This implies the need to evaluate κ with the Green-Kubo (GK) method [5][6][7][8][9].
In the GK approach, κ is expressed in terms of the integral of the autocorrelation function of the instantaneous heat flux J(t) as observed in equilibrium molecular dynamics (MD) simulations, where k B is the Boltzmann constant, V the simulation cell volume, and • T,p denotes an ensemble average at temperature T and pressure p.
High-accuracy MD simulations can be performed using density-functional theory (DFT) when the exchangecorrelation approximation is reliable [10].For the evaluation of Eq. (1), this approach [11,12] suffers from its numerical costs which limits the system sizes and time scales that can be treated, and therefore requires additional denoising and extrapolation approaches [12][13][14].The alternative, so far, was the use of semi-empirical force fields (FFs) [15].Here, the interatomic interactions are described by a physically-motivated analytical equation that includes free parameters which are fitted to experimental or ab initio results.This classical FF approach has been very successful, as it enables a proper consideration of the ensemble averages needed in Eq. (1).However, the restricted flexibility of FFs may limit their generality and ability to model novel materials.
A new, more general class of FFs is the family of machine-learning potentials (MLPs) which leverage techniques like neural networks (NNs) [16][17][18][19][20][21].MLPs offer, in principle, unrestricted flexibility, but are limited to the mechanisms and information that are provided by their training data.In local MLPs, linear scaling with system size is achieved by using the short-ranged nature of chemical bonding [22] to decompose the total energy into contributions that only depend on local atomic environments.However, a strict locality assumption limits the flexibility and therefore accuracy of such MLPs.Some FFs therefore include explicit long-range electrostatic and van der Waals interactions [23][24][25][26][27]. Semi-local MLPs [28][29][30][31][32][33][34][35][36][37][38][39] build up longer-range correlations iteratively from local ones through message-passing mechanisms, thereby preserving linear scaling with system size.They have recently emerged as an alternative to strictly local MLPs and have shown promising performance in benchmark settings and first applications [35,36,[40][41][42].
While local MLPs have been used to investigate thermal transport via GK [43][44][45][46][47][48][49][50], more recent semi-local MLPs have not yet been applied, partially because a heat flux formulation that incorporates message-passing mechanisms was lacking.In this work, we fill that gap and extend the GK approach to semi-local potentials.To this end, we derive a formulation of the heat flux that explicitly accounts for semi-local interactions, finding that the resulting thermal conductivity significantly differs from a purely local form.While the computation of this heat flux scales quadratically with system size, we show that an alternative yet equivalent form based on an extended auxiliary system can be introduced, leading to overall linear scaling and straightforward practical implementation of the approach via automatic differentiation (AD) [51,52].Using the SchNet message-passing neural network (MPNN) [29,30], we demonstrate the accuracy and feasibility of large-scale semi-local MLP thermal conductivity calculations for zirconia (ZrO 2 ), an oxide known for its strongly anharmonic PES [53,54].
For any potential function U ({R J }) that can be decomposed into atomic contributions U = I U I ({R J }), where {R J } denotes the set of all atomic positions R J , the heat flux is given by the classical equivalent of a formula by Hardy [55], which we re-derive in the Supp.Mat. to explicitly account for periodic boundary conditions.This yields the full 'Hardy' heat flux =: where ṘI is the velocity of an atom, and E I = U I + 1 2 m I ṘI is the total energy per atom.For atom-pair vectors, we adopt the convention R IJ = R J − R I .R sc indicate the atoms in the simulation cell, while R all enumerates the full, infinite, bulk system.The nomenclature for heat flux contributions and the relation of Eq. ( 2) to DFT formulations are further discussed in the Supp.Mat.As this work considers FFs and MLPs that explicitly define atomic potential energies U I , total atomic energies E I and consequently J conv can be computed in a straightforward manner.We therefore only discuss the more involved computation of J pot in the following, whereas the heat flux used for calculating κ is always equivalent to the full flux given by Eq. (2).
Evaluating J pot requires disentangling the contributions of every atom, including those in the bulk, to every atomic potential energy U I .This can be challenging for non-pairwise, many-body potentials, leading to the development of specialized expressions for different FFs [56][57][58][59][60][61], many of which were recently unified and shown to be equivalent to Eq. ( 2) by Fan et al. [59].Their work is based on the insight that translational invariance requires that the potential is computed only from atom-pair vectors R IJ , which provides a convenient basis to separate the computation of each U I into distinct sets of inputs.
Combined with the introduction of an interaction cutoff radius r c and atomic neighborhoods As each R IJ only contributes to one U I , derivatives of U naturally separate into atomic contributions ∂U I /∂R IJ = ∂U/∂R IJ .The resulting expression can be implemented efficiently with AD, as detailed in the Supp.Mat.To estimate the asymptotic scaling, a function proportional to N x has been fitted to the results for large N .Note that on this setup with limited memory, the truly asymptotic limit cannot be reached.
While being exact for local potentials, this formulation of the heat flux does not apply to the semi-local case.In such potentials, longer-range interactions are introduced without explicitly increasing the cutoff r c by building them up iteratively: Neighboring atoms are allowed to exchange information for a fixed number of iterations M [28].Neighboring environments up to an effective cutoff radius r eff c = M r c therefore become correlated; atomic potential energies U I acquire a dependence on atom-pair vectors outside of their immediate neighborhoods N (I), rendering Eq. ( 4) inapplicable.
To see this, we employ a description in terms of a graph G, where vertices V are identified with atoms I in the simulation cell, and connected via edges E labelled by atom-pair vectors R IJ if they lie within N (I).Semi-local MLPs then act on this graph by propagating information between vertices (see Supp.Mat.).Interactions outside of the simulation cell are therefore mapped back into it, and explicit replicas are not constructed.
Assuming that r eff c is chosen such that the minimum image convention (MIC) is applicable, J pot in Eq. ( 2) can be rewritten (see Supp.Mat.) as generalizing J local pot to semi-local MLPs.In the case of M = 1, this form reduces to Eq. (4).
Equation ( 5) reflects the standard construction of semilocal MLPs; a double sum over all atoms is required and its evaluation formally scales quadratically with system size.As shown in Fig. 1, a direct implementation of this form is therefore impractical.While force predictions for a semi-local MLP based on the SchNet architecture [29,30] remain below 100 ms for all system sizes investigated, the unoptimized calculation of the heat flux dominates the computational cost by several orders of magnitude at the system sizes required for the GK method.
If the analytical form of U I were known, a lower-scaling evaluation of the heat flux might be accessible by deriving and implementing analytical derivatives.Modern MLPs, however, typically rely on AD [51,52] for efficiently computing derivatives without requiring detailed information on the functional form of the MLP.
To take advantage of this, we now derive an adapted form of the heat flux that preserves the implicit treatment of interactions beyond local environments to retain the computational efficiency of semi-local MLPs, while explicitly attributing all contributions to U I to bulk positions for J pot in Eq. ( 2).This is achieved by constructing an extended simulation cell that explicitly includes all replicas that interact with atoms in the simulation cell, inspired by previous approaches which did not consider AD [62,63].The graph representation is then constructed without periodic boundary conditions, yielding 'unfolded' vertices R unf .The potential energy obtained by summing over the original simulation cell, U , remains unchanged in this construction.This allows to retain the small cutoffs needed for efficiency, while enabling AD to compute the required derivatives.
With this construction, Eq. ( 2) can be rewritten as introducing the energy barycenter B = I∈Rsc R I U I , where the positions R I are treated as pre-factors and not included in the partial derivative.The dot product is taken between denominator and velocity.Writing the heat flux as the derivative of a vector B and a scalar U , as opposed to a high-dimensional Jacobian, ensures that these derivatives can be readily computed with AD, incurring the same asymptotic computational cost as the calculation of U and B, which is proportional to |R unf |.Since the number of additional positions is proportional only to the surface area of the simulation cell and the number of interactions M , the overall asymptotic linear scaling is restored, with |R unf | ∝ N + N 2/3 (see Fig. 1).
To validate the approach, we benchmark the performance of a semi-local MLP, in particular the SchNet [29,30] MPNN architecture, for GK calculations on zirconia (ZrO 2 ) and compare to results obtained with sizeextrapolated ab initio GK [12], as well as GK with a local MLP [50], and experimental measurements [64][65][66].
Training and validation data were generated using ab initio MD in the N pT ensemble, with four different trajectories heating up an initially tetragonal simulation cell with 96 atoms to target temperatures 750 K, 1500 K, 2250 K and 3000 K.In total, 10 000 single-point calculations were performed using FHI-aims [67] and FHIvibes [68], using the PBEsol [69] functional and otherwise following the computational approach of Ref. [12].
On this data, we train a SchNet MPNN, implemented in SchNetPack [70], with cutoff radius r c = 5 Å.We choose Comparison of the integral of the heat flux autocorrelation function for different formulations of the heat flux.The efficient re-formulation of the heat flux J unfolded is equivalent to the full heat flux J semi-local , but not to J local , which neglects semi-local interactions.Results are given for an MPNN with M = 2 and zirconia at 300 K in the monoclinic phase (top) and 1400 K in the tetragonal phase (bottom) for a simulation cell with 768 atoms.Shaded regions indicate standard error across eleven trajectories.
an interaction depth M = 2 leading to an effective cutoff of 10 Å.In line with recent findings by others [36], we find this to be sufficient, as test set error does not significantly decrease for higher values of M or r c .Further details on the training procedure, choice of hyperparameters, and testing of the MLP can be found in the Supp.Mat.
We find that this simple approach yields a MLP capable of describing the dynamics in monoclinic and tetragonal zirconia up to temperatures of approximately 2000 K.In this temperature range, the anharmonic vibrational density of states matches that obtained from DFT.Beyond 2000 K, the oxygen atoms become more mobile and different types of dynamical events are observed, in particular exchange-type oxygen diffusion.This behavior is also present in the training data in line with recent literature [71], although slightly different diffusion events are observed given the smaller simulation cells and trajectory lengths.When diffusion increases at higher temperatures, the MLP becomes unstable.This might be due to the limited amount of training data for these processes, especially for thermodynamic conditions close to the tetragonal-tocubic phase transition.These observations suggest that an accurate description of defect formation is necessary to investigate zirconia above 2000 K, which is beyond the scope of the current work.
Figure 2 compares our efficient implementation with FIG. 3. Thermal conductivity across temperatures computed with an MPNN using M = 2 message-passing steps and experimentally determined lattice parameters [72,73], compared with another MLP without extrapolation [50], sizeextrapolated ab initio GK [12], and experimental measurements [64][65][66].Error bars are shown as given in the respective publications, the ones in the present work reflect the standard error across eleven trajectories.Letters 't' and 'm' indicate results for the tetragonal and monoclinic phase, respectively.
the full semi-local heat flux, as well as the purely local heat flux formulation.Due to the high computational cost of the unoptimized implementation, we use a small simulation cell with N = 768 atoms, and rely on the noise reduction scheme introduced in Ref. [14].The results confirm that our implementation J unfolded pot is equivalent to the semi-local heat flux J semi-local pot , while the local flux J local pot is not, underestimating the thermal conductivity by approximately 40 % due to missing interactions beyond M = 1.A similar effect has been observed when formulations applicable to pairwise additive potentials are used for many-body force fields [60,61].
Enabled by computationally efficient access to J for semi-local MLPs, we then predict the thermal conductivity of zirconia across temperatures.Since the focus of the present work is the heat flux, we do not treat the thermodynamics of zirconia with the MLP, but use experimentally determined lattice parameters [72,73] to account for lattice expansion.At 1400 K, both phases are investigated, as the monoclinic phase is sufficiently stable during the course of the simulations, which consist of eleven trajectories of 1 ns each, with a N = 1500 simulation cell.These settings yield fully size-and time-converged results (see Supp.Mat. for details).
The results presented in Fig. 3 are in good agreement with both experimental measurements in the monoclinic phase, and theoretical MLP predictions in the monoclinic and tetragonal phases.As this work uses similar lattice parameters and the same exchange-correlation functional as the work by Verdi et al. [50], the observed close agreement is to be expected.Remaining differences between the MLP results may be due to larger simulation cells used in the present work, enabled by the favorable scaling of computational cost due to the efficient heat flux implementation, and the semi-local nature of the employed MPNN.Compared to experiment, both MLPs are found to systematically underestimate κ by approximately 10 % to 20 %, which may be related to the intrinsic approximation of a finite-range MLP, or the underlying density functional approximation.
Larger differences are observed with the sizeextrapolated ab initio GK results reported by Carbogno et al. [12], which, however, were computed for the tetragonal phase at all temperatures.Additionally, due to the high computational cost of first-principles calculations, only three trajectories of 60 ps each were used, which is reflected in the larger statistical error.
We conclude that the adapted GK approach for semilocal MLPs introduced in this work can successfully and efficiently predict the thermal conductivity of zirconia across temperatures, using 10 000 first-principles calculations in total.Despite a moderate system size of 96 atoms for training, fully size-converged results were obtained without requiring additional extrapolation schemes.
In summary, we have demonstrated the feasibility of applying AD-based semi-local MLPs to the prediction of thermal conductivities with the GK method.For this, we investigated the impact of semi-local interactions on the heat flux, and derived an adapted heat flux J unfolded pot that can be efficiently implemented via AD.This heat flux has asymptotically linear runtime and requires no further restrictions on the form of the potential.Its formulation is independent of the body-order of the potential energy function, making no distinction between pair, angle-dependent, or many-body potentials.As it relies on explicitly constructing an extended simulation cell, it is applicable to semi-local MLPs with moderate effective interaction ranges.

DATA AND CODE AVAILABILITY
For practical simulations based on machine-learning potentials, it is not only necessary to predict the total energy U , but also its derivatives with respect to atomic positions R I , e.g., the forces and the stress tensor [1].Since it is impractical to manually implement the required derivatives for deep neural networks or other complex architectures, automatic differentiation (AD) is employed.
AD is a technique to automatically obtain derivatives of functions implemented as computer programs [2,3] by reducing them to a directed acyclic graph of elementary operations, evaluating the local derivatives for each operation, and then applying the chain rule by traversing the graph: Traversing the graph from the output backwards yields reverse-mode AD.Traversing from the inputs forwards yields forward-mode AD.
AD can efficiently obtain vector products of the Jacobian of a function over either the input dimensions (forward mode) or the output dimensions (reverse mode), with the same asymptotic runtime as computing the original function.Computing the explicit Jacobian, on the other hand, requires repeated evaluations of such products, incurring linear additional computational cost in either the input or output dimension.
Therefore, if the energy prediction scales linearly with system size, and only derivatives of the scalar total potential energy U are required, the forces and stress can be predicted in a linearly-scaling fashion as well.In contrast, explicitly computing all derivatives of the potential energy contributions, ∂U I /∂R J , would incur quadratic computational cost, which introduces difficulties when implementing the GK method.
Since AD relies on tracing the underlying computation, it can only produce derivatives with respect to quantities that were explicitly used in computing a result; for instance, derivatives with respect to pair vectors R IJ cannot be computed when J / ∈ N (I), i. e., for atoms that are not in the neighborhoood of I, since those atom-pair * Corresponding author: mail@marcel.science† Corresponding author: florian.knoop@liu.sevectors only appear implicitly through message passing, which is discussed in the next section.

II. SEMI-LOCAL MACHINE-LEARNING POTENTIALS
In this work, we consider machine-learning potentials (MLPs) with semi-local structure.In many cases, MLPs of this type are implemented as message-passing neural networks (MPNNs) [4], although constructions based on other forms of regression are also possible [5].
This type of model aims to combine the computational efficiency of local MLPs, in which potential energy contributions are dependent on atoms within a cutoff r c , typically on the order of 5 Å to 10 Å, with the ability to model correlations extending beyond these cutoffs.This is achieved by allowing M interactions between neighboring environments, building up longer-range correlations iteratively, without requiring the explicit construction of the extended neighborhoods, at the expense of an information bottleneck introduced by the restricted amount of information that can be communicated.One can alternatively view semi-local models as introducing a sparsification of interactions within r eff c = M r c .In order to clarify the construction of such models in practice, we now briefly introduce MPNNs.The (periodic) system at hand is represented by a graph G = (V, E), where the vertices V correspond to the atoms present in the simulation cell connected by edges E. Two vertices I and J are connected with an edge if |R IJ | ≤ r c , where r c is a user-defined cutoff radius.Since, by definition, I and J are atoms in the simulation cell, R IJ must be subjected to the minimum image convention (MIC) to account for periodicity, particular choice of unit cell; translational invariance is ensured by using relative positions.
MPNN potentials operate on G in three stages, illustrated in Fig. 1: initialization, message-passing, and readout.Each message-passing step is labeled by t = 0, . . ., M. During initialization (t = 0), every atom I is assigned a state vector s t=0 I , based on chemical species.During the next stage, these states are updated with messages m t I that can depend on the states of both the receiver atom I and its neighbors J 2 N (I), as well as the edges R IJ connecting them: Here, the message functions M t and update functions F t are differentiable functions implemented as neural networks.They are learned during training, and can differ over message-passing iterations.The message functions M t model pairwise interactions.Their aggregation into the total message m t+1 I via a sum ensures permutational invariance.The update functions F t describe how the interactions with all neighbors influence the new state of the receiver atom.Over multiple iterations, information is propagated through the graph.
We emphasize that this construction incorporates periodicity by messages wrapping around boundaries back into the simulation cell.Since the atom-pair vectors within N (I) are identical whether I is in the simulation cell or a replica, this is equivalent to messages propagating into neighboring replicas.However, as we will see in the following section, this implicit treatment of periodicity introduces difficulty with the heat flux, which is resolved by adapting G.
After M message-passing steps, the final stage is reached where a readout function R predicts the atomic potential energy contributions U I .This atom-independent readout function is learned during training and acts on each vector s M I representing the final state of atom I.The total potential energy U is then simply given by For M > 1, the energy contribution U I is a function not only of {R IJ |J 2 N (I)}, but of every edge within M hops on the graph: Information propagates beyond local neighborhoods of atoms, allowing for implicit long-range interactions.This distinguishes MPNN potentials from other MLPs, which typically act only on local features.In the present framework, such simpler, strictly local models are included when setting M = 1.
We also note that different strategies can be employed to deal with rotational invariance.Early MPNNs such as SchNet [6,7] only use interatomic distances as inputs, discarding angular information.Recent equivariant MPNNs [8][9][10][11][12] can make use of atom-pair vectors and can ensure that s t I and m t I transform appropriately under rotations.Message functions that take angular or higher-order information into account are also possible [13][14][15][16].In this work, we use an invariant MPNN.The presented derivations and conclusions are however fully general and applicable to equivariant MPNNs.

III.1. Derivation of Hardy Formula
Due to the difficulties discussed in the main text, deriving and implementing the heat flux for particular types of PES has been the focus of much previous work [17][18][19][20][21][22][23][24].This work aims to provide a unified perspective for the case of semi-local MLPs implemented using AD.
As a starting point for this task, we re-derive the classical equivalent of the heat flux introduced by Hardy [25], particular choice of unit cell; translational invariance is ensured by using relative positions.
MPNN potentials operate on G in three stages, illustrated in Fig. 1: initialization, message-passing, and readout.Each message-passing step is labeled by t = 0, . . ., M .During initialization (t = 0), every atom I is assigned a state vector s t=0 I , based on chemical species.During the next stage, these states are updated with messages m t I that can depend on the states of both the receiver atom I and its neighbors J ∈ N (I), as well as the edges R IJ connecting them: Here, the message functions M t and update functions F t are differentiable functions implemented as neural networks.They are learned during training, and can differ over message-passing iterations.The message functions M t model pairwise interactions.Their aggregation into the total message m t+1 I via a sum ensures permutational invariance.The update functions F t describe how the interactions with all neighbors influence the new state of the receiver atom.Over multiple iterations, information is propagated through the graph.
We emphasize that this construction incorporates periodicity by messages wrapping around boundaries back into the simulation cell.Since the atom-pair vectors within N (I) are identical whether I is in the simulation cell or a replica, this is equivalent to messages propagating into neighboring replicas.However, as we will see in the following section, this implicit treatment of periodicity introduces difficulty with the heat flux, which is resolved by adapting G.
After M message-passing steps, the final stage is reached where a readout function R predicts the atomic potential energy contributions U I .This atom-independent readout function is learned during training and acts on each vector s M I representing the final state of atom I.The total potential energy U is then simply given by For M > 1, the energy contribution U I is a function not only of {R IJ |J ∈ N (I)}, but of every edge within M hops on the graph: Information propagates beyond local neighborhoods of atoms, allowing for implicit long-range interactions.This distinguishes MPNN potentials from other MLPs, which typically act only on local features.In the present framework, such simpler, strictly local models are included when setting M = 1.
We also note that different strategies can be employed to deal with rotational invariance.Early MPNNs such as SchNet [6,7] only use interatomic distances as inputs, discarding angular information.Recent equivariant MPNNs [8][9][10][11][12] can make use of atom-pair vectors and can ensure that s t I and m t I transform appropriately under rotations.Message functions that take angular or higher-order information into account are also possible [13][14][15][16].In this work, we use an invariant MPNN.The presented derivations and conclusions are however fully general and applicable to equivariant MPNNs.

III.1. Derivation of Hardy Formula
Due to the difficulties discussed in the main text, deriving and implementing the heat flux for particular types of PES has been the focus of much previous work [17][18][19][20][21][22][23][24].This work aims to provide a unified perspective for the case of semi-local MLPs implemented using AD.
As a starting point for this task, we re-derive the classical equivalent of the heat flux introduced by Hardy [25], combining existing approaches with the aim of clarifying the introduction of periodic boundary conditions.Often, the Hardy formula is given without explicitly specifying the range of the involved sums, introducing ambiguity in periodic boundary conditions: Sums can run either over the simulation cell, or the bulk, and describe either 'checkerboard' or 'toroidal' boundary conditions, an ambiguity highlighted by Erpenbeck [26] in the context of GK calculations of shear coefficients.In the context of AD, which relies explicitly on the forward computation of U , the precise construction of a given potential gains central importance, and we therefore take care to be as explicit as possible.
In this derivation, we rely on the approach by Hardy [27] to provide a solution to the continuity equation in terms of bond functions, 1 and follow the ideas of Ito and Nakamura [30] to avoid the assumption of a pair potential.After solving the continuity equation for a finite system of N particles, we adopt periodic boundary conditions, recovering the Hardy formulation of the heat flux [20,25].
The heat flux required for the GK method is defined as the spatial integral of a current density which arises from a continuity equation for the energy density which in turn is defined as a scalar field that integrates to the total energy of the system We are studying the system under consideration with MD, treating it as a collection of N point particles contained in the volume V , whose state is fully characterised by a set of phase-space coordinates Γ = { Γ I : I = 1 . . .N } = { (R I , p I ) : I = 1 . . .N }, with atomic momenta p I .These coordinates evolve in time according to Newton's equations of motion where U is the potential energy, which we take to be composed of atomic contributions; U = N I=1 U I .This work considers potentials where this decomposition is available by design, as they approximate the BO PES in terms of many-body functions U I of the atomic coordinates.
In DFT, on the other hand, the potential energy part of the Hamiltonian has a strictly pairwise structure; manybody interactions then emerge implicitly from the electronic density, instead of being explicitly included in the energy expression.Carbogno et al. [22] exploit this pairwise structure, which extends to the Hellmann-Feynman forces, to obtain pairwise derivatives of the potential.They then obtain an exact equivalent of J pot , which purely relies on derivatives, and discard J conv .The meaning and naming of these terms is further discussed at the end of the present section.We additionally note that an energy density for DFT can be defined [31] and a particular gauge can be chosen to obtain a heat flux [21].
We now proceed with a derivation for potentials where atomic potential energies U I are explicitly available, and for the sake of generality assume that each U I can depend on the position of every atom J, such that U I = U I ({R J }).We will later specialize to particular types of potentials, restricting the dependence of U I to finite neighborhoods.
With this, time-dependence appears only implicitly through the time-dependent positions and velocities, allowing us to recast the total time derivative in terms of partial derivatives with respect to positions.We suppress time-dependence in the following derivation.Additionally, all sums will run from 1 to N until the spatial integral over j(r) is executed.
Following Hardy [27], we spatially localize the atomic total energy, given by the sum of potential energy U I and kinetic energy T I , E I = U I + T I with a function ∆(R I − r), which is peaked at R I , decays to 0 as |R I − r| increases, and is normalized to 1.
We also define a corresponding bond function, which localizes contributions along a line segment: One can then show that We then make an ansatz for e(r): The time-derivative of e(r) is The first term can be tackled by again splitting E I = U I + T I and resolving the time derivative using Eq. ( 8): Then, we re-arrange it in a pair-wise form and finally apply the identity from above to obtain The second term is easily resolved by the chain rule Comparing with the continuity equation, we find a direct solution to the continuity equation.This result can also be justified from a physical perspective.Any energy change at atom I must be balanced out by corresponding changes in the atoms J it interacts with, with an energy current flowing between them. 2 If we can divide up the change in E I into contributions that can be attributed to different J, and identify the corresponding terms in E J , we know the magnitude of the current between I and J.It is a natural assumption that the current flows directly between them, parallel to R IJ .The first term is one way to construct a vector field accordingly.The second term describes an alternative process to change local distribution of energy: Rather than energy flowing between atoms, the energy assigned to a given atom is 'dragged along' by the atom moving.Together, these two processes describe how the energy density can change, and therefore yield the energy current density, solving the continuity equation.
We are now in a position to compute the macroscopic heat flux J.However, we must first define the integration domain: Since the thermodynamic limit is inaccessible due to finite computational resources, practical simulations make use of periodic boundary conditions, replicating a simulation cell with N atoms periodically in space.This simulation cell must be large enough to capture the relevant dynamic processes for thermal transport.We then formally compute j for the resulting bulk, defining R sc as the atoms in the simulation cell and R all as the entire bulk.Extending the so-far unspecified sum over IJ explicitly over the bulk yields We now integrate over the simulation cell, obtaining where we have used the fact that all contributions where I / ∈ R sc can be computed equivalently for the replica of I in the simulation cell.This is a general form of the heat flux, the classical equivalent of the formulation derived by Hardy for periodic quantum systems [20,25].We note that the minimum image convention does not yet appear.
Equation ( 26) is typically split into two terms: The first one is identified as the 'potential' heat flux J pot , describing heat flux due to energy exchange between atoms, and the second one as 'convective' term J conv , accounting for changes in energy density due to atomic movements directly.Nomenclature changes across the literature.For instance, Fan et al. [20] refer to J conv as 'kinetic' term, while Carbogno et al. [22] refer to J pot as 'conductive' or 'virial' term.We do not adopt this terminology as J pot is not always composed of virials, i.e., contributions to the stress: In general semi-local MLPs, the derivative ∂U I /∂R J does not directly resolve into atom-pair vectors originating from I. In solids, J pot is also not precisely the heat flux that determines thermal conductivity.

III.2. Heat Flux in Solids
In solids, atomic positions are bounded over time.In other words, atomic positions R I (t) can be split into a fixed reference position R 0 I and a time-dependent displacement from that position u I (t): Comparison of the integral of the heat flux autocorrelation function for J, Jint, Jpot and Jconv, at 300 K in the monoclinic phase (top) and 1400 K in the tetragonal phase (bottom), using 'production' settings (N =1500, t=1 ns).No noise reduction is applied in this case.Shaded areas indicate standard error.

If we choose R 0
I such that it contains the information about which replica belongs to, 3 we obtain: ⇒ ṘIn (t) = ṘI (t) , where we introduce n as a an offset vector, indicating which integer multiples of lattice vectors are added to yield a given replica.In other words, the displacements and velocities are shared between all replicas of I. Substituting into Eq.( 26), we obtain: This defines a new split for J, into an 'interaction' term J int , which arises due to the exchange of energy between 3 For instance by choosing positions in the pristine supercell, or R 0 I = R(t = 0), or average positions over the simulation run.
atoms at sites I and J, and a remainder term due to the displacements, J disp .If u I (t) are bounded over time, then J disp is also bounded, and is therefore a non-diffusive heat flux, which does not contribute to κ [21,32,33].
As we consider a solid system in this work, only J int is reported throughout, as it allows for a more straightforward implementation and noise removal.Figure 2 compares J and J int , showing that they are indeed equivalent for zirconia at the given temperatures.We note that J conv does not disappear in isolation, and that J pot is generally not equivalent to J int .

III.3. Local Heat Flux
Let us briefly consider the heat flux for general local potentials where with N (I) containing positions, including replicas, up to a finite cutoff radius r c .Then Since potentials are expected to be translationally invariant, the dependence on R J can only enter through atom-pair vectors, and we therefore obtain the form derived by Fan et al. [20].If we further note that no interactions between environments occur, then every atom-pair vector can be uniquely assigned to one U I , and we can equivalently write A similar form was derived by Carbogno et al. [22] for DFT, using the pairwise nature of the derivatives of the Hamiltonian.However, in DFT, electronic degrees of freedom enter U .
In the present work, we consider local potentials as the M = 1 special case of semi-local models as introduced in Section II.In this setting, replicas are not explicitly constructed.Instead, the MIC is used on the atom-pair vectors providing the edges in the graph.The heat flux for the M = 1 case is therefore This form can be efficiently implemented with AD, as it only requires derivatives of the scalar total energy, which can be obtained in one backwards pass.

III.4. Semi-Local Heat Flux
To treat semi-local MLPs, we first note that by construction, Section II, interactions beyond the simulation cell are re-mapped into the cell and treated implicitly.Equation ( 26) cannot be immediately applied, since the replicas in R all are not explicitly constructed, and are therefore inaccessible to AD.The approach used for local potentials, where we used the fact that each edge can be uniquely assigned to a given U I is also not applicable, as M > 1 correlates neighboring environments.
A general solution to this difficulty is provided by the 'unfolded' heat flux presented in the following section, which changes the construction of semi-local MLPs to allow this attribution.A preliminary solution that respects the standard construction of such models is to restrict the effective cutoff so only one replica of any given atom lies within the effective interaction radius of any other.Then, the minimum image convention can be used while computing derivatives only with respect to positions within the simulation cell.However, as we will see, this solution is inefficient, leading to quadratic scaling of the computational cost.
We now transition to the notation introduced in Section II.Then, U I depends not only on N (I), but the set of neighborhoods within M hops on the graph.In general, it can depend on all neighborhoods, and we can therefore write the following, which highlights the dependence of the heat flux on interactions beyond the cutoff, This form requires the explicit computation of all elements of the Jacobian ∂U I /∂R JK , or equivalently ∂U I /∂R J , which in turn requires N -fold repeated evaluations with AD, leading to the quadratic scaling observed.

III.5. Unfolded Heat Flux
Conceptually, we now modify the construction of the MLP: Instead of constructing the input graph G based on positions in R sc only, we instead start from an extended simulation cell R unf that includes all positions up at most r eff c distant from the atoms in R sc , i.e., all positions that contribute to U .The atom-pair vectors that appear in the resulting graph are identical to the standard construction, and U is therefore unchanged.Similar ideas have been employed previously to define stress and heat flux under periodic boundary conditions [34,35] without the consideration of AD.The unfolded heat flux J unfolded Hardy can then be obtained starting from Eq. ( 26).We first write out the atom-pair vector R JI , In order to efficiently implement this expression with AD, we must rewrite it in the form of derivatives of scalar, or low-dimensional, quantities.
For the first term, this requires the ability to move the partial derivative outside the sum over I, which we enable by formally replacing the R I appearing as prefactors with auxiliary positions R aux I that are numerically identical, but not included in derivatives.
We can then introduce the potential energy barycenter the first term becomes The vector-vector product is taken between the denominator and the ṘJ , not with B.
In the second term, the sum over I can be executed directly, leaving only derivatives of the scalar energy U with respect to all positions in R unf .So: The first term requires one backwards pass through the computation of B for each of its three cartesian components, or, if forward-mode AD is available, a single forward pass.The second term can be simply computed from the gradient of U , which is obtained in a single backwards pass.Therefore, the asymptotic runtime scales with the number of atoms N unf in R unf .The number of additional atoms to be considered scales with the surface of the simulation cell, constituting a shell of width r eff c .Assuming constant density, V and N are proportional.Considering a cubic system for simplicity, the surface area of the simulation cell is proportional to L 2 , where L = V 1/3 .Therefore, the surface area, and the additional number of atoms, is proportional to N 2/3 .Asymptotically, this is dominated by the number of atoms N in the simulation cell itself, rendering this approach formally linearly scaling in N .However, for fixed N , the number of additional positions increases cubically with M , restricting this approach to moderate M .

IV. HEAT FLUX BENCHMARK
The timings of different implementations of the heat flux were taken for ten steps of molecular dynamics of different supercells of tetragonal zirconia, using the M = 2 SchNet with r c = 5 Å employed for all presented results.SchNet is implemented in SchNetPack [36], but used with a custom calculator.The reported runtimes exclude the calculation of neighborlists, which can be cached, but include the construction of the unfolded cell.Implementations and timing scripts are available at doi:10.5281/zenodo.7767432.

V. TRAINING AND MODEL SELECTION V.1. Data
Training and validation data was generated using molecular dynamics in the N pT ensemble, heating a tetragonal 96-atom supercell from 0K to a grid of target temperatures up to 3000K at intervals of 750K at ambient pressure, spanning the entire temperature range up to the melting point.In total, 10 000 single-point calculations were performed using FHI-aims [37], with the PBEsol [38] functional, 2 × 2 × 2 k-points, and light basis sets, with an additional basis function for O, following the approach of Ref. [22].The calculations were set up and run using FHIvibes [39].We use the first 2000 steps of each trajectory for training, and the remaining 500 steps for validation.
As test data, we obtained the calculations by Carbogno et al. [40] from the NOMAD repository [41], consisting of three trajectories of about 15 000 steps per temperature (300K to 2400K in intervals of 300K), using every tenth step of all trajectories.This data was not used during training.
For training, we use the potential energy, the forces on each atom, and the total stress tensor in a joint squared error loss function, with weights 0.001, 0.999 and 100.0 respectively, and the ADAM optimizer [42].Whenever a plateau of the loss on the validation set is encountered (with a patience of 20 epochs), the learning rate is reduced by a factor of 1/4 from a start of 10 −4 to a minimum of 10 −6 .If the loss has not improved for 200 epochs, the training is terminated.Training was performed on two Tesla Volta V100 32GB GPUs using a batch size of 100.
Hyperparameters other than cutoff radius and number of interactions (see next section) were selected based on successful termination of training and minimal error on the validation set, with a final selection based on the error on the test set.We discuss the selection of the main model parameters, the cutoff radius r c , which determines the amount of local information available, and the number of message passing steps M , which controls the range of interactions within the network, leaving choice of training data and method, as well as model architecture fixed.

V.3.1. Losses
Test set errors were evaluated for the NOMAD reference data (Fig. 3).Going from M = 1, i.e., no message passing, to M = 2 yields an approximately constant decrease in error.Additional message passing steps yield only marginal improvements.Predictive accuracy increases with cutoff radius, but saturates as the diameter of local environments exceeds the diameter of approximately 10 Å of the simulation cells used in training, approaching an all-to-all model.In this regime, all degrees of freedom are seen in the simulation cell, and message-passing cannot propagate additional information.However, M = 1 is equivalent to a non-linear pair potential, while M > 1 can model higherorder interactions, leading to lower errors.We therefore proceed with M = 2 in the following, which is sufficient to demonstrate the effect of message passing for heat flux predictions and minimises additional computational cost.

V.3.2. Vibrational Density of States
Since predictive accuracy on a fixed test set cannot fully predict model performance for practical applications, where larger regions of the potential energy surface are ex- FIG. 5. Unit cell volume, shown as rolling average over 10 ps, versus externally imposed temperature, compared to a similar MLP simulation [43] and experimental reference values [44].Vertical line indicates the estimated transition temperatures.plored, we evaluate dynamical properties as well.Figure 4 shows the vibrational density of states (VDOS) for different choices of r c , evaluated for trajectories of tetragonal zirconia at 300K, started from identical initial configurations.For r c = 5 Å and higher, the reference VDOS from fully ab initio MD is adequately reproduced.Further increasing the cutoff only leads to marginal improvements.
We therefore choose r c =5 Å and M = 2 to compute thermal conductivities.This model is used for all results presented in this work, and used across phases.

V.4. Testing and Limits
In this work, we employ a simple training strategy purely based on ab initio MD N pT simulations, as opposed to more involved schemes that rely on active learn-ing.This approach decouples the training data from the model, allowing the straightforward comparison of different model parameters, as shown in Figs. 3 and 4, but cannot be expected to yield a PES for the entire phase diagram of zirconia.In this section, we therefore probe the limits of our MLP to assess whether it is sufficient for the purposes of the present work.
As a stress test, we investigate the temperature dependence of the unit cell volume, probing the monoclinic to tetragonal phase transition of zirconia, which occurs at around 1480 K and is accompanied by a discontinuous change in volume [44].To this end, we perform a N pT simulation with Martyna-Tobias-Klein barostat [45] with τ = 5 ps and stochastic velocity rescaling thermostats [46] with τ = 5 ps for the positions and τ = 3 ps for the lattice, with a heating rate of 1 K/ps, a simulation timestep of 1 fs, and a simulation cell of 324 atoms.We compare with a similar simulation by Verdi et al. [43], which uses a heating rate of 0.5 K/ps.We note that this simple approach cannot be expected to produce a quantitative estimate of the transition temperature, which require thermodynamic integration [43].It can, however, indicate whether the transition is captured at all.
Figure 5 shows the result: Despite not being explicitly trained for this task, the model qualitatively captures the phase transition, at a similar temperature to another MLP.However, it over-estimates unit cell volume by approximately 1 %, and becomes unstable above 2000 K, likely due to insufficient training data around the tetragonal to cubic transition.During the GK workflow used in this work, we found that one out of eleven trajectories of 1 ns each becomes unstable at 1900 K; at 2000 K, no stable MD is possible with the given potential.Below 1900 K, we observe no instabilities in the potential in tetragonal and monoclinic cells, running hundreds of nanoseconds of MD at varying supercell sizes.Keeping these limitations in mind, we proceed to an investigation of the phonon band structure for the monoclinic and tetragonal phases.The results are displayed in Figs. 6 and 7, with the MLP adequately reproducing the DFT results.Supercell size convergence was checked, and we use a cell with N = 324 atoms for these figures.We can conclude that, while the model is limited in its ability to describe high-temperature behavior and lattice constants, it is adequate to model dynamics at lower temperatures, and therefore thermal conductivity for the temperature range investigated in the present work.

VI. GREEN-KUBO CONVERGENCE
This section is concerned with the determination of overall settings and parameters for the GK results presented in this work.Overall, we follow the approach presented by Knoop et al. [47].

VI.1. General Workflow
We use experimentally determined lattice parameters as presented by Verdi et al. [43], based on references [44,48], in a 12-atom primitive cell.From these primitive cells, simple n × n × n supercells are created.We proceed similarly for the tetragonal phase, with a 12-atom conventional cell in place of the 6-atom primitive cell.
The systems are then thermalised in the N V T ensemble for 200 ps at ∆t = 4 fs, and 11 decorrelated frames are extracted from the resulting trajectories, taking every 1000-th step from the end.
These frames serve as the initial time-steps of the ensemble of trajectories used for the GK method.We use a 4 fs timestep throughout, consistent with previous work.The duration of the trajectories used, as well as the timestep with which the heat flux is computed, is discussed in the following subsection.For consistency with the reference data, we report the spatial average of the thermal conductivity, κ = α κ αα /3.We use the noise reduction scheme introduced in Ref. [47], discussed further in Section VI.3, throughout.We use the heat flux for solids J int , which is, however, equivalent to the full J, as shown in Fig. 2.

VI.2. Size and Time Convergence
Formally, the Green-Kubo relation applies in a number of limits: infinite integration times, the bulk limit for system size, and the ensemble average over all realizations of the system at a given temperature and pressure [47].We investigate convergence with respect to simulation time t and supercell size N ; finite ensemble size is taken into account to compute error bars.
Figure 8 shows convergence with respect to simulation duration t and simulation cell size N at 300 K and 1400 K.At 300 K, for a given N , simulation times of at least 1 ns are required to obtain results that consistently lie within standard error of each other.Choosing t = 1 ns, we observe no systematic increase in κ above N = 1500, as all results lie within the standard error of that estimate, highlighted in Fig. 8.At 1400 K, anharmonic contributions dominate, leading to shorter phonon lifetimes, achieving convergence earlier.We therefore conclude that the parameters t = 1 ns and N = 1500 are sufficient to obtain a converged estimate for κ at all temperatures of interest.
For the comparison of heat flux formulations with high computational cost, we additionally require a reduced, 'light', choice of parameters that yields a reasonable estimate of the converged κ.We choose t = 0.5 ns and N = 768, which are used for the comparison of heat flux variations in Fig. 2 of the manuscript.

VI.3. Filter Frequency and Heat Flux Timestep
In addition to the number of atoms in the simulation cell N , and the simulation duration of the trajectories used, auxiliary parameters are present in the GK method as used in this work.Since the processes determining the thermal conductivity typically occur on longer timescales than the simulation timestep ∆t, the heat flux can be computed at some larger timestep n hf • ∆t to reduce computational cost.
Additionally, the heat flux is post-processed to reduce noise and enable the automatic determination of the cutoff for the integral in the GK relation.We use the noise reduction scheme presented in [47], consisting of two separate mechanisms: Firstly, a gauge term of the form I Ω I • ṘI (t) is subtracted from the heat flux at every step, with the 3 × 3 matrices Ω I representing an average or reference atomic virial.Since our heat flux formulation does not lend itself to the computation of per-atom virials, we use virials computed for the pristine supercell.For sufficient simulation times, we find that this approximation yields satisfactory results.Secondly, the integrated HFACF is filtered to reduce high-frequency noise, with a filter width t window = 1/ω window .
In this section, we discuss the impact of the spacing parameter n hf , and of the filter frequency ω window .We note that in principle, the impact of these settings depends on the choice of N and t.We have, however, checked that the results presented here are robust to changes in N and t.In general, higher N and t reduce noise in the HFACF and therefore allow larger ω window and higher ∆t.
Figure 9 shows the dependence of the integrated HFACF, as well as the resulting4 κ, on the choice of ω window .At 300 K, no strong dependence of κ on ω window is observed; results lie within the standard error.At 1400 K, high choices of ω window lead to misidentification of the plateau in the integrated HFACF.From 1 THz, further reducing ω window yields only minor changes.We therefore choose ω window = 1 THz.
Figure 10 shows the dependence on the spacing n hf .

Fit ∝ N x Data FIG. 1 .
FIG. 1. Computation time per timestep for different system sizes N for zirconia, evaluating a SchNet MPNN, for different heat flux formulations on a single Tesla Volta V100 32GB GPU.To estimate the asymptotic scaling, a function proportional to N x has been fitted to the results for large N .Note that on this setup with limited memory, the truly asymptotic limit cannot be reached.

6 FIG. 1 .
FIG. 1. Sketch of neural message passing with M = 2. Connections are only considered in one direction for simplicity.(1) Initialization: Each edge symbolises an interatomic distance and is associated with a color.Each node is an atom with an empty initial state.(2+3) Message-passing: Each node is updated based on the (empty) neighboring state, and its incoming edge.Afterwards, each node depends on the incoming edge.(4+5) Another message-passing iteration.Now, every state depends on next-to-nearest neighbors.(6) Readout: UI are predicted based on the final states.

FIG. 1 .
FIG. 1. Sketch of neural message passing with M = 2. Connections are only considered in one direction for simplicity.(1) Initialization: Each edge symbolises an interatomic distance and is associated with a color.Each node is an atom with an empty initial state.(2+3) Message-passing: Each node is updated based on the (empty) neighboring state, and its incoming edge.Afterwards, each node depends on the incoming edge.(4+5) Another message-passing iteration.Now, every state depends on next-to-nearest neighbors.(6) Readout: UI are predicted based on the final states.

FIG. 3 .
FIG.3.Absolute prediction errors (AE) for the forces on the test set, for models with different cutoffs rc and numbers of message passing steps M .A constant horizontal offset has been added to rc.Boxes, whiskers, bars, markers show interquartile range, total range, median, and mean, respectively.The relative AE is scaled by the standard deviation of forces over the test set.

FIG. 4 .
FIG. 4. Comparison of vibrational density of states for MPNNs (M = 2) with different cutoff radii compared to a baseline computed with FHI-aims.Vertical lines indicate peaks in the FHI-aims result.Constant vertical offsets are applied to distinguish curves.Results are averaged over three trajectories of 60ps (∆t = 4 fs), at 300 K, with matching initial configurations.Shaded areas indicate the minimum and maximum.

FIG. 8 .
FIG.8.Convergence of thermal conductivity with respect to simulation cell size N and simulation duration t at 300 K in the monoclinic phase (top) and 1400 K in the tetragonal phase (bottom).For each N , different t from 0.1 ns to 2 ns in steps of 0.1 ns are shown with a constant horizontal offset.Error bars indicate the standard error.The x-axis is scaled to indicate the effective length scale of the simulation cell, which is proportional to N 1/3 .The chosen 'production' settings, N =1500, t=1 ns, as well as the 'light' settings, N =768, t=0.5 ns are indicated.For the 'production' settings, the standard error is shown as a shaded area.

FIG. 9 .
FIG. 9. Comparison of the integral of the heat flux autocorrelation function for different choices of ω window , at 300 K in the monoclinic phase (top) and 1400 K in the tetragonal phase (bottom).Horizontal lines indicate chosen value for κ. 'Production' settings (N =1500, t=1 ns) are used.Shaded areas indicate standard error.Lower values of ω window correspond to more severe filtering.

FIG. 10 .
FIG. 10.Comparison of the of the heat flux autocorrelation function for different choices of n hf , at 300 K in the monoclinic phase (top) and 1400 K in the tetragonal phase (bottom).The upper edges of the labels indicating n hf are aligned with horizontal lines indicating the value for κ chosen by the first dip of the corresponding heat flux autocorrelation function.'Production' settings (N =1500, t=1 ns) are used.Shaded areas indicate standard error.