Learning physically consistent differential equation models from data using group sparsity

Suryanarayana Maddu,1,2,3,4 Bevan L. Cheeseman,1,2,3,* Christian L. Müller,5,6,7 and Ivo F. Sbalzarini 1,2,3,4,8,† 1Technische Universität Dresden, Faculty of Computer Science, 01069 Dresden, Germany 2Max Planck Institute of Molecular Cell Biology and Genetics, 01307 Dresden, Germany 3Center for Systems Biology Dresden, 01307 Dresden, Germany 4Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Germany 5Center for Computational Mathematics, Flatiron Institute, New York, New York 10010, USA 6Department of Statistics, LMU München, 80539 Munich, Germany 7Institute of Computational Biology, Helmholtz Zentrum München, 85764 Neuherberg, Germany 8Cluster of Excellence Physics of Life, TU Dresden, 01307 Dresden, Germany

While purely data-driven models, like reservoir computing, can be successful in predicting future behavior [5], such "black-box" models are often difficult to interpret for domain scientists. This raises the question how interpretable mathematical models, such as ordinary differential equations (ODEs) or partial differential equations (PDEs), can be learned directly from data.
Given the feasibility of data-driven modeling and the historic success of first-principles modeling, it seems natural to try combine the two. This requires methods to incorporate or enforce first-principle constraints, like conservation laws and symmetries, into the data-driven inference problem. First attempts in this direction used block-diagonal dictionaries with group sparsity to avoid model discrepancy [29,[34][35][36] and to infer PDEs with varying coefficients [27]. However, there are many more priors one may want to exploit when modeling complex systems, including information about symmetries in interactions, knowledge of conservation laws, dimensional similarities, or awareness of spatially and temporally varying latent variables. Such prior knowledge can come from first principles or from model assumptions or hypotheses. To date, there is no statistical inference framework available that would allow flexible inclusion of different types of priors as hard constraints in data-driven inference of differential equations models.
Here we present a statistical learning framework based on group sparsity to enforce a wide range of physical or modeling priors in the regression problem for robust inference of ODE and PDE models from modest amounts of noisy data. We present three representative examples from systems biology to demonstrate how information about conservation laws, latent variables, and symmetries can be encoded into grouped features of a sparse-regression formulation. We therefore present numerical experiments using a mass-conserving ODE model of Janus kinase-signal transducer and activator of transcription (JAK-STAT) signaling in cells, a mechanical transport model for membrane proteins, and λ-ω reactiondiffusion systems, respectively. We approximately solve the resulting nonconvex optimization problems using the group iterative hard thresholding (GIHT) algorithm presented here, in combination with stability selection for statistically consistent model identification [19]. We show that stability selection in combination with GIHT enables robust model inference from limited noisy data.

II. PROBLEM FORMULATION
We aim to learn the functional form of a governing ODE or PDE from data about the corresponding dynamics. We consider the following canonical form, where the left-hand side is a first derivative in time and the right-hand side is a nonlinear function N of space x, time t, and derivatives: The quantity u = (u i ) is the state variable of interest (e.g., velocity, concentration, or pressure) and (x, t ) is the set of parameters of the equation, such as diffusion constants or viscosity. The dependence of on (x, t ) allows for equations with varying coefficients in both space and time. Without loss of generality, N (·) can be written as a linear combination of potentially nonlinear terms. Common models like Navier-Stokes, advection, active mechanochemistry, and reactiondiffusion models are represented by this canonical form. Models requiring a different left-hand side (e.g., wave equations) can be expressed using suitably adjusted canonical forms.
The goal of equation inference [15,16] is to find a specific instance of this canonical differential equation from given data. Data are given as measured or simulated valuesû(x i , t j ) at discrete locations x i and time points t j . These data points may contain noise, e.g., from measurement uncertainties or numerical errors. The question then is which right-hand side N makes Eq. (1) describe the dynamics from which the data are sampled, without describing the noise, in a way that is statistically consistent and stable under data perturbation or different realizations of the noise.
We follow the standard approach to equation inference, constructing an overcomplete dictionary of possible righthand-side terms and approximating their values from the data using discrete approximations of the derivatives [15,16] (e.g., finite differences, polynomial differentiation [16,27], or automatic differentiation [32]). For example, for a model with a single scalar state variable u ∈ R, a dictionary of p ∈ N potential terms numerically evaluated over n ∈ N data points is a matrix ∈ R n×p . The canonical form of Eq. (1) then becomes ⎡ where subscripts denote derivatives with respect to the subscripted variable. By default, we include in all differential operators and polynomial nonlinearities up to and including order and degree 3. Each column of contains the discrete approximations of one such term at all n data points. The vector ξ contains the unknown coefficients [ξ 0 ξ 1 ξ 2 ξ 3 · · · ξ p ] of the model. For systems of differential equations, the dictionary ∈ R N×P becomes block diagonal with p b blocks b ∈ R n×p [see, for example, Fig. 2(a)]. In this case, we distinguish the number p of potential terms in each block and the number P = p b p of columns in the overall dictionary. Likewise, N = p b n.
In either case, the problem is to find a statistically consistent ξ * such that the model in Eq. (2) fits the data while being sparse, i.e., ξ * 0 p. This trade-off between model simplicity and data fitting can be formulated as a regularized optimization problem whereξ λ is the global minimizer, h(·) a smooth convex datafitting metric (e.g., least-squares or Huber loss), and r(·) a regularization or penalty function with regularization constant λ ∈ R + that controls the trade-off between model simplicity and fitting accuracy. The superscript λ to the estimated coefficient vectorξ indicates the dependence of the result on the regularization parameter. The data-fitting metric measures the distance (in some norm) between the model output for a given ξ and the data. The regularization function measures model complexity.
Here we use the following standard choice for the datafitting and regularization functions: By choosing r(ξ) = λ ξ 0 , we directly penalize the number of terms on the right-hand side of the model, hence favoring simpler models (Occam's razor) that are easier to interpret. Such sparsity-promoting regularization has been successful in applications of compressive sensing and signal processing. For the data-fitting function, we choose the standard least-squares metric, hence h(ξ) = 1 2 U t − ξ 2 2 , leading to models that fit the data in the least-squares sense.

III. SOLUTION METHOD
Classic algorithms that efficiently compute locally optimal solutions to Eq. (4) include greedy optimization strategies [37], compressed sampling matching pursuit [38], subspace pursuit [39], and iterative hard thresholding (IHT) [40]. To avoid the problem of nonconvexity in the objective function, a popular approach is to consider the convex relaxation of the problem in Eq. (4) by replacing the · 0 term with r(ξ) = ξ 1 [41]. However, while this formulation benefits from the availability of fast convex optimization algorithms, it does not provide good approximations when right-hand-side terms in Eq. (1) are correlated [42] and leads to biased estimates of model coefficients [43], thus reducing model selection performance [19]. Therefore, we directly consider the original nonconvex problem in Eq. (4) and provide an algorithm to approximately solve it while accounting for modeling priors and guaranteeing statistically stable and consistent models.

A. Group-sparse regression
In addition to statistical stability, we require the learned models to be consistent with prior knowledge about the physics of the process that generated the data. Examples of such priors one may want to impose are conservation laws, symmetries, and knowledge about latent variables. While the sparsity constraint is imposed as a soft constraint, these physical priors will be imposed as hard constraints on the model. They amount to restrictions on the structure of the coefficient vector ξ. We show here how the concept of group sparsity [42,44] can be used to impose modeling priors in a sparseregression framework.
The concept of group sparsity assumes that prior knowledge about the underlying system can be encoded by partitioning model terms into m groups. During the inference process, group sparsity then imposes that coefficients within the same group can only enter or leave the statistical model jointly. We additionally leverage the block-diagonal structure of the dictionary matrix to allow for spatially or temporally varying coefficients and for joint sparse regression of multiple state variables.
Formally, given a partitioning of the coefficients ξ k , k = 1, 2, . . . , P, into m groups g j , j = 1, 2, . . . , m, we thus consider the optimization problem where g j ∈ R N×p g j is the submatrix of ∈ R N×P formed by all columns corresponding to the coefficients in group g j ⊆ {1, . . . , P} and ξ g j = {ξ i : i ∈ g j } is the coefficient vector ξ restricted to the index set g j of size p g j , i.e., |g j | = p g j . The indicator function 1(·) over the · 2 norm encourages sparsity on the group level [42]. For groups comprising only a single element, this penalty reduces to the · 0 -norm. Here we restrict ourselves to nonoverlapping groups where g i ∩ g j = ∅ ∀ i = j = 1, . . . , m and m j=1 p g j = P. Extensions to overlapping groups are possible [45] and discussed in Sec. V. We solve the nonconvex problem in Eq. (5) using the GIHT algorithm, which generalizes the standard IHT algorithm as detailed in Appendix C.

B. Stability selection
Robust tuning of the regularization parameter λ is of fundamental importance for successful model discovery. Wrong choices of λ result in incorrect equation models being identified, even if correct model discovery would have been possible in principle given the data [25,46]. Common methods for tuning λ include the Akaike information criterion (AIC) [47], the (modified) Bayesian information criterion (BIC) [48], and cross validation. While AIC or BIC model selection is useful for combinatorial best-subset selection methods in low dimensions, they typically deteriorate in high dimensions since they rely on asymptotic considerations. Cross validation tends to include many false-positive coefficients in the data-limited regime [49].
In order to provide a robust model selection method for the data-limited high-dimensional case, we leverage here the statistical principle of stability selection, which tunes λ so as to guarantee model stability under perturbation random subsampling of the data [50]. We perform stability selection by generating B random subsamples I * b , b = 1, . . . , B, of the data, using the GIHT algorithm to find the setŜ λ [I * b ] ⊆ {1, . . . , P} of coefficients (or groups) for every data subsample I * b for different values of λ ranging over a regularization path = [λ max , λ min ] with λ min = λ max and λ max = max j∈{1,...,m} as computed for the group least absolute shrinkage and selection operator (LASSO) [27]. Typical values of range from 0.1 to 0.01 with the path discretized evenly on a logarithmic scale. The probability that group j overlaps with the coefficients selected for a given λ is approximately [50] λ This is the importance measure [50] for group j. Plotting this importance measure as a function of λ ∈ provides an interpretable way to assess the robustness of the estimation across levels of regularization in a so-called stability plot [50].
To select a final model, stability selection chooses the set of stable coefficients (or groups)Ŝ stable = { j :ˆ λ s g j > π th }. This means that we search for the components (groups) in the dictionary that consistently appear with probability greater than π th when repeatedly solving the sparse-regression problem in  Eq. (5) for different random subsets of the data. The threshold probability π th controls the type I error of false positives according to [51] where E fp is the upper bound on the expected number of false positives and q is the average number of selected variables (i.e., nonzero components of ξ) along the regularization path [50]. The group size p g = p g j , j = 1, . . . , m, if all groups are of equal size; otherwise we set p g = 1 [51]. For a fixed value of π th , we use this relation to find a λ s for which a given bound on the expected number of false positives, E fp , is achieved. Equation (7) therefore provides an elegant way of determining the regularization constant based on the importance measure and the intuitively defined parameters E fp and π th . Throughout this work, we set π th = 0.8 and E fp = 1.
For the examples shown in this paper, we empirically find that regularization paths with = 0.1 are sufficient to find a solution to Eq. (7) for these parameters. Alternatively, one can determine π th and by visual inspection of a stability plot, which usually shows a clear separation between two clusters of coefficients of different stability. Stability selection not only removes the necessity to manually tune λ, but also ensures robustness against data sampling and noise in the data. All of these properties are required for statistical consistency in the sense that the inferred models are guaranteed to become accurate with high probability for increasing data size [10].

IV. APPLICATIONS
We present three different modeling examples from systems biology that illustrate the utility of priors in data-driven modeling. Each example highlights a different type of prior knowledge to be enforced. In order to benchmark the accuracy and robustness of model inference, we need to know the ground-truth model. We therefore generate synthetic data by numerically solving known models and see how well we can recover those models again purely from the data. To emulate noisy measurements from real-world experiments, we corrupt the simulated data u(x, t ) with additive Gaussian noiseû = u + σ η(0, θ ), where η is a vector of elementwise independent and identically distributed Gaussian random numbers with mean zero and empirical variance θ = Var{u 1 , . . . , u N } of the simulated data. The constant σ defines the noise level. In line with previous works, we use polynomial differentiation [16,27] to approximate the spatial and temporal derivatives in the dictionary from the noise input dataû.

A. Enforcing mass conservation in the JAK-STAT reaction pathway for signal transduction
Signal transduction pathways are the engines of chemical information processing in living biological cells. Using methods from biochemistry and systems biology, the constituent molecules of many signaling pathways have been identified. However, identifying the topology of these chemical reaction networks remains challenging. It typically involves building mathematical models of hypothetical reaction networks and comparing their predictions with the data. A popular choice is to use ordinary differential equation models of the stoichiometry and chemical kinetics of the pathway. However, when discrepancies occur between the ODE model and the experimental data, it is difficult to decide whether the model structure is incorrect or whether the parameters of the model have been badly chosen [52]. Here data-driven modeling can help identify the stable structure of minimal ODE models that can explain the measurement data.
In this example, we consider the JAK-STAT pathway, which communicates chemical signals from outside a biological cell to the cell nucleus. It is implicated in a variety of biological processes from immunity to cell division, cell death, and tumor formation. Mathematical models based on biochemical knowledge of the JAK-STAT pathway have identified nucleocytoplasmic cycling as an essential component of the JAK-STAT mechanism, which has been experimentally verified [52,53]. We therefore consider the simplest ODE model with irreversible reactions that account for nucleocytoplasmic cycling in order to model information transfer from the cell membrane to the nucleus as previously described [52]: A schematic of the JAK-STAT pathway is shown in Fig. 1, illustrating the reaction cascade from outside the cell membrane to inside the cell nucleus. The functions x 1 (t ), x 2 (t ), x 3 (t ), and x 4 (t ) in the above ODE model are the time courses of the concentrations of monomeric STAT-5, phosphorylated STAT-5, cytoplasmic dimeric STAT-5, and STAT-5 in the nucleus, respectively. The scalar constants k ± 1 , k ± 2 , k ± 3 , and k ± 4 are the kinetic reaction rates of phosphorylation, dimerization, nuclear transport, and nuclear export, respectively. While of we distinguish different occurrences of the same rate constant by sign superscripts in order to make clear that they are independently learned from data by our regression algorithm.
For sparse-regression model learning, a dictionary matrix b of all possible interactions between the molecules is generated [see Eq. (2)]. The left-hand side U t is the time derivative of each concentration, i.e.,ẋ 1 ,ẋ 2 ,ẋ 3 , andẋ 4 as approximated from the data. For this example, b contains p = 19 polynomial nonlinearities (e.g., leading to the block-diagonal overall dictionary structure with p b = 4 [shown in Fig. 2(a)]. For model inference, we use the simulated concentration time courses shown in Fig. 2(b). They are obtained by numerically solving the model (8) 2066, and k − 4 = k + 4 = 0.106 58, as found by fitting experimental data [52,53] (see Fig. 1 inset). The simulated data are corrupted with 10% additive Gaussian noise before inference. The noisy timeseries data for the concentration of the activated EpoR receptor c(t ) are taken directly from experimental measurements (concentration x 1 ). Upon receptor recruitment, monomeric STAT-5 is tyrosine phosphorylated (x 2 ), dimerizes (x 3 ), and translocates to the nucleus (x 4 ), where it binds to the promoters of target genes, is dephosphorylated, and is exported again to the cytoplasm [52]. The inset plot shows an experimentally measured time course of EpoR activation (data from [53]). [53], both when generating the simulation data and for model inference. All units are relative to the experimental data.
From the simulated data for x i (t ), i = 1, 2, 3, 4, and the experimentally measured c(t ), we aim to infer back the model equations. The JAK-STAT pathway conserves mass, as evident from the ODE model (8). This can be used as a prior when inferring a model from data. We therefore perform group-sparse regression (see Sec. III A) using the groups This is graphically represented by the vertical lines in Fig. 2(a), with each group corresponding to one type of biochemical process in the model, as given in the legend (g 1 , phosphorylation; g 2 , dimerization; g 3 , nuclear transport; and g 4 , nuclear export). We solve the resulting group-sparseregression problem using the present GIHT algorithm. This leads to a conservative model structure, but the fitted values of the rate constants may differ for different signs, i.e., it can be that k − 1 = k + 1 , etc. Enforcing symmetry also in the coefficient values, and not only in the model structure, would require solving a constrained group-sparse-regression problem, which we do not consider here.
The results are shown in Fig. 3(a). In this benchmark setting, group sparsity helps identify the correct model terms (red curves) out of all terms in the dictionary. There exists a range of λ values (shaded gray) where stability selection with threshold π th = 0.8 (green dashed line) can identify the correct model, even at the 10% noise level considered here.
Without coefficient grouping, i.e., without imposing the mass-conservation prior, there is no value of λ for which the correct model is recovered, as shown in Fig. 3(b). To show consistency of the group-sparsity method, we also provide achievability plots in Figs. 3(c) and 3(d). They show that enforcing the mass conservation prior leads to consistent model selection over a wide range of data sample sizes N.
Using group sparsity in combination with stability selection, the correct model can be identified in 100% of cases (over 20 independent repetitions) when more than 200 data points per component are used (i.e., success probability 1), regardless of the noise level in the data (color, see the legend), as shown in Fig. 3(c). Sparse regression without priors suffers from inconsistency, at all noise levels and for all data sizes [ Fig. 3(d)]. The learned coefficients at different noise levels are shown in Fig. 8 in Appendix A.

B. Enforcing model equivalence in advection diffusion with spatially varying velocity
The development of organisms from their zygotic state involves a myriad of biochemical interactions coupled with the mechanical forces that shape the resulting tissue. In past decades, the role of mechanics, including forces and flows, has increasingly been investigated in developmental biology and morphogenesis. On the cell and tissue scale, many developmental processes involve both patterning and flows. Examples include polarity establishment [54], tissue 042310-5 folding [55], and cell sorting [56,57]. The spatiotemporal concentration fields of labeled proteins can be recorded in all of these processes using fluorescence microscopy [54,58]. This has led to quantitative measurements and predictive models of active mechanochemical self-organization in, e.g., cytoplasmic flow [59], endocytosis [60], and tissue patterning [61].
In this example, we consider the simplest case of transport by advection and diffusion of signaling molecules. In order to allow for latent processes, we consider spatially varying model coefficients. We thus construct groups that allow the advection velocity to be a function of space. In addition, we impose a prior that enforces model equivalence, i.e., learning structurally equivalent models for the different chemical species, albeit with different diffusion constants. For the concentration fields u(x, t ) and v(x, t ) of two chemicals, this Here D u = 0.25 and D v = 0.50 are the ground-truth diffusion constants and the function c(x) = − 3 2 + cos( 2πx L ) is the spatially varying advection velocity field in the domain of length L = 2π . With added chemical reactions, this form of model has previously been successfully used to explain early patterning in the single-cell C. elegans zygote [54,58].
We use data from numerical simulations of the above model equations with 15% additive Gaussian noise (see Fig. 4) to show that both priors, model equivalence and spatial variability, are necessary to recover the ground-truth equations including the spatially varying velocity field from the data. We construct two block-diagonal dictionaries, for u and v, where each block represents the dictionary constructed at one spatial location. We use p b = 10 blocks, corresponding to five randomly selected spatial data points for each u and v. Each of the diagonal blocks b (n, p) uses n = 75 randomly chosen time points and p = 15 potential operators.
We use grouping to enforce that the structure of the model learned from the data must be the same for all spatial locations and that the models learned for u and v must be equivalent. Each group therefore ties a column in a block dictionary to all corresponding columns in the other blocks. This construction results in the following groupings to encode spatial variability: Here the set g l is the group l and p is the number of columns each dictionary block. The groups g u l and g v l , independently constructed for species u and v using Eq. (11), are further grouped to enforce model equivalence between species with the grouping g l = g u l ∪ g v l . The resulting stability and achievability plots are shown in Fig. 5 when using the noisy data from Fig. 4 for inference. Comparing Figs. 5(a) and 5(b), we see that the prior for model equivalence is necessary to recover the true model. The algorithm is unable to identify the diffusion process of species u when only using the grouping for the spatially varying coefficient [ Fig. 5(b)]. Inference without any priors fails to recover the true model even for noise-free data [ Fig. 5(c)]. The achievability plot in Fig. 5(d) demonstrates the consistency of our model selection algorithm with grouping over 20 independent realizations of the noise process and of the random subsampling of the data. We observe consistent model recovery with high success probability even at high noise levels, albeit with decreasing fidelity as seen in Fig. 5(d). In contrast, previous studies on advection-diffusion model recovery with unknown velocity field were limited to 1% noise (σ = 0.01) [27].
The estimated latent velocity fields and their gradients are shown in Fig. 9 and compared with ground truth for different noise levels. In Appendix B we show how these estimates can be further improved by postprocessing with additional smoothness priors.

C. Enforcing symmetry in reaction-diffusion kinetics
Reaction-diffusion models are widely used in systems biology to describe the dynamics of chemical reaction networks in a continuous space. Their popularity goes back to a seminal paper by Turing [62], proposing that reaction-diffusion mechanisms could be responsible for pattern formation in developing tissues. Since then, reaction-diffusion equations have been successful in modeling nonequilibrium pattern formation [63], dynamics of ecological [64] and biological systems [65], cell polarity [54,63], phase transitions [66], and chemical waves [67].
In this example, we consider the λ-ω reaction-diffusion system as a prototypical model of chemical waves [68], showing how it can be inferred from data when including symmetry priors. The model equations for the scalar concentration fields u(x, y, t ) and v(x, y, t ) of two chemical species in two dimensions are Here we choose r = √ u 2 + v 2 , ω(r) = −r 2 , λ(r) = 1 − r 2 , and D u = D v = 0.1. This system is symmetric in the two species, i.e., swapping u ↔ ±v leaves the structure of the model unchanged. Such symmetries are common in biology and can be found in predator-prey models [69], models of fish scale patterning [70], and models of antagonistic protein interactions [54].
If known beforehand, such symmetries can be used as priors. Here we impose the symmetry prior by grouping each column of the dictionary of one species with the corresponding column for the other species, where "corresponding" means pertaining to the same operator upon the swap, i.e., Inferring reaction-diffusion models from noisy spatiotemporal data. Stability plots (a) with and (b) without symmetry priors for noise level σ = 0.1. Achievability plots (c) with and (d) without symmetry priors for different noise levels in the data (legends). We show the stability plots for λ min = 0.01λ max to illustrate the difference between (a) and (b), but still run our algorithm with λ min = 0.1λ max . Each point is averaged over 20 independent trials. The colored bands correspond to the Bernoulli standard deviation.
for each of the p b = 2 species contain all polynomial nonlinearities up to degree 3 and all spatial derivatives up to second order, resulting in p = 18.
We use data obtained by numerically simulating the above model with 10% pointwise Gaussian noise added to the data (see Fig. 6). The stability and achievability plots when using these data are shown in Fig. 7. Comparing Figs. 7(a) and 7(b), we observe that model inference without the symmetry prior fails, whereas it works robustly when the prior is included via group sparsity. This fact is substantiated by the achievability plots in Figs. 7(c) and 7(d) for model inference with and without the prior, respectively, for different noise levels σ in the data. Our group-sparse-regression formulation provides remarkable consistency for model recovery over a wide range of λ values even at high noise levels of 10%.

D. Computational cost
Given the block diagonal dictionary matrix ∈ R N×P and the vector U t ∈ R N×1 from data, the computational complexity of Algorithm 1 is O(NP) in each GIHT iteration without the debiasing step. This is the same complexity as matrix-vector multiplication. We also include in our algorithm a debiasing step, which also has a complexity of O(NP). However, debiasing has been reported to lead to faster convergence amortizing its additional computational cost [71]. As an example, the advection-diffusion problem considered here with n = 75, p = 15, and p b = 10 required less than 1.5 s of runtime to compute a regularization path with 15 different values of λ for one data subsample when implemented in PYTHON on a single 2.3-GHz x86_64 processor core. The total time for stability selection over 100 data subsamples was under 150 s. This can be further accelerated using multithreaded programming, since the different data subsamples can be evaluated independently in parallel.

V. CONCLUSION AND DISCUSSION
We have introduced a flexible and robust inference framework to learn physically consistent differential-equation models from modest amounts of noisy data. We used the concept of group sparsity to provide a flexible way of including modeling priors as hard constraints that render inference more robust. We combined this with the concept of stability selection for principled deduction of regularization parameters in cases where the true model is not known. To approximately solve the resulting nonconvex regression problem, we introduce the group iterative hard thresholding algorithm in Appendix C.
We have benchmarked and demonstrated the use of this algorithm in examples of common mathematical models in biological physics. The examples covered ordinary differential equations and partial differential equations in one-and twodimensional domains. They demonstrated how different types of priors can be imposed using the concept of group sparsity: Conservation laws, model equivalence, spatially varying latent variables, and symmetries. The results have shown that including such priors enables correct model inference from data containing 10% or even 15% additive Gaussian noise. Without the priors, the correct model could not be recovered in any of the presented cases. The achievability plots furthermore confirmed that relatively little data (here a few hundred spacetime points) is sufficient to reliably and reproducibly learn the correct model when group-sparsity priors are included. Without the priors, model inference was inconsistent in all cases.
Importantly, stability selection converts the problem of tuning the regularization parameter λ to the easier problem of thresholding the importance measureˆ . We argue that this is easier to do, as it relates to an upper bound on the number of false positives one is willing to tolerate [50,51], providing interpretability. Further refining such results in the group-sparse case would be useful for applications that require reliability guarantees.
The concepts introduced here are independent of how the elements of the dictionary are constructed. Exploring more advanced dictionary constructions, such as integral formulations [31] or weak formulations [33], in conjunction with group sparsity and stability selection likely provides a promising future research direction.
In its current form, however, our framework has a number of limitations. First, we only considered nonoverlapping groups, restricting each column of the dictionary to be part of at most one group. This is a limiting assumption, as it is not uncommon in physics or biology to simultaneously use multiple overlapping priors. The more advanced concept of structured sparsity [72] could provide a way to include overlapping priors in future work. Second, we only showed how to include priors about the structure of a model. If additionally one wants to impose priors about coefficient values (e.g., values of diffusion constants and reaction rates), the framework would need to be extended to constrained groupsparse regression [73]. Third, although we have demonstrated robust data-driven inference of the model structure, estimates for the coefficient values can considerable deviate from the ground truth (see Appendix A). Fourth, although we have demonstrated robust data-driven inference of the model structure, estimates for the coefficient values can considerable deviate from the ground truth (see Appendix A). Refitting the coefficient estimates using additional smoothness regularization in a postprocessing step could help, as we hint at in Appendix B.
Especially at high noise levels, coefficient estimation errors likely stem from inaccurate spatial derivative approximations, since the polynomial differentiation schemes used here are known to amplify noise. This issue can possibly be addressed in the future by combining our framework with physicsinformed neural networks [18] or with Gaussian processes [17] to more robustly estimate the coefficients of the recovered model once the model structure is fixed. Such hybrid methods, combining the reconstruction abilities of physics-constrained neural networks with the robustness and consistency of sparse inference methods, may be particularly powerful for recovering spatiotemporal latent variables, such as pressure or stresses in continuum mechanics models, that cannot be directly measured in experiments.

APPENDIX A: REGRESSION ESTIMATES OF THE COEFFICIENTS
The coefficients estimated by the GIHT algorithm from the noisy simulation data in the three application cases are shown in Figs. 8 (for the JAK-STAT example), 9 (for the advection velocity), and 10 (for the reaction-diffusion system). In all cases, the results are compared with ground-truth values for different noise levels.

APPENDIX B: USING SMOOTHNESS PRIORS TO IMPROVE COEFFICIENT REGRESSION
In the results presented in Appendix A, the estimated coefficients are the direct outcome of the GIHT algorithm, which jointly infers the equation structure and the values of the coefficients over the so-determined support. It is possible to further improve the estimation of the coefficient values by imposing an additional smoothness prior on the values of the coeffi- cients. However, this additional prior introduces an additional regularization parameter, which also needs to be determined using stability selection, introducing an additional dimension into the stability plots. Moreover, smoothness priors based on discrete total variation and trend filtering based on discrete higher-order derivatives impose constraints on how the data points have to be sampled in space and time. This hampers the application of stability selection, which relies on uniformly random subsamples of the data.
We propose to reconcile these two seemingly conflicting requirements of smoothness priors by first identifying the groups using GIHT and then solving the trend-filtering problem as a postprocessing step for each individual group in order to impose the smoothness priors. This uses GIHT with stability selection only to infer the structure of the model (i.e., the support of ξ), followed by a separate smoothness-constrained regression to determine the values of the nonzero coefficients by solvinĝ This yieldsξ s , the smoothed estimates recovered from the trend-filtering problem. Here K is the number of groups identified by stability selection using GIHT, (k+1) j ∈ R (p g j −k−1)×(p g j −k) is a discrete smoothing filter based on the (k + 1)th derivative, and p g j = |g j | is the size of the respective group. The · 1 norm in the smoothness prior penalizes outliers and favors smooth reconstruction of coefficients. For k = 0, this formulation reduces to the classic total variation prior. The regularization constant λ f controls the degree of smoothness imposed on the coefficients by the filter.
As an example, one could regularize the Laplacian (i.e., the curvature) of the coefficients. The discrete filter with k = 1 is then given by (2) for all j. We demonstrate this in the advection-diffusion example by regularizing smoothness in all recovered groups including the velocity and its gradient field. We solve the optimization problem in Eq. (B1) using the alternating direction method of multipliers algorithm [74] with exhaustive grid search to identify the smoothness regularization λ f that leads . Reconstructed spatially varying velocity field and its gradient for the advection-diffusion example with additional secondorder smoothness prior. The plots show the estimates for the latent spatially varying velocity c(x) (left column) and its gradient ∂ x c(x) (right column) when imposing an additional smoothness prior over each group recovered by stability selection using GIHT. The rows correspond to data with different noise levels σ (shown in the legend). For σ = 0 (top row), we use a smoothness regularization λ f = 1, and for σ = 0.05 and 0.1 (middle and bottom rows), we use λ f = 20. Black solid lines are the ground truth.
to the lowest mean-square error estimate. The reconstructed velocity field and its gradient are shown in Fig. 11 for different levels of noise on the input data. Comparing with the profiles recovered by GIHT directly (Fig. 9), we observe that imposing smoothness priors in a separate postprocessing step significantly improves the reconstruction of the latent fields in this example.

APPENDIX C: ALGORITHM FOR GROUP-SPARSE REGRESSION
Given dataû(x i , t j ) at discrete locations x i and time points t j , we use polynomial differentiation [16,27] to approximate the derivatives required to construct the dictionary and the vector U t . To approximately solve the optimization problem in Eq. (5), we derive the GIHT algorithm. This algorithm is based on an approximate proximal operator for nonoverlapping group sparsity, i.e., for cases where the groups {g l : l ∈ N m } form a partition of the index set N P . In this case, the approximate proximal operator can be applied to each group separately, and the results summed [75].

Proximal view of the iterative hard thresholding algorithm
We start from the well-known iterative hard thresholding (IHT) algorithm for · 0 -regularized sparse regression [40]. We formulate this algorithm from the perspective of projection and proximal operators. For solving the composite optimization problem in Eq. (3), we use linearization and solve the surrogate problem to generate a sequence {ξ k } as ξ k+1 = arg min ξ h(ξ k ) + ∇h(ξ k ), ξ − ξ k + t k 2 ξ − ξ k 2 + λr(ξ) .
This linearization works under the assumption that the data-fitting function h(ξ) is continuously differentiable with Lipschitz continuous gradient, i.e., that there exists a positive constant L > 0 such that ∇h(x) − ∇h(y) 2 L x − y 2 ∀ x, y ∈ R d . The problem in Eq. (C1) is equivalent to the proximal operator where v k = ξ k − ∇h(ξ k )/t k is the gradient-descent iteration with step size 1/t k . Thus, we perform gradient descent along −∇h(ξ k ) and then apply the proximal operator. In the IHT algorithm with nonconvex penalty function λ ξ 0 , the proximal operator prox r (v) is approximated by hard thresholding [40].