Missing links as a source of seemingly variable constants in complex reaction networks

A major challenge in network science is to determine parameters governing complex network dynamics from experimental observations and theoretical models. In complex chemical reaction networks, for example, such as those describing processes in internal combustion engines and power generators, rate constant estimates vary significantly across studies despite substantial experimental efforts. Here, we examine the possibility that variability in measured constants can be largely attributed to the impact of missing network information on parameter estimation. Through the numerical simulation of measurements in incomplete chemical reaction networks, we show that unaccountability of network links presumed unimportant (with local sensitivity amounting to less than two percent of that of a measured link) can create apparent rate constant variations as large as one order of magnitude even if no experimental errors are present in the data. Furthermore, the correlation coefficient between the logarithmic deviation of the rate constant estimate and the cumulative relative sensitivity of the neglected reactions was less than 0.5 in all cases. Thus, for dynamical processes on complex networks, iteratively expanding a model by determining new parameters from data collected under specific conditions is unlikely to produce reliable results.


I. INTRODUCTION
In the study of real complex network systems, information about the system components is often incomplete, unreliable, or unknown. Recently proposed methods to infer the likelihood of links in a network take advantage of node similarities or network structures [1][2][3][4][5][6]. Combined with statistical inference, such methods have demonstrated remarkable success in a variety of real-world networks with heterogeneous errors [7]. Other approaches include observing links present in reconstructed networks after perturbations [8], applying the inverse problem to generative models for networks with predicted hyperbolic structures [9], and inferring missing nodes from community detection algorithms [10]. Statistical measures of correlation and causation have also been used to infer links from observations [11][12][13][14].
In addition to the topology of their connections, links and nodes often carry parameters that encode their properties. In models of physical networks, such parameters must be determined through measurements of observables, which may be costly and limited. In some cases, while the topology of the network is known in principle, parameters such as link weights can only be estimated for a relatively small proportion of the network. Nevertheless, if the network is derived from a knowledge-based model (i.e., it is believed to include all relevant details and is physically well motivated), it is usually assumed that individual parameters can be measured accurately through targeted experiments.
The key problem in such cases is not to determine which nodes and links are present, but rather, which of them are most important to include in the model. "Weak" nodes and links may be systematically neglected to reduce the number of parameters and derive minimal models, but it is unclear how these missing elements may affect the estimated parameters, and hence the predicted dynamics, of the parts of the network that are retained.
That is, it remains largely unknown what impact missing and deliberately omitted structural information may have on the relevant dynamical processes predicted or described by a network model.
Here, in contrast to previous studies focused on solely network structure, we explore the impact of network incompleteness on the parameters governing the system dynamics. To motivate this work, consider the simple electric network of resistors shown in Fig. 1(a). Suppose an experimenter attempts to fit an incomplete model of the system (in which the red resistor is absent) to current measurements in order to determine the resistance R 1 of the blue resistor. Since the missing resistor is relatively large, little current passes through it. From the network perspective, the links in the network should be given a weight proportional to their admittance (inverse resistance) and, intuitively, a model that neglects small weighted links would still be expected to be good at predicting the dynamics. The absence of the red resistor results in a systematic error due to the modeling approximation, and we assume for simplicity that the measurements are perfect and no additional errors are present.
The currents drawn from the voltage sources are measured, and the resistance R 1 in the model is adjusted until the model's predicted currents minimize the deviations from the measured currents, resulting in a least squares estimate (see Appendix A for details). Figure  1(b) shows how this estimated resistance R est depends on the voltage value V adj that the experiment is conducted under. Crucially, two experimenters who operate with different values of V adj can obtain substantially different estimates for R est and may falsely conclude that the resistance changes with the experimental conditions. The apparent variability of a constant in this example is actually quite easy to understand. Essentially no current passes through the blue resistor when V adj ≈ V 1 , so the measured currents are not sensitive to changes in the value of R 1 (as quantified by local sensitivity in Appendix A). Thus, the error introduced by the incomplete model is amplified around the dotted line in Fig. 1(b), and therefore the measure of link importance based on the admittance (link weight) does not adequately reflect the impact of link removal on parameter estimation.
While the errors induced by the estimation procedure in Fig. 1 are easy to understand in terms of local sensitivities, subtler challenges arise in more complex networks, especially when the network dynamics are nonstationary. In the remainder of the paper, we show that for complex chemical reaction networks in particular, missing information about the underlying network can lead to significant variations in rate constants estimated under different conditions even when the local sensitivities for the neglected links are small compared to those for the links being measured. Given that network models of chemical reactions will necessarily be incomplete (since additional short-lived radical species and rare reactions could always be added to a given model), this variability poses a serious challenge to the efforts to reliably determine rate constant through experiments.
To illustrate this challenge, we focus on important chemical reaction networks involved in natural gas combustion in Section II and ethylene pyrolysis in Section III, which have been carefully developed for decades, primarily through experiments employing shock tubes and laminar flame speed measurements [15][16][17][18][19]. We compare the results for these processes in Section IV, and find in particular that local sensitivity analysis, which is a commonly used methods for determining reaction importance [20][21][22], correlates weakly with the reactions impact on rate constant measurements in all cases. We conclude in Section V with a discussion of implications and directions for future research.

II. NATURAL GAS COMBUSTION
We first consider the impact of unknown network reactions in the case of natural gas combustion. We model the chemical process with a coupled set of kinetic equations, which corresponds to a network of chemical species, elementary reactions, and physical parameters, including the rate constants that relate species concentrations to the rate of reactions (see Appendix B). To emulate the effects of missing information in models of real processes, we take a specific network (which we regard as the complete network) as our "ground truth" from which measurement values (regarded to be exact) are generated using simulations. We employ the GRI 3.0 network [23] as our "complete" natural gas network, which is then simulated using the open source software Cantera to integrate a continuously stirred reactor with an accurate time stepping method for stiff ordinary differential equations [24,25]. The GRI 3.0 network is a specification of 53 chemical species and 325 elementary reactions and rate constants, which have been curated carefully from the body of experiments in the literature. Despite extensive curation, this network is not reliable for general purposes across different conditions when the parameters are fixed, which our analysis will suggest to be due to incompleteness of the network model itself. But for the purpose of this analysis, we regard this network as complete and evaluate our hypothesis by quantifying the impact of further removing information from the network. Specifically, starting with this network, we produce plausible models for the process by removing some reactions from the complete network and use the resulting "incomplete" networks to determine parameter values from measurements. Figure 2(a) shows the evolution of the chemical compositions in a mixture of natural gas and oxygen. The reactants combine explosively as the reaction proceeds and ultimate produce water and a variety of other products. During this process, the temperature and pressure of the gas increase sharply as the fuel ignites, as shown in Fig. 2(b). This ignition is the result of a chain reaction process taking part in the complex network of species which are connected by elementary reactions. The network structure of this process can be represented as a hypergraph, where nodes are chemical species and directed hyperedges are reactions, with weights proportional to the reaction flux. This hypergraph can be visualized by its bipartite incidence graph, where the hyperedges are replaced with reaction nodes, as shown in Fig. 2

(c).
A traditionally employed measure of reaction importance in chemical reaction networks is the local sensitivity, which quantifies changes in the concentration with changes in the rate constant (see Appendix B). To assess the reliability of this metric, we generate new incomplete networks from the complete network by randomly removing reactions that are deemed unimportant while retaining reactions presumed to be important, where reaction importance is assessed on the basis of local sensitivities. Rate constants in these incomplete networks are then fit to simulated measurements from the complete network to assess the impact of network incompleteness on parameter estimation.
For our measurement simulations, we consider two reactions whose rate constants are to be determined,  ing reaction, which helps to populate the radical pool that leads to ignition. Equation (II) is a propagation reaction, which helps to diversify this pool of radicals.
We perform a numerical optimization to minimize an error in order to fit the incomplete network to the measurements. The error we use quantifies the discrepancy in the peak oxygen radical concentration and ignition time, as described in Appendix C. This particular quantity is employed in order to emulate the data available in real experiments, such as shock tube experiments based on oxygen radical concentration peaks [15]. Furthermore, the reactions in Eqs. (I) and (II) are known to be important for combustion modeling, and their local sensitivities S OI and S OII are among the largest of all the reactions in the network during the ignition events, as shown in Fig. 3(a).
Incomplete networks are generated randomly by removing reactions with small oxygen sensitivity from the network. In order to quantify how the sensitivity of the removed reactions impacts the variability of the rate constant estimate, we retain from the N tot total reactions a tunable number of reactions, N ret , with the largest Osensitivities, which are not candidates for removal in the generation of the incomplete networks. Incomplete networks are then generated by randomly selecting a tunable number of reactions N rem from the remaining N tot − N ret reactions and removing them from the network. For each value of N ret and N rem , multiple incomplete networks are generated with differing random seeds, which allows us to statistically study the effects of missing information. Rate constant estimates are obtained from fits minimizing . The fit of one sample realization is shown in Fig. 3(b), but the quality of this fit varies significantly with each randomly realized incomplete network for fixed Under any particular thermodynamic condition, minimizing the error results in a rate constant estimate k est for Eq. (I) or Eq. (II), but these estimates will vary with the thermodynamic conditions. For example, for one incomplete network generated with N ret = 40 and  Table C1 in Appendix C) for the complete network and a fitted incomplete network with Nret = Nrem = 40. The value of the rate constant for the fitted network provides an estimate for the rate constants kest. data point, and so we may hope to eliminate the variation in the estimate by combining many observations in a single fit to produce an estimate.
To test if measurements of multiple data points will improve the estimate, we performed 1000 rate constant estimates for Eqs. (I) and (II) for incomplete networks generated with N rem and N ret varying from 10 to 160 by increments of 15 (in total, 2.42 × 10 5 estimates), where each estimate was obtained by minimizing the aggregated error over all 27 thermodynamic conditions. Figure 4 shows how the average results from these simulations vary with N rem and N ret . For each estimate, we consider the ratio of the sensitivities of the removed and measured reactions, S r /S m , the magnitude of logarithmic deviation of the rate constant estimate (representing the orders-ofmagnitude difference between the rate constant estimate and the actual rate constant), |log 10 (k est /k 0 )|, and the logarithm of the error, log 10 ( ). It is clear from these results that after increasing the amount of observational data in the rate constant estimates, there can still be significant variation in the estimates even when relatively few of the least sensitive reactions are missing from the network and the fit appears adequate.
Having incorporated multiple thermodynamic conditions in each estimate, it is interesting to study how the estimates change under yet different conditions. We performed a second set of rate constant measurements using the same set of incomplete networks described above under slightly different experimental conditions by diluting the gas with about ten percent argon. Argon, a neutral buffer which does not react chemically, is sometimes used in shock-tube experiments to adjust conditions, such as pressure, under the assumption that its presence will not alter the results. However, besides deviating again from the true rate constants, the same incomplete networks also exhibit substantial differences between rate constant measurements performed in air with and without argon dilution, as shown in Fig. 5. Among all the paired measurements, the Pearson correlation coefficient between rate constant estimates with no argon and with argon dilution was 0.93 for Eq. (I) and 0.78 for Eq. (II), while the fraction of variance unexplained by a linear regression between the two conditions was, respectively, 14% and 39%. Furthermore, while networks with small removal sensitivities tend to produce rate constant estimates closer to the real value, the scatter in Fig. 5 is not substantially smaller when the estimates are good or when the removal sensitivity is small. Thus, network incompleteness can clearly result in variable rate constant estimates under differing conditions even when the reactions being measured have large sensitivities compared to the neglected reactions and the change in conditions is seemingly innocuous.

III. ETHYLENE PYROLYSIS
To illustrate that these results generalize beyond combustion, we next consider a different complex reaction network describing ethylene pyrolysis. Pyrolysis is a common industrial chemical process involving the burning of organic compounds in the absence of oxygen. In this process, larger organic molecules like C 2 H 4 (ethylene) break down into smaller molecules such as H 2 and CH 4 . We employ a recent model of light hydrocarbon pyrolysis consisting of 84 species and 1698 reactions as our complete pyrolysis network [19]. Figure 6 shows the evolution of the process and network structure in the pyrolysis case, as in Fig. 2.
For ethylene pyrolysis, we consider rate constant measurements based on net yields of H 2 and CH 4 rather than the peak O concentration and ignition time used previously. We allow the reaction to evolve until the H 2 concentration reaches its steady state with 27 different thermodynamic conditions relevant to low pressure chemical vapor deposition. We consider two reactions, each of which has large sensitivity with respect to the

IV. COMBUSTION/PYROLYSIS COMPARISONS
The distributions of rate constant estimates for the combustion and pyrolysis networks are broadly similar, as summarized by the histograms in Fig. 9(a)-(d). The rate constant estimates typically deviate much more than the error or the removal sensitivities in all cases, despite the differing measurement targets and reaction networks. Table I shows the correlation coefficients between measurement quantities for each reaction in each process; there are moderate to weak correlations between all quantities. Thus, in both these processes, the local sensitivity metric correlates weakly with the rate constant estimates and is not a very reliable measure of link importance. Even when the removed reactions have local sensitivity totaling to much less than the sensitivity of the measured links, the rate constant estimates can vary considerably, as shown in Fig. 9(e)-(f). In the networks we generated, for example, the smallest values of S r /S m leading to order-of-magnitude changes with | log 10 (k est /k 0 )| > 1 are just 0.005 for Eq. (I), 0.0046 for Eq. (II), 0.019 for Eq. (III), and 0.0024 for Eq. (IV). While computational costs prohibit us from testing all reactions in these networks, less systematic simulations suggest that other reactions typically exhibit even greater variations in rate constant estimates, as expected given that the reactions (I)-(IV) are among the most sensitive ones in the networks.
To compare the two processes in more detail, we con-  sider the effect of varying the number of removed and retained reactions in the simulated measurements. We divide the plane spanned by N ret and N rem into an acceptable region (in which more than 95% of the rate constant estimates from fits of incomplete networks deviate from the actual value by less than a factor of two) and an unacceptable region (in which this is not the case). Figure 10 compares the acceptable regions in the combustion and pyrolysis cases by normalizing the axes by N tot , the total number of reactions in each network. In order to achieve acceptable measurements with just 10% of reactions missing for all cases, the retained reactions must amount to more than 15% of the network. While somewhat smaller N ret /N tot is adequate for acceptable measurements for Eqs. (I)-(III) in this case, the proportion of retained reactions required for acceptable measurements rises to about 15% in all cases when 50% of the reactions are missing. Interestingly, the natural gas network shows a stronger dependence on the proportion of removed reactions than the pyrolysis network, indicating that the pyrolysis network has a greater proportion of reactions that are unimportant to the dynamics of the measurement. This may result from the more complex chain reaction involved in ignition compared to the slower and more temporally distributed pyrolysis process. The rate constant estimates are deemed acceptable only if more than 95% of incomplete network fits result in rate constant estimates which differ from the actual rate constant by less than a factor of two. Thin lines show how the boundary varies as the acceptable threshold is reduced to 90%, 85%, 75%, and 70%.

V. DISCUSSION
Our results show that observed variations in rate constants of combustion and pyrolysis networks can be largely attributed to the use of incomplete network models that are missing reactions that were presumed to be negligible. We demonstrated that even when the neglected reactions have small cumulative sensitivity compared to the sensitivity of the measured reaction, rate constant estimates in incomplete networks can vary by one order of magnitude. Furthermore, rate constant estimate deviations and local sensitivity analysis showed weak statistical correlations. While these results highlight significant challenges in constructing a reliable and transferable kinetic model for these important chemical processes, they also reveal that current observations of variability in rate constant estimates are not necessarily an indication that a complete network model applicable across all conditions is unattainable. Yet, since measurements of individual rate constants using the current incomplete network models cannot produce reliable results, iteratively refining individual parameters through targeted experiments is unlikely to ever produce accurate estimates. It would therefore be desirable to accumulate large data sets from measurements across all relevant conditions in order to simultaneously estimate all rate constants in the model.
Our results also have implications for other processes described by complex chemical reaction networks, including atmospheric processes [27][28][29] as well as biochemical processes, such as those mediated by metabolic networks [30][31][32]. Compared to the combustion and pyrolysis networks considered here, parameter estimation in metabolic networks in particular is still in its infancy and could benefit significantly from more informed formulations at the outset. One promising direction for improvement is suggested by information-geometric and sloppy modeling approaches [33][34][35]. Using such methods, it may be possible to identify measurement targets as algebraic combinations of rate constants that are more robust to the variations caused by missing network links. Another promising information theoretic approach for predicting network parameters from data would be to employ maximum entropy models, whose degeneracy issues have been recently addressed [36]. Alternatively, it may be possible to directly estimate reaction fluxes rather than species concentrations in order to disentangle reactions. In principle, molecular dynamics simulations can be used to estimate fluxes even if they are experimentally inaccessible, provided that the molecular force fields can be modeled realistically [37,38].
Beyond chemical reaction networks, social networks [39][40][41] and infrastructure networks may also benefit from our findings. In power grids [42,43], for example, the network components are assumed to be known but can involve uncertainties and the network itself may also vary, as not all components are necessarily operational at all times. Our results suggest that power-grid models may become unreliable even if the component parameters are accurate unless they are updated in real time to account for missing components. In other applications, our results also emphasize an important trade-off between model accuracy and parameter accuracy when dimension reduction is implemented by network trimming. That is, trimmed networks may lead to accurate results under specific conditions at the cost of requiring parameter values that differ significantly from their actual values and that may change substantially as conditions change. We anticipate that the impact of missing information on parameter estimation will ultimately depend on the class of networks under consideration, much in the same way as optimal link detection varies across scientific domains [44].

Appendix A: Electric circuit analysis
We first derive the measured currents for the circuit Fig. 1 using Kirchhoff's laws. The voltage across the blue resistor is known, so the current moving from left to right across the blue resistor is I 3 = (V adj −V 1 )/R 1 . Since the black nodes are neither sources nor sinks of current, the current from top to bottom across the red resistor must be I 2 + I 3 = (V adj − V 3 )/R 2 . This same current must pass through the horizontal 1Ω resistor, so that I 2 + I 3 = V 3 /(1Ω). It follows that V 3 = V adj /(R 2 + 1Ω) and I 2 = V adj /(R 2 + 1Ω) − (V adj − V 1 )/R 1 . Similarly, the current across the vertical 1Ω resistor must be I 1 − I 3 = V 1 /(1Ω), and thus Next, we consider the model of the experimenter who neglects the red resistor. According to this model, all the current drawn through the adjustable voltage source travels across the blue resistor, so that the estimated current is I 2 = −(V adj − V 1 )/R est , where R est is the estimate of the resistance of the blue resistor. Similarly, the experimenter estimates the current drawn from the left voltage source to be I 1 = V 1 /(1Ω) + (V adj − V 1 )/R est . The square error for the incomplete model, 2 ≡ (I 1 −I 1 ) 2 +(I 2 −I 2 ) 2 , is thus where ∆ = (V adj − V 1 )(1/R est − 1/R 1 ). The least square estimate for the resistance in Fig. 1(b) follows by minimizing 2 with respect to R est . The local sensitivity S ij ≡ (R j /I i )(∂I i /∂R j ) quantifies how the ith current varies with the jth resistor. It seems intuitive that the least square estimate derived from the model that omits R 2 can become unreliable when S i2 becomes large compared to S i1 . Indeed, for this circuit, the ratio S 22 /S 21 = (R 2 + 1Ω) 2 (V adj − V 1 )/R 1 R 2 V adj approaches zero as V adj approaches V 1 , where the estimate in Fig. 1(b) becomes unreliable. The chemical reaction networks we consider are described by state variables corresponding to the molar fraction X Ai of the ith species with name A i in an ideal gas and thermodynamic variables including the temperature T and pressure P , with volume held fixed and under adiabatic conditions. For simplicity, we consider wellmixed systems, so that there is no spatial dependence on variables. Molar fractions vary because of elementary chemical reactions between species, such as the th re- which is assumed to occur at a rate governed by mass-action kinetics (i.e., proportional to a rate constant κ times the product of the reactants molar concentrations raised to their stoichiometric coefficients). Each molar fraction increases at a rate given by the reactions that produce it and decreases at a rate given by the reactions that consume it, Aj . The energy released or consumed from each reaction (determined from thermodynamic data about the species) is converted to heat, which enters an energy conservation equation governing the evolution of the temperature, and the pressure and temperature are related through a constitutive relation, which we take as the ideal gas law. Each rate constant has temperature dependence which is based on a modified Arrhenius equation κ = kT b exp(−E/RT ) where k, b, and E are model parameters. Some threebody reactions also depend on pressure with a Troe falloff, and reversible reactions have rate constants derived from detailed balance. The local sensitivity, S i ≡ κ /X i × ∂X i /∂κ , quantifies how the state variable X i changes as the parameter κ varies [21]. Small changes in the rate constants for reactions with large sensitivities produce large changes in the species molar fractions, indicating that sensitivity may be a useful metric for reaction importance.

Appendix C: Simulated combustion measurements
We consider 27 thermodynamic conditions for our measurement simulations that evenly span the range of thermodynamic values over which the GRI network was designed to model ignition, as listed in Table C1. The initial concentrations in the simulation are a mixture of air and methane, with 8 parts N 2 , 2 parts O 2 , and φ parts fuel CH 4 . For the measurements in air diluted with argon, an additional 1 part Ar was added to the initial conditions. Experiments cannot generally observe the concentration of all species or reaction fluxes during the course of ignition. Instead, specific targets are identified to measure and fit. For this study, we consider the ignition time t ig and the peak in the oxygen radical mole fraction X O peak that occurs at ignition, where t ig is taken as the time of the oxygen peak. For simplicity, we fix the values of b and E in the rate constants to their values in the original network and optimize only over the pre-exponential factor k. We optimize over the error quantity where the superscripts o and f indicate the observed and fitted values, respectively. When the observation from multiple thermodynamic conditions are combined to produce a rate constant estimate, the square errors are totaled to give the aggregated error. The particular optimization target in Eq. (C1) is intended to emulate choices made in real combustion experiments [15]. We use a Brent optimization algorithm, with a bracket found from an initial interval of [k 0 /2, 2k 0 ], where k 0 is the original network's rate constant value. To simulate dynamics, we used the open-source software Cantera to numerically integrate the coupled ordinary differential equations in the combustion and pyrolysis networks.
The 2-norm (over the time indexes) of the local sensitivity S O (and S H2 in the case of pyrolysis below) was calculated for all reactions and averaged over all experimental conditions to rank reactions by their sensitivity. The removal sensitivity S r is the total sensitivity of all reactions removed to generate the incomplete network, and the measurement sensitivity S m is the sensitivity of the reaction whose rate constant is being estimated. Table D1 shows the experimental conditions for the pyrolysis rate constant measurements. The fuel purity φ in this case indicates the ratio of ethylene to N 2 in the initial condition. These conditions were chosen for their applicability to low pressure chemical vapor deposition. For ethylene pyrolysis, we optimize over the measurement error

Appendix D: Simulated pyrolysis measurements
where the yield concentrations are evaluated at a time that the H 2 concentration has attained its equilibrium value. The particular optimization target in Eq. (D1) based on methane and hydrogen yields is intended to represent the kinds of limited data that are available in chemical vapor deposition experiments.