Path Weight Sampling: Exact Monte Carlo Computation of the Mutual Information between Stochastic Trajectories

,


I. INTRODUCTION
Quantifying information transmission is vital for understanding and designing natural and engineered information-processing systems, ranging from biochemical and neural networks, to electronic circuits and optical systems [1][2][3].Claude Shannon introduced the mutual information and the information rate as the central measures of Information Theory more than 70 years ago [4].These measures quantify the fidelity by which a noisy system transmits information from its inputs to its outputs.Yet, computing these quantities exactly remains notoriously difficult, if not impossible.This is because the inputs and outputs are often not scalar values, but rather temporal trajectories.
Most, if not all, information-processing systems transmit signal that vary in time.The canonical measure for quantifying information transmission via time-varying signals is the mutual information rate [4][5][6][7].It quantifies the speed at which distinct messages are transmitted through the system, and it depends not only on the accuracy of the input-output mapping but also on the correlations within the input and output signals.Computing the mutual information rate thus requires computing the mutual information between the input and * tenwolde@amolf.nloutput trajectories, not between their signal values at given time points.The rate at which this trajectory mutual information increases with the trajectory duration in the long-time limit defines the mutual information rate.In the absence of feedback this rate also equals the multistep transfer entropy [8,9].
More generally, useful information is often contained in the temporal dynamics of the signal.A prime example is bacterial chemotaxis, where the response does not depend on the current ligand concentration, but rather on whether it has changed in the recent past [10,11].Moreover, the information from the input may be encoded in the temporal dynamics of the output [12][13][14][15].Quantifying information encoded in these temporal features of the signals requires the mutual information not between two time points, i.e. the instantaneous mutual information, but rather between input and output trajectories [6].
Unfortunately, computing the mutual information between trajectories is exceptionally difficult.The conventional approach requires non-parametric distribution estimates of the input and output distributions, e.g.via histograms of data obtained through simulations or experiments [16][17][18][19][20][21].These non-parametric distribution estimates are necessary because the mutual information cannot generally be computed from summary statistics like the mean or variance of the data alone.However, the high-dimensional nature of trajectories makes it infeasible to obtain enough empirical data to accurately estimate the required probability distributions.Moreover, this approach requires the discretization of time, which becomes problematic when the information is encoded in the precise timing of signal spikes, as, e.g., in neuronal systems [22].Except for the simplest systems with a binary state space [21], the conventional approach to estimate the mutual information via histograms therefore cannot be transposed to trajectories.
Because there are currently no general schemes available to compute the mutual information between trajectories exactly, approximate methods or simplified models are typically used.While empirical distribution estimates can be avoided by employing the K-nearestneighbors entropy estimator [23,24], this method depends on a choice of metric in trajectory space and can become unreliable for long trajectories [25].Alternative, decoding-based information estimates can be developed for trajectories [26], but merely provide a lower bound of the mutual information, and it remains unclear how tight these lower bounds are [25,27,28].Analytical results are avaiable for simple systems [29], and for linear systems that obey Gaussian statistics, the mutual information between trajectories can be obtained from the covariance matrix [6].However, many information processing systems are complex and non-linear such that the Gaussian approximation does not hold, and analytical solutions do not exist.A more promising approach to estimate the trajectory mutual information for chemical reaction networks has been developed by Duso and Zechner [30] and generalized in Ref. [31].However, the scheme relies on a moment closure approximation and has so far only been applied to very simple networks, seemingly being difficult to extend to complex systems.
Here, we present Path Weight Sampling (PWS), an exact technique to compute the trajectory mutual information for any system described by a master equation.Master equations are widely used to model chemical reaction networks [32][33][34][35], biological population growth [36][37][38], economic processes [39,40], and a large variety of other systems [41,42], making our scheme of interest to a broad class of problems.
PWS is an exact Monte Carlo scheme, in the sense that it provides an unbiased statistical estimate of the trajectory mutual information.In PWS, the mutual information is computed as the difference between the marginal output entropy associated with the marginal distribution P[x] of the output trajectories x, and the conditional output entropy associated with the output distribution P[x|s] conditioned on the input trajectory s.Our scheme is inspired by the observation from Cepeda-Humerez et al. [25] that the path likelihood, i.e. the probability P[x|s], can be computed exactly from the master equation for a static input signal s.This makes it possible to compute the mutual information between a discrete input and a time-varying output via a Monte Carlo averaging procedure of the likelihoods, rather than from an empirical estimate of the intractable high-dimensional probability distribution functions.The scheme of Cepeda-Humerez et al. [25] is however limited to discrete input signals that do not vary in time.Here we show that the path likelihood P[x|s] can also be computed for a dynamical input trajectory s, which allows us to compute the conditional output entropy also for time-varying inputs.While this solves the problem in part, the marginal output entropy associated with P[x] cannot be computed with the approach of Cepeda-Humerez et al., and thus requires a different scheme.
We show how, for time-varying input signals, the marginal probability P[x] can be obtained as a Monte Carlo average of P[x|s] over a large number of input trajectories.How to do this effectively is the crux of PWS.We then use the Monte Carlo estimate for P[x] to compute the marginal output entropy.We present three variants of PWS, all of which compute the conditional entropy in the same manner, but differ in the way this Monte Carlo averaging procedure for computing the marginal probability P[x] is carried out.
To compute P[x], Direct PWS (DPWS) performs a brute-force average of the path likelihoods P[x|s] over the input trajectories s.While we show that this scheme works for simple systems, the brute-force Monte Carlo averaging procedure becomes more difficult for larger systems and exponentially harder for longer trajectories.
Our second and third variant of PWS are based on the realization that the marginal probability P[x] is akin to a partition function.These schemes leverage techniques for computing free energies from statistical physics.Specifically, the second scheme, Rosenbluth-Rosenbluth PWS (RR-PWS), exploits the observation that the computation of P[x] is analogous to the calculation of the (excess) chemical potential of a polymer, for which efficient methods have been developed [43][44][45].The third scheme, Thermodynamic Integration PWS (TI-PWS), is based on the classic free energy estimation technique of thermodynamic integration [46][47][48] in conjunction with a trajectory space MCMC sampler using ideas from transition path sampling [49].
All three PWS variants make it possible to compute the mutual information between trajectories exactly, without the need of time-discretizing the input and output signals.
The DPWS method is presented in Section II, followed by a review of the required concepts from the theory of Markov jump processes and master equations.The other two PWS schemes are then presented as improvements of DPWS in Section III.
In Section IV we show that, surprisingly, our PWS methods additionally make it possible to compute the mutual information between input and output trajectories of systems with hidden internal states.Hidden states correspond, for example, to network components that merely relay, process or transform the signal from the input to the output.Indeed, the downstream system typically responds to the information that is encoded in this output, and not the other internal system components.Most information processing systems contain such hidden states, and generally we want to integrate out these latent network components.In addition, we can generalize PWS to systems with feedback from the output to the input as shown in Appendix C.
In Section V we apply PWS to two well-known model systems.The first is a simple pair of coupled birth-death processes which allows us to test the efficiency of the three PWS variants, as well as to compare the PWS results with analytical results from the Gaussian approximation [6] and the technique by Duso and Zechner [30].Our second application concerns the bacterial chemotaxis system, which is arguably the best characterized signaling system in biology.Mattingly et al. [50] recently argued that bacterial chemotaxis in shallow gradients is information limited.Yet, to compute the information rate from their experimental data they had to employ a Gaussian framework.PWS makes it possible to asses the accuracy of this approximation.Our results show that the Gaussian framework is accurate in the regime of shallow concentration gradients as studied in Mattingly et al. [50].Yet, comparing the PWS predictions of a model based on previous literature against their experimental results reveals that the composition of the receptor array differs from that hitherto believed.

II. MONTE CARLO ESTIMATE OF THE MUTUAL INFORMATION
In this section we present the fundamental ideas of PWS.These ideas lie at the heart of DPWS and also form the foundation of the other two more advanced PWS variants which will be explained in subsequent sections.

A. Statement of the Problem
All information processing systems repeatedly take an input value s and produce a corresponding output x.Due to noise, the output produced for the same input can be different every time, such that the system samples outputs from the distribution P(x|s).In the information theoretic sense, the device's capabilities are fully specified by its output distributions for all possible inputs.We consider the inputs as being distributed according to a probability density P(s) such that the whole setup of signal and device is completely described by the joint probability density P(s, x) = P(s) P(x|s).
When the conditional output distributions P(x|s) overlap with each other, information is lost because the input can not always be inferred uniquely from the output (see Fig. 1).The remaining information that the output carries about the signal on average is quantified by the mutual information between input and output.
Mathematically, the mutual information between a random variable S, representing the input, and a second random variable X , representing the output, is defined as I(S, X ) = ds dx P(s, x) ln P(s, x) where the marginal output distribution is given by P(x) = ds P(s, x).The quantity I(S, X ) as defined above is a non-negative real number, representing the mutual information between S and X in nats.The integrals in Eq. ( 1) run over all possible realizations of the random variables S and X .In our case, S and X represent stochastic trajectories and so the integrals become path integrals.
In general, the mutual information can be decomposed into two terms, a conditional and marginal entropy.Due to the symmetry of Eq. ( 1) with respect to exchange of S and X , this decomposition can be written as The (marginal) input entropy H(S) represents the total uncertainty about the input, and the conditional input entropy H(S|X ) describes the remaining uncertainty of the input after having observed the output.Thus, the mutual information I(S, X ) = H(S) − H(S|X ) naturally quantifies the reduction in uncertainty about the input through the observation of the output.When analyzing data from experiments or simulations however, the mutual information is generally estimated via I(S, X ) = H(X )−H(X |S).This is because simulation or experimental data generally provide information about the distribution of outputs for a given input, rather than vice versa.The accessible entropies are thus the marginal output entropy H(X ) and the conditional output entropy H(X |S), which are defined as H(X ) = − dx P(x) ln P(x) H(X |S) = − ds P(s) dx P(x|s) ln P(x|s) .
The conventional way of computing the mutual information involves generating many samples to obtain empirical distribution estimates for P(x|s) and P(x) via histograms.However, the number of samples needs to be substantially larger than the number of histogram bins to reduce the noise in the bin counts.Obtaining enough samples is effectively impossible for highdimensional data, like signal trajectories.Moreover, any nonzero bin size leads to a systematic bias in the entropy estimates, even in one dimension [17].These limitations of the conventional method make it impractical for highdimensional data, highlighting the need for alternative approaches to accurately compute mutual information for trajectories.

B.
Direct PWS The central idea of PWS is to compute probability densities for trajectories exactly, sidestepping the problem having to estimate them via histograms.We exploit that for systems described by a master equation, the conditional probability of an output trajectory for a given input trajectory can be computed analytically.With this insight we can derive a procedure to compute the mutual information.Specifically, we will show that • for a system described by a master equation, the trajectory likelihood P[x|s] is a quantity that can be computed on the fly in a stochastic simulation; • input trajectories can be generated from P[s], output trajectories for a given input s can be generated according to P[x|s] using standard SSA (Gillespie) simulations; • by combining the two ideas above, we can derive a direct Monte Carlo estimate for the mutual information I(S, X ), as illustrated in Fig. 2.
Note that we denote trajectories by bold symbols to distinguish them from scalar quantities.Our technique is conceptually straightforward.Using Monte Carlo simulations we can compute averages over the configuration space of trajectories.Suppose we have a function f [z] that takes a trajectory z and produces a scalar value.The mean of f [z] with respect to the trajectory distribution We write D[z] to denote a path integral over all possible trajectories of a given duration.We estimate ⟨f [z]⟩ P[z] , by generating a large number of trajectories z 1 , . . ., z N from P[z] and evaluating the corresponding Monte Carlo average Using the likelihoods and the marginal probabilities from the previous steps we can estimate the mutual information using Eq.(10).
which converges to the true mean in the limit N → ∞.Specifically, we want to estimate the conditional and the marginal entropy to compute the mutual information.Let us imagine that we generate N input trajectories s 1 , . . ., s N from the distribution P[s].Next, for every input s i , we generate a set of K outputs x i,1 , . . ., x i,K from P[x|s i ].Then, the Monte Carlo estimate for the conditional entropy is Secondly, for a given output x we generate M inputs s ′ 1 , . . ., s ′ M according to P[s], then we can obtain a Monte Carlo estimate for the marginal probability of the output trajectory P[x]: The estimate for the marginal entropy is then given by In the last step we inserted the result from Eq. ( 8).
In this estimate, the trajectories x 1 , . . ., x N are sampled from P[x], i.e., by first sampling from P[s] and then from P [x|s].Finally, the mutual information is obtained by taking the entropy difference, i.e., I(S, X ) = H(X ) − H(X |S).
While this is the main idea behind PWS, it is computationally advantageous to change the order of operations in the estimate.Specifically, computing the difference of two averages, leads to large statistical errors.We can obtain an improved estimate by reformulating the mutual information as a single average of differences: This equation applies to all variants of PWS.They differ, however, in the way P[x] is computed.In the bruteforce version of PWS, called Direct PWS (DPWS), we use Eq. ( 8) to evaluate the marginal probability P[x].DPWS indeed involves two nested Monte Carlo computations, in which N pairs (s i , x i ) are generated, and for each output x i , M input trajectories {s} are generated from scratch to compute P[x].In Section III below, we will present two additional variants of PWS where the brute-force estimate of the marginal probability P[x] is replaced by more elaborate schemes.That said, DPWS is a conceptually simple, straightforward to implement, and exact scheme to compute the mutual information.
Having explained the core ideas of our technique above, we will continue this section with a review of the necessary concepts of master equations to implement PWS.First, in Section II C, we derive the formula for the conditional probability P[x|s] which lies at the heart of our technique.In Sections II C and II D, we discuss how trajectories are generated according to P[x|s] and P[s], which are the remaining ingredients required for using DPWS.Then, in Section III, we will present the two other variants of PWS that improve on DPWS.

C. Driven Markov Jump Process
Throughout this article we consider systems that can be modeled by a master equation and are being driven by a stochastic signal.The master equation specifies the time evolution of the conditional probability distribution P(x, t|x 0 , t 0 ) which is the probability for the process to reach the discrete state x ∈ Ω at time t, given that it was at state x 0 ∈ Ω at the previous time t 0 .The state space Ω is multi-dimensional if the system is made up of multiple components and therefore x and x 0 can be vectors rather than scalar values.Denoting the transition rate at time t from state x to another state x ′ ̸ = x by w t (x ′ , x), the master equation reads where, for brevity, we suppress the dependence on the initial condition, i.e., P(x, t) = P(x, t|x 0 , t 0 ).By defining Note that by definition the diagonal matrix element Q t (x, x) is the negative exit rate from state x, i.e. the total rate at which probability flows away from state x.
Using the master equation we can compute the probability of any trajectory.A trajectory x is defined by a list of jump times t 1 , . . ., t n−1 , together with a sequence of system states x 0 , . . ., x n−1 .The trajectory starts at time t 0 in state x 0 and ends at time t n in state x n−1 , such that its duration is T = t n − t 0 .At each time t i (for i = 1, . . ., n − 1) the trajectory describes an instantaneous jump x i−1 → x i .The probability density of x is a product of the probability of the initial state P(x 0 ), the rates of the n − 1 transitions Q ti (x i , x i−1 ), and the survival probabilities for the waiting times between jumps, given by exp ti ti−1 dt Q t (x i−1 , x i−1 ) for i = 1, . . ., n.

Computing the Likelihood P[x|s]
To compute likelihood or conditional probability P[x|s] of an output trajectory x for a given input trajectory s, we note that the input determines the time-dependent stochastic dynamics of the jump process.Indeed, the transition rates at time t, given by Q t (x ′ , x; s), depend explicitly on the input s(t) at time t and may even depend on the entire history of s prior to t.
In the common case that every input trajectory s leads to a unique transition rate matrix Q t (x ′ , x; s), i.e. the map s → Q t (•, •; s) is injective, the likelihood is directly given by Eq. ( 13): where P(x 0 |s 0 ) is the probability of the initial state x 0 of the output given the initial state of the input s 0 = s(t 0 ).
The evaluation of the trajectory likelihood is at the heart of our Monte Carlo scheme.However, numerically computing a large product like Eq. ( 14) very quickly reaches the limits of floating point arithmetic since the result is often either too large or too close to zero to be representable as a floating point number.Thus, to avoid numerical issues, it is vital to perform the computations in log-space, i.e. to compute ln P[x|s] = ln P(x where The computation of the log-likelihood ln P[x|s] for given trajectories s and x according to Eqs. ( 15) and ( 16) proceeds as follows: • At the start of the trajectory we compute the logprobability of the initial condition ln P(x 0 |s 0 ), • for every jump x i−1 → x i in x compute the log jump propensity ln Q ti (x i , x i−1 ; s), and • for every interval (t i−1 , t i ) of constant output value x(t) = x i−1 between two jumps of x, we compute ti ti−1 dt Q t (x i−1 , x i−1 ; s).This integral can be performed using standard numerical methods such as the trapezoidal rule, which is also exact if Q t (x(t), x(t); s) is a piecewise linear function of t as in our examples in Section V.
The sum of the three contributions above yields the exact log-likelihood ln P[x|s] as given in Eq. (15).Thus, notably, the algorithm to compute the loglikelihood ln P[x|s] is both efficient and straightforward to implement, being closely related to the standard Gillespie algorithm.The only quantity in Eq. ( 15) that cannot be directly obtained from the master equation is the log-probability of the initial state, ln P(x 0 |s 0 ).
Our scheme can be applied to any system with a welldefined (non-equilibrium) initial distribution P(s 0 , x 0 ) as specified by, e.g. the experimental setup.Most commonly though, one is interested in studying information transmission for systems in steady state.Then, the initial condition P(s 0 , x 0 ) is the stationary distribution of the Markov process.Depending on the complexity of the system, this distribution can be found either analytically from the master equation [51,52] (possibly using simplifying approximations [53,54]), or computationally from stochastic simulations [55].

Sampling from P[x|s]
Standard kinetic Monte Carlo simulations naturally produce exact samples of the probability distribution P[x|s] as defined in Eq. (14).That is, for any signal trajectory s and initial state x 0 drawn from P(x 0 |s 0 ) we can use the Stochastic Simulation Algorithm (SSA) or variants thereof to generate a corresponding trajectory x.The SSA propagates the initial condition x 0 , t 0 forward in time according to the transition rate matrix Q t (•; s).In the standard Direct SSA algorithm [55] this is done by alternatingly sampling the waiting time until the next transition, and then selecting the actual transition.
The transition rates Q t (x ′ , x; s) of a driven master equation are necessarily time-dependent since they include the coupling of the jump process to the input trajectory s, which itself varies in time.While most treatments of the SSA assume that the transition rates are constant in time, this restriction is easily lifted.Consider step i of the Direct SSA which generates the next transition time t i+1 = t i + ∆t i .For time-varying transition rates the distribution of the stochastic waiting time ∆t i is characterized by the survival function The waiting time can be sampled using inverse transform sampling, i.e. by generating a uniformly distributed random number u ∈ [0, 1] and computing the waiting time using the inverse survival function ∆t i = S −1 i (u).Numerically, computing the inverse of the survival function requires solving the equation for the waiting time ∆t i .Depending on the complexity of Q t (x i , x i |s), this equation can either be solved analytically or numerically, e.g. using Newton's method.Hence, this method to generate stochastic trajectories is only truly exact if we can solve Eq. ( 18) analytically, as in the example in Section V A. Additionally, in some cases more efficient variants of the SSA with time dependent rates could be used [56,57].

D. Input Statistics
For our mutual information estimate, we need to be able to draw samples from the input distribution P[s].
Our algorithm poses no restrictions on P[s] other than the possibility to generate sample trajectories.
For example, the input signal may be described by a continuous-time jump process, as in Section V A. One benefit is that it is possible to generate exact realizations of such a process (using the SSA) and to exactly compute the likelihood P[x|s] using Eq.(15).Specifically, the likelihood can be exactly evaluated because the transition rates Q t (•, •; s) for any input trajectory s, while time-dependent, are piece-wise constant.This implies that the integral in Eq. ( 15) can be evaluated analytically without approximations.Similarly, for piece-wise constant transition rates, the inverse function of Eq. ( 18) can be evaluated directly such that we can sample exact trajectories from the driven jump process.As a result, when both input and output are described by a master equation, PWS is a completely exact Monte Carlo scheme to compute the mutual information.
However, the techniques described here do not require the input signal s to be described by a continuous-time jump process, or even to be Markovian.The input signal can be any stochastic process for which trajectories can be generated numerically.This includes continuous stochastic processes that are found as solutions to stochastic differential equations [58].The application in Section V B provides an example.

III. VARIANTS OF PWS
The DPWS scheme presented in the previous section makes it possible to compute the mutual information between trajectories of a stochastic input process and the output of a Markov jump process.However, the number of possible trajectories increases exponentially with trajectory length, leading to a corresponding increase in the variance of the DPWS estimate.Hence, for long trajectories the DPWS estimate may prove to be computationally too expensive.To address this issue, we describe two improved variants of PWS in this section, both based on free-energy estimators from statistical physics.

A. Marginalization Integrals in Trajectory Space
The computationally most expensive part of our scheme in Section II B is the evaluation of the marginalization integral P[x i ] = D[s]P[s, x i ] which needs to be performed for every sample x 1 , . . ., x N .Consequently the computational efficiency of this marginalization is essential for the overall performance.
Marginalization is a general term to denote an operation where one or more variables are integrated out of a joint probability distribution.For instance, we obtain the marginal probability distribution P[x] from P[s, x] by computing the integral In DPWS, we use Eq. ( 8) to compute P[x] which involves generating independent input trajectories from P[s].However, this this is not the optimal Monte Carlo technique to perform the marginalization.The generated input trajectories are independent from the output trajectory x.Thus, we ignore the causal connection between s and x, and we typically end up sampling trajectories s ⋆ whose likelihoods P[x|s ⋆ ] are very small.Then, most sampled trajectories have small integral weights, and only very few samples provide a significant contribution to the average.The variance of the result is then very large because the effective sample size is much smaller than the total sample size.The use of P[s] as the sampling distribution is thus only practical in cases where the dependence of the output on the input is not too strong.It follows, perhaps paradoxically, that this sampling scheme works best when the mutual information is not too large [59].This is a well known Monte Carlo sampling problem and a large number of techniques have been developed to solve it.The two variants of our scheme, RR-PWS and TI-PWS, both make use of ideas from statistical physics for the efficient computation of free energies.
To understand how we can make use of these ideas to compute the marginal probability P[x], it is convenient  20) and (24).
to rephrase the marginalization integral in Eq. ( 19) in the language of statistical physics.In this language, P[x] corresponds to the normalization constant, or partition function, of a Boltzmann distribution for the potential In Eq. ( 20), we interpret s as a variable in the configuration space whereas x is an auxiliary variable, i.e. a parameter.Note that both s and x still represent trajectories.For this potential, the partition function is given by The integral only runs over the configuration space, i.e. we integrate only with respect to s but not x, which remains a parameter of the partition function.The partition function is precisely equal to the marginal probability of the output, i.e.Z[x] = P[x], as can be verified by inserting the expression for the U[s, x].Further, the free energy is given by which shows that the computation of the free energy of the trajectory ensemble corresponding to U[s, x] is equivalent to the computation of (the logarithm of) the marginal probability P[x].
Note that above we omitted any factors of k B T since temperature is irrelevant here.Also note that while the distribution exp(−U[s, x]) looks like the equilibrium distribution of a canonical ensemble from statistical mechanics, this does not imply that we can only study systems in thermal equilibrium.Indeed, PWS is used to study information transmission in systems driven out of equilibrium by the input signal.Thus, the notation introduced in this section is nothing else but a mathematical reformulation of the marginalization integral to make the analogy to statistical physics apparent and we assign no additional meaning of the potentials and free energies introduced here.
In statistical physics it is well known that the free energy cannot be directly measured from a simulation.Instead, one estimates the free-energy difference between the system and a reference system with known free energy F 0 [x].The reference system is described by the potential U 0 [s, x] with the corresponding partition function Z 0 [x].In our case, a natural choice of reference potential is with the corresponding partition function This means that since P[s] is a normalized probability density function, the reference free energy is zero (F 0 [x] = − ln Z 0 [x] = 0).Hence, for the above choice of reference system, the free-energy difference is Note that in our case the reference potential U 0 [s, x] = − ln P[s] does not depend on the output trajectory x, i.e.U 0 [s, x] ≡ U 0 [s].It describes a non-interacting version of our input-output system where the input trajectories evolve independently of the fixed output trajectory x.
What is the interaction between the output x and the input trajectory ensemble?We define the interaction potential ∆U[s, x] through The interaction potential makes it apparent that the distribution of s corresponding to the potential U[s, x] is biased by x with respect to the distribution corresponding to the reference potential U 0 [s].By inserting the expressions for U 0 [s] and U[s, x] into Eq.( 27) we see that where L t [s, x] is given by Eq. ( 15).This expression illustrates that the interaction of the output trajectory x with the ensemble of input trajectories is characterized by the trajectory likelihood P[x|s].Because we can compute the trajectory likelihood from the master equation, we can compute the interaction potential.
In this section we have introduced notation (summarized in Table I) to show that computing a marginalization integral is equivalent to the computation of a freeenergy difference.This picture allows us to distinguish two input trajectory ensembles, the non-interacting ensemble distributed according to exp(−U 0 [s]) = P[s], and the interacting ensemble with input distribution proportional to exp(−U[s, x]) ∝ P[s|x].For example, the brute force estimate of P[x] used in DPWS can be written as where the notation ⟨• • • ⟩ 0 refers to an average with respect to the non-interacting ensemble.By inserting the expressions for U 0 and ∆U, it is easy to verify that this estimate is equivalent to Eq. ( 8).As explained in Section II B, to compute Eq. ( 29) using Monte Carlo, it is only necessary to sample from the non-interacting system U 0 [s] and to compute the Boltzmann weight ∆U[s, x] (i.e. to sample from P[s] and to compute the log-likelihood ln P[x|s]).This is indeed the DPWS scheme.However, by noting the correspondence between signal trajectories and polymers and that Eq. ( 23) has the same form as the expression for the (excess) chemical potential of a polymer, which is the free-energy difference between the polymer of interest and the ideal chain [43,60], more efficient schemes can be developed, as we show next.

B. RR-PWS
In Rosenbluth-Rosenbluth PWS we compute the freeenergy difference ∆F between the ideal system U 0 and U in a single simulation just like in the brute force method.However, instead of generating s trajectories in an uncorrelated fashion according to exp(−U 0 [s]) = P[s], we bias our sampling distribution towards exp(−U[s, x]) ∝ P[s|x] to reduce the sampling problems found in DPWS.
The classical scheme for biasing the sampling distribution in polymer physics is due to Rosenbluth and Rosenbluth [61] in their study of self-avoiding chains.A substantial improvement of the Rosenbluth algorithm was achieved by Grassberger, by generating polymers using pruning and enrichment steps, thereby eliminating configurations that do not significantly contribute to the average.This scheme is known as the pruned-enriched Rosenbluth method, or PERM [44].While PERM is much more powerful than the standard Rosenbluth algorithm, its main drawback is that it requires careful tuning of the pruning and enrichment schedule to achieve optimal convergence.Therefore we have opted to use a technique that is similar in spirit to PERM but requires less tuning, the bootstrap particle filter [62].We will describe how to use PWS with a particle filter below.That said, we want to stress that the particle filter can easily be replaced by PERM or other related methods [63].Also schemes inspired by variants of Forward Flux Sampling [64,65] could be developed.
In the methods discussed above, a polymer is grown monomer by monomer.In a continuous-time Markov process this translates to trajectories being grown segment by segment.To define the segments, we introduce a time discretization 0 < τ 1 < τ 2 < • • • < τ n−1 < T .Thus, each trajectory s of duration T consists of n segments where we denote the segment between τ i and τ j by s [i,j] (we define τ 0 = 0 and τ n = T ).The particle filter uses the following procedure to grow an ensemble of trajectories segment by segment: 1. Generate M starting points s 1 0 , . . ., s M 0 according to the initial condition of the input signal P(s 0 ).

Iterate for
propagate each trajectory (or each starting point) forward in time from τ i−1 to τ i .Propagation is performed according to the natural dynamics of s, i.e. generating a new segment s k [i−1,i] with probability for k = 1, . . ., M .
(b) Compute the Boltzmann weight of each new segment.This Boltzmann weight of a segment from τ i−1 to τ i can be expressed as see Eq. ( 28), and is therefore straightforward to compute from the master equation.
(c) Sample M times from the distribution where the Rosenbluth weight w i is defined as This sampling procedure yields M randomly drawn indices ℓ 1 i , . . ., ℓ M i .Each ℓ k i is an index that lies in the range from 1, . . ., M and that points to one of the trajectories that have been generated up to τ i .To continue the sampling procedure, we relabel the indices such that the resampled set of trajectories is defined by sk [0,i] ← s is subsequently used as the input for the next iteration of the algorithm.
The normalized Rosenbluth factor of the final ensemble is then given by As shown in Appendix A, we can derive an unbiased estimate for the desired ratio based on the Rosenbluth factor: with P(x 0 ) being the probability of the initial output x 0 .The particle filter can therefore be integrated into the DPWS algorithm to compute the marginal density P[x], substituting the brute-force estimate given in Eq. ( 8).
We call the resulting algorithm to compute the mutual information RR-PWS.
We now provide an intuitive explanation for the scheme presented above.First note that steps 1 and 2(a) of the procedure above involve just propagating M trajectories in parallel, according to P[s] = exp(−U 0 [s]).The interesting steps are 2(b-c) where we eliminate or duplicate some of the trajectories according to the Boltzmann weights of the most recent segment.Note, that in general the list of indices (ℓ 1 i , . . ., ℓ M i ) that are sampled in step 2(c) will contain duplicates (ℓ k i = ℓ k ′ i for k ̸ = k ′ ), thus cloning the corresponding trajectory.Concomitantly, the indices ℓ 1 i , . . ., ℓ M i may not include every original index 1, . . ., M , therefore eliminating some trajectories.Since indices of trajectories with high Boltzmann weight are more likely to be sampled from Eq. ( 34), this ensures that we are only spending computational effort on propagating those trajectories whose Boltzmann weight is not too small.Hence, at its heart the particle filter is an algorithm for producing samples that tend to be distributed according to exp(−U 0 [s]) exp(−∆U[s, x]) = exp(−U[s, x]), i.e. according to the Boltzmann distribution of the interacting ensemble, see also Appendix A. For illustration of the algorithm, one iteration of the particle filter is presented schematically in Fig. 3.
For the efficiency of the particle filter it is important to carefully choose the number of segments n.When the segments are very short (n large), the accumulated weights (Eq.( 33)) tend to differ very little between the newly generated segments s k Hence, the pruning and enrichment of the segments is dominated by noise.In contrast, when the segments are very long, the distribution of Boltzmann weights U k i becomes very wide.Then only few segments contribute substantially to the corresponding Rosenbluth weight w i .Hence, to carefully choose n, we need a measure that quantifies the variance of the trajectory weights.To this ens, we follow Martino et al. [66] and introduce an effective sample size (ESS) which lies in the range 1 ≤ M = 1 if one trajectory has a much higher weight than all the others and M (eff) i = M if all trajectories have the same weight.As a rule of thumb, we resample only when the M (eff) i drops below M/2.Additionally, as recommended in Ref. [67], we use the systematic sampling algorithm to randomly draw the indices in step 2(c) which helps to reduce the variance; we find, however, the improvement over simple sampling is very minor.Using these techniques, the only parameter that needs to be chosen by hand for the particle filter is the ensemble size M .

C. TI-PWS
Our third scheme, thermodynamic integration PWS (TI-PWS), is based on the analogy of marginalization integrals with free-energy computations.As before, we view the problem of computing the marginal probability P[x] as equivalent to that of computing the free-energy difference between ensembles defined by the potentials U 0 [s, x] and U[s, x], respectively.For TI-PWS, we define a potential U θ [s, x] with a continuous parameter θ ∈ [0, 1] that allows us to transform the ensemble from U 0 to U = U 1 .The corresponding partition function is For instance, for 0 ≤ θ ≤ 1, we can define our potential as such that e −U θ [s,x] = P[s]P[x|s] θ .Note that this is the simplest choice for a continuous transformation between U 0 and U 1 , but by no means the only one.For reasons of computational efficiency, it can be beneficial to choose a different path between U 0 and U 1 , depending on the specific system [47].Here we will not consider other paths however, and derive the thermodynamic integration estimate for the potential given in Eq. (40).
To derive the thermodynamic integration estimate for the free-energy difference, we first compute the derivative of ln Z θ [x] with respect to θ: Thus, the derivative of ln Z θ [x] is an average of the Boltzmann weight with respect to P θ [s|x] which is the ensemble distribution of s given by Integrating Eq. ( 41) with respect to θ leads to the formula for the free-energy difference which is the fundamental identity underlying thermodynamic integration.To compute the free-energy difference using Eq. ( 43), we evaluate the θ-integral numerically using Gaussian quadrature, while the inner average ⟨∆U[s, x]⟩ θ is computed using MCMC simulations.To perform MCMC simulations in trajectory space we use ideas from transition path sampling (TPS) [49].As discussed in Appendix B, the efficiency of MCMC samplers strongly depends on the proposal moves that are employed.While better proposal moves could be conceived, we only use the forward shooting and backward shooting moves of TPS [49] to obtain the results in Section V.These moves regrow either the end, or the beginning of a trajectory, respectively.A proposal is accepted according to the Metropolis criterion [68].

IV. INTEGRATING OUT INTERNAL COMPONENTS
So far the output trajectory x has been considered to correspond to the trajectory of the system in the full state space Ω.Concomitantly, the method presented is a scheme for computing the mutual information between the input signal s and the trajectory x, comprising the time evolution of all the n components in the system, X 1 , X 2 , . . ., X n .Each component X i itself has a corresponding trajectory x i , such that the full trajectory can be represented as a vector x = (x 1 , . . ., x n ).It is indeed also the conditional probability P[x|s] = P[x 1 , . . ., x n |s] and the marginal probability P[x] = P[x 1 , . . ., x n ] of this vector in the full state space that can be directly computed from the master equation.In fact, it is this vector, which captures the states of all the components in the system, that carries the most information on the input signal s, and thus has the largest mutual information.However, typically the downstream system cannot read out the states of all the components X 1 , . . ., X n .Often, the downstream system reads out only a few components or often even just one component, the "output component" X r .The other components then mainly serve to transmit the information from the input s to this readout X r .From the perspective of the downstream system, the other components are hidden.The natural quantity to measure the precision of information processing is then the mutual information I(S; X r ) between the input s and the output component's trajectory x r , not I(S; X ).The question then becomes how to compute P[x r ] and P[x r |s], from which I(S; X r ) can be obtained.Here, we present a scheme to achieve this.
As an example, consider a chemical reaction network with species X 1 , . . ., X n .Without loss of generality, we will assume that the n-th component is the output component, X r = X n .The other species X 1 , . . ., X n−1 are thus not part of the output, but only relay information from the input signal s to the output signal x n .To determine the mutual information I(S, X ) we need P[x n |s], where x n is the trajectory of only the readout component X n .However, from the master equation we can only obtain an expression for the full conditional probability P[x 1 , . . ., x n |s] of all components.To compute the value of P[x n |s], we must perform the marginalization integral We can compute this integral using a Monte Carlo scheme as described below and use the resulting estimate for P[x n |s] to compute the mutual information using our technique presented in Section II B.
The marginalization of Eq. ( 44) entails integrating out degrees of freedom from a known joint probability distribution.In Eq. ( 8) we solved the analogous problem of obtaining the marginal probability P[x] by integrating out the input trajectories through the integral P[x] = ds P[s, x] = ds P[s]P[x|s].As described in Section II B, the integral from Eq. ( 8) can be computed via a Monte Carlo estimate by sampling many input trajectories from P[s] and taking the average of the corresponding conditional probabilities P[x|s i ].We will show that in the case where there is no feedback from the readout component back to the other components, a completely analogous Monte Carlo estimate can be derived for Eq. ( 44).We describe this below.Additionally, in the absence of feedback, the techniques presented in Section III above can be employed to develop computationally more efficient schemes.
More specifically, we can evaluate Eq. ( 44) via a direct Monte Carlo estimate under the condition that the stochastic dynamics of the other components X 1 , . . ., X n−1 are not influenced by X n (i.e., no feedback from the readout).Using the identity , s] (45) to rewrite the integrand in Eq. ( 44), we are able to rep-resent the conditional probability P[x n |s] as an average over the readout component's trajectory probability Thus, assuming that we can evaluate the conditional probability of the readout given all the other components, , s], we arrive at the estimate where the samples x 1 i , . . ., x n−1 i for i = 1, . . ., M are drawn from P[x 1 , . . ., x n−1 |s].Notice that the derivation of this Monte Carlo estimate is fully analogous to the estimate in Eq. ( 8), but instead of integrating out the input trajectory s we integrate out the component trajectories x 1 , . . ., x n−1 .
To obtain , s] in Eqs. ( 46) and ( 47), we note that, in absence of feedback, we can describe the stochastic dynamics of the readout component X n as a jump process with time-dependent transition rates whose time-dependence arises from the trajectories of the other components x 1 , . . ., x n−1 and the input input s.In effect, this is a driven jump process for X n , driven by all upstream components X 1 , . . ., X n−1 and the input signal.Specifically, denoting u = (x 1 , . . ., x n−1 , s) as the joint trajectory representing the history of all upstream components as well as the input signal, we can, as explained in Section II C, write the time dependent transition rate matrix Q t (•|u) for the stochastic dynamics of X n and use Eq. ( 14 Finally, to compute the mutual information I(S; X n ), e.g. using the estimate in Eq. ( 10), we additionally need to evaluate the marginal output probability P[x n ].This requires us to perform one additional integration over the space of input trajectories s: The corresponding Monte Carlo estimate is where the input trajectories s i follow P[s] and the intermediate components (x 1 ij , . . ., x n−1 ij ), for i = 1, . . ., N and j = 1, . . ., M , follow P[x 1 , . . ., In summary, the scheme to obtain P[x n |u] in the presence of hidden intermediate components is analogous to that used for computing P[x] from P[x|s].In both cases, one needs to marginalize a distribution function by integrating out components.Indeed, the schemes presented here and in Section II B are bona fide schemes to compute the mutual information between the input s and either the trajectory of the output component x n or the full output x.However, when the trajectories are sufficiently long or the stochastic dynamics are sufficiently complex, then the free-energy schemes of Section III may be necessary to enhance the efficiency of computing the marginalized distribution, P[x] or P[x n |s].

V. RESULTS
To demonstrate the power of our framework and illustrate how the techniques of the previous sections can be used in practice, we apply PWS to two instructive chemical reaction networks.We first consider a linearly coupled birth-death process.This system has already been studied previously using a Gaussian model [6], and by Duso and Zechner [30] using an approximate technique, and we compare our results to those of these studies.This simple birth-death system serves to illustrate the main ideas of our approach and also highlights that linear systems can be distinctly non-Gaussian.The second example has been chosen to demonstrate the practical applicability of our technique.We use RR-PWS to compute the mutual information rate in the bacterial chemotaxis system, which is a prime example of a complex information processing system consisting of many reactions.Then we compare the computed rate against recent experiments.
The code used to produce the PWS estimates was written in the Julia programming language [69] and has been made freely available [70].For performing stochastic simulations we use the DifferentialEquations.jlpackage [71] and biochemical reaction models are set up with help from the ModelingToolkit.jlpackage [72].

A. Coupled Birth-Death Processes
As a first example system we consider a simple birthdeath process ∅ ⇌ X of species X which is created at rate ρ(t) and decays with constant rate µ per copy of X.This system receives information from an input signal that modulates the birth rate ρ(t).For simplicity, we assume it is given by where ρ 0 is a constant and s(t) is the input copy number at time t.This is a simple model for gene expression, where the rate of production of a protein X is controlled by a transcription factor S, and X itself has a characteristic decay rate.The input trajectories s(t) themselves are generated via a separate birth-death process ∅ ⇌ S with production rate κ and decay rate λ.Below, the mutual information is shown as a function of trajectory duration.The inset shows an enlarged version of the dotted rectangle near the origin.For short trajectories all PWS estimates agree.Yet, for longer trajectories, DPWS and TI-PWS require a much larger number of input trajectories M for computing P[x] than RR-PWS to converge.Results for the three PWS variants are compared with the Duso and Zechner [30] estimate, and with the linear noise approximation from Ref. [6].We find excellent agreement between the Duso scheme and RR-PWS.The Gaussian linear noise approximation systematically underestimates the mutual information.All PWS estimates, as well as the Duso approximation were computed using N = 10 4 samples from P[s, x].
We compute the trajectory mutual information for this system as a function of the trajectory duration T of the input and output trajectories.For T → ∞, the trajectory mutual information is expected to increase linearly with T , since, on average, every additional output segment contains the same additional amount of information on the input trajectory.Because we are interested in the mutual information in steady state, the initial states (s 0 , x 0 ) were drawn from the stationary distribution P(s 0 , x 0 ).This distribution was obtained using a Gaussian approximation.This does not influence the asymptotic rate of increase of the mutual information, but leads to a nonzero mutual information already for T = 0.
Figure 4 shows the mutual information as a function of the trajectory duration T .We compare the three PWS variants and two approximate schemes.One is that of Duso and Zechner [30].To apply it, we used the code publicly provided by the authors [73], and to avoid making modifications to this code, we chose a fixed initial condition (s 0 = x 0 = 50) which causes the mutual information to be zero for T = 0.The figure also shows the analytical result of a Gaussian model [6], obtained using the linear-noise approximation (see Appendix F).
We find that the efficiency of the respective PWS variants depends on the duration of the input-output trajectories.For short trajectories all PWS variants yield very similar estimates for the mutual information.However, for longer trajectories the estimates of DPWS and, to a smaller degree, TI-PWS diverge, because of poor sampling of the trajectory space in the estimate of P[x].For longer trajectories, the estimate becomes increasingly dominated by rare trajectories, which make an exceptionally large contribution to the average of P[x].Missing these rare trajectories with a high weight tends to increases the marginal entropy H(X ) (see Eq. ( 9)), and thereby the mutual information; indeed, the estimates of DPWS and TI-PWS are higher than that of RR-PWS.For brute-force DPWS, the error decreases as we increase the number M of input trajectories per output trajectory used to estimate P[x].Similarly, for TI-PWS the error decreases as we use more MCMC samples for the marginalization scheme.For the RR-PWS, however, already for M = 128 the estimate has converged; we verified that a further increase of M does not change the results.
We also find excellent agreement between the RR-PWS estimate and the approximate result of Duso and Zechner [30].Only very small deviations are visible in Fig. 4.These deviations are mostly caused by the different choice for the initial conditions.In RR-PWS, the initial conditions are drawn from the stationary distribution, while in the Duso scheme they are fixed, such that the mutual information computed with RR-PWS is finite while that computed with the Duso scheme is zero.Yet, as the trajectory duration T increases, the Duso estimate slowly "catches up" with the RR-PWS result.
Fig. 4 also shows that although the Gaussian model matches the PWS result for T = 0, it systematically underestimates the mutual information for trajectories of finite duration T > 0. Interestingly, this is not a consequence of small copy-number fluctuations: increasing the average copy number does not significantly improve the Gaussian estimate.We leave a detailed analysis of this observation for future work.
The different approaches for computing the marginal probability P[x] lead to different computational efficiencies of the respective PWS schemes.In Fig. 5, as a benchmark, we show the magnitude of the error of the different PWS estimates in relation to the required CPU time.Indeed, as expected, the computation of the marginal probability poses problems for long trajectories when using the brute force DPWS scheme.More interestingly, while TI-PWS improves the estimate of the mutual information, the improvement is not dramatic.Unlike the bruteforce scheme, thermodynamic integration does make it possible to generate input trajectories s that are correlated with the output trajectories x, but it still overesti- mates the mutual information for long trajectories unless a very large number of MCMC samples are used.The RR-PWS implementation evidently outperforms the other estimates for this system.The regular resampling steps ensure that we mostly sample input trajectories s with non-vanishing likelihood P[x|s], thereby avoiding the sampling problem from DPWS.Moreover, sequential Monte Carlo techniques such as RR-PWS and FFS [64] have a considerable advantage over MCMC techniques in trajectory sampling.With MCMC path sampling, we frequently make small changes to an existing trajectory such that the system moves slowly in path space, leading to poor statistics.In contrast, in RR-PWS we generate new trajectories from scratch, segment by segment, and these explore the trajectory space much faster.
The coupled birth-death process represents a simple yet non-trivial system capable of information transmission.In the next section we apply PWS to a more complex and realistic biochemical signaling network.

B. Bacterial Chemotaxis
The chemotaxis system of the bacterium Escherichia coli is a complex information processing system.It is responsible for detecting nutrient gradients in the cell's environment and using that information to guide the bacterium's movement.Briefly, E. coli navigates through its environment by performing a biased random walk, successively alternating between so-called runs, during which it swims with a nearly constant speed, and tumbles, during which it randomly chooses a new direction [74].The rates of switching between these two states are controlled by the chemotaxis sensing system (Fig. 6a).This system consists of receptors on the cell surface that detect the ligand, and a downstream signaling network that processes this information by taking the time-derivative of the signal, the ligand concentration.This derivative is taken via two antagonistic reactions, which occur on two distinct timescales.Attractant binding rapidly deactivates the receptor, while slow methylation counteracts this reaction by reactivating the receptor, leading to near perfect adaptation [75][76][77][78].Lastly, active receptors phosphorylate the downstream messenger protein CheY, which controls the tumbling propensity by binding the flagellar motors that propel the bacterium.
In our model, the receptors are grouped in clusters (Appendix D).Each receptor can switch between an active and an inactive conformational state, but, in the spirit of the Monod-Wyman-Changeux model [79], the energetic cost of having two different conformations in the same cluster is prohibitively large.We can then speak of each cluster as either being active or inactive.Each receptor in a cluster can bind ligand and be (de)methylated, which, together, control the probability that the cluster is active.
In the simulations, receptor (de)methylation is modeled explicitly, because the (de)methylation reactions are slow.In contrast, the timescale of receptor-ligand (un)binding is much faster than the other timescales in the system, i.e., those of the input dynamics, CheY (de)phosphorylation, and receptor (de)methylation.The receptor-ligand binding dynamics can therefore be integrated out without affecting information transmission, in order to avoid wasting CPU time (Appendix D).In addition, the receptor clusters can phosphorylate CheY, while phosphorylated CheY is dephosphorylated at a constant rate.The dynamics of the kinase CheA and the phosphatase CheZ which drive (de)phosphorylation are not modeled explicitly.Table II in Appendix D gives the parameter values of our chemotaxis model, which are all based on values reported in the literature.For what follows below, the key parameters are the number of receptors per cluster, which is taken to be N = 6 based on Refs.[80,81], while the number of clusters is N c = N r /N = 400, where 10 3 < N r < 10 4 is an estimate for the total number of receptors based on Ref. [82].The translation of our model into a master equation is explained in Appendix D.
We first asked whether this model based on the current literature can reproduce the information transmission rate as recently measured by Mattingly et al. [50].In what follows, we call this model the "literature-based" model.The information transmission rate depends not only on the biochemical chemotaxis network, but also on the dynamics of the input signal.It is therefore important that the dynamics of this signal in our model agree with those in the experiments of Mattingly et al. [50].In our model, the input signal is the time-dependent ligand concentration c(t) that is experienced by the swimming bacterium.
In the experiments, the cells swim in a very shallow chemical gradient, such that their swimming behavior is, to a good approximation, identical to that in the absence of a gradient.The cell's movement can thus be modeled as a (persistent) random walk.Assuming that the gradient is oriented in x-direction, the autocorrelation of the xcomponent of the velocity was experimentally found to be well described by V (t) ≡ ⟨δv x (t)δv x (0)⟩ ≃ σ 2 v e −λ|t| , where σ 2 v is the variance of the fluctuations in the velocity, and λ −1 is the correlation time of these fluctuations [50].We therefore model the cell's velocity as vx = −λv x + ξ, which, with ⟨ξ(t)ξ(t ′ )⟩ = 2σ 2 v λδ(t − t ′ ), gives rise to the measured correlation function V (t).These cells swim in a shallow, exponential gradient c(x) ∝ e gx with steepness g.The dynamics of the ligand concentration as experienced by the bacteria are then given by ċ = gcv x .The cell's own swimming dynamics described by v x thus give rise to the input signal c(t) of the biochemical network.It constitutes an exponential random walk, as illustrated in Fig. 6b, c (see also Appendix E).
In our model, the output is the concentration of phosphorylated CheY, while in the experiments of Mattingly et al. [50] it is the average activity of the receptor clusters as obtained via FRET measurements.We argue that this difference does not significantly affect the obtained information rates, and thus, that it is valid to compare our results to the experiments.In particular, since the copy number of CheY is much larger than the number of receptor clusters, the fluctuations in CheY are dominated by the extrinsic fluctuations coming from the receptor activity noise rather than from the intrinsic fluctuations associated with CheY (de)phosphorylation.To a good approximation, the copy number of phosphorylated CheY, Yp(t), is thus a deterministic function of the average receptor activity a(t).Mathematically, the mutual information I(X; Y ) between two stochastic variables X and Y is the same as the mutual information I(f (X); g(Y )) for deterministic and monotonic functions f and g.It follows that the mutual information between c(t) and Yp(t), is nearly the same as that between c(t) and the receptor activity a(t).It is therefore meaningful to compare the information transmission rates as predicted by our PWS simulations to those measured by Mattingly et al. [50].
We use RR-PWS to exactly compute the mutual information for the literature-based model.Specifically, we measure the mutual information I(C, Y p ; T ) between the input trajectory of the ligand concentration c(t) and the output trajectory of phosphorylated CheY, y p (t), and where each trajectory is of duration T .With RR-PWS it is possible to compute I(C, Y p ; τ ) for all τ ≤ T within a single PWS simulation of duration T by saving intermediate results after each sampled segment, see Section III B. The receptor states are hidden internal states, and we use the technique of Section IV to integrate them out.
Figure 6e shows the PWS estimate of the information transmission rate for cells swimming in gradients of different steepnesses g.The information transmission rate is obtained from the PWS estimate of the trajectory mutual information I(C, Y p ; T ), different trajectory durations T .As seen in Fig. 6d, for short trajectories the mutual information increases non-linearly with trajectory duration T , but in the long-duration limit the slope becomes constant.This asymptotic rate of increase of the mutual information with T is the desired information transmission rate R(C, Y p ).The precise definition is given by We then compared our results for the information transmission rate of the literature-based model to those of Mattingly et al. [50].Figure 7c shows that the model predictions differ from the experiments by a factor of ≈ 4. Despite this discrepancy, we believe that the agreement between experiment and theory is, in fact, remarkable, because these predictions were made ab initio: the model was developed based on the existing literature and we did not fit our model to the data of Mattingly et al.
Yet, the question about the origin of the discrepancy remains.The difference between their measurements and our predictions could be attributed either to the inaccuracy of our model or to the approximation that Mattingly et al. had to employ to compute the information transmission rate from experimental data.Concerning the latter hypothesis, due to the curse of dimensionality and experimental constraints, Mattingly et al. could not directly obtain the information transmission rate from measured time traces of the input and output of the system.Instead, they measured three different kernels that describe the system in the linear regime.Specifically, they obtained the response K(t) of the kinase activity to a step-change in input signal, the autocorrelation function of the input signal V (t), and the autocorrelation N (t) of the kinase activity in a constant background concentration.Then they used a Gaussian model to compute the information transmission rate from these measured functions K(t), V (t), and N (t) [6,50] (see also Appendix F).This Gaussian model is based on a linear noise assumption and cannot perfectly capture the true nonlinear dynamics of the biochemical network.This could be the cause for the observed discrepancies in the information rate.We have indeed already seen in Section V A that there can be substantial differences between exact computations and the Gaussian approximation for the trajectory mutual information.
To uncover the reason for the discrepancy we first tested whether our literature-based model reproduces the experimentally measured kernels.If the kernels do not match, then, clearly, the discrepancy in the information rate may be caused by the difference between our model and the experimental system, as opposed to the inaccuracy of the Gaussian framework.Our input correlation function, V (t), is, by construction, the same as that of Mattingly et al. [50].However, we find that the response kernel K(t) and the autocorrelation function of the noise N (t) of our system are different.Figure 7a, b shows that our model reproduces the timescales of N (t) and K(t) as measured experimentally.This is perhaps not surprising, because the decay of both N (t) and K(t) is set by the (de)methylation rate, which has been well-characterized experimentally.Yet, the figure also shows that our model significantly underestimates the amplitudes of both N (t) and K(t).
This raises the question of whether other parameter values would allow our model to better reproduce the measured kernels K(t) and N (t), and, secondly, whether this would resolve the discrepancy in information rate between our simulations and the experiments.
The amplitude σ 2 N of the output noise correlation function N (t) is bounded by the number of receptor clusters N c .In particular, the variance of the receptor activity is , where σ 2 a ≤ 1/4 is the variance of the activity of a single receptor cluster.Comparing this bound to the measured receptor noise strength σ 2 N reveals that N c needs to be much smaller than our original model assumes: the number of clusters needs to be as small as N c ≲ 10.Indeed, Fig. 7b shows that with N c = 9, our model quantitatively fits the correlation function N (t) of the receptor activity in a constant background concentration, as measured experimentally [50].
The amplitude of K(t), i.e. the gain, depends on the ratio K A D /K I D of the dissociation constants of the receptor for ligand binding in its active or inactive state, respectively, as well as on the number of receptors per cluster, N .Both dissociation constants have been well characterized experimentally [80,83], but the number of receptors per cluster has only been inferred indirectly from experiments [80,81].The higher gain as measured experimentally by Mattingly et al. [50] indicates that N is larger than assumed in our model: with N = 15 our model can quantitatively fit K(t) (Fig. 7a).
We thus find that by reducing the number of clusters from N c = 400 to N c = 9 while simultaneously increasing their size from N = 6 to N = 15, our model is able to quantitatively fit both N (t) and K(t) [50] (Fig. 7).This suggests that the number of independent receptor clusters is smaller than hitherto believed, while their size is larger.
Finally, how accurately can our revised model reproduce the measured information rate, and how accurate is the Gaussian framework for the experimental system in the regime studied by Mattingly et al. [50]?In the revised model, called the "fitted model", with N c = 9 and N = 15, all key quantities for computing the information transmission rate within the Gaussian framework, V (t), N (t) and K(t), are nearly identical to the experiments of Mattingly et al. [50], see Fig. 7. Within the Gaussian framework (see Appendix F), the information transmission rate in our model is thus expected to be very similar to the experimentally measured one, and Fig. 7c shows that this is indeed the case.To quantify the accuracy of the Gaussian framework, we then recomputed the information transmission rate for the revised model, using exact PWS (see Appendix H).We found that the result matches the Gaussian prediction very well.For these shallow and static chemical gradients, the Gaussian model is thus highly accurate.Our analysis validates a posteriori the Gaussian framework adopted by Mattingly et al. [50].

VI. DISCUSSION
In this manuscript, we have developed a general, practical, and flexible method that makes it possible to compute the mutual information between trajectories exactly.PWS is a Monte Carlo scheme based on the exact computation of trajectory probabilities.We showed how to compute exact trajectory probabilities from the master equation and thus how to use PWS for any system described by a master equation.Since the master equation is employed in many fields and in particular provides an exact stochastic model for well-mixed chemical reaction dynamics, PWS is very broadly applicable.
The application of PWS to the bacterial chemotaxis system shows how crucial it is to have a simulation tech-nique that is exact.Without the latter it would be impossible to determine whether the difference between our predictions and the Mattingly data [50] is due to the inaccuracy of the model, the inaccuracy of the numerical technique to simulate the model, or the approximations used by Mattingly and coworkers in analyzing the data.In contrast, because PWS is exact, we knew the difference between theory and experiment is either due to the inaccuracy of the model or the approximations used to analyze the data.By then employing the same Gaussian framework to analyze the behavior of the model and the experimental system, we were able to establish that the difference is due to the inaccuracy of our original model.
Our analysis indicates that the size of the receptor clusters in the E. coli chemotaxis system, N ≈ 15, is larger than that based on previous estimates, N ∼ 6 [80,81,84].The early estimates of the cluster size were based on bulk dose-response measurements with a relatively slow ligand exchange, yielding N ≈ 6 [80,84].More recent doseresponse measurements, at the single cell level and with faster ligand exchange, yield an average that is higher, ⟨N ⟩ ≈ 8, and with a broad distribution around it, arising from cell-to-cell variability [81].Our estimate, N ≈ 15, based on fitting the response kernel K(t) to that measured by Mattingly et al. [50], therefore appears reasonable.At the same time, the number of clusters, obtained by fitting the noise correlation function N (t) to the data of Mattingly et al. [50] is surprisingly low, N c ∼ 10, given the total number of receptors, N r ∼ 10 3 -10 4 [82].Interestingly, recent experiments indicate that the receptor array is poised near the critical point [85], where receptor switching becomes correlated over large distances.This effectively partitions receptors into a few large domains, which may explain our fitted values for N and N c .
It has been suggested that information processing systems are positioned close to a critical point to maximize information transmission [21,86], although it has been argued that the sensing error of the E. coli chemotaxis system is minimized for independent receptors [87].Mattingly et al. have demonstrated that the chemotactic drift speed in shallow exponential gradients is limited by the information transmission rate [50], but whether the system has been optimized for information transmission, and how the latter affects chemotactic performance in other spatio-temporal concentration profiles, remain interesting questions for future work.
While we have focused on the computation of the mutual information between trajectories of systems governed by a master equation, this concept can be extended to other types of stochastic processes.The crux of PWS is the exact evaluation of the path likelihood P[x|s] from the master equation.Therefore, in order to use PWS with a different stochastic process, we require the ability to compute the path likelihood P[x|s] for its sample paths.Remarkably, for stochastic diffusion processes, which are described by a Langevin equation in a continuous state space, the trajectory probability can be computed by replacing the kernel L t (s, x) in Eq. ( 15) with the Onsager-Machlup function [88].In particular, while the path probability is not well-defined for continuous sample paths, Adib [89] shows that the Onsager-Machlup function yields a correct expression for the path probability of time-discretized diffusion trajectories.Therefore, PWS can be extended to handle systems described by a Langevin equation.In such a PWS scheme, the Gillespie simulations are replaced by standard numerical integration techniques for Langevin equations (see e.g.Ref. [58]), and the trajectory likelihood is evaluated onthe-fly using the Onsager-Machlup function.
PWS cannot be used to directly obtain the mutual information between trajectories from experimental data, in contrast to model-free (yet approximate) methods such as K-nearest-neighbors estimators [23,24], decodingbased information estimates [26], or schemes that compute the mutual information from the data within the Gaussian framework [50].PWS requires a (generative) model based on a master equation or Langevin equation.However, an increasingly popular approach is to estimate the mutual information from experimental data using hidden Markov models (HMM) [90][91][92].While the construction of these HMM models is beyond the scope of this manuscript, PWS makes it possible to compute the mutual information between time-varying inputs and outputs exactly for these models.
We have applied PWS to compute the mutual information (rate) in steady state, but PWS can be used equally well to study systems out of steady state.For such systems a (non-)equilibrium initial condition P(s 0 , x 0 ) must be specified in addition to a well-defined non-stationary probability distribution of input trajectories P[s].These distributions are defined by the (experimental) setup and lead to a well-defined output distribution P[x] when the system is coupled to the input.Thus, a steady state is no prerequisite for the application of PWS to study the trajectory mutual information.
Throughout the manuscript, we have considered systems in which the output does not feed back onto the input.In systems with feedback, the current output influences future input, which means that we cannot straightforwardly generate input trajectories according to P[s].Moreover, P[x|s], the central quantity of all three PWS methods, cannot be obtained straightforwardly.Nonetheless, PWS can be extended to systems with feedback as shown in Appendix C. As in free-energy calculations, the trick is to define a reference system for which the marginal probability distribution-and thus the "free energy"-is known and then compute the "freeenergy difference" between that reference system and the system of interest.
Aside from DPWS, we developed two additional variants of PWS, capitalizing on the connection between information theory and statistical physics.Specifically, the computation of the mutual information requires the evaluation of the marginal probability of individual output trajectories P[x].This corresponds to the computation of a partition function in statistical physics.RR-PWS and TI-PWS are based on techniques from polymer and rare-event sampling to make the computation of the marginal trajectory probability more efficient.
The different PWS variants share some characteristics yet also differ in others.DPWS and RR-PWS are static Monte Carlo schemes in which the trajectories are generated independently from the previous ones.These methods are similar to static polymer sampling schemes like PERM [44] and rare-event methods like DFFS or BG-FFS [64].In contrast, TI-PWS is a dynamic Monte Carlo scheme, where a new trajectory is generated from the previous trajectory.In this regard, this method is similar to the CBMC scheme for polymer simulations [93] and the TPS [49], TIS [94], and RB-FFS [64] schemes to harvest transition paths.The benefit of static schemes is that the newly generated trajectories are uncorrelated from the previous ones, which means that they are less likely to get stuck in certain regions of path space.Concomitantly, they tend to diffuse faster through the configuration space.Indeed, TI-PWS suffers from a problem that is also often encountered in TPS or TIS, which is that the middle sections of the trajectories move only slowly in their perpendicular direction.Tricks that have been applied to TPS and TIS to solve this problem, such as parallel tempering, could also be of use here [95].
Another distinction is that RR-PWS generates all the trajectories in the ensemble simultaneously yet segment by segment, like DFFS, while DPWS and TI-PWS generate only one full trajectory at the time, similar to RB-FFS, BG-FFS, and also TPS and TIS.Consequently, RR-PWS, like DFFS, faces the risk of genetic drift, which means that, after sufficiently many resampling steps, most paths of the ensemble will originate from the same initial seed.Thus, when continuing to sample new segments, the old segments that are far in the past become essentially fixed, which makes it possible to miss important paths in the RR-PWS sampling procedure.As in DFFS, the risk of genetic drift in RR-PWS can be mitigated by increasing the initial number of path segments.Although we did not employ this trick here, we found that RR-PWS was by far the most powerful scheme of the three variants studied.
Nonetheless, we expect that DPWS and TI-PWS become more efficient in systems that respond to the input signal with a significant delay τ .In these cases, the weight of a particular output trajectory depends on the degree to which the dynamics of the output trajectory correlates with the dynamics of the intput trajectory a time τ earlier.Because in RR-PWS a new segment of an output trajectory is generated based on the corresponding segment of the input trajectory that spans the same time-interval, it may therefore miss these correlations between the dynamics of the output and that of the input a time τ earlier.In contrast, DPWS and TI-PWS generate full trajectories one at the time, and are therefore more likely to capture these correlations.
Overall, PWS is a general framework for computing the mutual information between trajectories.We pre-sented three variants of PWS for systems described by a master equation.Apart from these, we expect that other variants could be developed to improve efficiency for particular applications.Because of its flexibility and simplicity, we envision that PWS will become an important and reliable tool for studying information transmission in dynamic stochastic systems.
To ensure efficient convergence of the resulting Markov chain to its stationary distribution, the proposal kernel must balance two conflicting requirements.To efficiently explore the state space per unit amount of CPU time, the proposed trajectory s ′ must be sufficiently different from the original trajectory s, while at the same time it should not be so radically different that the acceptance probability is drastically reduced.Thus, the design of the proposal kernel is crucial for an efficient MCMC sampler, and we will discuss various strategies to create trial trajectories.Since different types of trial moves can easily be combined in a Metropolis-Hastings algorithm, the most efficient samplers often incorporate multiple complementary proposal strategies to improve the exploration speed of the trajectory space.
The simplest (and naïve) proposal kernel is to generate an entirely new trajectory s ′ independent of s, by sampling directly from P[s] using the SSA.Hence, the transition kernel is given by T (s → s ′ ) = P[s ′ ] and a proposal s → s ′ is accepted with probability where the second line follows by inserting the definition of U θ [s, x] given in Eq. (40).Although this simple scheme to completely regenerate an entire trajectory and accepting/rejecting according to A(s ′ , s) creates correctly distributed trajectories, it should not be used in simulations to compute P[x].Indeed, we get a better estimate of P[x] by just using the same number of independent sample trajectories from P[s] and using the brute-force scheme in Eq. ( 8) without taking the detour of thermodynamic integration to estimate the normalization constant.Instead, an idea from transition path sampling is to only regenerate a part of the old trajectory as part of the proposal kernel [98].By not regenerating the entire trajectory, the new trajectory s ′ is going to be correlated with the original trajectory s, and correlation in general improves the acceptance rate.The simplest way to generate trial trajectories using a partial update is a move termed forward shooting in which a time point τ along the existing trajectory s is randomly selected, and a new trajectory segment is regrown from this point to the end, resulting in the proposal s ′ .Since the new segment is generated according to the input statistics given by P[s [T −τ,T ] ], the acceptance probability for the proposed trajectory is given by Eq. (B2).If the input dynamics given by P[s] are time-reversible, we can also perform a backward shooting move.Here, the beginning of s is replaced by a new segment that is generated backwards in time.Assuming that the initial condition is the input's steady state distribution, the corresponding acceptance probability of the backward shooting move is again given by Eq. (B2).Using these two moves we create an MCMC sampler where both ends of the trajectory are flexible, and thus if the trajectory is not too long, the chain will quickly relax to its stationary distribution.This is indeed the MCMC sampler used to obtain the TI-PWS results for the coupled birth-death process in Section V A.
For long trajectories it can prove to be a problem that the middle section is too inflexible when the proposal moves only regenerate either the beginning or the end of a trajectory.Therefore, one could additionally try to incorporate mid-section regrowth to make sure that also the middle parts of the trajectory become flexible.To regrow a middle segment with duration τ of a trajectory s, we have to generate a new segment of duration τ according to the stochastic dynamics given by P[s] but with the additional condition that we have to connect both endpoints of the new segment to the existing trajectory.Although the starting point of the segment can be freely chosen, the challenge is to ensure that the end point of the new segment satisfies the end-point constraint.Stochastic processes that generate trajectories under the condition of hitting a specific point after a given duration τ are called stochastic bridging processes.
The simplest way to generate trajectories from a bridging process is by generating a trajectory segment of length τ from the normal stochastic process and rejecting the segment if it does not hit the correct end point [99].Clearly, this strategy is only feasible for very short segments and when the state space is discrete, as otherwise almost every generated segment will be rejected due to not hitting the correct end point.To avoid this problem, more efficient algorithms have been developed to simulate stochastic bridges for some types of stochastic processes.For diffusion processes, bridges can be simulated efficiently by introducing a guiding term into the corresponding Langevin equation [100].For jump processes, bridges can be simulated using particle filters [101], by a weighted stochastic simulation algorithm (wSSA) [102], or using random time-discretization (uniformization) [99].
Further techniques to create a trajectory space MCMC samplers have been developed in the literature.Crooks [103] describes a scheme to create MCMC moves for trajectories evolving in non-equilibrium dynamics, by making MCMC moves to change the trajectories' noise histories.In the Particle Markov Chain Monte Carlo (PM-CMC) algorithm, proposal trajectories are generated using a particle filter and accepted with an appropriate Metropolis criterion [104].Another class of efficient samplers for Markov jump processes can be built using uniformization [105].
affects the incoming signal, and indeed, it follows that no information can be extracted from the input signal without any perturbation of the input dynamics.Often, it is assumed that the amplitude of such perturbations is comparatively small and thus that the feedback can safely be ignored.Above, the PWS scheme was derived with this assumption.In this section, we drop the assumption and will explicitly consider systems where the produced output perturbs the input, i.e. systems where the output feeds back onto the input.In the following we will first discuss the additional problems that arise when computing the mutual information for a system with feedback, and subsequently we present a modified version of PWS that can be used to compute the trajectory mutual information for these systems.

Computing the Mutual Information with
Feedback between Input and Output All PWS schemes presented above require the computation of the trajectory likelihood P[x|s], a quantity that is not readily available for systems with feedback.Indeed, as already mentioned in Section II C 1, for a given input trajectory s, the output dynamics are no longer described by a Markov process in a system with feedback, and therefore we cannot find an expression for P[x|s] based on the master equation.This implies that for systems with feedback, PWS schemes cannot be used without modification.While it is generally not possible to derive an expression for the conditional probability P[x|s] in systems with feedback, we often still can compute the joint probability density P[s, x] instead.Based on this quantity, we will present a modified PWS scheme to compute the mutual information for systems with feedback.
Specifically, since PWS is a model-based approach to compute the mutual information, when there is feedback from the output back to the input, we require a complete model of the combined system.Specifically, such a model must provide an expression for the joint probability P[s, x], describing the input dynamics and the interaction between input and output, including the feedback.
An estimate of the mutual information that only relies on the computation of joint probability densities P[s, x] can be obtained by expressing the mutual information as and However, these marginalization integrals cannot be directly computed with the techniques described so far.Note that while in Section III we discussed in detail how to compute such marginalization integrals, all methods presented there themselves require the evaluation of the likelihood P[x|s] and cannot be used directly.Therefore, in the following subsection, we discuss how to compute marginalization integrals for systems with feedback.Additionally, as discussed in Section IV, we may also need to integrate out internal components of the master equation even when the output feeds back onto these internal components.The technique discussed below can also be used in this case as a way to compute the marginalization integral in Eq. ( 44).

Marginalization Integrals for Systems with Feedback
Computing marginalization integrals in systems with feedback is harder than it is in the case without feedback.Specifically, we will show that it is not obvious how apply the brute force Monte Carlo estimate Eq. (8) or the other, more advanced techniques from Section III A to systems with feedback.Nevertheless, if the system with feedback can be decomposed into a non-interacting part and an interacting part that includes the feedback, it is often still possible to compute marginalization integrals.Below, we sketch the steps that are necessary in order to compute marginalization integrals for systems with feedback using such a decomposition.
For concreteness, we discuss how to compute Previously, for systems without feedback, we chose these potentials to be U 0 [s, x] = − ln P[s] and U[s, x] = − ln P[s, x] with the idea that U is the potential corresponding to the actual system and U 0 is the potential of a reference system with known free energy.Then, by computing the free-energy difference between the reference system and the actual system, we could compute the marginal probability P[x].
However, in systems with feedback we face a problem.Note that the actual system is still described by the potential U[s, x] = − ln P[s, x], even with feedback.Yet, for the reference system described by U 0 [s, x] we cannot make the same choice as before, because the previous choice involved the marginal probability P[s] which is not available with feedback.
Instead, we have to find an alternative expression for U 0 [s, x].To construct a suitable reference potential, we can use a decomposition of the full potential into three parts where ∆U[s, x] describes the features of the system that induce interaction, or correlation, between s and x.The first two terms of the potential above, therefore describe a non-interacting version of the system, where the input and output are fully independent of each other.We want to use the potential of that non-interacting version as our expression for U 0 , i.e.
. To be able to do so, we require that the partition function (normalization constant) is known.In other words, we need to choose the decomposition in Eq. (C6) such that the partition function Eq. (C7) is known either analytically or numerically.
If such a decomposition is found, we can compute the marginal probability P[x] from the difference in free energy ∆F[x] between U and U 0 : where F 0 = − ln Z 0 [x] is known.Because we have a known expression for U 0 [s, x], the free-energy difference ∆F[x] can now be computed using any of the techniques described in Section III A.
As an example for finding a decomposition like Eq. (C6), let us consider the case where the joint system of input and output is described by a single master equation, i.e. we have a master equation with two components, S which represents the input, and X which represents the output.In such a system, information is transmitted if there exist transitions that change the copy number of X with a rate that depends on the copy number of S. In terms of chemical reactions, S → S + X is an example for such a transition.In turn, this system exhibits feedback if at least one of the transitions that change the copy number of S has a rate that depends on X, as for example with the reaction S + X → X.Note that with such reactions, the dynamics of S depend on the current copy number of X, and therefore we cannot evolve S trajectories independently of X trajectories, a consequence of feedback.Both of the reactions S → S + X and S + X → X introduce correlations between the S and X trajectories.
In a non-interacting system, such interactions between the input and output must be absent.Thus, a noninteracting version of the reaction system contains no single reaction that involves both S and X.We will now describe how we can use that non-interacting version of the reaction system, to obtain the reference potential U 0 [s, x].Since the input and output trajectories are completely independent in the non-interacting system, we can express the joint distribution's probability density as the product of the individual component's trajectory densities, P 0 [s, x] = P 0 [s] P 0 [x].Note that P 0 [s] and P 0 [x] should not be confused with the marginal probabilities P[s] and P[x] of the interacting version of the reaction system, which must be computed using a marginalization integral.Since in the non-interacting version both, S and X obey independent dynamics which are characterized by individual master equations, both P 0 [s] and P 0 [x] can be individually computed using Eq. ( 13).Thus, in this case, the non-interacting potential is U 0 [s, x] = − ln P 0 [s] − ln P 0 [x] and, since the probability densities P 0 [s] and P 0 [x] are normalized, the corresponding partition function is Z 0 = 1.Hence, for this reaction system, we can straightforwardly define a noninteracting version that can be used to obtain the reference potential U 0 [s, x].Using the techniques described in Section III A, we can then compute the free-energy difference between U 0 [s, x] and U[s, x] = − ln P[s, x], where the latter potential describes the dynamics of the fully interacting system.Specifically, we can compute the marginal probabilities P[s], P[x] pertaining to the interacting system which are required for the mutual information estimate in Eq. (C2).
In summary, for systems with feedback, we can compute marginalization integrals by specifying a reference potential U 0 [s, x] by finding a non-interacting version of the system.However, barring a decomposition into interacting and non-interacting potentials, there is generally no unambiguous choice of the reference potential U 0 [s, x] to compute the marginal probability P[x].Still, if a suitable expression for U 0 [s, x] can be found, we can make use of the techniques developed in Section III A to compute marginal probability P[x].Thus, the specific choice of U 0 [s, x] is system-specific.

Appendix D: Stochastic Chemotaxis Model
We developed a stochastic chemotaxis model that describes individual reactions using a master equation framework.In our model, receptors are organized in clus- ters.To each cluster we assign a probability of being active that depends on the ligand concentration.Additionally, we explicitly model the methylation and demethylation events of each receptor which affect the activity of a cluster.The cluster activity, in turn, determines its ability to phosphorylate the protein CheY.Phosphorylated CheY binds to the molecular motors driving the flagella, which alter the cell's tumbling rate.However, we do not model this downstream effect in our model.

MWC Model
Receptors are organized in clusters on the cell surface.In our model, each cluster consists of N receptors.The ligand binding dynamics to a cluster is cooperative, and in spirit of the Monod-Wyman-Changeux (MWC) model [79,108] we model this cooperativity by coupling the ligand-binding dynamics to conformational switching dynamics of the receptors.Moreover, the energetic cost of two receptors in the same cluster being in different conformational states is prohibitively large.This means that all receptors within a cluster of size N switch conformations in concert, so that we can meaningfully speak of an active or an inactive cluster.A typical value for the cluster size is reported to be N = 6 by Shimizu et al. [80].Detailed balance requires that the ligand binding affinity depends on whether a cluster is in the active or inactive state.Consequently, we have a dissociation constant K a for a ligand bound to an active receptor and another dissociation constant K i for a ligand bound to an inactive receptor.For chemotaxis, K a ≫ K i , i.e. the ligand binding affinity is higher for the inactive state.
Additionally, each receptor monomer has M methy-lation sites that can affect its conformation and therefore the kinase activity.The aspartate receptor Tar has M = 4 methylation sites [80].Methyl groups can be attached to a receptor by the protein CheR and are removed by the protein CheB.We model the receptors' methylation dynamics following the model of Barkai and Leibler [75], where CheB can only demethylate active receptors.Additionally, to ensure exact adaptation, in our model CheR can only attach methyl groups to inactive receptors, as in Ref. [109].
In an environment with ligand concentration c, the probability of a receptor cluster with m methylated sites to be active, p a (c, m), is determined by the free-energy difference between the active and inactive receptor states where Here, the number of methylated sites of a cluster (not receptor) is denoted by m, ranging from 0 to N M .The parameters are again taken from Shimizu et al.The system reaches a steady state for the adapted activity p a (c, m) = a 0 where The steady-state methylation m ⋆ can be obtained from Eqs. (D1) and (D2) by solving p a (c, m ⋆ ) = a 0 : To characterize the methylation timescale, we linearize the dynamics of m(t) around the steady state (at constant ligand concentration c(t) = c 0 ).To first order, we can write where τ m is the characteristic timescale of the methylation dynamics.We find τ m by expanding p a (Eq.(D1)) around m = m ⋆ : and then plugging this first-order expansion into Eq.(D3) to get So, we find that for small perturbations, the timescale for methylation to approach steady state is given by Thus, the parameters k R and k R control two important characteristics of the methylation system: the adapted activity a 0 and the methylation time scale τ m .Shimizu et al. [80] report an adapted activity of a 0 = 1/3 and based on experimental data [50,80] we assume a methylation time scale of τ m = 10 s.Our parameter choice, which is consistent with both of these observations, is k R = 0.075 s −1 and k B = 0.15 s −1 .CheY is phosphorylated by CheA, the receptorassociated kinase.The kinase activity is directly linked to the activity of a receptor cluster.Therefore, we assume that CheY is phosphorylated by active receptor clusters.Dephosphorylation of CheY-p is catalyzed by the phosphatase CheZ, which we assume to be present at a constant concentration.The CheZ-catalyzed dephosphorylation rate was reported to be 2.2 s −1 for attractant response and 22 s −1 for repellent response [107].Based on this data, we use the approximate dephosphorylation rate k Z = 10 s −1 in our model.In the fully adapted state the fraction of active receptors is a 0 and therefore the mean fraction of phosphorylated CheY, ϕ Y = [CheYp]/([CheY] + [CheYp]), is given by In the fully adapted state the phosphorylated fraction was found to be ϕ Y ≈ 0.16 [106].Hence, we infer a phosphorylation rate of k A = k Z ϕ Y /(a 0 N c (1 − ϕ Y )) = 0.015 s −1 for the literature-based model.Accordingly, for the "fitted" model, based on fitting K(t) and N (t) to those measured by Mattingly et al. [50], we use a larger phosphorylation rate due to the smaller number of clusters N c .

Reaction Kinetics
Since the timescale of conformational switching of active and inactive receptors and ligand binding is much faster [110] than the timescale of phosphorylation or methylation, we don't explicitly model ligand (un)binding and conformational switching.Each cluster is characterized by its methylation state m.This ranges from 0 to the total number of methylation sites, which equals the number of sites per receptor M times the number of receptors per cluster N .In our Gillespie simulation, each possible state of a cluster is its own species, i.e., we have species C m for m = 0, . . ., N M .Overall, our chemotaxis model consists of four types of reactions that describe where only active receptors can be demethylated.These zero-order dynamics of (de)methylation of receptors lead to the adaptive behavior of the chemotaxis system as described above.
Similarly, only active receptors can phosphorylate the CheY protein using the receptor-associated kinase CheA, therefore we model phosphorylation as a reaction C m + Y → C m + Y p with propensity where k A is a constant that represents the phosphorylation rate of an active cluster.The dephosphorylation Y p → Y is carried out by the phosphatase CheZ at a constant rate k Z = 10 s −1 .
The information rate in the Gaussian framework can thus be computed by obtaining the required (cross-)correlation functions.In their experiments with E. coli bacteria, Mattingly et al. [50] don't obtain these correlation functions directly, however.Instead, they obtain three kernels, V (t), K(t) and N (t), from which the correlation functions can be inferred.We proceed by discussing the three kernels individually.
V (t) denotes the autocorrelation function of the swimming velocity of bacteria, i.e., V (t) = ⟨v x (τ )v x (τ + t)⟩.As explained in Appendix E, the swimming dynamics of the bacteria determine the statistics of the input signal s(t) = d dt ln c(t), where c(t) is the ligand concentration as experienced by the bacterium and g is the gradient steepness.The input signal correlation function, denoted by C ss (t), can then be expressed as C ss (t) = g 2 V (t).
The response kernel, denoted by K(t), represents the time evolution of the average activity of the receptors in response to an instantaneous step change in the input concentration.More precisely, K(t) is defined as where we assume the input concentration jumps instantaneously from c 0 to c s at time t = 0. θ(t) is the Heaviside step function.Note that because the signal s(t) is defined as the time-derivative of the concentration c(t), a step-change in c(t) corresponds to a delta impulse in s(t).Thus, K(t) describes the deterministic dynamics of the system after being subjected to a unit stimulus s(t) = δ(t), i.e.K(t) is the Green's function of the system.The stochastic response a(t) to an arbitrary timedependent signal s(t) can be written as a convolution of K(t) with s(t) where η a (t) is the receptor activity noise.We define the response x(t) = a(t) − a 0 .Assuming the input statistics are stationary and described by the correlation function C ss (t), it is easy to show that the cross-correlation between s(t) and x(t) is given by In other words, the cross-correlation between s(t) and x(t) is given by the convolution of the response kernel with the input correlation function.
The noise kernel N (t) describes the autocorrelation of the activity fluctuations in the absence of an input stimulus.In particular, N (t) = ⟨x(τ )x(τ + t)⟩ = ⟨η a (τ )η a (τ + t)⟩ where we assume that s(t) = 0.
We now rewrite Eq. (F3) for the mutual information rate in terms of the three kernels described above.So we need to express the power spectra P αβ (ω) in terms of the Fourier-transformed kernels V (ω), K(ω), and N (ω).In Appendix E we already showed that P ss (ω) = g 2 V (ω).The cross power spectrum is given by P sx (ω) = K(ω)P ss (ω) which follows from Eq. (F7).Finally, from Ref. [6] we use the identity P xx (ω) = P ss (ω)|K(ω)| 2 + N (ω) to express the output power spectrum.We insert these expressions into Eq.(F3) which yields Then, for shallow gradients, we can make a Taylor approximation to obtain This result shows that the information rate in shallow gradients is proportional to g 2 and the proportionality constant is determined by the measured kernels.Mattingly et al. [50] obtain the relevant kernels V (ω), K(ω), and N (ω) from experiments by fitting phenomenological models to their single-cell data.How we obtain these kernels for our chemotaxis model is described in Appendix G.
Figure 8 shows the Fourier representations of the relevant kernels, V (ω), K(ω), and N (ω).We computed these kernels for the three different systems: the literaturebased model, the fitted model, and the experimental system of Mattingly et al. [50].Because the kernels are different, so are the Gaussian information rates that we obtain.The results are shown and discussed in the main text.K n e −i2πnk/N (G2) where k = 0, 1, . . ., N − 1.These DFT coefficients can be computed efficiently using the Fast Fourier Transform (FFT) algorithm.The DFT provides point estimates for the Fourier-domain kernel K(ω) at discrete frequencies i.e., K(ω k ) ≈ Kk .This approximation introduces some level of error, known as spectral leakage, due to the finite duration and sampling of the signal.This error can be reduced by multiplying the time-domain kernel with a window function.Thus, we multiply the kernel with a Hanning window, which is a smooth function that tapers at the edges of the kernel, reducing the effect of discontinuities at the beginning and end of the time series.The Hanning window is defined as: The windowed kernel k n is obtained by multiplying the time-domain kernel K n with the Hanning window h n : Using the FFT algorithm we then compute the DFT coefficients kk of the windowed kernel.The procedure described above to obtain the DFT coefficients kk from K(t) is applied to N (t) as well to obtain the coefficients ñk .We can then evaluate the information rate using Eq.(F9) by discretizing the integral dω F (ω) → k ∆ω F (ω k ) with ∆ω = 2πf s /N .More precisely, we compute the Gaussian information rate as assumption that the information rate is zero in the limit N c → 0, we fit a quadratic function R(N c ) = aN c − bN 2 c with positive coefficients a, b to the data.We provide the fit coefficients for different gradient steepnesses g in Table III.From these fits we can obtain the extrapolated information rates for N c = 9 that are shown in the main text.

FIG. 1 .
FIG.1.Schematic of information processing under the influence of noise.Overlapping output distributions for different inputs lead to information loss, because the input cannot always be uniquely inferred from the output.The mutual information I(S, X ) quantifies how much information the observation of the output typically retains about the input signal.

FIG. 3 .
FIG. 3. Illustration of one step of the bootstrap particle filter in RR-PWS.We start with a set of trajectories s k [0,i−1] with time span [τ0, τi−1] (left panel).In the next step we propagate these trajectories forward in time to τi, according to P[s] (central panel).Then we resample the trajectories according to the Boltzmann weights of their most recent segments, effectively eliminating or duplicating individual segments.An example outcome of the resampling step is shown in the right panel where the bottom trajectory was duplicated and one of the top trajectories was eliminated.These steps are repeated for each segment, until a set of input trajectories of the desired length is generated.The intermediate resampling steps bias the trajectory distribution from P[s] towards P[s|x].

FIG. 4 .
FIG.4.Comparing different schemes to compute the mutual information as a function of trajectory duration for a simple coupled birth-death process with rates κ = 50, λ = 1, ρ0 = 10, µ = 10 and steady-state initial condition.The top panels show example trajectories of input and output as well as the mean (solid line) and standard deviation (shaded region).Below, the mutual information is shown as a function of trajectory duration.The inset shows an enlarged version of the dotted rectangle near the origin.For short trajectories all PWS estimates agree.Yet, for longer trajectories, DPWS and TI-PWS require a much larger number of input trajectories M for computing P[x] than RR-PWS to converge.Results for the three PWS variants are compared with the Duso and Zechner[30] estimate, and with the linear noise approximation from Ref.[6].We find excellent agreement between the Duso scheme and RR-PWS.The Gaussian linear noise approximation systematically underestimates the mutual information.All PWS estimates, as well as the Duso approximation were computed using N = 10 4 samples from P[s, x].

FIG. 5 .
FIG. 5. Comparing estimation bias for the different PWS variants in relation to their CPU time requirements.Each dot represents a single mutual information estimate with N = 10 4 samples for trajectories of duration T = 5. Almost all the CPU time of a PWS estimate is spent on the computation of the marginal probability P[x].The bias of the marginal probability estimate can be reduced by using a larger number M of sampled input trajectories to compute the marginalization integral, which also increases the required CPU time.The RR-PWS estimate converges much faster than the estimate of DPWS and TI-PWS.For DPWS and TI-PWS, the dots represents estimates ranging from M = 2 5 to M = 214 , for RR-PWS ranging from M = 2 3 to M = 210 .As the baseline of zero bias we use the converged result from the RR-PWS estimates.

FIG. 6 .
FIG.6.The information transmission rate in the bacterial chemotaxis system.a Cartoon of the chemotaxis network of E. coli.Receptors form clusters with an associated CheA kinase.A cluster can either be active or inactive, depending on the number of bound ligands (green dots) and methylated sites (orange dots).Active CheA can phosphorylate CheY; phosphorylated CheY controls the rotation direction of the flagellar motors and thereby the movement of the bacterium.b In a shallow gradient c(x), the bacterium diffuses nearly freely in the x-direction.The variance of the position increases with time, the hallmark of a random walk.The input signal is the concentration c(t) = c(x(t)) as experienced by the bacterium at time t.c This movement gives rise to the input statistics of the signal.The shaded regions indicate the 75th and 95th percentiles, example trajectories are displayed in color.This signal is non-stationary as its variance always keeps growing.d Mutual information I(CT , YT ) between input trajectories c(t) and output trajectories yp(t) as a function of trajectory duration T .In each RR-PWS simulation, N = 7200 Monte Carlo samples were used (M = 256 for the particle filter).e The information transmission rate is defined as I(CT , YT )/T in the limit T → ∞.

FIG. 7 .
FIG. 7. Comparison of theoretical models with experimental data for bacterial chemotaxis system.Panels a and b show the response and noise kernels, respectively, for the model based on literature parameters (green), parameters fitted to experiments (blue), and experiments from Mattingly et al. [50] (orange).In panel c, the information transmission rate is shown for each model as a function of gradient steepness, with results from the Gaussian approximation shown alongside exact PWS calculations.The fitted model closely matches the experiments, while the literature-based model over-estimates information transmission rate by a factor of ≈ 4 despite having a lower response amplitude (panel a).This is because the literature-based model has a large number of independents receptor clusters Nc, resulting in much lower noise in the output (panel b).In all cases, the Gaussian approximation matches the exact PWS results, providing support for the accuracy of the measurements by Mattingly et al. [50].
C5) as the prototype for a marginalization integral we want to compute.Unlike in Section III A, we now assume that x feeds back onto s.That means that we have access to the joint distribution's density P[s, x], but not to the marginal density P[s] or the conditional density P[x|s].Formulated in the language of statistical physics, all of the techniques of Section III A are estimators of the freeenergy difference ∆F[x] = F[x] − F 0 [x] between two ensembles described by potentials U[s, x] and U 0 [s, x].
[80].Their experimental results indicate that δf m = −2k B T, m 0 = −N/2.Kamino et al.[81] report ligand dissociation constants of K a = 2900 µm for active receptors and K i = 18 µm for inactive Tar receptors (for MeASP).Note that in the equations we assume units such that k B T = 1.The dynamics of methylation in our model are described by the following mean-field equation dm dt = (1 − p a (c, m))k R − p a (c, m)k B .(D3) (a) the methylation of a receptor C m → C m+1 , (b) the demethylation of a receptor C m → C m−1 , (c) the phosphorylation of CheY C m +Y → C m +Y p , and (d) the single dephosphorylation reaction Y p → Y. Thus, due to the combinatorial explosion of receptor states, the system has a total number of 3N M + 2 elementary reactions (which amounts to 75 reactions in the literature-based model and 182 reactions in the fitted model).The ligand-concentration dependent methylation rate for C m → C m+1 is given byk m+ (c, m) = (1 − p a (c, m))k R .(D11)The term 1 − p a (c, m) is needed because only inactive receptors can be methylated.The demethylation rate for C m → C m−1 is given by k m− (c, m) = p a (c, m)k B (D12)

4 FIG. 9 .
FIG. 9.The information rate as a function of the number of receptor clusters Nc.The cluster size is fixed at N = 15.The left panel shows the increase of information rate as a function of gradient steepness for different values of Nc, including a line for the experimental data from Mattingly et al.[50].The right panel shows the same data but highlights the increase of the information rate and when increasing the number of receptor clusters.A quadratic fit (shown as dotted lines) is used to extrapolate the information rate.All results were obtained using RR-PWS.
We see that while we don't need to evaluate the likelihood P[x|s], we now need to explicitly compute the joint density P[s, x], and two marginal densities, P[s] and P[x], for each Monte Carlo sample (s, x) ∼ P[s, x].While the joint density can be evaluated directly by assumption, each of the marginalized densities can only be computed using a nested Monte Carlo estimate.Specifically, for PWS with feedback, we need to compute two marginalization integrals per Monte Carlo sample:

TABLE II .
[50]parameters required for the chemotaxis model, based on literature values.These are the parameters used in the so-called literature-based model.In the fitted model (see main text) the same parameter values are chosen, except for N = 15 and Nc = 9, which were obtained by fitting to the data of Mattingly et al.[50]; we note that changing N and Nc also requires updating kA to keep the fraction ϕY of phosphorylated CheY constant.

TABLE III .
Fit coefficients for the information rate as a function of the number of clusters Nc.These coefficients are for a quadratic function R(Nc) = aNc − bN 2 c .