Information flow and causality as rigorous notions ab initio

Information flow (or information transfer as may be called) the widely applicable general physics notion can be rigorously derived from first principles, rather than axiomatically proposed as an ansatz. Its logical association with causality and, particularly, the most stringent one-way causality, if existing, is firmly substantiated and stated as a fact in proved theorems. Established in this study are the information flows among the components of time-discrete mappings and time-continuous dynamical systems of arbitrary dimensionality, both deterministic and stochastic. They have been obtained explicitly in closed form, and all possess the property of causality, which reads: if a component, say $x_i$, has an evolutionary law independent of $x_j$, then the information flow from $x_j$ to $x_i$ vanishes. These results have been put to applications with benchmark systems, such as the Kaplan-Yorke map, the R\"ossler system, the baker transformation, the H\'enon map, and a stochastic potential flow. Besides recovering the properties as expected from the respective systems, some of the applications show that the information flow structure underlying a complex trajectory pattern could be tractable. For linear systems, the resulting remarkably concise formula asserts analytically that causation implies correlation, while correlation does not imply causation, resolving unambiguously the long-standing debate over causation versus correlation.


I. INTRODUCTION
Information flow, or information transfer as it may be referred to in the literature, has been realized as a fundamental notion in general physics. Though literally one may associate it with communication, its importance lies far beyond in that it implies causation [1]- [5], uncertainty propagation [6], predictability transfer [7], etc. In fact, it is the recognition of its causality association that has attracted enormous interest from a wide variety of disciplines, particularly in neuroscience [8]- [14], finance [15]- [16], climate science [17]- [18], turbulence research [19]- [20], network dynamics [21]- [24], and dynamical systems particular in the field of synchronization [25]- [31]. This recognition has been further substantiated by the finding that transfer entropy [5] and Granger causality [32] are equivalent (up to a factor 2) [33].
Historically, many information theoretic quantities have been proposed to measure information flow, including time-delayed mutual information [34], transfer entropy [5], momentary information transfer [18], causation entropy [35], to name a few. Among these most notably is transfer entropy, which has spawned many varieties in its family, e.g., [15], [36], [11], and has been widely applied in different disciplines.
A fundamental question to ask is whether information flow needs to be axiomatically proposed as an ansatz (as above), or it can be derived from the first principles in information theory. Naturally, one would like to minimize or avoid the use of axioms in introducing new concepts in order to have the material more coherent within the field to which it belongs. In physics, "flow" or "transfer" does have definite meaning, albeit the meaning may differ depending on the context. One then naturally expects that the concept be rigorized. Indeed, as we will see soon, at least within the framework of dynamical systems, information flow/transfer can be rigorously derived from, rather than empirically or axiomatically proposed with, Shannon entropy.
Another impetus regards the inference of causality. As mentioned in the beginning, information flow arouses enormous interest in a wide range of fields not because of its original meaning in communication but because of its logical implication of causation. Whether the cause-effect relation underlying a system can be faithfully revealed is, therefore, the touchstone for a formalism of information flow. That is to say, information flow should be formulated with causality naturally embedded; it should, in particular, accurately reproduce a one-way causality (if existing), which is unambiguously equal to zero on one side. In this light, the widely used formalism namely transfer entropy is, unfortunately, not as satisfactory one expects. This has even led to discussions on whether the two notions, namely, information flow and causality , should be differentiated (e.g., [38]). Since it is established that Granger causality and transfer entropy are equivalent, one may first look at the problems from the former. Now it is well known that spurious Granger causality may arise due to unobserved variables that influence the system dynamics (a problem identified by Granger himself) [39], due to low resolution in time [40] [41], and due to observational noise [42]. Besides, Granger explicitly excludes deterministic systems in establishing the causality formalism, a case that for sure is important in realistic problems. For transfer entropy, the issue has just been systematically examined [44]. Aside from the failure in recovering the many preset one-way causalities, evidence has shown that sometimes it may even give qualitatively wrong results; see [44] and [43] for such examples.
Realizing the limitation of transfer entropy, different alternatives have been proposed; the above momentary information transfer is one of these proposals. The purpose of this study is, instead of just remedying the deficiencies of the existing formalisms, to put information flow the fundamental physical notion on a rigorous footing so that it is universally applicable. The stringent one-way causality requirement will not be just verified with certain given examples, but rigorously proved as theorems.
With this faith, recently Liang and Kleeman (2005) [45] take the initiative to study the problem with dynamical systems. In this framework, the information source and recipient are abstracted as the system components, and hence the problem is converted into the information flow or information transfer between dynamical system components. The basic idea can be best illustrated with a deterministic system of two components, say, x 1 and x 2 : where we follow the convention in physics and do not distinguish random and deterministic variables, which should be clear in the context. Now what we are to consider are the time evolutions of the marginal entropies of x 1 and x 2 , denoted respectively as H 1 and H 2 . Look at x 1 , its marginal entropy evolution may be due to x 1 itself or subject to the influence of x 2 . This partitions the mechanisms that cause H 1 to grow into two exclusive parts. That is to say, if we write the contribution from the former mechanism as dH * /dt and that from the latter as T 2→1 , This T 2→1 is the very time rate of information flowing from x 2 to x 1 . We remark that this setting is rather generic, except for the requirement of differentiability for the vector field F = (F 1 , F 2 ) T . In particular, the input-output communication problem can be cast within the framework by letting, for example, F 2 = F 2 (x 1 , t), F 1 = F 1 ((x 1 , t), where x 1 is the input/drive and x 2 the output/consequence, and the channel is represented by F 2 . From the above argument, the evaluation of the information flow T 2→1 may be fulfilled through evaluating dH * 1 /dt. This is because that, when a dynamical system is given, the density evolution is known through the corresponding Liouville equation, and, accordingly, dH 1 /dt can be obtained. In [45], Liang and Kleeman prove that the joint entropy of (x 1 , x 2 ) follow a very concise law where E is the operator of mathematical expectation. They then argue that and hence obtain the time rate of information flowing from x 2 to x 1 where ρ 1 is the marginal probability density function of x 1 . The thus-obtained information flow is asymmetric between x 1 and x 2 ; moreover, it possesses a property of causality, which reads, if the evolution of x 1 does not depend on x 2 , then T 2→1 = 0. The above result is later on proved [47] [48]. It is remarkable in that the stringent one-way causality in a system, if existing, can be stated as a proven theorem, rather than a fact for a formalism to verify; see [46] for a review. This result, however, is only for systems of dimension 2 (2D). For systems with many components, it does not work any more. We have endeavored to extend it to more general situations and do have obtained results for deterministic systems of arbitrary dimensionality which possess the property of causality. But, as we will see in the following section, the extension relies on an assumption that is, again, axiomatically proposed. This makes the resulting formalism not one fully derived from first principles, and as we realize later on, it does not work for multidimensional stochastic systems. This line of work, though with a promising start, is stuck at this point.
In this study, we will show that the assumption can be completely removed. In a unified approach, the notion of information flow can be rigorously derived for both deterministic and stochastic systems of arbitrary dimensionality. In the following, we first briefly set up the framework, and show where the snag lies in the above approach. The solution is then presented, and applied to derive the information flows for deterministic mappings (section III), continuous-time deterministic systems (section IV), stochastic mappings (section V), and continuous-time stochastic systems (section VI). For the purpose of demonstration, each section contains one ore more applications. As an important particular case, we specialize to do the derivation for linear systems, and the material is presented in section VII. This study is summarized in section VIII.

II. THE SNAG THAT STUCK THE LIANG-KLEEMAN FORMALISM
The success of the Liang-Kleeman formalism is remarkable. It is, however, only for 2D dynamical systems. When the dimensionality exceeds 2, the resulting quantity, namely, (6), is not the information transfer from x 2 to x 1 , but the cumulant transfer to x 1 from all other components x 2 , x 3 ,..., x n . In this sense, the use of (6) is rather limited.
In order to extend the formalism to systems of higher dimensionality, Liang and Kleeman [47] [48] re-interpret the term dH * 1 /dt in the above decomposition (3), for a 2D system, as the evolution of H 1 with the effect of x 2 excluded. More specifically, it is the evolution of H 1 with x 2 instantaneously frozen as a parameter at time t. To avoid confusing with dH * 1 /dt, denote it as dH 1\ 2 /dt, where the subscript \2 signifies that x 2 is frozen, or that its effect is removed. With this, the disjoint decomposition (3) is re-stated as Note this decomposition, albeit seemingly with only a change of symbol, is actually fundamentally different from (3) in physical meaning; it now holds for systems of arbitrary dimensionality. The information flow is, therefore, Of course, the key is how to find dH 1\ 2 dt . In [47] and [48], Liang and Kleeman start with discrete mappings, and then take the limit as the time stepsize goes to zero. To illustrate, let Φ : R n → R n be a mapping taking x(τ ) to x(τ + 1), from time step τ to τ + 1. Correspondingly there is another mapping P : L 1 (R n ) → L 1 (R n ) that steers its density ρ forward. This mapping is called a Frobenius-Perron operator; we will refer it to F-P operator henceforth. Loosely speaking, P is, for any ω ⊂ R n , such that [49] ω When the sample space is in a Cartesian product form, as is in this case (R n ), the operator can be evaluated. Let a = (a 1 , a 2 , ..., a n ) be some constant point, and ω = [a 1 , It has be established that (e.g., [49]) For convenience, a is usually taken to be the origin. Furthermore, if Φ is nonsingular and invertible, then P can be explicitly written out where J is the Jacobian of Φ.
As the F-P operator carries ρ forth from time step τ to τ + 1, accordingly the entropies H, H 1 , and H 2 are also steered forward. On [τ, τ + 1], let H 1 be incremented by ∆H 1 . By the foregoing argument, the evolution of H 1 can be decomposed into two exclusive parts, namely, the information flow from x 2 , T 2→1 , and the evolution with the effect of x 2 excluded, ∆H 1\ 2 . Hence, for discrete mappings, we have the following counterpart of (8): Liang and Kleeman derive the information flow for the continuous system from the discrete mapping. So the whole procedure relies on how is evaluated, or, more specifically, how H 1\ 2 (τ + 1) is evaluated (since H 1 (τ ) is known). To see where lies its difficulty, first notice that which is the mean of − log(Pρ) 1 (x 1 ). Given Φ, P can be found in the way as shown above, so H 1 (τ + 1) is known. For H 1\ 2 , however, things are much more difficult; − log(P \ 2 ρ) 1 (x 1 ) involves not only the random variable x 1 (τ + 1), but also x 2 (τ ) (embedded in the subscript \2). What is the joint density of (x 2 (τ ), x 1 (τ + 1))? We do not know. In [47] and [48], an approximation was proposed, which gives where y 1 is employed to signify x 1 (τ + 1) and the symbol x 1 is reserved for x 1 (τ ). This is a natural extension of what the authors use in the their original study [45] for 2D discrete mappings. A central approximation is that they use to represent the joint pdf of y 1 and x 2 (and x 3 , ...x n ) (think about ρ(x 2 |x 1 )ρ(x 1 ) = ρ(x 1 , x 2 )). This is, however, only an approximation, since we really don't know what the joint pdf of (y 1 , x 2 ) is. As we will see soon, though the resulting formalism verifies dH 1\ 2 /dt = dH * 1 /dt for 2D systems and the zero-causality property for one-way causal deterministic systems, when stochasticity gets in, the causality property cannot be recovered this way.

A. Derivation
Fortunately, the issue that stuck the Liang-Kleeman formalism can be fixed; we actually can get the entropy without appealing to the joint probability density function of (y 1 , x 2 ), i.e., that of (x 1 (τ + 1), x 2 (τ )) as mentioned above. Consider a mapping where Ω is the sample space (R n in particular). Let ψ : Ω → Ω be an arbitrary differentiable function of x. We have the following theorem: Theorem III.1 Eψ(x(τ + 1)) = Eψ(Φ(x(τ ))).
Remark 1: The expectation operator E on the right hand side applies to a function of x(τ ); it is thence with respect to ρ(τ ). Differently, the left hand side E is with respect to ρ(τ + 1) = Pρ, where P is the F-P operator as introduced in (9). Remark 2: This equality is important in that one actually can obtain the expectation of ψ(x(τ + 1)) without evaluating Pρ. Proof. The following proof is in the framework of Riemann-Stieltjes integration. A more general proof in terms of Lebesgue theory is also possible but is unnecessary, since the functions and vector fields we are dealing with in this study are assumed to differentiable.
Let {ω 1 , ω 2 , ..., ω n } be a partitioning of the sample space Ω. The elements are mutually exclusive and Ω = ∪ n k=1 ω k . To make it simple, assume that these ω k 's have the same diameter (the maximal distance between any two points in ω k ). For clarity, write x(τ + 1) as y, while x is reserved for x(τ ). Then where y k ∈ ω k is some point in ω k . The existence of the Riemann integral Ω Pρ(y)ψ(y)dy assures that it can be any point in ω k as n goes to infinity, while the resulting integral is the same. Now by (9),
The equality (13) actually can be utilized to derive the F-P operator. We look at the particular case when Φ is invertible. By definition, Eq. (13) means If Φ is invertible, the right hand side is Ω ψ(y) · ρ (τ, Φ −1 (y)) |J −1 | dy by transformation of variables. Since ψ is arbitrary, we have which is precisely the Frobenius-Perron operator (11).
The above equality provides us a convenient and accurate way to evaluate H 1 (τ + 1) and H 1 \2(τ + 1). Picking ψ as (log Pρ) 1 and log(P \ 2 ρ) 1 , we obtain, respectively, the following formulas: In these formulas, both the expectations are taken with respect to ρ(x 1 , x 2 , ...x n ), i.e., the pdf at time step τ . In (15), we do not need to care about the joint pdf ρ(y, x 2 ) any more. The information flow from x 2 to x 1 is, therefore, Proof.
Substitute into the above formulas for H 1 and H 1\ 2 and (16) follows. Note that the evaluation of P \ 2 ρ and Pρ generally depends on the system in question. But when Φ and Φ \ 2 are invertible, the information flow can be found explicitly in a closed form.

B. Properties
Theorem III.3 For 2D systems, if Φ 1 is invertible, then Remark: This is the analog of (5) for discrete-time systems [45]. Proof. Let x(τ + 1) ≡ y. For a 2D system, and if Φ 1 is invertible, we have which gives To avoid confusion, we use E x to indicate that the expectation is with respect to x when mixed variables x and y appear simultaneously. Note here x 1 = Φ −1 1 (y 1 ) since this is a 1D system after x 2 is frozen. Thus Theorem III.4 (Property of causality) If Φ 1 is independent of x 2 , then T 2→1 = 0.
Proof. By the definition of the F-P operator, . That is to say, On the other hand, a.e.
C. Application-Kaplan-Yorke map Once a dynamical system is specified, in principle the information flow can be obtained. This subsection presents an application with a discrete-time dynamical system, the Kaplan-Yorke map [50], that exhibits chaotic behavior.
The Kaplan-Yorke map is defined as a mapping Φ = (Φ 1 , Φ 2 ) : [0, 1] × R → [0, 1] × R, (x 1 , x 2 ) → (y 1 , y 2 ), such that A typical trajectory for α = 0.2 is plotted in Fig. 1. We now compute the information flows between the two components. First we need to find the F-P operator Pρ(y 1 , y 2 ). Pick a domain ω = [0, y 1 ] × [0, y 2 ]. By (10) so the key is the finding of Φ −1 (ω). Since it is easy to obtain Φ −1 1 ([0, y 1 ]) = 0, To avoid the round-off error in the computation which will quickly lead to a zero x 1 , we let b = 9722377, and instead compute a n+1 = 2a n mod b, x 1,n+1 = a n /b, x 2,n+1 = αx 2,n + cos(4πx 1,n ). The trajectory is initialized with x 1 = 7722377/b, x 2 = 0. (The initial points outside the attractor are not shown.) Given y 1 , x 1 may be either y 1 /2 or (1 + y 1 )/2, but either way, cos(4πx 1 ) = cos(2πy 1 ). Thus Eq. (19) is, therefore, Pρ(y 1 , y 2 ) = ∂ 2 ∂y 2 ∂y 1 To compute T 2→1 , freeze x 2 . The resulting mapping Φ \ 2 is the dyadic mapping in the x 1 direction. As above, which gives On the other hand, just as one would expect based on the independence of Φ 1 on x 2 . This serves as a validation of Theorem III.4. To compute T 1→2 , notice The corresponding F-P operator is such that which makes sense, considering that, when x 1 is frozen, y 2 is just a translation followed by a rescaling of x 2 . On the other hand, the marginal density Because of the intertwined y 1 and y 2 , these integrals cannot be explicitly evaluated without specifications of ρ. But when ρ is given, it is a straightforward exercise to compute Denote it byH 2 . Then Generally this does not vanish. That is to say, within the Kaplan-Yorke map, there exists a one-way information flow from x 1 to x 2 .

D. Applications-The baker transformation and Hénon map revisited
Since its establishment, the Liang-Kleeman formalism has been applied to a variety of dynamical system problems. Hereafter we will re-study some benchmark examples and see whether the results are different. In this subsection we look at the baker transformation and Hénon map.

Baker transformation
The baker transformation is an extensively studied prototype of area-conserving chaotic maps that has been used to model the diffusion process in real physical world. It mimicks the kneading of dough: first the dough is compressed, then cut in half; the two halves are stacked on one-another, compressed, and so forth; see Fig. 2 for an illustration. In formal language, it is Φ : It is invertible, and the inverse is Thus the F-P operator P can be easily found We now use the above theorem to compute T 2→1 . Integrating (26) with respect to x 2 , When x 2 is frozen as a parameter, the baker transformation (24) becomes a dyadic mapping in x 1 direction, i.e., a mapping Φ 1 : By the above theorem, By Proposition ??, To compute H 2\ 1 , notice that, when x 1 is frozen, the transformation is invertible; moreover, the Jacobian J 2 = 1/2 is a constant. By Theorem III.3, which gives It is easy to show that I + II > 0. In fact, obviously I + II is nonnegative; besides, the two brackets cannot be zero simultaneously, so it cannot be zero. Hence T 1→2 is strictly positive; that is to say, there is always information flowing from the abscissa to the ordinate.
To summarize, T 2→1 = 0, T 1→2 = I + II > 0. These results are precisely the same as those obtained before in [45] and [47]. That is to say, for the baker transformation, the current formalism shows no difference from the previous one based on heuristic arguments and with approximations.

Hénon map
The Hénon map is a mapping with a > 0, b > 0. The case with parameters a = 1.4 and b = 0.3 is called a "canonical Hénon map," whose attractor is shown in Fig. 3.
It is easy to see that the Hénon map is invertible; its inverse is The F-P operator thus can be easily found from (11): In the following we compute the flows between the quadratic component x 1 and the linear component x 2 .
Look at T 2→1 first. By (16), we need to find the marginal density of x 1 at step τ + 1 with and without the effect of x 2 , i.e., (Pρ) 1 and (Pρ) If a = 0, this would give ρ 2 (x 1 − 1), i.e., the marginal pdf of x 2 with argument x 1 − 1. But here a > 0, the integration is taken along a parabolic curve rather than a straight line. Still the final result will be related to the marginal density of x 2 ; for notational simplicity, write To find (P \ 2 ρ) 1 , use y 1 to denote following our convention to distinguish variables at different steps. Modify the system so that x 2 is now a parameter. As before, we need to find the counterimage of (−∞, y 1 ] under the transformation with x 2 frozen: Therefore, Denote the average of ρ 1 (−x 1 ) and ρ 1 (x 1 ) asρ 1 (x 1 ) to make an even function of x 1 . Then Note that the parameter x 2 does not appear in the arguments. Substitute all the above into (16) to get . Comparing this to the result in [47], except for the term −E log |ax 1 |, all other terms are different.
Next consider T 1→2 . From (35), the marginal density of x 2 at τ + 1 is Thus The evaluation of H 2\ 1 is much easier. As x 1 is frozen as a parameter, y 2 becomes definite. In this case, the 2D random variable is degenerated to a 1D variable. Correspondingly P \ 1 ρ becomes a pdf in x 1 only. So Thus H 2\ 1 = −E log(P \ 1 ρ) 2 = 0. By (16), the information flow from x 1 to x 2 is, therefore, In other words, the flow from x 1 to x 2 is equal to the marginal entropy of x 1 , modified by an amount related to the factor b. Particularly, when b = 1, T 1→2 = H 1 . This is precisely the same as what is obtained before in [47].
In a summary, the information flows within the baker transformation are precisely the same as we have obtained before in [47]. For the Hénon map, the flow from x 1 to x 2 , has recovered the benchmark result based on physical grounds. But T 2→1 is generally different from that in [47].

A. Deriving the information flow
Now look at the information flow within the continuous system, the 2D version of which motivates this line of work: . . . . . .
or, in vectorial form, Consider a time interval [t, t + ∆t]. Following [48], we discretize the ordinary differential equation and construct a mapping Φ : . Write x(t + ∆t) as y, a convention we have been using all the time to avoid confusion. Then the mapping Φ : x → y is such that Its Jacobian is As ∆t → 0, J → 1 = 0, so Φ thus constructed is always invertible for ∆t small enough. Moreover, it is easy to obtain the inverse mapping and As a verification, check This yields the Liouville equation ∂ρ ∂t + ∇ · (ρF) = 0, as is expected. With Pρ we now can compute the marginal density At the last step, the y ′ s have been replaced by x ′ s in the integral term. This is legitimate since the difference goes to higher order terms. We now evaluate the marginal entropy increase at t + ∆t. The following is the key step: Take expectation on both sides, the left hand side with respect to (Pρ) 1 (y 1 ), while the right hand side with respect to ρ 1 (x 1 ). This yields Note the third term on the right hand side vanishes after integration with respect to x 1 due to the compactness of the functions. So and hence This is precisely the same as that either from the F-P operator [47] or directly from the Liouville equation [45], serving a validation of our approach in this study.
When x 2 is frozen as a parameter on [t, t+ ∆t], we need to examine the modified mapping . . . . . .
i.e., the mapping Φ with the equation y 2 = x 2 +F 2 ∆t removed, and x 2 frozen as a parameter. Again, here x i stands for x i (t), and y i for x i (t + ∆t). For convenience, we further adopt the following notations: Besides, use ρ \ 2 to signify the joint density of x \ 2 , and ρ 1\ 2 to denote the density of x 1 with x 2 frozen as a parameter on [t, t + ∆t]. Notice the fact ρ 1\ 2 = ρ 1 at time t.
It is easy to know that the Jacobian of Φ \ 2 The corresponding F-P operator P \ 2 : L 1 (R n−1 → L 1 (R n−1 is such that Integrate with respect to (y 3 , ..., y n ) (recall that x 2 is now a parameter) to get where other terms vanish due to the compactness assumed for the functions. Hence Note in the integral term, the y ′ s have been replaced by x ′ s; this is legitimate as the difference goes to the higher order terms. Since ρ 1\ 2 (x 1 ) = ρ 1 (x 1 ) at t, so Take expectation on both sides, the left hand side with respect to the joint probability density of (y 1 , x 2 ), while the right hand side with respect to (x 1 , x 2 ). This is the key step that makes the present study fundamentally different from [48] which relies on an approximation to fulfill the derivation. This yields where ρ 2|1 is the conditional density of x 2 on x 1 . Thus We therefore arrive at the following theorem: B. An alternative derivation For the continuous system consider an interval [t, t + ∆t], and a mapping Recall that by definition Eψ(x(t+ ∆t) = ψ(x)ρ(x, t+ ∆t)dx, +o(∆t), for any test function ψ, and Note the expectation on the left hand side is with respect to ρ(t + ∆t), and that on the right is with respect to ρ(t). This way we obtain the Liouville equation. Now let ψ be the functional log(P \ 2 ρ) 1 . When x 2 is frozen, on interval [t, t + ∆t], there is a Liouville equation for ρ \ 2 the joint density of (x 1 , x 3 , ..., x n ). The equation for its marginal density ρ 1\ 2 = R n−2 ρ \ 2 dx 3 ...dx n is, after integration with respect to (x 3 , x 4 , ..., x n ) and with the consideration of the compact support assumption, Divided by ρ 1\ 2 , this yields Discretizing, and noticing the fact ρ 1\ 2 (t) = ρ 1 (t), which is log(P \ 2 ρ) 1 (x 1 ). As conventional, let x(t + ∆t) ≡ y and leave x for x(t) to avoid confusion. We actually need to find log(P \ 2 ρ) 1 (y 1 ) = log ρ 1\ 2 (t + ∆t; y 1 ) = log ρ 1 (t; x(t + ∆t)) − ∆t Taking expectation and multiplying by (-1) on both sides, we obtain On the other hand, from the Liouville equation it is easy to obtain Hence which is the same as (52) in Theorem IV.1.

C. Properties
Theorem IV.2 For a 2D system we have Remark: This recovers Eq. (5), the key equation originally obtained by Liang and Kleeman [45] through heuristic argument. Here we rigorously prove it.
Proof. If F 1 has no dependence on x 2 , so is F 1 ρ \ 2 . Thus where the fact ρ(x 2 |x 1 )dx 2 = 1 and the assumption of compact support have been used.

D. Application-Rössler system
In this subsection, we present an application study of the information flows within the Rössler system: To calculate the information flows, one needs to obtain the joint probability density function ρ(x 1 , x 2 , x 3 ). It is, of course, obtainable through solving the Liouville equation with some initial condition ρ 0 . However, there is another way, namely, ensemble forecast, which is more efficient in terms of computational load. As illustrated in Fig. 5, instead of solving the Liouville equation, we solve the Rössler systems initialized with an ensemble of initial values of x. This ensemble is formed with entries randomly drawn according to the initial pdf ρ 0 . At each time step, we count the bins thus obtained and estimate the pdf. The resulting pdf is the desired ρ. The Rössler system (58)-(59) is solved using the second order Runge-Kutta method with a time step size ∆t = 0.01. A typical computed trajectory is plotted in Fig. 4. The initial conditions are randomly drawn according to a Gaussian distribution N(µ, Σ), the mean vector and covariance matrix being, respectively,  4,28], clearly covers the attractor. We discretize it into 320 × 320 × 320 = 32, 768, 000 bins with ∆x = ∆y = ∆z = 0.1. To ensure one draw for each bin on average, in the beginning we make 32, 768, 000 random draws. As the ensemble scheme is carried forth, ρ and all other statistics can be estimated as a function of time. By Theorem IV.1 the information flow rates are computed accordingly.  For a system with three components (x, y, z), there are in total 6 flow pairs: T x→y , T y→z , T z→x , T x→z , T y→x , T x→z . A first examination of the system tells that dy/dt does not depend on z and dz/dt does not depend on y. By the property of causality (Theorem IV.3), T z→y and T y→z must vanish. The computational results reconfirm this. In Fig. 6, the two are essentially zero. What makes the results surprisingly interesting is that T x→z is also insignificant, while the dependence of dz/dt on x is explicitly specified. Besides, T z→x is also small. In the figure are essentially the flows between x and y: T y→x and T x→y .
The above information flow scenario motivates us to check the system with only x and y two components. This is an amplifying harmonic oscillator dx dt = Ax where A = 0 −1 1 a ,  a linear system allowing the information flow, say, from y to x, to be simply expressed as T y→x = a 12 σ 12 σ 11 (see below in section VII). That is to say, here the covariance matrix Σ = (σ ij ) completely determines the flow. The evolution of Σ follows Initialized by 4 0 0 4 , σ ij can be easily computed; the resulting T y→x and T x→y are shown in Fig. 7. Comparing to those in Fig. 6, the general trend, including the period, seems to be similar, though the geometry of the curves has been modified from harmonic into a seesaw ones. Besides, the T x→y (T y→x ) is always negative (positive) for the harmonic oscillator, while for the Rössler system, they can be both negative and positive. Note the parameter a in A does not explicitly appear in the formula, but it does contributes to the generation of the information flow. One may easily check that, if it is zero, then dΣ dt = 0, and hence the flow rates will stay zero if originally σ 12 = 0.
The above example is just used for the demonstration of application and, in some cases, for the validation of the proven theorems such as the property of causality. The seemingly vanishing T x→z in spite of the dependence of dz/dx on x for sure deserves further investigation but is beyond the scope of this study. Here we just want to mention that this does conform to the observations with complex systems-Emergence does not result from rules only (e.g., [53]- [54]). It has long been found that regular patterns may emerge out of irregular motions with some simple preset rules; a good example is the 2D turbulent flow in natural world (e.g., [55]). Clearly, these simple, rudimentary rules are not enough for explaining the causal efficacy and the bottom-up flow of information that leads to the emergence of the organized structure. As commented by Corning [56], "Rules, or laws, have no causal efficacy; they do not in fact generate anything... the underlying causal agencies must be separately specified." We shall see a more remarkable example in the following subsection.

E. Application-The truncated Burgers-Hopf system revisited
Here we re-examine the Truncated Burgers-Hopf system (TBS hereafter), a chaotic system which seemingly has rather simple information flow structures in the studies of Liang and Kleeman [48]. For a detailed description of the system itself, see [58]. In this section we only examine the following particular case: As we have described before, the system is intrinsically chaotic, with a strange attractor embedded in The information flow within the TBS cannot be found analytically. As before, we use the ensemble prediction technique to estimate the density evolution, and then evaluate the T 's. The setting and procedure are made precisely the same as that in [48] in order to facilitate a comparison. Details are referred to the original paper and will not be presented here. Figure 8 plots the results for the case with a Gaussian initial distribution N(µ, Σ), where with µ = (9,9,9,9), σ 2 = (9,9,9,9). Shown specifically are the time rates of the 12 information flows: The results are qualitatively the same as before in [48]. That is to say, except for T 3→2 , which is distinctly different from zero, all others are either negligible, or oscillatory around zero. But, of course, the present flows are much smaller in magnitude, in comparison to the one obtained before in [48] using the approximate formula.

V. STOCHASTIC MAPPING
A. Derivation Consider the system where Φ : R n → R n is an n-dimensional mapping, B is an n × m matrix, and w an mdimensional normally distributed random vector, representing an m-dimensional standard Wiener process. Without loss of generality, we assume the covariance matrix of w, Σ = I, since the perturbation amplitude can be put into B.
In general, B may depend on x. But this complicates the derivation a lot. For simplicity, in this section we only consider the case when B is a constant n × m matrix. As x(τ ) is taken to x(τ + 1), there exists an operator, written P : L 1 (R n ) → L 1 (R n ), steering the pdf at time step τ , ρ, to the pdf at step τ + 1, Pρ. Since x(τ ) and w are independent, if B is a constant matrix, one may view x(τ + 1) as the sum of two independent random variables, and then conjecture that Pρ be the convolution of P Φ ρ and some joint Gaussian distribution. Here P Φ stands for the F-P operator associated with the mapping Φ. This is indeed true, as is stated in the following theorem.
Proof. We first assume that Φ is invertible to make the approach more transparent to the reader. As always, write x(τ + 1) as y to avoid confusion. Make a transformation: Its Jacobian The inverse mapping is For any S y ∈ R n , S z ∈ R m , Sy×Sz ρ yz (y, z)dydz = where the independence between x and w has been used (hence ρ xw = ρ x ·ρ w ). P(y) = ρ y (y) is thence the marginal density by integrating out z: Since ρ (Φ −1 (y)) · |J −1 | = P Φ ρ(y), the theorem thus follows. When Φ is singular or noninvertible, let its F-P operator be P Φ , then ∀S y ∈ R n , S z ∈ R m , The conclusion follows accordingly.
With the above theorem, the information flow can be easily computed. Note the theorem actually states that where E w signifies the expectation taken with respect to w. So where B 1 ≡ (b 11 , b 12 , ..., b 1m ) is the row vector. Likewise, In the equation, the subscript \2 in the vector(s) and matrix means the second row is removed from the corresponding entities. So Subtract H 1\ 2 (τ + 1) from H 1 (τ + 1), and the information flow T 2→1 follows: Proof. As we proved for the deterministic case, if Φ 1 is independent of x 2 , then (P Φ ρ) 1 a.e. = (P Φ \ 2 ρ) 1 . If further B 1 has no dependence on x 2 , then the above H 1 (τ + 1) and H 1\ 2 (τ + 1) are equal, and hence T 2→1 = 0.

C. Application: A noisy Hénon map
We now reconsider the benchmark systems that have been examined before, but with Gaussian noise added. The baker transformation is not appropriate here, since the added noise perturbation will take x outside the domain [0, 1]. We hence only look at the Hénon map Φ : R 2 → R 2 : with parameters α, β > 0, and consider only the case T 1→2 which has been shown as a benchmark case. Now perturb Φ to make a stochastic mapping: It is easy to see that Φ is invertible; in fact, J = −2α 2 1 β 0 = −β = 0. The inverse is Thus If x 1 is frozen, Φ 2 (x 1 , x 2 ) = βx 1 is a constant. Hence H 2\ 1 (τ + 1) = 0, and Here F H 1 is the functional H 1 applied by a Gaussian filter. One may understand it as H 1 smeared out by a Gaussian filter. It is less than H 1 , so the noise addition makes the system lose some information, compared to T 1→2 = log β + H 1 in the deterministic case.

A. Derivation
Following what we have done in section IV, we derive the information flow within a continuous-time stochastic system by taking the limit of the corresponding discrete stochastic mapping. In doing this, the results in the preceding section are ready for use. But, as noted, in the above derivation we have assumed a constant matrix B, a simplified case allowing for a clear expression of information flow. (This case does have realistic relevance, though.) For a time continuous system, this assumption actually can be completely relaxed. In the following we will see why.
Consider a system where x and F are n-dimensional vector, B is an n × m matrix, and w an m-vector of standard Wiener process. Note that B can be a function of both x and time t. This above equation may also be written as whereẇ a vector of white noise, or, in component form, . . . . . .
In the equations we have used B i to indicate the i-th row vector of B. Now consider (78) on a small interval [t, t + ∆t]. Euler-Bernstein differencing, This motivates the introduction of a transformation Π : As shown in the discrete mapping case, generally this transformation cannot be inverted. But for this special case where ∆t and ∆w are small, the inversion can be done asymptotically.
In fact, Note, here the higher order terms means terms with order higher than ∆t or (∆w) 2 -We will see soon that E(∆w) 2 = ∆t. The above expansion helps invert Π to The following proposition finds the Jacobian associated with the inverse transformation.
Proposition VI.1 Define the double dot of two dyadics A and B as A : B = i,j a ij b ji , then Proof. By definition The key is the evaluation of det ∂x ∂y . By (86), it is, up to o(∆t), the determinant of Recall that, for an n × n matrix A = (a ij ), where P n is the totality of permutations of {1, 2, ..., n}. By this formula, the terms of order ∆t and ∆w are easy to find; they can only come from the diagonal entries. For terms of order (∆w) 2 , there are three sources: (1) the last term at each diagonal entry, together with n − 1 1's; (2) multiplication of two entries at (i, i) and (j, j), i = j, together with n − 2 1's on the diagonal; (3) similar to (2), but with entries at (i, j) and (j, i), i = j.
In (3) the order between i and j switches and it has sgn = −1 for its permutation. Except for (3) involving off-diagonal entries, all others are from the diagonal. So Notice the factor 1 2 in the last term. Because of the symmetry between i and j and they repeat once when summed over i, j = 1, n. Thus and k,s i,j ∂b ∂y i js The last parenthesis holds because the two pairs (i, k) and (j, s) may be switched under the summation without changing the result. Thence which is J −1 by (88).
With J −1 , we can then evaluate the operator P and hence arrive at dH 1 dt and dH 1\ 2 dt . Proposition VI.2 Let BB T ≡ G = (g ij ). The time rate of change of H 1 is Proof. For any subset S y ∈ R n , S z ∈ R m , due to the independence between x and ∆w. Since S y and S z are arbitrarily chosen, the integrand is the very joint pdf ρ yz (y, z). Thus Note here the fact about Wiener process has been used. As a verification, one may obtain from this step which is precisely the Fokker-Planck equation (cf. the appendix). Denote BB T by G. Integrate both sides of the above equation with respect to (y 2 , y 3 , ..., y n ) to obtain (Pρ) 1 (y 1 ) = ρ 1 (y 1 ) − ∆t So as the rest two terms vanish after applying the operator E(·) = R ρ 1 (·)dy 1 . Expanding y 1 around x 1 , and denoting B 1 ≡ (b 11 , b 12 , ..., b 1n ), we have Let ∆t → 0 and we finally arrive at Now consider during the time interval [t, t + ∆t] to freeze x 2 as a parameter, and examine how the marginal entropy of x 1 evolves. In this case we are actually considering a density ρ 1\ 2 , with rho 1\ 2 (t) = ρ 1 (t) under an (n − 1)-dimensional transformation: R n−1 → R n−1 , x 1\ 2 → y 1\ 2 : .. y n = x n (t + ∆t) = x n (t) + F n ∆t + B n ∆w.
With this system we have the following proposition.
Proof. Following the same procedure as above, we arrive at an equation for log(P \ 2 ) 1 (y 1 ) similar to (95): Note at time t, ρ 1\ 2 = ρ 1 , and in the last two terms y can be replaced by x with error going to higher order terms. Thus Take the limit dH 1\ 2 dt = lim ∆t→0 H 1\ 2 (t + ∆t) − H 1 (t) ∆t and we arrive at the conclusion.

B. Properties
Theorem VI.2 For a 2D system in the absence of stochasticity.
Proof. In this case g 11 = 0, ρ \ 2 = ρ 1 , so Theorem VI.3 If g 11 = m k=1 b 1k b 1k is independent of x 2 , the resulting T 2→1 has a form same as its deterministic counterpart.
Proof. If g 11 is independent of x 2 , so is Hence the integration can be simplified: Theorem VI.4 (Property of causality) If both F 1 and g 11 are independent of x 2 , then T 2→1 = 0.
Proof. As proved above, when g 11 has no dependence on x 2 , the last term of T 2→1 becomes zero. If, moreover, F 1 does not depend on x 2 , then ∂F 1 ρ \ 2 ∂x 1 does not, either. So the integration with respect to x 2 can be taken inside directly to ρ 12 /ρ 1 = ρ 2|1 (x 2 |x 1 ):

C. Application: A stochastic gradient system
We are about to study the information flow within a system which has a drift function in the gradient form. We particularly want to understand how stochastic perturbation may exert influence on the flow. The gradient systems are chosen because their corresponding Fokker-Planck equation admit explicit equilibrium solutions, i.e., solutions of the Boltzmann type. To see this, let where V = V (x) is the potential function. For simplicity, suppose that the stochastic perturbation amplitude B = bI where I is the identity matrix and b = const. Hence G = BB T = gI, and g = b 2 is a constant. It is trivial to verify that where Z is the normalizer (or partition function as is called in statistical physics), solves ∇ · (ρF) = 1 2 g∇ 2 ρ, the equilibrium density equation for the system As an example, consider the potential function This system, though simple, results in a compactly supported density function, while allowing for asymmetric nonlinear interactions among x 1 , x 2 , and x 3 . The resulting vector field is Obviously, T 3→1 = T 1→3 = 0 by the theorem on causality property. The general flow from x j to x i is The computation seems to be easy, but by no means trivial. The difficulty comes from the evaluation of the conditional density ρ j|i (x j |x i ). Theoretically this is not a problem, but in realizing the computation we have to consider the problem on a limited domain, which may not effectively cover the support of the density function. To test how the stochastic perturbation may affect the information flow. tune b to see the response. The tuning range is rather limited, though, with the present computational domain. Shown in Fig. 9 are the results. As expected, T 3→1 and T 1→3 are identically zero. For others, the flow rates generally increase with b. That is to say, they tend to increase the uncertainty of their corresponding target components. This makes sense, since g functions like temperature in thermodynamics, and increase in T surely will lead to increase in uncertainty. If examining more carefully, one finds that the increase is actually not symmetric. Those going to x 2 (T 3→2 and T 1→2 ) are faster than those leaving x 2 (T 2→1 and T 2→3 ), reflecting the property of asymmetry of information flow. Since ρ can be accurately obtained, this example can be utilized to validate our numerical computations for more general cases.

VII. LINEAR STOCHASTIC SYSTEMS
As always, it would be of interest to look at the particular case, namely, the case with linear systems: with A and B being constant matrices. If originally x is normally distributed, then it is normal/Gaussian forever. Let its mean vector be µ and its covariance matrix be Σ. Then In component form µ = (µ 1 , ..., µ n ) T , Σ = (σ ij ) n×n , and BB T has been denoted by G in the above. The distribution is, therefore, We need to find ρ 1 , ρ 12 , ρ \ 2 , and the following facts will help. Fact 1: ρ \ 2 is a multivariate Gaussian N(µ \ 2 , Σ \ 2 ) where µ \ 2 = (µ 1 , µ 3 , µ 4 , ..., µ n ) n , and Σ \ 2 is the covariance matrix of (x 1 , x 3 , x 4 , ..., x n ) n .
Fact 2: The conditional probability density function ρ 2|1 is 2∆ 12 in other words, In the above equations, we have used, and will be using, ∆ ij to shorten det σ ii σ ij σ ij σ jj .
The so many terms are canceled out, and the result turn out to be precisely the same as that for the 2D case we have derived before ever since Liang and Kleeman (2005)! The above remarkably concise formula actually holds for systems of arbitrary dimensionality. This makes the following theorem: Theorem VII.1 If an n-dimensional (n ≥ 2) vector of random variables (x 1 , ..., x n ) T evolves subject to the linear system where A = (a ij ) and B are constant matrices, and if its covariance matrix is (σ ij ), then the information flow from x j to x i is for any i, j = 1, ..., n, i = j.
Proof. It suffices to prove the case (i, j) = (1, 2); if not, we may always reorder the components to make them so. We prove by induction. The 3D case has just been shown above. Now suppose (111) holds for n-dimensional systems. Consider an n+1-dimensional system a nj x j + a n,n+1 x n+1 To distinguish, we now use ρ n to denote the joint density for the n-dimensional system. The information flow from x 2 to x 1 is Note the first term results from integration with respect x n+1 , since all the variables except ρ \ 2 are independent of x n+1 . This is precisely the information flow from x 2 to x 1 for an n-dimensional system; by our assumption it is a 12 σ 12 /σ 11 . For the second term, note that all variables, except ρ 2|1 , are independent of x 2 , so we may take integral with respect to x 2 directly inside with ρ 2|1 . But R ρ 2|1 dx 2 = 1, so the second term results in the integral of ∂ ∂x 1 (a 1,n+1 x n+1 ρ \ 2 ) which vanishes by the compactness of ρ. Therefore (111) holds for n+1-dimensional systems. By induction, it holds for systems of arbitrary dimensionality.
The evolution of the covariance matrix C is governed by flow are subsequently obtained and plotted in Fig. 11. Among them, T 3→1 = 0, just as expected by the property of causality. T 3→2 and T 2→3 oscillate around a value near zero, and T 2→1 oscillates around -0.9. The remaining transfers, T 1→2 and T 1→3 , albeit still oscillatory, approximately approach two constant values. The former approaches 0.16, while the latter approaches 1.

VIII. SUMMARY
Information flow, or information transfer as it may appear in the literature, is a fundamental notion in general physics which has wide applications in different disciplines. In this study we have shown that, within the framework of dynamical systems, it can be rigorously derived from first principles. That is to say, it is a notion ab initio, quite different from the existing axiomatic postulates or empirical proposals. In this light we have studied the information flow for both time-discrete and time-continuous differentiable vector fields in both deterministic and stochastic settings. In a nutshell, the results can be summarized as follows.
Consider an n-dimensional state variable x = (x 1 , x 2 , ...x n ), the corresponding probability density function (pdf) being ρ(x 1 , x 2 , ...x n ), and the marginal pdf of x i being ρ i . For a deterministic mapping Φ : R n → R n , x(τ ) → x(τ + 1) = (Φ 1 (x), Φ 2 (x), ...Φ n (x)), the rate of information flowing from x 2 to x 1 proves to be T 2→1 = E log(P \ 2 ρ) 1 (Φ 1 (x)) − E log(Pρ) 1 where E is the mathematical expectation with respect to x, P the Frobenius-Perron operator of Φ, and P \ 2 the same operator of Φ but with x 2 frozen as a parameter (so (P \ 2 ) 1 (x 1 ) has dependence on x 2 ). The units are in nats per unit time; same below. If the system is continuous in time, i.e., dx dt = F(x, t), where ρ \ 2 = R ρ(x 1 , x 2 , ..., x n )dx 2 , and ρ 2|1 is the conditional pdf of x 2 on x 1 . When stochasticity comes in, in the discrete mapping case: x(τ + 1) = Φ(x(τ ))) + B(x)w, where Φ : R n → R n is an n-dimensional mapping, B an n × m constant matrix, and w an m-dimensional standard Wiener process, then with B 1 = (b 11 , b 12 , ..., b 1m ) a row vector of the matrix B. Here we use E x and E w to indicate that the expectation is taken with respect to x and w, respectively. If what we consider is a continuous-time stochastic system, i.e., a system as dx = F(x, t)dt + Bdw, or alternatively written as whereẇ is the white noise, then the result can be explicitly evaluated: where g 11 = m j=1 b 1j b 1j . Note the first term is just from the deterministic vector field, while the second the contribution from the noise. It has been proved that, if b 1j has no dependence on x 2 , then the stochastic contribution vanishes, making the information flow same in form as that from its deterministic counterpart. We have particularly examined the case F = Ax, i.e., the case when the system is linear and autonomous, dx = Axdt + Bdw with A = (a ij ) n×n and B = (b ij ) n×m being constant matrices, then the information flow from x j to x i is remarkably simple: for any (i, j), 1 ≤ i, j ≤ n, i = j. This result is precisely the same in form as originally we obtained for 2D deterministic systems based on intuitive arguments in [45]. Historically it has been a long-time endeavor to relate information flow to causality. We want specifically to have that, if T j→i = 0, then x j causes x i , otherwise x j is not causal. With the existing empirical/half-empirical measures for information flow, such as the widely used transfer entropy, the endeavor has been fruitful for some problems but unsuccessful for others (e.g., [44]), and the inconsistency has even led to doubt about the association between information flow and causality (e.g., [38]). In this study, the implied causality, the touchstone one-way causality in particular, is a proved fact for dynamical systems, as stated in various theorems. More specifically, when the evolution of x i does not depend on x j , then T j→i = 0. This is particularly clear in the above linear case, the dependence of x i on x j is from the entry a ij of A, so when it is zero, then x j is not causal to x i . This result also quantitatively, and unambiguously, tells us that, causation implies correlation, but not vice versa, resolving the long-standing debate over correlation versus causation.
The above results have been put to applications with a variety of benchmark systems. Particularly we have re-examined the baker transformation, Hénon map, and truncated Burgers-Hopf system. The results are qualitatively similar to what we have obtained before using an approximate formalism, but with magnitudes significantly smaller. Also shown are the information flows within a Kaplan-Yorke map, a noisy Hénon map, a Rössler system, and a stochastic gradient flow. We look forward to more applications to real world problems in the near future.