Unraveling the cause-effect relation between time series

X. San Liang School of Marine Sciences, Nanjing University of Information Science and Technology (Nanjing Institute of Meteorology), Nanjing 210044, and China Institute for Advanced Study, Central University of Finance and Economics, Beijing 100081, China Abstract Given two time series, can one faithfully tell, in a rigorous and quantitative way, the cause and effect between them? Based on a recently rigorized physical notion namely information flow, we solve an inverse problem and give this important and challenging question, which is of interest in a wide variety of disciplines, a positive answer. Here causality is measured by the time rate of information flowing from one series to the other. The resulting formula is tight in form, involving only the commonly used statistics, namely, sample covariances; an immediate corollary is that causation implies correlation, but correlation does not imply causation. It has been validated with touchstone linear and nonlinear series, purportedly generated with one-way causality that evades the traditional approaches. It has also been applied successfully to the investigation of real world problems; an example presented here is the cause-effect relation between the two climate modes, El Niño and Indian Ocean Dipole (IOD), which have been linked to the hazards in far flung regions of the globe. In general, the two modes are mutually causal, but the causality is asymmetric: El Niño tends to stabilize IOD, while IOD functions to make El Niño more uncertain. To El Niño, the information flowing from IOD manifests itself as a propagation of uncertainty from the Indian Ocean.


I. INTRODUCTION
In a philosophical discussion, A.P. Dempster says, "Causal analyses are guides to higher understanding." [1] Indeed, identification of the causality between dynamical events is a subject of enormous interest in a wide variety of disciplines. Examples are seen everywhere in neuroscience [2], biology [3], network dynamics [4], financial economics [5], earth sciences [6], physics [7], to name a few. Toward the end of this study, we will see a climate science example that has been linked to the severe weather and natural disasters on a global scale.
Given that, more often than not, what is known about the two events in question are in the form of time series, causality analysis between time series is therefore of particular importance. This is, however, a very difficult problem; in fact, it is said to be "one of the biggest challenges" in data science ([8], p. 274). Presently a common practice, particularly a practice in climate science, is computing time-lagged correlation. However, it is well known that correlation does not carry the needed directedness or asymmetry and hence does not necessarily imply causality. As stated by Barnard ([9], p. 387): "That correlation is not causation is perhaps the first thing that must be said." Besides, one may argue that, for recurrent processes, there is actually no way to distinguish a lag from an advance, unless one has enough a priori knowledge of the processes of concern. This is particularly a problem when the processes are nonsequential, with circular cause and consequence that easily lead one to the chicken-egg dilemma.
Another practice is performing Granger causality test, which is a statistical test of the usefulness of one time series in forecasting another. This kind of test, as the name implies, provides only a yes/no judgment, lacking the quantitative information that may be required in many circumstances. In attempt to remedy the deficiency, these years people begin to turn their eyes to an empirical information-theoretic measure, namely, transfer entropy [10]. Unfortunately, this kind of measure is notoriously difficult to evaluate, requiring long time series and imposing challenges to computation [11]; moreover, recent studies have shown that it is biased as its value depends on the autodependency coefficient in a dynamical system [12]. Perhaps this is the reason why in many applied sciences, climate science in particular, time-delayed correlation analysis is still the major tool.
In this study, we will show that causality analysis can be rigorously formulated and quantitatively realized, and that the resulting formula turns out to be remarkably concise and very easy to compute. Considering the large body of related research in climate science, among other sciences, this study is expected to provide timely help.
We use information flow, or information transfer as it may be referred to in the literature, to measure the causation between dynamical events. It has long been recognized as an appropriate causality measure, as the amount of information exchange between two events tells not only the magnitude but also the direction of the cause-effect relation (cf. [2]- [6]). Due to its importance, the past decades have seen a surge of interest in formulating this notion. Formalisms have been established empirically or half-empirically, among which is the afore-mentioned transfer entropy. Realizing that information flow is a real physical notion, and that a real physical notion should be rigorously, rather than empirically, built on a solid foundation so as to be universally applicable for problems in different disciplines, Liang and Kleeman [13] take the initiative to establish, followed by a series of researches afterwards, a rigorous formalism for the information flow within given dynamical systems, both deterministic and stochastic (see [14] for a review). The problem is that it relies on a given dynamical system and hence all the information in the sample space (one can pro-duce arbitrarily many realizations from the given system) or somewhat partially prescribed causality (but dynamics or laws alone do not have causal efficacy; see, e.g., [15]). What we have here, however, are just two time series, i.e., one single realization for a two-dimensional (2D) system. If we refer to it as a "forward problem", then what we need for the present study is an "inverse problem", i.e., whether and how the same notion can be realized with no dynamics but two time series previously given; that way, the quantitative causal relation between the series will be obtained. This could be challenging since only one realization is available, while the resulting formula must verify, to an acceptable extent, the touchstone properties established based on the knowledge of essentially infinite many realizations. This study shows that such a formula, which is expected to have far-reaching implications, indeed can be obtained.
In the following we first briefly introduce the "forward problem". The major derivation is presented in section III, where a concise formula for causality analysis is obtained. This formula is validated with touchstone time series purportedly generated with only one-way causality (section IV); it is then applied to the study of the causal relation between El Niño and Indian Ocean Dipole (IOD), the two major modes of climate variation, with important results which have been evidenced for years but missed in previous observational data analyses with existing tools (section V). This study is discussed in sectionVI and summarized in section VII.

II. THE "FORWARD PROBLEM"
Before getting into the "inverse problem", let us give a brief review of the "forward problem", i.e., the recently established formalism of information flow with dynamics given [16]. Consider a d-dimensional stochastic system where F is the vector of drift coefficients, θ a vector of parameters, B = (b ij ) a matrix of diffusion coefficients (or volatility in finance), and W a vector of standard Wiener process (Ẇ is the so-called white noise). As usual, the vector field F is assumed to be differentiable. The rate of information flowing from a component, say, X 2 , to another, say, X 1 , is the change rate of the marginal entropy of X 1 , minus the same change rate but with the effect from X 2 instantaneously excluded from the system. These rates of information flow/transfer have been derived analytically in a closed form for any given dynamical systems. The derivation is lengthy and rather technically involved, requiring, for example, the evaluation of some Frobenius-Perron operator [17] and the establishment of a kind of Fokker-Planck equation for the X 2 -excluded system [16]. But the results appear to be tractable. Particularly, for a system of dimensionality 2 (2D), which we will be considering in this study, we have the following theorem.
Theorem II.1 For the system (1), if d = 2, then the the rate of flow from X 2 to X 1 is where ρ 1 is the marginal probability density of X 1 , and E the mathematical expectation.
The proof is referred to [16]; same for the theorems in this section. T 2→1 may be either zero or nonzero. A nonzero T 2→1 means that X 2 is causal to X 1 : a positive value means that X 2 makes X 1 more uncertain, and vice versa. This measure of information flow possesses an important property, namely, the property of causality or property of flow/transfer asymmetry: a one-way information flow implies nothing about the flow in the opposite direction. In particular, we have the following proved fact: Theorem II.2 For the system (1), if the evolution of X 1 does not depend on X 2 , then the information flow from X 2 to X 1 vanishes, i.e., T 2→1 = 0. Specifically, if the vector field component F 1 is independent of x 2 , This property, which one may expect on physical grounds, is very important and will form the touchstone for causality studies.

III. CAUSALITY ANALYSIS
The above result on information flow is rigorous. It is, however, not about causality analysis, as the causal relation is somewhat partially prescribed in the given dynamical system, though dynamical rules alone do note mean causality [15]. Since from a dynamical system one can generate as many realizations as he likes, this formalism is essentially based on infinite many realizations. In this study, however, the only knowns are two time series, i.e., one realization, with which we need to find the corresponding information flow. Generally it could be very challenging, considering the above touchstone property of one-way causality, among others, to be verified.
Since the dynamics is unknown, we first need to choose a model. As always, a linear model is the natural choice, at least at the initial stage of development. In this case, (2) can be greatly simplified. In Eq. (1), assume F = f + AX, with f = (f 1 , f 2 ) T , A = (a ij ), and B = b 1 0 0 b 2 being constant vector/matrices. In the equation, if initially (X 1 , X 2 ) has a bivariate Gaussian distribution, as the system is linear, the density is always Gaussian. Denote the mean by µ = (µ 1 , µ 2 ) T , and the covariance matrix by Σ = (σ ij ), where σ ii = σ 2 i . They evolve following the differential equations The information flows between X 1 and X 2 are now easy to evaluate. We only need to consider T 2→1 . Denote as (I) and (II), respectively, the two terms on the r.h.s. of (2), and write b 2 11 + b 2 12 as β for short. Substitution of F 1 = f 1 + a 11 X 1 + a 12 X 2 into (2) for F 1 and ρ 1 yields where σ ij is obtained by solving (3), and The former is the theorem in [13] restated in the linear limit. The vanishing (II) is actually a particular case of the property of causality proved in [16]: when β is independent of x 2 , (II) = 0. Causality is reflected in the asymmetry of the flow. In (I), it goes with a 12 /σ 11 . To understand how asymmetry arises in (II), let us assume that β is a linear function of x 2 : β = β 0 + κx 2 . As shown above, β 0 does not contribute to T 2→1 ; we need only look at the contribution from κx 2 . Usually this cannot be found explicitly; one has to solve a density evolution equation and then evaluate (2). For the sake of interpretation, let us assume that the density is also a Gaussian (this is generally not true, of course). With this, which is clearly asymmetric between X 1 and X 2 . Our strategy to approach the causality problem is first to estimate the linear model, i.e., to estimate the parameters (f, A, B), and then compute the information flow. Without loss of much generality, assume the time series are equal-distanced.
We use maximum likelihood estimation (mle) to fulfill the purpose. As the population covariances σ ij (i, j = 1, 2) can be fairly accurately estimated by the corresponding sample covariances, we focus on the estimation of a 12 . Consider an interval [n∆t, (n + 1)∆t], ∆t being the time stepsize. If the transition probability density function ρ(X n+1 |X n ; θ) can be obtained, since {X n } is a Markov process, the likelihood is then given (N the sample size), or, alternatively, the log likelihood log ρ(X n+1 |X n ; θ) + log ρ(X 1 ) is obtained. When N is large, ρ(X 1 ) can be dropped without causing much error. In this linear case, analytical solution of ρ can be found; the result, however, is complicated. For the sake of practical applicability, we turn to the discretized version of the SDE. Using the Euler-Bernstein scheme, where F = f + Ax. Suppose ∆t is small, we know that ∆W is a normal N (0, ∆tI) and hence X n+1 |X n = x n is a normal N (x n + F∆t, BB T ∆t). So Note here we have assumed that N is large enough such that ρ(X 1 ) can be dropped without causing much error. The mle of θ,θ, is therefore the θ that maximizes ℓ N (θ). For convenience, denote, for i = 1, 2, Also notice that The estimators of f, A, and B can be found by maximizing ℓ N . It is interesting to note that the equations governing the estimators (f 1 ,â 11 ,â 12 ), (f 2 ,â 21 ,â 22 ), and (b 1 ,b 2 ) are actually decoupled. This makes the estimation much easier. Besides, notice that maximizing ℓ N over (f i , a i1 , a i2 ) is equivalent to minimizing n R 2 i,n := Q N,i over the same group of parameters, for i = 1, 2. So (f 1 ,â 11 ,â 12 ) are precisely the least square estimators, which satisfies The above equation set can be written in a more familiar and succinct form: where, as usual, the overline signifies sample mean. Let With these results, let ∂ℓ N ∂b 1 = 0 to get where On the other hand, as the population covariance matrix can be estimated by the sample covariance matrix, we estimate σ 12 /σ 11 to be C 12 /C 11 . Substituting these results to the linear version of (2), T 2→1 = a 12 σ 12 /σ 11 , we finally obtain the rate of information flowing from X 2 to X 1 : where C ij is the sample covariance between X i and X j , and C i,dj the covariance between X i andẊ j . Strictly speaking, here T 2→1 should bear a caret since it is an estimator of the true information flow. But we will abuse the notation a little bit for the sake of terseness. The flow in the opposite direction, T 1→2 , can be directly written out by switching the indices 1 and 2. The units are in nats per unit time.
Given a significance level, we may estimate the confidence interval for (4). This actually can always be achieved with bootstrap. But here things can be simplified. When N is large, T 2→1 is approximately normally distributed around its true value with a variance C 12 C 11 2σ 2 a 12 , thanks to the mle property. Hereσ 2 a 12 is determined as follows (e.g., [18]). Denote θ = (f 1 , a 11 , a 12 , b 1 ). Compute to form a matrix NI, I being the Fisher information matrix. The inverse (NI) −1 is the covariance matrix ofθ, within which isσ 2 a 12 . Given a significance level, the confidence interval can be found accordingly.
As in history there is a great deal of debate over correlation versus causation, it might be of interest to write (10) in terms of correlation and/or correlation-like quantities. Let 2) be the "correlation" between X i andẊ j but normalized with the variances of X i and X j . We then have Obviously, two uncorrelated events (r = 0) must be non-causal (T 2→1 = 0); in other words, causation implies correlation. The converse, however, does not hold, i.e., correlation does not imply causation.

A. A linear problem
To validate (10), consider a 2D stochastic differential equation (SDE) set: Clearly, X 2 drives X 1 , but not vice versa. This kind of problem is very typical in causality analysis: One component causes another, but the latter has no feedback to the former. We now generate a sample path (X 1,n , X 2,n ) and expect to recover this touchstone oneway causality using Eq. (10) from the only realization. Using a time step ∆t = 0.001, we generate 100000 steps, corresponding to a time span t = 0 − 100. For later use, we purportedly initialize the series far from the equilibrium to allow for a period of spin-down, as shown in Fig. 1. The series reach a stationary state after approximately t = 4. As here the dynamics is given, the true information flow between X 1 and X 2 can be evaluated by solving (3). The computed T 2→1 and T 1→2 are plotted in Fig. 1 (dashed lines). As expected, T 1→2 ≡ 0, since the evolution of X 2 does not depend on X 1 . That is to say, to X 2 , X 1 is not causal. On the other hand, X 2 drives X 1 and hence is causal to X 1 ; correspondingly T 2→1 = 0. In this example, T 2→1 approaches a constant 0.1111 no matter how the covariances are initialized, though the result may be different during the spin-up period (t < 4). Now let us see whether the concise formula (10) can help to recover the one-way causality. The data we have is just one realization, i.e., the above sample path. Our objective is, of course, not to estimate the whole evolution course of T 2→1 and T 1→2 as shown in Fig. 1; what we expect is to see whether the stationary values (T 2→1 = 0.1111, T 1→2 = 0) can be estimated with acceptable confidence. For this purpose, we form different series from the sample path with different resolutions and different time intervals and lengths, and then test the performance of estimation. The testing results are listed in Table I.  Clearly, so long as the time span of the series is long enough, the estimation can be made rather accurate. With stationary data (t = 5 − 100), even when one samples the path every 100 time steps (corresponding to a time resolution of 0.1) which yields a time series of only 1000 data points, the result is still acceptable to some extent. To see how the formula may perform if the time span is short, we make the series spanning only 10 time units: t = 10 − 20. The result: T 2→1 = 0.60, T 1→2 = 0.17, unfortunately, is not as good as one would like to expect, even though the data points amount to 10000. Nonetheless, if we estimate the standard error at a 95% significance level, which gives a value of 0.54 and 0.47, respectively, the result still sounds informative. It tells that T 2→1 is significantly (at a 95% level) greater than zero, while T 1→2 is not significantly different from zero, consistent with the accurate result.
We have also tested the formula (10) or (11) for systems with different drift and diffusion coefficients, particularly for systems where noises dominate. Again, so long as the series have a span long enough, the causal relation can be faithfully recovered. As an example, for the system It would be of interest to check how the popular approaches may work with the present series. Since the time delayed correlation analysis is widely used (especially in climate science), though it has long been recognized inappropriate for causality inference (e.g., [8][9] [19]), we compute the time delayed correlation r with the above time series X 1 and X 2 (the stationary segment t = 5 − 100), and plot it as a function of the time lag ∆t (positive if X 1 lags X 2 ). We know X 2 causes X 1 , while X 1 does not cause X 2 . So, by the logic of correlation advocators, maximal correlation should take place between X 1 and some past X 2 , i.e., X 2 should lead X 1 . For this specific example, unfortunately, this is not true at all. In Fig. 2, the computed r(∆t) as a function of ∆t is very small for ∆t > 0), and takes much larger (still small) values at negative ∆t. That is to say, a correlation advocator would have concluded that X 1 causes X 2 , not the opposite!

B. A nonlinear problem
We now look at how the formula (10), which stems from a linear assumption, may work for nonlinear systems. The system we will be considering is a one-way coupled map from [20]: where is the chaotic logistic map, and g α a functional of f . In [20], Hahs and Pethel examine the "anticipatory system" with g α (x) = (1 − α)f (x) + αf 2 (x), where f 2 means that the logistic map f applies twice. This is a very special and chaotic case, which needs a careful discussion; we defer it to sectionVI. Here we consider a normal case where f applies only once. Obviously, when ξ = 0 and α = 0, X 1 is the drive and X 2 the response; in other words, X 1 causes X 2 but not vice versa. Pick the typical case ξ = 0.3. An example series pair is shown in the upper panel of Fig. 3 for α = 1.0 (initialized with X 1 (1) = 0.4, X 2 (1) = 0.1). We then use Eq. 10 to compute T 2→1 and T 1→2 and draw them as functions of α (see the lower panel of the figure). Note what we plot are |T 2→1 | and |T 1→2 |, as it is the absolute values that tell the causality. Besides, the flow rates need not be continuous functions of α, and interpolation between a positive point and a negative point may lead to a spurious zero information flow, i.e., nil causality. (Here it is not a problem, though: all the values are negative.) The figure shows that, compared with T 1→1 , T 2→1 is generally negligible when α ≥ 0.3. When α is small, the two may be comparable, but still T 1→2 is much larger. Besides, T 1→2 increases with α, consistent with the observation that in the equation of X 2,n+1 , α controls the effect from X 1 . Particularly, when α = 0, both T 1→2 and T 2→1 vanish, just as one would expect based on the independence between X 1 and X 2 . It is, therefore, safe to say that the one-way causality within the unidirectionally coupled map, albeit highly nonlinear, has been fairly well recovered with (10).

V. A REAL WORLD APPLICATION
Finally let us look at an application to a real world problem: the causal relation between the two famous climate modes, El Niño and Indian Ocean Dipole (IOD). El Niño is the strongest interannual climate variation in the tropical Pacific air-sea coupled system, occurring at irregular intervals of 2 to 7 years and lasting 9 months to 2 years [21]; it is well known through its linkage to natural disasters in far flung regions of the globe, such as the floods in Ecuador, the droughts in Southeast Asia and Southern Africa, the increased number of storms over the Pacific Ocean. There are several indices measuring the strength of El Niño , the popular ones being Niño3 and Niño4 [22]. IOD is another air-sea coupled climate mode, characterized by an aperiodic oscillation of sea surface temperature (SST) in the Indian Ocean [23]. It has been related to, among others, the floods in East Africa and drought in Indonesia and parts of Australia. IOD is measured by an index called Dipole Mode Index (DMI).
As the dominant modes in respectively Pacific and Indian Oceans, the relationship between El Niño and IOD is of enormous interest; see [24] for a review. It has long been recognized that the tropical Indian and Pacific Oceans are interrelated on their SST anomalies [25]. But the relation between the two modes is still to be clarified. In general, it is believed that El Niño may induce IOD, which usually peaks in fall; a recent study is referred to [26]. The causality of El Niño is easily understood, considering the alteration of the Walker circulation during the El Niño years which causes the subsidence over the Indian Ocean (e.g., [27], [28]). On the other hand, the impact from the IOD has jut been recognized recently; in fact, in early efforts, Indian Ocean used to be treated as a slab of mixed layer responding passively to El Niño [29]. The recognition is mainly through climate predictions, such as the El Niño predictions (e.g., [30], [31]) and the coupled atmosphere-ocean general circulation model (CGCM) experiments (e.g., [32]). Due to the correlation between the two modes, it has been suggested that an Indo-Pacific perspective should be adopted in researching on the El Niño and IOD problems (e.g., [33]) The linkage between IOD and El Niño , though evidenced from different aspects, is still an issue in debate. This is partly due to the failure to recover the IOD pattern in the Indian Ocean in many correlation or time-lagged correlation analyses with the indices and the SST observations, albeit the correlation could be significant. Such an example is shown in the Fig.4 of [24]. This has led people to conclude with caution that IOD might be partially independent of El Niño , though El Niño tends to induce IOD. This problem, which is obviously a problem on cause and effect, could be due to the inadequacy of using correlation analysis for causality purposes. Now that we have arrived at the formula (10), let us see how things will come out if it is applied to the IOD-El Niño causality analysis.
First look at the causal relation between the indices of the two modes. The DMI used is the monthly series by Japan Agency for Marine-Earth Science and Technology (JAMSTEC), which can be downloaded from their website. The El Niño indices are from NOAA ESRL Physical Sciences Division (http://www.esrl.noaa.gov/psd/); they are also monthly data. Since this DMI series spans from January 1958 to September 2010, the El Niño indices, which have a much longer time span (1870-present), are also tailored to the same period. These series then have a total of 633 time points. Application of (10) to the DMI and Niño4 yields a flow of information from El Niño to IOD: T E→I = −6, and a flow from IOD to El Niño : T I→E = 13 (units: 10 −3 nats/month). Using Niño3 one arrives at a similar result: T E→I = −6 and T I→E = 16. That is to say, El Niño and IOD are mutually causal, and the causality is asymmetric, with the one from the latter to the former larger than its counterpart. Moreover, the different signs indicate that El Niño tends to stabilize IOD, while IOD tends to make El Niño more uncertain.
By our previous experience, series length is crucial. To see whether the length here is adequate, we have tried shortened series: 1963-2010, 1968-2010. The former yields essential the same result as the one in its full length ; the result of the latter is also similar. In fact, even series with a span as short as 1970-2010 can give qualitatively similar result. So the series may be long enough to form an ensemble with sufficient statistics. The resulting information flow rates hence may be acceptable.
With this assertion we proceed to extract the causality patterns out of the SST in the Pacific and Indian Oceans. The SST data are from the above NOAA site. As before, we use only the data for the period 1958-2010. First, compute the causality between DMI and the tropical Pacific SST. The results are shown in Fig. 4. Clearly DMI and the Pacific SST are mutually causal, and the information flow in either direction has a El Niño -like pattern. The signs of the flows are the same as computed above with two index series.
If the above El Niño -like pattern in the Pacific is common, the following pattern in the Indian Ocean is unique to our causality analysis. Compute the information flow between Niño4 and the Indian Ocean SST, and plot the result in Fig. 5. Fig. 5a shows the flow from El Niño to the SST. In the tropical region, there are indeed two poles, though the eastern one covers a rather limited region over Indonesia. On the map of the feedback from the Indian Ocean SST (Fig. 5b), this structure is much clearer, with a pattern in between the IOD and Indian Ocean Basin Mode (IOBM). This is remarkable, as in previous correlation analyses or time-lagged correlation analyses, the IOD pattern is not seen. Note in Fig. 5b, both centers are positive, indicating that the Indian Ocean influences the El Niño through IOD, which functions to amplify the El Niño oscillations. In contrast, the impact of El Niño on the Indian Ocean SST divides over the two centers, which have opposite signs of information flow. The causality analysis with Niño3 shows similar patterns, though the features may not be as pronounced as that with Niño4. The relation between El Niño and IOD is an extensively investigated subject; further discussion, however, is beyond the scope of this study. Our purpose here is to use it as one real world example to demonstrate the utility of the newly derived formula (10). The result is encouraging: the IOD-like pattern, which is anticipated as evidence builds up, but missed in previous data analyses, shows up in the index-SST causality patterns. Moreover, it is found that, on the whole, the El Niño influences the IOD by making the latter more certain, while the causality from IOD to El Niño is generally the opposite. This is perhaps the reason why it has long been observed in a definite way that El Niño may induce IOD, while the influence from IOD to El Niño , albeit stronger, has been just recognized recently through predictability studies. The latter, which carries a positive information of flow, means that, to El Niño , the Indian Ocean is a source of uncertainty, and the causality from IOD to El Niño is manifested as a propagation of uncertainty from the former to the latter. Clearly an efficient way to observe and forecast an event is to put extensive observation at its uncertainty source-a way how adaptive observing systems are designed. This is why knowledge of the Indian Ocean facilitates the El Niño forecast (e.g., [30], [31]), i.e., increases the predictability of El Niño .

VI. DISCUSSION
While for linear systems Eq. (10) is satisfactory, and it also succeeds with nonlinear problems such as that in section IV B, caution should be exercised in applying it to highly chaotic cases. Look again at the one-way causal map system (13), with g now defined as where f 2 means that the logistic map f applies twice. This is the anticipatory system introduced by Hahs and Pethel [20], with α being the anticipation parameter. It is very unusual in causality, as reflected in the computed transfer entropy rates, written as T S

2→1
and T S 1→2 here, versus α. Transfer entropy depends on the choices of delay length m, and the sample space resolution when coarse graining, which is presented with d namely the number of partitions in each dimension of the sample space [10]. Let m = 2, d = 2, Hahs and Pethel find that T S 1→2 decreases with α ∈ [0, 1], while T S 2→1 increases with α. The two intersects around α = 0.5, and T 2→1 finally dominates the computed information flow (see the Fig. 1 of [20]). This is, of course, not what one would expect, as ideally the information flow from X 2 to X 1 should be around zero due to the one-way causality in (13). The above anticipatory system seems to be a nightmare for causality analysis tools; our formula (10) together with (7) also encounters the similar difficulty. Of course, it could be argued that the formula is developed for linear systems, while this current system is highly nonlinear. However, as linearization is a common practice in nonlinear research, one expects that (10) should give qualitatively correct result. A possible reason why there is a difficulty here could be the inadequacy of the differencing scheme for highly chaotic series. It is well known that, for a field, its derivative is usually much more wildly distributed; a good example is velocity versus vorticity in turbulence (e.g., [34]). In fact, for a system with weak solutions, the shock-like structure may yield infinite derivatives which, when represented using differencing schemes, are very sensitive in time, resulting in spuriously large local differences, and hence disguise the the differential field on larger scales. That is to say, the spuriously large local differences may overwhelm the whole series and make the resulting differences unable to characterize the large-scale (and hence global) trend of the differential field. The above highly chaotic anticipatory system is similar to this. In Fig. 6 a plot of (X 2,n+1 − X 2,n ) (thin line) and (X 2,n+2 − X 2,n )/2 (light thick line) versus n is drawn (α = 1, initialized with X 1,1 = 0.4, X 2,1 = 0.1). Clearly, the former is much wilder and larger in amplitude. In this case, the latter represents better the general characteristics of the derivative series, although it is only accurate up to O(2∆t). Estimates of dX/dt using different Euler forward schemes. The series is generated from the anticipatory system (13). Shown here is {X 2,n } for α = 1. Thin line: (X 2,n+1 − X 2,n ); light thick line: (X 2,n+2 − X 2,n )/2.
Based on the above observation and argument, we modify (7) tȯ to compute the C 1,d1 and C 2,d1 in Eq. (10) for the anticipatory system problem. The resulting rates of information flow are plotted in Fig. 7a. For clarity and for the sake of comparison, in Fig. 7b we re-plot their absolute values |T 2→1 | and |T 1→2 |, which measure the strength of the underlying causality. The figure shows that, compared with T 1→2 , T 2→1 is negligible; even at α = 0.5 where the two stay closest, |T 1→2 | > 5 × |T 2→1 |. The one-way causality within the anticipatory system has been fairly faithfully recovered with only modest computational effect, although this is a highly nonlinear system! We have also tried the differencing schemė , and the resulting structure of information flow is similar, except for a left translation of the extrema (minimum and maximum) by about ∆α = 0.1.
In general, time series are formed from measurements. The sampling cannot be as dense as all time steps are resolved. Actually, more often than not, it is coarse. We may then ask whether a re-sampled pair of the anticipatory series can give the right result, witḣ X i,n = (X i,n+1 − X i,n )/∆t as defined before in (7). Sample the original series every two time steps and re-do the computation. Indeed, the computed T 1→2 and T 2→1 are almost the same as that in Fig. 7; the tiny difference lies at α = 0.9 and 1, where T 1→2 takes the same value, but in Fig. 7 α = 0.9 is a local maximum. The success of this "series coarsening" implies that, for realistic problems such as that in section V,Ẋ i,n = X i,n+2 −X i,n ∆t can be used in (10) with safety, for measurements essentially cannot be taken at extremely fine resolution.
A closer look at Fig. 7 tells, besides the fact that T 2→1 is negligible compared with T 1→2 , T 1→2 actually changes sign as α varies on [0, 1]. The change of causal structure makes the anticipatory system of [20] particularly interesting. Physically, this means that, when α is small, X 1 functions to reduce the marginal entropy, and hence the uncertainty, of X 2 ; in other words, X 1 tends to stabilize X 2 . As α exceeds 0.5, however, the function is reversed; X 1 now makes X 2 more uncertain or more unpredictable. It is still unclear whether this fundamental change in causal structure is related to the result of Hahs and Pethel, but the change itself may deserve further study.
In a word, with highly chaotic and densely sampled time series as that generated with the anticipatory system, the formula (10) still works fairly well, provided thatẊ i is differ-  (13) and (16). The series is formed on MATLAB with initial condition X 1,1 = 0.4, X 2,1 = 0.1. (a) T 1→2 and T 2→1 (in nats per unit time) vs. anticipation α estimated withẊ i,n = . (b) Same as (a), but for absolute values. Note that |T 1→2 | ≫ |T 2→1 |; even at the point α = 0, 5, where they seem to stay close in this figure, the former is actually 5 times larger (3.26e-4 vs. 6.28e-5 nats/unit time). enced every two or more time steps to avoid spuriously large differences that prevent from characterizing the large scale trend of the derivative field. We remark that, in real world applications, this should not be a problem, as measurements generally cannot be taken at extremely fine resolution.

VII. CONCLUDING REMARKS
We have obtained, in a rigorous and quantitative way, a concise formula for causal analysis between time series. The causality is measured by information flow, a physical notion just rigorized recently. For series X 1 and X 2 , the rate of information flowing (units: nats per unit time) from the latter to the former is where C ij is the sample covariance between X i and X j , C i,dj the covariance between X i anḋ X j , andẊ j the difference approximation of dX j dt using the Euler forward scheme: with k ≥ 1 but should not be large to ensure precision. For general time series including those from realistic problems, k = 1 should be fine; for highly chaotic and densely sampled series, one may pick k = 2. Practically, one may first make a comparison between the results with k = 1 and k = 2. If they are qualitatively different, then k = 1 should be discarded. The rate of information flow computed from the above formula, T 2→1 , could be zero or nonzero. If T 2→1 = 0, X 2 does not cause X 1 ; if not, it is causal. In the presence of causality, two cases can be distinguished according to the sign of the flow: a positive T 2→1 means that X 2 functions to make X 1 more uncertain, while a negative value indicates that X 2 tends to stabilize X 1 . This formula is tight in form, involving only the common statistics namely the sample covariances, and is hence very easy to compute. Besides, it clearly shows that causation implies correlation, but correlation does not necessarily lead to causation, resolving the continuing debate over causation versus correlation. We have validated it with a couple of touchstone series. In the linear case, the stationary preset one-way causality can be rather accurately recovered, provided that the series is long enough (time span, not number of data points) to contain sufficient statistics; this same causality cannot be inferred with the traditional time delayed correlation analysis. When the series length is guaranteed, the formula works even with series with rather coarse resolution. In the nonlinear case, it also shows remarkable success with a chaotic logistic map, and an unusually chaotic anticipatory system with a very special causal structure.
The above formula has been applied to investigating the relation between El Niño and IOD, the two climate modes which have been linked to many natural hazards on earth. In general, the two modes are mutually causal, though the causality is asymmetric. Particularly, El Niño tends to stabilize IOD, while IOD functions to make El Niño more uncertain. In other words, to El Niño , the Indian Ocean is a source of uncertainty, and the causality from IOD to El Niño is manifested as a propagation of uncertainty from the former to the latter. This assertion is also substantiated by the causality analyses between the El Niño index and the Indian Ocean sea surface temperature. The resulting information flow rates, both T El N ino→Indian and T Indian→El N ino , clearly show a dipolar structure reminiscent of the IOD and IOBM, which has been missing in previous observational data analyses with traditional approaches.
It would be of interest to compare (10) to Schreiber's transfer entropy. This is, however, a task far beyond the scope of the present study. Previously in [35], a brief comparison has been made, which demonstrates that our formalism qualitatively agrees with transfer entropy, though quantitatively they are different. Recently Runge et al. [12] find that transfer entropy is biased since it, within the framework of a dynamical system, depends on the autodependency coefficient; as a remedy, they propose their own measure namely momentary information transfer based on graphical models. Of course, our formalism does not have this kind of bias, as implied by Theorem II.2. Nonetheless, the computational results with the linear problem in section IV A seem to be similar, if the transfer entropy is computed using the analytical result such as that in [36]. (Note, if one does the computation through bin counting, the one-way causality is not seen in the resulting transfer entropy.) A careful and detailed comparison is yet to be made.
It should be pointed out that the derivation of the formula (10) is based on a formalism with respect to Shannon entropy, or absolute entropy, while it has been argued [37], in predictability research, relative entropy is a more advantageous choice due to its nice properties such as invariance under nonlinear transformation. Fortunately, as we have proved in [17], for 2D systems, the information flow thus obtained is the same with both absolute and relative entropies. But, of course, the sign has to be changed, as the former is for uncertainty, while the latter is for predictability.
It should also be pointed out that originally our formalism is developed purportedly for physical systems from nature, such as the climate modes examined in section V. The vector field of concern is differentiable. That is to say, in the governing equation (1), F should be a differentiable function of X. The resulting general formula (2) further asserts this, or the derivative cannot be taken. While in practice the formula (10) may actually go beyond the assumption and work with some non-differentiable F, the problems with systems such as those formed with the logical exclusive or operation should be excluded. Some issues remain. Firstly, in arriving at (10), a linear model has been adopted, while in reality, nonlinearity is ubiquitous. Although a linear model could be a good approximation in many cases, and, indeed, it appears to be remarkably successful for the benchmark nonlinear systems examined, we would just take it as an approximation, and do expect some more sophisticated model in order to to arrive at a universally applicable formula. Secondly, what we have worked with is a 2D system. For general multi-dimensional systems, in principle this should be straightforward, though a formula like (2) is yet to be obtained for stochastic systems. These issues, among others, will be considered in the forthcoming studies.