Normalizing the causality between time series

Recently, a rigorous yet concise formula has been derived to evaluate the information flow, and hence the causality in a quantitative sense, between time series. To assess the importance of a resulting causality, it needs to be normalized. The normalization is achieved through distinguishing three types of fundamental mechanisms that govern the marginal entropy change of the flow recipient. A normalized or relative flow measures its importance relative to other mechanisms. In analyzing realistic series, both absolute and relative information flows need to be taken into account, since the normalizers for a pair of reverse flows belong to two different entropy balances; it is quite normal that two identical flows may differ a lot in relative importance in their respective balances. We have reproduced these results with several autoregressive models. We have also shown applications to a climate change problem and a financial analysis problem. For the former, reconfirmed is the role of the Indian Ocean Dipole as an uncertainty source to the El Ni\~no prediction. This might partly account for the unpredictability of certain aspects of El Ni\~no that has led to the recent portentous but spurious forecasts of the 2014"Monster El Ni\~no". For the latter, an unusually strong one-way causality has been identified from IBM (International Business Machines Corporation) to GE (General Electric Company) in their early era, revealing to us an old story, which has almost gone to oblivion, about"Seven Dwarfs"competing a giant for the mainframe computer market.


I. INTRODUCTION
Information flow, or information transfer as it may be referred to in the literature, has long been recognized as the logically sound measure of causality between dynamical events [1]. It possesses the needed asymmetry or directionalism for a cause-effect relation and, moreover, provides a quantitative characterization of the otherwise statistical test, e.g., the Granger causality test [2]. For this reason, the past decades have seen a surge of interest in this arena of research. Measures of information flow proposed thus far include, for example, time-delayed mutual information [3], transfer entropy [4], momentary information transfer [6], and causation entropy [7], among which transfer entropy has been proved to be equivalent to Granger causality up to a factor of 2 for linear systems [5].
Recently, it has been shown that the notion of information flow actually can be put on a rigorous footing within the framework of dynamical systems. The rate of information flowing from one component to another can be derived from first principles, and the expected property of causality turns out to be a proved theorem [10].
In the case where only a pair of time series, rather than the system, is given, in principle the information flow can be estimated. Particularly, under the assumption of a linear model with additive noise, the maximum likelihood estimate (MLE) of the information flow in (4) turns out to be very tight in form, involving only the common statistics, namely, sample covariances [11]. Take two series X 1 and X 2 , for example. The MLE of the rate of information flowing from X 2 to X 1 is shown to be where C ij is the sample covariance between X i and X j , and C i,dj is that between X i andẊ j ,Ẋ j being the dif-* sanliang@courant.nyu.edu ference approximation of dX j /dt using the Euler forward scheme [11]. Ideally, when T 2→1 = 0, X 2 is not the cause of X 1 , and vice versa. It is easy to see that, if C 12 = 0, then T 2→1 = 0, but when T 2→1 = 0, C 12 does not need to vanish. That is, contrapositively, causation implies correlation, but correlation does not imply causation. (Throughout the text "causation" and "causality" are used synonymously.) In an explicitly quantitative way, this resolves the long-standing debate over causation versus correlation. The magnitude of an information flow may differ from case to case. It needs to be normalized, just as covariance does, in order to have its importance assessed. In the extreme cases where no causality exists, though theoretically the corresponding information flow rates should be 0, in reality their estimators from time series generally do not precisely vanish. One then cannot tell whether the causality indeed exists by the absolute magnitude.
A simple example may help illustrate the issue better. Consider the series generated from two autoregressive processes, X 1 (n + 1) = 0.1 + 0.5X 1 (n) + αX 2 (n) + e 1 (n + 1), (2a) X 2 (n + 1) = 0.7 + βX 1 (n) + 0.6X 2 (n) + e 2 (n + 1), (2b) where the errors e 1 ∼ N (0,1) and e 2 ∼ N (0,1) are independent. Generate a pair of series with 80 000 values each 1 , and perform the causality analysis. We list in Table I the  information flow rates and their respective confidence intervals at a 90% significance level. (The results may differ slightly for different series due to the pseudorandom number generation.) For case I, |T 2→1 /T 1→2 | > 740, one may then conclude that this is a one-way causality from X 2 to X 1 , as is indeed true. For case II, however, one actually cannot say much from the numbers. Though small, they tell no more than that the information flows in both directions are of equal importance. Of course, one may argue that the statistical significance test says it: at a 90% level these flow rates are not significantly different from 0. However, such a test just tells how precise the estimate is with the available data; it depends, for example, on the length of the series, which is irrelevant to the parameter to be estimated. In other words, an insignificant estimated rate may appear significant if more data are included. To see this more clearly, look at case III. Obviously, the information flows, albeit existent, make only tiny contributions to their respective series, as the coupling coefficients are over an order smaller; in classical perturbation analysis, they can be dropped to the first-order approximation. The computed information flows are significantly different from 0 at a 90% level: from one viewpoint, this testifies to the success of the formalism. However, the small numbers cannot tell how important they are, since, with a slowly varying series, even the dominant flow rate could be very low. On the other hand, if we cut the series by half and pick the first 40 000 points for analysis, then the results will be T 2→1 = 0.653 ± 0.751 and T 1→2 = 1.240 ± 0.690 (in 10 −4 nats/iteration). So T 2→1 is insignificant while T 1→2 is significant. (Again, these small numbers may fluctuate due to the pseudorandom number generator.) Can one thus conclude that there is a one-way causality? or Can one thus assert that this shortened series yields a more reliable estimation? Surely this is absurd. The problem here is that we do need a normalized flow to evaluate its importance relative to other factors. In this study, we present a way to arrive at such a flow, and apply it to the analysis of several realistic financial time series.

II. INFORMATION FLOW NORMALIZATION
The normalization is not as simple as it seems to be. A natural normalizer that comes to mind, at the hint of correlation coefficient, might be the information of a series transferred from itself. A snag is, however, that this quantity may turn out to be 0, just as that in the Hénon map, a benchmark problem we have examined before (see the references in [9]). Another snag is that the above T 2→1 and T 1→2 usually do not share the same normalizer as that in a correlation analysis based on the Cauchy-Schwarz inequality. That is, two information flows of equal magnitude may be of different relative importance in their respective series.
To arrive at a logically and physically sound normalizer, we need to get to the basics and analyze how an information flow within a system is derived. Consider a two-dimensional (2D) stochastic system where F = (F 1 ,F 2 ) T is the vector of drift coefficients (differentiable vector field), B = (b ij ) the matrix of stochastic perturbation coefficients, and W a 2D standard Wiener process. Let g ij = k b ik b jk and ρ i be the marginal probability density function of x i . It is proved [10] that the time rate of information flowing from X 2 to X 1 is where E signifies the operator mathematical expectation. This measure of information flow is asymmetric between the two parties, and particularly, if the process underlying X 1 does not depend on X 2 , then the resulting information flow from X 2 to X 1 vanishes, i.e., X 2 is not causal to X 1 . This is the so-called property of causality, a fact rigorously proven rather than just verified in applications. When T 2→1 is nonzero, it may take positive or negative values. Ideally, a positive T 2→1 means that X 2 causes X 1 to be more uncertain, while a negative T 2→1 reduces the entropy of X 1 . For more details, the reader is referred to Ref. [11].
By Ref. [10], the rate of change of the marginal entropy of X 1 is It is actually a result of two mutually exclusive mechanisms: the first is the information flow T 2→1 as shown in (4); the second is the complement, i.e., the rate of entropy increase without taking into account the effect of X 2 . Denoting the latter dH 1\ 2 dt , it has been proven in [10] that The right-hand side has three terms. The first term is precisely the time rate of change of H 1 due to X 1 itself in the absence of stochasticity. This is the starting point which we showed in 2005 [8] in establishing the rigorous formalism and proved later (cf. [9]). Hence through a careful analysis, the increase in the marginal entropy H 1 is decomposed into three parts, dH noise and T 2→1 in (4), which correspond to, respectively, the phasespace expansion along the X 1 direction, the stochastic effect, and the information flowing from X 2 ; Fig. 1 shows a schematic. Note that this decomposition does not appear explicitly in the marginal entropy evolution equation, (5), as the two stochastic terms cancel out.
The normalization is now made easy. Let Obviously it is no less than T 2→1 in magnitude and cannot be 0 unless X 1 does not change, a situation that is excluded in time series analysis. We may therefore pick Z 2→1 as the normalizer and define This way if τ 2→1 = 1, the variation of H 1 is 100% due to the information flow from X 2 ; if τ 2→1 is approximately 0, X 2 is not the cause. Therefore, τ 2→1 assesses the importance of the influence of X 2 on X 1 relative to other processes. It should be pointed out that the above normalizer applies to T 2→1 only. For T 1→2 , it is which may be quite different in value. This, from another viewpoint, reflects the asymmetry between T 2→1 and T 1→2 .

III. ESTIMATION
As in Ref. [11], consider a linear version of the stochastic differential equation, (3), where f is a constant vector, and A = (a ij ) and B = (b ij ) are constant matrices. Initially if X obeys a normal distribution, then it is normally distributed forever. Let the mean and covariance matrix be μ and = (σ ij ). They evolve according to So Eqs. (7) and (8) can be explicitly evaluated: and dH noise since neither g 11 nor ρ 1 depends on x 2 . But R ρ 2|1 dx 2 = 1, and ρ 1 is compactly supported, so the whole second term on the right-hand side then vanishes. Hence Equations (14) and (15), together with the information flow from X 2 to X 1 as we have obtained before [8,11], T 2→1 = σ 12 σ 11 a 12 , form the three constituents that account for the evolution of the marginal entropy of X 1 .
An observation about d dt H noise 1 = g 11 /(2σ 11 ), where g 11 = b 2 11 + b 2 12 , is that it is always positive. That is, noises always contribute to an increase in the marginal entropy of X 1 , conforming to common sense. In financial economics, this reflects the volatility of, say, a stock. On the other hand, for a stationary series, the left-hand side of Eq. (13) tends to be 0, and the balance on the right-hand side requires that 2σ 11 ∼ g 11 . So this quantity is also related to the noise-to-signal ratio.
The above results need to be estimated if what we are given is just a pair of time series. That is, what we know is a single realization of some unknown system, which, if known, can produce infinitely many realizations. The problem now becomes estimating (14) and (15) with the available statistics of the given time series.
We use maximum likelihood estimation to achieve the goal. The procedure follows precisely that in [11], to which we refer the reader for details. Suppose that the series are sampled at regular instants with a time step size t, and let N be the sample size. Further assume that b 12 = 0 (hence g 11 = b 2 11 ). We have shown that the MLEs are [11]â 11 = p,â 12 = q,f 1 = X 1 − pX 1 − qX 2 (overbar signifies sample mean), with where C i,j is the sample covariance between X i and X j , and C i,dj the sample covariance between X i andẊ j ≈ { X j (n+k)−X j (n) k t }.
(Usually k = 1 should be used to ensure accuracy, but in some cases of deterministic chaos, where the sampling is at the highest resolution, one needs to choose k = 2.) The MLE of g 11 can be obtained by computinĝ On the other hand, the population covariance matrix can be rather accurately estimated by the sample covariance 022126-3 matrix C. So (14) and (15) become, after some algebraic manipulation, As in Ref. [11] with T 2→1 , here dH * 1 dt and dH dt noise 1 (and Z 2→1 in the following) should bear a hat, since they are the corresponding estimators. We abuse the notation a little bit to avoid notational complexity; from now on they should be understood as their respective estimators. With these the normalizer is and hence we have the relative information flow from X 2 to X 1 : τ 1→2 can be obtained simply by swapping the indices in T and Z and their respective expressions.
Clearly both are negligible in comparison to the contributions from the other processes in their respective series. For the case α = β = 0.01, where one may encounter difficulty due to the ambiguous small numbers, the computed relative information flow rates are τ 2→1 = 0.018%, τ 1→2 = 0.015%. Again, they are essentially negligible, just as one would expect.
Generally speaking, the above imbalance is a rule, not an exception, reflecting the asymmetry of information flow. One may reasonably imagine that, in some extreme situation, a flow might be dominant while its counterpart is negligible within their respective series, although the two are of the same order in absolute value.

V. APPLICATION
We now demonstrate a real-world application with several financial time series. Here it is not our intention to conduct financial economics research or study market dynamics from an econophysical point of view; our purpose is to demonstrate a brief application of the aforementioned formalism for time series analysis. Nonetheless, this topic is indeed of interest to both physicists and economists in the field of macroscopic econophysics; see, for example, [12].
We pick nine stocks in the United States and download their daily prices from YAHOO! FINANCE. and discount stores (WMT)], the automotive industry (F), the oil and gas industry (XOM), and the multinational conglomerate corporation GE, which operates through the segments of energy, technology infrastructure, capital finance, etc. Here by "daily" we mean on a trading-day basis, excluding, say, holidays and weekends. Since stock prices are generally nonstationary, we check the series of daily return, i.e., R(t) = [P (t + t) − P (t)]/P (t), or log-return, r(t) = ln P (t + t) − ln P (t), where P (t) are the adjusted closing prices in the YAHOO! spreadsheet, and t is 1 trading day. Following most people we use the series of log-returns r for our purpose. In fact, the return and log-return series are approximately equivalent, particularly in the high-frequency regime, as indicated in [13]. Since the most recent stock, MSFT, started on March 13, 1986, all the series are chosen from that date through December 26, 2014, when this study commenced. This amounts to 7260 data points, hence 7259 points for the log-return series.
Using Eq. (1), we compute the information flows between the nine stocks and form a matrix of flow rates; see Table II. The flow direction is represented by the matrix indices; more specifically, it is from the row index to the column index. For example, listed at location (2,4) is T 2→4 , i.e., T AAPL→INTC , the flow rate from Apple to Intel, while (4,2) stores the rate of the reverse flow, T INTC→AAPL . Also listed in the table are the respective confidence intervals at the 90% level.
In Table II, most of the information flow rates are significant at the 90% level, as highlighted. Their values vary from 4 to 22 (units, 10 −3 nats/day; same below). The maximum is |T IBM→XOM |, and second to it are |T WMT→CVS | and |T CVS→GE |.
Look at the table row by row (companies as drives). Perhaps the most conspicuous feature is that the whole CVS row is significant. Next to it is XOM, with only three insignificant entries. That is, CVS is found to be causal to all other stocks, though the causality magnitudes have yet to be assessed (see below). This does make sense. As a chain of convenience stores, CVS connects most of the general consumers and TABLE II. Rates of absolute information flow among the nine chosen stocks (in 10 −3 nats per trading day). For each entry the direction is from the row index to the column index of the matrix. Also listed are the standard errors at a 90% significance level (significant flows are highlighted).
Another interesting observation is that |T F→WMT | > |T F→CVS |. This is easy to understand, as we rely on our motor vehicles to shop at Wal-Mart, while CVS stores could be right in the neighborhood.
The above significant absolute information flows, large or small, still need to assessed regarding their respective relative importance before any conclusion of causality is reached. Using Eq. (21), we compute the relative information flow rates (as a percentage) and list them in Table III. For clarity, those greather than or equal to 1% are highlighted. In contrast to Table II, we see only a few information flows that account for more than 1% of their respective fluctuations. This echoes what we introduced in the beginning: though significant, some information flows may be negligible in their own marginal entropy balances.
It should be noted that the causal relations generally change with time. If the series are long enough, we may look at how these information flows may vary from period to period. Pick the pair (IBM, GE) as an example. For the duration (March 1986 through present) considered above, T GE→IBM = −13 ± 9, while T IBM→GE is not significant. Neither τ GE→IBM nor τ IBM→GE reaches 1%. Since at the YAHOO! site both GE and IBM can be dated back to January 2, 1962, we can extend the time series a lot, up to 13 338 data points.   Tables II and III, with the causal structure changed from a weak two-way causality to a stronger and more or less one-way causality. Since in the above, only data for the most recent 30 years are used, we expect that in the early years this causal structure could be much enhanced. Choosing the first 7000 points ( from January 1962 through November 1989), the computed relative information flow rates are τ IBM→GE = 3.1%, τ GE→IBM = −0.2%;  Table II. Obviously, during this period, the causality can be approximately viewed as one-way, i.e., from IBM to GE. And the relative flow is more than 5%, much larger than the values in Table III. The above remarkable causal structure for that particular period actually can trace its roots back to the history of GE [14]. There was such a period in the 1960s when "Seven Dwarfs" (Burroughs, Sperry Rand, Control Data, Honeywell, General Electric, RCA, and NCR) competed with IBM, the giant, for computer business, particularly, to build mainframes. In 1965, GE had only a 3.7% market share of the industry, though it was then dubbed the "King of the Dwarfs,' while IBM had a 65.3% share. Historically GE was once the largest computer user besides the U.S. Federal Government; it got into computer manufacturing to avoid dependency on others. And, indeed, throughout the 60s, the causalities between GE and IBM are not significant. Then why, as the 70s began, did the information flow from IBM to GE suddenly increase to its highest level? It turned out that GE sold its computer division to Honeywell in 1970; in the following years (starting from 1971), GE relied a great deal on IBM products. Clearly, the GE computer history does substantiate the existence of a causation between GE and IBM and, to be more precise, an essentially one-way causation from IBM to GE. In an era when this has almost been forgotten (one cannot even find it at GE's Web site), and GE may have left the impression that it never built any computers, let alone a series of mainframes, this finding, which is solely based on the causality analysis of a couple of time series, is indeed remarkable.

VI. CONCLUDING REMARKS
To assess the importance of a flow of information from one time series, say X 2 , to another, say X 1 . it needs to be normalized. Getting down to the fundamentals, we were able to distinguish three types of mechanisms that contribute to the evolution of the marginal entropy of the recipient X 1 -namely, the phase-space expansion in the X 1 direction, the information flow from X 2 , and the contribution from noise-and hence proposed an approach to normalization. The resulting scheme is described by Eqs. (16)-(21).
It should be noted that a relative information flow is for comparison purposes within its own series. The two reverse flows between two series can only be compared in terms of absolute value, since they belong to different series. It is quite normal that two identical information flows may differ significantly in relative importance with respect to their own series, as demonstrated in our examples, reflecting the property of flow asymmetry. In some extreme situation, a pair of equal flows may have one dominant but another negligible in their respective entropy balances. In this sense, absolute and relative information flows usually need to be examined simultaneously in realistic applications.