Understanding causation via correlations and linear response theory

In spite of the (correct) common-wisdom statement correlation does not imply causation, a proper employ of time correlations and of fluctuation-response theory allows to understand the causal relations between the variables of a multi-dimensional linear Markov process. It is shown that the fluctuation-response formalism can be used both to find the direct causal links between the variables of a system and to introduce a degree of causation, cumulative in time, whose physical interpretation is straightforward. Although for generic non-linear dynamics there is no simple exact relationship between correlations and response functions, the described protocol can still give a useful proxy also in presence of weak nonlinear terms.

In spite of the (correct) common-wisdom statement correlation does not imply causation, a proper employ of time correlations and of fluctuation-response theory allows to understand the causal relations between the variables of a multi-dimensional linear Markov process. It is shown that the fluctuation-response formalism can be used both to find the direct causal links between the variables of a system and to introduce a degree of causation, cumulative in time, whose physical interpretation is straightforward. Although for generic non-linear dynamics there is no simple exact relationship between correlations and response functions, the described protocol can still give a useful proxy also in presence of weak nonlinear terms.

I. INTRODUCTION
Detection of causation is a fundamental topic in science, whose origin dates back to the philosophical investigation of D. Hume [1] and to the roots of physical thinking. In its most general terms, the problem may be formulated as follows: given the time series {x (1) t }, {x (2) t }, . . . , {x (n) t } of n variables constituting an observable system x t , one wishes to determine unambiguously whether the behavior of x (k) has been influenced by x (j) during the dynamics, without knowing the underlying evolution laws. Causal detection has a primary practical relevance in physical modeling [2][3][4], where the problem of inferring models from data is typically faced [5][6][7][8].
A natural idea, summarized by the Latin saying cum hoc ergo propter hoc ("with this, therefore because of this"), is looking at the correlation C jk (t) = x , since a causal link should lead to a non-zero value for it, at least for some t > 0. On the other hand, the presence of correlation does not imply causation, as it is possible, for instance, that both x (k) and x (j) are influenced by one or more common-causal variables [2,[9][10][11].
A more reliable way to detect the presence of causal effects between two variables is the popular Granger causality (GC) test [12]. This method allows to determine whether the knowledge of the past history of x (j) enhances the ability to predict future values of x (k) . Basically, it compares the statistical uncertainties of two predictions built on the linear regression of past data, obtained by including or ignoring the trajectory of x (j) . The improvement of the prediction, defined by the relative reduction of the uncertainty, gives a measure of how much x (j) is useful to the determination of x (k) [13][14][15]. A similar approach consists in defining a degree of information exchange from x (j) to x (k) , which quantifies the loss of information about x (k) that one experiences if {x (j) t } is ignored. This is exactly what is done by transfer entropy (TE) and related quantities [16][17][18][19][20] (which also have interesting interpretations in the context of information thermodynamics [21,22]). Remarkably, TE has been shown to be exactly equivalent to GC in linear autoregressive systems [14,23,24].
Even if GC, TE and similar quantities can provide use-ful information about the dynamics, their employment as a measure of causal relations may be not completely satisfactory from a physical point of view. Indeed, in physics two variables are usually believed to be in a cause-effect relationship if an external action on one of them results in a change of the observed value of the second [3,23], whereas the above mentioned tests, strictly speaking, only determine whether, ant to what extent, the knowledge of a certain variable is useful to the actual determination of future values of another. In the following, we will call "interventional" the former, physics-inspired definition of cause-effect relation and "observational" the latter. Sometimes a similar distinction is made between the two approaches, distinguishing between the detection of "causal mechanisms" and "causal effects" [25,26]. As we will discuss in the next Section, the strength of the interventional causal link is quantified by a well-known observable, the physical response [27], whose usage to infer causal relations from data is the main subject of this paper. To clarify the above distinction between interventional and observational causation, let us briefly discuss a simple situation in which this difference may be relevant. Imagine that we want to measure the electrical current passing trough a resistor, when its extremities are connected to an external time-dependent source of electric potential, v(t). Let us assume that the amperometer we are using is affected by some noise η(t) independent of v(t). In this case, the measured value of the current j(t) is given by where j true is the actual (unknown) value of the current and G is the electrical conductance of the considered resistor. In this case, a good estimator of the interventional causality between v(t) and j meas (t) will only depend on the conductance G, since this parameter establishes to which extent an external action on v(t) will influence the observed value of the current, j meas (t) (a notion which does not depend on the intensity of the noise). Conversely, from an observational perspective also the amplitude of the noise η(t) does play a role, since our ability to predict future values of j meas , given v(t), crucially depends on it: roughly speaking, if the noise is small, the knowledge of v(t) will suffice to give a good esteem of j meas (t), whereas if it is large, the information about v(t) is almost useless.
In this paper we show that linear response theory allows to understand causal links (in the interventional sense) from time series of data, if the considered process is of Markov type. Moreover if the dynamics is also linear, only simple time correlation functions have to be taken into account. This approach can be used both to quantify the overall influence of x (j) on x (k) , including the effects due to indirect causation, and to infer the matrix of direct links between the elements of the system.
Of course, in most cases an analysis based on linear response will provide results qualitatively similar to those obtained by mean of TE or GC, since information transfer and physical interaction are usually related; however, the analytical forms of TE and GC are typically cumbersome, even for very simple models, and this makes very difficult to get any insight into the structure of the considered system by mean of these tools. Moreover they are usually difficult to apply in practical situations, as in experiments, if the dimensionality of the system is not very small. The method presented here is instead very simple to apply in practice, and its physical interpretation is straightforward; the drawback is its rigorous validity only for Markov systems with linear dynamics: generalizations to non-linear evolutions are also possible, provided that the stationary joint probability density function of the system is known.
The paper is structured as follows. In Section II we give a physical definition of causation using the formalism of linear response theory, which is briefly recalled in Appendix A. Section III is devoted to linear Markov systems: we discuss how the response formalism can be used to infer causal links from correlations, and we outline the main differences with other approaches. In Section IV we consider more general cases, i.e. non-linear systems and dynamics with hidden variables, and we discuss the limits of causation determination from data. Finally, in Section V we draw our conclusions.

II. A PHYSICAL DEFINITION OF CAUSATION
As mentioned in the Introduction, we are mainly interested in the study of causation in the interventional sense, i.e. the one accounting for the effects of external actions of the system, as in typical experimental setups. Let us consider the system where t is a (discrete) time index. We say that x (j) influences x (k) if a perturbation on the variable x (j) at time 0, x 0 , induces, on average, a change on x (k) t , with t > 0. In formulae, we will say that x (j) has an influence on x (k) if a smooth function F(x) exists such i.e. if perturbing x (j) 0 results in a non-zero average variation of F(x (k) t ) with respect to its unperturbed evolution. Here the over-line represents an average over many realizations of the experiment. Since we will mainly deal with linear Markov systems, considering the identity function F(x) = x will be sufficient to detect the presence of causal links (see Appendix B for a brief discussion on this point).
This idea is not completely new [3,23], and it is reminiscent of the framework developed by Pearl [2], in which causation is detected by observing the effects of an action on the system (although in that context the role of time is not explicitly considered). In particular, in Pearls' formalism one has to evaluate conditional probabilities assuming that the graph of the interactions between variables is actively manipulated. A similar idea can be found in the "flow of information" introduced in Ref. [28], which can be seen as the information-theoretic counterpart of the Pearl's probabilistic formalism.
If the system admits a (sufficiently smooth) invariant distribution, and δx (j) 0 is small enough, quantities of the form (2) can be evaluated without actually perturbing the system, since they are related to the spontaneous correlations in the unperturbed dynamics by the fluctuation-response (FR) theorem [27,29], also known as fluctuation-dissipation theorem. If {x t } is a stationary process with invariant probability density function (p.d.f.) p s (x), under rather general conditions the following relation holds (see Appendix A): where the average · is computed on the two-times joint p.d.f. p (2) s (x t , x 0 ). R t is the matrix of the linear response functions (at time t) of the considered system.
Eq. (3) shows the existence of a rigorous link among responses and correlations, provided that either the functional form of p s (x) is known, or it can be inferred from data. Of course, in general the latter will be a rather non-trivial task, at least in high-dimensional systems.

III. LINEAR MARKOV SYSTEMS
In this Section we will limit ourselves to the study of linear stochastic processes of the form where A and B are constant n × n matrices and the components of η t are independent and identically distributed random variables with zero mean and unitary variances. The spectral radius of A needs to be less than 1, in order for the dynamics not to diverge with time. In this case one has x (i) = 0 ∀i. The following relation between the response matrix and the covariance matrix with en- holds: for details). This result can be shown to hold also in cases with continuous time.
Following the idea of Green-Kubo formula, which allows to understand the average effect of an electric field on the current in terms of correlations [29,30], a cumulative "degree of causation" x (j) → x (k) can be introduced: This quantity characterizes the cumulative effect of the perturbation δx k 0 on the variable x (k) . In linear systems with discrete time, from the relation R t = A t (see Appendix B) it follows that I n being the n × n identity matrix. Let us stress that a vanishing value of D j→k does not exclude causation between x (j) and x (k) ; indeed, since R kj t can assume both positive and negative values, contributions with opposite signs in the sum (6) might eventually compensate and give a null result even in presence of a causal link.

A. Interventional and observational causation
Let us briefly discuss an important difference between the FR formalism and the other traditional approaches to the study of causation. The formalism of response, as well as Pearl's probabilistic interventional approach [2,28], focuses on the effect of an active perturbation of the considered system, which is a typical physical procedure in experimental practice. In contrast, GC and TE pertain mainly to the observational approach, as they are related to the information exchange between degrees of freedom. As mentioned in the Introduction, the intrinsic statistical fluctuations of the observed variables are not crucial to establish their cause-effect relation from a physical, interventional perspective, because they are not related to the active perturbation of the system and its effects. On the other hand, such fluctuations play a relevant role in the information-based, observational approach, since they concur to determine the statistics of the observed quantities, and this is relevant to our ability to make prediction.
To show the above point, let us consider model (4) with The response function R 12 t=1 = A 12 is equal to √ 2/2 and is independent of D 1 and D 2 , as it is expected. Indeed, the amplitudes of the noise terms should not play any role in the cause-effect relations, from a physical perspective.
For direct comparison, let us compute now the GC and the TE for the same model. Suppose it generates a long time-series {x (1) t , x (2) t }: the evaluation of the observational casual link between x (2) and x (1) with the GC test requires to find the best approximation of {x (1) t } by the two alternative models and x (1) where the coefficients (α 1 , ∆ 1 ) and (α 2 , β 2 , ∆ 2 ) need to be optimally adjusted. Once they are known, the quantity provides a measure of the increment in the predictability of x (1) when also the trajectory of x (2) is taken into account. In Appendix C, we compute ∆ 1 and ∆ 2 explicitly for model (8), finding the final result where r = D 2 /D 1 . Likewise, we can derive analytically the TE for model (8). In this case, we need to evaluate the following expression: where the average is taken over the joint distribution p(x t+1 , x t , y t ). In Appendix D we show that The coincidence of TE and GC expressions, but for a factor 1/2, is not incidental: indeed, the equivalence of the two quantities for linear regressive systems has been proved in Ref. [23]. Both TE and GC depend on the ratio r = D 2 /D 1 of the noise amplitudes: as mentioned at the beginning of this Section, this is consistent with the fact that they are related to predictability rather than to mechanistic causality, in contrast with response. Let us stress that also in the response-theory approach one may define a observational-like causation estimator by rescaling correlations and responses with the standard deviations of the corresponding variables: Since the quantitiesR kj t are dimensionless, they can be used to compare the effect of different "causes" on a given variable. In the above discussed example, the rescaled response reads:R

B. Linear response and correlations
To better understand the role of response in determining non-trivial causal links, let us examine a typical toy model in which the analysis of correlations may lead to wrong conclusions. We consider a 3-dimensional vector x = (x, y, z), whose evolution is ruled by a Gaussian, linear stochastic dynamics at discrete times: where η (x) , η (y) , η (z) are independent Gaussian processes with zero mean and unitary variance, while a, ε and b are constant parameters. The situation is graphically represented in Fig. 1(a). The case ε = 0 is a minimal example in which the behavior of two quantities, y and z, is influenced by a common-causal variable x; as a consequence, y and z are correlated even though they are not in causal relationship (black graph in the inset of Fig. 1(b)). The same mechanism may be identified in many situations in which surprising functional dependences arise, as that between the number of Nobel laureates of a country and its chocolate consumption per year [31]: in this specific case, both quantities may be expected to be influenced by the gross domestic product of the nation.
According to our definition, in order to decide whether there is a causal relation between y and z, one has to perturb y at time 0 and measure the average variation δz t for t > 0. Let us briefly comment on the optimal choice for the intensity of the perturbation. As a general rule, δy should be small with respect to the typical values of the variable y, since the linear response theory requires an expansion for small values of δy (see Appendix A); on the other hand, if δy is too small, a large number of experiments will be needed to get reliable averages over the stochastic realizations of the noise. Here and in the following examples, we took δy O(10 −2 ); however, since the dynamics of this example is linear, the results of Appendix A are exact and there is actually no need to choose δy small.
The result for ε = 0 is shown in Fig. 1(b), black curve: not surprisingly,R zy t = 0 for all t > 0. The situation completely changes if we introduce a small feedback ε = 0 from y to x, which will eventually result in a causal link between y and z. As Fig. 1(b) shows, the corresponding response function correctly reveals that the behavior of z starts to be influenced by a perturbation of y after t = 2 time steps, and that the intensity of such causal influence roughly scales with ε.
None of these conclusions could have been drawn from the mere analysis of the correlation functions, reported in the inset of Fig. 1(b). However, for linear Markov systems, formula (5) allows the response function to be found by simple operations on the covariance matrix, i.e. by a suitable manipulation of time correlations.
It can be shown [23,32,33] that in linear systems also GC, TE and related quantities can be eventually reduced to functions of correlations, but in general their derivation may be much more involved than that based on response theory. In studying the above example, an important caveat has to be bore in mind: when dealing with more than two variables, in order to get insightful results, we need to use conditional GC and TE [14]. This fact can be understood by looking at the causal link between y and z with a time-lag of 1 step, which is expected to be null from a physical perspective, since no action on y t will have consequences on z t+1 in our model. The "naive" TE will be in general different from zero, because the knowledge of y t provides indirect information about x t (the two variables are not independent), and the possibility to forecast the value of z t+1 is improved. The problem is solved by considering the conditional TE in this case, the conditional probabilities at the numerator and denominator are equal, in fact the knowledge of y t does not provide additional information about x t , which is already known. Similar considerations hold for the GC analysis. However, let us stress that the FR formalism provides a handy method to deal with many variables at the same time, as in the linear cases the problem reduces to the computation of 1-step correlations and matrix operations. The TE approach, instead, requires the evaluation from data of conditioned probabilities as those appearing in Eq. (19), which may be a non-trivial task as soon as the number of conditioning variables is larger than 1 or 2.

C. "Direct" causation and modeling via response theory
A typical problem in the study of a complex system is that of inferring the strength of its links, assuming that the dynamics is of the form (4); in other terms, one can be interested in inferring the matrix A from the analysis of long time series {x (i) t }, i = 1, . . . , n, t = 1, 2, . . . , T 1. A situation of this kind is usually faced, e.g., in the study of complex proteins [34,35]. In these cases one is mostly interested in the "direct" causation links between the variables, which allow to understand the structure of the system and the matrix A [20]; this can be done again by mean of response theory, which relates the response function to the propagator of the dynamics. In particular, by recalling that R t and A are simply related by R t = A t , one has that A = R 1 . An example is shown in Fig. 2; the matrix A which rules the dynamics is graphically represented by panel (a). In panels (b)-(e) the matrix R t is shown as reconstructed from time correlations, for different values of t. As expected, for t = 1 the response matrix equals A, and it is possible to infer all (oriented) causal links. For t > 1, R kj t provides information on the indirect influence of x (j) on x (k) , i.e. including effects which would not have been present in a system composed by x (j) and x (k) only.
However, the response formalism is able to give, with minimal effort, much more information on the studied system. In particular, it is especially suitable to determine in a rather simple way also "indirect" causation. It x (1) x (2) x (16) x (15) x (14) x (13) x (12) x (11) x (10) x (9) x (8) x (7) x (6) x (5) x (4) x (3) (a)  is quite natural to say that there exists an indirect causation relationship x (j) → x (k) if there exist an oriented path on the graph connecting j with k, i.e. there is (at least) a sequence of length m − 1 (i 1 , i 2 , . . . , i m−1 ) such that From the time series {x (i) t }, i = 1, ..., n, we can compute the correlation functions and, using Eq. (5), the response matrix. The entries R kj t allow the understanding of the structure of the graph (i.e. the matrix A) and the causation relationship x (j) → x (k) . If R kj t = 0 for any t > 0, the causation link is missing, whereas if R kj t = 0 for t ≤ m − 1 and R kj t = 0 for t ≥ m, this means that there exist at least a path of length m connecting j with k. Fig. 3 reports two examples of response functions (R 16,1 t andR 16,5 t ) for the model described in Fig. 2. It can be verified that, in both cases, the first non-zero value of the responses obtained after a number of time-step equals the length of the minimum oriented path connecting the considered variables. Again, the relative effect of the variables x (1) and x (5) on x (16) could not have been simply deduced from the correlation functions, reported in the inset of Fig. 3.
Let us just mention that the same reasoning can be easily extended to stochastic processes with continuous time of the formẋ where F and B are n × n matrices. The eigenvalues of F have positive real part and ξ is a n-dimensional, deltacorrelated normalized Gaussian noise. In this case it can be shown [27] that R t = exp(−F t), so that inferring F from the study of the response functions is again possible, either by considering the matrix I n − R t for t → 0, where I n denotes the n×n identity matrix, or by the continuoustime version of Eq. (7),

IV. TACKLING THE GENERAL PROBLEM
In this Section we discuss the difficulties encountered when trying to infer causal relations in more general situations, as in non-linear systems and in cases where not all relevant variables are accessible. While, in the former case, the FD theory is still applicable in principle, and linear approximations provide quite good results, in the latter the lack of information is a major obstacle to the understanding of the causal links.

A. Non-linear systems
As an example of non-linear dynamics, let us now consider a system composed by three interacting particles in one dimension moving under the action of an external non-harmonic potential. We assume an overdamped dynamics, so that the state of the system x = (x, y, z) evolves aṡ where k and b are constants, ξ is a delta-correlated Gaussian noise and r is a parameter which determines the degree of nonlinearity of the dynamics: when r = 0 the external potential U is harmonic, while for r > 1, it takes a double-well shape. We are interested in studying how accurate Eq. . However Fig. 4 shows that if the nonlinear contribution to the dynamics is small enough, the "linearized" response (5) still gives a meaningful information about the causal relations between the variables of the system. In particular, Fig. 4(d) reports the relative error that one makes by computing D x→y , defined by Eq. (6), with the linear approximation Eq. (5). We observe that the error is rather bounded even for r O(1), when the joint p.d.f. is quite far from a multivariate Gaussian. This fact has a quite clear mathematical interpretation. To show that, we consider a system described by the time-dependent vector x, ruled by some unknown stochastic dynamics. The system is initially in the state x 0 , and the dynamics will evolve it to some other state x t after a time interval t, where x t will, in general, depend both on the initial condition and on the particular realization of the stochastic noise. If we repeat this kind of observation many times along a trajectory, assuming that the dynamics of the considered system is ergodic, we can collect many pairs (x 0 , x t ) i . The best linear approximation to predict x t from x 0 will be of the form where ζ t is a vector of random variables with zero mean, independent of x 0 . The structure of Eq. (25) is the same as that of Eq. (B4). Reasoning as in Appendix B, one finds the linear regression formula (see also Ref. [23]) As a consequence, the R t matrix that one might compute in nonlinear systems by using the "wrong" relation is actually the response associated to the process (25), which is the best linear approximation of the considered transformation x 0 → x t . Let us notice that for this result to hold, we do not have to assume any particular dependence of L t on time.
B. Systems with hidden variables: failure of "embedding" strategies Let us conclude by discussing the rather common situation in which we do not know the whole state vector x of the system, but we only have access to the time series of two variables, {x In order to show the basic problems in inferring causation, let us refer again to the system (17), assuming that only the times series of y and z are available.
The first attempt to detect the y − z causation can be to consider the reduced vector Γ (2) t = (z t , y t ), assuming that it properly describes the system, and to use formula (5) in this 2-dimensional space. This simple approach leads to wrong results: as shown in Fig. 5 (green circles), the computed "response" function is completely different from the real R zy t (blue solid line). This is not surprising at all, since Γ (2) t does not contain enough information about the state of the system, and therefore the dynamics is not Markov. A tempting strategy, inspired by Takens' "embedding" approach in the context of deterministic dynamical systems [37,38], suggests to try a reconstruction of a vector which completely describes the state of the system, by exploiting the knowledge of past values of y t and z t . Basically, the idea is to introduce the vector (27) and repeat the analysis on this 2d-dimensional system for increasing values of d. For deterministic dynamical systems, if d is large enough, the vector Γ (2d) t can be proved to have an autonomous dynamics, so we might expect that in the context of stochastic processes it would follow a Markov evolution rule. If this were the case, we could apply formula (5) to Γ (2d) t and infer all the causal links. Unfortunately, Fig. 5 shows in a rather convincing way that increasing the embedding dimension d does not lead to any improvement: on the contrary, choosing d > 1 can even determine, as in the considered example, a worse estimation of the response function. Similar results would have been observed with different choices of the embedding protocol.
The embedding fails for generic random process because, at variance with deterministic cases, the knowledge of previous values of certain observables is not equivalent, in general, to the knowledge of the entire state of the system. To clarify this point, let us consider a dynamical system x t composed of n variables (x (1) , x (2) , ..., x (n) ), ruled by some autonomous dynamics in discrete time: where f : R n → R n , and we know that there exists a unique solution at any time. It is quite obvious that the n-dimensional vector obtained with the embedding protocol: gives as much information as the vector x t , see e.g. Ref [37,38].
Let us now consider a non autonomous version of (28), where g : R → R n is the vector of n periodic functions with period T . The system can be mapped into an autonomous system by introducing a new variable, say w, such that where y stands for the integer part of y. With this definition, w t ∈ [0, T ) and g(t) = g(w t ), because of its periodicity. Similarly, if g(t) is the linear combination of periodic functions with k (incommensurable) periods T 1 , ..., T k , system (28) can be mapped into an autonomous system by introducing k variables w (1) , ..., w (k) of the form (31). Since a random term can be seen as the superposition of an infinite number of periodic functions with incommensurable frequencies, it is straightforward to understand that in a generic system perturbed by a random forcing, for any finite d, the vector Γ (d) s cannot be able to describe completely the state of the original system. In particular, no reliable information about the response function of the original system can be deduced by applying the fluctuation-response relation to it.
This implies that to infer causation from time correlations in a stochastic dynamics, we actually need to know the trajectories of all the variables which are relevant to the dynamics of y and z.

V. CONCLUSIONS
Using some tools from the FR theory of out-ofequilibrium statistical mechanics, we have introduced a way of characterizing causation between two variables, whose physical interpretation is rather straightforward. The basic idea of this proposal is that x (j) has a causal effect on x (k) after a time interval ∆t > 0 if a perturbation of x (j) at time t induces some change on x (k) at time t + ∆t. In this sense, our definition is reminiscent of the interventional framework developed by Pearl, in which causation is detected by observing the effects of an action on the system. Other approaches to detect causation, as those related to GC and TE, are based on the idea that causation is associated to information, i.e. x (j) has an effect on x (k) if the knowledge of x (j) helps the prediction of x (k) . At a first glance, the choice between observational or interventional approaches may seem only a matter of taste; instead the two methods present important differences, both at qualitative and quantitative level.
Bearing in mind the above definition, we describe a practical method to understand causal links between the variables of a system by looking at time-series of data. Despite the (correct) common-wisdom statement that correlation does not imply causation, we have shown that, at least in multi-dimensional linear Markov process, the presence/absence of causation between variables can be inferred by a proper employ of (all) time correlations. The FR formalism can be used to find "direct" causal links between variables at a given time, and therefore to build linear models based on these findings, as well as to introduce a "degree of causation" cumulative in time. The physical interpretation of this indicator is quite natural and reminds the Green-Kubo formula for the electric (or thermal) conductivity.
From a computational point of view the practical implementation of our method is quite easy, much simpler than GC and TE, whose application becomes elaborate in high dimensional systems. In a generic nonlinear dynamics, even though an exact relation between response functions and certain correlators (whose specific shape depends on the invariant probability distribution) always exists, its explicit form may be very convoluted. However, we have shown that the protocol that holds for the linear case still represents a useful proxy also in presence of weak nonlinear terms.
Serious difficulties arise instead in the case of hidden variables, i.e. when the access to the vector x describing the state of the system is partial. The tempting idea to use an "embedding" methodology to reconstruct the proper complete phase space, at variance with deterministic systems, does not work, in general, for stochastic processes. Let us stress that this impossibility is not due to mere practical difficulties, as the limited length of the time series or the high dimension of the system. It seems to us that the only possible way to understand causation from data is to guess the proper set of variables which describe, at least within a certain accuracy, the complete system according to a Markov rule. The above limitation is always present in any purely inductive approach, i.e. in all cases where, without a fair theoretical framework, one tries to infer the essence of a system (or to build an effective model) just from data. Caveats on this topic had been already expressed by Onsager and Machlup [39], and Ma [40], in a rather vivid way; unfortunately, those wise warnings are often disregarded.

ACKNOWLEDGMENTS
The Authors thankfully acknowledge useful discussions with Erik Aurell. This work is part of MIUR-PRIN2017 Coarse-grained description for non-equilibrium systems and transport phenomena (CO-NEST) whose partial financial support is acknowledged. through the linear combination x (k) t − i [A t ] ki x (i) 0 . Integrating again by parts, this time with respect to x (k) t , one finally obtains Calling R t the matrix of linear responses with the choice F(x) = x, and taking into account Eq. (B5), we recover the well-known formula valid for linear Markov systems at discrete times, where we have introduced the covariance matrix C t = x t x T 0 . From Eq. (B7) it is now clear that in these systems one can observe non-vanishing responses from x (j) to x (k) , for any possible choice of F(x (k) ), only if R kj t = 0; therefore the knowledge of the matrix R t (i.e., F(x) = x) is sufficient to establish the causal links in a linear Markov dynamics.
Appendix C: Sketch of the computation of Eq. (12) In this Appendix we sketch the computation to derive Eq. (12). First, by multiplying Eq. (4) by x T t+1 and by x T t to the right, and taking averages on the stationary joint p.d.f., we get For the simple model (8), by solving the above system one finds C 0 = 2D 1 1 + 3r r r r C 1 = √ 2D 1 1 + 4r 2r 2r r (C2) where r = D 2 /D 1 . Now, we have to compute the amplitudes of the noises ∆ 1 and ∆ 2 in the two alternative models (9) and (10). ∆ 1 is given by a linear regression analysis: Eq. (9) yields i.e.
Instead, ∆ 2 is clearly equal to D 1 , since a similar regression analysis on model (10) shows that the best AR-model coincides with the original dynamics. The above values of ∆ 1 and ∆ 2 lead to Eq. (12).  (D3)