Dynamical independence: discovering emergent macroscopic processes in complex dynamical systems

We introduce a notion of emergence for coarse-grained macroscopic variables associated with highly-multivariate microscopic dynamical processes, in the context of a coupled dynamical environment. Dynamical independence instantiates the intuition of an emergent macroscopic process as one possessing the characteristics of a dynamical system"in its own right", with its own dynamical laws distinct from those of the underlying microscopic dynamics. We quantify (departure from) dynamical independence by a transformation-invariant Shannon information-based measure of dynamical dependence. We emphasise the data-driven discovery of dynamically-independent macroscopic variables, and introduce the idea of a multiscale"emergence portrait"for complex systems. We show how dynamical dependence may be computed explicitly for linear systems via state-space modelling, in both time and frequency domains, facilitating discovery of emergent phenomena at all spatiotemporal scales. We discuss application of the state-space operationalisation to inference of the emergence portrait for neural systems from neurophysiological time-series data. We also examine dynamical independence for discrete- and continuous-time deterministic dynamics, with potential application to Hamiltonian mechanics and classical complex systems such as flocking and cellular automata.


Introduction
When we observe a large murmuration of starlings twisting, stretching and wheeling in the dusk, it is hard to escape the impression that we are witnessing an individuated dynamical entity quite distinct from the thousands of individual birds which we know to constitute the flock.The singular dynamics of the murmuration as a whole, it seems, in some sense "emerges" from the collective behaviour of its constituents (Cavagna et al., 2013).Analogously, the gliders and particles observed in some cellular automata appear to emerge as distinct and distinctive dynamical entities from the collective interactions between cells (Bays, 2009).In both cases, these emergent phenomena reveal dynamical structure at coarser "macroscopic" scales than the "microscopic" scale of interactivity between individual components of the system -structure which is not readily apparent from the microscopic perspective.Frequently, dynamical interactions at the microscopic level are reasonably simple and/or well-understood; yet an appropriate macroscopic perspective reveals dynamics that do not flow transparently from the micro-level interactions, and, furthermore, appear to be governed by laws quite distinct from the microscopic dynamics.Emergence, it seems, proffers a window into inherent parsimonious structure, across spatiotemporal scales, for a class of complex systems.
In both of the above examples emergent structure "jumps out at us" visually.But this need not be the case, and in general may not be the case.For example, directly observing the population activity of large numbers of cortical neurons [e.g., via calcium or optogenetic imaging (Weisenburger et al., 2019)] may not reveal any visually obvious macroscopic patterning (besides phenomena such as widespread synchrony), even though this activity underlies complex organism-level cognition and behaviour.Even in flocking starlings, while distal visual observation manifestly reveals emergent macroscopic structure, could there be additional emergent structure that would only be apparent from a very different-possibly non-visual-perspective?
In this paper, we address two key questions regarding emergent properties in complex dynamical systems: • How may we characterise those perspectives which reveal emergent dynamical structure?(1.1a) • Knowing the microscopic dynamics, how may we find these revealing perspectives? (1.1b) By providing principled data-driven methods for identifying emergent structure across spatiotemporal scales, we hope to enable new insights into many complex systems from brains to ecologies to societies.

Emergence and dynamical independence
Emergence is broadly understood as a gross (macroscopic) property of a system of interacting elements, which is not a property of the individual (microscopic) elements themselves.A distinction is commonly drawn between between "strong" and "weak" emergence (Bedau, 1997).A strongly-emergent macroscopic property (i) is in principle not deducible from its microscopic components, and (ii) has irreducible causal power over these components.This flavour of emergence appears to reject mechanistic explanations altogether, and raises awkward metaphysical issues about causality, such as how to resolve competition between competing micro-and macro-level "downward" causes (Kim, 2006).By contrast, Bedau (1997, p. 375) characterises emergent phenomena as "somehow constituted by, and generated from, underlying processes", while at the same time "somehow autonomous from underlying processes".He goes on to define a process as weakly emergent iff it "can be derived from [micro-level dynamics and] external conditions but only by simulation".Weakly emergent properties are therefore ontologically reducible to their microscopic causes, though they remain epistemically opaque from these causes.
We propose a new notion and measure of emergence inspired by Bedau's formulation of weak emergence.Our notion, dynamical independence, shares with weak emergence the aim to capture the sense in which a flock of starlings seems to have a "life of its own" distinct from the microscopic process (interactions among individual birds), even though there is no mystery that the flock is in fact constituted by the birds.Following Bedau's formulation, a dynamically-independent macroscopic process is "ontologically reducible" to its microscopic causes, and downward (physical) causality is precluded.However, dynamically-independent macroscopic processes may display varying degrees of "epistemic opacity" from their microscopic causes, loosening the constraint that (weak) emergence relations can only be understood through exhaustive simulation.
Dynamical independence is defined for macroscopic dynamical phenomena associated with a microscopic dynamical system-macroscopic variables-which supervene1 (Davidson, 1980) on the microscopic.Here, supervenience of macro on micro is operationalised in a looser predictive sense: that macroscopic variables convey no information about their own evolution in time beyond that conveyed by the microscopic dynamics (and possibly a coupled environment).The paradigmatic example of such macroscopic variables, and one which we mostly confine ourselves to in this study, is represented by the coarsegraining of the microscopic system by aggregation of microscopic components at some characteristic scale.Dynamical independence is framed in predictive terms: a macroscopic variable is defined to be dynamically-independent if, even while supervenient on the microscopic process, knowledge of the microscopic process adds nothing to prediction of the macroscopic process beyond the extent to which the macroscopic process already self-predicts.(This should not be taken to imply, however, that a dynamically-independent process need self-predict well, if indeed at all; see discussion in Section 5.) To bolster intuition, consider a large group of particles, such as a galaxy of stars.The system state is described by the ensemble of position and momentum vectors of the individual stars in some inertial coordinate system, and the dynamics by Newtonian gravitation.We may construct a low-dimensional coarse-grained macroscopic variable by taking the average position and total momentum (and, if we like, also the angular momentum and total energy) of the stars in the galaxy.Elementary physics tells us that this macroscopic variable in fact self-predicts perfectly without any knowledge of the detailed microscopic state; it has a "life of its own", perfectly understandable without recourse to the microscopic level.Yet an arbitrarily-concocted coarse-graining-i.e., an arbitrary mapping of the microscopic state space to a lower-dimensional space-will almost certainly not have this property2 : indeed, the vast majority of coarse-grainings do not define dynamically-independent processes.Dynamical independence is defined over the range of scales from microscopic, through mesoscopic to macroscopic.It is expressed, and quantified, solely in terms of Shannon (conditional) mutual information (Cover and Thomas, 1991), and as such is fully transformation-invariant; that is, for a physical process it yields the same quantitative answers no matter how the process is measured.Under some circumstances, it may be defined in the frequency domain, thus enabling analysis of emergence across temporal scales.It applies in principle to a broad range of dynamical systems, continuous and discrete in time and/or state, deterministic and stochastic.Examples of interest include Hamiltonian dynamics, linear stochastic systems, neural systems, cellular automata, flocking, econometric processes, and evolutionary processes.
As previously indicated, our specific aims are (1.1a) to quantify the degree of dynamical independence of a macroscopic variable, and (1.1b) given the micro-level dynamics, to discover dynamically-independent macroscopic variables.In the current article we address these aims primarily for stochastic processes in discrete time, and analyse in detail the important and non-trivial case of stationary linear systems.
The article is organised as follows: in Section 2 we set out our approach.We present the formal underpinnings of dynamical systems, macroscopic variables and coarse-graining, the information-theoretic operationalisation of dynamical independence, its quantification and its properties.We declare an ansatz on a practical approach to our primary objectives (1.1).In Section 3 we specialise to linear stochastic systems in discrete time, and analyse in depth how our ansatz may be achieved for linear state-space systems; in particular, we detail how dynamically-independent macroscopic variables may be discovered in state-space systems via numerical optimisation, and present a worked example illustrating the procedure.In Section 4 we discuss approaches to dynamical independence for deterministic and continuous-time systems.In Section 5 we summarise our findings, discuss related approaches in the literature and examine some potential applications in neuroscience.

Dynamical systems
Our notion of dynamical independence applies to dynamical systems.We describe a dynamical system by a sequence of variables S t taking values in some state space S at times indexed by t (when considered as a whole, we sometimes drop the time index and write just S to denote the sequence {S t }).In full generality, the state space might be discrete or real-valued, and possibly endowed with further structure (e.g., topological, metric, linear, etc.).The sequential time index may be discrete or continuous.The dynamical law, governing how S t evolves over time, specifies the system state at time t given the history of states at prior times t < t; this specification may be deterministic or probabilistic.In this study, we largely confine our attention to discrete-time stochastic processes, where the S t , t ∈ Z, are jointly-distributed random variables.The distribution of S t is thus contingent on previously-instantiated historical states; that is, on the set s − t = {s t : t < t} given that S t = s t for t < t (throughout this article we use a superscript dash to denote sets of prior states).In Section 4, we discuss extension of dynamical independence to deterministic and/or continuous-time systems.
The general scenario we address is a "dynamical universe" that may be partitioned into a "microscopic" dynamical system of interest X t coupled to a dynamical environment E t , where X t and E t are jointlystochastic variables taking values in state spaces X and E respectively.Typically, the microscopic state will be the high-dimensional ensemble state of a large number of atomic elements, e.g., birds in a flock, molecules in a gas, cells in a cellular automaton, neurons in a neural system, etc.The microscopic and environmental 3 processes jointly constitute a dynamical system (X t , E t ); that is, the dynamical laws governing the evolution in time of microscopic and environmental variables will depend on their joint history.Here we assume that the system/environment boundary is given [see Krakauer et al. (2020) for an approach to the challenge of distinguishing systems from their environments].

Macroscopic variables and coarse-graining
Given a microscopic dynamical system X t and environment E t , we associate an emergent phenomenon explicitly with some "macroscopic variable" associated with the microscopic system.Intuitively, we may think of a macroscopic variable as a gross perspective on the system, a "way of looking at it" (cf.Section 1), or a particular mode of description of the system (Shalizi and Moore, 2003;Allefeld et al., 2009).We operationalise this idea in terms of a process Y t that in some sense aggregates microscopic states in X into common states in a lower-dimension or cardinality state space Y, with a consequent loss of information.(We don't rule out that aggregation may occur over time as well as state.)The dimension or cardinality of Y defines the scale, or "granularity", of the macroscopic variable.
The supervenience of macroscopic variables on the microscopic dynamics (Section 1.1) is operationalised in predictive, information-theoretic terms: we assume that a macroscopic variable conveys no information about its own future beyond that conveyed by the joint microscopic and environmental histories.Explicitly, we demand the condition where Cover and Thomas, 1991).A canonical example of a macroscopic variable in the above sense is one of the form Y t = f (X t ), where f : X → Y is a deterministic, surjective mapping from the microscopic onto the lower dimensional/cardinality macroscopic state space (here aggregation is over states, but not over time 5 ).The relation (2.1) is then trivially satisfied.We refer to this as coarse-graining, to be taken in the broad sense of dimensionality reduction.Coarse-graining partitions the state space, "lumping together" microstates in the preimage with a concomitant loss of information: many microstates correspond to the same macrostate.For concision, we sometimes write Y = f (X) to denote the coarse-graining If the state space X is endowed with some structure (e.g., topological, metric, smooth, linear, etc.) then we generally restrict attention to structure-preserving mappings (morphisms).In particular, we restrict coarse-grainings f to epimorphisms (surjective structure-preserving mappings) 6 .There is a natural equivalence relation amongst coarse grainings: given f : X → Y, f : X → Y we write f and f then lump together the same subsets of microstates.When we talk of a coarse-graining, we implicitly intend an equivalence class {f } of mappings X → Y under the equivalence relation (2.2).In the remainder of this article we restrict attention to coarse-grained macroscopic variables.

Dynamical independence
As we have noted, not every coarse-graining f : X → Y will yield a macroscopic variable Y t = f X t which we would be inclined to describe as emergent (cf. the galaxy example in Section 1.1).Quite the contrary; for a complex microscopic system comprising many interacting components, we may expect that an arbitrary coarse-grained macroscopic variable will fail to behave as a dynamical entity in its own right, with its own distinctive law of evolution in time.When applied to the coarse-grained variable, the response to the question: What will it do next?will be: Well, without knowing the full microscopic history, we really can't be sure; unsurprising, perhaps, as coarse-graining, by construction, loses information.By contrast, for an emergent macroscopic variable, despite the loss of information incurred by coarse-graining, the macroscopic dynamics are parsimonious in the following sense: knowledge of the microscopic history adds nothing to the capacity of the macroscopic variable to self-predict.Dynamical independence formalises this parsimony as follow (we temporarily disregard the environment): Given jointly-stochastic processes (X, Y ), Y is dynamically-independent of X iff, conditional on its own history, Y is independent of the history of X. (2.3) In information-theoretic terms, (2.3) holds (at time t) precisely when I(Y t : X − t | Y − t ) vanishes identically.We recognise this quantity as the transfer entropy (TE; Schreiber, 2000;Paluš et al., 2001;Kaiser and Schreiber, 2002;Bossomaier et al., 2016) T t (X → Y ) from X to Y at time t.Thus we state formally: (2.4) Eq. ( 2.4) establishes an information-theoretic condition7 for dynamical independence of Y with respect to X; we further propose the transfer entropy T t (X → Y ), the dynamical dependence of Y on X, as a quantitative, non-negative measure of the extent to which Y departs from dynamical independence with respect to X at time t.Crucially, "dynamical independence" refers to the condition, while "dynamical dependence" refers to the measure8 .Dynamical (in)dependence is naturally interpreted in predictive terms: the unpredictability of the process Y at time t given its own history is naturally quantified by the entropy rate H(Y t | Y − t ).We may contrast this with the unpredictability H(Y t | X − t , Y − t ) of Y given not only its own history, but also the history of X.Thus the dynamical dependence quantifies the extent to which X predicts Y over-and-above the extent to which Y already self-predicts.
As presented above, dynamical independence generalises straightforwardly to take account of a third, jointly-stochastic conditioning variable, via conditional transfer entropy (Bossomaier et al., 2016).In the case where X represents a microscopic system, E a jointly-stochastic environmental process and Y an associated macroscopic variable, we define our dynamical dependence measure as and define the condition for dynamical independence as: The macroscopic variable Y is dynamically-independent of the microscopic system The dynamical dependence T t (X → Y | E) will in general be time-varying, except when all processes are strongly-stationary; for the remainder of this article we restrict ourselves to the stationary case and drop the time index subscript.In the case that the processes X, Y, E are deterministic, mutual information is not well-defined, and dynamical independence must be framed differently.We discuss deterministic systems in Section 4.
As a conditional Shannon mutual information (Cover and Thomas, 1991), the transfer entropy is nonparametric in the sense that it is invariant with respect to reparametrisation of the target, source and conditional variables by isomorphisms of the respective state spaces (Kaiser and Schreiber, 2002).Thus if ϕ is an isomorphism of X , ψ an isomorphism of Y and χ an isomorphism of E, then In particular, dynamical (in)dependence respects the equivalence relation (2.2) for coarse-grainings.This means that (at least for coarse-grained macroscopic variables) transfer entropy from macro to micro vanishes trivially; i.e., T(Y → X | E) ≡ 0, which we may interpret as the non-existence of "downward causation".
To guarantee transitivity of dynamical independence (see below), we introduce a mild technical restriction on admissible coarse-grainings to those f : X → Y with the following property: Intuitively, there is a "complementary" mapping u which, along with f itself, defines a nonsingular transformation of the system.For example, if f is a projection of the real Euclidean space R n onto the first m < n coordinates, u could be taken as the complementary projection of R n onto the remaining n − m coordinates (cf.Section 3).Trivially, (2.8) respects the equivalence relation (2.2).The restriction holds universally for some important classes of structured dynamical systems, e.g., the linear systems analysed in Section 3, and also in general for discrete-state systems; otherwise, it might be relaxed to obtain at least "locally" in state space9 .We assume (2.8) for all coarse-grainings from now on.Given property (2.8), we may apply the transformation ϕ = f × u and exploit the dynamical dependence invariance (2.7) to obtain an equivalent system X ∼ (Y, U ), U = u(X) in which the coarse-graining Y becomes a projection of X = Y × U onto Y, and dynamical dependence is given by (2.9) Assuming (2.8) for all coarse-grainings, using (2.9) we may show that dynamical independence is transitive: We provide a formal proof in Appendix A. We have thus a partial ordering on the set of coarse-grained dynamically-independent macroscopic variables, under which they may potentially be hierarchically nested at increasingly coarse scales.Systems featuring emergent properties are typically large ensembles of dynamically interacting elements; that is, system states x ∈ X are of the form (x 1 , . . ., x n ) ∈ X 1 × . . .× X n with X k the state space of the kth element, where n is large.Dynamical independence for such systems may be related to the causal graph of the system (Barnett and Seth, 2014;Seth et al., 2015), which encapsulates information transfer between system elements.As we shall see (below and Section 3.6), this facilitates the construction of systems with prescribed dynamical-independence structure.Given a coarse-graining . ., m at scale m, using (2.8) it is not hard to show that we may always transform the system so that y k = x k for k = 1, . . ., m; that is, under some "change of coordinates", the coarse-graining becomes a projection onto the subspace defined by the first m dimensions of the microscopic state space.The dynamical dependence is then given by and we may show10 that, under such a transformation where is the causal graph of the system X conditioned on the environment (here the subscript "[ij]" denotes omission of the i and j components of X).According to (2.12) we may characterise dynamicallyindependent macroscopic variables for ensemble systems as those coarse-grainings which are transformable into projections onto a sub-graph of the causal graph with no incoming information transfer from the rest of the system.However, given two or more dynamically-independent macroscopic variables (at the same or different scales), in general we cannot expect to find a transformation under which all of those variables simultaneously become projections onto causal sub-graphs.Nonetheless, (2.12) is useful for constructing dynamical systems with prespecified dynamically-independent macroscopic variables (cf.Section 3.6).For many complex dynamical systems, there will be no fully dynamically-independent macroscopic variables at some particular (or perhaps at any) scale; i.e., no macroscopic variables for which (2.6) holds exactly11 .There may, however, be macroscopic variables for which the dynamical dependence (2.5) is small -in an empirical scenario, for instance, "small" might be defined as statistically insignificant.We take the view that even "near-dynamical independence" yields useful structural insights into emergence, and adopt the ansatz: • The maximally dynamically-independent macroscopic variables at a given scale (i.e., those which minimise dynamical dependence) characterise emergence at that scale. (2.14a) • The collection of maximally dynamically-independent macroscopic variables at all scales, along with their degree of dynamical dependence, affords a multiscale portrait of the emergence structure of the system. (2.14b)

Linear systems
In this section we consider linear discrete-time continuous-state systems, and later specialise to linear state-space (SS) systems.For simplicity, we consider the case of an empty environment, although the approach is readily extendable to the more general case.
Our starting point is that the microscopic system is, or may be modelled as, a wide-sense stationary, purely-nondeterministic12 , stable, minimum-phase13 ("miniphase"), zero-mean, vector14 stochastic process t ∈ Z defined on the vector space R n .These conditions guarantee that the process has unique stable and causal vector moving-average (VMA) and vector autoregressive (VAR) representations: and respectively, where ε t is a white noise (serially uncorrelated, iid) innovations process with covariance matrix Σ = E[ε t ε T t ], and H(z) the transfer function with z the back-shift operator (in the frequency domain, z = e −iω , with ω = angular frequency in radians).B k and A k are respectively the VMA and VAR coefficient matrices.The stability and miniphase conditions imply that both the B k and A k are square-summable, and that all zeros and poles of the transfer function lie strictly outside the unit disc in the complex plane15 .The cross-power spectral density (CPSD) matrix for the process is given by (Wilson, 1972) where "*" denotes conjugate transpose.Note that at this stage we do not assume that the innovations ε t are (multivariate) Gaussian.Note too, that even though we describe the system (3.1) as "linear", this does not necessarily exclude processes with nonlinear generative mechanisms -we just require that the conditions listed above are met.Wold's Decomposition Theorem (Doob, 1953) guarantees a VMA form (3.1a) provided that the process is wide-sense stationary and purely-nondeterministic; if in addition the process is miniphase, the VAR form (3.1b) also exists, and all our conditions are satisfied.Thus our analysis here also covers a large class of stationary "nonlinear" systems, with the caveats that (i) for a given nonlinear generative model, the VMA/VAR representations will generally be infinite-order, and as such may not represent parsimonious models for the system, and (ii) restriction of coarse-graining to the linear domain (see below) may limit analysis to macroscopic variables which lack a natural relationship with the nonlinear structure of the dynamics.Nonetheless, linear models are commonly deployed in a variety of real-world scenarios, especially for econometric and neuroscientific time-series analysis.
Reasons for their popularity include parsimony (linear models will frequently have fewer parameters than alternative nonlinear models16 ), simplicity of estimation, and mathematical tractability.Since we are in the linear domain, we restrict ourselves to linear coarse-grained macroscopic variables (cf.Section 2.2).A surjective linear mapping L : R n → R m , 0 < m < n, corresponds to a full-rank m × n matrix, and the coarse-graining equivalence relation (2.2) identifies L, L iff there is a non-singular linear transformation Ψ of R m such that L = ΨL.(Note that since L is full-rank, Y t = LX t is purelynondeterministic and satisfies all the requirements listed at the beginning of this section.)Considering the rows of L as basis vectors for an m-dimensional linear subspace of R n , a linear transformation simply specifies a change of basis for the subspace.Thus we may identify the set of linear coarse-grainings with the Grassmannian manifold G m (n) of m-dimensional linear subspaces of R n .The Grassmannian (Helgason, 1978) is a compact smooth manifold of dimension m(n − m).It is also a non-singular algebraic variety (the set of solutions of a system of polynomial equations over the real numbers), a homogeneous space (it "looks the same at any point"), and an isotropic space (it "looks the same in all directions"); specifically, is the Lie group of real orthogonal matrices.Under the Euclidean inner-product (vector dot-product), every m-dimensional subspace of R n has a unique orthogonal complement of dimension n − m17 , which establishes a (non-canonical) isometry of G m (n) with G n−m (n); for instance, in R 3 , every line through the origin has a unique orthogonal plane through the origin, and vice-versa.There is a natural definition of principal angles between linear subspaces of Euclidean spaces, via which the Grassmannian G m (n) may be endowed with various invariant metric structures (Wong, 1967).
By transformation-invariance of dynamical dependence, we may assume without loss of generality that the row-vectors of L form an orthonormal basis; i.e., The manifold of linear mappings satisfying (3.3) is known as the Stiefel manifold which, like the Grassmannian is a compact, homogeneous and isotropic algebraic variety, with dimension nm − 1 2 m(m + 1).In contrast to the set of all full-rank mappings R n → R m , the Stiefel manifold is bounded, which is advantageous for computational minimisation of dynamical dependence (Section 3.5.1).
The condition (2.8) is automatically satisfied for linear coarse-grainings.In particular, given L satisfying (3.3), we may always find a surjective linear mapping M : R n → R n−m where the row-vectors of the (n − m) × n matrix M form an orthonormal basis for the orthogonal complement of the subspace spanned by the row-vectors of L. The transformation of R n is then nonsingular and orthonormal; i.e., ΦΦ T = I.Given a linear mapping L, our task is to calculate the dynamical dependence T(X → Y ) for the coarse-grained macroscopic variable Y t = LX t .In the context of linear systems, it is convenient to switch from transfer entropy to Granger causality (GC; Wiener, 1956;Granger, 1963Granger, , 1969;;Geweke, 1982).In case the innovations ε t in (3.1) are multivariate-normal, the equivalence of TE and GC is exact (Barnett et al., 2009); else we may either consider the GC approach as an approximation to "actual" dynamical dependence18 , or, if we wish, consider dynamical dependence framed in terms of GC rather than TE as a linear prediction-based measure in its own right; we note that key properties of dynamical (in)dependence including transformation invariance (2.7), the existence of complementary mappings (2.8) [cf.(3.4)], transitivity (2.10) and relationship to the (Granger-)causal graph (2.12) carry over straightforwardly to the GC case.GC has distinct advantages over TE in terms of analytic tractability, sample estimation and statistical inference, in both parametric (Barnett andSeth, 2014, 2015, cf.Section 3.1 below) and nonparametric (Dhamala et al., 2008) scenarios.In Appendix B we provide a concise recap of (unconditional) Granger causality following the classical formulation of Geweke (1982).

Linear state-space systems
We now specialise to the class of linear state-space systems (3.1) (under the restrictions listed at the beginning of Section 3), where X t may be represented by a model of the form where the (unobserved) state process zero-mean multivariate white noises, C is the observation matrix and A the state transition matrix.Note the specialised use of the term "state space" in the linear systems vocabulary: the state variable W t is to be considered a notional unobserved process, or simply as a mathematical construct for expressing the dynamics of the observation process X t , which here stands as the "microscopic variable".
The parameters of the model (3.5) are (A, C, Q, R, S), where is the joint noise covariance matrix (the purely-nondeterministic assumption implies that R is positivedefinite).Stationarity requires that the transition equation (3.5a) satisfy the stability condition max{|λ| : A process X t satisfying a stable, miniphase SS model (3.5) also satisfies a stable, miniphase vector autoregressive moving-average (VARMA) model; conversely, any stable, miniphase VARMA process satisfies a stable, miniphase SS model of the form (3.5) (Hannan and Deistler, 2012).
To facilitate calculation of dynamical dependence (Section 3.2 below), it is useful to transform the SS model (3.5) to "innovations form" (cf.Appendix B.1) and Kalman gain matrix K.The moving-average and autoregressive operators for the innovations-form state-space (ISS) model (3.7) are given by and respectively, where B = A − KC.The miniphase condition is thus max{|λ| : λ ∈ eig(B)} < 1.
A general-form SS (3.5) may be converted to an ISS (3.7) by solving the associated discrete algebraic Riccati equation (DARE; Lancaster and Rodman, 1995;Hannan and Deistler, 2012).
which under our assumptions has a unique stabilising solution for P ; then

Dynamical dependence for state-space systems
From (3.5) it is clear that a macroscopic process Y t = LX t will be of the same form; that is, the class of state-space systems is closed under full-rank linear mappings.Now consider the (nonsingular) orthonormal transformation (3.4) above.Setting Xt = ΦX t , again by transformation invariance we have and Xt satisfies the ISS model where εt = Φε t so that Σ = ΦΣΦ T , C = ΦC and K = KΦ T (note that by orthonormality, Φ −1 = Φ T ).Now partitioning Xt into X1t = LX t = Y t and X2t = M X t , by transformation-invariance we have , where the last equality holds since, given X− 1t , the all-variable history X− t yields no additional predictive information about X1t beyond that contained in X− 2t .We may now apply the recipe for calculating (unconditional) GC for an innovations-form SS system, as described in Appendix B.1, to (3.12) We find C1 = LC and (B.7) becomes The DARE (B.9) for calculating the innovations-form parameters for the reduced model for X1t then becomes which has a unique stabilising solution for P .Explicitly, setting we have ΣR 11 = LV L T , so that finally Note that P and V are implicitly functions of the ISS parameters (A, C, K, Σ) and the matrix L. Again using transformation invariance, we note that transformation of R n by the inverse of the left Cholesky factor of Σ yields Σ = I in the transformed system.Thus from now on, without loss of generality we restrict ourselves to the case Σ = I, as well as the orthonormalisation (3.3).This further simplifies the DARE (3.14) to and the dynamical dependence (3.16) becomes simply ) is invariant under nonsingular linear transformation of source or target variable at all frequencies (Barnett and Seth, 2011).To calculate the spectral dynamical dependence f(X → Y ; z), we again apply the orthonormal transformation (3.4) and calculate f(X → Y ; z) = f( X2 → X1 ; z).Firstly, we may confirm that the transfer function transforms as H(z) = ΦH(z)Φ T , so that by (3.2) we have S(z) = ΦS(z)Φ T ; in particular, we have H12 (z) = LH(z)M T and S11 (z) = LS(z)L T .We may then calculate that under the normalisations LL T = I and Σ = I, we have Σ22|1 = I, so that, noting that LL T + M M T = I, we have As per (B.5), we may define the band-limited dynamical dependence as Noting that LH(z)L T is the transfer function and LH(z)L T LH * (z)L T the CPSD for the process Y , by a standard result (Rozanov, 1967, Theorem 4.2) we have ) and (B.4) the time-domain dynamical dependence is thus compactly expressed as which may be more computationally convenient and/or efficient than (3.18) with the DARE (3.17) (cf.Section 3.5.1).We note that in the presence of an environmental process, we must consider conditional spectral GC, which is somewhat more complex than the unconditional version (B.3) (Geweke, 1984;Barnett and Seth, 2015); we leave this for a future study.

Finite-order VAR systems
Finite-order autoregressive systems (Appendix B.2) are an important special case of state-space (equivalently VARMA) systems.For a VAR(p), p < ∞, (B.11), to calculate F(X → Y ) we may convert the VAR(p) to an equivalent ISS (B.12) (Hannan and Deistler, 2012), and proceed as in Section 3.2 above.Alternatively, we may exploit the dimensional reduction detailed in Appendix B.2: applying the transformation (3.4) with normalisation LL T = I, it is easy to calculate that the autoregressive coefficients transform as Ãk = ΦA k Φ T for k = 1, . . ., p, and as before Σ = ΦΣΦ T .We thus find [cf.(B.18)] Ã22 = M AM T and C12 = LCM T , where M = diag(M, . . ., M ) (p blocks of M on the diagonal), and with normalisation Σ = I, we have [cf. (B.19)].
so that, setting Π = M T P M for compactness, with P the unique stabilising solution of the reduced p(n − m) × p(n − m) DARE (B.17) For spectral GC, the formula (3.19) applies, with transfer function as in (B.21).

Perfect dynamical independence
We now examine generic conditions under which perfectly dynamically-independent macroscopic variable can be expected to exist, where by "generic" we mean "except on a measure-zero subset of the model parameter space".Again applying the transformation (3.4), (B.10) yields where r is the dimension of the state space.Eq. (3.25) constitutes rm(n − m) multivariate-quadratic equations for the m(n − m) free variables which parametrise the Grassmannian G m (n).For r = 1, we would thus expect solutions yielding dynamically-independent Y t = LX t at all scales 0 < m < n; however, as the equations are quadratic, some of these solutions may not be real.For r > 1, except on a measure-zero subset of the (A, C, K) ISS parameter space, there will be solutions if LC ≡ 0 or KM T ≡ 0 (or both).The former comprises rm linear equations, and the latter r(n − m) equations, for the m(n − m) Grassmannian parameters.Therefore, we expect generic solutions to (3.25) if r < n and either m ≤ n − r or m ≥ r (or r ≤ m ≤ n − r, in which case 2r ≤ n is required).Generically, for r ≥ n there will be no perfectly dynamically-independent macroscopic variables.We note that r < n corresponds to "simple" models with few spectral peaks; nonetheless, anecdotally it is not uncommon to estimate parsimonious model orders < n for highly multivariate data, especially for limited time-series data.
In the generic VAR(p) case (B.11), the condition (B.20) for vanishing GC (Geweke, 1982) yields which constitutes pm(n−m) multivariate-quadratic equations for the m(n−m) Grassmannian parameters.Generically, for p = 1 we should again expect to find dynamically-independent macroscopic variables at all scales 0 < m < n, while for p > 1 we don't expect to find any dynamically-independent macroscopic variables, except on a measure-zero subset of the VAR(p) parameter space (A 1 , . . ., A p ).
Regarding spectral dynamical independence, we note that f(X → Y ; z) is an analytic function of Thus by a standard property of analytic functions, if band-limited dynamical independence (3.20) vanishes for any particular finite interval [ω 1 , ω 2 ] then it is zero everywhere, so that by (B.4) the time-domain dynamical dependence (3.18) must also vanish identically.

Statistical inference
Given empirical time-series data, a VAR or state-space model may be estimated via standard (maximumlikelihood) techniques, such as ordinary least squares (OLS; Hamilton, 1994) for VAR estimation, or a subspace method (van Overschee and de Moor, 1996) for state-space estimation.The dynamical dependence, as a Granger causality sample statistic, may then in principle be tested for significance at some prespecified level, and dynamical independence of a coarse-graining Y t = LX t inferred by failure to reject the null hypothesis of zero dynamical dependence (3.25) [see also Appendix B.1,eq. (B.10)].
In the case of dynamical dependence calculated from an estimated state-space model [i.e., via (3.16) or (3.18)], the asymptotic null sampling distribution is not known, and surrogate data methods would be required.For VAR modelling, the statistic (3.23) is a "single-regression" Granger causality estimator, for which an asymptotic generalised χ 2 sampling distribution has recently been obtained by Gutknecht and Barnett (2019).Alternatively, a likelihood-ratio, Wald or F -test (Lütkepohl, 2005) might be performed for the null hypothesis (B.20) of vanishing dynamical dependence (3.26).

Maximising dynamical independence
Following our ansatz (2.14), given an ISS system (3.7),whether or not perfectly dynamically-independent macroscopic variables exist at any given scale, we seek to minimise the dynamical dependence F(X → Y ) over the Grassmannian manifold of linear coarse-grainings (i.e., over L for Y t = LX t ).The band-limited dynamical dependence F(X → Y ; ω 1 , ω 2 ) (B.5) may also in principle be minimised at a given scale to yield maximally dynamically-independent coarse-grainings associated with the given frequency range at that scale; we leave this for future research.
Solving the minimisation of dynamical dependence (3.18) over the Grassmannian analytically appears, at this stage, intractably complex (see Appendix C for a standard approach); we thus proceed to numerical optimisation.

Numerical optimisation
Given a set of ISS parameters (A, C, K) (we may as before assume Σ = I), minimising the cost function F(X → Y ) of (3.18) over the Grassmannian manifold G m (n) of linear subspaces presents some challenges.Note that we are not (yet) able to calculate the gradient of the cost function explicitly, provisionally ruling out a large class of gradient-based optimisation techniques 19 .
Simulations (Section 3.6) indicate that the cost function (3.18) appears to be in general multi-modal, so optimisation procedures may tend to find local sub-optima.We do not consider this a drawback; rather, in accordance with our ansatz (2.14), we consider them of interest in their own right, as an integral aspect of the emergence portrait.
While G m (n) is compact of dimension m(n − m), its parametrisation over the nm − 1 2 m(m + 1)dimensional Stiefel manifold V m (n) of m×n orthonormal basis matrices L is many-to-one.There will thus be 1 2 m(m − 1)-dimensional equi-cost surfaces in the Stiefel manifold.These zero-gradient sub-manifolds may confound standard optimisation algorithms; population-based (non-gradient) methods such as cross-entropy optimisation (Botev et al., 2013), for example, fail to converge when parametrised by R mn under the constraint (3.3), apparently because the population diffuses along the equi-cost surfaces.Preliminary investigations suggest that simplex methods (Nelder and Mead, 1965), which are generally better at locating global optima, also fare poorly, although the reasons are less clear.
An alternative approach is to use local coordinate charts for the Grassmannian.Any full-rank m × n matrix L can be represented as where Π is an n × n permutation matrix (i.e., a row or column permutation of I n×n ), Ψ an m × m non-singular transformation matrix and M is m × (n − m) full-rank.For given Π the Grassmannian is then locally and diffeomorphically mapped by the m × (n − m) full-rank matrices M20 .But note that for given Π, M , while there is no redundancy in the (injective) mapping M : R m(n−m) → G m (n), the space of such M is unbounded and doesn't cover the entire Grassmannian, which again makes numerical optimisation awkward.
A partial resolution is provided by a surprising mathematical result due to Knuth (1985) [see also (Usevich and Markovsky, 2014)], which states roughly that given any fixed δ > 1, for any full-rank m × n matrix L 0 there is a neighbourhood of L 0 , a permutation matrix Π, and a transformation Ψ such that for any L in the neighbourhood of L 0 , all elements of M satisfying (3.27) are bounded to lie in [−δ, δ].That is, in the local neighbourhood of any subspace in the Grassmannian, we can always find a suitable permutation matrix Π such that (3.27) effectively parametrises the neighbourhood by a bounded submanifold of R m (n−m) .During the course of an optimisation process, then, if the current local search (over M ) drifts outside its δ-bounds, we can always find a new bounded local parametrisation of the search neighbourhood "on the fly".Finding a suitable new Π is, however, not straightforward, and calculating the requisite Ψ for (3.27) is quite expensive computationally (Mehrmann and Poloni, 2012).Nor is this scheme particularly convenient for population-based optimisation algorithms, which will generally require keeping track of different permutation matrices for different sub-populations, and-worse-for some algorithms (such as CE optimisation), it seems that the procedure can only work if the entire current population resides in a single (Ψ, Π)-chart.
Regarding computational efficiency, we may apply a useful pre-optimisation trick.From (3.25), F(X → Y ) vanishes precisely where the "proxy cost function" vanishes, where as before M spans the orthogonal complement to L, and U 2 = trace U U T is the (squared) Frobenius matrix norm.While F * (X → Y ) will not in general vary monotonically with F(X → Y ), simulations indicate strongly that subspaces L which locally minimise F(X → Y ) will lie in regions of the Grassmannian with near-locally-minimal F * (X → Y ).Since F * (X → Y ), as a biquadratic function of L, M , is considerably less computationally expensive to calculate21 than F(X → Y ), we have found that pre-optimising F * (X → Y ) under the constraints LL T = 0, LM T = 0, M M T = 0 leads to significantly accelerated optimisation of F(X → Y ), especially for highly multivariate state-space models, and/or models with large state-space dimension.The same techniques may be used to optimise For optimisation of F(X → Y ), it may also be more computationally efficient to use the spectral integral form (3.21) rather than (3.18) with the DARE (3.17).Approximating the integral will involve choosing a frequency resolution dω for numerical quadrature.We have found that a good heuristic choice is dω ≈ log ρ/ log ε, where ρ is the spectral radius of the process (Section 3), and ε the machine floating-point epsilon22 .Quadrature, then, is likely to be computationally cheaper than solving the DARE (3.17) provided that ρ is not too close to 1, and for fixed ρ scales better with system size.
For numerical minimisation of dynamical dependence derived from empirical data via state-space or VAR modelling, a stopping criterion may be based on statistical inference (Section 3.4): iterated search may be terminated on failure to reject the appropriate null hypothesis of vanishing dynamical dependence at a predetermined significance level.As mentioned in Section 3.4, in lieu of a known sampling distribution for the state-space Granger causality estimator, this is only likely to be practicable for VAR modelling.
So far, we have had some success (with both the redundant L-parametrisation under the orthonormality constraint and the 1−1 M -parametrisation with on-the-fly (Ψ, Π) selection), with (i) stochastic gradient descent with annealed step size, and, more computationally efficiently (ii) with a (1+1) evolution strategy (ES; Rechenberg, 1973;Schwefel, 1995) -both with multiple restarts to identify local sub-optima.The search space scales effectively quadratically with n; so far, we have been able to solve problems up to about n ≈ 20 (for all m) in under than 24 hours on a standard multi-core Xeon ™ workstation; with pre-optimisation, we may extend this to about n ≈ 100, although the number of local sub-optima increases with n, necessitating more restarts.Parallel high-performance computing aids significantly, since restarts are independent and may be run concurrently.GPU computing should also improve efficiency, in particular if GPU-enabled DARE solvers are available.

Simulation results
In this Section we demonstrate the discovery of dynamically-independent macroscopic variables and estimation of the emergence portrait for state-space systems with specified causal connectivity, using numerical optimisation (Section 3.5.1).In an empirical setting, a state-space (or VAR) model for stationary data could be estimated by standard methods and the same optimisation procedure followed.
Our simulations are motivated as follows: if at scale 0 < m < n we have a macroscopic variable Y t = LX t , then (cf.Section 3.3) the system may be transformed so that the linear mapping L takes the form of a projection onto some m-dimensional coordinate hyperplane x i 1 , . . ., x im , and according to (2.12), Y t is perfectly dynamically-independent iff where G is the causal graph (2.13) of the system.While not an entirely general characterisation of the emergence portrait (as remarked at the end of Section 2.3, multiple dynamically-independent linear coarse-grainings will not in general be simultaneously transformable to axes-hyperplane projections), we may nonetheless design linear models with prespecified causal graphs, which then mediate the expected  A k,ij , k = 1, . . ., 7, i, j = 1, . . ., 9, setting the coefficients to zero for "missing" connections (zeros in the matrix in Fig. 1b), and normalising to spectral radius ρ = 0.9.
dynamically-independent macroscopic variables at various scales.This construction is simple to achieve with finite-order VAR models (less so with more general state-space models23 ), by setting the appropriate VAR coefficients A k,ij to zero.(We may also achieve "near dynamical independence" by making the appropriate A k,ij "small".)This is illustrated in Fig. 1.In an empirical setting, it would be preferable to "prune" the Granger-causal graph to only display statistically-significant pairwise-conditional Granger causalities as directed edges.
We minimised dynamical dependence at scales m = 1, . . ., 8 for the VAR model of Fig. 1, using a (1+1)-ES (Section 3.5.1).At each scale, 100 independent optimisation runs were performed, initialised uniformly randomly on the Grassmannian.The (1+1)-ES algorithm was implemented as follows: initial step size is set to σ = 0.1.At each step, the current orthonormal m × n matrix L representing the Grassmannian element is "mutated" by addition of a random m × n matrix ∆L with each element drawn independently from a N (0, σ 2 ) distribution.The mutant L = L + ∆L is then othonormalised (using a singular value decomposition), and its dynamical dependence d calculated as in Section 3.2.2.If d is less than or equal to the current dynamical dependence d, then L is replaced by L , and the step size σ increased by a multiplicative factor ν + ; otherwise, the original L is retained, and the step size decreased by a multiplicative factor ν − .The adaptation factors were calculated according to a version of the well-known Rechenburg "1/5th success rule" described in Hansen et al. (2015, Sec. 3.1): a gain factor γ is set to 1/ √ δ + 1 where δ = m(n − m) is the dimension of the Grassmannian search space.We then set with h = 1/5.The algorithm is deemed to have converged24 when either the step size σ or the current dynamical dependence d falls below a threshold value25 set to 10 −8 .A typical set of 100 runs at scale m = 6 is plotted in Fig. 2. We see that while many runs converge to the true minimum-the unique Full optimisation results across all scales are illustrated in Fig. 3, which may be considered as a graphical overview of the empirically-derived emergence portrait of the system in accordance with our ansatz (2.14): at each scale the 100 locally-optimal terminating values of dynamical dependence are sorted in ascending order and plotted on a bar chart.Zero values indicate the presence of perfectly dynamically-independent subspaces at the corresponding scale, while non-zero values indicate locallyoptimal subspaces; near-zero values indicate "nearly dynamically-independent" projections.The width of ledges of equal dynamical dependence at each scale give an indication of the size of the basin of attraction of the local minimum.
We see clearly from the figure that, as expected from the causal graph (Fig. 1) there are perfectly dynamically-independent macroscopic variables only at scales 2 and 6, corresponding to projection onto the sub-graphs {1, 2} and {1, 2, 3, 4, 5, 6} respectively (cf.Figs.5a, 5g below); only these sub-graphs have no incoming Granger-causal connections from the rest of the graph.(It is important to note, though, that this "no incoming connections" for dynamically-independent subspaces holds for our simple example as a direct consequence of its construction from a causal graph; we are, in effect, working in a "privileged" coordinate system.For an arbitrary system, where we could not be expected to know a priori which particular coordinate transformation(s) map the dynamically-independent subspaces to the causal graph as per eq.(3.29), this would no longer hold in general.)Fig. 3 on its own does not reveal the full detail of the emergence portrait; in particular, while indicating the broad distribution of dynamical dependence of (locally-)optimal subspaces (i.e., the macroscopic variables), it says little about the subspaces in relation to the system itself, or to each other -for example, it is not clear whether bars of equal height at a given scale actually correspond to the same subspace or not.To dig deeper into the emergence portrait, we need to consider the (locally-)maximally dynamically-independent subspaces explicitly in the structured domain in which they reside -that is, the Grassmannian manifold of vector subspaces.Visualisation of vector subspaces in high- dimensional Euclidean spaces-elements of the Grassmannian-is challenging.Below we present a series of visualisations designed to aid intuition on the structure of, and relationships between, locally-optimal subspaces.Firstly, as alluded to in Section 3, we may calculate the principal angles between two subspaces of R n ; in fact, for subspaces of dimensions 0 < m 1 ≤ m 2 < n there are m 1 principal angles 0 ≤ θ 1 ≤ . . .≤ θ m 1 ≤ π/2 (Wong, 1967), and we may define a metric on the Grassmannian-a measure the distance between subspaces-as We may use this metric to to answer the question posed above: do the locally-optimal subspaces of equal dynamical dependence at a given scale, as evidenced in Fig. 3, in general correspond to the same subspaces or not?To this end, at each scale we may calculate the distances between all pairs of the 100 locally-minimal subspaces.In Fig. 4 these distances are represented on a colour scale (this figure should be viewed alongside Fig. 3).We see from the white squares on the diagonals that at each scale the local optima of equal dynamical dependence are in general zero distance from each other, and thus correspond to the same local optimum.Further structural detail may be inferred from Fig. 4; for instance, from the m = 2 results we see that the second-lowest dynamical dependence subspace (around the 70 − 80th run) is almost orthogonal to the zero-dynamical dependence subspace (cf.Figs.5a, 5b below) -that is, these subspaces are highly dissimilar.
To gain insight into the placement of locally-optimal subspaces with respect to the system itself, we may calculate the distances between an m-dimensional subspace and each of the coordinate axes x 1 , . . ., x n (in this case there is only a single principal angle).Note that these n distances do not uniquely identify the Grassmannian element, unless m = 1 or n − 1 (in higher dimensions there is more "wiggle room" for subspaces); however, if the distance between a subspace L and a coordinate axis is zero, then we can conclude that that axis is co-linear with L. Thus by (3.29) it follows that Y t = LX t is perfectly dynamically-independent iff distances to some set of axes x i 1 , . . ., x im are zero (L is then a projection onto the linear subspace spanned by those axes).Given a specified Granger-causal graph G and a linear subspace, we present it graphically as a weighted graph with edges coloured according to the pairwise-conditional Granger causalities, and nodes coloured according to the distance between the corresponding coordinate axis and the subspace.Fig. 5 displays some colour-weighted Granger-causal graphs corresponding to locally-optimal subspaces at scales 2, 5 and 6.A solid blue node indicates that the subspace is co-linear with the corresponding axis, a white node that it is orthogonal to that axis, while intermediate shades indicate angles between zero and π/2.
The top row in Fig. 5 displays weighted graphs for 2-dimensional subspaces discovered by the ES, with locally-minimum dynamical dependence.Fig. 5a indicates the unique 2-node sub-graph corresponding to a perfectly dynamically-independent 2-dimensional subspace.Fig. 5b illustrates a nearly-dynamicallyindependent 2-dimensional subspace; the subspace corresponding to Fig. 5c, while a local minimum, is not very dynamically-independent.The middle row of figures display graphs for 5-dimensional subspaces; there are no perfectly dynamically-independent subspaces at this scale, but the subspace corresponding to Fig. 5d is, again, nearly dynamically-independent, the subspaces corresponding to the graphs to its right less so.Fig. 5g identifies the unique perfectly dynamically-independent subspace at scale 6, the graphs to its right local dynamical dependence sub-optima.
The previous visualisation examined subspace (angular) distance from coordinate axes.This still allowed for a lot of "wiggle room": at least in higher dimensions, many subspaces of a given dimension are equidistant from all the coordinate axes.In the next, finer-grained visualisation, we look instead at subspace distance from coordinate hyperplanes of the same dimension as the subspace, i.e., subspaces spanned by subsets of the coordinate axes.Now there is no "wiggle room"; a subspace is uniquely identified by its distances from all same-dimension coordinate hyperplanes.In Fig. 6, for the same examples as in Fig. 5, at the appropriate scale m we plot a measure of how co-planar the locally-optimal subspace is with each of the n m subspaces spanned by combinations x i 1 , . . ., x im , 1 ≤ i 1 < . . .< i m ≤ n, of the coordinate axes.The horizontal scale is the dictionary (lexicographic) ordering of the axis combinations.The height of the bars is 1 − θ max , where 0 ≤ θ max ≤ 1 is the normalised maximum principal angle between the locally optimum subspace and corresponding coordinate subspace26 ; thus 1 indicates that the subspaces are co-planar, 0 that they are orthogonal.The figure may be compared with Fig. 5, though here, by contrast, the metrics for each plot collectively identify a unique Grassmannian element.Taking some examples, for the locally-minimum 5-dimensional subspace represented in in Fig. 5d, we find that the two large bars in Fig. 6d correspond to the coordinate subspaces x 1 , x 3 , x 4 , x 5 , x 6 and x 2 , x 3 , x 4 , x 5 , x 6 , which is not apparent from Fig. 5d alone.The single high bar in Fig. 6e corresponds to the subspace spanned by the axes x 1 , x 2 , x 7 , x 8 , x 9 -this is perhaps more apparent from Fig. 5e-while the subspace in Figs.5f/6f exhibits a more complex relationship to the coordinate subspaces, with the highest co-planarity at x 3 , x 4 , x 5 , x 6 , x 9 .Comparing Figs.6a and 6g confirms that the unique scale 2 dynamically-independent variable is nested in the unique scale 6 variable.
To summarise our analysis of the emergence portrait of the VAR( 7) system: Fig. 3 displays just the distribution of dynamical dependence values of locally-optimal subspaces at each scale, as discovered by independent optimisation runs; in Fig. 4 we examine at each scale the distances between discovered locally-optimal subspaces, thus enabling us to distinguish which represent unique subspaces; in Fig. 5, we measure the distances between optimal subspaces and coordinate axes, allowing a rough graphical depiction of their relationship to nodes on the causal graph; Fig. 6 takes this a step further, pinpointing the exact positioning of the locally-optimal subspaces with respect to the coordinates of the microscopic system space.
Although avowedly a low-dimensional toy model, our analysis of the VAR( 7) system of Fig. 1 presents a viable approach to discovery of macroscopic variables in linear systems, and in accordance with our ansatz (2.14), grants insight into the local dynamical independence structure of linear systems.Our analysis also illustrates a more general point: developing a comprehensive emergence portrait-how dynamically-independent macroscopic variables relate to the microscopic system and to each otherinvolves exposing the structure of the space of macroscopic variables.Thus in Figs. 3, 4, 5 and 6, we "drill down" into this structured space using progressively more detailed metrics.For real-world systems, relating the emergence portrait to macroscopic phenomenology is likely to be an empirical question - Figure 5: Granger-causal graphs for locally-optimal subspaces with nodes colour-weighted by axis angle, for 9-variable VAR(7) model in Fig. 1.A solid blue node indicates that the subspace is co-linear with the corresponding axis, while a white node indicates that it is orthogonal to that axis.Note that, since runs are sorted in ascending order of dynamical dependence, higher run numbers correspond to less dynamically-independent subspaces.See main text for details. , where 0 ≤ θ max ≤ 1 is the normalised maximum principal angle between the locally optimum subspace and corresponding coordinate subspace; so 1 indicates co-planarity, 0 orthogonality.The horizontal scale is the dictionary ordering of the n m axis combinations.Boxed labels show x-axis numbers of bars; e.g., the "spike" in subfigure (e) at ordinal 35 corresponds to the subspace spanned by the axes x 1 , x 2 , x 7 , x 8 , x 9 (cf.Fig. 5e).See main text for details.
but here too, understanding the structure of the emergence portrait in abstracto is likely to be crucial.

Deterministic and continuous-time dynamics
Although the main thrust of this article concerns dynamical independence for discrete-time stochastic systems (Section 2.3), and in particular discrete-time linear systems (Section 3), many systems commonly associated with emergent phenomena feature deterministic and/or continuous-time dynamics.For deterministic systems the question immediately arises as to how the information-theoretic framework of Section 2.3 might apply, since Shannon information is indeterminate for non-stochastic variables.In continuous time, furthermore, transfer entropy is more nuanced (Spinney et al., 2017) and considerably more complex, even in the linear case (Barnett and Seth, 2017).Below we preview our approach to these important challenges.

Deterministic dynamics in discrete time
For many discrete-time dynamical systems of interest, such as cellular automata, flocking models and discrete-time chaotic systems, dynamics take the deterministic Markovian form with state transition function ξ : X → X , where the microscopic state space X may be continuous and measurable, or discrete and countable.Thus given some initial condition x 0 we have x t = ξ t (x 0 ), t ≥ 0 where ξ t denotes t iterations of the mapping ξ.
For discrete-state systems, we may consider the dynamics (4.1) with stochastic initial conditions.Thus a random variable X 0 on X is introduced to represent the statistical distribution of initial (t = 0) microscopic states, yielding the microscopic stochastic process X t = ξ t (X 0 ), t ≥ 0, on X .Given a coarse-graining f : X → Y, dynamical independence for the macroscopic variable Y t = f (X t ) may then be analysed along the lines of the general discrete-time stochastic case (Section 2.3).In practice, choice of the initial distribution may be based on general principles (e.g., maximum-entropy), or on domain-specific a priori considerations.Then, since Y t depends deterministically on X − t = {X 0 , X 1 , . . ., X t−1 }, the dynamical dependence (2.5) of Y on X at time t ≥ 0 is given simply by Given the probability distribution (or density) function p(x 0 ) for X 0 , a general expression for the entropy , and thence T t (X → Y ), may be calculated.This approach, though, is unviable for continuous-state systems of the form (4.1), since for deterministic dynamics the transfer entropy T t (X → Y ) based on differential entropies diverges27 .An alternative approach is to introduce scalable noise into the process (4.1), and then analyse dynamical independence for the resultant discrete-time stochastic system in the limit of vanishing noise.If the state space X is Euclidean, for example, we might consider the autoregressive stochastic process X t+1 = ξ(X t ) + σε t derived from (4.1), in the limit σ → 0, where ε t is a multivariate-normal white noise.

Flows: deterministic dynamics in continuous time
An important case of deterministic continuous-time dynamics is that of flows, defined as the group action If X is a differentiable manifold then the flow is smooth if the function ξ is differentiable, and for any fixed t the function x → ξ(x, t) is a diffeomorphism.If x → ξ(x, t) is only a diffeomorphism on a strict subset of X × R, then ξ is said to define a local flow; from now on, we use the term "flow" to include local flows.On Euclidean space X = R n , smooth flows are essentially equivalent to 1st-order autonomous ordinary differential equations (ODEs); the trajectory x(t) = ξ(x 0 , t) is the unique solution of the autonomous ODE28 ẋ(t) = g(x) with initial condition x(0) = x 0 , where g(x) = ξ(x, 0).Many classical dynamical systems, such as Hamiltonian mechanics, flocking, and chaotic dynamical systems, are expressed as ODEs and may thus be considered as flows.
For flows, stochastic initial conditions (Section 4.1) once again run up against problems with diverging differential entropies.As a potential remedy, we reconsider the original expression (2.3) of dynamical independence.There, independence is interpreted in a statistical sense; here we propose a "functional" interpretation more aligned with dynamical systems theory: given a smooth flow ξ : R n × R → R n at the microscopic level, a differentiable coarse-graining function f : R n → R m , 0 < m < n, is considered to define a dynamically-independent macroscopic variable for ξ iff there is a flow η : R m × R → R m on the macroscopic space such that29 for all x ∈ R n and t ∈ R. In terms of ODEs, this is equivalent to the existence of an autonomous ODE30 ẏ(t) = h(y) on R m such that for any trajectory x(t) of ξ, y(t) = f x(t) is a trajectory of η.Thus, in the spirit of (2.3), a dynamically-independent macroscopic system is self-determining: given an initial condition y 0 ∈ R m , the coarse-grained macroscopic system determines its evolution in time without reference to the micro-level dynamics.Dynamical independence in this sense is invariant with respect to smooth coordinate transformations of both the microscopic space R n and the macroscopic space R m ; dynamical independence may thus be extended to flows on differentiable manifolds via overlapping coordinate charts (Kobayashi and Nomizu, 1996).
In preliminary work (in preparation) we derive necessary and sufficient condition for dynamical independence in the above sense, and show that dynamically-independent coarse-grainings f : R n → R m are built from invariants (conserved quantities) of the flow, along with a "time-like" scalar function; see Appendix D for a summary of these results.Thus dynamical independence in the functional sense essentially reduces to the classical problem of invariants of flows on differentiable manifolds (Cohen, 1911); cf. the example of a Newtonian galaxy at the beginning of Section 2.3.By the celebrated First Theorem of Noether (Arnold, 1978), for Lagrangian systems invariants are associated with symmetries of the Lagrangian action.Thus for such systems Noether's Theorem characterises the dynamically-independent macroscopic variables.However, by no means all systems of interest fall into this class.(The central role of symmetry in Noether's Theorem, though, seems worth bearing in mind.) The functional approach has one drawback: unlike the transfer entropy measure (2.5) in the discretetime stochastic case, it lacks an information-theoretic interpretation, and does not yield up an obvious candidate measure for dynamical dependence, let alone a transformation-invariant one (we are currently investigating whether such a measure may exist); there is thus no ready notion of "near-dynamical independence".As for the discrete-time case (Section 4.1), there is also the possibility of adding scalable noise to the ODE to derive a stochastic differential equation (SDE; Øksendal, 2003), and then consider the limiting behaviour of the micro → macro transfer entropy measure in the limit of vanishing noise.Nonlinear SDEs, though, are challenging to analyse in any generality.

Discussion
In this paper we introduce a notion of emergence of macroscopic dynamical structure in highly-multivariate microscopic dynamical systems, which we term dynamical independence.In contrast to other treatments of emergence, which are largely concerned with part-whole and synergistic relationships between system components (see the discussion below), dynamical independence instantiates the intuition of an emergent process at a macroscopic scale as one which evolves over time according to its own dynamical laws, distinct from and independently of the dynamical laws operating at the microscopic level.More specifically, while prescribed by the microscopic process, a dynamically-independent macroscopic process is, conditional on its own history, independent of the history of the microscopic process.Dynamical independence is quantified by a Shannon information-based (and hence transformation-invariant) measure of dynamical dependence.Importantly, dynamical independence may be conditional on a co-distributed externallydemarcated process, thus accommodating systems which feature input-output interactions with a dynamic environment.
Critical to any theory of emergence over a potential range of spatiotemporal scales, is how we should construe a "macroscopic variable".Here, we try to keep this question as open as possible, with one key constraint: a macroscopic variable may be any process co-distributed with the microscopic process which, in predictive terms, does not "bring anything new to the table" beyond the microscopic: a macroscopic variable is prescribed by the microscopic process in the sense that it does not self-predict beyond the prediction afforded by the microscopic variables (along with the environment).We might thus conclude that if a macroscopic process appears to emerge as a process in its own right-with a "life of its own"-this apparent autonomy is in the eye of a beholder blind to the micro-level dynamics.Emergence, in our approach, is therefore best thought of as being associated with particular "ways of looking" at a system.
A key aspect of our approach is an emphasis on discovery of emergent macroscopic variables-"ways of looking" at the system-given a micro-level description.Although specific problem domains may present "natural" prospective emergent macroscopic variables (which may be tested for degree of emergence by our dynamical dependence measure), this is by no means always the case.For neural systems, for example, it is in general far from clear how to identify candidate emergent processes.Groups of neurons firing in synchrony might intuitively suggest an emergent variable, but there may be many more (and more subtle) patterns of neural activity that may count as emergent without being intuitively apparent to an observer.Our approach addresses this issue by "automating" the discovery process, through consideration of the full space of all admissible macroscopic variables; discovery of emergent variables then becomes a search/optimisation problem across this space.We introduce an ansatz that proposes the results of this search, across all scales, as an informative account of the emergence structure of the given system -an "emergence portrait".Parametric modelling, furthermore, opens up the possibility of data-driven discovery of emergent variables.We present an explicit calculation of dynamical dependence, and a detailed account of the search/optimisation process, for the important class of linear state-space models, suitable for wide deployment across a range of domains, including neural systems.

Related approaches
One difference between our approach and many related approaches concerns the role of the environment, and in particular the system/environment distinction (Krakauer et al., 2020).Another is our emphasis on discovery of emergent phenomena, whereas the majority of approaches, while furnishing criteria or metrics for emergence, do not specify how, for a given microscopic process, candidate emergent processes might actually be found in practice.
Our notion of dynamical independence bears a resemblance to informational closure, introduced by Bertschinger et al. (2006); a process Y is described as "informationally closed" with respect to an environment E when the transfer entropy T(E → Y ) vanishes; that is, Y is dynamically-independent with respect to the environment.To compensate for "trivial" systems where environment E and system Y are independent, this quantity is then subtracted from the mutual information I(Y t : E − t ) to yield the "non-trivial informational closure" (NTIC), which may also be expressed as Chang et al. (2020) apply NTIC specifically to the case of coarse-grained macroscopic variables in the context of an environment.Their definition of a C-process requires that the macroscopic variable Y be (i) dynamically-independent of the system-environment "universe" (X, E), and (ii) NTIC with respect to the environment E. Note that condition (i) is not equivalent to dynamical independence of Y with respect to the system in the context of the environment.While Chang et al. (2020) associate a C-process with a measure of consciousness, it is perhaps more generally (and less contentiously) construed as a notion of autonomy or emergence in complex systems.
Another relevant construct is G-emergence (Granger emergence;Seth, 2010).Seth (2010) firstly operationalises the "self-causation" or "self-determination" of a variable Y with respect to an external (multivariate) variable Z as G-autonomy31 which measures the the degree to which inclusion of its own past enhances prediction of Y t by the past of the external variable Z t .Given a microscopic process X t and a macroscopic process This expression operationalises the notion that an emergent macroscopic process is, in a predictive sense, at once autonomous from, but also dependent on, the microscopic process -again recalling the conceptual definition of "weak emergence" from (Bedau, 1997).G-emergence differs from dynamical independence in two main respects.First, it requires that macroscopic variable be non-trivially self-predictive.Second, it includes a micro-to-macro term to assure, in an ad-hoc way, that they are related, in contrast to the principled approach to coarse-graining taken by dynamical independence.
We recognise immediately the second (transfer entropy) term in (5.2)-designed to ensure that the macro and the micro are related-as our dynamical dependence (2.5) in the absence of a coupled environment, although for G-emergence it "pulls in the opposite direction", in the sense that increasing T(X → Y ) increases G-emergence, but decreases dynamical independence.Note also, though, that our requirement (2.1) on macroscopic variables-which holds in particular for coarse-grained variablesactually stipulates (in the absence of an environment) that the G-autonomy contribution ga(Y | X) in (5.2) vanishes identically, thus leaving G-emergence as precisely our dynamical dependence rather than independence, for the situations we consider for dynamical independence.
A recent approach with both parallels and differences to ours is that of Rosas et al. (2020a).In contrast to our approach, their concern is explicitly with mereological (part-whole) causal relationships, such as downward causation, what they term causal decoupling and, in particular, causal emergence.The latter is quantified as the unique predictive capacity of a supervenient feature over the microscopic system, beyond the predictive capacity of (parts of) the microscopic system.This is almost the obverse of dynamical independence, which hinges on prediction of the macroscopic rather than the microscopic process.Supervenience for "features" as defined by Rosas et al. (2020a), it should be noted, does not generally correspond to our notion of supervenience for macroscopic variables.In contrast to our supervenience condition (2.1), the comparable condition in (Rosas et al., 2020a, Sec. II) is, in our notation Although coarse-grained variables trivially satisfy both (2.1) and (5.3), the latter again speaks to prediction of the microscopic, rather than macroscopic variable.
In order to express causal emergence in information-theoretic terms, Rosas et al. (2020a) make use of a partial information decomposition (PID; Williams and Beer, 2010;Wibral et al., 2017).One challenge for this approach is a lack of consensus on what a "canonical" PID might look like.Further, current PID candidates tend to be computationally intractable and scale poorly with system size and macroscopic scale.In addition, the proposed measures are frequently framed in terms of discrete-valued (often finite) systems, and it is often unclear how they might be realised-or they become counter-intuitive and/or exhibit discontinuous behaviour-when extended to continuous-valued variables (Barrett, 2015).Connected with the last point, many (though not all) lack the transformation invariance of Shannon information (Chicharro et al., 2018;Rosas et al., 2020b).In recognition of the computational burden attached to PIDs, Rosas et al. (2020a) define Shannon information-based "large system approximations" for their measures, although it is unclear to what extent these reflect the intent of the respective PID formulations.
Closer in spirit to our approach is the theory of emergent brain macrostates propounded by Allefeld et al. (2009).Along similar lines to dynamical independence, they consider dynamics for macroscopic systems which are in a sense "self-contained" with respect to the microscopic dynamics; however, unlike our more general information-theoretic approach, they associate such dynamics with a (1st-order, discrete-valued) Markov property: "...the Markov-property criterion distinguishes descriptive levels at which the system exhibits a self-contained dynamics ('eigendynamics'), independent of details present at other levels."Emergent macroscopic processes are then identified with coarse-grainings which preserve the Markov property ["Markov partitions" (Adler, 1998)].We note that a Markovian coarse-grained macroscopic variable would automatically satisfy our criterion (2.3) for dynamical independence.
Since low-level neural processes, and indeed neurophysiological recordings of these processes, do not naturally take the form of 1st-order discrete-valued Markov processes, Allefeld et al. (2009) devise a discrete approximation scheme33 .They then seek Markovian coarse-grainings of the discretised Markov model in the form of metastable macrostates (Olivieri and Vares, 2005), and (putatively emergent) dynamics that transition between such macrostates at slow time scales compared to the underlying microscopic dynamics.This latter idea, more closely aligned with a thermodynamical perspective on coarse-graining (Green, 1952;Jeffery et al., 2019), seems worthy of further investigation in regard to dynamical independence.Shalizi and Moore (2003) consider the "causal states" of a system, defined as equivalence classes of state histories which yield the same conditional distribution over future states.The sequence of causal states defines a Markov process.A (coarse-grained) macroscopic process is then deemed emergent if its causal states self-predict "more efficiently" than the causal states of the microscopic process, where predictive efficiency of a process is measured in terms of the ratio of entropy rate to statistical complexity.Hoel et al. (2013) formulate a notion of causal emergence based on effective information (Tononi and Sporns, 2003).Here, although macro is supervenient on micro, a coarse-grained macroscopic variable is deemed emergent to the extent that it leads to a gain in effective information.Effective information is calculated by comparing the distribution of prior states that could have caused a given current state (the "causal distribution"), with the uniform distribution over the full repertoire of possible prior states.The KL-divergence of the causal distribution with respect to the uniform distribution is then averaged over the distribution of current states.The procedure is motivated by the Pearlian approach (Pearl, 2009) which identifies causation with the effects of counterfactual interventions (perturbations) on the system; the uniform (maximum entropy) distribution then stands as an injection of random perturbations.A drawback of effective information, however, is that it assumes the existence of a uniform distribution of states, thus ruling out a large class of (in particular continuous-state) physical systems, for which the uniform distribution does not exist; and even if it exists, it is not clear that the EI will be transformation-invariant.It may also be argued that a uniform distribution over prior states is in any case a purely notional, unphysical construct, and that its deployment consequently fails to reflect causation "as it actually happens" -that is, as stochastic dynamics play out over time.In a related approach, Friston et al. (2021) present a recursive partitioning of neuronal states based on effective connectivity graphs (Friston et al., 2013) and Markov blankets (Pearl, 1998), which they associate with emergent intrinsic brain networks at hierarchical spatiotemporal scales34 .
Millidge (2021) presents a mathematical theory of abstraction which shares some commonalities with theories of emergence.An abstraction is considered as a set of "summaries" of a system which are sufficient to answer a specified set of "queries" regarding the time evolution of the system.Like macroscopic variables (in the broad sense), abstractions discard information about the system's detailed dynamics -in this case such information as turns out to be irrelevant to the specific queries.It is proposed that the irrelevant information be considered via the maximum-entropy principle (Jaynes, 1985), whereby uncertainty about detailed system behaviour is maximised within the constraint of retention of the ability to answer the queries.Like dynamically-independent macrovariables, abstractions might be considered to have a "life of their own" insofar as they retain sufficient information to predict their own behaviours at a macroscopic level.In common with our approach, Millidge (2021) places an emphasis on data-driven discovery of abstractions, by minimising their "leakiness" -that is, their departure from accurate prediction of the associated macrophenomena (cf.dynamical dependence).In contrast to dynamical independence, abstractions might be said to be driven by the agenda of the observer (in the form of specific queries), rather than, as in our case, unconstrained and intrinsic to the dynamical structure of the microsystem.
Finally, our approach is also clearly related to the general idea of dimensionality reduction in information theory, machine learning and beyond.Importantly, dynamical independence defines a very specific basis for dimensionality reduction, one which flows explicitly from the dynamics of the underlying microscopic system.This might be contrasted, for example, with principal components analysis (PCA), which is essentially determined by correlations within a dataset.In case the data derives from a dynamical process (e.g., econometric data, neuroimaging data, etc.), these correlations are contemporaneous, and as such fail to reflect in full the temporal dynamics of the generative process.

Relationship with autonomy
A macroscopic process Y that is dynamically-independent with respect to the microscopic process X might well be described as "autonomous of X".We avoid this usage, though, because conventionally the term autonomy carries two distinct connotations (Bertschinger et al., 2008): an autonomous process should not only be independent of external "driving" processes, but should also self-determine its evolution over time (Seth, 2010).As remarked in Section 1.1, a dynamically-independent macroscopic variable need not fulfil the self-determination criterion; dynamical independence does not equate to autonomy (cf.Section 5.1, Granger autonomy/emergence).In the extreme case, a dynamically-independent macroscopic variable might in fact be completely random, as in the following trivial VAR(1) example: where ε 1,t , ε 2,t are uncorrelated white noises.Here the macroscopic (coarse-grained) white noise Y t = X 2,t is clearly dynamically-independent of the microscopic process X t .Note, however, that a completely random macroscopic variable is not necessarily dynamically-independent: if we replace (5.4b) with then, while Y t = X 2,t is still a white noise, it is no longer dynamically-independent of X t35 .We consider this as a positive feature of our definition of dynamical independence: as per our ansatz (1.1), if we discover that our microscopic system features a completely-random macroscopic variable at some scale, this tells us something useful about the system.We might even, via (2.8), choose to "factor out" this embedded randomness in order to better reveal significant causal structure.

Discovery of emergent macroscopic processes in neural systems
Notwithstanding that the generative mechanisms underlying neural processes may be highly nonlinear, linear modelling is routinely deployed for the functional analysis of neural systems via neurophysiological recordings (indeed, correlation statistics are associated with linear regression; see also our discussion in Section 3).Granger causality based on VAR (and more recently state-space) modelling in particular is a popular technique for inference of directed functional connectivity (Seth et al., 2015;Barnett and Seth, 2015) from EEG, MEG and iEEG data 36 .The techniques described in Section 3.5 may thus be applied directly to estimated state-space models for such data, to infer the emergence portrait of neural systems.Issues of scale (see Section 3.5.1)remain significant at this stage, but do not appear to be intractable.
While it may be tempting to draw analogies between dynamically-independent macrovariables in neural systems and functional network analyses, e.g., default-mode networks (Raichle et al., 2001), this would be misleading; a dynamically-independent macrovariable is not a static "network", but rather a macro-scale dynamical entity in its own right, emerging from interactions on the "microscopic" scale (in this case, the scale set by neural recording channels associated with "small" brain regions).A fascinating question for future empirical research, is whether specific emergent (dynamically-independent) macrovariables might be associated with ("neural correlates" of) large-scale neural phenomena, such as behaviours, cognition, and specific states of, or disorders of, consciousness.

Caveat
Finally, a caveat: the underlying intuition behind any study of emergence for real-world systems, is that identifying emergent structure is likely to advance our understanding of the physical phenomena in question.While reasonable, this conclusion is not a given.Whether emergent dynamical structures turn out to be functionally relevant for explaining a particular system's behaviour will most often be an empirical question.and Σ R 11 = E ε R 1t ε RT 1t is the corresponding error covariance matrix.Geweke (1982) then defines the Granger causality as the log-ratio of generalised variances which quantifies the degree to which the history of X 2t enhances prediction of X 1t beyond the degree to which X 1t is predicted by its own history alone.We note that if the innovations are Gaussian, then the generalised variance |Σ| is proportional to the likelihood function for X t , so that (B.2) is a log-likelihood ratio, which under ergodic assumptions is asymptotically equivalent to the conditional entropy H(X t | X − t ) (Barnett and Bossomaier, 2013); this circumscribes the relationship between Granger causality and transfer entropy [cf. (2.5b)].Under the classical "large-sample theory" (Neyman and Pearson, 1933;Wilks, 1938;Wald, 1943), the log-likelihood ratio as a sample statistic furnishes asymptotic F -and χ 2 tests for statistical inference on GC [but see also Gutknecht and Barnett (2019)].

B.1 Granger causality for linear state-space systems
Suppose that the observation process X t for an innovations-form SS model (3.7) is partitioned as above into two sub-processes.Following Barnett and Seth (2015), we show how F(X 2 → X 1 ) may be calculated from the ISS parameters.The utility of innovations form (3.7) for GC analysis, is that the innovations ε t are precisely the residual error terms of the predictive VAR representation (3.1b) for X t , which feature in the expression for Granger causality.Given (3.7), X 1t satisfies the SS model, now no longer in innovations form: where C, ε t and Σ are partitioned concordantly with X 1t , X 2t .The joint noise covariance matrices for the SS (B.6) are given by where Σ * 1 = [Σ 11 Σ 12 ] T and, converting to innovations form, from (3.10a) we have where P is the unique stabilising solution38 of the "reduced" DARE [cf.(3.9)] with Q, R, S as in (B.7), and the GC (B.2) may thus be calculated.The condition for vanishing F(X 2 → X 1 ) is Barnett and Seth (2015, eq. 17 In the spectral domain, the formula (B.3) applies, with the transfer function as in (3.1a), and the CPSD specified by (3.2).Note that in the unconditional case presented here, there is no need to solve the DARE (B.9); the conditional spectral GC (required in particular if there is an environmental process E t ) is somewhat more complex (Barnett and Seth, 2015), and requires solution of a DARE.

B.2 Granger causality for finite-order VAR systems
Although a special case of state-space (equivalently VARMA) models, here we show that for finite-order pure-autoregressive models we may achieve a reduction in computational complexity.
We suppose that a VAR(p) model, p < ∞, is given by with n × n coefficients matrices A 1 , . . ., A p and residuals covariance matrix Σ.We may specify an equivalent (innovations-form) state-space model by where A is the pn × pn "companion matrix" (Hannan and Deistler, 2012) The VAR(p) is stable iff the companion matrix A is stable, and is always miniphase.We suppose again that X t is partitioned into sub-processes X 1t , X 2t .We may then calculate the Granger causality F(X 2 → X 1 ) as already described in Appendix B.1.Note that the DARE (B.9) is a pn × pn matrix equation.In Gutknecht and Barnett (2019, Appendix B) it is shown that for VAR systems, the DARE may be reduced to pn 2 × pn 2 dimensions, where n 2 is the dimension of X 2t .Specifically, the reduced model residuals covariance matrix Σ R 11 in the expression (B.2) for the Granger causality F(X 2 → X 1 ) is given by Σ  The condition for vanishing F(X 2 → X 1 ) is Barnett and Seth (2015, eq. 17

D Dynamical independence for flows
A smooth (local) flow ξ(x, t) satisfying (4.3) may be considered as a vector field on R n , with associated gradient operator where the associated ODE is ẋ(t) = g(x), and for any function ϕ(x) on R n we have for all x, t.As described in Section 4.2, a macroscopic variable y(t) = f x(t) , where f : R n → R m is a coarse-graining 40 , is deemed dynamically-independent of the process x(t) if it is itself described by a flow; that is, if there is a flow η : R m × R → R m such that (4.4) is satisfied.Differentiating (4.4) with respect to t and setting t = 0, we find that y(t) = f x(t) is dynamically-independent with respect to x(t) iff for all x, for some mapping h : R m → R m ; the associated ODE for y(t) is then ẏ(t) = h(y).
We wish to find a general form for dynamically-independent coarse-grainings.We thus seek functions for scalar functions τ (x), u(x).From from (D.2), we see that solutions of (D.4b) are invariants of the flow ξ-that is, u ξ(x, t) does not vary with t-while solutions of (D.4a) "parametrise time" along the trajectories of ξ, in the sense that τ ξ(x, t) changes at a constant unit rate with t.If the flow ξ(x, t) is known explicitly, invariants may be obtained analytically by eliminating t between pairs ξ i (x, t), ξ j (x, t), i = j.It will thus in general (at least locally on R n ) be possible to find a basis set u(x) = u 1 (x), . . ., u n−1 (x) of n − 1 functionally-independent invariants 41 , such that any invariant of the flow is of the form U u(x) .Given a set of n − 1 invariants u = (u 1 , . . ., u n−1 ), to solve (D.4a) let v(x) be any scalar function such that x → v(x), u(x) defines a nonsingular transformation of R n .In the new coordinate system (v, u 1 , . . ., u n−1 ) we may calculate where U (x) is an arbitrary invariant.The choice of particular v(x) is not significant, insofar as we may verify 42 that θ v(x), u(x) is unique up to an additive function of u(x) alone, which may be absorbed in U (x).In practice v(x) may be chosen for convenience of evaluation of the indefinite integral in (D.7).In the coordinate system (τ, u), we have ∇ ξ = ∂ ∂τ ; intuitively, the transformation x → τ (x), u(x) "flattens out" the flow, so that points in R n are transported at unit rate along straight-line trajectories parallel to the τ -axis.
Returning to the general solution to (D.3)-i.e., finding all dynamically-independent coarse-graining maps f : R n → R m for the flow ξ-we take first the case where the coarse-grained flow η is non-trivial (we continue to assume that ξ is non-trivial).Then, from the above analysis, we can always find transformations of R n and R m that (at least locally) "flatten out" the respective flows ξ, η as described above.Under these transformations, the condition (D.3) for dynamical-independence becomes ∂f α ∂τ = 0 , α = 2, . . ., m (D.9b) so that, transforming back to the original coordinates, the general form of a dynamically-independent coarse-graining is f (x) = Ψ τ (x), u(x) (D.10) with u(x) = u 1 (x), . . ., u m−1 (x) a set of m − 1 functionally-independent invariants, τ (x) as in (D.8), and Ψ(y) an arbitrary diffeomorphism of R m .In the case where the coarse-grained flow η(y, t) is trivial, it is easy to see that f (x) must itself comprise a set of m functionally-independent invariants.

D.1 Worked example
Consider the system of ODEs on R 3 defined by ẋ1 = −x 2 (D.11a) ẋ2 = x 1 (D.11b) ẋ3 = 2x 1 x 2 (D.11c) We have g(x) = (−x 2 , x 1 , 2x 1 x 2 ), so that  are functionally-independent invariants and for simplicity we choose v = x 3 , which is functionallyindependent of u 1 , u 2 .We have and the general solution to (D.4a) is then where U (u 1 , u 2 ) is an arbitrary function of two variables.The dynamically-independent coarse-grainings are then obtained from (D.10).
Pairwise-conditional Granger causality matrix: columns index source node, rows target node.

Figure 1 :
Figure 1: Granger-causal structure for a 9-variable VAR(7) model comprising three fully-connected modules with two inter-module connections.The model was constructed by randomly generating autoregression coefficients

Figure 3 :
Figure 3: Emergence portrait (I): results of (1+1)-ES dynamical dependence minimisation at all scales of the 9-variable VAR(7) model in Fig. 1.At each scale, the heights of the bars indicate the sorted optimal (minimised) dynamical dependencies for 100 runs with uniform random initialisation.See main text for details.

Figure 4 :
Figure 4: Pair-wise inter-optimum distances between locally-optimal subspaces for the 9-variable VAR(7) model in Fig. 1, for 100 independent optimisation runs.See main text for details.

Figure 6 :
Figure6: Co-planarity of locally-optimal subspaces with subspaces of the same dimension spanned by coordinate axes, for the 9-variable VAR(7) model in Fig.1.The height of the bars is 1 − θ max , where 0 ≤ θ max ≤ 1 is the normalised maximum principal angle between the locally optimum subspace and corresponding coordinate subspace; so 1 indicates co-planarity, 0 orthogonality.The horizontal scale is the dictionary ordering of the n

C
the n × pn matrix C = A 1 A 2 . . .A p−1 A p (B.14)and K the pn × n matrix K = I 0 . . .

C
& ff.) F(X 2 → X 1 ) = 0 ⇐⇒ A k,12 ≡ 0 for k = 1, . . ., p (B.20)For spectral GC, the formula (B.3) applies, with transfer function H(z) = I − Analytic optimisation of state-space dynamical dependence To minimise F(X → Y ) = log |LV L T | (eq.3.18) over the set of m × n matrices 39 L = (L αi ) under the orthonormality constraint LL T = I, we introduce Lagrange multipliers Λ = (λ αβ ), (Λ is m × m symmetric) and solve simultaneously ∇ log |LV L T | = ∇ trace ΛLL T (C.1a) LL T = I (C.1b)where V = V (L) = CP (L)C T + I with P (L) the solution of the DARE (3.17), and for a tensor ψ jk ...∇ψ denotes the tensor with entries ψ jk ...,αi = ∂ψ jk ... ∂L αi .We may calulate that∇ log |LV L T | = 2LQV + 2R (C.2a) ∇ trace ΛLL T = 2ΛL (C.2b)where we have setQ = L T LV L T −1 L (C.3a) R = 1 2 trace[Q • ∇V ] (C.3b)Here Q is n × n, and trace[Q • ∇V ] is to be understood as the m × n matrix with entries j,k Q jk V kj,αi .Equations (C.1) and (C.2) then yieldLQV + R = ΛL (C.4)Multiplying both sides on the right by L T , and noting that LQV L T = I yields the Lagrange multiplierΛ = I + RL T (C.5)and the optimisation equations may thus be writtenR(I − L T L) = L(I − QV ) (C.6)to be solved under the constraint LL T = I, where all derivatives of the DARE solution P (L) with respect to L are now collected in the LHS term R(L).Note that if LL T = I then I − L T L will be singular, with rank n − m.To solve (C.6) requires ∇V = C • ∇P • C T , where ∇P may (in principle) be derived by partial differentiation of the DARE (3.17) with respect to the L αi ; (C.6) then becomes a set of highly-nonlinear partial differential equations, where the terms P and ∇P are only defined implicitly.The calculation thus appears, at this stage, intractable.