Dynamical maximum entropy approach to flocking

We derive a new method to infer from data the out-of-equilibrium alignment dynamics of collectively moving animal groups, by considering the maximum entropy distribution consistent with temporal and spatial correlations of flight direction. When bird neighborhoods evolve rapidly, this dynamical inference correctly learns the parameters of the model, while a static one relying only on the spatial correlations fails. When neighbors change slowly and detailed balance is satisfied, we recover the static procedure. We demonstrate the validity of the method on simulated data. The approach is applicable to other systems of active matter.

Flocking, the highly coordinated motion displayed by large groups of birds, has attracted much attention over the last twenty years as a prototypical example of outof-equilibrium collective behavior. It has been suggested that flocking is an emergent phenomenon resulting from mutual alignment of velocities between neighboring birds, much like the spontaneous symmetry breaking towards a magnetized state exhibited by ferromagnetic spins at low temperatures. Although this idea has been extensively studied from a theoretical view point [2][3][4], only recently have advances in the 3D imaging of large flocks of starlings [5] given empirical grounds supporting this picture. Interactions between individuals in the flock were shown to be topological and local [6], leading to the global ordering of flight orientations and scalefree correlation functions [7]. The analogy with ferromagnetic systems was made explicit by the quantitative inference of spin models from empirical data using the principle of maximum entropy [8,9]. These analyses have focused on the steady state behaviour of flocks, by examining the flock configurations as drawn from a given statistical ensemble. This approach allows for an effective equilibrium-like description, without having to make detailed assumptions about the microscopic rules governing flock behaviour. Yet it is an incomplete picture as it does not take into account the dynamical, out-of-equilibrum nature of the process.
The major difference between flocks and equilibrium spin systems is that birds are like active particles, constantly moving within the flock along the direction given by their "spin", exchanging local interaction partners, thus extending their effective interaction range, and also breaking detailed balance. This qualitative difference between equilibrium spins and out-of-equilibrium active particles can dramatically affect the thermodynamic properties of the system, including the existence of an ordered phase in two dimensions, and the value of the critical exponents [3]. One can thus naively interpret the parameters of static descriptions of flocks as a renormalized version of some underlying and unknown out-ofequilibrium dynamical model.
In this paper we propose a general framework for learning the features of the out-of-equilibrium dynamics directly from data, while making minimal assumptions about the specific microscopic interaction rules. We generalize the principle of maximum entropy to account for multi-time correlations between birds, and show that maximizing the entropy under this constraint is equivalent to inferring a dynamical model of social forces. We test our dynamical inference method on synthetic data generated by a topological Vicsek model (VM), showing that its inferred interaction parameters are consistently better than the ones obtained in an equilibrium framework, especially when the relative mobility between individuals is high. When the interaction network is static, and the dynamics satisfies detailed balance, our method recovers the results of the static approach [8], additionally allowing us to separate the contributions of interaction strength and noise to the alignment dynamics.
Maximum entropy distributions are the least constrained distributions that are consistent with certain selected key observables of the data. They usually map onto equilibrium statistical mechanics problems and do not involve any assumptions about the system under study, besides the choice of the relevant observables, which should be selected accordingly to the fundamental symmetries of the underlying system. They have been particularly successful in describing collective and emergent phenomena in biological systems comprising many correlated degrees of freedom [10]. When considering flocks, where polar order is present, a natural choice of observables to be constrained by the data are the equal time pairwise correlation functions between birds orientations: s i s j , where s i is a d-dimensional unit vector denoting the flight direction of bird i, with i = 1, . . . , N . (Throughout the paper inner products over the physical space are implicit.) These correlations were found to exhibit scale-free behavior in natural flocks [7], and characterize the collective nature of flocking. The maximum entropy distribution P (s) for the orientations can then be computed by maximizing the entropy S[P ] = − s P (s) ln P (s), while constraining the equaltime correlations to their experimental values. The result is the stationary probability distribution for the equilibrium heterogeneous Heisenberg model [8]: where s is a shorthand for (s 1 , s 2 , . . . , s N ) and Z a normalization constant. The interaction parameters J stat ij are Lagrange multipliers that need to be tuned so that the probability distribution (1) matches the empirical correlation functions s i s j . Using 3D, single individual resolution data of large bird flocks, this class of models was shown to recapitulate quantitatively the ordering properties of real flocks [8].
But infinitely many dynamical models may give rise to this steady-state distribution, most of which break detailed balance. In fact, the change of neighborhoods causes the interaction network to vary in time, keeping the system constantly out of equilibrium. Here we extend the maximum entropy framework to account for the nonequilibrium nature of flocking. We consider the set of entire trajectories (s 1 , s 2 , . . . , s T ), where the superscript index denotes time points separated by δt. We then look for the distribution P (s 1 , . . . , s T ) that maximizes the entropy while reproducing some given experimental observables. Since we want to capture the dynamics, in addition to equal-time correlation functions, we also constrain the correlation functions between two consecutive time points s t+1 i s t j . Doing so yields the following form of the probability distribution over trajectories (see Appendix for details): whereẐ is a normalization factor, and the "effective action" (or minus log-likelihood) reads: There now are two sets of time-dependent coupling parameters, for synchronous and consecutive times. We note that the probability (Eq. 2) corresponds to Markovian dynamics; non-Markov forms are possible if constraining more complex multi-time observables.
When flight orientations are highly polarized (as in the case of starling flocks [7]), one can use the spinwave (SW) approximation [11] to explicitly rewrite the action as a sum of Markov terms which are quadratic in the spin-wave variables. Specifically, we denote s i = π i +n 1 − (π i ) 2 , where n is an abitrary unit vector close to the average flight direction of the flock, and π i is the perpendicular component of the orientation, π i n=0. (When there is no ambiguity we drop the time superscript.) When the flock is highly polarized, we have π 2 i 1, and we may expand at small π i . The action may then be written as a sum of terms corresponding to the transition probabilities P (π |π) between successive time points (see Appendix for technical details): where L t is formally equivalent to a Lagrangian density. In (4) we have defined: ik;t /2, where K t is a calculation intermediate obtained by a descending recursion enforcing normalization at each time step: t /4. The Gaussian form of the transition probabilities Eq. (4), corresponds to a spin-wave dynamics described by the following stochastic equation: with t being a random, isotropic Gaussian noise perpendicular to n, of zero mean and covariance: where δ t,t is the Kronecker delta. Eq. (5) can be interpreted as follows. At each time, individual i computes its new orientation from a weighted average over the orientation of other individuals, including itself, at the previous time point with weights encoded in the matrix M t (one can check that, by construction, j M ij = 1). Noise t added to this average determines the level of error in the alignment. Without it, all individuals would be perfectly aligned. This model may be viewed as the spin-wave expansion of a generalized Vicsek model [1] with arbitrary weights and noise.
Tuning the parameters to match the correlation functions is equivalent to maximizing the likelihood, Eq. (2) (see Appendix), or equivalently maximizing the loglikelihood − t L t , with which we will work from now on. To maximize the likelihood with respect to the two equivalent sets of parameters {J we would need to observe a large number of random realizations of the same flock dynamics. This is impossible in practice due to limited data compared to prohibitively large number of potential configurations of the bird positions that one would need to sample.
To overcome this problem, we need to introduce some additional assumptions about the interaction network and the form of the noise in order to simplify the parameter space and the number of observables. These simpifications come naturally in the Markovian description parametrized by M t and A t . From a biological standpoint, it is reasonable to assume that birds treat information from each interacting neighbor (the precise definition of "neighborhood" being left unspecified for the moment) equally, while keeping memory of their own direction. Mathematically this translates into: where n ij = 1 if j is one of i's neighbours, and 0 otherwise, and n i = j n ij is the global number of neighbors interacting with bird i. (For ease of notation we omit the t index, even though n ij depends on t.) The scalar parameter J now measures the alignemnt interaction strength. Errors made by different birds when trying to align with their neighbours can be assumed to be of the same amplitude and independent of each other, so that noise is uncorrelated and A is proportional to the identity, Here T is a squared noise amplitude (the out-of-equilibrium equivalent of a temperature) that sets the level of disorder in the system. The scaling in δt ensures a well-defined continuous limit when δt → 0, described by a Langevin equation. We can reconcile this dynamical description with the static inference [8] in the special case of equilibrium dynamics, which is realized when n ij is symmetric and constant in time. In this case, the spins can be described for δt → 0 by a stationary distribution with the same form as in Eq. (1) and the steady-state couplings take the simple equilibrium value [8], J stat ij =(J/T )n ij (see Appendix). Taking the specific form of M t and A t above for a given network of neighbours, we obtain a formula for L t that only depends on two parameters, the interaction strength J and the "effective temperature" T with n c = (1/N ) i n i . Also, the number of independent observables appearing in L t is drastically reduced, to a handful of empirical integrated pair correlation functions defined in Table I. These correlations can be evaluated over pairs of consecutive configurations, or averaged over the entire sequence if we work with time-independent parameters and steady state dynamics.
Maximizing the log-likelihood with respect to J and T , ∂L t /∂T =0 and ∂L t /∂J =0, yields simple analytical expressions for the parameters as a function of the empirical correlation functions: where The leading-order temperature T 0 is the derivative of a self-correlation function, and obeys the standard fluctuation-dissipation relationship found in equilibrium dynamics. The term Ω is related to the dynamics of the network. In particular, at steady state dC int /dt = 0 implies Ω ∝ ij π i π j dn ij /dt. In order to apply Eqs. (8)- (9) to data, one still needs to specify the neighboring matrix n ij . In absence of prior information, the simplest possibility is to assume that each bird interacts with the first n c neighbors [8]. An alternative choice would be to define neighbors according to a metric rule, each bird interacting with neighbors within a given distance r c . In both cases an extra parameter is introduced, either the 'topological' interaction range n c or the metric range r c , that can also be inferred by likelihood maximization. Another scheme is to define neighbors through a Voronoi tassellation [12], as in the Topological VM [4]. Likelihoods between different neighborhood definitions may also be compared to find the one closest to optimality.
We tested our dynamical inference method on synthetic data generated from a slight generalization of the Topological VM on a two dimensional torus of linear size L = 32 with N = 1024 particles: where s i = (cos θ i , sin θ i ), and Arg(s) is the angle of vector s. The delta-correlated angular noise ξ t i is uniformly distributed in [−ηπ, +ηπ], corresponding to an effective temperature T V = (ηπ 2 )/6 for δt → 0. The Voronoi adjacency matrix n ij has a non-uniform degree n i , of mean n V =6. A spin-wave expansion of Eq. (11) leads to an expression of the form of (5)- (6), with J ≈ J V /(1 + J V n V δt) (see Appendix). The degree of neighbor mixing is characterized by a single mixing parameter µ = 1/(N n c ) ij |dn ij /dt| , which quantifies how fast birds exchange neighbors. We performed simulations with time step δt = 0.01 in three regimes with slow, medium and fast neighbor mixing (µ = 0.18, 0.35, 0.76, v 0 = 0.5, 1.0, 2.0, J V = 1.0, 1.0, 0.1 and η = 0.3, 0.2, 0.12 respectively), all of which display the same level of polarization, N −1 i s i ≈ 0.97. We then applied the inference procedure described in Eqs. (8)-(9) to the synthetic dataset generated by the simulations. In the inference we tried the choices for n ij discussed above: the n c nearest-neighbor (NN) topological rule, the metric rule where n ij = 1 within a metric range r c (and 0 outside), and the Voronoi rule (actually used to generate the data). Correlation functions were averaged over 10 3 different configurations in the stationary state, sampled from a single run at 100 time unit intervals, ensuring independent sampling.
The likelihood as a function of n c can be computed with the NN rule using Eqs. (8)(9) and (7). The result is shown in the inset of Fig. 1 for the high mixing regime. Its maximum n * c corresponds to the most likely interaction range, from which the optimal J * and T * are computed via Eqs. (8)- (9). Fig. 1 shows that the new dynamical procedure systematically outperforms the static approach described in [8] in predicting the mean interaction range n c . The error made by the static inference is larger when neighbor mixing is higher and the dynamics is strongly out-of-equilibrium. That is because in the high-mixing case, the effective number of interacting neighbors, as inferred by the static approach, includes neighbors visited in the recent past in addition to the current ones, and thus is larger than the true n c . Overall, the dynamical inference based on NN interactions performs reasonably well, considering that the model used for the inference incorrectly assumes a constant n c . Not surprisingly, the log-likelihood computed with the (correct) Voronoi topology is larger than with the (incorrect) NN one. The temperature T is well inferred in both cases (8% error), while the alignment strength J is well recovered when assuming Voronoi neighbors (3% error), and approximately with a NN topology (20% error). Performing the dynamical inference using a metric rule gives significantly worse results, giving n c ∼ 3 (see figure  2), a factor 2 smaller than the correct value. Hence the dynamical method not only gives us the correct interaction parameters, but also distinguishes the rule used to build the interaction network. The method exploits the different ways in which spatial density fluctuations translate into fluctuations in the number of neighbors. In the Voronoi network (the generating one), the number of neighbors n i of each point fluctuates weakly around its mean value of 6. In the NN case, n i does not fluctuate at all, whereas with the metric rule n i exhibits very large fluctuations, directly linked to the VM giant density fluctuations [4]. The large fluctuations of n i make the correlation functions of Table I very different from their correct (Voronoi) values.
In summary, we have derived a dynamical maximum entropy method to infer the alignment dynamics of highly-ordered animal groups from just two consecutive snapshots. Tests on synthetic data confirm the validity of our method. Our approach is very general and makes minimal, symmetry-based assumptions on the structure of the dynamics under investigation, alternative to other inference methods [13]. Related approaches have been proposed in the context of Ising spins or spiking neurons [14]; however, in that case it is hard to relate a simple interaction form of the Markovian transition probabilities to a principle of maximum entropy. Our work emphasizes the need for a dynamical inference approach to out-ofequilibrium active matter systems, especially when there is no a priori knowledge of the timescales in the system, which is usually the case when dealing with experimental data.
Our approach is applicable to many systems where collective motion is observed, including moving animal groups [15], bacterial colonies [16], motility assays [17], collective motion of epithelial cells [18], or vibrated polar disks [19]. Throughout this work we have assumed that δt is equal to (or smaller than) the real update time lag, namely the biological timescale. This may not be true for some datasets, as the sampling time of the experimental equipment is likely to be larger than the neural update time actually used by animals. This is certainly the case for the starling data of [8]. When this happens, the experimental time series is a coarse-grained version of the real dynamics, so that the present method would probably provide a time-renormalized value of the interaction parameters. It would therefore be important to generalize our equations to deal with this issue. Other generalizations include the analysis of other symmetries than the polar one (as in systems with nematic order [20]), or the extension to second-order dynamics describing systems characterized by linear, not diffusive, dispersion relations [21].
Acknowledgements. We thank Martin Weigt for helpful discussions. I.G. was supported by grants IIT-Seed Artswarm, ERC-StG n.257126. A.C was supported by grant US-AFOSR FA95501010250 (through the University of Maryland). FG acknowledges support from grants EPSRC First Grant EP/K018450/1 and MC Career Integration Grant PCIG13-GA-2013-618399. Work in Paris was supported by grant ERCStG n. 306312.
In the maximum entropy approach, one looks for the maximally disordered probability distribution consistent with carefully chosen observables of the data. In practice, given a stochastic variable s, and a set of observables {O µ (s)}, with µ = 1, . . . , K, one looks for the model distribution P of maximum entropy that coincides with the data for the average values of each of the observables: Using the technique of Lagrange multipliers, one shows that the distribution takes the exponential form: where {λ µ } are Lagrange multipliers that need to be set to satisfy (14), and Z({λ µ }) is a normalization factor enforcing s P (s) = 1. By analogy with the Boltzman distribution from equilibrium statistical mechanics, the sum inside the exponential may be interpreted as an energy. Conveniently, the Lagrange multipliers that match the mean value of the observables are also those that maximize the likelihood of the data given the exponential form (15). Given M data points s 1 , . . . , s M , the log-likelihood of the data reads: Maximizing the log-likelihood with respect to the parameters {λ µ } implies: By virtue of this equivalence, we will maximize the expression of the log-likelihood with respect to the parameters to find the correct maximum entropy distribution. Let us now consider the specific case of bird flocks. Denote s = (s 1 , . . . , s N ) the flight directions of birds in a flock of size N . The maximum entropy distribution consistent with the synchronous pairwise correlation functions s i s j , for all (i, j), reads: where {J ij } are (minus) the Lagrange multipliers associated to the constraints on the correlation functions. Generalizing the set of constrained obsersables to both synchronous and consecutive-time correlation functions, {s t i s t j } and {s t+1 i s t j }, for all pair (i, j), and for all times t in the trajectory, yields a time-dependent maximum entropy distribution: withẐ again a normalization factor, and where {J (1) ij;t }, {J (2) ij;t } are the Lagrange multipliers associated to the constraints on the synchronous and consecutive-time correlation functions. Here, A is more appropriately interpreted as an action, in a path-integral representation of the stochastic trajectories of the whole flock.

Markovian description
Because the action only involves cross-terms between consecutive times, it underlies a Markov process P (s 1 , . . . , s T ) = P (s 1 ) and can be rewritten as: where may be interpreted as a Lagrangian density in the path integral formalism. Let us check that this Markovian decomposition is possible. Identifying the two expressions of A (20) and (22), we may write L t in the form: ij;t s i s j + J ij;t s i s j −K t (s )+K t−1 (s), with the constraint that, for all s, the transition probability be normalized, which entails: ij;t s i s j + J ij;t s i s j + K t (s )   .
(26) Eq. (26) defines a descending recursion, by which K t is calculated from the next time point. Thus the Markovian form of the action is fully specified using (24).

Equivalence with a generalized Vicsek model in the spin-wave approximation
In general, the integral in (26) cannot be calculated analytically, and K t does not have a simple quadratic form as a function of s. However things simplify in the spinwave approximation, where the flock is very polarized, as we will show now. Denote s i = π i + n 1 − (π i ) 2 , where n is an abitrary unit vector, and π i is the perpendicular component of the orientation, π i n = 0. n is chosen to be close the flock's main direction of flight, so that π i 1. Let us assume a quadratic form for K t : The integral in (26) can be expanded at small π: ij;t π 2 j , This Gaussian integral can be calculated exactly. Doing so, and expanding the left-hand side of (26) at small π, yields Focusing on the non-diagonal terms of the matrix K t , we obtain a simple expression for the recursion: We can now replace the expression of K t (24), and thus rewrite the transition probability in terms of π in a Gaussian form: with t .
This transition probability rule describes a random walk in the joint space of bird directions, described by: with t a random, isotropic Gaussian noise perpendicular to n, of zero mean and covariance: Note that the (d − 1) factor, here and in previous equations, corresponds to the dimensionality of the perpendicular component π. M t defines a well-balanced weighted average, as it satisfies: To show this, let us rewrite this identity in a matrix form: where u is a vector of ones, u i = 1. Proving (37) is therefore equivalent to showing: J t u = 2A t u, which follows from the definition of A t (29).
This identity also allows us to check that the diagonal components in the equality (30) are consistent with the off-diagonal components. This is done by checking that on both sides of the (30), contraction with u gives zero.
The equation describing the collective random walk in terms of the perpendicular component π holds almost the same for the flight direction s itself. Starting from the update equation: where θ(x) = x/ x is the normalization operator, and expanding in the spin-wave approximation (π i 1), one recovers (35) with t i = η t i − (n · η t i )n the perpendicular component of the vectorial noise η.

Parametrization
The matrices M t and A t are parametrized as follows: where n ij = 1 if j is one of i's neighbours, and 0 otherwise, and n i = ij n ij . (We drop the t index, even though n ij depends on t in general); and J is interpreted as an alignment strength, and T as a temperature.

Continuous time limit and equivalence with static maximum entropy
The parametrization has a well defined continuoustime limit. When δt → 0, (35): where Λ ij = n i δ ij −n ij , and ξ i (t) are i.i.d Gaussian white is Dirac's delta function. When Λ varies slowly with time, (42) can be formally integrated: If, in addition, Λ is symmetric, the system reaches some equilibrium steady state. More precisely, the collective mode that is parallel to u, which corresponds to the average direction of the flock (1/N ) i π i , follows an unconstrained random walk, as it corresponds to a the zero mode of Λ, Λu = 0. All the other modes that are orthogonal to u are bounded by a restoring force. The steady-state distribution of π is therefore Gaussian, with C ij = Cov(π i , π j ) satisfying: where 1 is the identity matrix.
Remarkably, in the spin-wave approximation, this distribution is the same as the one obtained by the principle maximum entropy constrained by the static correlation functions: with One can check this by expanding (45) at small π, after setting n to be the average direction of the flock, so that π i = 0, and By virtue of Gaussian integration rules, this distribution has the same covariance as (44), and therefore is identical.
The various correlated functions used in this expression are defined in Table I in the main text.
In the case of non-constant n i , n c is defined as (1/N ) i n i . Note that in the case of constant n i = n c , as in the case of the nearest-neighbor model,C s =Ĉ s = C s ,G s = G s andC int = C int .
There are three parameters to optimize over: the interaction strengh J, the interaction range n c , and the "temperature" T which sets the strength of noise. This last one is simply given by the condition ∂L/∂T = 0, which yields: At this optimum value of T , we have MinimizingL, ∂L/∂α, then yields the optimum value of α: At this optimum, one hasL = C 1 s + C s − 2G s +L, wherẽ is the only term that depends on the interaction matrix n ij . Therefore, to find the optimum interaction range n c in the case of the nearest-neighbor model, one just needs to minimizeL(n c ).

Consistency with the static approach
To recover the static inference equations, we start by rewriting the dynamical inference equations, Eqs. (51) and (53), explicitly: with n c = (1/N ) i n i and When the system is at steady state, we have C 1 s ≈C s and Ω ≈ (2N n c ) −1 ij π i π j dnij dt (directly from definitions in Table I of the main text and Eq. 57); the second term in Eq. (56) cancels and T ≈ T 0 for small δt. If we further assume that data was actually generated by exactly the class of models we are trying to infer (which may not be the case in general, as we are looking at effective descriptions), we have exactly T = T 0 . If in addition neighbor changes are slow, then Ω ≈ 0 and Eq. (44) im-pliesC int ≈ C int . Eq. (55) thus gives which is the result of the static inference [8]. Note however that in addition to recovering the alignment strength, the dynamical inference procedure allows us to separate the interaction coupling J from the temperature T .
Spin wave expansion of the Topological Vicsek model As described in the main text, to test our dynamical inference method we generated synthetic data with the Topological VM defined by In this section, we show that Eq. (59) is in fact equivalent in the spin-wave limit to an update equation of the same kind as Eqs. (35), (40) and (41). To this aim, it is convenient to rewrite Eq. (59) in the following equivalent form where i is a delta-correlated noise perpendicular to s i with variance 2(d − 1)T V (i.e. whose effect is the same as the angular noise appearing in Eq. (59)).
In the large polarization regime we can perform a spin wave expansion s i = π i + n 1 − π 2 i , where n is a vector representing the global direction of motion and π i is the component of the direction s i perpendicular to n. We can now expand the normalization at the r.h.s. in Eq. (61) with respect to π 2 i to get where n i = j n ij . Eq. (61) then leads to the following update equation for the {π i } π t+δt i = π t i + δtJ V j n ij π t j 1 + δtJ V n i When δt is small, we can disregard fluctuations in n i and Eq. (64) is of the same form of Eqs. (35) with the parametrization defined in (40)-(41) and