Vicsek Model by Time-Interlaced Compression: a Dynamical Computable Information Density

Collective behavior, both in real biological systems as well as in theoretical models, often displays a rich combination of different kinds of order. A clear-cut and unique definition of"phase"based on the standard concept of order parameter may therefore be complicated, and made even trickier by the lack of thermodynamic equilibrium. Compression-based entropies have been proved useful in recent years in describing the different phases of out-of-equilibrium systems. Here, we investigate the performance of a compression-based entropy, namely the Computable Information Density (CID), within the Vicsek model of collective motion. Our entropy is defined through a crude coarse-graining of the particle positions, in which the key role of velocities in the model only enters indirectly through the velocity-density coupling. We discover that such entropy is a valid tool in distinguishing the various noise regimes, including the crossover between an aligned and misaligned phase of the velocities, despite the fact that velocities are not used by this entropy. Furthermore, we unveil the subtle role of the time coordinate, unexplored in previous studies on the CID: a new encoding recipe, where space and time locality are both preserved on the same ground, is demonstrated to reduce the CID. Such an improvement is particularly significant when working with partial and/or corrupted data, as it is often the case in real biological experiments.


I. INTRODUCTION
Statistical physics and information theory have a long history of cross-fertilization [1], with both disciplines relying on a key concept: a quantitative statistical measure of order [2,3]. For physical systems both in and out of equilibrium, such a measure is the starting point for a general theory of phase transitions, and it is a prerequisite for the study of response to external perturbations [4]. The statistical notion of entropy -the fundamental link between thermodynamics and equilibrium statistical mechanics -is the main inspiration for the central concept of information theory, the Shannon entropy [5]. More recently, information theoretic ideas have proven useful in the study of physical many-body systems, for example, to obtain good estimates for critical temperatures in equilibrium spin systems [6,7], to describe entropy production in the context of stochastic thermodynamics [8,9], and to obtain accurate estimates of the entropy in both equilibrium and non-equilibrium systems [10,11].
In this paper, we seek a method to measure order in some non-equilibrium flocking models, with an eye to eventually analyzing observational data on living systems. In particular, we propose a novel analysis which treats ordering in space and time on an equal footing. We will argue that by analyzing the temporal development of spatial patterns together, we can obtain information in a way that is robust enough to survive the inevitable noise present in collections of living objects.
Our approach is based on a recently proposed information-based measure of order for out-ofequilibrium systems [10]. This proposal, called Computable Information Density (CID), relates to the compression rate measured by lossless compression algorithms [12,13]. CID was shown to give clear signatures of important transitions in the systems studied, which included several absorbing state models [14] as well as an active matter model, repulsive active Brownian particles (ABP), where motility induced phase separation appears at large enough concentrations [15]. We note that these models lack first-principles Hamiltonians, which makes this quantification of order even more compelling.
In particular, we compress a sequence x using a universal lossless compression algorithm (such as one of the Lempel-Ziv algorithms) [16], and define the CID of x as where L(x) is the total binary code length of the compressed sequence, and N is the length of the original sequence x. We note that although we are using the term "sequence", we are not restricted to 1D stringsmicrostates in any dimension may be compressed by appropriate procedures [17], as will be discussed later. For equilibrium systems, the CID gives a good approximation to the thermodynamic entropy [10].
Here, we investigate the effectiveness of CID in flocking models, a class of active matter known to exhibit a different kind of order from ABP. We introduce a novel "space-time" algorithm based on CID which is sensitive enough to yield insight even when it only uses strongly coarse-grained information about the system, such as a binarized (empty/occupied) density field. Our main results are: • We are able to accurately characterize phase behavior by CID using a coarse-grained representation that only considers occupied/unoccupied discrete regions of space.
• A single parameter Q, characterizing the amount of correlations in the system as measured by CID [11], can accurately detect multiple phases in the system.
• By proper incorporation of temporal as well as spatial information, we are able to capture details of phase transitions with greater sensitivity.
• This increase in sensitivity allows us to measure important features even when our data is substantially corrupted, such as is typically the case for data from actual living systems.
In Section II we introduce the flocking model used to test our approach, the two-dimensional Vicsek Model, following which we explain the encoding method we use to translate its dynamical configurations into a binary string, and we then define our order measure. In Section III we present results obtained from numerical simulations, and we present conclusions in Section IV.

A. The Vicsek Model
We will consider here an archetypical model of collective behavior in biological systems, the Vicsek model (VM) [18], which we will study in two dimensions. The VM is an active matter model in which each agent/particle updates its velocity by imitating the average velocity of its nearest neighbors, and then updates its position by following its own velocity [19,20]. The first is an alignment mechanism, identical to classical ferromagnetic models with continuous symmetry, while the second leads to diffusion-like motion. It is worth noting that the interaction (or connectivity) matrix n ij (t) depends on time through the inter-particle distances r ij (t), which in turn depend on the velocities; this closes a feedback loop between positions and velocities that drives the systems out-of-equilibrium. As a function of noise strength, the VM displays a (finite-size) transition between a high-noise disordered phase and a low-noise ordered (or polarized) phase. Close to this transition, nontrivial fluctuations emerge both in velocity and density, whose properties depend on the nature of the noise and interaction [21,22].
More specifically, the model consists of a system of N active particles, each moving in a 2D plane with fixed speed v 0 in a square box of area A = L × L with periodic boundary conditions. At each time step the particles tend to align their direction of motion with that of their neighbors, with some noise added to make the dynamics stochastic. Let r i (t) ≡ (x i (t), y i (t)) and θ i (t) be the position and orientation respectively of particle i at some time t, N i (t) the set of its neighbors in a circle of radius R [23] centered about the particle i, and V i (t) ≡ j∈Ni(t) v j (t)/|N i (t)| the average velocity of the particles in the neighborhood of i. The velocity vector will be represented in the complex plane as v i (t) = v 0 e iθi(t) .
There are two cases to consider: intrinsic (also called vectorial) and extrinsic (also called scalar) noise. For the case of intrinsic noise, each particle in the system evolves according to the update rule is a uniformly distributed random variable in [−π, π] and η ∈ [0, 1] is the noise strength. For extrinsic noise, the updating rule for θ i is a little different: Both types of noise generate a phase diagram characterized by two critical noise values η b < η c which, finite size effects apart, depend on the density ρ of the system and the speed v 0 of the particles. When the noise intensity is large enough, η > η c , a Vicsek fluid stays in a fully disordered state with a spatially homogeneous density and a thermal-like distribution of particle directions. By reducing the noise to η η c the rotational symmetry is spontaneously broken and the system exhibits collective motion: a clear polarization transition appears for the velocities, which is accompanied by strong modifications of the density field, similar to a phase separation with traveling band formation [24].
The chief difference between the two kinds of noise is that extrinsic noise has a stronger contribution to band formation than intrinsic noise. In the latter case, the traveling bands are less clear and defined, though they become more visible as the size of the system increases. When the noise is further reduced to η < η b , the bands disappear (sharply with extrinsic noise, smoothly with intrinsic noise), replaced by a polarized state whose density is spatially homogeneous but possessing giant fluctuations [25]. At finite N , a sharp transition for extrinsic noise [26,27], and a smoother transition for intrinsic noise [28] are usually found. The order of the transition in the large N limit is still under debate, as is whether or not the nature of the intermediate phase of traveling bands is an artifact of the periodic boundary conditions [29]. In this work we will not discuss the asymptotic nature of the transition, but just show how we can distinguish the phases of VM through a single observable related to the CID of a suitable coding of system configurations. In Figure 1 we show the polarization Φ and its fluctuations for the simulated systems, where indicating the nature of the transitions for the two types of noise.

B. Coarse-Graining and Alphabet of Symbols
To implement the CID measurements, we first discretize our space by overlaying a regular square grid of M = m × m cells of size b = L/m, and we observe the system evolution for a time window of T steps. At each time t we assign the symbol '1' to all the cells occupied by at least one particle and the symbol '0' to empty ones: in this way we build a 3-dimensional array of M × T bits which we denote A.
We note that we could have considered a richer alphabet, e.g. based upon combinations of local density and Figure 2. Procedure to get the Z-value, i.e. the coordinate along the Z-order curve which is a one-dimensional spanning of a multi-dimensional space, preserving locality. Each point has a Z-value which is obtained by interleaving the bits of the x coordinate with the bits of the y coordinate. This, in two dimensions, can be done in two different ways: bits can be interleaved in the order xyxyxy . . . (x-first) or in the order yxyxyx . . . (y-first). The two possibilities are illustrated for two different points, using colors (green and blue) to make appreciable the choice of the interleaving order. One choice gives place to the solid curve, the other gives place to the dashed curve. In the example, the distance (along the Zcurve) between the two points depends upon the choice of the order: it is 38 − 7 = 31 for x-first order, and 25 − 11 = 14 for y-first order.
local (e.g. cell averaged) velocity, but there are good reasons for not doing this. First, in real biological data, we typically have direct access only to the positions of the agents; other degrees of freedom (e.g. velocities) are obtained from the knowledge of the positions. Since we will be simultaneously encoding T > 1 configurations, we expect the velocity information to be present implicitly.

C. Scanning: The Z-Order Curve
Since the typical compression programs operate on one-dimensional strings of characters, we need to scan the array A and produce a 1D sequence. Different scanning procedures exist, but in this work we will employ two procedures based on the so-called Z-order or Mortonorder mapping [30], which is similar to Hilbert scanning [17]. This class of mappings have the advantages of preserving spatial locality in a reasonable fashion, and of working in arbitrary dimensions. We will discuss the Zorder mapping for the 2D case; generalization to higher dimensions is immediate.
In 2D, a state of the system is represented by a matrix M with entries M mn [31]. We wish to compose a 1D string σ with entries σ l which are derived from M . The Z-order algorithm does this as follows: 1. Write the integers n and m in binary representation, such that n k and m k are the k th digits in the representations.
2. Interleave the digits of the n and m to form a new binary string n 1 m 1 n 2 m 2 n 3 m 3 .... If the binary string for n or m is shorter than the other, pad it with zeros. This is the binary representation of some integer l.
We note that there are two such possible mappings, (the interlaces n 1 m 1 n 2 m 2 n 3 m 3 ... and m 1 n 1 m 2 n 2 m 3 n 3 ...) as seen in Figure 2. We will discuss this ambiguity later. The generalization to higher dimensions is straightforward: the binary representations of the lattice site (j, k, ..., r) are interleaved in the same fashion as above.
In this work, we wish to study a sequence of T > 1 sequential configurations showing the time evolution of the system. There are two main ways to do this (see Fig. 3): • Serialized Time Coding (STC) We scan the 2D configuration matrix M at each time step using the 2D Z-order algorithm, and we then concatenate the resultant 1D strings in a sequence of length m × m × T according to their time order.
• Interlaced Time Coding (ITC) We scan the entire space-time array A with the 3D Z-order algorithm, producing a 1D sequence of length m×m×T . This method is expected to better preserve time correlations.
We will show that ITC is more sensitive to the ordering than STC, which we argue will play a useful role in analyzing collective dynamics of living things.
For our analysis, we find that it is useful to consider a quantity Q [11] defined by where CID(x sh ) is the CID of a random shuffle of the sequence x, and CID(x sh ) indicates an average over several such shuffled sequences. For asymptotically long strings, 0 ≤ Q ≤ 1, with Q 0 indicating that the data string is uncorrelated, while Q > 0 indicates the presence of order. We note one additional technical detail. The Z-curve does not span the space (or the space-time for ITC case) in an isotropic way, but we can improve the isotropy by averaging Q over the different interleavings of coordinates as shown in Figure 2. For STC, there are 2 possible interleavings (as shown in Figure 2) and for ITC there are 3! = 6 interleavings, since the Z-curve is taken in the full 3D matrix. We define Q I and Q S as the average of Q: where the average is taken over the 2 (for ITC) or 6 (for STC) possible curves.

III. RESULTS
In this Section we show how the dynamical information arises by an analysis of Q by comparing the ITC and STC schemes on simulations of the 2D Vicsek Model. For both intrinsic and extrinsic types of noise, we focus on a typical set of parameters, setting the interaction radius, density and speed to R = 1, ρ = N/V = 2 and v 0 = 0.5, respectively. We simulate the model for different sizes, from N = 2 9 to N = 2 15 , and vary the noise strength 0 ≤ η ≤ 1. The system size is L × L, and we employ periodic boundary conditions. We choose our observation window (that is, the data which we analyze) to be smaller than the entire system to avoid points which are near one another in space being far apart in the Z-order curve. Since such issues arise only near the boundary, we choose an observation window of size L/2 × L/2 at most. Finally, we average Q over many configurations (10 3 − 10 4 ). We begin to collect configuration data for our analysis after waiting for the system to reach a stationary steady state, a time t 10 4 .
We first studied the effect of the cell size b at T = 1 (for which there is no difference between STC and ITC), see Fig. 4 which shows an optimal value (best compression) at b = 1 for most of the choices of the noise intensity. It does not seem a coincidence that this optimal value coincides with the interaction radius R. Since we want to study the effect of other parameters, we fix b = 1 in the remainder of this work.

A. Dependence On The Time Window T And On
The Time Encoding In Fig. 5a we plot Q as a function of the noise η for the ITC and STC protocols for the model with intrinsic noise and different time-window lengths T. Al- though the curves all have the same general shape, the dynamic range (maximum value -minimum value) of the ITC protocol is considerably larger than that of the STC, suggesting that it might be a more sensitive measure of the order. Fig. 5b shows that the value of Q(t) is always higher for ITC than for STC, which is another indication of its better sensitivity. The main polarization transition at η c 0.48 is seen as an inflection point in Q , which appears more cleanly in the curves for the ITC scheme. The polarization phase transition at η c is typically thought of as an ordering of the particle velocities. For this reason, the fact that Q with T = 1 shows a clear signature of this transition is interesting. For T = 1, Q is only a single spatial configuration from which no velocity information can be inferred. This is a clear indication that when the velocities order at η c , there is a simultaneous ordering in the density field, which cannot be appreciated by looking just at polarization. Thus, the behavior of Q is more sensitive than the behavior of the polarization. In Section III.C we give details about how Q describes the full complex phase diagram of the VM. Here we focus on the effect of T and the way time is treated by the different CID encoding protocols. When T is increased, the shape of the curve Q vs. η remains similar. The values of Q obtained with the STC scheme increase slightly, mainly because of the larger statistics of substrings. However, a significant increase of Q is apparent when using the ITC protocol. For instance the ITC with T = 2 is everywhere larger than STC with T = 64, indicating a vastly greater sensitivity, even when η → 1, where the particles are evenly distributed with no polarization. The STC scheme preserves only space locality; therefore at large noise values, where there are no spatial correlations, it gives Q 0. On the contrary the ITC scheme preserves space and time locality and therefore exploits temporal correlations, resulting in Q > 0 even at high noise. Importantly, since temporal correlations are large in the presence of significant order, the curve Q I vs. η shows the polarization transition better and with higher resolution.
The difference between the two types of encoding is larger than the statistical fluctuations, and thus is significant. In Fig. 5b we compare step-by-step the "worst" result for ITC -the minimum value Q < I over the 6 permutations -with the "best" result for STC -the maximum value Q > S over the 2 permutations. We find a gap ∆ = Q < I − Q > S between the two encoding protocols which is always positive, a solid test of the effective improvement achieved by ITC with respect to STC.
From these results we conclude that temporal interlacing has added information about dynamics, information that the STC scheme struggles to reveal. This is also ap-preciated by studying how Q changes when the distance ∆t between successive times in a sequence of T configurations is increased (Fig. 6). This analysis shows a striking difference between STC and ITC. STC is not particularly sensitive to ∆t: it treats all configurations as independent, without respect to their proximity in time. ITC, on the other hand, displays a smooth relaxation towards an asymptotic value for large ∆t. The characteristic time of this relaxation grows when η → η c , η b suggesting a powerlaw rather than an exponential time decay, as in typical slowing-down phenomena close to phase transitions.  Fig. 7 provides a detailed account of how Q correlates with the rich VM phase diagram. Since we wish to analyze the effect of system size N , and since the ITC analysis requires a cubic matrix in space-time, we increase T coherently with N .
We consider both variants of the VM, with intrinsic and extrinsic noise, in order to investigate the sensitivity of Q to the known differences between these two kinds of noise. Intrinsic noise is known to bear the signature of a smooth flocking transition [28], while extrinsic noise exhibits a sharper transition at a higher value of η [26]. Both behaviors are well reproduced by the Q vs. η curve: in the intrinsic noise case Q has a smooth variation close to the known value of η c (see also the dashed line reproducing the polarization order parameter), while in the extrinsic noise case Q has a rapid variation near η c (which is larger).
The variation of Q in the vicinity of the transitions is made more clear by looking at the derivative |dQ/dη|, shown in the insets on the left panels of Fig. 7. Remarkably, Q does not signal only the polarization transition, but also the other known crossover present in the VM phenomenology. In particular, Q reaches a local maximum at the point marked by η * . The maximum decrease is marked by a second peak in |dQ/dη|, at a noise value η b . This loss of order is well explained by the behavior of the density field, represented in typical configurations for selected values of η, in Fig. 7. The polarization transition in the VM is accompanied by traveling band structures. Such structures lose coherence at smaller values of η, leading the density field to become homogeneous. For even smaller values of η < η b we observe a final increase of Q with decreasing η. This is due to the both the further increase of polarization and the appearance of giant fluctuations, a well-studied phenomenon in the VM at low values of noise [25]. Such strong inhomogeneities of the density field appear as large areas with correlated values of the occupation field, contributing to an increase of Q.
It is interesting to take another look at the difference between ITC and STC. In all cases, the the relative sensitivity of ITC with respect to STC increases for larger N and smaller η, with a few exceptions (we note that at low values of noise the large correlations in the system imply large fluctuations). In particular the difference between STC and ITC increases close to the polarization transition η c [32,33]. Left: Q Vs η computed with ITC scheme at increasing system size N and time length T : dashed black lines refer to polarization Φ computed for N = 32768. The inset shows the two peaks of |dQ/dη| for N = 32768 located at η b and ηc (and the local maximum at η * ) which delimit the phase of traveling bands. The density structures are shown in the strips below. We estimate ηc 0.48 and η b 0.34 for intrinsic noise and ηc 0.62 and η b 0.56 for extrinsic one. Right: Differences with the STC scheme by varying the noise strength.

C. Coping With Corrupted Data
The VM is a numerical model that produces trajectories for which identity, position, and velocity of each particle are exactly known at each time step. In real experiments on collective biological systems, reconstructed trajectories (both in 2-and 3-dimensions) may become corrupted in several possible ways, especially when analyzing large groups. For example, in some cases the trajectories of certain individuals are temporarily lost, so that their positions are missing for several time steps. Thus, some trajectories are interrupted, meaning that at each time step we lack information about a certain fraction of objects. This uncertainty fraction typically grows with the system density [34,35].
Here we examine whether our analysis is sensitive enough to show data on ordering in the presence of data corruption, by simulating the interruption of the trajectories by a simple two-state Markov process. Such a process has two parameters that modulate the degree of data corruption: the fraction of missing individuals µ and the average length of a trajectory λ (see Appendix for further details). Fig. 8 shows that both STC and ITC coding are able to detect the transition even in the presence of strong data corruption. Since the corruption spoils correlations, the robustness of Q is non-trivial. An important point is that, with increasing corruption of data, ITC copes with corruption significantly better than STC. Not only Q is always larger in the ITC case than in the STC one, but, more importantly, their difference increases with increasing data corruption (Fig. 8b), apart from few cases at small values of µ. This fact indicates that in real datasets, ITC encoding will be more robust than STC. The reason seems to be that ITC exploits time correlations to recover the information which is lost because of disappeared particles.

IV. CONCLUSIONS
We have studied a new definition of compression-based entropy and applied it to the Vicsek model, an offequilibrium active system which describes collective behavior in biological systems. We have adopted a crude encoding method, based upon a binary coarse-graining of the positional information. We do not directly feed the velocity information to the encoding, despite the crucial role of velocity alignment in the VM. Numerical results show that the CID is able to capture the order-disorder transition normally described by the velocity order parameter, demonstrating that it can exploit the densityvelocity coupling present in the VM and in many nonequilibrium systems. This result is promising for future analysis of real data, as the positional information is easily obtained in experiments. Of course, one could in principle explore other, more refined, ways of encoding the physical information, by changing, and expanding, the alphabet of the compressed string according to other local properties of the system (including velocities). However, we have seen that a larger alphabet makes the space of possible strings larger: therefore -at fixed string length L -it reduces the ability of the compressing scheme to exploit correlations. Given the results obtained with our simple encoding method, we deem that more complex encodings are not necessary in general. Extending our approach to systems in three (or more) dimensions is straightforward, since the Z-ordering curve has an obvious generalization in any dimension. The method is not constrained to regular cubic lattices with T = m = 2 n (we have chosen to do so just to simplify computational work). In fact, regardless of the size of the observation window and of grid features, once the cells have been defined, it is sufficient to sort them according to their Z-value. Effects of the aspect ratio of the simulation box is taken into account by averaging on the permutations of the coordinates, as illustrated in this paper.
Our results show that preserving locality in both space and time in the encoding (ITC), rather then in space only (STC), is important. The ITC encoding always extracts more information than the STC encoding and, most im-portantly, the difference in the performance is robust (or even increases) in the presence of corruption of the data sets. This result is particularly important if one wants to apply the method to actual biological data.
We conclude by describing some future directions of our work. The use of compression-based tools is particularly promising for the study of the response to perturbations in collective biological systems. The fluctuationdissipation theorem (FDT), connecting the unperturbed correlations of a given observable to the linear response of the system to a given small external perturbation, is particularly simple at equilibrium, where the observables involved are dictated by the Hamiltonian. In the case of flocks, swarms and other biological systems one has to exploit one of the many recipes for non-equilibrium generalization of FDT [4]. In all such recipes one needs the knowledge of what are the relevant variables conjugated to the perturbation; however, in the case of biological experiments, it is not at clear what variables of the systems are perturbed in the presence of a given external stimulus. Recent advances in non-equilibrium statistical physics provide a possible way out: for non-equilibrium steady states the response to perturbations can always be expressed in terms of correlations involving observables conjugate with respect to a specific observable, the 'stochastic entropy' [36]. We conjecture that the observable Q investigated here is closely related to it. Our analysis, in particular, shows that Q is coupled to many relevant degrees of freedom in the system and therefore is a promising candidate for a general approach to response in biological systems. Future studies are needed to test this scenario.  Let us illustrate how the LZ77 algorithm works, and derive the corresponding definition for the CID, with an example: we try to encode a sequence of characters in a list of "longest previous factors" (LPF). We represent a LPF by pair of integer (p, l) which are the "instructions" to retrieve the original sequence: "print l symbols starting from p-th character already written, if l = 0 just print p directly". To decode we must go through the list of LPFs in order, and for each of them follow the instructions. Let's consider the following binary sequence of length L = 16 0101100111001110 and read it from left to right. At the beginning no characters have been printed yet so, surely, the first two LPFs will be ('0', 0) and ('1', 0) (both have l = 0, in this case p is the character to be printed, we emphasize it using quotation marks). Next we have a substring '01' that we have already met: the LPF is then (1, 2) (copy 2 characters starting from position 1). The next substring already met is '10' and it is encoded by LPF (2, 2), followed by (3,3) to encode '011'. Finally, we see that the remaining substring '1001110' is obtained by copying 7 characters starting from position 5, so the last LSF is (5,7). It does not matter if the sequence currently available is shorter than 7 characters and incomplete, as the full subsequence will become available during printing. The number of bits L needed to encode this list of C = 6 LPFs can be estimated by the following argument. Let (p i , l i ) be the i-th LPF and b i log p i + log l i the number of bits needed to encode it (all logs are base 2): in this way the length of the original sequence L and the compressed binary length L are given by: Since p i < L we have b i ≤ (log L + log l i ), L must be bounded by log l i Now, we use Jensen's inequality for concave functions and, after simple algebraic manipulation, we obtain our definition of CID as in [10] L L ≤ C log C + 2C log (L/C) L ≡ CID The CID of our example (C = 6, L = 16) is 2.03. The sequence is too short and we doubled the length of the original sequence. For effective compression we must consider longer sequences, for example consider an L = 128 string obtained by replicating the sequence considered above (L = 16) 8 times. In this case we must add one more LPF, (1,112), so, since C = 7 and L = 128, we find CID 0.61 and we have almost halved the number of bits needed to represent the sequence.

B. Data Corruption
In order to mimic real data corruption, or degradation, we proceed as follows. We associate to each particle i a boolean random variable b i ∈ {0, 1} which evolves under the action of a 2-states Markov Chain with transition probability P (b → b ) = P bb : p = P 10 = 1 − P 11 q = P 01 = 1 − P 00 At each simulation step we apply the rules of this Markov Chain to evolve stochastically the b i . As a result, we are able to modulate the degree of data corruption by tuning two parameters: i) the fraction of missing individuals µ and, ii) the average length of a trajectory, λ. In detail, we consider or ignore particle i of the data-set according to the value of b i : when b i = 0 the particle i is removed from the data-set until b i returns to 1. It easy to prove that the invariant measure ρ b (ρ 0 = µ) and the typical length (the average life-span) λ of a trajectory depend on p and q in the following way: Then, by setting p and q to appropriate values p = 1 1 + λ q = 1 − µ µ p and by starting with a configuration of {b i } i=1,N already in equilibrium according to the invariant measure i b i /N ρ 1 , we can simulate data corruption by varying µ and λ.