Sampling of Temporal Networks: Methods and Biases

Temporal networks have been increasingly used to model a diversity of systems that evolve in time; for example human contact structures over which dynamic processes such as epidemics take place. A fundamental aspect of real-life networks is that they are sampled within temporal and spatial frames. Furthermore, one might wish to subsample networks to reduce their size for better visualization or to perform computationally intensive simulations. The sampling method may affect the network structure and thus caution is necessary to generalize results based on samples. In this paper, we study four sampling strategies applied to a variety of real-life temporal networks. We quantify the biases generated by each sampling strategy on a number of relevant statistics such as link activity, temporal paths and epidemic spread. We find that some biases are common in a variety of networks and statistics, but one strategy, uniform sampling of nodes, shows improved performance in most scenarios. Our results help researchers to better design network data collection protocols and to understand the limitations of sampled temporal network data.


I. INTRODUCTION
Networks have been used to model the interactions and interdependencies between the parts of a system [1]. Social and sexual contacts, flights between airports, email and phone communication, or gene regulatory networks are just a few examples of systems that can be conveniently mapped into networks [1,2]. When modelling real-systems as networks, researchers sample data by extracting the relevant information within a given temporal and spatial frame [3], trace-routing or snow-balling from one or multiple sources [4,5], or simply by collecting all network-related information of a specific system, for example, email exchanges within a university or social interactions on a web-community [2,6]. Sampling network data involves at least four main decisions: the choice of (i) the total observation, or sampling, time (e.g. 1 day or 1 year); (ii) which nodes and (iii) links will be observed (e.g. all or a fraction); (iv) the temporal resolution, i.e. the time interval in which data are recorded. If the temporal resolution is smaller than the total observation time, several interaction events between the same pair of nodes may be recorded and filtering strategies may be used to remove weak links [7].
Network modelling may involve the traditional framework of static networks or extensions such as temporal networks [8]. In temporal networks, nodes and links * luis.rocha@ki.se are active at given times in contrast to static networks where nodes and links remain active during the whole period. Temporal networks thus describe more realistically the temporal paths through which information (e.g. through email communication [9]), infections (e.g. over sexual contacts [10]), or resources (e.g. via flights [11]) can propagate or flow. In this temporal perspective, the order and frequency of node and link activations directly affect the dynamics of simulated epidemics [12][13][14][15][16][17] and information spread [18][19][20][21], mixing properties of random walks [16,22,23], and synchronization [24,25] on networks. Although some level of recording error is acceptable, accurate labeling of the interaction events is important to study, for example, simulated infections on real-life temporal networks [26].
Another challenge that comes with the study of reallife temporal networks is the amount of generated data since all timings of link activation are stored. This is in contrast to static networks in which activation events are aggregated and multiple activations of the same link are then represented as single weights [27], saving memory. The memory cost is particularly problematic when handling big data or when designing studies to collect social interactions using electronic devices such as RFID tags [28,29] and mobile phones [30]. In both cases, researchers aim to collect as much relevant data as possible while optimising resources. Furthermore, several algorithms used to extract information or to simulate dynamic processes on networks struggle to deal with large temporal networks, becoming computationally intractable [31][32][33][34][35]. Facing these challenges, the natural question that emerges is what data should be collected and used in network studies.
The four sampling decisions mentioned above are more critical for the study of temporal networks than for static networks. For example, the total observation time might affect birth and death statistics of nodes and links, and add artificial cutoffs to inter-event times since interaction events might be truncated. Similarly, the temporal resolution acts like a filter since only temporal patterns of node and link activity at time scales above the resolution are observable. A typical example is to use a resolution of 1 day to collect data on email communication; this choice would miss the rich dynamic communication patterns happening within a day. The sampling of nodes and links are expected to have at least the same effect as on static networks [5,6] with the aggravated consequence that missing nodes and links would also affect the temporal patterns of the neighbouring nodes.
When sampling temporal network data, one wishes to collect as much information as possible such that both short-and long-term temporal patterns can be observed [36]. Yet, the amount of information should be manageable by existing algorithms. In this paper, we study the impact of four sampling design decisions, or strategies, on key temporal network variables applied to various categories of real-life temporal networks. In particular, we will study how the choice of the observation time, the temporal resolution, and the number of sampled nodes and events, affect the statistics of the lifetime and burstiness of links, the number and length of temporal paths between nodes, and the number of secondary infections and outbreak size of simulated epidemics.

A. Temporal Networks
For a given time period T , a temporal network of size N is defined as a set of nodes i connected by a set E of links (i, j), in which events occur at times t [8]. M represents the sum of the number of events in each link for all links. The temporal resolution δ characterises the size of the time interval (or snapshot) in which the network data are collected, therefore, an event occurring at time t actually means that the event occurred in the time interval [t, t + δ). The statistics of the sampled networks will be represented by the subscript s, for example, N s and M s represent the number of nodes and of events, respectively.

B. Network Data
We will use six network data sets corresponding to different contexts in which temporal networks are relevant. We have chosen networks with different topological and temporal structures. The first data set corre-sponds to sexual contacts between sex-workers and their clients (SEX) [14,37]; the second is about online communication between users of a web-community related to movies (FOR) [38]; the third is about email communication within a university (EMA) [9]; the fourth is about online communication between students in an online social network (COL) [39]; the fifth is about faceto-face proximity contacts (≤ 1.5m) between high-school students (HSC) [40]; the sixth is also about proximity contacts but between visitors of a museum exposition (GAL) [41] (See Table I). Links are undirected and only a single event may occur in a time window [t, t + δ) for a given link, i.e. events are unweighted. Sampling consists in making a number of observations or selecting a set of individuals to estimate properties of the target population. In the context of networks, sampling means selecting a number of nodes and links of a system within temporal and spatial frames to build the network of interest. In this paper, we will take network data sets available in the literature as reference populations. We will then study the consequences, on the network structures and on dynamics on the networks, of applying different sampling strategies on these populations. Effectively, we will subsample the original empirical network and then discuss the biased estimates of each sample, that is, the difference in the estimates given by the sampled and the original networks. This subsampling approach is widely used in statistics (see e.g. subsampling bootstrap [42]) and other disciplines (see e.g. [5,6]).
We will study the effect of four sampling strategies ( Fig. 1): (i) to reduce the observation time T s , where T s ≤ T and [0, T s ] is the sampling time in which the network data are collected (strategy TS); (ii) to uniformly select a fraction N s /N of nodes of the original network and thus all events between the sampled nodes (strategy NS); (iii) to uniformly select a fraction M s /M of events of the original network and thus all nodes connected by these events (strategy ES) -note that this protocol is used, instead of selecting links (and consequently all events associated to that particular link), be- Sampled nodes and sampled events are highlighted for each strategy. In (a), we obtain a new temporal network by truncating the observation time to Ts. In this example Ts = 3, therefore all nodes and events in 1 ≤ t ≤ 3 are collected. In (b), we uniformly choose nodes in 1 ≤ t ≤ T . In this example, nodes B, C and E are sampled, therefore only the events between these particular nodes are collected. In (c), we uniformly choose events in 1 ≤ t ≤ T . In this example, the events (A, B) at t = 2, (B, D) at t = 3, and (A, B) at t = 6 are sampled. In (d), we coarse-grain the temporal network in the interval 1 ≤ t ≤ T by letting an event represent the presence of at least one event at that link during δs. In this example, we change the resolution from δ = 1 to δs = 2, therefore we only record interaction events at times 1, 3, and 5, and events are merged if they repeat (e.g. original events at t = 1 and t = 2 at link (A, B) become a single event at t = 1).
cause of higher flexibility and because one can design "on-line sampling", that is, collect events as they happen in time; and (iv) to reduce the resolution by setting δ s a multiple of δ of the original network (strategy RS). Note that repeated same-link events in the interval [t, t + δ s ) are merged into a single event.

D. Validation Measures
To compare the effects of the four sampling strategies, we will estimate six measures, or statistics, on each sample s of the original networks. For strategies NS and ES, we will present average values calculated over five random network samples. Two measures are related to the timings of events, two to the temporal paths and two to the dynamics on the network.
The first measure is the burstiness B s of the link activity [43]. This measure is widely used to characterise temporal patterns on temporal networks. The burstiness depends on the mean m and standard deviation σ of the distribution of same-link inter-event times (the interevent time is the time between two subsequent same-link activations) and measures the deviation of the link activity from a Poisson process. Considering the distribution of inter-event times of all links collected together, the burstiness is given by The second measure is related to the lifetime L ij (or persistence [44]) of links, that is, the time between the first event, t first ij , and last event, t last ij , on the link (i, j). The link lifetime can be used as a proxy for the real lifetime of contacts [45]. We measure the average lifetime L s over all K s links in which L ij > 0 (i.e. there are at least two events in the link) to summarise the lifetime of the links in the sampled network, i.e.
The third and fourth measures are related to temporal paths. Temporal paths are particularly relevant in the context of temporal networks because they combine topological and temporal information. They emphasise the role of the timings of events in the connectivity of a node. For example, two nodes may be topologically close (e.g. directly connected by a link) but one may need to wait a long time for this link to be active (i.e. for an interaction event to happen). On the other hand, a more topologically distant pair of nodes (e.g. two links away) may be reached quickly if the interaction events are temporally close. We assume here that, within a time step, a node can only be reached by another node through a direct link. For example, there are no paths connecting nodes A and C if the events (A,B) and (B,C) occur at the same time. An alternative assumption could define a path between A and C in this example [46].
The third measure is the reachability ratio f s [47]. It is the fraction of pairs of nodes that have at least one temporal path between them and is defined by where: It can happen that τ ij is finite, whereas τ ji is infinite, or vice versa.
The fourth measure is related to the time distance between nodes in the network [31,47,48]. The time distance τ ij is here defined as the time necessary to reach node j from the first appearance (i.e. birth) of node i through the shortest temporal path connecting i and j. If there is no path between nodes i and j, we set τ ij → ∞ [48]. We then set to summarise τ ij over the links. Note that τ ij → ∞ contributes zero to the sum in Eq. (3) and that both the shortest path from i to j and that from j to i appear in Eq. (3) because τ ij is not equal to τ ji in general. This measure is normalized by N s (N s −1), that gives the total number of possible paths between any two pairs of nodes if all links occur at the same time [46]. For the fifth and sixth measures, we model a susceptible-infected-recovered (SIR) epidemics on the temporal network. In the SIR model, a node can be either susceptible (S), infected (I) or recovered (R). Infected nodes can infect susceptible nodes with probability β and recover with probability µ in a time step. For strategy RS, to account for the change in the resolution δ s (and consequently in the contact rate), we re-scale the parameters to β/δ s and µ/δ s . Re-scaling these parameters effectively conserves the contact rate because we assume the events are unweighted; without the re-scaling, the infection and recovery probabilities would be overestimated for δ s > δ. We start by infecting a single node and leaving all others susceptible. Under the so-called individual-based approximation [34], the dynamics of the probability that node i is infected at time t is given by where N si (t) is the set of neighbors of node i at time t, φ j (t) = 1 − (1 − µ)βI j (t − 1) if there is an event between nodes i and j at time t, and φ j (t) = 1 otherwise. We then measure the average number of secondary infections R eff s and the average final outbreak size Ω s caused by a single infected node at time 0 for each sampled network [34]. R eff s is thought to indicate the propensity of an outbreak to become pandemic [49]. The value of Ω s is not linearly related to R eff s although a larger Ω s is expected for larger R eff s [50]. Under the individual-based approximation, we obtain (8) and

III. RESULTS
A. Network size Different sampling strategies have a different impact on the number of nodes and events in the sampled networks (Fig. 2a,b). Reducing the temporal resolution δ s (strategy RS) has no effect on the number of nodes (N s ) but monotonically decreases the number of events (M s ). This happens because some events repeat at subsequent times. If there is little repetition, reducing the temporal resolution will only slightly decrease the number of events. In the SEX network (δ = 1 day), for example, setting δ s = 63 days only reduces the number of events by 8.87% (Fig. 2a,c). This is the reason for the short magenta curve in Fig. 2a. In contrast, setting δ s = 55 hours in the EMA network (δ = 1 hour) reduces the number of events by about 40% (Fig. 2b,c). The high turnover of nodes (i.e. shorter lifetimes in comparison to the observation time in the original network) in the SEX network explains why the number of nodes falls more substantially in this case than in the EMA network if we reduce T s (strategy TS). For example, a reduction of about 43% in T s results in about 37% less nodes in the SEX network (Fig. 2a). For the EMA network however the reduction of 48% in T s implies on only 9.8% less nodes (Fig. 2b). The same reduction in T s by half results in approximately half the events in both cases (Fig. 2d).
The uniform sampling of events (strategy ES) has less impact on the number of nodes than the uniform sampling of nodes (strategy NS) if we control for the number of events (Fig. 2a,b). This happens because a node typically has more than one event with the same or with different neighbours. In strategy ES, highly connected nodes are selected often (proportionally to the number of events [51]) and thus sampled nodes might repeat, decreasing the final number of nodes in the sample. In strategy NS, on the other hand, the selection of nodes brings all their events (to other sampled nodes), implying that less nodes are selected (in comparison to strategy ES) for the same number of events.
In the following analyses, we will present the results for COL, FOR, HSC and GAL using two configurations (A and B) for each strategy. Each configuration corresponds to a fixed number of events M s . M s was based on an arbitrarily chosen resolution. That is, we set a resolution δ s and took the number of events of this sample as reference to be used in the other sampling strategies. For the COL data set, A corresponds to a fraction of 62% (δ s = 48 hours) and B to a fraction of 77% (δ s = 12 hours) of the events of the original network. For the FOR data set, we have respectively 56% (δ s = 24 hours) and 74% (δ s = 6 hours), for HSC, 54% (δ s = 60 sec) and 68% (δ s = 40 sec), and for the GAL data set, we have 57% (δ s = 60 sec) and 70% (δ s = 40 sec).

B. Timings of events
We have found that uniformly sampling nodes (strategy NS) seems to be the best strategy to conserve the burstiness. The value of B s is robust in both SEX and EMA data sets even when only half of the events are sampled (Fig. 3a,b). The fact that the number of sampled nodes (by strategy NS) has little impact on the estimation of the burstiness suggests that all nodes follow similar inter-event times distributions, (i.e. a few nodes are sufficient for an accurate estimation , Fig 2a,b). On the other hand, increasing δ s (strategy RS) has a significant negative effect on B s . The resolution affects the distri-bution of inter-event times since increasing δ s filters out short inter-event times and reduces the long inter-event times, making the signal move towards more regularity (with larger mean and standard deviation). Strategies ES and TS also generate biases, which are considerably smaller than biases given by strategy RS. For different reasons, strategies ES and TS also affect the distribution of inter-event times but to a lesser extent than strategy RS. Strategy ES misses a few events and thus increases the average (and standard deviation of the) interevent times. In contrast, strategy TS skips events that could generate long inter-event times since the observation time is truncated and thus generates smaller means and standard deviations. Similar results are observed for the other data sets (Fig. 3c,d). Strategies NS and ES generally give good estimations of the average lifetime of links L s for all data sets (Fig. 4ad). The uniform sampling of events or nodes decreases the lifetime of some links but also sometimes does not sample any event of a particular link (i.e. some links and nodes may not be sampled at all). The smaller K s possibly compensates the decrease in the lifetimes such that the average L s is little affected. Strategy TS introduces cut-offs on the lifetimes of both links and nodes since sampling is limited within the observation time [0, T s ]. Consequently, the lifetime is underestimated. The case of GAL is special because visitors explore the museum in groups at allocated times, meaning that links form and disappear before T s (Fig. 4c,d). Finally, strategy RS tends to overestimate L s because increasing δ s is equivalent to rounding down the times of births and deaths. The rounding down leads to an overall increase in the lifetime of links and a decrease in K s since links with a single event are not included in the average.

C. Temporal Paths
The reachability, f s , changes substantially for the SEX and HSC networks but not as much for the other networks (Fig. 5a-d). For example, in the original SEX network about 34% of the pairs of nodes were reachable in contrast to about 94% in the EMA original network. After sampling, only strategy RS decreases f s in the EMA network. However, the difference with the original value is small, e.g. 6.4% in the sampled EMA network containing about 50% of the original events. This is considerably less than in the case of the SEX network that shows a difference of 55.9% to the original value for the same strategy RS (Fig. 5a,b). The generally observed low biases generated by strategies NS and ES result from the redundancy of paths, i.e. the fact that there are multiple paths connecting the same pairs of nodes at distinct times. The absence of some events thus has little impact on f s . The same redundancy is also observed for example in the SEX network but at a lesser extent, possibly because of the relatively smaller density of events in the SEX network in comparison to the EMA network (see Table I). Furthermore, the low observed biases of strategy TS (for most data sets) indicate that the number of existing shortest paths decreases at the same rate as the number of potential paths (N s (N s − 1)), for smaller T s . The biases observed for SEX and HSC data sets, on the other hand, thus indicate that the new sampled nodes (introduced in the sample for increasing T s ) do not result in the same number of new paths as the number of potential paths that could exist (i.e. f s decreases with increasing T s ).  6a-d shows that the statistics of the duration of the temporal paths between nodes, θ s , changes for EMA, COL, FOR and GAL for strategy TS. For the SEX and HSC data sets, this strategy generates very low biases. Although several shortest temporal paths are formed before T s , some only exist if we increase T s . Therefore, if we truncate the data to T s , the summation term in θ s may decrease. But since nodes are also removed (i.e. lower N s ), the overall value of θ s increases. For the SEX and HSC data sets, the decrease in the summation term is equivalent to the decrease in the number of potential shortest paths (N s (N s − 1)). On the other hand, strategy RS results in considerably different values for SEX, EMA, COL and FOR data sets. Strategy RS generates larger biases than the other strategies because higher δ s rounds down the timings of events, collapsing many links to the same time interval and thus removing several temporal paths between nodes, that in turn results in smaller θ s . Remember that in our definition, only directly connected nodes have a temporal path within the same time step. For the other two strategies (NS and ES), uniform sampling of nodes or events increases, on average, the temporal distances between nodes. The higher θ s given by strategy NS, in comparison to strategy ES, is possibly a result of a smaller N s obtained by strategy NS in comparison to the N s obtained by strategy ES (see Fig. 2 for the SEX and EMA data sets). The relatively smaller biases in the EMA data set in comparison to the SEX data set are likely a result of higher redundancy of paths in the EMA network, as discussed in the previous paragraph.

D. Epidemic Variables
We set β = 0.5 and µ = 0.001 to simulate a stochastic epidemic process. These values were chosen because they generate relatively large epidemic outbreaks in all original networks, and thus facilitate the understanding and discussion of the mechanisms regulating the epidemic process. We first look at the average number of secondary infections, R eff s . Strategy TS results in a relatively small increase in R eff s for most data sets, whereas strategies NS and ES result in a small decrease for all data sets (Fig. 7a-d). The estimations of R eff s given by the sampled networks indicate that the systems remain above the epidemic threshold of R eff s = 1 for this particular set of parameters, and that an epidemic outbreak will likely occur. Since the value of R eff s also indicates how difficult is to avoid an epidemic outbreak, the estimations given by the sampled networks generally suggest that an outbreak might be easier to control than indicated by the original network (i.e. R eff s is closer to one in the sampled networks). The results for strategy RS are substantially far from the value given by the original network for the SEX, EMA and COL data sets, but not for the FOR, HSC and GAL data sets. The low biases produced by strategies NS and ES across the different data sets are explained by the fact that the infection process is temporally finite. Many events do not actually contribute to the spread of the infection given the stochastic nature of the process, i.e. the absence of randomly selected interaction events has a relatively little importance to avoid infection events. The negative effect of the absence of interaction events is lower for the EMA network in which events repeat more often than in the SEX network. Therefore, the same neighbour has more chances of being infected in subsequent times in EMA than in the SEX network. This is related to the results observed for θ s (Fig. 5) and f s (Fig. 6), where a substantial absence of events generated small biases for most networks. Strategy TS also performs well because of the finite time of the infection period that makes most infection events occur before T s . If the infection period is long (small µ) or the infection probability is small, the biases given by strategy TS are expected to be larger. Since the number of nodes is smaller in comparison to the original networks, R eff s becomes slightly over-estimated by strategy TS. On the other hand, strategy RS generates large biases. Increasing δ s alters the infection potential through a particular event and extends the infection period because of the rescaling of the infection and recovery probabilities, respectively. For example, in the SEX data set, if δ s = 7 days, the effective infection probability is β s = β/7 ∼ 0.07; this infection probability is too low. Combined with the fact that the number of events (of a single node to different neighbours) at a given time step does not increase much for increasing δ s , very few neighbours may be infected by an infectious node (Fig. 7a). In the EMA network, on the other hand, there will be more events (connecting different nodes) at a single time step and thus there is a higher chance of infecting some neighbours. See also Fig. 2c for the correspondence between M s /M and δ s for SEX and EMA data sets. Figure 8a-d shows that the final outbreak size, Ω s , is close to zero for strategy RS applied to the SEX network, to the EMA network when approximately 65% (or less) of the events are sampled, and to the COL network. For the other three sampling strategies, Ω s is similar between the sampled and original networks for most data sets but increasingly different for smaller samples in the case of the SEX network. This is again explained by the fact that events repeat over time (less often in the SEX network). This repetition of events creates redundancies of temporal paths. In the absence of several events (by any of these three strategies), various potential infection routes remain between the nodes, and the epidemic may still grow. The biases should increase for smaller infection probabilities since an infection event will be less likely through a particular interaction event.

IV. CONCLUSIONS
Our analyses indicate that generally both measures related to link activity are little affected by uniform sampling of nodes. This strategy also had very good performance for estimation of the statistics of temporal paths and epidemics for all network data sets but the sexual contact data set. These results likely explain the high performance of recently proposed methods to reconstruct temporal networks [52,53]. That is, the temporal patterns extracted from a small sample of the temporal network are sufficient to generate larger temporal networks with realistic temporal properties. However, more research is necessary to validate these methods on diverse types of networks. Uniform sampling of events have also performed well for most statistics on most data sets. Although less efficient than uniform sampling of nodes, sampling of events may be an option when continuously collecting network data. For example, for a given number of nodes, at each time step a fraction of links may be selected and stored as time evolves ("on-line sampling"). This procedure is expected to produce better samples than truncating the observation time. In fact, truncating the observation time produced mixed results. For some networks, this sampling strategy did not affect much the statistics but for some other data sets, relatively high biases are observed (e.g. for lifetime and for the temporal distance). Although performing well in some cases, the poorest performance was obtained when varying the temporal resolution. In some networks, there are many repetitions of events. Therefore, merging the events on the same link by reducing the temporal resolution implies small changes in the temporal network structure. On the other hand, if there are few repetitions of events, the network might look substantially different at each temporal resolution, consequently affecting the statistics. Using a different methodology, previous research suggests that for a set of epidemiological parameters a high temporal resolution might not be necessary to study simulated epidemics in some systems [15].
In general, we have identified differences in the magnitude of the biases on various statistics and real-life networks. Given our results, we advice to avoid reducing much the temporal resolution but instead, if possible, we recommend uniform sampling of nodes to conserve several of the properties of temporal networks. The choice of a sampling strategy may be case-dependent, leaving some room for sampling design. In practice, it is likely to combine all proposed sampling strategies in a data collection project. It is difficult to predict the consequences of combining them since positive bias by one strategy may compensate negative bias by another strategy. Nevertheless, our study of the effects of separately applying each sampling strategy will likely improve data collection by helping the research to make informed decisions.