Efficient limited-time reachability estimation in temporal networks

Time-limited states characterise many dynamical processes on networks: disease infected individuals recover after some time, people forget news spreading on social networks, or passengers may not wait forever for a connection. These dynamics can be described as limited waiting-time processes, and they are particularly important for systems modelled as temporal networks. These processes have been studied via simulations, which is equivalent to repeatedly finding all limited-waiting time temporal paths from a source node and time. We propose a method yielding orders of magnitude more efficient way of tracking the reachability of such temporal paths. Our method gives simultaneous estimates of the in- or out-reachability (with any chosen waiting-time limit) from every possible starting point and time. It works on very large temporal networks with hundreds of millions of events on current commodity computing hardware. This opens up the possibility to analyse reachability and dynamics of spreading processes on large temporal networks in completely new ways. For example, one can now compute centralities based on global reachability for all events or can find with high probability the infected node and time, which would lead to the largest epidemic outbreak.


I. INTRODUCTION
The topology of networks laying behind complex systems is crucial for any dynamical processes taking place on them [1]. This realisation provided new perspectives in understanding various phenomena, such as spreading of disease [2] and social dynamics [3]. In addition to the topology, it has later become evident that the timevarying nature of these connections also has a large effect on the unfolding of spreading processes [4] and many other dynamical phenomena [5]. This was one of the main realisations leading to the emergence of the field of temporal networks, which studies structures where links are not static but active only at some specific times [6,7]. The timing of connections has both uncovered interesting phenomena never seen before and created new types of computational problems to the analysis of network data and models.
In static networks, the possible routes for any dynamics to evolve are determined by topological paths. Paths can also be defined for temporal networks, but there are two main fundamental differences. First, the paths need to be time respecting such that the consecutive links are activated in the correct order [8]. Second, the time between activations is often limited. This is because many of the processes are characterised by time-limited states and finite memory, e.g., in case of spreading processes where they appear as the limited lifetime of a spreading agent. The maximum acceptable transfer time in a transportation network [9,10] or in a gossip protocol [11], as well as the finite infectious period of an individual in case of disease spreading [12] are all good examples of such dynamics. These processes can only spread through time-respecting paths where consecutive connections take place within some limited time δt.
The detection of temporal paths and the connectivity they provide is fundamental to understanding dynamics on and characteristics of the networks, but it cannot directly rely on the methodologies developed for static structures. Instead, new methods need to be developed, and this work is still at its infancy compared to static networks. For example, temporal connectivity and related measures are routinely being computed using breadthfirst search type of algorithms. This is similar to the approach of finding connected components in static graphs in the early studies on percolation phenomena on lattices [13]. Major improvements to these early algorithms, such as the Newman-Ziff algorithm [14], made it possible to analyse large network structures in an unprecedented way and opened the path to the understanding of the connectivity of networks we have today.
An elegant way to overcome difficulties in temporal networks is to transform the temporal problems into static problems, which we know how to solve efficiently. To do so, we need a representation, which maps temporal networks to a static structure on which we can then apply static network methods. Weighted temporal event graphs have been recently suggested as one such solution [15,16]. They provide a representation of temporal networks as static directed acyclic graphs (DAGs), which contains all the information on the temporal paths. They can be interpreted as temporal line graphs, where events are nodes and if they are adjacent, they are connected by a link directed according to the arrow of time. Such links can form longer δt-constrained path, representing the ways a limited waiting-time process can spread in the structure. This representation allows us to design efficient algorithms to measure temporal centrality or connectivity in time-varying networks, while exploiting tools and theories developed for static graphs and directed acyclic graphs.
A particular way of using the weighted event graphs arXiv:1908.11831v4 [physics.soc-ph] 11 Jun 2023 is to use the Newman-Ziff algorithm [14] to measure the size of the weakly-connected component when increasing the δt value [15]. This allows extremely fast sweeps of δt values where the size of the weakly-connected components can be measured for each value. However, weakly connected components only give an upper estimate for the outcome of any potential global phenomena. On the other hand, more precise indicators of connectivity and influence, like in-and out-components, are difficult to measure with current conventional techniques.
Here we take a complementary approach to the Newman-Ziff algorithm and develop a method to make accurate estimates of the sizes of source and influence set of every single event in a temporal network, given an arbitrary δt. We rely on the DAG character of the event graph representation, which allows us to convert our temporal reachability problem to a DAG reachability problem, a.k.a., the graph-theoretical challenge to estimate transitive closure sizes [17]. Relying on already developed probabilistic counting methods [18], we can devise an algorithm, which estimates the global reachability for each event even in extremely large temporal networks with hundreds of millions of events. Further, using this approach, we can effectively identify with high probability events with the largest out-(and in-) components in massive temporal networks.
To introduce and demonstrate our method, first in Section II A we define the basic formal concepts of event graphs. In Section II B we describe our algorithmic solutions and use them in Sections III A and III B to estimate out-component sizes of events in random and real-world networks. Analysed networks include large-scale temporal structures such as mobile phone communication and transportation networks. Note that the implementation of the algorithms described in Section II and the Appendices are publicly available [19].

A. Definitions and measures
1. Temporal networks, adjacency, temporal paths, and reachability Temporal networks can be formally defined in various ways [6]. They build up from time-varying interaction events, which can be directed or undirected, appear with duration or delay, and can be between two or more nodes. In turn, events induce temporal paths, whose structure critically depends on the event characteristics. To capture all of this complexity, we introduce methods using a slightly more general definition of temporal networks than usual, which can be easily made more specific depending on the features of the actual temporal network.
We define a temporal network as a tuple G = (V G , E G , T ) of a finite set of nodes V G , a finite set of events E G , and an observation window T . An event e ∈ E G is defined as e = (u, v, t, τ ), where u, v ⊆ V G are the source and target node sets of the event, t is the time at which the event starts, and τ is its delay or duration 1 . Here we assume that the source and target event sets are relatively small with a constant size, not depending on the length of the temporal data, which is usually the case in real temporal networks.
To capture possible information flow [20], potential causal relationships [21], and mesoscopic motifs in temporal networks [22], we can define the adjacency relation between pairs of events. We say that two events e i , e j ∈ E G are adjacent, e i → e j , if they have at least one node in common in their target and source sets, v i ∩ u j ̸ = ∅, and they are consecutive: the second event e j , at t j > t i cannot start before the first event e i ends, thus the time difference between the two events must be ∆t(e i , e j ) = t j − t i − τ i > 0. In addition, we can constrain events to be δt-adjacent, e i δt − → e j , thus being related only if they happen within a time distance δt, i.e., ∆t(e i , e j ) ≤ δt.
Unlike in static networks, in temporal structures information can pass between nodes only at the time and direction of interactions. Thus to study any dynamical process on temporal networks, we first need to define how information can be propagated through a sequence of events. We define a temporal path (also called a timerespecting path) as an alternating sequence of v i ∈ V G nodes, e i ∈ E G events, which must be adjacent if they are consecutive in the sequence. In contrary to static paths, a temporal path is not permanent but depends on the time and the source node of the first interaction. Moreover, in a temporal path consecutive events need to be adjacent: they need to happen in correct temporal order while taking account their duration and direction as well. In addition, we can constrain consecutive events to be δt-adjacent, to capture processes with a maximum allowed transfer time. Taking these possible restrictions, we can already code some characters of the dynamical process in the representation of the underlying temporal network. In the following, we often use an example a simplification of this general description by assuming instantaneous, undirected, and dyadic interactions with only two interacting nodes [21]. This gives us a network G ′ with an event set and an event defined as (u, v, t) ∈ E G ′ 2 .
Temporal paths code reachability in a temporal network, i.e., whether a node at a given time can or cannot influence another node in an upcoming time step. Considering all outgoing (or incoming) temporal paths starting from (resp. ending at) a given node at a given time, one can obtain its influence (resp. source) set. This outcomponent (resp. in-component) can be computed as the union of the nodes in the time respecting paths starting from (ending at) a given node. The out-component determines the possible routes information, epidemics, rumour or influence can travel after initiated from a given node at a given time. This may give us the potentially infected set of patients in an epidemic, or the influenced set of people of a political campaign. However, the solution of the reachability problem is computationally expensive even for small structures [8]. For larger temporal networks the only feasible solution has been to sample initial source node-time pairs and compute their influence sets using a breadth-first search algorithm [8,23]. This approach, although very expensive, has already provided some insight about the average reachability of temporal networks and its relation to various network features [5,21].

Weighted temporal event graphs
A recently introduced higher-order representation of temporal networks, called temporal event graphs, provides effective solutions to many computational problems related to temporal network connectivity [15] and other purposes [16]. Given a temporal network G = (V G , E G , T ), the temporal event graph representation is formally defined as a weighted graph D = (E G , A E , ∆t). The nodes of D are the events of G and edges are drawn between adjacent events. The direction of every edge is in the arrow of the time and the weight is defined as the time difference of the two events incident to the edge, ∆t(e i , e j ). Going forward in time, the direct successors of event e ∈ E G are the set of events connected by outgoing edges from e. Going backwards in time, direct predecessors of an event e are the set of all events where there is a directed edge from that event to e. Event graphs can be regarded as a temporal line graph representation, capturing higher-order relationships between events.
Since adjacency is defined between non-simultaneous events and directed by time, temporal event graphs appear as weighted directed acyclic graphs (DAG). They are static representations of temporal networks, which can be analysed by the full spectrum of tools and methods developed for static graph. Further, they allow one to use concepts of static centrality and similarity measures do develop similar concepts in temporal networks. As their most important feature, they appear as a static superposition of all temporal paths present in the original temporal network. In other words, P G e = P D v , where P G e is the set of event sequences in all temporal paths in G, and P D v is the set of node sequences in all the paths in D. While every temporal path in G corresponds to a An undirected temporal network and its corresponding static event graph. Two events ei and ej on temporal network are adjacent if they share at least one node and ∆t(ei, ej) ≤ δt where in this instance we have chosen a value of δt = 3 for the purpose of illustration. One valid temporal path is highlighted red on the temporal network and on the event graph.
unique path in D, there can be redundancy in the other direction: multiple temporal paths in G could correspond to a single event path in D. This is due to the multiple temporal paths using the same sequence of events, but a different sequence of nodes. It is easy to construct such sets of paths with events that have multiple source and target nodes. In the case of dyadic interactions, such redundancies are very minor (or non-existent). In any case, these multiplicities do not have any effect on the reachability.

Component definitions
Components can be defined in various ways in an event graph. As it is a directed graph, one can identify in-and out-components and also weakly connected components. Since event graphs are directed acyclic graphs, strongly connected components larger than one node do not exist.
More precisely, the out-component of an event (also called the root event) in a static event graph is defined as the maximum set of other future events that can be reached by any temporal path starting from the root event. In case of an epidemic spreading with initial infection taking place at the root event, this is the set of temporal contacts (and nodes) which potentially propagate the disease. We define the maximum out-component of an event graph as its largest out-component, giving us the largest possible effect/outbreak ever observable in the network. Equivalently, the in-component of an event is formed by the incoming temporal paths and it is defined as the set of earlier events (and nodes), which can influence the actual pair of interacting nodes up to the actual time. The definition of weakly connected components is less restrictive as they include any events, which are connected via temporal paths irrespective of the direction of Static event graph representation of a temporal network. Weakly connected components are {e1, e3} and {e2, e4, e5, e6, e7, e8}. Out-component set of the event e2 is {e2, e4, e6, e7, e8}. Note that event e5 is in the same weakly connected component as e2 but it is not a member of its outcomponent set. The algorithm finds the out-component by going backwards through a topological ordering of events (i.e. reverse time order), at each step, the out-component of each event is calculated by getting the union of the out-component sets of all the events from the set of events in its out-edges plus the event itself. Since the ordering is reverse topological order, all the events in the out-edge set will already have their out-components calculated.
Among all these component types, in the following we are mostly going to focus on the precise identification of out-components, and we will explain how our methodology can be generalised to identify in-components as well. The end goal for our algorithm is to rapidly determine the sizes of these components.
Temporal graphs provide further ways to define connectivity [15]. Beyond connected events in the components of D, one can look for the set of original network nodes from V G involved in such components. Since a network node can appear in multiple times in an event graph component, this is an alternative way to measure the influence of an event by counting the total number of network nodes involved in the corresponding event graph out-component. Event graph components have also temporal dimensions, thus their connectivity can be also measured in terms of the time span between their first and last events. This compared to the T total observation time tells whether a component has only a local temporal effect or it percolates in time and bridges information over a longer course of observation.

B. Scalable algorithms for in-and out-component size estimation
The out-component of an event e i ∈ E G (which is a node in D) can be calculated in several ways. As we have mentioned, one potential solution is to start a breathfirst-search process from one of the nodes involved in e i by using the upcoming events in G. Another solution would be to compute the direct successor set recursively using the algorithm explained in Appendix B. However, calculating the sizes of out-components even for a small fraction of all events is not feasible with any of these solutions for large temporal networks, as their complexity scales badly with the number of events |E G |. Here we propose an alternative solution based on a probabilistic approach to estimate the size of the largest outcomponent to arbitrary precision and to identify its root event in any temporal network, even with extremely large sizes.

Probabilistic method for estimating out-component sizes
Our main goal is to obtain the out-component size for each node in D. But for a more concise presentation, first, we define an algorithm, which exactly provides outcomponents (i.e., the reachable sets) for each node in D. Our out-component size estimation algorithm is subsequently defined by changing the data structure containing these sets to a probabilistic counting data structure [17,18,24].
Our solution is similar to the commonly used algorithm for computing all subtree sizes, where starting from leaf nodes, the size of each subtree is given by the sum of its subtree sizes plus one. We tailor this idea specifically for DAG structures. This algorithm reuses the already computed out components for direct successors to calculate the out component of each node in a directed acyclic graph. To explain the algorithm we consider separately the nodes with zero and non-zero out-degrees k out : The out component of any leaf node i (i.e. node with k out = 0) is trivial as it contains only itself C i = {i}. For the other nodes (k out > 0) the out component can be built as C i = {i} ∪ (i,j)∈A G C j where A G is the set of edges in the event graph D.We compute the out components C i by going through the nodes of D (events of the temporal network) in reverse topological order, for example, reverse temporal order starting from the event with the largest timestamp backwards. This ensures that we already know the out-components of direct successors of each event we encounter. The algorithm is illustrated in Figure 2 and described in detail in the Algorithm 1.
This algorithm only goes through each event once and performs number of union operations equal to the number of links in the event graph, |A G |. However, the average out-component size can be directly proportional to the number of events in well-connected networks. That is, the out components can grow rapidly when the network size grows. This makes the algorithm to scale badly both in memory and computational time due to the cost of union operations on increasingly large sets.
The root of the performance problem is that we store the actual reachability sets when we only need their sizes. The solution to this problem is to find a data structure to replace the sets C i with another data structureĈ i , which Data: topo: reverse topologically sorted list of events Result: outsize: associative array of each event to its out component size begin Calculating out-component sizes of events from the static event graph representation described in Appendix A. Successor(e) (and P redecessor(e)) return set of direct successors (and predecessors) of event e as described in Appendix A and Algorithm 2. Associative array out is used to keep the memory-intensive set representation of outcomponents of events in memory up until the moment when there would be no references to the out-component of that event, when they are deleted from out and the cardinality of their out-component set is added to outsize. has a constant size and constant time union operatorĈ i ∪ C j and can return an estimate for the set size |Ĉ i | (again in constant time). With this data structure, the scaling of the algorithm becomes O(|E G |log(|E G |)+|A G |), which is much preferable to the breadth-first search approach with O(|E G ||A G |) complexity. Probabilistic counting methods described next give access to exactly this type of data structures.
The method described above works equally well if we want to measure the sizes of the components in terms of nodes of the temporal network G. In this case, the reachable sets would be populated with the nodes of the events instead of the events themselves. If the sizes are measured in lifetimes, i.e., the time between the first and the last event in the component the algorithm can be made even more simple. In this case, instead of saving the full reachable set of nodes, it is enough to save the largest timestamp of all of the event. That is, the set C i is replaced with a timestamp T i , which is initially set to t i for any event e i appearing as a leaf node in D, and the union operator is replaced with the maximum operator.
Note that although here we discuss the computation of the out-component sizes, in-components can be calculated with the same algorithm by reversing the direction of the links in D and the order at which the nodes in D are traversed. In practice the reversion of the link direction can be obtained by replacing calls to Successors(e) function with P redecessors(e) and vice versa in Algorithms 1 and 3.

Probabilistic cardinality estimator
For Algorithm 1 to run on large real-world networks we need to ensure that the time complexity of the union and the cardinality operators and also the space complexity of the set implementation do not grow linearly as a function of the cardinality of the set. This is not the case for implementations which exactly keep track of the out component sets for example using sorted vectors or hash maps. However, in order to estimate out-component sizes it is not necessary to query the sets for their members, but only to insert, merge and query the size of each set. We use a data structure implementing a variation of the HyperLogLog probabilistic cardinality estimator algorithm [24], which is computationally efficient for the three required operations. HyperLogLog was conceived as a method of estimating the cardinality of massive multisets, usually in the form of streams, given a constant amount of memory.
The basic idea of the algorithm is to use randomisation, in form of passing the input through a hash function, and only save the maximum number of leading zeros in the binary representation of the hashed values of the multiset. A cardinality estimation is then made by counting the number of leading zeros. Due to the uniform distribution of the output of hash functions suited for this algorithm, if the maximum number of observed leading zeros is ϱ − 1 then a good estimation of the cardinality would be 2 ϱ . Alone, the above-described estimators are extremely crude, but the algorithm works by combining many such estimators via a process of stochastic averaging. Based on the hash value the algorithm splits the input stream into m substreams while keeping track of the maximum number of leading zeros in each substream. Subsequently, it averages the observables using their harmonic mean, which ensures that variability of the estimation is kept in check [24].
We made several choices in our implementation of the algorithm, with some described in more details in the definition of HyperLogLog++ algorithm [18]. In particular, the following modifications were borrowed from HyperLogLog++: (a) We used a 64-bit hash function, as opposed to original 32-bit, to compensate for the collision of hash values for multisets with large cardinalities. (b) Empirical bias correction was performed as introduced  [24]. For a more in-depth study of different variations of HyperLogLog and the role and reasoning about bias-estimation see Ref. [18].
in [18]. (c) To improve performance characteristics and simplify error analysis, we did not use a separate sparse representation. Fig. 3 shows the relative accuracy and bias values for the HyperLogLog cardinality estimator. The difference in the scale of bias and accuracy indicates that the bias estimation reduced the bias and stopped its growth as the cardinality grows, to a degree where it plays an insignificant role in the total inaccuracy of the estimator.
The relative error in the size estimates can be made arbitrarily small by increasing the number of registers m. Estimations of cardinality of a multiset S is expected to have a Gaussian distribution, due to averaging and the central limit theorem, with a mean of |S| and a standard deviation of 1.04|S| √ m (for m > 128) [24]. HyperLogLog needs at most 6 bits (log 2 (64)) per register to store the number of leading zeros in the output of the 64-bit hash function but for ease of use, we elected to assign a full 8-bit byte for each register. A HyperLogLog counter has been reported to be able to estimate cardinalities well beyond 1 billion, limited by raising collision probability as approaching to the limits of a 64-bit hash function [18]. As an example, a counter with m = 2 10 registers would have a constant size of one kibibyte and a relative accuracy (corresponding to standard deviation of the distribution of estimates as a fraction of actual cardinality) of 0.0325. While inserting an item in the HyperLogLog estimator requires a constant number of operations with respect to cardinality or number of registers, the estimation operation requires linear operations with respect to the number of registers.
For a specific relative error rate, the memory and timescaling of the probabilistic counter are constant and do not depend on the input network size. In practice, the constants involved are relatively large. For this reason, we only keep track of the cardinality estimator data structures for nodes that do still have unprocessed predecessors. This significantly reduces the memory requirements when running the algorithm on real data (see Section III B).

Finding the event with largest out-component
The above-described algorithm finds accurate estimates for the out-component sizes of nodes in a DAG. However, it can be further developed to design a probabilistic estimation method to find the event with the maximum out-component size with a highly adjustable probability. This is possible by complementing the estimates with breadth-first search. That is, starting from the event with largest estimated out-component size, we perform consecutive breadth-first search operations, finding exact out-component size either identifying it as the new largest out-component size or ruling it out as such. This is repeated until the probability that any of the estimated (non-exact) out-component sizes being larger than the largest exact out-component size is smaller than some predefined probability threshold.
Let's assume that the out-component size estimation process provides anŝ e out-component size for the event e. We can calculate the probability distribution of the actual out-component size of that event s e based on the extended form of Bayes' theorem: where P (s e |ŝ e ) is the probability of the actual size being s e when the estimateŝ e is observed, P (ŝ e |s e ) is the probability to estimate the size of a multiset with cardinality s e asŝ e , and P (s e ) is the probability that any multiset would have a cardinality of s e . The term P (ŝ e |s e ) can be approximated by a probability density function of a Gaussian distribution with a mean of s e and standard deviation of s e 1.04 √ m for m > 128, where m is the number of registers of the probabilistic counter [24]. Assuming a uniform prior 3 for cardinality of multisets, Eq. 2 simplifies to: . (3) Assume we have estimated in-or out-component sizes of all the events as {ŝ 1 ,ŝ 2 , . . .}. Without loss of generality, we take thatŝ 1 is the largest estimate (i.e. ∀ e∈E Gŝ 1 ≥ s e ). If the actual in-or out-component size corresponding to event 1 is measured using the exact algorithm described in Appendix C as s 1 , the probability that s 1 would be the largest in-or out-component size of the whole network can be expressed as: where given Eq. 3, P (x ≥ s e |ŝ e ) can be written as: Along with a large enough number of registers, this can increase the probability of finding the absolute largest in-or out-component at any desirable level by removing estimates one by one through calculating exact in-or out-component sizes with the breadth-first search algorithm. It is also possible to use this technique for finding the largest out-component size to a specific number of significant figures.

III. APPLICATIONS
To demonstrate the use of our method we first apply it on simulated (Section III A) and subsequently on empirical temporal networks (Section III B). As it comes, we focus on the computation of out-components, but in-components could also be obtained with the same method.

A. Random networks
For the demonstration of our methodology, we use one of the simplest temporal network model, which assumes 3 The actual distributions of the component sizes will be biased towards small components especially for the regions of δt which are of most interest. Prior with more probability mass on the large values will mean that our estimate on the number of breadthfirst search operations we need to perform to achieve the desired accuracy gets larger. That is, the uniform prior is likely to be an overly cautious option as a prior for the component sizes.
that both the structure and the link dynamics are completely random and uncorrelated [15]. More specifically, our model network is built on a static structure generated as an Erdős-Rényi random graph with n nodes and k average degree. Each link has an interaction dynamics modelled as a Poisson process with a rate parameter α = 1 for an observation window T . Thus, events on links follow each other with exponentially distributed inter-event times. It has been shown earlier [15] that by varying the k average degree and the δt event adjacency parameter the event graph goes through a percolation phase transition between a disconnected and a connected phase. If δt is small or the underlying network is disconnected (k out < 1), only short temporal paths can evolve between small components of connected nodes, thus the potential sizes of DAG components are very limited. However, on a connected structure (k out > 1), by increasing δt, more and more events become δt-adjacent, this way forming longer paths and potentially larger event graph compo-nents. At a critical δt the event graph goes through a directed-percolation-like phase transition, with an emerging giant connected component, which connects the majority of events via valid δt-connected time respecting paths. This transition has been observed earlier [15] via the measurement of the largest weakly connected component of the temporal event graph, as it is demonstrated in Fig. 4a. The critical point can be approximated via a simple analytic function δt c = 1/(α(2k − 1)) (solid line in Fig. 4a) or via the scaling of different thermodynamic properties of the system [15]. Although the analytic and simulated critical points match relatively well each other (see Fig. 4a), discrepancies between them are due to (i) the analytic solution being an approximation only underestimating the critical point and (ii) weakly connected components providing only an upper limit for the actual largest out-component sizes. However, comparing the analytic curve to the estimated largest out-component sizes we find a significantly better match, as it is shown in Fig. 4b.
Just like in case of the weakly connected components in [15], the out-components sizes can be measured in three different ways: in terms of the number of events, the number of temporal network nodes, and in terms of the time between the first and last even in the component (i.e. the lifetime). As discussed in Section II B 1 the algorithm presented here is easily adaptable to calculating the sizes of components in the temporal network nodes and even simpler algorithm can be used for the lifetimes. The results of these calculations for a single average degree value are shown in Fig. 4c. Further, the algorithm produces the out-component sizes for all events in the network, which allows us to study their size distribution. These distributions are shown in Fig. 4d for three δt values around the value at which the largest component size becomes comparable to the system size. If these distributions would have been produced by sampling events and performing breadth-first search operations, the three distributions with different δt values would have looked almost identical with 10 4 breadth-first search operations and difference in the tails would only become visible with an expected number of around 10 5 to 10 6 breadth-first search operations, which would have been comparable in terms of runtime to performing breadth-first search operations from all events.

B. Real networks
To benchmark the performance of the algorithm we measured reachability values of a set of real-world networks. (a) A mobile call network [21] of 325 million events of over 5 million customers over a period of 120 days; (b) 258 million Twitter interactions [25] of over 12 million users over a period of more than 200 days; (c) air transport network of United States [26] with 180 112 flights between 279 airports; and (d) public transportation network of Helsinki [27] with 664 138 trips (defined as a vehicle moving between two consecutive stops) between 6 858 bus, metro and ferry stops. The mobile call and Twitter interactions datasets were processed as an undirected temporal network. Public transportation and Air transportation datasets were processed as directed networks with delays (duration of time between departure and arrival) taken into account.
HyperLogLog estimators for Mobile and Twitter networks use m = 2 10 registers. For other datasets m = 2 14 registers were used. Largest out-component sizes were measured with a maximum probability of misidentifying of at most 0.01. TABLE I. Running times for real-world networks when calculating the reachability (number of unique reachable events, nodes and lifetime) from all events in the network. The δt * corresponds to a waiting time around the time at which there is a jump in the largest out-component size (see the text for details; this corresponds to the vertical line in Fig. 5). As the δt values around δt * is of interest for a wide range of studies, the runtime for δt = δt * would be representative of the running times for a typical study. The values for δt * are 271 seconds for air and public transport networks, 100 minutes for the Twitter network and 6.5 hours for the mobile network. Baseline running times are measured by calculating out-component size on a sample of 500 events based on Algorithm 4 (see Appendix C) and extrapolating to estimate running time of exact measurement of out-component size from every event. Error column refers to relative standard error for each reachability estimate based on the number of registers m used in HyperLogLog estimator. The times are presented in seconds (s), minutes (m), hours (h), and years (y Table I provides information on median runtime (as measured by CPU clock time) of the out-component size estimation portion of the implementation. The running time is shown for a δt = δt * threshold close to a jump in the largest component size, which is likely to be around the interesting region. We also report the largest possible threshold δt = ∞ leading to largest running times. For undirected temporal networks (Mobile and Twitter) taking δt to infinity does not result in a substantial increase in the running time as most of the increase in the number of event graph links are never considered due to the optimising for redundant links (see Appendix B). This, however, is not the case for directed temporal networks as the optimisation method described in Appendix B does not apply to directed events. Assuming a homogeneous distribution of events across time, the runtime  Table I. for event graphs constructed from directed events grows by O(δt log δt) and reaches a maximum at δt = T where T is the maximum δt between any two events in the network. For the case of instantaneous (non-delayed) events, T is equal to the measurement window of the dataset. Table I also gives estimates of running times for a breadth-first search type of algorithm for comparison.
In these examples, the smallest network with less than 200k events takes around the same order of magnitude of time to process with both algorithms. However, even for the second-largest network with around 600k events, there is an order of magnitude of difference in the running times. For the larger networks with hundreds of millions of events, the run time jumps down from thousands of years with breadth-first search to order of hours with the new algorithm. This means that large data sets that were previously practically impossible to analyse this way are now accessible even with minimal computational resources. Figure 5 shows a more systematic analysis of the running times for the real data, where we vary the δt parameter. As previously described in Sec. III A, as δt is increased larger and larger connected structures begin to form in the event graph. The increase in size is also visible from the largest out-component size curves for the same dataset in Fig. 5. This transition period usually marks the most interesting area for further studies. Running times of the breadth-first type of algorithms are in practice dependent on the component sizes and can thus see a dramatic increase in the running times during and after the transition period. Running time plots (Fig. 5) show that as expected our algorithm is not sensitive to these transitions. They show that while for the case of directed networks (air and public transportation) runtime grows almost linearly as a function of δt, it grows sub-linearly for undirected networks because of the wider range of applied optimisation described in Appendix B.

IV. DISCUSSION
We have presented a method for computing component sizes starting from multiple sources (or reaching multiple destinations) in temporal networks which scales well with the increasing data size. Using simulated networks and real network data we show that the method is efficient enough for us to accurately estimate the δt-reachability for each of the events in networks with hundreds of millions of events. As a further demonstration of the capabilities of the algorithm, we repeated several results from a previous study [15] using accurate estimates for the component sizes instead of using weakly connected components as upper bounds.
Previously temporal network studies have focused on sampling starting points for reachability or simulated spreading processes and exactly calculating statistics based on that sample [7,8]. The sampling approach usually works well for calculating mean values or estimating the parts of distributions where most of the values lie. However, it is not suitable for analysing tails of distributions or extreme values in the networks. Perhaps more importantly, sampling is ill-suited for microscopic analysis of properties of individual nodes or events, which require calculating the reachability from each of them separately. The algorithm presented here is suitable for this type of studies and opens up possibilities for many new kinds of analysis of large data sets.
When presenting the algorithm, we aimed to work at a general level in taking into account various use cases. The definition of a temporal network we used is rather inclu-sive, although other kinds of hypergraph-type structures could have been considered. Despite these efforts, there are use cases that we did not still consider. Consider, for example, a situation where the edges are available for the paths with some uncertainty such that events e 1 and e 2 are adjacent with probability P (e 1 , e 2 , δt). In this case the algorithm could be easily used by simulating many instances of the event graph as result a deterministic random process to measure expected values of in-and outcomponent sizes. This is important for processes with a stochastic component, such as infection spreading models. Further, we did not discuss multiple sources or targets for the paths. However, as far as we have considered various scenarios such as the above mentioned multiple sources and targets, the algorithms proposed here would have required only minor adjustments.
Here we have mainly focused on the algorithmic improvements, and used the new method to demonstrate its ability to handle multiple types of networks with sizes varying all the way to hundreds of millions of events. We have barely scratched the surface in the type of analysis, which our new method enables. For example, microscopic network statistics such as centrality measures for nodes could now be defined based on the δt-reachability counts. Further, theoretical studies of directed temporal percolation in networks are now in our grasp as we can efficiently compute the relevant statistics. Our theoretical and algorithmic contributions allow to study effectively directed percolation phenomena in temporal networks, contrary to earlier works, which are either based on ordered lattices [28] or otherwise unsuitable assumptions for temporal networks. We expect our work on the computational methods to open doors for many future branches of research in data analysis and theory for temporal networks. BinarySearchstart(e, vec) finds the index of the first event, e ′ = (u ′ , v ′ , t ′ , τ ′ ), in vec with start time t ′ larger or equal to that of the input event e. BinarySearch end (e, vec) similarly finds first event in vec where its ending time t + τ is not less than that of e. Both functions rely on vec being already sorted in ascending order of t or t + τ respectively. DT Adjacent(e1, e2, ∆t) checks whether e2 is δt-adjacent to e1.
once. For the case of non-delayed events, regardless of directedness, it is possible to dispense with priority queue and provide a much simpler implementation and a computational complexity of O(|E|) where |E| denotes the number of events. For the case of delayed events, however, computational complexity will depend on the number of simultaneous in transit events and the selected implementation of the priority queue. Calculating exact out-component of an events from the static event graph representation described in Appendix A.