Sampling networks by nodal attributes

In a social network the individuals or nodes connect to other nodes by choosing one of the channels of communication at a time to re-establish the existing social links. Since information for research is usually restricted to a limited number of channels or layers, these autonomous decision making processes by the nodes constitute the sampling of a multiplex network leading to just one (though very important) example of sampling bias caused by the behavior of the nodes. We develop a general setting to get insight and understand the class of network sampling models, where the probability of sampling a link in the original network depends on the attributes $h$ of its adjacent nodes. Assuming that the nodal attributes are independently drawn from an arbitrary distribution $\rho(h)$ and that the sampling probability $r(h_i , h_j)$ for a link $ij$ of nodal attributes $h_i$ and $h_j$ is also arbitrary, we are able to derive exact analytic expressions of the sampled network for such network characteristics as the degree distribution, degree correlation, and clustering spectrum. The properties of the sampled network turn out to be sums of quantities for the original network topology weighted by the factors stemming from the sampling. Based on our analysis, we find that the sampled network may have sampling-induced network properties that are absent in the original network, which implies the potential risk of a naive generalization of the results of the sample to the entire original network. We also consider the case, when neighboring nodes have correlated attributes to show how to generalize our formalism for such sampling bias and we get good agreement between the analytic results and the numerical simulations.

In a social network the individuals or nodes connect to other nodes by choosing one of the channels of communication at a time to re-establish the existing social links.Since information for research is usually restricted to a limited number of channels or layers, these autonomous decision making processes by the nodes constitute the sampling of a multiplex network leading to just one (though very important) example of sampling bias caused by the behavior of the nodes.We develop a general setting to get insight and understand the class of network sampling models, where the probability of sampling a link in the original network depends on the attributes h of its adjacent nodes.Assuming that the nodal attributes are independently drawn from an arbitrary distribution ρ(h) and that the sampling probability r(hi, hj) for a link ij of nodal attributes hi and hj is also arbitrary, we are able to derive exact analytic expressions of the sampled network for such network characteristics as the degree distribution, degree correlation, and clustering spectrum.The properties of the sampled network turn out to be sums of quantities for the original network topology weighted by the factors stemming from the sampling.Based on our analysis, we find that the sampled network may have sampling-induced network properties that are absent in the original network, which implies the potential risk of a naive generalization of the results of the sample to the entire original network.We also consider the case, when neighboring nodes have correlated attributes to show how to generalize our formalism for such sampling bias and we get good agreement between the analytic results and the numerical simulations.

I. INTRODUCTION
Mapping out the underlying network is an essential part of studying complex systems.Accordingly, many different networks have been constructed from empirical data sets, but this procedure is subject to noise and various biases thus being of limited applicability.This is especially the case when the network is huge like the Internet, the WWW or the human society, where unavoidably only a sample of the whole network can be analyzed that is inherently likely to cause some biases.Also the identification of a network links could be the cause of bias unless all the links are equally measurable.For example in communication-based social networks difficulties may arise, when one wants to detect social links between people using different means of communication.The consequences of these kinds of sampling could hinder the generalization of the properties observed on the sample to the case of the entire system.For instance, the sampling bias may make the degree distribution look like a power law even when the original degree is Poissonian [1,2].Furthermore, the peaked degree distribution of social interactions is transformed to a monotonic one if only one communication channel is sampled [3].Also other network quantities such as degree correlations, centrality measures, and clustering properties, could undergo nontrivial bias effects depending on how networks are sampled [4][5][6][7][8][9][10][11][12][13][14][15][16].Thus, understanding the effect of sampling biases is crucial in interpreting empirical data better and in studying the original systems.
There have been a number of theoretical and numerical studies on network sampling since its significance was recognized.The sampling methods studied so far are classified as random node sampling [4][5][6][7][8], random link sampling [7,8,16], and path-based sampling [1,2,[7][8][9][10][11][12][13][14].The path-based sampling is a class of methods that sample nodes and links while traversing the network from certain nodes, which includes breadth first search, depth first search, snowball sampling, random walk sampling, and trace-route sampling.For these sampling methods, the effect of the sampling biases are understood well and algorithms to improve inference of the original network properties have also been suggested [8,[13][14][15][16].
In this paper, we study another class of network sampling, where links are sampled with a probability depending on the attributes of the nodes.We assume that node i has an attribute h i and that the links between nodes i and j are sampled according to the probability depending on h i and h j .This is the case, when a multiplex is sampled by a limited number of layers.Usually it depends on the node attributes in which layer a link can be found.For example, in the network of social relationships a family tie between two individuals or nodes is there if both have the attribute of belonging to the same family.This class includes the model of Ref. [3], which was proposed to explain the commonly observed monotonic degree distribution in the data sets of social networks sampled by single communication channel.While people communicate using various communication channels, online and offline, the sources of the data sets are often limited to a single communication channel due to technical and privacy reasons.Thus, extracting a single communication channel is regarded as a sampling process, through which non-trivial biases are inevitably induced.Because each person has a different tendency to select the communication channel [17] and this is adjusted to the preferences of the communication partner, the sampling process is plausibly modeled by introducing the attributes for each person rather than by random or path-based sampling methods, such that the sampling probability depends on the two communicating persons' attributes.
From a mathematical point of view, the sampling model we are going to study here is a generalization of a class of the network generation models with hidden variables [18,19].In this class of models, starting from an empty network, hidden variables are assigned to the nodes, and links are generated according to a function of the hidden variables h i and h j .On the other hand, in this paper, the network is obtained by sampling from an original network having certain properties.Hence the sampling model studied here is equivalent to the model studied in [18,19] when the original network is a complete graph.
This paper is organized as follows: In Section II, we present rigorous analytic forms of the degree distribution P (k), degree correlation knn (k), and clustering spectrum c(k) for the sampled network in the case where h is independently drawn from a certain distribution ρ(h).In Section III, we apply the results to some concrete examples.Especially, we will investigate the model proposed in Ref. [3] and see how the original network affects the sampled network.Then, in Section IV, we numerically study the case where the hidden variables of neighboring nodes in the original network are correlated with each other.The last section is devoted to summary and discussion.

II. MODEL AND ANALYSIS
We define the model of sampling as follows (see Fig. 1).First, a hidden variable h is assigned to each node in the original network, where each of the hidden variables is randomly and independently drawn from the distribution ρ(h).Then, a link between node i and j is sampled with the probability r(h i , h j ), where we assume that it is a symmetric function with respect to h i and h j .Although, in this paper, we mainly consider h as a scalar variable, it original network h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > r(h i , h j ) < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 2 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 3 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 4 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 5 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 6 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > h 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " ( n u l l ) " > ( n u l l ) < / l a t e x i t > sampled network FIG. 1.A schematic diagram showing the sampling method studied in this paper.For each node i, a hidden variable hi is drawn randomly from ρ(h).A link ij in the original network is sampled with a probability r(hi, hj), where hi and hj are the hidden variables of the nodes i and j.
is straightforward to extend the model such that a node has a vector attribute (a set of attributes) h, similarly to the Axelrod's model [20] or the model in [21], which we will discuss in Section II D.
Hereafter, we denote the degrees in the original network by κ, the degree distribution of the original network as P o (κ), the conditional probability that a neighbor of degree-κ node has degree κ as p o (κ |κ), and the average local clustering coefficient of degree-κ nodes as c o (κ).

A. Degree Distribution
Let us consider links around a node which has hidden variable h.Since the hidden variables of the neighbors are independently given, the probability that a link around the node with h is sampled is The probability distribution of the degree sampled around a node is a binomial distribution given as a function of the hidden variable h and the original degree κ of the focal node: The degree distribution of the sampled network P (k) is therefore written as where we defined the weighted sum over h and κ as Thus, the degree distribution of the sampled network depends only on ρ(h), r(h, h ), and P o (κ).It is independent of higher order correlations in the original network such as the degree correlation or clustering coefficient.Since g(k|h, κ) is of binomial form, the average degree in the sampled network is simply written as where κ = κ κP o (κ) is the average degree in the original network, and is the average sampling probability.Similarly, one can calculate k 2 as Thus, the second moment of the sampled degree distribution is written as a function of the first and the second moments of P o (κ) and the weighted average of r(h) 2 .In general, the nth moment of P (k) can be obtained as the nth derivatives of the characteristic function.The characteristic function is Thus, the nth moment of the sampled degree depends on up to the nth moments of P o (κ) and the weighted average of r(h) n .

B. Degree Correlation
The degree correlation between neighboring nodes in the sampled network can be characterized by the conditional distribution as where g * (h, κ|k) is the inverse of g(k|h, κ).Since one connection has already been used up for the conditional edge with h, g(k − 1|h , κ − 1) gives the probability that a node with (h , κ ) ends up with degree k .Using the Bayes' formula, we obtain The conditional probability p(h , κ |h, κ) is the probability that a neighbor of a node with (h, κ) in the sampled network has the hidden variable h and the original degree κ .Since h is assigned independently to nodes, it is written as the product of two factors: where the conditional probability p(h |h) is written as Note that there is a correlation of h between neighboring nodes after the sampling even though there is no correlation in the original network.Sampling induces correlations of neighboring h values.The degree correlation is then given by Using this, the average degree of neighbors of a degree-k node is where rnn (h) is the average sampling probability of the links around a neighbor of a node having h and κnn (κ) is the average neighbor degree of a degree-κ node in the original network.Each of these is defined as and Thus, knn (k) is written as a weighted sum of rnn (h) (κ nn (κ) − 1), which is the product of the correlations of hidden variables and the original degrees.It depends on P o (κ) and κnn (κ) but is independent of other higher order correlations.
In general, the hidden variables of neighboring nodes in the sampled network show a correlation even if h is originally independent of the neighbors.This implies that the correlation between hs in the sampled network is induced by the sampling.Here, we note that the correlation of hidden variables in the sampled network p(h |h) is totally independent of the original network because it depends only on the functional forms of ρ(h) and r(h, h ).The hidden variable averaged over neighbors of a node having h is written as using Eq. ( 12).The sampling induced correlations in h disappear when r(h i , h j ) is factorized such that r(h i , h j ) = r (h i )r (h j ) irrespective of the functional form of ρ(h).This is because the conditional probability p(h |h) is independent of h as follows:

C. Clustering coefficient
Consider a node with hidden variable h and original degree κ.In the original network, the fraction c o (κ) of the pairs of the neighbors have links between them, where c o (κ) is the local clustering coefficient in the original network.Therefore, the local clustering coefficient of this node in the sampled network c h,κ is where The average local clustering coefficient of a node with sampled degree k, denoted by c(k), is given by the average of c h,κ weighed by the probability that the node has the hidden variable h and the original degree κ: Therefore, c(k) is given by the weighted sum of the product of c o (κ) and c h (h).It is notable that it does not depend on the degree correlation between neighbors in the original network.It depends only on P o (κ) and c o (κ).
The average clustering coefficient in the sampled network is then given as The equations for the sampled network properties are summarized in Table I.

Network property
Analytic form Dependency on the original network Po(κ), co(κ) hnn(h) dh h p(h |h) None TABLE I.A summary of the analytic equations for the sampled network properties.On the right column, the dependency on the original network properties are shown.For instance, the sampled degree distribution depends only on the original degree distribution Po(κ).When neighboring h is correlated, use Eq.(31) and Eq.(32) instead of Eq. ( 1) and Eq. ( 12), respectively.

D. Vector attributes
Although we have assumed that a hidden variable h for a node is a scalar, it is straightforward to extend our analysis for general attributes.In general, a node may have a vector attribute h of dimension d whose elements may be continuous or have discrete numbers, R d .The probability density function ρ(h) : R d → R + and the sampling probability r(h, h ) : R d ×R d → R + are defined in the extended spaces.
The analytic equations shown in the previous subsections are valid by replacing integrals over a scalar h by the integrals over the vector h.For instance, Eq. ( 1) is Similarly, Eq. ( 4) is redefined as In the case when an element of h is a discrete value, the corresponding integral is replaced by a summation.After these modification, the other equations in the previous subsections are still valid.

III. EXAMPLES
A. Sampling with a generalized mean function of hidden variables In this section, we study some concrete examples of ρ(h) and r(h i , h j ).The first model we are going to study is the sampling method proposed in Ref. [3], which was introduced to explain monotonically decreasing degree distributions, as they are commonly observed in various data sets of social networks.In the empirical studies today most of the data sets used are taken from a certain online service or a communication channel while in reality people combine the use of various communication channels in their daily life.Then without comprehensive access to data from these various communication channels the use of only one data set is regarded as a sampling process.If we regard the connections in different communication channels as layers of a multiplex, the sampling bias corresponds to sampling a multiplex by a single layer.The general observation from such data, namely the monotonically decreasing degree distribution, indicating that the most frequent degree is one, is considered to be an outcome of the bias of sampling a single communication channel.With reference to everyday experience we can state that it is quite uncommon to find a person who has only one friend or social tie, indicating that the original social network should have a degree distribution with a peak at a degree larger than one.Canonical sampling methods, such as random node/link sampling or snowball sampling, are not suitable for explaining this discrepancy since they predict a peaked degree distribution after the sampling.In contrast, the model presented in Ref. [3] can be considered simple and plausible in explaining the monotonically decreasing degree distribution.It assumes that the monotonically decreasing degree distribution is attributed to the mixture of the rare and frequent users of the communication service.In the model, the tendency for a person to use the communication channel is represented by a nodal attribute h and the link sampling probability is related to the channel selection.
The model is defined such that each node has a scalar value h i , which is independently drawn from the distribution ρ(h).The value of h denotes how much one person favors an online service or a communication channel.The distribution of the hidden variables ρ(h) is a Weibull distribution truncated at h = 1: The sampling probability is defined as the generalized mean of the two hidden variables: where β is an exponent characterizing the generalized mean which takes the form of arithmetic, geometric, or harmonic mean for β = 1, 0, or −1, respectively.In the limits of β → ∞ and β → −∞, the sampling probabilities are equivalent to max{h, h } and min{h, h }, respectively.As a demonstration, we compare our analytic expressions in Section II with the Monte Carlo simulations.As an original network, we use an Erdös-Rényi random graph of size N = 5000 and link density p = 0.03.The parameters for the sampling are h 0 = 0.3, α = 0.8, and β = −∞.Figure 2 shows that our analytic results are in excellent agreement with the results from Monte Carlo simulations, implying the validity of our analysis in the previous section.
As shown in Section V of Ref. [3], β ≤ 0 is a necessary condition to obtain a monotonically decreasing degree distribution.This is reasonable because the sampling probability around a node with h, r(h), does not go to 0 for a positive β even when h → 0. In other words, even with an infinitesimal h, a node has a positive finite sampling probability of links, which makes the low-degree nodes rare and yields peaked degree distribution.
Although a negative β reproduces a monotonically decreasing degree distribution, it also causes side-effects in other network statistics.The degree correlation knn (k) shows an increasing behavior while the original network does not have any degree correlation between neighbors, strongly implying that degree assortativity is induced by the sampling.The sign of β plays a pivotal role for the sampling-induced assortativity.In Ref. [3], it was shown that the neighboring h in the sampled network is positively (or negatively) correlated for a negative (or positive) β regardless of the original network topology or the functional form of ρ(h).The correlation of h induced by the sampling also causes the correlation of k in the sampled network.As a matter of fact, an increasing knn (k) is commonly observed in various empirical data sets of social networks.A plausible explanation for degree assortativity could be the homophily mechanism, however, the model implies that it may be an outcome of the bias caused by sampling a single communication channel.We cannot naively conclude that the original network shows a positive correlation even if the data set shows a positive degree correlation between neighbors.The fact that high degree nodes are likely to be connected due to homophily is not enough for degree assortativity as it is easy to construct disassortative networks, with interconnected hubs [22].
The local clustering coefficient as a function of the degree c(k) is also biased.As seen in Fig. 2(c), it shows an increasing behavior while it originally had a flat profile.This increasing behavior is also explained by the sampling induced assortativity.The low k nodes mostly consist of nodes having low h.Because of the positive correlation in h, the neighboring nodes around a lowdegree node also tends to have a low h.Therefore, the probability of making a link between neighbors is low.On the other hand, for a high-degree node, the hidden variables of its neighbors tend to be high as well, yielding a higher local clustering coefficient.As a result, c(k) shows an increasing trend.In empirical networks, however, decreasing c(k) is commonly found in many data sets [23][24][25][26].This is one of the unrealistic aspects of the model, which needs to be resolved in future.
So far, we have used an Erdös-Rényi random graph for the original network.One can ask how would the properties of the original network affect these results?As we have shown in the previous section, the degree correlation in the sampled network knn (k) consists of the contributions both from the original degree correlation κnn (κ) and the correlation of h, rnn (h).To see how it depends on κnn (κ), we conducted a thought experiment by manually assigning an increasing or a decreasing function to κnn (κ).For simplicity, we adopt linearly increasing or decreasing functions, κnn (κ) = ±0.2(κ − κ ) + κ nn , where κ = κ κP o (κ) and κ nn = κ κnn (κ)P o (κ).We can calculate how knn (k) would look like if such assortative and disassortative networks existed using the equations in Table I.
The result of the thought experiment is shown in Fig. 3(a).The distribution ρ(h) and the sampling probability r(h, h ) are kept same as the previous one.The figure shows that the curves almost collapse into a single curve when k < k * , and a significant difference appears after k exceeds k * , where k * ≈ 30.It indicates that knn (k) for k < k * mostly reflects the property of the sampling rather than the original network property.Because the majority of the nodes in the sampled network have a degree smaller than k * [see Fig. 2(a)], it is practically difficult to obtain the information of the original network from the sampled network.
A similar experiment is conducted for c(k).We calculate c(k) by assuming three possible cases: c o (κ) ∝ κ 0 , κ −1 , and κ −2 .For a fair comparison, we keep the average clustering coefficient in the original network c o = κ P o (κ)c o (κ) constant.The results are shown in Fig. 3(b).Similar to knn (k), the clustering spectrum c(k) shows the dependency on the original network only for k > k * .The low-degree behavior is determined by the dependency on h hence it does not contain much information about the original network property.
These results of the thought experiment are explained by calculating the conditional probability distribution of the original degree κ given a sampled degree k: If P * (κ|k) is identical to P o (κ), the network property around a degree-k node is determined only by h dependency irrespective of κ.In other words, the difference between P * (κ|k) and P o (κ) serves as an indicator of the relevance of the original network property.We calculated the expected original degree conditioned on a sampled degree, which is defined as κ(k) = κ κP * (κ|k).As shown in Fig. 3(c), κ(k) remains constant at κ for k < k * while it shows deviations only when k > k * , supporting the observations so far.We also calculated the Kullback-Leibler (KL) divergence between these two distributions for a given k, which is defined by The result (not shown) remains near zero for k < k * , which is again consistent with the behaviors in Figs.3(a) and 3(b).Since the dependence on κ is not significant, the sampled network topology mostly attributed to the dependence on h.The conditional probability distribution of h given the sampled degree k which is a counterpart of Eq. ( 27), is written as The expected value h(k) = dhhP * (h|k) is shown in Fig. 3(d).In contrast to Fig. 3(c), h(k) is significantly different from the original average h for all range of k.Thus, the sampled network reflects h of each node while the contribution of the original network topology is marginal.
In general, the property of the original network is not reflected for all k but for a limited range of k as this example illustrates.It can be hard to obtain the information about the original network since the hidden variables can be a definitive factor for the sampled network topology.The conditional probability distributions P * (κ|k) or the KL divergence D(k) serves as an indicator of the dependency on the original network.It is noteworthy that P * (κ|k) and D(k) are dependent only on ρ(h), r(h, h ) and P o (κ), hence it is independent of any higher-order correlation in the original network.
The dependency of knn (k) and c(k) on the original network may change when the original degree distribution changes.As a trivial example, let us consider the case

FIG. 3. (a)
The degree correlation knn(k) when the original network is non-assortative, assortative, and disassortative.(b) The clustering spectrum c(k) of the sampled network when the original clustering spectrum is co(κ) ∝ κ 0 (const), κ −1 , and κ −2 .(c) The average original degree given the sampled degree, κ(k).The inferred original degree κ(k) does not show a significant difference from the original average degree κ , depicted as a horizontal dashed line, for k < k * .(d) The average hidden variable given sampled degree h(k).The horizontal dashed line depicts h = dhhρ(h).The other settings are kept same as Fig. 2.
where P o (κ) is a delta function as in the case of a regular random network.The k-dependency of knn (k) and c(k) are fully attributed to the sampling hence it does not contain any information about the original network properties.On the other hand, when the variance of P o (κ) is large, more information of the original network properties are likely to be reflected to the sampled networks.

B. Sampling by a vector hidden variables
We demonstrate another example where the hidden variable of a node is not a continuous scalar value but a vector h i .Inspired by the Axelrod's model for the dissemination of culture [20], h i is assumed to be a Fdimensional vector whose components take q discrete values [0, q − 1].Although this is nothing but a toy model, it is introduced to show the validity of the theory.
In the following, F = 2 and q = 4 are used, and its first and second components are denoted as σ and τ , that is, h i = (σ i , τ i ).The probability mass function ρ(h) is the uniform distribution: ρ(h) = 1/q 2 for any σ and τ .For the sampling probability function r(h i , h j ), the following function is used: where c > 0 is a parameter controlling the relative weight between the first and second terms.In this example, we use c = 2.As an original network, we use an Erdös-Rényi random graph with N = 50000 and p = 0.003.Figure 4 shows the simulation results as well as theoretical prediction.The degree distribution P (k), average neighbor degree knn (k), and clustering spectrum c(k) are shown in Figs.4(a-c), respectively.Although the profiles of these curves are not as simple as those in the previous subsection, the theoretical curves perfectly coincide with the simulation results, proving the validity of our analytic approach.We also studied the correlation between neighboring σ and τ by measuring the curves for σnn (σ) and τnn (τ ), which are defined as the average of neighbors' hidden variables around a node having σ and τ , respectively.As shown in Fig. 4(d), σ shows a negative correlation while τ shows the opposite dependency.This competing correlation is why we see non-monotonic profiles in knn (k) and c(k).For some range of k, the negative correlation of σ plays a major role while the positive correlation in τ becomes more evident in other region of k. Figure 4(d) also shows the theoretical values, which completely agrees with the simulation results.

IV. CORRELATED HIDDEN VARIABLES
So far we have derived the analytic forms for various network properties under the assumption that h's are independent from each other in the original network.Here, we consider a more realistic case where the neighboring h's are correlated.One of the typical examples of the correlated attributes is the homophily mechanism in social networks, meaning that people tend to form ties between those similar to themselves [27].
When h's are correlated, one can conduct a rigorous calculation under a limited condition that h is Markovian, i.e., the probability distribution of h is conditional only on their neighbors' hidden variables.We also assume that h is independent of the local network topology, such that the degree, the average neighbor degree, and the local clustering coefficient are independent of h of the node.Under this assumption, it is straightforward to formally write down the equations for P (k), knn (k), and c(k) as shown shortly.Although these assumptions limit the applicability of the theory as the hidden variables are usually not Markovian, it serves as a good approximation for various practical cases and gives an idea about how the correlation in h would affect the topology of the sampled networks.
When h is correlated, the sampling probability around the node r(h) is written as Using Eq. ( 31) instead of Eq. ( 1), the other equations in Sec.II A are still valid.The degree distribution is calculated using Eq.(3).
To calculate knn (k), we have to replace Eq. ( 12) with the following: with the same r(h) as in Eq. ( 31).The remaining calculations in Subsections II B and II C are the same.Therefore, the degree correlation in a sampled network is written as the joint effect of the original degree correlation p o (κ |κ), the correlation of h in the original network p o (h |h), and the sampling-induced assortativity.
To demonstrate how the correlation in h works, we study the following model as a case study.The original network is constructed using the stochastic block model (SBM) where N nodes are equally partitioned into C communities of size N C = N/C.The probability of making intra-community and inter-community links are given by p in and p out , respectively, which are independent of the community.The hidden variable h of a node in community I, where I is the index of the community ranging from 1 to C, is drawn from ρ I (h).Thus, the distribution of h for all the nodes is ρ(h) = I ρ I (h)/C.For ρ I (h), the same functional form as Eq. ( 25) with α = 1 is adopted, but with h 0 dependent on the community as h 0 (I) = 0.01I.With this community dependent h 0 , the nodes in the same community have similar h compared to the nodes in other communities, yielding a positive correlation between neighboring h's.For the sampling probability, Eq. ( 26) with β = −∞, i.e., r(h i , h j ) = min{h i , h j }, is used.
For this model, the conditional probability p o (h |h) is given as where p o (h , h ) denotes the probability that a link in the original network is connected to nodes of h and h .This joint probability is given by where q in and q out are the fractions of the intra-and inter-community links, respectively.These are given by  Figure 5(a) shows the degree distribution for both correlated and uncorrelated cases.When the correlation is introduced, the degree distribution has a heavier tail than that for the uncorrelated case.This is because a node with a high h tends to be surrounded by nodes with higher h.High-degree nodes have even higher degrees when h is correlated.Figure 5(d) shows the correlation of h in the sampled networks for the correlated and uncorrelated cases.The correlation in the original network enhances the positive correlation in the sampled network.The effects of the correlations are also observed in knn (k) and c(k) for both correlated and uncorrelated cases.The degree assortativity and the increasing behavior of c(k) get stronger for the correlated case as shown in Fig. 5(b) and (c).
The theoretical predictions using Eqs.( 31) and (32) are also shown in Fig. 5.The theoretical curves for P (k) and hnn (h), shown in Fig. 5(a) and (d), agree very well with the simulation results.In order to calculate the sampled degree around a node, h's of the focal node and its surrounding nodes are necessary.Because longer correlations are not necessary, these equations, that take the neighboring correlations into account, are rigorous.However, the theoretical curves for knn (k) and c(k) show deviations from the simulation results as depicted in Fig. 5(b) and (c).This is because knn (k) and c(k) depend on longer correlations.The average neighbor degree depends on h of the neighbor node and its neighbors', i.e., the correlations between the focal node and its next nearest neighbors affect knn (k).The clustering coefficient also depends on the correlations between the next nearest neighbors.A theory which takes into account the correlations of h between next nearest neighbors would improve the accuracy.Even though these equations are not rigorous ones, they give good approximations practically and tell how the correlations affect the sampling.

V. SUMMARY AND DISCUSSION
In this paper, we studied a class of sampling on networks, where a sampling probability of a link depends on the attributes of the connected nodes.This is typically the case, when a multiplex is sampled by some layers only.The rigorous results for P (k), knn (k), c(k) for the sampled network are shown for general functional forms of ρ(h) and r(h i , h j ) when the attributes of the nodes are independent.The analytic calculations were compared with the simulations, and a perfect match was obtained.As shown in Table I, the properties of the sampled networks are written as the aggregate of the contributions from the original network and the hidden variables.The theory also presents how quantities in the sampled network depend on the original network.For instance, it is shown that the sampled degree P (k) depends on the original degree distribution P o (κ) but not on the higher order correlations.
As a concrete example, we studied the model where r(h i , h j ) is defined as the generalized mean of h i and h j , which was proposed to model sampling a communication channel of the social network [3].Using the equations in Table I, we compared the sampled networks and the original networks with different properties.For this model, the original network property is manifested only for the limited range of k values, indicating that recovering the original network from the sampled network is unfeasible.One of the lessons learned from this example is that the network we observe does not necessary reflect the original network properties.Instead, it may reflect the attributes of nodes.
We also presented a theory for the case where the neighboring h are correlated as in reality the attributes are not independent.Although the theory is not rigorous for general cases, it gives a good approximation in practical cases and tells how the correlation in h alters the properties of the sampled networks.
So far, we limited ourselves to the case where the original network properties and the hidden variable are uncorrelated, i.e., we assumed that h and the network quantities (κ, κ nn , and c o ) are uncorrelated.Although we leave the study for the correlated case for future researches, its impact could be highly significant.
Another future research issue would be the method development to infer the original network and/or h from empirical data sets.In that case, the usage of metadata, as in [28,29], which works as a proxy of h would be of great help because the correlation of h in the sampled network is independent of the topology of the original network.It will be helpful to infer the functional form of ρ(h) and r(h i , h j ).We believe our results serve as the basis for these future researches.

FIG. 4 .
FIG. 4. (a) Degree distribution P (k), (b) Average neighbor degrees knn(k), (c) clustering spectrum c(k) of the sampled network for the model having vector nodal attributes h.Simulation results are compared with the theoretically predicted lines.The simulations results are averaged over 1000 independent runs.(d) Correlation of the hidden variables between neighbors in the sampled networks.Symbols denote the simulation results while dashed curves are theoretically predicted results.

FIG. 5 .
FIG. 5. (a) Degree distribution P (k), (b) Average neighbor degrees knn(k), (c) clustering spectrum c(k), (d) correlations of neighboring hidden variables in the sampled network for the model with correlated h.Simulation results, depicted by symbols, are compared with the theoretically predictions, depicted as dashed curves.The results are compared with a null model where the correlation is removed by shuffling h.The correlated case and the uncorrelated cases are denoted as CO and UC in the legend, and drawn in blue and orange, respectively.The simulations results are averaged over 200 independent runs.

Figure 5
Figure5shows the simulation results as well as the theoretical predictions for this model.The parameter values of N = 10000, C = 100, N C = 100, p in = 0.5, and p out = 50/9900 are used.With this setting, half of the links are intra-community links while the other half of the links are made between different communities.To investigate the effect of the correlations, the uncorrelated version of the model was also studied.For the uncorrelated model, h of the nodes are randomly shuffled while the other settings kept the same.Figure5(a) shows the degree distribution for both correlated and uncorrelated cases.When the correlation is introduced, the degree distribution has a heavier tail than