An information theoretic network approach to socioeconomic correlations

Due to its wide reaching implications for everything from identifying hotspots of income inequality to political redistricting, there is a rich body of literature across the sciences quantifying spatial patterns in socioeconomic data. In particular, the variability of indicators relevant to social and economic well-being between localized populations is of great interest, as it pertains to the spatial manifestations of inequality and segregation. However, heterogeneity in population density, sensitivity of statistical analyses to spatial aggregation, and the importance of pre-drawn political boundaries for policy intervention may decrease the efficacy and relevance of existing methods for analyzing spatial socioeconomic data. Additionally, these measures commonly lack either a framework for comparing results for qualitative and quantitative data on the same scale, or a mechanism for generalization to multi-region correlations. To mitigate these issues associated with traditional spatial measures, here we view local deviations in socioeconomic variables from a topological lens rather than a spatial one, and use a novel information theoretic network approach based on the Generalized Jensen Shannon Divergence to distinguish distributional quantities across adjacent regions. We apply our methodology in a series of experiments to study the network of neighboring census tracts in the continental US, quantifying the decay in two-point distributional correlations across the network, examining the county-level socioeconomic disparities induced from the aggregation of tracts, and constructing an algorithm for the division of a city into homogeneous clusters. These results provide a new framework for analyzing the variation of attributes across regional populations, and shed light on new, universal patterns in socioeconomic attributes.


I. INTRODUCTION
Analysis of spatial data is crucial for understanding a wide variety of human systems, and with our increasing capacity to handle high resolution data sets, has found applications across the sciences [1]. From analyzing demographic polling behavior [2], to epidemic vulnerability of populations [3], to disparities in access to nutritious food [4], spatial data on social and economic attributes of populations is central to many problems in modern data-driven science. In particular, assessing the extent to which socioeconomic properties differ across regions of space is an important topic for understanding the spatial dynamics of production and consumption [5], the manifestation of segregation [6], and the spatial decomposition of inequality [7]. There has been thus been extensive research to understand how socioeconomic indicators fluctuate across space, which has involved the development of many sophisticated mathematical techniques to quantify variation in spatial data.
A major challenge with these methods is determining the scale at which to probe spatial variations, noting that populations tend to disperse heterogeneously across space [8]. Extreme spatial inhomogeneity in population density causes inconsistencies in interpretability of results based solely on distance, as the pace of economic activity is largely determined by population [9], and space is primarily relevant insofar as it relates to the number of "intervening opportunities" it provides economic agents with [10]. As a result, various methods have been designed to account for heterogeneous population in the analysis of spatial data, some of which include density-equalizing maps [11] and methods based on dasymetric mapping [12]. As an additional complication, there is no apriori way to aggregate regions of space for statistical analysis (an issue that is more precisely quantified by the Modifiable Areal Unit Problem, or MAUP for short) [13]. Consequently, finding suitable scales for various problems in spatial analysis is an open problem that has received extensive interest due the sensitivity of results to the chosen scale [14]. To make matters worse, policy interventions take place at the level of artificial government-designated boundaries, and so analysis that ignores these boundaries may be irrelevant for certain implementation-driven studies [15]. Here, we assess relationships between official boundaries (census tracts) in a network-based manner, circumventing the aforementioned spatial issues by considering topological distance rather than geographic distance, which also allows for the development of insights at the scale of regions designated for policy intervention.
Another important avenue of research investigates what measures to use to quantify spatial disparities in population data. For nominal distributions, such as race or religious affiliation, there are a wide variety of measures of qualitative variation that take the form of segregation indices between disjoint groups [16]. Some of these indices can also be used with ordinal or interval data such as population counts over income brackets or education levels [6], but few of these measures have the flexibility to accommodate all types of data on the same scale or generalize to more than two regions. Measures based on information theory can also be effectively applied to distributional socioeconomic data [17], having the additional benefit of being founded in fundamental statistical principles, and allowing in some cases an extension to multiple distributions. We develop here a novel approach based on the Generalized Jensen-Shannon Divergence (GJSD) [18] to compare distributional data, which has a number of advantages over other approaches, including flexibility for all distributional data types and an intuitive theoretical interpretation.
We note other approaches proposed to analyze spatial data using networks or information theoretic principles, as there has been similar research in regional science, economic and political geography, urban planning, and spatial analysis. There has been extensive work on using spatial network methods in urban science [19,20], with a focus oriented mostly towards the structure of urban form and the dynamics of urban growth. Numerous methods based on spatial aggregation of neighboring regions within the context of multiscalar diversity indices have been developed to assess the spatial manifestation of diversity [21,22], but rarely do these accommodate distributions or multiple data types. Additionally, there is a body of research constructing spatial correlation and aggregation methods based on information theoretic measures [23,24]. However, these analyses thus far have been limited primarily to racial segregation and ecological diversity, and focus on relationships between individual regional entities within larger clusters rather than multi-distribution correlations as in this study. Additionally, these measures may not be as easily interpreted in terms of a simple statistical process, as is the case with our method, and many of these measures are not adaptable to all data types, which limits their capability in comparative analysis.
In this study, we develop a novel approach for studying spatial variation in distributional socioeconomic data based on regional adjacency networks and information theory, and apply our methodology to the network of adjacent census tracts in the continental US through a few experiments. We first examine two-point correlations in our distributional distance measure with respect to path length across the adjacency network, finding a universal decay pattern with similar scaling exponents and finite size cutoffs across a variety of socioeconomic attributes. We also utilize this methodology to assess disparity with respect to various socioeconomic attributes across US counties by generalizing our measure to the comparison of more than two regions, finding high regional dependence and correlations in our measure for multiple variables. Finally, we discuss a new means for spatial aggregation of regions through community detection with our measures at multiple resolutions, clustering the census tract network for the city of Chicago into meaningful regions of homogeneous socioeconomic characteristics at different cluster size scales. Our methods provide a new means for analyzing spatial variation in all types of distributional data within a universal framework that circumvents limitations in traditional spatial measures. The results from our experiments point to new ways of thinking about how socioeconomic characteristics manifest across space, and can be applied to a wide range of problems across the social sciences.

A. Census Tract Data and Network Construction
In order to study a wide variety of socioeconomic attributes at high spatial resolution, we utilize US Census data at the tract level from the American Community Survey in 2018 [25,26]. The American Community Survey continuously samples US households to collect data on various socioeconomic and demographic characteristics of the population, and it is the largest survey at the household level that is conducted by the Census Bureau. We choose to analyze data at the level of census tracts because they encapsulate highly localized populations, represent officially designated regions relevant for policy intervention [27,28], and have roughly equal populations (the 25th and 75th percentiles in terms of population are 2971 and 5572 for the set of tracts used in the analysis). We aggregate distributional data on educational attainment, house price, income, industry of occupation, and race in order to assess spatial variability across a range of different variables. The techniques we develop can be adapted for continuous distributional data, but here we use the available binned data for housing prices and incomes, leaving to future work the estimation of the full corresponding continuous distributions, as this is a difficult problem on its own [29,30].
In order to quantify variation in the discrete distribution of a variable X across tracts, we encode its possible values as a vector q X , which may be nominal, ordinal, or interval in nature depending on the variable X being analyzed. For census tract i, we denote its distributional vector of values for the variable X as q      Table I. The nearest-neighbor network representation for census tracts is constructed utilizing the TIGER shapefile data [31], and two tracts are neighbors in the network if they share a common length of border or a corner. Only tracts in the continental US were considered for this analysis in order to ensure a single connected component for two-point correlation analyses. After removing tracts with corrupted or incomplete data, the final network had 70, 201 nodes and 197, 841 edges (for an average degree of 5.6). The overall goal in terms of practical relevance of the proposed methods is for local spatially targeted interventions (e.g. at the scale of counties, cities, or neighborhoods, with tracts as the fundamental subdivision), and so we are only presently interested in relatively short range correlations, hence the choice to construct the underlying network based on spatial adjacency. However, the method we present for comparing local distributions of socioeconomic variables can be applied to any pair of regions (whether or not they are adjacent), which is in fact what is done for our two-point correlation analysis, and so any network structure signifying a relationship between two regions could be used in this framework. For instance, one could construct the network based on population migration flows from region to region, which would no longer necessarily have geographically localized edges, but could be used to see whether or not people move homes between regions with similar or different socioeconomic properties.
As the analyses performed in this study are intentionally topological in nature, rather than geographic, we do not focus on spatial dimensions. However, for better contextualization of our results for those unfamiliar with the spatial extent of subdivisions within the US, we report summary statistics from our network dataset here. For the subset of tracts used in the analysis, the distribution of land areas is heavily right-skewed, with the tracts in the 10th percentile, median, and 90th percentile having areas of 1.0 km 2 , 6.4 km 2 , and 269.2 km 2 respectively. If we consider the set of tracts kept in the filtered dataset, and construct their (potentially incomplete) associated counties, the distribution of land areas is also right-skewed, with the counties in the 10th percentile, median, and 90th percentile having areas of 953.0 km 2 , 1911.4 km 2 , and 4863.0 km 2 respectively. The high level of heterogeneity we see in the land area statistics at both the tract and county level further illustrates the utility of an approach to socioeconomic correlations that is spatial scale-independent, as adminstratively equivalent regions clearly can have drastically different sizes.

B. Generalized Jensen-Shannon Divergence
Due to its desirable properties as a distributional distance measure, which we discuss in more detail, the Generalized Jensen-Shannon Divergence (GJSD) has gained popularity for applications across disciplines, from quantum physics [32], to genomics [33], and even to history [34]. For our purposes, the GJSD will allow us to distinguish distributional data across census tracts in a meaningful way, which can be understood in terms of the following process.
Suppose we have two spatial regions, region 1 and region 2, and we want to determine how similar these regions are with respect to a socioeconomic variable X. We assume that their respective populations n 1 and n 2 are known, as well as the distributions q (1) X (x) and q (2) X (x) defined in the Introduction. One way to think about how the populations in regions 1 and 2 differ in their composition of the attribute X is to consider the situation where there was no artificial line drawn between regions 1 and 2, and instead we had just decided to consider them one single "super-region". We can then ask the question: How different is the distribution of X across the population in this super-region than in its individual sub-regions? Rather than naively comparing the distributions q (1) and q (2) directly, this perspective accounts for the population difference between the regions, and will also allow us to address in a natural way the increase in regional homogeneity we get by separating these regions.
From an information theoretic perspective, we can quantify the homogeneity of attribute X within a population by its average information content (or surprisal ), in other words how unpredictable it is. For instance, if a population has relatively equal fractions of people from each race, it is difficult to guess what any given person's race is, and the amount of "information" we gain by finding out each person's race is relatively high on average. However, if nearly everyone is of a single race, it is very easy to guess an individual's race, and we are on average very "unsurprised" upon each discovery of the race of a randomly selected individual in this population. For our thought experiment, we can determine the homogeneity gain we achieve by separating regions 1 and 2 by computing how much the average information content of attribute X in the population is reduced after the split of the super-region.
The average information content of a random variable with probability distribution q(x) is given by its entropy, Generalizing our argument to the merging of m ≥ 2 regions, we have that the reduction in average information content by the separation of m adjacent regions is given by where (with n k the population of region k) and We can recognize now that Eq. 6 is equivalent to the Generalized Jensen-Shannon Divergence (GJSD), which is sometimes referred to as just the Jensen-Shannon Divergence for m = 2 [18].
Intuitively, if the distributions {q (k) } are all very similar, knowing which region that a person is from does not reduce our uncertainty about their value of X by much, and J (1,..,m) will be close to 0. On the other hand, if the {q (k) } are relatively different, then we can reduce our uncertainty about a person's value of X by a lot by knowing which region k they are from, and J (1,...,m) will be higher.
We know that Eq. 6 is bounded below by 0 due to the concavity of entropy, and this minimum is achieved when q (k) = q (k ) for all k, k , as merging the regions does not change our uncertainty about a person's value of X at all. On the other hand, the maximum value J (1,..,m) max that Eq. 6 can take is which happens when the {q (k) } are entirely non-overlapping in their regions of non-zero probability. We can see that this is the upper bound by rewriting Eq. 6 in a more illuminating manner as and noting that log ≤ log 1 π k , with the equality holding when q (l) (x) = 0 for all l = k, which is equivalent to the q's having disjoint nonzero support. Eq. 9 is just the average uncertainty we have about which smaller region k a randomly chosen person from the super-region will come from.
We normalize Eq. 6 by the upper bound in Eq. 9 to enforce values to lie in [0, 1], which allows us to compare tract similarities for regions with variable populations n k . The final expression we use for distributional comparison across regions is then This measure is easily adapted to any discrete variable X, which can be nominal, ordinal, or interval in nature, allowing for the application of Eq. 11 to a wide variety of problems. It can also be adapted to continuous distributions through approximations of the differential entropy. We note that for ordered data, Eq. 11 is only sensitive to how much the probability mass changes between distributions of interest, not to where it moves. In this sense, there are other appealing measures for comparing ordered data, such as variants of the earth-mover's distance [36]. However, Eq. 11 has a major advantage over such previous measures in that it can be used to compare results across all types distributional data on the same scale, and can also accommodate the inclusion of more than two distributions for comparison. In the following section, we perform multiple experiments on the tract adjacency network using Eq. 11, demonstrating new insights on spatial socioeconomic variability that can be gained through our methodology.

A. Two-point Correlations in LX
Two-point correlation functions-a term used to refer generically to functions that measure some type of average correlation between points in a system as a function of the distance between them-are an invaluable tool for describing spatial data for systems as diverse as galaxy clusters [37], turbulent fluids [38], and earthquakes [39]. In more recent work, the concept of the two-point correlation function has been extended to networks [40][41][42], where it refers to computing correlations between the properties (in most cases, degree) of two nodes as a function of the shortest path distance between them.
Here, in order to assess the "scale" at which socioeconomic properties vary across the US, we compute a two-point correlation function for L X (Eq. 11) between census tracts as a function of the number of network hops between them. The effective distance we are concerned with is then consistent with policy-relevant boundaries and roughly accounts for the heterogeneous population density across space (as tracts have relatively similar populations as discussed earlier). In other words, the total population of neighbors at path distance l or less from a focal tract is roughly the same for all tracts, as the degree distribution of the analyzed network is highly homogeneous as is characteristic of spatial networks in general.
In our case, the two-point correlation function C X (l) for socioeconomic attribute X as a function of (unweighted) network geodesic distance l is given by where δ is the Kronecker delta function, n(l) is the number of node pairs separated by shortest path distance l, and d ij is the shortest path distance between tracts i and j in the adjacency network. C X (l) gives the average divergence L (ij) X over all pairs of nodes (i, j) that are separated by l hops.
Calculating C X (l) exactly is difficult, as there are ∼ 2.5 billion pairs of tracts {i, j} in the network that contribute to the sum in Eq. 12 for a given X. We therefore opt for a sampling procedure to compute C X (l) approximately, traversing nodes j in the network up to a distance l = 20 starting at 1, 000 uniformly sampled focal tracts i, then computing the sum in Eq. 12 over sampled focal tracts i and traversed nodes j. A network distance of l = 20 corresponds to a spatial distance of 200 km, varying depending on the location of the central tract, and so captures spatial regions roughly of size 160, 000 km 2 (or about 2% of the land area of the continental US). Using this distance cutoff thus restricts our analysis to relatively concentrated regions, which may be more relevant for spatially targeted policy interventions.
In order to examine the scale over which correlations in each attribute decay, we analyze how quickly C X (l) approaches its asymptotic value C X (∞) from its initial value C X (1) as we increase l. C X (∞) is estimated as the average value of L X over 10,000 tract pairs selected uniformly at random (which should draw primarily from node pairs at distances much greater than l = 20 based on the network structure). Taking inspiration from the form of two-point correlations in spin systems, we can then fit the resulting data to the truncated power-law form whereC and we've rescaled C X →C X to account for the intercepts at l = 1 and l = ∞.
The scaling exponent α in Eq. 13 quantifies the rate of decay in correlation in the system as a function of distance (network hops), and the cutoff exponent β determines the distance scale (in terms of hops) over which correlation persists. A higher (more positive) value of α indicates a slower decay in correlations as we move away from a given tract, and a higher value of β indicates a longer distance over which tracts have correlated distributions with this reference tract. To extract the exponents α and β, the following OLS fit is performed with l a white noise process.
We plot the results of the fit in Fig. 1A, where we show the coefficient of determination r 2 , the scaling exponent α, and the cutoff exponent β for the fit in Eq. 15 for each attribute. We can see that the curves for all attributes (apart from "industry", which due to autocorrelated residuals has been suspected to follow a different decay form that we will not explore here) collapse quite well onto each other. This collapse is not only an indication of a good fit, but can possibly lead us to consider a more fundamental, attribute-independent mechanism behind the variation of different attributes X across regions, which we will discuss at the section's closing.
To investigate a potential consequence of the striking similarity in the decay ofC X (l) across attributes X studied in Fig. 1A, we examine the correlations between the losses L (ij) X and L (ij) X across edges (i, j) for all pairs of attributes (X, X ). Specifically, we analyze the monotonic dependence between losses using Spearman correlation, which relaxes the linearity assumption of Pearson correlation but also allows us to test for the significance of observed correlations [43]. Specifically, we compute where E is the set of edges in the adjacency network, ρ is the Spearman correlation coefficient, and the arguments to ρ describe the vectors of measurements being correlated. We plot the results as a correlation matrix in Fig.  1B, where we can see relatively high correlations between most of the variables analyzed. The high correlations we see are consistent with associations seen in a multitude of previous economic and sociological studies [44][45][46][47], although our framework has the added benefit of utilizing a single unified formalism to analyze all these associations. However, to get at underlying universal mechanisms behind observed socioeconomic data, we must go beyond solely demonstrating statistical associations between variables. The correlations seen in Fig. 1B may actually just be an artifact of a more fundamental process determining the decays in Fig. 1A, and we can make some headway in uncovering this process (or processes) using techniques inspired from urban scaling.
Traditional urban scaling posits that a wide variety of characteristics Y in a city can be predicted solely by the city's population P through relations of the form Y ∼ P η for some exponent η > 0, which in practice holds up for a large number of cities and variables of interest [9]. The success of the urban scaling theory relies on a few key factors that are associated with a growing city population: denser organization of facilities and infrastructure, an accelerated pace of life, and increased interaction between agents and activities leading to specialization and innovation [48]. In practice, the data Y for some city-level characteristic (such as new patents or number of gas stations) is fit versus city population P for many different cities, yielding an estimate for the exponent η which we can interpret to gain an understanding of the fundamental processes contributing to the scaling behavior of Y . For instance, if η > 1 this says that Y grows superlinearly with P , which should be the case for quantities Y that show increasing returns with population (e.g. indicators of innovation such as new patents). On the other hand, η < 1 indicates economies of scale, or characteristics Y that decrease in unit cost as we increase the city's population (e.g. mobility-related infrastructure such as number of gas stations). Perhaps the most important takeaway from traditional urban scaling analysis is that when we can collapse the behavior of a wide range of seemingly different socioeconomic systems into a single framework with few parameters, these parameters can help us understand basic universal processes underlying these superficially distinct variables.
We can use a similar process to interpret the results of Fig. 1A, except rather than the absolute quantity of a socioeconomic indicator, we are analyzing correlations between distribution-valued quantities, and the fundamental covariate here is network distance l instead of population P . Based on their homogeneous populations and degrees, the total population in tracts at path distance l or less from a focal tract is very similar across tracts, and so l reflects the total population included as we encircle a focal tract at greater and greater radii. As space is a factor for socioeconomic processes mainly in that it provides a medium for interaction among people [49], this distance l may be a more fundamental quantity than standard spatial distance in how it determines socioeconomic activity, and so we may be able to explain the spatial variation in a wide variety of socioeconomic variables using simple functions of l such as Eq. 13. An alternative quantity to l could be derived from literally transforming space based on population to homogenize the population density, a concept which has inspired numerous interesting and informative mapping methods [11]. However, we are ultimately constrained by the basic spatial units designated for data aggregation (e.g. census tracts), and so here we treat these regions, hence l, as fundamental.
In the present case, we can see that the exponents β determining the network correlation cutoff scale are very similar for education, housing, income, and race, indicating that correlations in these regional distributions are non-negligible over a universal distance scale of ∼ 30 hops. However, we see higher variation in the scaling exponents α, with race and housing decaying at a slower rate across the network than education and income. This suggests that perhaps the mechanisms that drive spatial differences in racial composition and local real estate values operate over larger distances than the mechanisms behind income or educational variability, at least in the US.
The association between the spatial distributions of housing values and racial groups has been noted in numerous studies that address "redlining" and other processes that result in lower property values in neighborhoods with high minority populations [50]. The analyses in Fig. 1A may point to additional, more subtle mechanisms behind this inequality due to a significant difference in the scaling exponents for housing and race, as this observed discrepancy indicates that the scales over which housing and racial regional similarity decay are quite different. It is known that home values are also tied to local incomes, which in turn can result in high variability in housing prices due to the relative flexibility of wages and mobility of workers compared to supply-regulated housing [51]. Therefore, perhaps the interplay between the long-range correlated racial composition of the population and the comparatively short-range correlated income distributions play a role in determining the moderate decay exponent α we see in the data. However, more definite conclusions and practical intervention strategies require a more contextualized analysis in conjunction with domain expertise.

B. County-level Heterogeneity
To examine the regional diversity of a given socioeconomic variable, we employ Eq. 11, this time to all the tracts comprising each county within our dataset. More specifically, for each county we examine, we compute L (t1,t2,...) X with t i the census tracts within the county subdivision and X the variable of interest. For notational convenience, we will use the notation L (county) X from now on for this quantity. In order to compare counties with varying numbers of constituent tracts on the same scale, we normalize for potential biases from the number of included tracts by using a bootstrapping procedure to compute z-scores for each county-level value L  Table I, with 95% confidence intervals around the scaling and cutoff exponents α and β. The line y = x is plotted for reference, as a perfect scaling collapse maps all points onto this line. Eq. 15 is deemed a poor fit for C industry after a residual analysis, and so this result is omitted. (B) Spearman correlation matrix with respect to losses L (ij) X across all edges (i, j), for all pairs of socioeconomic attributes utilized in our study. All correlations are highly statistically significant at the 1% significance level, with standard errors of ∼ 0.001. deviation of L (t1,...,t k ) X over 100 random samples of k tracts t 1 , ..., t k . Then, we calculate a standardized version of Eq. 11,L, for each county usingL where |county| is the number of tracts within the county. We will refer to Eq. 17 as a "disparity" measure, as higher values ofL (county) X indicate higher dissimilarity in a county's tract-level distributions of q (i) X relative to what is expected in a randomized null model where the county's tracts are chosen at random. In practice, we will see that L tends to be negative for most counties, and in this case we should note that values of greater magnitude indicate higher similarity in the county-aggregated tracts than expected by chance.
As a first step in understanding county-level disparities across the US, we plot the distribution ofL (county) X over all counties for each socioeconomic attribute X in Fig. 2A as box-and-whisker plots. We can see that the distributions of all quantities tend strongly towards negative values, indicating that most counties have greater similarity in their tract-level distributions q (i) X than expected in the null model. This is consistent with the spatial autocorrelation at short scales we see in socioeconomic variables in Fig. 1A, although these analyses in some sense provide a complimentary view point. Here, rather than assessing the scales over which distributions of socioeconomic characteristics remain similar as in Fig. 1, we are examining whether artificially drawn administrative boundaries are effective at capturing the homogeneity in these attributes. As counties have size scales much smaller than the area covered up to the typical correlation cutoff scale l ∼ 30 from any reference tract, we expect that correlations between tract-level distributions will be relatively high within counties, and so in this sense these results should be unsurprising.
Looking at Fig. 2A, we do see something perhaps unexpected though: there are lots of counties that are only slightly more (and sometimes less) homogeneous in their tract-level distributional data than we'd expect by chance. In particular, most of the values ofL (county) race are in the interval [−2, 0], which means they are less than two standard deviations different in their disparity than expected with completely randomized tracts. This suggests that many counties in the US are relatively representative of the whole US in terms of racial composition, whereas there are relatively few counties with drastically different compositions. The same does not hold for housing though, for which around 75% of the counties studied had more than two standard deviations differentiating their disparity values from the null model expectation. This result indicates that there are relatively few counties with distributions of housing values that are diverse enough to reflect typical housing prices nationally.
To determine the association in the disparity valuesL (county) X across counties, we plot the corresponding Spearman correlation matrix using the results from all counties studied. Similarly to Eq. 16, we compute The Spearman correlation matrix in Eq. 18 is shown in Fig. 2B, where we can see very high correlations between the within-county disparities, even higher than in the values of L (ij) X shown in Fig. 1B. These correlations are similar in sign and relative magnitude (between attributes X) to those in Fig. 1B, but by aggregating tracts at the county level rather than just assessing correlations over edges, we are effectively reducing noise by smoothing out local fluctuations, and so we see a major increase in the values of ρ. In other words, some individual edges (i, j) may have very different divergences L (ij) X and L (ij) X , but the effect of these outlier pairwise relationships is reduced when looking at distributions between tracts at the county-level. This noise reduction is only possible because, as discussed, the scale at which we are analyzingL (county) X is smaller than the area associated with the correlation cutoff scales β found in Fig. 1A.
Finally, as a case study to visualize the geographic manifestation of these county-level disparities, we plot a heatmap ofL (county) housing across all counties studied in Fig. 2C. Here we can immediately see an interesting pattern: the county-level disparity in housing prices, when compared to the same number of randomly selected regions, is actually much lower along the coasts and metropolitan areas than it is elsewhere. Housing markets in coastal and metropolitan regions are typically seen as having high inequality due to the large variation in home and land values often seen in these areas [52,53]. However, when assessed on a relative scale using distributions at the granularity of census tracts, we see a different story. In this case, we see that these coastal and metropolitan counties actually have quite similar distributions q (i) housing across their constituent tracts i relative to more inland and rural counties. The primary reason for this may be that the heterogeneity in housing prices in dense, urban counties is primarily manifested at scales below our measurement precision: tracts themselves have house price distributions with high variance, but tracts in a given county tend to have relatively similar distributions. This is consistent with the low rate of spatial decay in housing correlations seen in Fig. 1A, as most tracts are urban [54] and if most of the fluctuations persist at scales smaller than census tracts, we will see a relatively smooth correlation trend at larger scales. Due to the coarse binning of housing values, however, there is also a potential confounding factor here in that many of the home prices in expensive metropolitan and coastal regions fall into the highest bin in the corresponding census data (> $1, 000, 000), and so variability due to fluctuations above this price threshold are suppressed when using census data to assess inequalities.

C. Regional Clustering
As a final experiment using our measures, we detect communities at multiple size scales in the census tract subnetwork within the city of Chicago-a frequently used case study in socioeconomic diversity due to its rich history and abundance of available data [55,56]-with the goal of constructing clusters that are relatively homogeneous with respect to each socioeconomic attribute. Optimal aggregation of spatial regions according to various criteria has been a longstanding problem of interest, and numerous approaches have been proposed to tackle this using networks with edges weighted by an attribute representing regional similarity. This approach has the added benefit that since community detection algorithms look for connected clusters of nodes, the clusters detected naturally tend to be contiguous, and thus relevant for spatially localized policy. Attributes used in previous studies include phone calls between regions [57], commuting flows [58], taxi trips [59], and similarity between individual within-region features like our own method [60].
In order to group the tract network into clusters that exhibit homogeneity with respect to attribute X, we use L (ij) X to construct edge weights w ij for the network prior to performing community detection. However, we cannot not use L (ij) X for edge weights directly, as community detection algorithms typically associate higher edge weight with higher node similarity, and L (ij) X is structured so that lower values indicate greater similarity across an edge (i, j). We thus employ a common transformation from the machine learning literature [61], which is to use an exponential kernel to map the values L (ij) X to their associated edge weights w ij in the network. The weight transformation can be written as where ω > 0 is a tunable parameter that determines how much differentiation in the weights we will have across edges in the network. A value of ω ≈ 0 results in almost no differentiation between edge weights (w ij ≈ 1 for all edges), whereas ω >> 1 results in an exaggerated difference in edge weights between edges with lower L (ij) X and edges with higher L (ij) X . Any kernel mapping the unit interval to decreasing non-negative reals would suffice to construct the weights w ij , but we opt for the exponential function here because it is particularly simple and only has one tunable parameter. For the experiments shown, we find a middle ground between the two extremes presented for ω, for each attribute-based clustering choosing a value of ω that results in a relatively uniform distribution of edge weights across [0, 1]. More specifically, for each attribute X we numerically approximate the ω that maximizes the entropy of the associated distribution of edge weights e −ωL (ij) X . A more principled method for choosing ω based on the application of interest is a subject is left to future work, but here we use this simple statistical procedure to avoid falling into one of the two cases presented, where there is either no differentiation in the edge weights or only a handful of edges matter.
In order to detect communities in the Chicago subnetwork, we aim to find the configuration of node communities c = {c i } in the subnetwork such that the weighted modularity Q γ ( c) is approximately optimized. The modularity Q γ ( c) that we use here is defined by where W is the sum of edge weights in the network, s i = k w ik is the sum of weights of edges attached to node i, and γ is a resolution parameter [62]. When γ = 1, Eq. 20 reduces to the traditional notion of weighted modularity, but varying γ = 1 allows us to choose the importance given to w ij relative to sisj W (which is the approximate expected weight of w ij through random rewiring). In particular, the larger we make γ, the more importance is given to the observed edge weights relative to the expected weights, and the community configurations c that maximize Eq. 20 will be larger. Thus, by varying ω we can tune how much influence differences in L (ij) X across edges have, and by varying γ we can determine the characteristic cluster size. We use the Louvain Algorithm [63], a greedy optimization method, to find the configuration c that approximately maximizes Eq. 20. There are numerous viable alternative methods but here we opt for the Louvain algorithm as it is fast and straightforward to implement. It is also important to note that we can perform regional aggregation with L (ij) in a manner where clusters are not likely to be contiguous, for instance by constructing a matrix from all pairwise values of L (ij) and performing one of various matrix-clustering techniques [64]. However, here we are interested in constructing contiguous clusters of tracts in order to coarse grain the city into zones relevant for spatially targeted policy intervention, and so we use community detection to encourage contiguity of the clusters.
In Fig. 3 we show the results of our community detection analysis for the Chicago census tract subnetwork. In Fig.  3A-3C, we show the clusters obtained for edge weights constructed using L (ij) income , at various resolutions γ. We can observe that increasing γ allows us to get a coarser view of the socioeconomic clusters present in the city, and can allow for delineation of super-regions at a desired scale. We also show the officially designated neighborhood boundaries (thick black lines) for Chicago (https://data.cityofchicago.org/) in order to visualize the consistencies and inconsistencies between our clusters and these officially delineated regions. We can see that in the intermediate regime γ ∼ 0.1, our clusters are consistent with some neighborhood boundaries, but deviate significantly from others. This suggests that the officially designated regions are somewhat consistent with homogeneous socioeconomic clusters, but there is room for improvement to these boundaries if the goal is to delineate socioeconomically homogeneous zones within the city (at least regarding income). Of course, there are numerous other factors, both socioeconomic and geographic, that would need to be accounted for in addition to the factors we analyze in order to draw effective policy-relevant boundaries in practice.
We also compute the Adjusted Mutual Information (AMI) between clusters obtained using different attributes X as well as the official neighborhood clusters, in order to assess the consistency in the groups we obtain when considering these different factors. The Mutual Information M I( c 1 , c 2 ) is the amount of shared information (in an information theoretic sense) between the clusterings c 1 and c 2 , or more intuitively, the statistical uncertainty in each independent clustering minus the statistical uncertainty when combined. More specifically, we have that where q 1 (s) is the fraction of nodes put into cluster s under configuration c 1 (and similarly for q 2 ), and q 12 (s, t) is the fraction of nodes put into group s under configuration c 1 and t under configuration c 2 . One drawback to using MI, however, is that it gives systematically higher values as we increase the number of clusters, even for completely random cluster configurations [65]. One proposed correction (of many) to this is to use the AMI, given by where M I( c 1 , c 2 ) is the expectation value of MI in the null model where the number of items in each cluster is fixed and groups are generated randomly through permutations of labels. The AMI is equal to 0 if the clusters c 1 and c 2 share the amount of information we expect from random chance purely based on their cluster sizes, and 1 if the clusters are identical.
In Fig. 3D, we plot the average AMI over all pairs of partitions using the five socioeconomic attributes, as a function of the resolution parameter γ. We can see that there is a clear peak value of γ at which the five attributes share highly overlapping clusters. In practice, this could be used as a heuristic to tune γ for selecting the size scale of the clusters, if the goal is to select clusters that are highly homogeneous with respect to multiple socioeconomic attributes. It is interesting to note the clear scale sensitivity in this analysis: at certain scales, we can divide the city into zones that are relatively socioeconomically homogeneous in all variables studied, but at other scales, the city decomposes into regions with less overlap. Fig. 3E shows the AMI matrix for the clusters obtained at γ ≈ 0.1, the peak in Fig. 3D. We can see from this plot that all socioeconomic attributes are spatially clustered in quite similar patterns at this scale, and that all have high correlation with the official neighborhood boundaries as well. This is perhaps an endorsement for the neighborhood boundaries, as these results suggest that the scale at which the neighborhoods are drawn corresponds to the scale at which the socioeconomic clusters in the city are most similar. Taken together, the results from Fig. 3D and 3E may point to a new method for subdividing a city into different neighborhoods, which can be constructed easily based on any socioeconomic attribute and at any size scale.

IV. CONCLUSION
In this study, we propose a new measure for analyzing socioeconomic data across spatial regions using concepts from network theory and information theory, which accommodates all forms of distributional data, has a natural extension to the comparison of more than two regions, and allows for policy-relevant analysis by considering officially delineated regions as fundamental spatial units. By analyzing spatial data from a topological lens, we can approach regional analysis issues from a relational perspective that avoids the longstanding issue of identifying appropriate spatial scales. We apply our framework in a series of experiments on the adjacency network of US census tracts to demonstrate the new insights we can gain with our methodology. We first find a universal decay pattern in various socioeconomic correlations as a function of path distance, as well as high statistical association between distributional similarities in adjacent tracts. We then aggregate tract-level distributions at the county level, finding again that distributional disparity measures are highly correlated, and also that there are relatively low levels of within-county inequality compared to what one would expect by aggregation of random tracts. Finally, we propose a clustering algorithm for regional aggregation into homogeneous socioeconomic clusters, finding that in practice the clusters obtained by our methodology have high overlap with accepted neighborhood delineations, as well as with each other across attributes. These applications illustrate the versatility of our methods, as well as the universality present in socioeconomic data when analyzed with a unified framework.
There are numerous improvements that can be made to our methodology in future work that increase its effectiveness in practical applications. Firstly, important limitations arise from the quality and resolution of census data, which we do not attempt to address as they are outside the scope of this work. In particular, the coarse binning of interval distributional datasets (here, income and housing) can result in poor estimation of entropy and other uncertainty measures, as long tails are not accounted for, which may account for a large portion of the variability in the distributions [66]. One improvement to our methodology to obtain more accurate results would thus be to estimate these full distributions based on the predefined bins and other summary statistics such as mean, median, and Gini coefficient [30,67], then apply our measures using approximations of differential entropy. Additionally, some census data have large margins of error due to various statistical sampling issues [68,69], and so correcting for this noise in our analyses would also improve the efficacy of our techniques. However, we leave these and further improvements to future work.