Linguistic evolution driven by network heterogeneity and the Turing mechanism

Given the rapidly evolving landscape of linguistic prevalence, whereby a majority of the world's existing languages are dying out in favor of the adoption of a comparatively fewer set of languages, the factors behind this phenomenon has been the subject of vigorous research. The majority of approaches investigate the temporal evolution of two competing languages in the form of differential equations describing their behavior at large scale. In contrast, relatively few consider the spatial dimension of the problem. Furthermore while much attention has focused on the phenomena of language shift---the adoption of majority languages in lieu of minority ones---relatively less light has been shed on linguistic coexistence, where two or more languages persist in a geographically contiguous region. Here, we study the geographical component of language spread on a discrete medium to monitor the dispersal of language species at a microscopic level. Language dynamics is modeled through a reaction-diffusion system that occurs on a heterogeneous network of contacts based on population flows between urban centers. We show that our framework accurately reproduces empirical linguistic trends driven by a combination of the Turing instability, a mechanism for spontaneous pattern-formation applicable to many natural systems, the heterogeneity of the contact network, and the asymmetries in how people perceive the status of a language. We demonstrate the robustness of our formulation on two datasets corresponding to linguistic coexistence in northern Spain and southern Austria.


Introduction
Language is the center of human activity and has served as the fundamental mode of communication since the dawn of human civilization. While there currently exists roughly 6000 languages differing structurally in terms of grammar and vocabulary [1], they all evolve dynamically through human interactions, that are shaped by economic, political, geographic and cultural factors [2,3,4,5].
A rather unfortunate outcome of such evolution is the replacement of vernacular tongues, spoken by a minority of the population, with that spoken by the dominant majority. Indeed it is estimated that 90% of existing languages will go extinct by the end of the century [1,6], leading to a huge loss in cultural diversity, given the inextricable links between speech and customs. One of the first mathematical models that accurately reproduced such "language-death" was proposed by Abrams and Strogatz (AS) [7]. In their formulation, two languages compete, with the attractiveness of each of the species being determined by its perceived status amongst the population. As long as the symmetry between the perceived status is broken, the model necessarily predicts a single hegemonic language adopted by the entire population. Indeed, the model successfully accounts for the decline of 42 real-world minority languages in contact with hegemonic counterparts. The formulation however fails to account for those cases where languages coexist in a geographically contiguous region.
To account for this limitation, refinements were made to the AS model by Mira and Paredes [8,9] incorporating bilingualism by introducing an inter-linguistic similarity parameter. An important example of this occurs in the northwestern part of Spain in Galicia, where both Galician and Spanish are spoken. Their model analyzes the temporal evolution of these languages demonstrating the existence of a stable coexistence given enough similarity between the languages.
Both formulations and other related ones [10,11,12] focus on the temporal aspect of language evolution at a macroscopic scale, while ignoring local dynamics on the space where subpopulations of competing tongues reside. To address this, other approaches incorporate the geographic component into reactiondiffusion equations [13,14,15] simulating the dispersal of speakers in a continuous domain [16,17,18,19]. While the approach is reasonable, it fails when geographic regions representing speakers of a common language are no longer contiguous, and thus there is no meaningful diffusive front. Furthermore, the approach is unable to provide spatially detailed description of language spread and retreat.
An agent-based probabilistic model proposed in [20], supported with detailed empirical data from southern Austria shed light on the constituents of language dynamics at a microscopic level. The region, where speakers of German and Slovenian live, is partitioned into quadratic grids where each cell represents an area of one square kilometer. To determine the probability of speaking one of the languages in a given year, the model uses the number of speakers of each language in the preceding year for every cell and their interaction with speakers with surrounding cells, hence accounting only for short-range connections. Although this fine-grained model is able to successfully determine the temporal evolution of the two languages and generate satisfactory results for geographic distribution of the subpopulations, it has limitations in terms of generalization. The historic and elaborate dataset for the number of speakers covering the entire region are from the periods 1880-1910 and 1971-2001, and such detailed records are not so easily available for other parts of the world.
A recent work [21] proposed a district level mean field network with uniform weights for Galicia where two competing languages are spread across 20 districts of the region with different prestige values, to combine internal complexity of each location with influence produced by its neighbors. This approach explains how the interplay between urban and rural dynamics leads to competition in language shift. The framework however requires fine-grained detail on a plethora of parameters to study the sociolinguistic dynamics across the region. While the models described thus far, reproduce, to varying degrees of accuracy, the evolution of the observed linguistic trends (to the extent that such data is available) they are formulated in a fashion that makes it difficult to disentangle the effects of the various mechanisms governing linguistic evolution. Additionally while focusing either on short range interactions or at a macroscopic scale, none of the models consider the effect of human mobility, a rather important ingredient in understanding the dynamics of socioeconomic systems [22,23,24]. Here we propose a coarse-grained model of language dynamics that seeks to uncover the minimal mechanisms that reproduce the observed linguistic trends. We set up our model in such a way that we can interrogate the effect of each of the constituent mechanisms. It is important to note that our goal is not to reproduce exactly the number of speakers of a given language, but rather, sacrificing specificity and erring on the side of generalizability, we focus on the qualitative trends.
Our model consists of primarily three ingredients. We first discretize the space on which linguistic interactions occur by representing towns as nodes and connections between them as edges, incorporating population flows at microscopic level. This geographic network extends the work of [20] by combining both short-range and long-range connections that are not present in the agent-based model and are essential to describe global population interactions. The edges are weighted by a gravity-like relation [24,25,26,27] which is the simplest parameter free model to calculate mobility flows between two communities, while accounting for their geographic separation. Second, the evolution of each language is characterized by reaction-diffusion equations for two competing languages whose dynamics is described by the Lotka-Volterra model, previously used to model linguistic coexistence [28,29]. As opposed to wavefront propagation in continuous space which requires a contiguous region, reaction diffusion on networks prevents the isolation of language islands. The contact network enables interactions along weighted edges such that each node communicates with all of its neighbors. Spatial linguistic patterns emerge on the geographical network through the Turing Mechanism, an exemplar of pattern formation [30,31,32] that relates to many observed natural phenomena in continuous [33] and discrete media [34,35]. The final ingredient in our model is the status perception of each language and its corresponding degree of spatial correlation. We test our model in two different regions of the world with linguistic coexistence, Galicia, in northwestern Spain, where Galician and Spanish are spoken, as well as Carinthia in southern Austria, where one finds both Slovenian and German speakers. In both cases, we find excellent agreement with qualitative trends, demonstrating that in addition to the specific model of linguistic competition one must also account for the geographic network on which the dynamics take place, as well as asymmetries in linguistic status perception.

Data
In our analysis we make use of datasets corresponding to two independent parts of Europe where populations speak at least two languages. The first comes from the Autonomous Region of Galicia in northwestern Spain, where Galician (a Romance language similar to Portuguese) and Castillian (Spanish) are co-official. The dataset includes information about the fraction of Galician and Spanish speakers in 20 districts of the region consisting of 550 cities [36]. In Fig. 1a, we illustrate the different districts with distinct colors for each region. The cities are represented as black points. The linguistic distributions according to the census data are shown in Fig. 2a  Our second dataset corresponds to Slovenian and German speakers in Southern Carinthia, Austria in the year 1910 across 9 districts and 112 cities. The data available in digitized form [37] consists of the fraction of the population speaking either language at the level of cities. The region with each district corresponding to a different color, with the cities represented as points is shown in Fig. 1c. In Fig. 2b we plot the spatial distribution of prevalence for the year 1910, finding a similar trend to that seen for Galicia; regions consisting of large number of Slovenian speakers (primarily near the Austrian-Slovenian border) consist of few German speakers, with the opposite also being true.

Reaction-diffusion equations
We model the language competition with the following set of Lotka-Voltera type differential equations, first proposed in [28,29]: Here u( x, t) and v( x, t) correspond to the frequency of the population speaking each language, whereas the cross-term uv represents the competition between them with a strength c, interpreted as the status of the language; c > 0 indicates that language u has a higher status or attractiveness than language v, with the opposite being true for c < 0. In the absence of competition, the model reduces to logistic growth where S u and S v represent the respective carrying capacities of the languages and α u , α v their natality and mortality rates. For the purposes of our analysis these constants are redundant and without any loss of generality set to 1 except for c, which is constrained to |c| < 1. In this setting, the fixed points of . In continuous media, the spatial component representing the diffusion of species is introduced via a second-order diffusion coefficient [17,29] proportional to ∇ 2 u. It's counterpart in discrete media, such as networks, is the Laplacian matrix L ij [34,38], defined as: where W ij is a symmetric weighted adjacency matrix and s i = j W ij is the weighted degree of node i [39]. To include diffusion on the underlying network one would then add a term of the form N j=1 L ij u j , and a corresponding one for v, in a network of N nodes (i = 1 . . . N ). If one were to consider ordinary diffusion then one need only include a diffusion coefficient d u , d v for each of the populations, which represents the diffusing away from populations of higher density to lower density regions. Yet, in the context of competition, one must also consider effects where a minority population under threat from a majority population diffuses away to a different region to avoid extinction [40,41]. Such cross-diffusion can be introduced by corresponding coefficients a uv , a vu ≥ 0 proportional to the product uv. With these refinements, Eq. (1) can be recast thus, Note, that in districts where c > 0, the higher status tongue corresponds to the population u, and v tends to move away at a rate a uv = γ whereas u remains in the district so a vu = 0. Similarly, in districts where c < 0, u diffuses away and we have a uv = 0, a vu = γ.

Choice of network and Turing instability
Next, we consider the choice of network that best represents the interactions between the populations in different centers. Since we are interested in capturing the effects of population movement, in principle, all cities are accessible to each other through a transportation network, such that all nodes are connected to each other as in a complete graph. Yet the extent of flows between two cities i, j depends on their respective populations p i , p j and the distance d ij . The simplest choice for the coupling between cities is the gravity model [24,25] with weights Here W ij represents the i th row and j th column of the weight matrix W; p i and p j are the populations of the corresponding cities and d ij is the geographical distance separating them. Thus greater flows occur either between high population centers or those proximate to each other. In principle the gravity model can be generalized to more complex dependencies on the population and distance [42] and one can consider related models of mobility such as those based on intervening opportunities [26,43], however the version considered here has been used to accurately described mobility patterns in different contexts [44,45], and our results are not overly sensitive to the precise form of Eq. (4) as long as the networks are heavytailed [35]. In Fig. 1b [33], recently proposed as a mechanism to explain spatial differentiation in linguistic competition [34,40]. A small perturbation to the uniform state triggers the growth of Turing patterns above a critical threshold, corresponding to the ratio of the diffusion constants of the respective linguistic species. The patterns in this context, correspond to distinct populations of nodes differentiated by the relative number of the population speaking a certain language. While in continuous media, perturbations are decomposed into a set of spatial Fourier modes representing plane waves with different wave-numbers, in networks the analog is the set of eigenvectors φ (α) of the Laplacian matrix (with associated eigenvalue Λ α ), where α = 1, . . . N corresponds to the mode [38]. The eigenvalues Λ α are sorted in decreasing order Λ 1 > Λ 2 · · · > Λ N and the first eigenvalue is always zero (Λ 1 = 0). Introducing small perturbations (δu i , δv i ), substituting into Eq. (3), and expanding over the set of the Laplacian eigenvectors, the linear growth rate λ α for each node is calculated from a polynomial equation of the form where b(Λ α ), c(Λ α ) are functions of the (cross)-diffusion coefficients, the competition terms uv, and the status c [34,35,40]. The set of solutions to Eq. (5) for all modes α, correspond to a dispersion relation λ (Λ α ). The Turing instability occurs when at least one of the modes become unstable, indicated by Re(λ α ) > 0 which happens when c(Λ α ) < 0, and the corresponding mode is denoted α c [34,35]. The full details of the calculation are shown in Supplementary Section S1 and Fig. S1. After the onset of the Turing instability, the system reaches a steady-state concentration of speakers for each node (city) which is normalized according toũ = u u + v , and similarly for v. Here the . . . denotes averaging over multiple realizations of the simulation corresponding to different initial conditions. The concentration of speakers at the district level is then a population weighted-sum over all constituent nodes. The full details of the normalization and data aggregation procedure is described in Supplementary Section S2 and Fig. S2. In what is to follow, for the sake of simplicity, we assume that both competing languages (cross)-diffuse at the same rate.

Galicia
We begin our analysis with the case of Galicia. We seek to uncover the role of each underlying mechanism in the observed empirical trends, and therefore systematically probe the effect of the model constituents, starting with the underlying network mediating the population-interactions. Recent results suggest that Turing patterns in networks are influenced and stabilized primarily by network topology provided the distribution of links is heavy-tailed [35].
While the mobility networks we consider are spatial, we first check the extent to which the linguistic patterns can be explained solely by the heavy-tailed nature of the network. To do so we randomly assign each node a weight s sampled from the empirical distribution P (s) for Galicia (Fig. 1b), in effect removing any information about the spatial location of the nodes, and treating the network as a purely topological graph. In addition we set the value of the status c = 0.5 everywhere in the region, the diffusion coefficients to d u , d v = 0.01 and γ = 2.1 for the cross-diffusion coefficients. These numbers were chosen to drive the system to the onset of the Turing instability. In Fig. S3a,d we show the results of the simulation averaged over 100 realizations of the process, compared to the empirical data, as a scatter plot of the districts according to the preponderance of Galician and Spanish, and in Fig. 3a,d we show the spatial linguistic distributions.
The scatter plots indicate relatively few nodes differentiate from their initial fixed-points for both Galician and Spanish. The agreement with the empirical data is rather poor with a Pearson correlationcoefficient of ρ G p = 0.25 for Galician and ρ S p = 0.26 for Spanish. One can also check the relative prevalence of linguistic speakers in each region by ranking districts by the concentration of speakers for each language, and then compute the rank correlation-coefficient. In Fig. S4a,d we show the scatter plot of the simulated and empirical data in terms of the rank of each district. The Spearman correlation coefficient for both Galican and Spanish is ρ G s = ρ S s = 0.07. The results indicate that network topology by itself is a poor indicator of the observed linguistic prevalence.
Next, we restore the spatial nature of the network maintaining both P (s) as well as the geographic position of the nodes, i.e links between nodes are established according to Eq. (4), and re-run the simulation with the same parameters. We show the results in Fig. 3b,e where we plot the spatial distribution and in Fig.S3b,e which shows the scatter plot of concentrations. We find improved correspondence for both Galician (ρ G p = 0.52) and Spanish with (ρ S p = 0.51), although there is an overestimation of Galician speakers, and a corresponding underestimation of Spanish speakers in about half the districts. This can be explained by the choice of a positive value for c in every district, which biases the result towards favoring Galician speakers. A choice of negative c for each district would reverse the trends. This is also reflected in Fig. S4b,e for the rank scatter plots, where we find ρ G s = ρ S s = 0.63. Nevertheless, the reasonable agreement between simulation and data for a majority of districts points to an important role played by the geographic networks in linguistic evolution. By itself, however, it is not enough to explain the full picture.
Next, we incorporate the geographical distribution of the status parameter c into our framework. Surveys and polls conducted in Galicia reveal 12 districts where Galician is perceived to have higher status (c > 0), and 8 districts for the case of Spanish (c < 0) [36]. For those districts where residents report a higher status for Galician we set c = 0.5, and for those that prefer Spanish we set c = −0.5. We then re-run the simulation for the same set of (cross)-diffusion coefficients as before, and report our results in Fig. 3c,f and Fig. S3c,f for the spatial distribution and scatter plots respectively. We now find significantly better agreement with the empirical data for both types of languages, with ρ G p = ρ S p = 0.84. A similar effect is seen in the relative abundance as reflected by the rank scatter plots shown in geographical mobility network coupled with the spatial correlation of the status parameters (along with their asymmetry) is a good predictor of the linguistic prevalence in Galicia.

Southern Carinthia
To check whether our results are unique to Galica, or generalizable to other regions, we now repeat the analysis for Southern Carinthia. We adjust the (cross)-diffusion constants to generate an instability range coinciding with the eigenvalue distribution of the empirical network Laplacian; now d u = d v = 0.1 and γ = 21. We once again, start by randomly assigning weights to nodes sampled from the empirical distribution P (s) seen in Fig. 1d, assign the same value of the status c = 0.5 in all districts and simulate the linguistic evolution. Much like in Galica, we find the same poor agreement with the empirical data in terms of both the fraction of speakers (ρ Ger p = 0.13, ρ Slo p = 0.13, Fig. 4a,d and S5 a,d) as well as their relative abundance (ρ Ger Fig. S6 a,d), indicating that here too, the topological nature of the network by itself is a poor predictor of linguistic prevalence.
Next, we consider the geographical network with weights assigned according to the nodes' positions (Eq. (4)). The results are shown in Fig. 4b,e and Fig. S5b The negative values for the correlation stem from only a few regions, and it is more accurate to say that there appears to be no correlation between the simulated and empirical data. The contrast with Galicia is striking, and is potentially due to the network in Southern Carinthia being an order of magnitude smaller in size. In this relatively small setting, the geographical mobility network has minimal-to-no-role in predicting the linguistic patterns.
Additionally, we do not have access to how residents perceive the status of each language given that there are no (known) surveys or polls. A reasonable choice in determining the status, however is to infer it from the proportion of speakers. That is, in those regions where German is the majority tongue we assume German has the higher status, and similarly regions with majority Slovenian speakers are assigned a higher status for Slovenian. Correspondingly in German majority districts we set c = 0.5 and c = −0.5 for Slovenian majority regions. We re-run the simulation for the same set of (cross)-diffusion coefficients as before, and report our results in Fig. 4c,f and Fig. S5c,f for the spatial distribution and scatter plots respectively. In this case, we find even better agreement with the data as compared to Galicia, with ρ Ger p = ρ Slo p = 0.9. A similar trend is seen for the relative abundance (Fig. S6 c,f) with ρ Ger s = ρ Slo s = 0.9. Thus, in this case, while the network plays a limited role, accounting for the asymmetry and spatial correlation of the status, the Turing mechanism produces good agreement with the empirical linguistic distributions in in Carinthia.

Discussion
In this manuscript we have presented a minimal formulation to explain the observed linguistic trends in two regions of Europe where languages co-exist. Our model, based on the Turing mechanism has as its primary ingredients, a reaction-diffusion model where language species spread and retreat in the same fashion as it occurs in predator-prey dynamics, the mobility network between locations based on the gravity model, coupled with the asymmetries and the geographical distribution in how speakers perceive a given language. Unlike in other descriptions of linguistic evolution, the model constituents are set up in a way, such that we can tease out the effects of each component. Another advantage of our framework as compared to existing formulations is the need for minimal empirical input, as well as its generalizability to multiple settings. Given that the language dynamics occurs on a discrete network we are able to simultaneously capture microscopic and macroscopic dynamics without the rise of pathologies such as "language islands" due to the lack of diffusive fronts in non-contiguous regions.
While patterns have been known to be stabilized by heterogenous network topologies in other settings, considering just the network topology by itself without considering its spatial nature, leads to poor agreement with our results and empirical trends. Once one accounts for the spatial location of nodes, the model gets about half the districts right in Galicia. Note that this occurs despite assigning both languages, Galician and Spanish, equal status among residents. In our version, we assigned higher status to Galican and despite this, the model was able to accurately reproduce some of the districts where Spanish is the majority language. This points to strong evidence for the spatial mobility network of contacts playing an important role in language interaction and diffusion. Similar results were seen for the relative abundance of languages, that is, ranking districts based on the concentration of speakers of each language.
Interestingly enough, the same was not seen for Southern Carinthia, where the spatial network appeared to have little-to-no predictive power in terms of the concentration of German and Slovenian speakers. Nevertheless, when coupled with a bimodal distribution for the status parameter (reflecting asymmetries in how languages are perceived), we got very good agreement in Galicia as well as Southern Carinthia, both in terms of the concentration of speakers and the relative abundance of the languages. Note that, in the latter case, we were able to produce good agreement with the empirical values, despite not knowing the actual values of the status parameters for each language.
Our results are notable, given our minimal set of assumptions as well as little recourse to empirical parameters. Of course, to go beyond this one would need tailored models with more granular data and the introduction of more region-specific parameters. Additionally, we do not consider more complex facets of linguistic prevalence such as bilingualism [9,21], however such features can be in principle introduced through an additional term in Eq. (3). Nevertheless, our objective here was to focus on uncovering the (potential) basic mechanisms of linguistic evolution and compare it against empirical trends, and not necessarily attempt to reproduce exactly the concentration of speakers in a region. We anticipate our formulation will be quite useful in understanding linguistic prevalence in those regions with scarce data on the relevant parameters.
Supplementary Information S1 Turing instability to first-order around the fixed points u 0 , v 0 via perturbations δu i ,δv i , Eq. (3) can be written in linearized form as: with the Jacobian and diffusion matrix are respectively defined as: The eigenvalue equation for the Laplacian matrix is: In terms of small perturbations, Eq. (S1) becomes: The perturbations can be expanded over the set of Laplacian eigenvectors as δu i . Substituting these into Eq. (S3) we obtain the following eigenvalue equation: The characteristic equation of this system is given by: The solutions to Eq. (S5) are then: . When diffusion starts, the only solution with positive real part is λ α1 . Thus, we define the dispersion relation in terms of Laplacian eigenvalues as Re(λ(Λ α )) with roots Λ α1 and Λ α2 , defining the instability range.
The Turing instability is triggered when the eigenvalues Λ(α) become unstable, which indicates that the corresponding growth factors in the dispersion relation Re(λ(Λ α )) become positive. In Fig. S1 we show the eigenvalues distribution for the geographic networks of Galicia (a) and Carinthia (b) along the curve Re(λ(Λ α )). The unstable modes are the eigenvalues that lie in the range [Λ G α1 , Λ G α2 ] and [Λ C α1 , Λ C α2 ] respectively.  Figure S1: Eigenvalue distributions of empirical networks of (a) Galicia and (b) Carinthia along the dispersion curve Re(λ(Λ α )). The instability ranges are marked by [Λ G α1 , Λ G α2 ] and [Λ C α1 , Λ C α2 ] for Galicia and Carinthia respectively. Differentiation of nodes are triggered by the instable growth factors Re(λ(Λ α )) > 0 which correspond to eigenvalues overlapping the instability ranges. The results are then aggregated to the level of districts by calculating the weighted average of node concentration by the population p i of the node that they belong to. In other words, Galician and Spanish speakers of district j is calculated by u j = i∈jũ ipi i∈j pi and v j = i∈jṽ ipi i∈j pi . The same procedure is used in Carinthia.  Figure S3: Comparison of simulation results with empirical concentrations in Galicia, illustrated in the same order as in Fig. 3a, b, c for Galician and d, e, f for Spanish speakers. The Pearson correlationcoefficient (ρ p ) is reported in each panel.     Figure S4: Comparison of simulation results with empirical rank-ordering of districts in terms of concentrations, in the same order as in Fig. 3a, b, c for Galician and d, e, f for Spanish speakers. The Spearman correlation-coefficient (ρ s ) is reported in each panel.  Fig. 4a, b, c for German and d, e, f for Slovenian speakers. The Pearson correlation-coefficient (ρ p ) is reported in each panel.     Figure S6: Comparison of simulation results with empirical rank-ordering of districts in terms of concentrations, in the same order as in Fig. 4a, b, c for German and d, e, f for Slovenian speakers. The Spearman correlation-coefficient (ρ s ) is reported in each panel.