Spatial evolution of human dialects

The geographical pattern of human dialects is a result of history. Here, we formulate a simple spatial model of language change which shows that the final result of this historical evolution may, to some extent, be predictable. The model shows that the boundaries of language dialect regions are controlled by a length minimizing effect analogous to surface tension, mediated by variations in population density which can induce curvature, and by the shape of coastline or similar borders. The predictability of dialect regions arises because these effects will drive many complex, randomized early states toward one of a smaller number of stable final configurations. The model is able to reproduce observations and predictions of dialectologists. These include dialect continua, isogloss bundling, fanning, the wave-like spread of dialect features from cities, and the impact of human movement on the number of dialects that an area can support. The model also provides an analytical form for S\'{e}guy's Curve giving the relationship between geographical and linguistic distance, and a generalisation of the curve to account for the presence of a population centre. A simple modification allows us to analytically characterize the variation of language use by age in an area undergoing linguistic change.


I. INTRODUCTION
Over time, human societies develop systems of belief, languages, technology and artistic forms which collectively may be called culture. The formation of culture requires individuals to have ideas, and then for others to copy them. Historically, most copying has required face-to-face interaction, and because most human beings tend to remain localized in geographical regions which are small in comparison to the world, then human culture can take quite different forms in different places. One aspect of culture where geographical distribution has been studied in great detail is dialect [1].
In order to visualize the spatial extent of dialects, dialectologists have traditionally drawn Isoglosses: lines enclosing the domain within which a particular linguistic feature (a word, a phoneme or an element of syntax) is used. However, it is not usually the case that language use changes abruptly at an isogloss -typically there is a transition zone where a mixture of alternative features is used [1]. In fact, there is debate about whether the most appropriate way to view the geographical organization of dialects is as a set of distinct areas, or as a continuum without sharp boundaries [1][2][3]. Whereas an isogloss represents the extent of an individual feature, a recognisable dialect is typically a combination of many distinctive features [1,4]. We can attempt to distinguish dialects by superposing many different isoglosses, but often they do not coincide [3] leading to ambiguous conclusions.
The first steps toward an objective, quantitative analysis of the shapes of dialect areas were made by Séguy [5,6], who examined large aggregates of features, making comparison between lexical distances and geographic separations. Central to the quantitative study of dialects, called dialectometry (see [7] for a recent review), is * james.burridge@port.ac.uk the measurement of linguistic distance which, for example, can be viewed as the smallest number of insertions, deletions or substitutions of language features needed to transform one segment of speech into another [8]. This "Levenshtein distance" was originally devised to measure the difference between sequences [9]. Using a metric of this kind, a set of dialect observations can be grouped into clusters according to their linguistic (as opposed to spatial) closeness [10][11][12][13]. The clusters then define geographical dialect areas.
The question we address is why dialect domains have particular spatial forms, and to give a quantitative answer requires a model. The question has been addressed in the past, famously (amongst dialectologists) by Trudgill [1,14], with his "gravity model". According to this, the strength of linguistic interaction between two population centres is proportional to the product of their populations, divided by the square of the distance between them. The influence of a settlement (e.g. a city) i on another, j, is then defined to be the product of interaction strength with the ratio P i /(P i + P j ) where P i and P j are the population sizes of settlements i and j. These additive influence scores may then be used to predict the progress of a linguistic change which originated in one city, by determining the settlements over which it exerts greatest net influence. It is then predicted that the change progresses from settlement to settlement in a cascade. Predictions may also be made regarding the combined influence of cities on neighbouring non-urban areas. The model has been partially successful in predicting observed sequences of linguistic change [1,[15][16][17], and offers some qualitative insight into the most likely positions of isoglosses [14]. In this paper we offer an alternative model, also based on population data, which makes use of ideas from Statistical Mechanics. Rather than starting with a postulate about the nature of interactions between population centres, we begin with assumptions about the interactions between speakers. From these as-sumptions about small scale behaviour we derive predictions about macroscopic behaviour. This approach has the advantage of making clear the link between individual human interactions, and population level behaviour. Moreover we are able to unambiguously define the dynamics of the model, and make precise predictions about the locations of isoglosses, the nature of transition regions between linguistic forms, and the most likely structure of dialect domains. There are links between our approach and agent based models of language change [18], which directly simulate the behaviour of individuals. The difference between this approach and ours lies in the fact that for us, assumptions about individual behaviour lead to equations for language evolution which are macroscopic in character. These equations have considerable analytical tractability and offer a simple and intuitive picture of the large scale spatial processes at play. In seeking to model the spatial distribution of language beginning with the individual, we are encouraged by the fact that dialects are created through a vast number of complex interactions between millions of people. These people are analogous to atoms in the physical context, and when very large numbers of particles interact in physical systems, simple macroscopic laws often emerge. Despite the fact dialects are the product of hundreds of years of linguistic and cultural evolution [4], so historical events must have played a role in creating their spatial distribution [19], the physical analogy suggests that it may be possible to formulate approximate statistical laws which play a powerful role in their spatial evolution.
A physical effect analogous to the formation of dialects is phase ordering [20]. This occurs, for example, in ferromagnetic materials, where each atom attempts to align itself with neighbours. If the material is two dimensional (a flat sheet) this leads to the formation of a patchwork of domains where all atoms are aligned with others in the same domain, but not with those in other domains. The boundaries between these regions of aligned atoms evolve so as to minimize boundary length [21,22]. The human agents who interact to form dialects behave in roughly the same way (as do some birds [23]). When people speak and listen to each other, they have a tendency to conform to the patterns of speech they hear others using, and therefore to 'align' their dialects. Since people typically remain geographically localized in their everyday lives, they tend to align with those nearby. This local copying gives rise to dialects in the same way that short range atomic interactions give rise to domains in ferromagnets. However, whereas the atoms in a ferromagnet are regularly spaced, human population density is variable. We will show that as a result, stable boundaries between domains become curved lines.
While our interest is in the spatial distribution of linguistic forms, there are other properties of language for which parallels with the physical or natural world can be usefully drawn, and corresponding mathematical methods applied. For example the rank-frequency distribution of word use, compiled from millions of books, takes the form of a double power law [24,25], which can be explained [24] using a novel form of the Yule process [26,27], first introduced to explain the distribution of the number of species in genera of flowering plants. Historical fluctuations in the relative frequency with which words are used have been shown to decay as a language ages and expands [25], analogous with the cooling effect produced by the expansion of a gas. Methods used to understand disorder in physical systems ("quenched" averages), have been applied to explain how a tendency to focus on topics controls fluctuations in the combined vocabulary of groups of texts [28]. A significant focus of current Statistical Physics research has been on the evolution and properties of networks [29], which have many diverse applications from the spread of ideas, fashions and disease [30], to the vulnerability of the internet [31]. Real networks are often formed by "preferential attachment" where new connections are more often made to already well connected nodes, leading to a "scale free" (power law) distribution of node degree. The popularity of words has been shown to evolve in the same way [32]; words used more in the past tend to be used more in the future. Beyond the study of word use and vocabulary, agent based models such as the naming game [33], used to investigate the emergence of language, and the utterance selection model [34], used to model changes in language use over time have been particularly influential. We follow the latter model by representing language use using a set of discrete linguistic variables. Spatial models motivated by concepts of statistical physics have also been used to study the spread of crime [35] and to devise optimal vaccination strategies to prevent disease [36]. The importance of the emergence of order in social contexts, and connections to Statistical Physics, may be found in a wide ranging review [37] by Castellano et al.

II. SUMMARY FOR LINGUISTS
A. Contents of the paper The aim of this paper is to adapt the theory of phase ordering to the study of dialects, and then to use this theory to explain aspects of their spatial structure. For those without a particular mathematical or quantitative inclination, the model can be simply explained: We assume that people come into linguistic contact predominantly with those who live within a typical travel radius of their home (around 10 to 20 km). If they live near a town or city, we assume that they experience more frequent interactions with people from the city than with those living outside it, simply because there are many more city dwellers with whom to interact. We represent dialects using a set of linguistic variables [1], and we suppose that speakers have a tendency to adapt their speech over time in order to conform to local conventions of language use. Our model is deliberately minimal: these are our only assumptions. We discover that, starting from any historical language state, these assumptions lead to the formation of spatial domains where particular linguistic variants are in common use, as in Figure 2. We find that the isoglosses which bound these domains are driven away from population centres, that they tend to reduce in curvature over time, and that they are most stable when emerging perpendicular to borders of a linguistic domain. These theoretical principles of isogloss evolution are explained pictorially in Figures 3, 4 and 5, and provide a theoretical explanation for a range of observed phenomena, such as the dialects of England (Figure 7), the Rhenish Fan (Figure 10), the wave-like spread of language features from cities ( Figures 12 and 16), the fact that narrow regions often have "striped" dialects ( Figure  11), and that coastal indentations including rivers and estuaries often generate isogloss bundles. Our assumptions also lead to a mathematical expression for the relationship between linguistic and geographical distance -the "Séguy Curve" -and a hypothesis regarding question of when dialects should be viewed as a spatial continuum, as opposed to distinct areas ( Figure 19).
B. How might a linguist make use of this work?
Without using mathematics, but having understood our principles of isogloss evolution and considered the examples set out in this paper, further cases may be sought where the principles explain observations. If the principles cannot explain a particular situation or are violated, one might seek to understand what was missing from the underlying assumptions, or if they were wrong. Since the assumptions are so minimal, they cannot be the whole story, and a discussion of possible missing pieces is given in the conclusion. For the mathematically inclined linguist, appendix A sets out an elementary scheme for solving the fundamental evolution equation on a computer. This scheme also offers a simple and intuitive understanding of the model, and can be implemented using only a spreadsheet (see supplementary material), although a computer program would be much faster. Using this, isogloss evolution can be explored in linguistic domains with any shape and population distribution. The simplicity of the scheme invites adaptation to include more linguistic realism (e.g. bias toward a linguistic variant). Beyond the exploration of individual isoglosses, a line of inquiry which may be of interest to dialectometrists is to test our predicted forms of Séguy's Curve against observations.

III. THE MODEL
Our aim is to define a model of speech copying which incorporates as few assumptions as possible, whilst allowing the effect of local linguistic interaction and movement to be investigated. The model has its roots in the ideas of the linguist Leonard Bloomfield [3] who believed that the speech pattern of an individual constantly evolved through his or her life via pairwise interaction. This microscopic view of language change lead to the prediction that the diffusion of linguistic features should follow routes with the greatest density of communication. Bloomfield defined this as the density of conversational links between speakers accumulated over a given period of time. In our model the analogy of this link density is an interaction kernel weighted by spatial variations in population distribution. We implicitly assume that interaction is inherently local so that linguistic changes spread via normal contact [38], rather than via major displacements, conquests, or dispersion of settled communities. We are therefore modelling language in stable settlements, with initial conditions set by the most recent major population upheaval.
We consider a population of speakers, each of whom has a small home neighbourhood, and we introduce a population density, ρ(x, y), giving the spatial variation of the number homes per unit area. In order to incorporate local human movement within the model, we begin by defining a Gaussian interaction kernel for each speaker Note that the symbol := indicates the definition of a new quantity. Consider a speaker, Anna, whose home neighbourhood is centred on (x 0 , y 0 ). In the absence of variation in population density, φ is the normalized distribution of the relative positions, (∆x, ∆y), of the home neighbourhoods of speakers with whom Anna regularly interacts. The constant σ, the interaction range, is a measure of the typical geographical distance between the neighbourhoods of interacting speakers. Now suppose that density is not uniform due to the presence of a city, or a sparsely populated mountainous area. In this case while Anna is going about her daily life she is more likely to hold conversations with people whose homes lie in a nearby densely populated region because these people constitute a greater proportion of the local population.
To incorporate this density effect we define a normalized weighted interaction kernel for a home at (x 0 , y 0 ) Given any region A, the fraction of Anna's interactions that are with people who live in A is A k(x 0 , y 0 , x, y)dxdy. We distinguish between dialects by constructing a set of linguistic variables whose values vary between dialects. A single variable might, for example, be the pronunciation of the vowel u in the words 'but' and 'up' [4]. In England, northerners use a long form: 'boott' and 'oopp', with phonetic symbol [U], and southerners use a short version, [2] . Considering a single variable which we suppose has V > 1 variants, we define f i (x, y, t) to be the relative frequency with which the ith variant of our variable is used by speakers in the neighbourhood of (x, y), at time t. For mathematical simplicity we assume that nearby speakers use language in a similar way, so that f i (x, y, t) varies smoothly with position.
People speak on average 16,000 words per day [39] and can take months or years (depending on their age and background) to adapt their speech to local forms [40,41]. Changing speech habits therefore involves a very large number of word exchanges, at least in the tens of thousands (comparable in magnitude to typical vocabulary size [42]). Although the rate at which individuals adapt their speech is not constant throughout life (it is particularly rapid in the young), adaptation has been observed even in late middle age [43]. To capture the cumulative effect of linguistic interaction we make use of a forgetting curve, which measures the relative importance of recent interactions to older ones. From a mathematical point of view, the simplest form for this curve is an exponential, and in fact there is some evidence from experiments involving word recall [44] which suggests that this is an appropriate choice. However, we emphasize that the curve, for us, is simply a way to capture the fact that current speech patterns depend on past interactions and that older interactions tend to be less important. With this in mind we make the following definition of the memory of a speaker from the neighbourhood of (x, y), for the ith variant of a variable An intuitive understanding of this equation may be gained by imagining that each speaker possesses an internal tape recorder which records language use as they travel around the vicinity of their home. As time passes, older recordings fade in importance to the speaker, and the variable m i measures the historical frequency with which variable i has been heard, accounting for the declining importance of older recordings. The rate of this decline is determined by the parameter τ , which we call memory length, and note that changing its value simply rescales the unit of time. We note also that this form of memory may be seen as a deterministic spatial version of the discrete stochastic memory used in the Utterance Selection Model [34,45]. On the grounds that speakers collect very large samples of local linguistic information, our definition does not contain terms representing random sampling error. In going from equation (1) to (2) we have used the saddle point method [46] to approximate the spatial integral in equation (1) and assumed that |∇ 2 ρ|/ρ is small compared to σ 2 (that is, population changes approximately linearly over the length scale of human interaction).
To allow speakers to base their current speech on what they have heard in the past we let f i (x, y, t) be a function, p i , of the set of memories (m 1 , m 2 , . . . , m V ) =: m Differentiating equation (2) with respect to t, and rescaling the units of time so that one time unit is equal to one memory length τ , we obtain which governs the spatial evolution of the ith alternative for a single linguistic variable. We note that memory length no longer appears as a parameter. An enhanced intuitive understanding of this evolution equation may be gained from its discrete counterpart, used to find computational solutions, and derived in appendix A. The simplest possible choice for p i is to let speakers use each variant with the same frequency that they remember it being used: p i [m(x, y, t)] = m i (x, y, t). This produces "neutral evolution" [34,45,[47][48][49] where there is no bias in the evolution of each variant. Equation (3) then describes pure diffusion, and variants spread out uniformly over the system. If all linguistic variables evolved in this way we would eventually have one spatially homogeneous mixture of grammar, pronunciation and vocabulary. If our memory model involved a stochastic component [45] then eventually we would expect all but one variant of each variable to disappear. Neither of these outcomes reflects the reality of locally distinctive forms of language.
We motivate our choice for p i based on two observations. The first is that dialects exist. In order for this to be the case, if the ith variant of a linguistic variable has been established amongst a local population for a considerable time so that m i ≈ 1, then a small amount of immigration into the region by speakers using a different variable should not normally be sufficient to change it. Mathematically, this is equivalent to the statement that the non diffusive term, p i [m(x, y, t)] − m i (x, y, t), in our evolution equation (3) must possess a locally stable fixed point at m i = 1. The second observation is derived from experiments on social learning, which show that the behaviour of individuals is considerably influenced by the majority opinion of those with whom they interact [50][51][52]. In fact, such social conformity is widely observed in the animal kingdom and is responsible for the formation of dialects in some species of birds [53]. Recent experimental research into human social learning [52], in which individuals were allowed to make a choice, before being exposed to the opinions of a group, has revealed that the likelihood of an individual switching their decision depends non-linearly on the proportion of the group who disagree. It is an increasing function which climbs rapidly when the proportion exceeds 50%, also possessing an inflection for large groups. In the context of language, such non-linear conforming behaviour would mean that variants which were used more frequently than others should be used with disproportionately large frequency in the future. A simple way to capture this behaviour is to define where β ≥ 1 measures the extent of conformity (nonneutrality). If β = 1 then we have the neutral model, and for β > 1 the non-diffusive term in (3) has a stable fixed point at m i = 1. According to (4), individuals disproportionately favour the most common variants they have heard: they have a tendency to conform to the local majority language use with β measuring the strength of this effect. In the limit β → ∞ all speakers use the only the most common (modal) dialect they have heard. An example of this function is plotted in Figure 1 for the V = 2 model. Conforming behaviour allows local dialects to form, as we shall see below.
The model we have defined is a "coarse-grained" description of real linguistic interactions which in reality are much more complex. Much of this complexity arises because there are often many distinct class, ethnic or age-defined social networks in any given geographical region. Within each of these subgroups the need to conform leads to similar speech patterns among members, and these patterns often, but not always, spread to other groups. Research by linguists has demonstrated that social factors strongly influence the uptake of particular speech patterns [54] and that language use is correlated with social class and identity. In American English, for example, language change is often initiated by the working and lower middle classes [55,56], before spreading to other groups. Some forms of language change are driven by resistance to conformity; for example "prestige dialects" (Received Pronunciation in the UK) are used to signify membership of a social elite, set apart from the common people. The desire to set oneself apart from others can also create reversals in language use amongst subsets of a population. For example, local residents of Martha's Vineyard [57,58] reverted to an archaic form of pronunciation in order reaffirm local tradition in the face of invading tourists. A similar effect was observed on the island of Ocracoke in North Carolina [59], but in this case the reversal was temporary. As well as social factors, language use may also be determined by age, gender or ethnicity [60]. It is clear that reality is far more complex than our simple model, which does not make any of these distinctions between speakers. However, the fact that dialects exists is itself evidence that in general people do adapt to local speech patterns. To model every speaker as having the same need to conform is therefore a reasonable first approximation to reality. It also has the value of simplicity, allowing us later to determine the importance of various additional levels of complexity by comparing how effectively our model fits empirical data when compared more complex models.

A. Application to Great Britain
We apply our model to the island of Great Britain (GB), whose early inhabitants were known as Britons, and spoke Celtic languages [62]. The earliest form of English was brought to the island by invading Germanicspeaking settlers. This became Anglo Saxon (or Old English), as written by Alfred, King of Wessex (849-899 A.D.) but would not be recognisable to modern speakers. It slowly changed, with external influences (notably Norman), into the English we know today [19].
We seek to discover the extent to which the spatial distribution of dialect structures which have emerged in GB can be predicted by equation (3). To model the evolution of individual linguistic variables we take mainland GB as our spatial domain, and numerically solve equation (3) on a grid of discrete points (Figure 2), using an explicit Euler scheme [61] (appendix A). The initial condition for the solution is a randomly generated spatial frequency distribution where each grid point is assigned a randomly selected variant. By repeatedly generating initial conditions and solving the system, we can determine the most probable equilibrium spatial distributions of language use. The population density ρ(x, y) is estimated using 2011 census data [63], which gives the number of inhabitants at each of the ≈ 1.8 × 10 6 UK postcodes. A smooth density is then obtained from this by allowing the inhabitants to diffuse a short distance from the geographical centre of their postcode. Despite significant overall population growth, the locations of major population centres in GB can trace their origins back through hundreds of years. Since dialect evolution equation (3) depends only on relative population densities, the current density distribution therefore serves as reasonable proxy for historical versions. We estimate that σ lies in the range 5km < σ < 15km based on that fact that the average distance travelled to work in GB in 2011 was 15km [63], whereas the average distance travelled to secondary school was 5.5km [64]. In section VII we find that the typical width of a transition region between linguistic variables is ≈ 1.8σ(β − 1) − 1 2 . For example, the transition between northern and southern GB dialects is ≈ 60km wide [1], which if σ = 10km, gives the approximation β ≈ 1.1.

Evolution of isoglosses
When it comes to interpreting our results, the fact that usage frequencies are continuously varying through space presents a similar problem to that faced by dialectologists when trying to draw isoglosses. We resolve this by defining domain boundaries to be lines across which the modal (most common) variant changes. A domain is therefore The surface tension effect at domain boundaries. Blue dots represent speakers and black circles give an approximate representation of interaction ranges. In the red shaded parts of these interaction ranges variant A is more common, and in the yellow shaded parts variant B is more common.
a region throughout which a single variant is the most commonly used. We may think of domain boundaries as synthetic isoglosses generated by equation (3). In Figure 2 we show a series of snapshots of the evolution of domains when there are V = 3 variants. Isogloss evolution is driven by a two dimensional form of surface tension [65]: in the absence of density variation, curved boundaries straighten out. Figure 3 illustrates why this happens faster when curvature is greater. Here, speaker L hears more of variant A so domain B will retract in this locality. Speaker R hears more of variant B and so domain A will retract in this region. The net effect will be to straighten the boundary, reducing its length. If a boundary forms a closed curve then this length reduction effect can cause it to evolve toward a circular shape, and reduce in area, eventually disappearing altogether. However, this shrinking droplet effect can be arrested or reversed if the droplet surrounds a sufficiently dense population centre (a city). In fact, population centres typically repel isoglosses in our model, and so have a tendency to create their own domains. An explanation of this effect is given in Figure 4. Here we have a region of high population density in which linguistic variant B is dominant, surrounded by a low population density region where variant A is in common use. We consider the linguistic neighbourhood of a speaker located on the isogloss separating the two domains. From Figure 4 we see that although the majority of the speaker's interaction range lies in region A, she has many more interactions with those in region B, and is therefore likely to adapt her speech toward variant B, causing the isogloss to shift outward into the low density area.
The most common form of stable isogloss generated by our model is a line, typically with some population density induced curvature, connecting two points on the boundary of the system. In order to be stable, such lines must emerge perpendicular from system boundaries, and as a result they are attracted to indentations in coastline, as illustrated in Figure 5. In this figure we consider two speakers located at the points where two possible isoglosses meet the coast (or other system boundary -a country border, or a mountain range for example). B because the isogloss is not perpendicular to the coast; it will therefore migrate upward toward the apex of the coastal indentation until it reaches the stable form shown by the solid line. This effect can be seen in Figure 2, where the longest east-west isogloss has migrated so that it emerges from the largest indentation on the east coast of GB. In reality this indentation, called 'The Wash', is the site of an isogloss bundle (the coincidence of several isoglosses) separating 'northern' [U] from 'southern' [2] [66]. A similar bundle occurs at the largest indentation in the Atlantic coast of France: the Gironde estuary [1], separating the langue d'oc from the langue d'oil. The fact that bundling at such locations is predicted by our model provides the first sign of the predictive power of the surface tension effect.
Having considered the evolution of a single linguistic variable, we now turn to modelling dialects. A dialect is typically defined by multiple linguistic characteristics, and we can capture this by combining many solutions to equation (3). In Figure 6 we have superposed the synthetic isoglosses for twenty binary (V = 2) linguistic variables. We see that there is a significant degree of bundling where many isoglosses follow similar routes across the system. Given that the initial conditions for each variable are distinct random frequency distributions ( Figure 2), then these bundles represent highly probable isogloss positions: many different early spatial distributions lead to these at later stages of evolution. The key point here is that the final spatial structure of dialect R L FIG. 5. Behaviour of isoglosses at an indentation in coastline or other boundary (political or naturally occurring). Dashed isogloss is unstable and will evolve toward the solid isogloss which emerges perpendicular from the coast. We assume that both isoglosses are effectively anchored to a feature some distance away, opposite the boundary shown. Speakers are shown as blue dots and colours have the same meanings as in Figure 3.
domains is rather insensitive to the early history of the language: the effects of surface tension and population density draw many different isoglosses toward the same stable configurations. In this sense the surface tension effect is an 'invisible hand' which, in the long term, can overpower historical population upheavals. However, we emphasise that our model predicts only the spatial structure of dialects and not their particular sound ; this is very much determined by quirks of history and the initial state of the system. Figure 6 also illustrates the effect of human mobility on dialect structure. For a smaller interaction range (5km), the structure of synthetic isogloss bundles is more complex, producing a larger number of distinct regions. This effect is well documented in studies of the historical evolution of dialects which were, in the past, more numerous and covered smaller geographical areas [4]. Within our model, this is explained by the fact that fluctuations in population density only become relevant to isogloss evolution when they take place over a length scale which is comparable to the interaction range: Two human settlements could only develop distinct dialects if they were separated by a distance significantly greater than σ, otherwise they would be in regular linguistic contact.

Cluster analysis
Having analysed our model using isoglosses, we now make comparison to recent work in dialectometry, where dialect domains have been determined using cluster anal-  Trudgill (1999) [4]. Right map: future dialect boundaries predicted using k-Medoids cluster analysis of 20 synthetic binary linguistic variables when σ = 10km and β = 1.1 at t = 150. Levenshtein distance (or 'edit distance') [9] used as distance metric. Colours, determined by Hungarian method, show mapping between dialect areas. Black dotted line shows North-South isogloss.
ysis and by multi-dimensional scaling [67]. A typical clustering approach [10,12] is to construct a data set giving the frequencies of a wide range of variant pronunciations at different locations, and then to cluster these locations according to the similarity of their aggregated sets of characteristics. Resampling techniques such as bootstrap [68] may be used to generate "fictitious" datasets and improve stability. We mimic this approach by constructing a synthetic data set from 20 solutions of equation (3) with V = 2, and each with different random initial conditions, corresponding to different linguistic variables. We then randomly select a large number (6,000) of sample locations within GB and determine the modal variants for each of the 20 variables at each location. This sample size was chosen to be sufficiently large so that the effect of resampling was only to make short length scale (≪ 1km) changes to cluster boundaries. This aggregated data is then divided into k clusters using the k-medoids algorithm [69] (available in the R language). The metric used for linguistic distance between sample points is the Manhattan distance between the binary vectors where the two variants are labelled 1 or -1. Because we are comparing vectors which can be transformed into one another purely by substitutions (1 for -1 or vice versa), rather than insertions or deletions, then this is equivalent to the Levenshtein distance used in dialectometry [2,9]. We have found that almost identical results are obtained by applying Ward's hierarchical clustering algorithm [70] to the sample locations and subsequently cutting the tree into k clusters.
In order to compare our cluster analysis to the work of dialectologists we consider a prediction for the future dialect areas of England (excluding Wales and Scotland) made by Trudgill [4], shown in the left hand map of Figures 7 and 8. This prediction divides the country into 13 regions, and is the result of a systematic analysis of regional variation in speech and ongoing changes. Such sharp divisions are a significant simplification of reality however, and hide many subtle smaller scale variations. The decision to define 13 regions therefore reflects a judgement on the range of language use which can be categorized as a single dialect. To allow comparison with this prediction we have performed a set of cluster analyses of near-equilibrium (large t) solutions for the whole of GB, for a range of values of the number, k, of clusters (see Figure 9), with the aim of producing 13 within the subset of GB defined by England. The closest result was 14 clus-  [71] to find the mapping between our synthetic dialects, and Trudgill's predicted dialects, which maximizes the total area of overlap between the two. The results are show in Figure 7. To provide a measure of the effectiveness of our model in matching Trudgill's predictions we also define a null model, which divides the country into regions at random, independent of population distribution and without reference to any model of speaker interaction. There are a number of models which generate random tessellations of space [72], many of which are motivated by physical processes such as fracture or crack propagation. We wish to exclude such physical assumption and so opt for the Voronoi tessellation [72], based on the Poisson point process: the simplest of all random spatial processes. Our null model is then a Voronoi tessellation of England ( Figure 8) using 13 points selected uniformly at random from within its borders, with dialects labelled to most closely match Trudgill's map, using the Hungarian method.
Having generated a our synthetic dialect maps, we now quantify the extent to which they match the predictions of Trudgill. The null model, because its lack of modelling assumptions, will reveal the extent to which our model is "better than random" at matching these predictions. We offer four alternative metrics of similarity in Table I. The simplest metric is overlap (OL): the percentage of land area which is identified as belonging to the same dialect as Trudgill's prediction. The Weighted Overlap (WOL) weights overlapping regions in proportion to their population density: it gives the probability that a randomly selected inhabitant will be assigned to the same dialect zone by both maps. From table I we see that this probability is high (82%) for our model, but lower for the random Voronoi model. We suggest that this is a result of the fact that population centres tend to repel isoglosses and therefore often lie at the centres of dialect domains. We will examine this repulsion effect in more detail below. The final two metrics are com- monly used to compare clusterings. Consider a set, S of elements (spatial locations for us) that has been partitioned into clusters (dialect areas) in two different ways. Let us call these two partitions X and Y . The Rand Index (RI) [73] is defined as the probability that given two randomly selected elements of S, the partitions X and Y will agree in their answer to the question: are both elements in the same cluster? A disadvantage with using this index to compare dialect maps is that the larger the number of regions in the maps, the more likely it is that two randomly selected spatial points will not lie in the same cluster in either map. The index therefore approaches 1 as the number of dialect areas grows. This problem may be countered by taking account of its expected value if X and Y were picked at random, subject to having the same number of clusters and cluster sizes as the originals [74]. The "Adjusted Rand Index" (ARI) is then defined The ARI ∈ [−1, 1] measures the extent to which a clustering is a better match than random to some reference clustering and is used by dialectometrists [75] in preference to the original Rand Index (RI). For us the reference clustering is Trudgill's predicted dialect map, and the Rand and Adjusted Rand Indices in Table I measure similarity to this reference. The primary conclusion that may be drawn from the indices in Table I is that by all measures our model provides a better match than the null model (indices all differ by at least three standard deviations, and typically many more). Of particular interest is the weighted overlap probability (WOL= 82%). Isoglosses are typically repelled by population centres, so tend to pass through regions of relatively low density. Because of this the WOL may be viewed as a measure of the effectiveness of the model at determining the centres of dialect regions and is less sensitive to small errors in isogloss construction, explaining its high value. It is important to realise also that Trudgill's predictions may themselves be imperfect.
We now make some qualitative comments. The dotted line in Figure 7 shows the location of our model's most dense North-South isogloss bundle. This is coincident with what is described by Trudgill as "one of the most important isoglosses in England" [14] dividing those who have [U] in butter from those who do not. In our model, the fact that this border lies where it does is a result of the surface tension effect which attracts many isoglosses towards the two coastal indentations at either end (see supplementary video). The fact that many randomized initial boundary shapes evolve toward this configuration, and that the configuration is seen as important by dialectologists supports the hypothesis that surface tension is an important driver of spatial language evolution. We also note that the western extremities of GB (Cornwall and North-West Scotland) support multiple synthetic dialects in our model, and we suggest that this is due to a heavily indented coastline and the fact that high aspectratio tongues of land are likely to be crossed by isoglosses; a fact predictable by analogy with continuum percolation [21]. The south west peninsula has historically supported three dialects.
In future work, the model might be tested by comparing its predictions to well researched dialect areas. On example is the Netherlands. Here, dialectologists have performed a cluster analysis [67] based on Levenshtein distances between field observations of 360 dialect varieties (correspoding to 357 geographical locations), revealing 13 significant geographical groupings. The extent to which a model is consistent with these groupings, accounting for variability caused by finite sample size, could be tested by generating equivalent clusterings for multiple fictitious dialect samples of the same size.

B. Bundles, Fans, Stripes and Circular Waves
We now illustrate a number of well known features of dialect distributions which may be qualitatively reproduced by our model. We consider first the isogloss bundle reported by Bloomfield (1933) [3] separating "High German" from "Low German". The bundle emerged from the tip of an indentation of the Dutch-German speech area (bordered to the East by Slavic languages), and ran roughly East-West before separating approximately 40km East of the river Rhine, and fanning out around cities such as Dusseldorf, Cologne, Koblenz and Trier. This arrangement of isoglosses is known as the "Rhenish Fan". In Figure 10 we have constructed an artificial system with boundaries approximating the geographical structure of relevant parts of the Dutch-German language area illustrated in Bloomfield [3] containing an artificial cluster of population centres representing the German cities located near to the Rhine. The system was initialized using the same randomization procedure used for GB, and Figure 10 shows a superposition of ten solutions, each with different initial conditions. In the early stages of evolution, very little pattern is discernible, but as time progresses the main indentation collects isoglosses, while the cities repel them, producing a fan-like structure. We therefore suggest that the isogloss separation which created the Rhenish fan may have been the result of repulsion by the cities of the Rhine. We next consider an example of what some physicists refer to as "stripe states" [21]: in finite systems which experience phase ordering, and have aspect ratios greater than one, boundaries between two orderings often form across the system by the shortest route (in a rectangle, joining two long sides). A collection of such boundaries form a striped pattern of phase orderings. Figure 11 illustrates this effect, produced by equation (3). Our model therefore predicts such striped dialect patterns in long thin countries, and a particularly striking example of the effect may be seen in the dialects of the Saami language [76]. The Saami people are indigenous to the Sámpi region (Lapland) which includes parts of Norway, Sweden and Finland. Their Arctic homeland forms a curved strip with a length which is approximately five times its average width. The region is divided into ten language areas, and the boundaries of all but two of these take a near-direct route between the two long boundaries of the system, forming a distinctive striped pattern. Another example are the dialects of Japan whose boundaries in many cases cross the country perpendicular to its spine [77].
The relationship between geographical separation and linguistic distance (often measured using Levenshtein distances [2]) is typically sublinear [2,38]. The definition of linguistic distance and its relation to geographic distance was made by Séguy [5,6], and the relationship therefore goes by the name: "Séguy's Curve" [38]. It has been substantially refined and tested since [2,38,78], and also generalised to involve travel time [79]. Séguy's curve is not universal, however. For example, an analysis of Tuscan dialect data [80] reveals an unusually low correlation between phonetic and geographical distances. A more detailed analysis reveals that there are geographically remote areas which are linguistically similar, and that within an approximately circular region (radius ≈ 40km) around the main city, Florence, phonetic variation correlates more strongly with geographical proximity. It is hypothesized [80] that this pattern marks the radial spread of a linguistic innovation (called "Tuscan -gorgia"). This Tuscan data motivates our final example of the qualitative behaviour of our model. To illustrate how a linguistic variable can spread outwards from a population centre, purely through the effects of population distribution and not necessarily driven by prestige or other forms of bias, we have simulated our model using an artificial city with Gaussian population distribution ( Figure 12). The system is initialized with a circular isogloss, centred on the city, representing a local linguistic innovation. Because population density is a decreasing function of the distance from the city centre, speakers on the isogloss hear more of the innovation than their current speech form, allowing it to expand (as explained in Figure 4). We will see in section VII that this expansion will not not necessarily continue indefinitely. Expansion processes such as this have also been observed in Norway [14]. In that case the progress of new linguistic forms was shown to depend on age, with changes more advanced for younger speakers, who are more susceptible to new form of speech. We illustrate how this effect can be analysed in appendix B.

V. SÉGUY'S CURVE
We now determine the relationship between geographical and linguistic distance within our model, providing an analytical prediction for the form of Séguy's curve [5,6,38]. For simplicity we consider the two variant model V = 2 and suppose that our language contains a number, n, of linguistic variables. At each location in space the local dialect is an n dimensional vector of the local modal variants which we label 1, and −1. Letting φ(r, t), where r = (x, y), be the vector field giving the distribution of these variants then the number of differences (the Levenshtein distance [9]) between two dialects φ(r 1 , t) =: φ(1) and φ(r 2 , t) =: φ(2) is (n−φ(1)·φ(2))/2. The linguistic distance, L(1, 2), between two dialects may be defined [81] as the fraction of variables that differ between them Since we have assumed that each variant evolves independently of every other, then the expected linguistic distance is where φ i is the ith component of φ.
To compute E[φ i (1)φ i (2)] we make use of the close similarity between equation (3) and the time dependent Ginzburg Landau equation [20], to derive (see appendix C) an analogue of the Allen-Cahn equation [82] giving the velocity of an isogloss at a point in terms the unit vectorĝ normal to it at that point The quantity κ is the curvature of the isogloss at the point: in the absence of variations of population density, the isogloss moves so as to reduce curvature. The second term in the square brackets produces a net migration of isoglosses towards regions of lower population density. To compute correlation functions between the field φ i at different locations in space we apply the Ohta-Jasnow-Kawasaki (OJK) method [83], introducing smoothly varying auxiliary field m(x, y, t) which gives the value of the ith variant as φ i = sgn(m). Note that the auxiliary field m(x, y, t) is distinct from the memory m i (x, y, t) for the ith variant. The OJK equation, describing the time evolution of this field, adapted to include density effects, is (appendix C) We introduce the fundamental solution, G(r, t; r 0 ), (the Green's Function) of (10), giving the function m(r, t) subject to the initial condition m(r, 0) = δ(r − r 0 ). The solution for arbitrary boundary conditions is then We assume that the initial condition of our system consists of spatially uncorrelated language use, so that E[φ i (r 1 , 0)φ i (r 2 , 0)] = δ r1r2 . A convenient, equivalent condition on the auxiliary field is to let it be Gaussian (Normally) distributed m(r, 0) ∼ N (0, 1) with correlator E[m(r 1 , 0)m(r 2 , 0)] = δ(r 1 − r 2 ).
We can compute this correlator at later times using the fundamental solution G The linearity of our adapted OJK equation (10) ensures that m(r, t) remains Gaussian for all time [20] (to see this note that derivatives are limits of sums, and sums of Gaussian random variables are themselves Gaussian). However, the values of the field at different spatial locations develop correlations so that the joint distribution of any pair is bivariate normal. Following Bray [20] we define the normalized correlator  (19) in the case σ = 4, β = 2. Open and closed dots show simulated linguistic distances using same parameter values in a 1000 × 1000 system at times t = 80, 160. Note that linguistic distance depends only on the combination rt −1/2 so curves evaluated at different time collapse onto one another.

A. Uniform population density
If population density is constant ρ = C then our adapted OJK equation (10) reduces to OJK's original form which has fundamental solution G(r, t; r 0 ) = exp − |r−r0| 2 giving a normalized correlator Our prediction for Séguy's curve at time t is therefore This curve is plotted in Figure 13 along with simulation results. We give the following interpretation of the curve. Starting from a randomized spatial distribution of language use, the need for conformity generates localized regions where particular linguistic variables are in common use, and these regions are bounded by isoglosses. These regions expand, driven by surface tension in isoglosses, so that from any given geographical point one would need to travel further in order to find a change in language use. The linguistic distance between two points therefore tends to decrease with time, and the curve (19) gives the rate of decrease as exponential. There are features of reality which we might expect to alter this behaviour. First, we have assumed that no major population mixing or migration takes place -such events would have the effect of resetting the initial conditions of the model. Our prediction is only valid during times of stability. Second we have assumed that the population is uniformly distributed in the system when in reality populations are clumped and, as we have seen, population centres can support their own dialects if they are large enough. We take some steps toward addressing this issue below. In appendix D we briefly discuss a simple one dimensional simulation model from the dialectology literature [38], which includes the same large r behaviour as in (19) for a particular choice (quadratic) of macroscopic "influence" curve.

B. Peaked population density
We now consider how Séguy's curve is modified by the presence of a peak in population density. In order to allow analytical tractability we consider a simple exponentially decaying peak where R > 1. To understand the behaviour of the modified OJK equation (10) it is useful to decompose it into an advection diffusion equation plus a source term (21) Defining r := x 2 + y 2 , the average velocity field for the diffusing particle is The source term is We now view equation (21) as describing the mass distribution for a collection of Brownian particles which are being driven radially away from the origin. The source term is interpreted as a field which causes particles to produce offspring at rate (rR) −1 as they move through it. The fundamental solution, G(r, t; r 0 ), to (21) is then the mass distribution for a very large (approaching infinite) collection of particles with total mass initially equal to one, all of which started at r 0 .
We wish to compute the dependence of linguistic distance on geographical distance from the peak of the population density (thought of as the centre of a city). We therefore require the expectation E[m(0, t)m(r, t)] = R 2 G(0, t; r 0 )G(r, t; r 0 )dr 0 . (24) Computation of a general closed form expression for G(r, t; r 0 ) is not our aim; preliminary computations in this direction suggest that if such a form existed its complexity would restrict its use to numerical computations alone. Instead we make arguments leading to a simple approximation for Séguy's curve. We observe first that the integrand in (24) is dominated by the region around r 0 = 0. Numerical evidence for this is provided in Figure  14 where we see that the fundamental solution grows in magnitude as r 0 → 0. In general the solution consists of a circular plateaux propagating outward from the origin plus an isolated but spreading peak also drifting away from the origin (Figure 14). The plateaux is formed once the rate of loss of particles from the peak source region (|r 0 | R −1 ) though advection and diffusion is equal to the rate of the creation of new particles. The plateaux height is determined by the particle mass which reaches the peak source region in the early stages of evolution. Due to radial drift, the only particles with a chance of doing this are those with sufficiently small Péclet number [85] where r 0 is their starting point (or that of their earliest ancestor if they are daughters). Values of r 0 which lie outside a region of radius ∝ R (henceforth Péclet region) around the origin can therefore be ignored when computing G(0, t; r 0 ). For R ≫ 1, the peak region forms a small fraction O(R −4 ) of the Péclet region and particles within the Péclet region have a probability of reaching the peak which decays exponentially with their initial distance from it. The function G(0, t; r 0 ) will therefore itself be sharply peaked within the Péclet region, around r 0 = 0, and we make the approximation G(0, t; r 0 ) ≈ hδ(r 0 ), where h is plateaux height. Making use of this approximation in (24) we have To compute the variance we note that if |r| ≪ t/R then the dominant contribution to the integral comes from the plateaux component of the solution. If |r| ≫ t/R then the plateaux will not have reached r so only the spreading peak component of the fundamental solution will contribute. Therefore We will comment on the significance of this behaviour below. To find the form of G(r, t, 0) we note first its circular symmetry, which reduces the number of variables in the OJK equation to two We seek a travelling wave solution, subject to the initial condition m(r, 0) = δ(r), representing the expanding plateaux, valid for large r so that the 1/r term in (29) can be neglected. We obtain, as t → ∞ where t 0 is a time correction which accounts for the fact that the propagation velocity of the plateaux takes some time to settle down to its long time value of βσ 2 /R. We verify in Figure 15 that this is the correct asymptotic solution by comparing it to the numerical solution of (29) for large t. We now approximate the normalized correlator as This approximation neglects the drop in the variance of m(r, t) for r ≫ t/R described by equation (28), which amounts to neglecting a multiplicative factor √ t in the large r behaviour of the correlator. Our approximate analytical prediction for Séguy's curve measured radially from the centre of the exponentially decaying population distribution is therefore This prediction is compared to correlations in the full model ( Figure 16) by generating 100 realizations of isogloss evolution over the exponential population density, each with different randomized initial conditions. From Figure 16 we see that as time progresses a growing region emerges around the centre of the city in which the linguistic distance to the centre is close to zero. An alternative visualization of this effect is given in Figure  18 which shows a superposition of the isoglosses from 20 simulation runs. As time progresses a circular patch emerges in the centre of the system which is devoid of isoglosses, and therefore where all speakers use the same linguistic variables. Outside of this central "city dialect" we note that the asymptotic behaviour of the complementary error function together with the expansion sin −1 (ǫ) = ǫ + O(ǫ 3 ) lead to the prediction that linguistic correlations fall as e −c(∆r) 2 /(∆r) where c is a constant and ∆r is distance from the edge of the city dialect. This is a faster rate of decay than in the flat population density case. It appears from Figure 13 that in reality the decay rate may be even faster than this. Further simulations reveal that the velocity with which the city dialect expands shows some systematic deviation from the prediction v ≈ βσ 2 /R of our OJK analysis. For example, in Figure 17 we have reduced the conformity parameter to β = 1.1 and we see that our theoretical predictions are in close agreement with the simulation data, provided we accelerate time by a factor of ≈ 1.25. The value of β = 1.4 selected in Figure 13 produces a match between predicted and observed velocity, but for larger values of β the prediction is an overestimate. For example, when β = 1.5 with all other parameters identical, the simulated velocity in the full model is smaller than our prediction by a factor of 0.97. One possible explanation for this discrepancy is that the interface shape may effect the constant of pro- portionality in the Allen-Cahn equation (9), for example if it did not match its constant density equilibrium form. We also note that OJK's assumption of isotropy in unit normals to isoglosses, although preserved globally by the circular symmetry of our system, at the edge of the city dialect it is clearly lost locally. Despite these shortcomings the adapted OJK theory allows analytical insight into the formation of dialects in population centres and the behaviour of Séguy's curve around cities. We leave the development of a more sophisticated theory for future work.

VI. DIALECT AREAS AND DIALECT CONTINUA
There is debate amongst dialectologists as to the most appropriate way to view the geographical variation of language use [1,2]. The debate arises because it is rarely the case that dialects are perfectly divided into areas. Chambers and Trudgill [1] imagine the following example: we travel from village to village in a particular direction and notice linguistic differences (large or small) as we go. These differences accumulate so that eventually the local population are using a very different dialect from that of the village we set out from. Did we cross a border dividing the dialect area of the first village from that of the second, and if so, when? Alternatively, is it a mistake to think of dialects as organized into distinct areas; should we only think of a continuum?
We now set out what our model can tell us about these questions. In one sense, language use in our model is always continuous in space. Although domains emerge where one variable is dominant, domain boundaries form transition regions in which the variants change continuously (the width of these regions is computed in section VII). Despite this, the boundary between two sufficiently large single-variant domains will appear narrow compared to the size of the domains, and in this sense the domains are well defined and noticeable by a traveller interested in one linguistic variable. Of more interest are the observations of a traveller who pays attention to the full range of language use. To perceive a dialect boundary, this traveller must see a major change in language use over a short distance. This change must be large in comparison to other, smaller changes perceived earlier. In our model a major language change is created by crossing a large number of isoglosses over a short distance. The question then is: under what circumstances will isoglosses bundle sufficiently strongly for dialect boundaries to be noticeable?
To answer this we need to recall the three effects which drive isogloss motion. First, surface tension which tends to reduce curvature. Second, migration of isoglosses until they emerge perpendicular to a boundary such as the coast, the border of a linguistic region, a sparsely populated zone, or an estuary. Third, repulsion of isoglosses from densely populated areas. There are two major ways in which these effects can induce bundling, both of which require the essential ingredient of time and demographic stability in order for surface tension to take hold. Indented boundaries can collect multiple isoglosses, creating a bundle. Examples already noted include the Wash and the Severn in GB, the Gironde Estuary in France, and the historical indentation in the Dutch-German language area marking the Eastern end of the Rhenish Fan. A major boundary indentation may not always create a bundle however: it may be that other parts of the boundary, or the presence of cities, creates a fanning effect. Variations in population density can also create bundling. Dense population centres which are large in comparison to the typical interaction range will push out linguistic change, and where two centres both repel, we expect to see bundling where their zones of influence meet. Each city would then create its own well defined dialect area. Within real cities we also see sub-dialects spoken by particular social groups [54], but since our model does not account for social affiliations we cannot explicitly model this.
In Figure 19 we schematically illustrate examples of these effects using three imaginary "Island Nations". Nation A exhibits distinct dialect areas. The Northernmost area is supported both by a city, which may generate and then repel language features, and by two indentations which form a "pinch point" which will tend to collect isoglosses via the boundary effect. Several . Schematic diagram of isoglosses (dashed lines) for three language areas or "Island Nations". Yellow / ochre circles represent cities. Nation A supports dialect areas formed by coastal boundary shape and repulsion from cities. Nation B largely exhibits a continuum of variation apart from two somewhat indistinct city dialects. Nation C has a single city dialect, but without this city (or if the city were not sufficiently densely populated) it would have no linguistic variation due to its entirely convex boundary and evenly distributed population.
other pinches exist in which also collect isoglosses, creating distinct dialects. The Southernmost city supports an isogloss via repulsion, which would otherwise migrate south under the combined influence of surface tension and the boundary effect, eventually disappearing. Nation B also possesses boundary indentations, but the lack of pinch points reduces bundling: while the indentations collect isoglosses, the smoother parts of the coastline allow isoglosses to attach anywhere, creating a continuum of language use. Two city dialects do exist however, driven by repulsion. Finally, nation C is a convex region. This is an example of a system which, without a city, could not support more than one dialect, and would tend over time to lose isoglosses. In some regions there are no dialect areas, only a continuum of variation [86], and in others clear dialects exist [4]. The above examples point to some general principles. In regions with low population density, a lack of major boundary indentations and few large cities, we might expect isoglosses to position themselves in a less predictable way, creating language variation which would be perceived as a continuum by a traveller. The regular creation of new isoglosses (resetting the initial conditions of the model) through linguistic innovation or demographic instability could also disrupt the ordering processes. Narrow geographical regions, or where there are major boundary indentations, or dense population centres which push out linguistic change, are particularly susceptible to the formation of distinct dialects.

VII. TRANSITION REGIONS AND CURVATURE
We now derive analytical results which characterise the transition regions between variables, and the effect of population density on the curvature of dialect domain boundaries.
To compute the gradients of transition regions we consider a straight isogloss (with constant population density) in equilibrium between variants 1 and 2, given by the line x = 0. Because the isogloss is vertical the frequencies will not depend on y so we write them f 1 (x) and f 2 (x) and note that f 1 (x) = 1 − f 2 (x) so we need only consider the behaviour of f 1 (x). For notational simplicity we define f := f 1 , m := m 1 and p(m) := p 1 (m). The isogloss will form the midpoint of a transition region where the frequencies change smoothly between one and zero, and where f ′ 1 (0) measures the rate of this transition. In equilibrium, from equation (3) we have . If non-neutrality (conformity) is small we may replace p −1 (f ) with its Taylor series about β = 1, neglecting O((β − 1) 2 ) terms Here we have defined a 'potential' function V (f ), allowing us to identify equation (34) as Newton's second law for the motion of a particle of mass σ 2 /2 in a potential V (f ), where x plays the role of time, so that the total 'energy' E : . Since V is defined by an indefinite integral then we can define V ( 1 2 ) := 0. As x → ±∞ we require that f ′ (x) → 0 and f (x) → 1 or 0 so The magnitude of the frequency gradient at the origin, where f (0) = 1 2 , is therefore From this we see that weak non-neutrality, and larger interaction range produce shallower gradients and therefore wider transition regions. As β approaches one, the FIG. 20. Sequence of radial cross-sections of the frequency of a linguistic variable whose initial domain is concentric with a city. Snapshots taken at times t = 10, 30, 50, . . . starting from initial radius r = 55. Model parameters σ = 5, β = 1.1.
transition region becomes increasingly wide and boundaries disintegrate destroying the surface tension effect described in Figure 3. Equation (36) is verified numerically for an 'almost straight' isogloss in Figure 20 (red dashed line). We now relate the equilibrium shape of isoglosses to population density. Starting from our modified Allen-Cahn equation (9) for isogloss velocity, and introducing the local radius of curvature R = 1/κ we see that when an isogloss is in equilibrium (having zero velocity) whereĝ is a unit normal to the isogloss. We note a simple alternative derivation of this result based on the dialect fraction, F D (x, y), of a domain D. We define this as the fraction of conversations that a speaker at a point (x, y) has with people whose home neighbourhoods lie in D.
Let P := (x * , y * ) be a point on an isogloss with local radius of curvature R ≫ σ, bounding some region D.
An intuitively appealing condition for language equilibrium is that the speaker at P should interact with equal numbers of speakers from within and without D Using the saddle point method [46] to evaluate equation (38) we have Result (37) then follows from the equilibrium condition (39). From this we see that large gradients in population density can produce equilibrium isoglosses with higher curvature. To test this against our full evolution equation (3), consider a circular city with 'radius' R c having a Gaussian population distribution set in a unit uniform background The constant ρ 1 ≥ 1 gives the relative density of the city centre compared to outlying areas. Consider a circular isogloss which is concentric with the city, then equation (37) has two solutions where W p and W m are two branches of Lambert's W function [88], W (z), which solves z = we w . The two solutions R 1 and R 2 are respectively decreasing and increasing functions of ρ 1 . The stability of these solutions may be determined by noting that if F D > 1 2 then D will expand, and contract if the inequality is reversed. From this we are able to determine that R 2 is stable, whereas R 1 is not. If a dialect domain begins with R < R 1 then it will shrink and disappear under the influence of surface tension, but if initially R > R 1 then the domain will expand or contract until R = R 2 . This behaviour is illustrated in Figure 20, where we see that our law (37) accurately predicts the stable radius produced by our evolution equation (3).

A. Summary of results
Departing from the existing approaches of dialectology, we have formulated a theory of how interactions between individual speakers control how dialect regions evolve.
Much of what we have demonstrated is a consequence of the similarity between dialect formation and the physical phenomenon of phase ordering. Humans tend to set down roots, and to conform to local speech patterns. These local patterns may be viewed as analogous to local crystal orderings in binary alloys [82] or magnetic domains [22]. As with phase ordering, surface tension is a dominant process controlling the evolution of dialect regions. However, important differences arise from the fact that human populations are not evenly distributed through space, and the geographical regions in which they live have irregular shapes. These two effects cause many different randomized early linguistic conditions to evolve toward a much smaller number of stable final states. For Great Britain we have shown that an ensemble of these final states can produce predicted dialect areas which are in remarkable agreement with the work of dialectologists.
Since language change is inherently stochastic, at small spatial scales we can only expect predictions of a statistical nature. At larger scales an element of deterministic predictability emerges. Within our model, all stochasticity arises from the randomization of initial conditions. The effect of this randomness is strongest in the early stages of language evolution, when the typical size of dialect domains is small. The superposition of isoglosses lacks a discernible pattern. This "tangle" of lines produces a continuum of language variation, with spatial correlations given by Séguy's curve. As surface tension takes hold, steered by variations in population density and system shape, isoglosses begin to bundle and the continuum resolves into distinct dialect regions. Both long term population stability, and large variations population density play an important role. Without these ingredients isoglosses will not have time to evolve into smooth lines, or bundle.
The assumptions of our model are minimal, and clearly there are many additional complexities involved in language change which we have not captured; we discuss below how the model might be extended to account for some of these. Despite this simplicity, in addition to substantially reproducing Trudgill's predictions for English dialects [4] we have also been able explain the observation of both dialect continua and more sharply bounded dialect domains. We have also provided an explanation for why boundary indentations (e.g. in coastline or in the border between different languages) are likely to collect isoglosses [1]. We have shown that cities repel isoglosses, explaining the origin of the Rhenish Fan [42], the wavelike spread of city language features [15] and the fact that many dialect patterns are centred on large urban areas. We have explained why linguistic regions with high aspect ratio tend to develop striped dialect regions [76,77]. We have computed an analytical form for Séguy's curve [2,5,6,38] which as yet has had no theoretical derivation. We have also adapted this derivation to deal with a population centre. We have quantified how the width of a transition region [1] between dialects is related to the strength of conformity in individual language use, and the typical geographical distances over which individuals interact. We have shown how to relate the curvature of stable isoglosses to population gradients. Finally, in appendix B we show how incorporating an age distribution into the model can quantify the "apparent time" [57] effect used by dialectologists to detect a linguistic change in progress. Given these findings, we suggest that the theoretical approach we have presented would be worth further investigation.

B. Missing Pieces
The model we have presented is deliberately minimal: it allows us to see how much of what is observed can be explained by local interactions and conformity alone. This leads to a simple unified picture with surprising explanatory power. However, having chosen simplicity, we cannot then claim to provide a complete description of the processes at play. We now describe the "missing pieces", indicating what effect we expect them to have, and how to include them.

Innovation
An important aspect of language change which is not explicitly encoded within our model is innovation: the creation of new forms of speech. We have instead assumed that there are a finite set of possible linguistic variables, and for each one, a finite set of alternatives, all of which are present in varying frequencies within the initial conditions of the model. Each alternative is equally attractive so that conformity alone decides its fate. A new variant cannot spontaneously emerge within the domain of another. The model therefore evolves toward increasing order and spatial correlation. Due to the presence of population centres this ordering process is eventually arrested creating distinct, stable domains. If innovation were allowed, then ordering would be interrupted by the initialisation of new features, and Séguy's curve would reach a steady state where the rate of innovation (creating spatial disorder) balanced the rate of ordering.
For a local innovation to take hold, it must overcome local conformity, realized as surface tension and the "shrinking droplet" effect. Several mechanisms might allow this to happen: for example young speakers must recreate their language using input from parents, peers and other community members. This recreation process is inherently imperfect [19], and many interactions are between young speakers who are simultaneously assimilating their language. In this sense the young are weakly coupled to the current, adult speech community, and their language state is subject to fluctuations which may be sufficient to overcome local conformity for long enough to become established. As these speakers age their linguistic plasticity declines, older speakers die, and the change is cemented. Other mechanisms include speech modifications made to demonstrate membership of a social group, or a bias toward easier or more attractive language features. To understand mathematically the effect of innovation on spatial evolution we might simply allow the creation of new variants, and then assign them a "fitness" relative to the existing population.

Interaction Network
By selecting a Gaussian interaction kernel, and not distinguishing between different social groups, we are assuming that the social network through which language change is transmitted is only locally connected in a geographical sense but within each locality the social network is fully connected. That is, I will listen without prejudice to anyone regardless of age, sex, status or ethnicity, as long as they live near my home. Research into 21st century human mobility [89,90] and the work of linguists [54,57,60], indicates that both these assumptions are a simplification. Human mobility patterns, and by implication interaction kernels, exhibit heavy tailed behaviour (with an exponential cut-off at large distances). In our framework, an interaction kernel of this type, combined with densely populated cities would allow long range connectivity between population centres. Long range interactions in phase ordering phenomena can have substantial effects on spatial correlations and domain sizes [91], and may effectively alter the spatial dimension of the system [92].
Further evidence for non-local networks is provided by the Frisian language, spoken in the Dutch province of Friesland. This has a "town Frisian" dialect [67], spoken in towns that are separated from each other by the Frisian countryside, where a different dialect is spoken: town Frisian is "distributed". Within the social network these towns are "local" (nearby) to each other. To incorporate this effect into our model we must either reformulate our fundamental equation (3) to describe evolution on more general network, or generalize our interaction kernel to allow communication between cities. Further empirical evidence for non-local interactions is provided by hierarchical diffusion [93], where linguistic innovations spread between population centres, not necessarily passing through the countryside in between. Such a process motivated the creation of Trudgill's gravity model [14].
In addition, mobility data [90] shows that individuals follow regular, repeating trajectories, introducing strong heterogeneity within the set of individual interaction kernels. Social, as well as spatial heterogeneity may also be important. For example, it has been shown theoretically [94] that the time required for two social groups to reach linguistic consensus is highly sensitive to the level of affinity that individuals have for their own group.

Linguistic Space and Dynamics
By using a set of linguistic variables we are treating dialects as points in vector space. Implicit in our dynamics are two assumptions. First, all transitions between variants are allowed, with probabilities given only by the frequencies with which the variables are used. Second, the evolution of different linguistic variables are mutually independent. There are cases where this is an incomplete description. A notable example is chain shift-ing in vowel sounds [57]. Linguists represent the set of possible vowel sounds as points in a two dimensional domain where closeness of the tongue to the roof of the mouth, and the position of the tongue's highest point (toward the front or back of mouth) form two orthogonal coordinate directions [95,96]. The vowel system of a language is then a set of points in this domain. If one vowel change leaves a gap in this system (an empty region of the domain) then other vowel sounds may shift to fill this gap, producing a chain of interconnected linguistic changes. Similarly, a change in one vowel to bring it closer to another can push it, and then others, out of their positions. A famous example is the Great Vowel Shift [19] in England between the 14th and 17th centuries. Another example concerns changes which spread to progressively more general linguistic (as opposed to geographical) contexts [80]. If we have several variables, each signifying the presence of the change in a different context, then it is clear that the frequency of one variable can influence the frequency of another, contradicting our assumption that variables are independent.
The fact that linguistic variables are sometimes dependent upon one another means that, within our model, p i , which relates the past use of some variable to the current frequency of its ith variant, via the relationship f i (x, y, t) := p i [m(x, y, t)], should sometimes depend on the use of other variables, and might be adapted to capture more than just conformity.

C. Conclusion
We conclude by noting that a major theme of the book War and Peace by Leo Tolstoy, is the idea that history is determined not by great individuals but rather by millions of small choices made by the people.
"To elicit the laws of history we must leave aside kings, ministers, and generals, and select for study the homogeneous, infinitesimal elements which influence the masses." [97] As pointed out by Vitány [98], Tolstoy was, in modern terms, advocating the formulation of a Statistical Mechanics of history. The work we have presented is an attempt to formulate such a theory for the spatial history of language. Because of its simplicity, dealing only with copying and movement, our model may apply more broadly to other forms of culture.

Appendix A: Discretized evolution equation
Here we present the computational scheme used for solving our evolution equation (3). This is aimed at researchers having some familiarity with computer programming, such as linguists interested in quantitative approaches. It can also be implemented using only a spreadsheet (see supplementary material). The discretized version of evolution equation (3) also provides a greater intuitive understanding of its continuous counterpart. For simplicity we consider the V = 2 model and define f := f 1 , m := m 1 , p(m) := p 1 (m) and note that we need only consider the evolution of m and f because We begin by rewriting our evolution equation (3) in terms of the memory and frequency fields where To solve equation (A1) on a computer we discretize space into a rectangular grid of square sites. We let the side of each grid square define one unit of length. The interaction range used in the computer calculation should be expressed in these units. That is, if the side of a grid square is a km long, and the real world interaction range is σ km, then the interaction range used in the computer should be σ c := σ/a. We choose a so that σ c > 1 so speakers interact over distances greater than one square. Subscript c distinguishes the interaction range measured in computer grid units from the interaction range in km.
For each site we store three quantities ρ ij , f ij , m ij where the subscripts i, j are horizontal and vertical indices related to spatial position by x = i × a and y = j × a. These quantities are our approximations to the values of the fields ρ, f, m at the centres of sites, and are stored in three arrays. Intuitively we think of each site as containing a group of a 2 ρ ij speakers, each of whom uses variant 1 with frequency f ij . The linguistic domain of interest is the set of sites with non-zero population density. Within the domain, sites with one or more nearest neighbours with zero population are referred to as boundary sites, otherwize they are bulk sites. Sites which are not part of the domain are never updated and it is useful to include a border of such sites around the edge of the rectangular grid. The scheme we present is based on approximating the Laplacian ∇ 2 = ∂ 2 x + ∂ 2 y using a central finite difference approximation for the derivatives ∂ 2 x and ∂ 2 y . Let g be an arbitrary function defined at every site. We define a local average at each grid point This is just the average of the values of g at the four nearest neighbours of (i, j). The Laplacian is then approximated This follows from the finite difference approximation ∂ 2 x g ≈ g i+1,j − 2g ij + g i−1,j , the effectiveness of which depends on g varying slowly between sites. From equation (A4) we see that ∇ 2 g measures the extent to which g differs from the average of its neighbours. If ∇ 2 g < 0 then g exceeds the local average, and is less than the local average when the inequality is reversed.
We now introduce a small discrete time step δt and write ∆m ij := m ij (t + δt) − m ij (t) for the change in the memory field over the time interval [t, t + δt]. We note also that provided the grid is sufficiently fine then at bulk sites ρ ij ≈ ρ ij so, making use of (A1) and (A4) we have At each time step equation (A5) is used to update the stored values of m ij for all sites in the linguistic domain, after which the frequencies can also be updated using f ij = p(m ij ). The quantity ρf ij / ρ ij is the average of frequencies at neighbouring sites, weighted in proportion to their populations. Using ρ ij rather than ρ ij in the denominator ensures that these weights sum to one. This serves two purposes: First, it avoids the need for an additional condition at boundary sites (intuitively, speakers in boundary sites simply shift attention from empty neighbouring squares to those which are part of the linguistic domain, consistent with the original definition of the weighted interaction kernel). Second it ensures that spatially constant memory and frequency fields constitute a fixed point of the dynamics. Rule (A5) is an explicit scheme and as such its stability requires that δt be chosen sufficiently small. In the case of zero conformity (β = 1) and constant population density the Von-Neumann stability condition [61] is δt < 1/σ 2 . This serves as a guide to find δt sufficiently small for our scheme to converge. For the density fields and conformity values used in this paper we have found that δt < 1/(4σ 2 c ) is more than sufficient.
We conclude this section by explaining the linguistic meaning of the terms on the right hand side of (A5). The first term: f ij − m ij drives conformity. If m ij > 1 2 then this term is positive, driving the memory further towards m ij = 1 where all speakers use variant 1. Otherwise if m ij < 1 2 , the memory is driven towards zero where no speakers do. The second term, ρf ij / ρ ij − f ij , acts to equalize speech use in the local area. If variant 1 is used more at (i, j) than in the surrounding squares, then this term acts to reduce its use in (i, j). If variant 1 is used relatively less at (i, j), the term has the opposite effect.
Appendix B: Incorporating age into the model In order to experimentally detect a linguistic change in progress, ideally one would like to survey the same population of individuals, or a representative sample of the population, at two or more points in time [1]. Such longitudinal studies may be practically difficult to carry out, so linguists have made use of the assumption that speech patterns are acquired mainly in the early part of people's lives. The speech of a 50-year old today should therefore reflect the speech of a 30-year old twenty years ago. It should be noted though, that speech patterns can change throughout life [43], although more slowly in older speakers. A linguistic change detected by observing different speech patterns in the young and old is said to have been observed in apparent time [57,58]. A famous example of apparent time is the replacement of the term chesterfield (meaning an upholstered multipleperson seat) in Canadian speech, with the fashionable American term couch [99]. In this case the use of couch was shown to decrease sigmoidally from ≈ 85% amongst teenagers to ≈ 5% among those in their eighties. The apparent time theory has been tested by comparing language surveys taken at different times, and comparing predictions based on apparent time in the earlier sample, with the observations made in the later one [100]. We note that differences between speech patterns between the generations do not always indicate a linguistic change in progress [43]. For example, the use of some speech forms may change systematically with age in the same way, generation after generation, so that the community as a whole is in a stable state [54].
We now give a simple illustration of how age and apparent time can be incorporated in our model. For simplicity we consider the progress of a straight isogloss between two variants, driven by a slowly declining population density. This density variation is equivalent to a social bias toward one variable which we call the new variant. We let f (x, t) be the frequency with which the new (spreading) variable is used at position x and time t. Note that there is no y dependence due to symmetry. To distinguish between young and old we introduce an age density distribution α(a) giving the fraction of individuals within the age bracket [a 1 , a 2 ] as a2 a1 α(a)da. (B1) Using this distribution we modify our original model to account for the fact that individuals have been exposed only to the linguistic information available in their lifetime. The memory of a speaker with age a is therefore defined (B2) Note that as a → ∞ this definition coincides with our original definition (2) of memory. In the interests of analytical tractability we consider the limit of small memory decay rate (τ → ∞) in which case linguistic memory is a simple "bus stop" average over life history We also take the limit of total conformity β → ∞ so that language is chosen according to a simple majority rule. We consider an exponentially decaying population density ρ(x) = e −ǫx (B4) where ǫ ≪ 1. The weighted interaction kernel for this density is then Notice that the effect of the decaying population density is to shift the interaction kernel to the left so that more attention is paid to language use on that side of the listener. To compute the isogloss velocity we define and prepare the system with initial condition where Λ is the initial location of the isogloss. Because each speaker listens more to the speakers on her left, the isogloss will travel right. In the limit β → ∞ then when the memory, µ a (x, t), of a speaker, with x > Λ, reaches 1 2 they will switch linguistic variables. The motion of the isogloss will then take the form of a travelling wave formed from a superposition of travelling step functions, one for each age f (x, t) = ∞ 0 α(a)η(x − vt + Λ(a))da, with the function Λ(a) > 0 and the velocity v to be determined. According to (B9), at t a := Λ(a)/v, a speaker at the origin with memory of length a will be on the point of switching variable so that As ǫ → 0 the isogloss velocity must also tend to zero. The quantity ∆(a 1 , a 2 ) := Λ(a 1 ) − Λ(a 2 ) (B12) gives the distance between the step functions for speakers with ages a 1 and a 2 as |∆(a 1 , a 2 )|, and this separation must also tend to zero as ǫ → 0. We can therefore compute a series expansion for v in powers of ǫ by expanding the cumulative interaction kernel K(z) in (B10) about z = 0 and ǫ = 0. To lowest order we have Substituting this approximation into equation (B10) we have If the oldest speaker has age A, then the width of the transition region is ∆(A, 0) = Aσ 2 ǫ/ā. We provide a concrete example using a population "pyramid" age distribution, cut off exponentially at low ages to account for the fact that very young speakers listen to, but do not influence, others. Letting a 0 be the low age cut-off we define where C is a normalizing constant. An example of the travelling wave (B9) generated by this age distribution is illustrated in Figure 21. Also shown are the results of a numerical implementation of the full model with a discretized version of the age distribution (B18). This discretization is necessary in order to implement the model numerically, because the memory of each age of speaker must be individually stored. Finally we consider the likely outcome of experimentally sampling the use of language within this model. We let x 0 be the left boundary of the transition region (the oldest speaker at x 0 is just about to switch variable). We then define the indicator function of the event that a speaker of age a, located at position x, is using the new variant An example of this distribution is illustrated in Figure  22. In this example the mean sample location is the centre of the transition region, and we see that uptake of the new variant exhibits the characteristic "S-shaped" age distribution seen in studies of linguistic change observed in "apparent time" [1,57,99].
We conclude by noting that this extension of the model to include different memory lengths should be seen as a "toy" model of the effect of age on language change. The fact that older speakers tend to take longer to change their language use is captured purely by the length of their memories. In reality the influence on a given speaker of the language she is exposed to at different stages of her life will be much more complicated than our simple model [101]. For example, if language use were determined entirely during early life then the forgetting curve should be peaked during these early years -in effect the linguistic memory should stop "recording" once a speaker's youth has ended. Each speaker will respond differently to what they hear, so the forgetting curve will not be identical for every speaker. Such heterogeneity amongst speakers is straightforward to incorporate, but at the cost of tractability. The advantage of the simple approach is to illustrate the apparent time effect in an analytically simple way.

Appendix C: Allen-Cahn Equation and
Ohta-Jasnow-Kawasaki Theory Here we derive an analogue of the Allen-Cahn equation [82] for the velocity of an isogloss, before deriving a modified Ohta-Jasnow-Kawasaki equation [83] which provides a simplified method for understanding the evolution of spatial correlations in the model. In the binary variant case (V = 2) we have that m 1 = 1 − m 2 so after defining p(m 1 ) := p 1 (m) and f := f 1 then f = p(m 1 ).
In terms of f , our evolution equation (3) may be written Following [20,82] we introduce a unit vector,ĝ, normal to the isogloss at a point P . We let g be the displacement from P in the direction ofĝ. At the isogloss we have ∇f = ∂f ∂g tĝ We will also need the cyclic relation