A high-reproducibility and high-accuracy method for automated topic classification

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make"big data"text analysis systems more reliable.

The amount of data that we are currently collecting and storing is unprecedented. A challenge for its analysis is that nearly 80% of this data is in the form of unstructured text. As digital data keep increasing, there is a pressing need for fast and reliable algorithms to navigate and turn them into new knowledge. One of the central challenges in the field of natural language processing is bridging the gap between information in text databases and their meaning in terms of topics. Topic classification algorithms are key for filling this gap.
Topic models use a database of text documents to automatically describe each document in terms of the underlying topics. This is the foundation for text recommendation systems [1,2], digital image processing [3,4], computational biology analysis [5], spam filtering [6] and countless other modern-day technological applications. Because of their importance, there has been an extraordinary amount of research and a number of different implementations of topic model algorithms [7][8][9][10][11][12][13][14].
Latent Dirichlet allocation (LDA) [10,15,16] is the state-of-the-art method in topic modeling. As its predecessor, Probabilistic latent semantic analysis (PLSA) [9], it relies on fitting a generative model of the corpus (Fig. 1). Specifically, the model assumes that a document doc in the corpus covers a mixture of topics, and that each topic is characterized by a specific word usage probability distribution. For instance, consider a corpus of documents addressing two topics, mathematics and biology. Each document in the corpus will cover these topics with given probabilities. A biology focused document d bio , for example, might have p(biology|d bio ) = 90% and p(math|d bio ) = 10%. Documents focused on different topics will make use of different words. Some words will be used for biology documents such as dna or protein because p(dna|biology) p(dna|math). In contrast, words such as tensor or equation will primarily be used in a math-focused document because p(tensor|biology) p(tensor|math). Additionally, there will be words such as research or study that are generic and will be used equally by both topics.
In practical applications, one has access to the word counts in each document, but the topic structure will be unobservable or latent. The challenge thus is to estimate the topic structure which is defined by the probabilities p(topic|doc) and p(word|topic). PLSA and LDA both try to estimate the model with the highest probability of generating the data [9,10,17,18], but PLSA does not account for the probability of choosing a certain topic mixture. Crucially, both methods rely on maximization of a likelihood that depends non-linearly on a large number of variables, a NP-hard problem [19].
Although it is well known that the problem is computationally hard, little is known about how in practice the roughness of the likelihood landscape impacts an algorithm's performance. In order to gain a more thorough theoretical understanding, we implement a controlled analysis of a highly specified and constructed set of data. This high degree of control allows us to tease apart the theoretical limitations of the algorithms from other sources of error that would be normally uncontrolled in traditional datasets. Our analysis reveals that standard techniques for likelihood optimization are hindered by the very rough topology of the landscape, even in very simple cases such as when topics use exclusive vocabularies. We show that a network approach to topic modeling enables searching the likelihood landscape much more efficiently, yielding more accurate and reproducible results.   How documents are modeled as mixture of topics: we count the word frequencies in each document and we model these as mixtures of different topics, math and biology in this example. The topic structure is latent, meaning we do not have information about the "true" topics which generated the corpus. However, they can be estimated by a topic model algorithm. B. We use data where each document is written in either English, French or Spanish (language = topic). However, a topic algorithm might separate English documents (many) into two subtopics, while French and Spanish (small groups) are merged. B1. We consider an example where each language has a vocabulary of 20 words, and the document length is 10 words: with this choice of the parameters, the alternative model has a better likelihood (for PLSA or symmetric LDA) if the fraction of English documents is bigger than ∼ 0.94, when the overfitting gain compensates the underfitting loss. B2. LDA's performance (variational inference): the curve represents the median likelihood of the model inferred by the algorithm, while the shaded area delimits the 25th and 75th percentiles. The algorithm does not find the right generative model (green curve) before the theoretical limit (grey shaded area). B3. Probability that LDA infers the actual generative model.

I. THE LIKELIHOOD LANDSCAPE IS ALWAYS ROUGH
Most practitioners know that a very large number of topic models can fit the same data almost equally well: this poses a serious problem for an algorithm's stability. We start investigating this problem by considering an elementary test-case, which we denote the language corpus. "Toy" models are helpful because they can be analytically treated and provide useful insights for more realistic and complex cases.
In our language corpus, topics are fully disambiguated languages -that is, no similar words are used across languages -and each document is written entirely in a single language, thus creating the simplest possible test case. As is assumed by the LDA generative model, we use a two-step process to create synthetic documents. In the first step, we select a language with probability p(language), which corresponds to a Dirichlet distribution with very small concentration parameters (see SI). Given the language, in the second step, we randomly sample a given number of words from that language's vocabulary into the document. For the sake of simplicity, we restrict the vocabulary of each language to a set of N w unique equiprobable words. Thus, an "English" document in the language corpus, is just a "bag" of En-glish words. Note that every document uses words from a single language.
Let us be more concrete and consider a dataset with three languages and distinct number of documents in each language. Consider also that there are more documents in English than in the other languages. An implementation of a topic model algorithm could correctly infer the three languages as topics or alternatively split "English" into two "dialects" and merge two other languages (see Fig. 1). This alternative model is wrong on two counts: it splits English into two parts, while merging two different languages. Naïvely, one would expect the alternative model to have a smaller likelihood than the correct generative model. However, this is not always the case (Fig. 1C), for PLSA [9] and the symmetric version of LDA [29] [10,17]. In fact, dividing (or overfitting) the "English" documents in the corpus yields an increase of the likelihood. As we show in the SI text, the log-likelihood increases by between log 2 2 and 1/π per English document, depending on the average length of the documents, through this process of overfitting. Analogously, merging (or underfitting) the "French" and "Spanish" documents, results in a decrease of the log-likelihood of L d log 2 per "French" and "Spanish" document, where L d is the average length of the documents. Thus, there is a critical fraction of "French" and "Spanish" documents below which the alternative model will have a greater likelihood that the correct generative model (Fig. 1).
Note that this theoretical limit of a likelihood's ability of identify the correct generative model is not limited to topic modeling. Indeed, it also holds for non-negative matrix factorization [30] [8] with KL-divergence, because of its equivalence to PLSA [20].
However, the critical size of underfitted documents depends on the length L d of the documents in the corpus, and decreases as 1/L d . In fact, increasing the documents' length or using asymmetric LDA [21] rather than symmetric LDA, one can show, for the language corpus, that the generative model always has a higher likelihood than the alternative model (see SI). In this case, the ratio of the log-likelihood of the alternative model and the generative model can be expressed as, where L alt and L true are the likelihoods of the alternative and generative model respectively and f U is the fraction of underfitted documents ("French" and "Spanish," in the example).
Even though the generative model has a greater likelihood, the ratio on the left-hand side of Eq. (1) can be arbitrarily close to 1. The reason is that the ratio is independent of the number of documents in the corpus and of the length of the documents. Thus, even with an infinite number of infinitely long documents, the generative model does not "tower" above other models in the likelihood landscape. The consequences of this fact are important because the number of alternative latent models that can be defined is extremely large -with a vocabulary of 1000 words per language, the number of alternative models is on the order of 10 300 (see SI).
In conclusion, we find that only the full Bayesian model [21] can potentially detect the correct generative model regardless of the documents' length, and an extremely large number of models are very close to the correct one in terms of likelihood. In the next section, we will show how current optimization techniques are affected by this problem.

II. NUMERICAL ANALYSIS OF THE LANGUAGE TEST
Although the language corpus is a highly idealized case, it provides an example where many competing models have very similar likelihood and the overwhelming majority of those models have more equally sized topics. Indeed, because of the high degeneracy of the likelihood landscape, standard optimization techniques might not find the model with highest likelihood even in such simple cases, and they might yield different models across different runs, as it has been previously reported [11,21]. Moreover, since small topics are the hardest to resolve (see SI, Sec. 1.6), standard algorithms might require the assumption that there are more topics in the corpus than in reality because the "extra topics" are needed to resolve small topics.

Spanish
We test these hypotheses numerically, on two synthetic language corpora (Fig. 2). For the first corpus, which we denote egalitarian, each of ten languages comprises an equal number of documents. For the second corpus, which we denote oligarchic, 60% of the documents belong to 20% of the languages. Specifically, we group the languages into two classes. The first class comprises two languages with 30% of the documents in the corpus. The second class comprises eight languages with 5% of the documents. For both corpora, we used the real-world word frequencies [22] of the languages.
In order to determine the validity of the models inferred by the algorithm under study, we calculated both the accuracy and the reproducibility of the algorithms' outputs. We use a measure of normalized similarity (see Methods) to compare the inferred model to the generative model (accuracy) and to compare the inferred models from two runs of the algorithm (reproducibility).
In the synthetic corpora that we consider, topics are not unequal enough and documents are sufficiently long, so that both datasets have their highest likelihood for the generative model, and for PLSA and symmetric LDA. Additionally, we run the standard algorithms [9,10] with the number of topics in the generative model (as we show in the SI, estimating the number of topics via model selection would lead to an over-estimation of the number of topics). We find that PLSA and the standard optimization algorithm implemented with LDA (variational inference) [10] are unable to find the global maximum of the likelihood landscape (see Fig. 2). In the SI we also show the results for asymmetric LDA implementing Gibbs sampling [21], which, interestingly, performs well only in the egalitarian case.
Our results thus show that it is highly inefficient to explore the likelihood landscape blindly, either by starting from random initial conditions or by randomly seeding the topics using a sample of documents ( Fig. 2), as is the current standard practice.

III. A NETWORK APPROACH
In order to improve on the performance of current methods, we surmise that it will be useful to build some intuition about where to search in the likelihood landscape. We start by noting that a corpus can be viewed as a bipartite network of words and documents [23], and, using this insight, we construct a network of words which are connected if they co-appear in a document [24].
In the language corpora, finding the languages is as simple as finding the connected components of this graph. In general, however, finding topics will be more complex because of words shared by topics. We propose a new approach comprising three steps, which we denote Top-icMapping. In the first step, we filter out words that are unlikely to provide a separation between topics because they are used indiscriminately across topics. Specifically, we compare the dot-product similarity [25] of each pair of words (which co-appear in at least one document) with the expectation for a null-model where words are randomly shuffled across documents. For the null-model, the distribution of dot-product similarities of pairs of words is well approximated by a Poisson distribution whose average depends on the frequencies of the words (see SI). We set a p-value of 5% for accepting the significance of the similarity between pairs of words.
In the second step, we cluster the filtered network of words using a clustering algorithm developed by Rosvall and Bergstrom (Infomap) [26]. Unlike standard topic modeling algorithms, the method does not require an estimate of the number of topics present in the corpus. We use the groups identified by the clustering algorithm as our initial guesses for the number and word composition of the topics. Because our clustering algorithm is exclusive -that is, words can belong to a single topic -we must use a latent topic model which allows for non-exclusivity. Specifically, we locally optimize a PLSA-like likelihood in order to obtain our estimate of model probabilities (see SI for more information).
In the third step, we can decide to refine our guess further running asymmetric LDA likelihood optimization [10] using, as initial conditions, the model probabilities found in the previous step. In general, if the topics are not too heterogeneously distributed, the algorithm converges after only a few iterations, as our guess is generally very close to a likelihood maximum (we actually found only one case where more iterations were needed: the Wikipedia dataset, see Fig. 5). Figure 2 shows the excellent performance of the TopicMapping algorithm.

IV. A REAL WORLD EXAMPLE
In order to test the validity of the TopicMapping algorithm and better compare its performance to standard LDA optimization methods, we next consider a realworld corpus comprising 23,838 documents obtained from Web of Science (WoS). Each document contains the title and the abstract of a paper published in one of six top journals from different disciplines (Geology, Astronomy, Mathematics, Biology, Psychology, Economics). We pre-processed the documents in the WoS corpus by using a stemming algorithm [27] and removing a standard list of stop-words. Pre-processing yielded 106,143 unique words.
We surmised a generative model in which each journal defines a topic and in which each document is assigned exclusively to the topic defined by the journal in which it was published. We then compare the topics inferred by symmetric LDA (variational inference) and TopicMapping with the surmised generative model (Fig. 3). While  Figure 3: Performance of the algorithms on a real world example. In the pie charts, each slice is a different topic found by the method and the colored areas are proportional to the probability of the corresponding journal given that topic: p(journal|topic) = doc p(journal|doc) × p(doc|topic). The topic labels are the most frequent words in the topic. The "*" symbol is due to the stemming algorithm we used (porter2).
A. Performance of standard LDA when we input the number of journals as number of topics. Big topics are split and small ones are merged. B. Performance of LDA when we input the number of topics suggested by model selection. Small topics are now resolved but big ones are split so that each topic is comparable in size. C. TopicMapping's performance. D. Topics found by TopicMapping in a corpus were we added an interdisciplinary journal such as Science. We also show the most frequent affiliations of papers published in Science in each topic (bottom). The total number of topics found is 19 but only topic with probability bigger than 2% are shown in the figure (9 topics).
TopicMapping has nearly perfect accuracy and reproducibility, standard LDA optimization has a significantly poorer performance. When using the standard approach, LDA estimates that the corpus comprises 20 to 30 topics (see SI) and yields a reproducibility of only 55%. Even when letting LDA know that there are only six topics, the inferred models will put together papers from small journals yielding an accuracy and a reproducibility of 70%.
Adding an interdisciplinary journal (Science), we can see that TopicMapping assigns the majority of papers published in Science to the already found topics, but several new topics are identified. In terms of likelihood, Top-icMapping yields a slightly better likelihood than standard LDA optimization, but only if we compare models with the same effective number of topics. A more detailed discussion on this point can be found in the SI.

V. SYSTEMATIC ANALYSIS ON SYNTHETIC DATA
As a final and more systematic evaluation of the accuracy and reproducibility of the different algorithms, we implement a comprehensive generative model, where documents choose a topic distribution from a Dirichlet distribution as proposed in the LDA model. We tune the difficulty in separating topics within the corpora by setting (1) the value of a parameter α which determines both the extent to which documents mix topics, and the extent to which words are significantly used by different topics; and (2) the fraction of words which are generic, that is, contain no information about the topics (see Methods). Fig. 4 shows our results for the synthetic corpora. We have also done a more systematic analysis (see SI), but the main conclusion is the same as for the language test: the generative model has the highest likelihood (topics are sufficiently equal in size), but the number of overfitting models is so large and they are so close in terms of likelihood, that the optimization technique requires help in exploring the right portion of the parameter space. Without the right initialization, we get lower accuracy and reproducibility, as well as equally sized topics and an overestimation of the number of topics (see SI).
The computational overhead of using TopicMapping, for obtaining an initial guess of the parameter values, is small and the algorithm can be easily parallelized. To demonstrate this fact, we applied TopicMapping to a sample of the English Wikipedia with more than a million documents and almost a billion words (see Fig. 5).

VI. CONCLUSIONS
Ten years since its introduction, there has been surprisingly little research on the limitations of LDA optimization techniques for inferring topic models [21]. We are able to obtain a remarkable improvement in method validity by using a much simpler objective function [26] to obtain an educated guess of the parameter values in the latent generative model. This guess is obtained exclusively using word-word correlations to estimate topics, whereas word document correlations are accounted for later in refining the initial guess. The algorithm is related to some recent work on spectral algorithms [13,14]. However, here we propose a practical implementation which makes no assumption about topic separability or the number of topics, as most spectral algorithms do. Interestingly, TopicMapping provides only slight improvements in terms of likelihood (because of the high degeneracy of the likelihood landscape), but nevertheless yields much better accuracy and reproducibility.

A. Comparing models
Here, we describe the algorithm for measuring the similarity between two models, p and q. Both topic models are described by two probability distributions: p(topic|doc) and p(word|topic). Given a document, we would like to compare two distributions: p(t |doc) and q(t |doc). The problem is not trivial because the topics are not labeled: the numbers we use to identify the topics in each model are just one of the K! possible permutations of their labels. Instead, documents have of course the same labels. For this reason, it is easy to quantify the similarity of topics t and t from different models, if we look at which documents are in these topics: we can use Bayes' theorem to compute p(doc|t ) and q(doc|t ) and compare these two probability distributions. We propose to measure the distance between p(doc|t ) and q(doc|t ) as the 1−norm (or Manhattan distance): Since we are dealing with probability distributions, p − q 1 2. We can then define the normalized similarity between topics t and t as: To get a global measure of how similar one model is with respect to the other, we compare each topic t with all topics t and we pick the topic which is most similar to t : this is the similarity we get best matching model p versus q: BM(p → q) = t p(t ) max t s(t , t ), where BM stands for Best Match, and the arrow indicates that each topic in p looks for the best matching topic in q. Of course, we can make this similarity symmetric, averaging . Although this similarity is normalized between 0 and 1, it does not inform us about how similar the two models are compared to what we could get with random topic assignments. For this reason, we also compute the average similarity BM(p → q s ), where we randomly shuffle the document labels in model q. Our null model similarity is then defined as BM rand = 1 2 [BM(p → q s )+BM(p s ← q)]. Eventually, we can define our measure of normalized similarity between the two models as: Creating synthetic corpora using the generative model. For each document, p(topic|doc) is sampled from a Dirichlet distribution whose hyper parameters are defined as: αtopic = K × p(topic) × α, where K is the number of topics, p(topic) is the probability (i.e. the size) of topic and α is a parameter which tunes how mixed documents are: smaller values of α yield a simpler model where documents make use of fewer topics. We also have a parameter to fix the fraction of generic words, and we implement a similar method for deciding p(word|topic) for specific and generic words (see Methods). Once the latent topic structure is chosen, we write a corpus drawing words with probabilities given by the mixture of topics. B. The performance of the topic modeling algorithms on synthetic corpora. In all our tests, we generate a corpus of 1000 documents, of 50 words each, and our vocabulary is made of 2000 unique equiprobable words. We set the number of topics K = 20 and we input this number in LDA and PLSA. "Equally sized" means all the topics have equal probability p(topic) = 5%, while in the "unequally sized" case, 4 large topics have probability 15% each, while the other 16 topics have probability 2.5%. LDA(s) and LDA(r) refer to seeded and random initialization for LDA (variational inference). The plots show the median values as well as the 25th and 75th percentiles.
An analogous similarity score can be defined for words using p(word|topic) instead of p(doc|topic). Here, we show the topics found by the TopicMapping after one single LDA iteration: indeed, this dataset represents an example where optimizing LDA until convergence gives rather different results (see SI, Sec. 1.6 and Sec. 10). We highlight the top topics that account for the 80% of total documents: those are just a handful of topics which are very easy to interpret (left). The inset shows the topics we find on the sub-corpus of documents assigned to the main topic "General Knowledge".

B. Generating synthetic corpora
The algorithm we used to generate synthetic datasets relies on the generative model assumed by LDA. First, we specify the number of documents and the number of words in each document, L d . For simplicity, we set the same number of words for each document, L d = L. Next, we set the number of topics K and the probability distribution of each topic, p(topic). Finally, we specify the number of words in our vocabulary, N w , and the probability distribution of each word, p(word). For the sake of simplicity, we used uniform probabilities for p(word), although the same model can be used for arbitrary probability distributions. All these parameters define the size of the corpus, the other aspect to consider is how mixed documents are across topics and topics are across words: this can be specified by one hyper-parameter α, whose use will be made clear in the following. The algorithm works in the following steps: 1. For each document doc, we decide the probability this document will make use of each topic: p(topic|doc). These probabilities are sampled from the Dirichlet distribution with parameters: The definition is such that topic will be used in the overall corpus with probability p(topic), while the factor K is a normalization which assures that we get α topic = α for equiprobable topics. In this particular case, α = 1 means that documents are assigned to topics draw-ing the probabilities uniformly at random (see SI for more on the Dirichlet distribution).
2. For each topic, we need to define a probability distribution over words: p(word|topic). For this purpose, we first compute p(topic|word) for each word, sampling the same Dirichlet distribution as before (α topic = K × p(topic) × α). Second, we get p(word|topic) from Bayes' theorem: 3. We now have all we need to generate the corpus. Every word in document doc can be drawn, first, selecting topic with probability p(topic|doc) and, second, choosing word with probability p(word|topic).
Small values of the parameter α will yield "easy" corpora where documents are mostly about one single topic and words are specific to a single topic, (Fig. 4). For simplicity, we keep α constant for all documents and words. However, it is highly unrealistic that all words are mostly used in a single topic, since every realistic corpus contains generic words. To account for this, we divide the words into two classes, specific and generic words: for the former class, we use the same α as above, while for generic words we set α = 1. The fraction of generic word is a second parameter we set.

OUTLINE
The supplementary material is organized as follows: • Sec. S1 provides analytical insights on the likelihood landscape: in particular, we discuss the theoretical limitations of PLSA [9] and symmetric LDA [10] in finding the correct generative model. Also, in Sec. S1 F we present an additional example which suggests why equally sized topics often have better likelihood.
• Sec. S2 describes the network approach we take for topic modeling.
• Sec. S3 shows that standard LDA tends to overestimate the number of topics, and to find equally sized topics.
• Sec. S4 presents a more detailed analysis of the synthetic datasets: among other things, we visualize the algorithms' results.
• Sec. S6 discusses the hierarchical topics of Web of Science dataset and the role of the p-value for Top-icMapping.
• Sec. S8 shows the topics we found on a large sample of the English Wikipedia.
• Sec. S9 shows that TopicMapping often provides models with higher likelihood, if we compare models with the same effective number of topics.
• Sec. S10 is an appendix with some more technical information about the calculations presented in Sec. S1, some clarifications about Dirichlet distributions and measuring perplexity, and some technical information about the algorithms' usage. Most topic model optimizations are known to be computationally hard problems [19]. However, not much is known about how the roughness of the likelihood landscape affects the algorithms' performance.
We investigate this question by (i) defining a simple generative model, (ii) generating synthetic data accordingly and (iii) measuring how well the algorithms recover the generative model (which is considered the "ground truth").
In the whole study, we examine different generative models. In this section, we study the simplest among those, the language test. For this model, we prove that, if the topics are not enough equally sized, the model which maximizes the likelihood optimized by PLSA and symmetric LDA can be different from the generative model. More specifically, we show that it is possible to find an extremely large number of alternative models (with the same number of topics) which overfit some topics and underfit some others but have a better likelihood than the true generative model. Symmetric LDA is the version of LDA where the prior α is assumed to be the same for all topics, and it is probably the most commonly used. For asymmetric LDA, which allows different priors, the correct generative model has the highest likelihood, in the language test. However, we show that the ratio between the log-likelihood of the generative model and the one of the alternative models can be arbitrarily close to 1, even in the limit of infinite number of documents and infinite number of words per document. This implies that even increasing the amount of available information, the likelihood of the generative model will not increase relatively to the others. Below, we also give some quantitative estimates.

B. The simplest generative model
Let us call K the number of topics. Each topic has a vocabulary of N w words, and for the sake of simplicity we assume all the words are equiprobable. We also assume that we cannot find the same word in two different topics, so that we are actually dealing with fully disambiguated languages. Then, each document is entirely written in one of the languages sampling L d random words from the corresponding vocabulary (we use the same number of words L d for each document). This should be a very simple problem, since there is neither mixing of words across topics, nor of topics across documents.
Let us compute the log-likelihood, log L true , of the generative model. The process of generating a document works in two step. We first select a language with probability p(L), and we then write a document with probability p(doc|L): log L true = log p(L) + log p(doc|L). (S3) Let us focus on the second part, log L true = log p(doc|L). After we selected the language that we are going to use, every document has the same probability of being generated: We will also consider p(L) later. We stress that log L true is the log-likelihood per document. The symbol is to recall that the likelihood is computed given that we know which language we are using for the document. Now, let us compute the log-likelihood of an alternative model, where one language (say English) is overfitted in two dialects, and two other languages (say French and Spanish) are merged. Fig. S1 illustrates how we construct the alternative model. French and Spanish are just one topic, in which each French and Spanish word is equiprobable. The English words instead are arbitrarily divided in two groups: the first English dialect makes use of words from the former group with probability f 1 and words from the second groups with probability g 1 and the second dialect has probabilities f 2 and g 2 for the two groups. We assume that the first group of words is more likely for the first dialect, i.e f 1 g 1 , while the situation is reversed for the second dialect: g 2 f 2 . The general idea is that if a document, just by chance, is using words from the first group with higher probability, it might be fitted better by the first dialect: overfitting the noise improves the likelihood and, if the English portion of the corpus is big enough, this improvement might overcome what we lose by underfitting French and Spanish.
In Sec. S10 A, we prove that the difference between the log-likelihood per English document of the generative model and the alternative model is bigger than 1/π, regardless of the number of words per document, the size of the vocabulary or the number of documents. More precisely, if N w L d , the difference can also be higher, (log 2) 2 . Calling L E , the likelihood per English document in the alternative model, we have that: (S5) Fig. S2, shows the log likelihood difference per English document, as a function of L d /N w .
Keeping the same number of topics, the alternative model will pay some cost underfitting Spanish and French. Since the languages are merged, the size of the vocabulary is 2N w and the log-likelihood per Spanish or French document is: Now, to compute the expected log-likelihood of the alternative model we also need to know how often we use the different languages. Let us call f E the fraction of English documents, and f U the fraction of documents written in Spanish or French (underfitted documents).
The average log-likelihood per document of the alternative model can then be written as: (S7) We recall that, so far, we have not considered the probability that each document will pick a certain language, p(L). Symmetric and asymmetric LDA make different assumptions at this point and we treat them both in the next two sections.
C. Symmetric LDA PLSA does not account for the probability of picking a language p(L) in the likelihood. LDA instead does consider that: the hyper parameters α L are a global set of parameters (one per topic) which tune the probabilities that each document is making use of each topic. In our case, each document is uniquely assigned to a language: therefore, for each document, there is a language which has probability 1 and all the other languages have probability 0. This corresponds to the limiting case α L = κp(L) where the proportionality factor κ is very small.
For symmetric LDA, however, all the α L are equal. This implies that, regardless of the actual size of the languages, the algorithm fits the data with a model for which p(L) = 1/K (we recall that K is the number of languages). Therefore: If f E is big enough, the likelihood of the alternative model can be higher than the one of the generative model. To be more concrete, let us consider an example. If L d = 10 and N w = 20, in Sec. S10 A, we show that C can be as high as 0.476. Let us consider the simplest case of just three topics, f U = 1 − f E . Setting the right hand side of Eq. S8 to zero, we find that if f E 0.936 the alternative model has a better likelihood. If the topics are not balanced enough, symmetric LDA cannot find the right generative model, regardless of the absence of any sort of mixing. However, this critical value actually depends on L d , and increasing L d the generative model will eventually get a better likelihood. The case L d 1, is treated in detail below.  log where H true is the entropy of the language probability distribution, H true = − L p(L) log p(L).
For the sake of simplicity, let us assume that French and Spanish are equiprobable, as well as the two English dialects (see Sec. S10 A). For the alternative model: (S10) From Eq. S7, we finally get: (S11) Since log 2 > C, now the generative model actually has the highest likelihood: in principle, asymmetric LDA is always able to find the generative model. The ratio of the two log-likelihoods, if the documents are long enough, becomes: The same equation holds for symmetric and asymmetric LDA, as well as PLSA. Therefore, even if we had infinite amount of information (infinite number of documents and words per document), the ratio of the two likelihoods can actually be very close to 1.

E. Finding the generative model in practice
The number of alternative models is huge.
In Sec. S10 A, we show that if each language had a vocabulary size N w = 1000, we can find ∼ K × 10 300 alternative models (this is a conservative estimate): assuming f U = 0.2 (which would correspond to 10 equiprobable topics), the relative difference in their log-likelihood is ∼ 2% as we can estimate from Eq. S12.
One might argue that, even if the relative difference of the log-likelihood is small, we have not considered that the basin of attraction of the generative model can be very large, so that optimization algorithms might actually be very effective in finding it anyway. Fig. S3 shows that the probability of finding the correct model for equiprobable languages is ∼ 20%, while in the het-erogeneous case is ∼ 2% (this was computed using variational inference [10]).

F. Model competition in hierarchical data
In the previous sections, we only discussed the difference in likelihood of the generative model and an alternative model with the same number of topics K. In this section, we consider a similar test case for which, however, we fit the data with a model with K − 1 topics.
The generative model we consider here is illustrated in Fig S4: we have K − 1 topics which have no words in common with any other topic and one bigger topic, say English, which has two subtopics, say "music" and "science", which share some words. Let us call U M the number of words in one of the English subtopics (music) which cannot be found in the other subtopic, U S the number of words which can only be found in the other subtopic (science), and C the number of words in common between the two subtopics. We further assume that U M = U S = U , the subtopics are equiprobable, and given a subtopic, each word is equiprobable. Let us call N w the number of words in each non-English language, p E the fraction of English documents and p k the fraction of documents written in a different language (for sake of simplicity, all languages but English are equiprobable).
This model should be fitted with K topics. However, let us assume that we do not know the exact number of topics (as it is usually the case) and we try to fit the data with K − 1 topics. In Fig. S4 we show two possible competing models: the first model correctly finds all the languages, while the second correctly finds the English subtopics but merges two languages.
With similar calculations as above, we can prove (see Sec. S10 B) that the first model has higher likelihood if: The previous equations holds for symmetric LDA, and also asymmetric LDA if L d 1 (the exact expression for asymmetric LDA can be found in Sec. S10 B). If U = 0, the first model is always better (there are no subtopics), if C = 0, one model is better than the other if it under-fits the smaller fraction of documents. In general, if English is used enough and U > 0, the second model better fits the data.
Let us consider a numerical example: consider p E = 50%, U M = U S = 50 words and C = 900 words (1, 000 total words in the English vocabulary). This means that 90% of the English words are used by both subtopics. Eq. S13 tells us that we are going to split English in the two subtopics, if there are two other topics to merge with 2p k < 2.6%.
We believe that this is the basic reason why big journals such as Cell and Astronomical journals are split by standard LDA in the Web of Science dataset (see Sec. S9). In general, since real-world topics are likely to display a hierarchical structure similar to the one described here, we argue that heterogeneity in the topic distribution makes standard algorithms prone to find subtopics of large topics before resolving smaller ones.

S2. A NETWORK APPROACH TO TOPIC MODELING
We give here a detailed description of TopicMapping. The method works in three steps.
First, we build a network of words, where links connect terms appearing in the same documents more often than what we could expect by chance. Second, we define the topics as clusters of words in such a network, using the Infomap method [26] and then we compute the probabilities p(topic|doc) and p(word|topic) locally maximizing a PLSA-like likelihood. Finally, we can refine the topics further optimizing the (asymmetric) LDA likelihood via variational inference [10].
a. How to define the network. A corpus can be seen as a weighted bipartite network of words and documents: every word a is connected to all documents where the word appears. The weight ω d a of the link is the number of times the word is repeated in document d.
From this network, we would like to define a unipartite network of words which have many documents in common. A very simple measure of similarity between any pair of words a and b is the dot product similarity: From this definition, it is clear that generic words, like "to" or "of", will be strongly connected to lots of more specific words, putting close terms related to otherwise far semantic areas. A possible way to filter out generic words is to compare the corpus to a simple null model where all words are randomly shuffled among documents.
For this purpose, we need to consider the probability distribution p(z a,b ) of the dot product similarity defined in Eq. S14. We start considering that in the null model each weight ω d a is now a random variable which follows a hypergeometric distribution with parameters given by: the total number of words in document d, L d , the total number of occurrences of word a in the whole corpus, s a = d ω d a , and the total number of words in the corpus L C = d L d . The mean ω d a is: Assuming a large enough number of documents, we can neglect the correlations among the variables ω d a and, from Eqs. S14 and S15, we get:  Figure S3: In this test, the corpus has 5000 documents of 100 words each, and the vocabulary of each language has 1000 equiprobable words. In the equally sized case, we consider 10 equiprobable languages, while in the heterogeneous case, we considered 2 languages with probability 30% each, and 8 languages with probability 5%. A. Cumulative probability of the relative difference of the log likelihood of the generative model and the one found by the algorithm. B. Scatter plot of the relative difference of the log likelihood versus the accuracy of the algorithm (accuracy is the Best Match similarity of the two models, see main text). Clear clusters are visible according to the how many languages are overfitted. Fig. 2 of the main paper, supports the same conclusion also after we removed the assumption that words are equiprobable.
Since z a,b is the sum of rare events (if L C 1), its probability distribution can be well approximated by a Poisson distribution Pois z a,b (z) with average given by Eq. S16, as shown in Fig. S5.
Finally, our procedure to filter out the noise consists in fixing a p-value, and for all pairs of words a and b which share at least one document, we compute z a,b −Z p (s a , s b ), where the latter term is the (1−p)-quantile of the Poisson distribution Pois z a,b (z). Being more precise, Z p (s a , s b ) is the largest non significant dot product similarity: (S17) z a,b −Z p (s a , s b ) is the weight of the link between words a and b, if positive. Fig. S6 shows an example.
b. Finding the topics as clusters of words and Local Likelihood Optimization. Once the network is built, we detect clusters of highly connected nodes using the Infomap method [26]. This provides us with a hard partition of words, meaning that words can only belong to a single cluster.
We now discuss how we can compute the distributions p(topic|doc) and p(word|topic), given a partition of words.
We recall that in the probabilistic model of how documents are generated, we assume that every word w appearing in document d has been drawn from a certain topic. We are in the realm of the bag of words approximation, and therefore we are completely discarding any information about the structure of the documents. Then, it is reasonable to assume that every time we see a certain word in the same document, it was always generated by the same topic: let us denote this topic as τ (w, d).
We identify the topic τ (w, d) with the single module where word w is located by Infomap, τ (w): in fact, since the partition is hard (no words can sit in different modules), there is no dependency on the documents. Therefore, p(t|w) = δ t,τ (w) and: p(w, t) = p(w) δ t,τ (w) and p(t|d) = 1 L d w ω d w δ t,τ (w) .
(S18) It is also useful to introduce n(w, t) = L C p(w, t), which is the number of times topic t was chosen and word w was drawn.
So far, we have got a model where all words are very specific to topics and documents use many topics, which is probably far from being a good candidate generative model. The model can be substantially improved optimizing the PLSA-like likelihood: (S19) We then describe a series of very local moves aimed at improving the likelihood of the model. The local optimization algorithm aims at fuzzing the topics and making documents more specific to fewer topics. For that, it simply finds, for each document, topics which are infrequent (more precise definition follows) and "move" the words drawn from that topic to the most important one in that document.  Figure S4: Generative model and two compiting models. In this example, we have K − 1 languages but one language (English) is bigger than the others and have two subtopics ("music" and "science"). UM is the number of words in the English vocabulary which can only be found in the music subtopics, US is the equivalent for science, whereas C is the number of common words between the two subtopics. If many documents are written in English, Model 2 has a better likelihood than Model 1.
smallest p-value, considering a null model where each word is independently sampled from topic t with probability p(t) = w p(w)p(t|w). Calling x the number of words which actually come from topic t, (x = L d × p(t|d), see Eq. S18), the p-value of topic t is then computed using a binomial distribution, B(x; L d , p(t)).

For document d, we define the infrequent topics t in
as those which are used with probability smaller than a parameter: p(t in |d) < η.
We consider the most significant topic τ d (see above) and we increment p(τ d |d) by the sum of the probabilities of the infrequent topics, while all p(t in |d) are set to zero. Similarly, n(w, t) has to be decreased by ω d w for each word w which belongs to an infrequent topic, and n(w, τ d ) is increased accordingly.
3. We repeat the previous step for all documents. We then compute p(w, t) = n(w, t)/L C , as well as the the likelihood of the model, L η , where we made explicit its dependency on η.

4.
We loop over all possible values of η (from 0% to 50% with steps of 1%) and we pick the model which maximizes L η .
c. LDA Likelihood optimization. The model we find, at this point, can be refined further via iterations of the Expectation-Maximization algorithm optimizing the LDA likelihood. The algorithm follows closely the implementation from [10]. The main difference, however, is that, for computing efficiency, we use sparse data structure, where words and documents are assigned to only a subset of the topics.
In most cases, the model does not change very much and the algorithm converges very quickly. However, if topics are very heterogenous in size, we might encounter situations similar to the one described in Sec. S1 F (see Sec. S8 for an example). In practice, the software records models every few iterations, allowing users to better explore the data.
d. Implementation details. Here, we would like to make a few points more precise.  We build a network connecting words with weights equal to their dot product similarity. C. We filter non-significant weights, using a p-value of %5. Running Infomap [26] on this network, we get two clusters and two isolated words (study and research). D. We refine the word clusters using a topic model: the two isolated words can now be found in both topics.

2.
After running Infomap, we might find that some words have not been assigned to any topics, because all their possible connections to other words have not been considered significant. In each document which uses any of them, we automatically assign these words to its most significant topic, τ d . 3. Some (small) topics might have not been selected as the most significant by any document. We remove these topics before the filtering procedure: if we do not, high values of the filter η will yield models where these topics do not appear at all, and this might penalize their likelihood just because the number of topics is diminished.
4. Depending on the application, it might also be useful to remove very small topics even if they were selected as the most significant by a handful of documents (this is especially important to avoid the following LDA optimization to inflate them, see Sec. S4 D). We used no threshold for the synthetic datasets, but we selected a threshold of 10 documents for the journals in Web of Science, and 100 documents for Wikipedia. In the implementation of the software, we let the users choose a threshold for removing small topics.
5. The initial α for LDA optimization was set to 0.01 for all topics.

S3. HELD-OUT LIKELIHOOD AND EFFECTIVE NUMBER OF TOPICS
The most used method for selecting the right number of topics, consists in (i) holding out a certain fraction of documents (say 10% of the corpus), (ii) training the algorithm on the remainder of the dataset, (iii) measuring the likelihood of the held-out corpus for the model obtained on the training set. The best number of topics should be the one for which the held-out likelihood is maximum. Fig. S7 shows that this method tends to give a higher number of topics that the actual one.
We also show that LDA tends to provide models in which p(topic) is fairly close to a uniform distribution. To assess this, we compare the entropy of the topic distribution, with the maximum possible entropy, i.e. those achieved by equally probable topics: h u = log 2 (K). In fact, it is easier to compare the exponential entropy [28] of the topic probability distributions: 2 h(pt) versus K. The former can be seen as an effective number of topics: it is the number of topics needed by a uniform distribution to achieve the same entropy. Fig. S7 shows that indeed, the effective number of topics is rather close to the input K.

S4. ADDITIONAL ANALYSIS ON THE SYNTHETIC DATASETS
In this section, we present five supplementary sets of results related to the synthetic datasets, presented in Fig. 4 in the main paper. In the first section, we measure the performance of the algorithms in terms of perplexity [10] (a standard measure of quality for topic models) and we show that, for our case, this evaluating method has a fairly low discriminatory power. We then propose a visualization of the comparison between the correct generative model and the ones found by the algorithms we considered. The third section is dedicated to measuring  Figure S7: Held-out likelihood and effective number of topics for the three datasets we considered in the main paper. In the language test, we considered 5, 000 documents, while, in the synthetic dataset, we set α = 10 −3 and the fraction of generic words to 25%. The dashed black lines on the left indicated the number of topics K that should have been selected by the method. The black line on the right-hand panels is y = x (the highest achievable value of the effective number of topics) and the horizontal lines are the actual effective number of topics.
the performance of the methods in case we do not have information about the correct number of topics to input. In the fourth section, we study how the performance of LDA is affected by the initial conditions of the optimization procedure, and we show that they are crucial, as expected. Finally, we compare the performance of Top-icMapping before and after running LDA as a refinement step.
A. Perplexity   Fig. S8 shows the performance of the algorithms on the synthetic datasets in terms of perplexity (in Sec. S10 D we explain in detail how perplexity is defined). Algorithms which yield a lower perplexity are considered to achieve a better performance because the model they provide is less "surprised" by a portion of the datasets which they have never seen before. The advantage of this approach is that it can be implemented for generic real-world datasets, where the actual generative model is unknown. However, in the study of our interest, the measure performs poorly in discriminating the methods.  Fig. 4 in the main paper). Perplexity seems to have low discriminatory power in this test.
B. Visualizing topic models Fig. S9 shows a visualization of the performance of the methods on the synthetic datasets. We selected a few runs where the algorithms have got an average performance. The colors allow to show in which way standard LDA and PLSA fail in getting the generative model. Similarly to what happens in the language test, some (small) topics are merged together (indicated by a "*" symbol) and some other topics are overfitted in two or more dialects.

C. Performances for different number of topics.
Here we discuss how the performance of LDA and PLSA changes if we do not know the exact number of topics. In the main paper, we have fed the algorithms the right number of topics, although we have shown (Sec. S3) that it is hard to guess this information. Here, we show what we get setting a different number of topics, but still reasonably close to the right value (K = 20). In general, the performance gets worse as we move further from the correct number, although 15 or 25 topics sometimes give slightly better results. We also show that the results do not change very much if we increase the number of documents to 5, 000.  82% 98% 63% 59% TopicMapping 89% Figure S9: Topic comparison for the synthetic datasets. All parameters are the same as in Fig. 4 in the main paper, and we set the fraction of generic words (words which are used uniformly across documents) to 40%. Every rectangle is split in 1000 horizontal bars, one for document. Each bar is divided in color blocks representing topics, with block size proportional to p(topic|doc). The documents are sorted according to their most prominent topic. A. Performance of LDA, for equally sized topics and α = 0.001. The "*" symbols indicate topics inferred by LDA in which two or more actual topics are merged. Top: comparison for documents. Bottom: same procedure for words: generic words are clearly distinguishable from specific ones. The numbers on the corners are obtained from the topic similarity (see main text). B. Unequally sized topics. We show results for two values of α, 0.001 and 0.016. Comparison of documents only is shown. We compare LDA, PLSA and TopicMapping.

D. LDA initial conditions
In this section, we discuss how the initial conditions affect the performance of LDA optimization. Two standard different ways of initializing the topics have been considered: random and seeded. The former assigns random initial conditions while the latter uses randomly sampled documents as seeds. We used both throughout the whole study, but we have only shown the seeded version in the WoS dataset (the difference in performance is not appreciable, though). Here we compare these two initializations with the performance of the method when we guess the best possible initial conditions, meaning we start from the actual generative model (Fig. S11).
Similarly to the language test, starting from the generative model as initial conditions, we get an outstanding performance, which is also the optimal one in terms of likelihood. However, we checked that if we slightly change the number of topics, the performance gets worse and the likelihood improves. In Fig. S11, we show both performance and likelihood. 24 topics refers to a model close to the generative one, but where we added 4 small topics, for which only one single word can be drawn: more precisely, we pick a word at random w r and we define these small topics with word probability distributions p(word|topic) = δ word,wr . LDA will grow these small topics to increase the likelihood, overfitting the data and getting a worse performance. This is the main reason why we decided to threshold small topics in the Web of Science dataset (see Sec. S2).

E. TopicMapping guess
Here, we show the performance of TopicMapping just for the guess, i.e. before running the LDA optimization (see Fig. S12). We do not show the results for the language test because, in that case, there is no difference at all. In the systematic tests, instead, running LDA as a last step slightly improves the performance of the algorithm, although the difference is not dramatic. We found a remarkable difference only in the Wikipedia dataset (see Sec. S8), where the topic distribution provided by the guess was highly heterogeneous.

S5. ASYMMETRIC LDA
In this section we discuss the results we obtain using asymmetric LDA [21] (http://mallet.cs.umass.edu). The algorithm has two main differences respect with the other LDA method we used throughout the study: first, the prior probabilities of using a certain topic are not all equal, and, second, the optimization algorithm is based on Gibbs-sampling rather than variational inference [10]. Equally sized topics Unequally sized topics Figure S11: How the initial conditions affect the performance of LDA. We checked four different ways of initializing the topics: random and seeded are the basic provided options. Real model refers to setting the underlying true parameters as initial conditions. 24 topics refers to the right initial conditions where we added 4 small topics peaked on a single randomly chosen word. The log likelihood improvement is defined as the relative difference in the log likelihood we get with the different initial conditions compared to the seeded initialization. The plot shows mean values and standard deviations.   S13 shows that the algorithm performs better than symmetric LDA in the language test, although it still struggles recognizing the languages if the number of documents is large and the language probabilities are unequal. The performance on the synthetic graphs is better to standard LDA, (see Fig. S14) for certain parameters only.

Equally sized topics Unequally sized topics
No. documents No. documents Reproducibility Accuracy Figure S13: Performance of asymmetric LDA in the language test (same as Fig. 2 in the main text). We used 5, 000 and 50, 000 iteration for Gibbs sampling and we input the correct number of languages in the algorithm. We optimize the hyper parameters each 100 iterations but performance is barely affected by the optimization interval. Curves are the median values and the shaded areas indicate 25th and 75th percentiles.

S6. THE HIERARCHY OF WOS DATASET
In this section, we study the subtopic structure of the Web of Science dataset. In fact, we expect to find subtopics in each journal. Although we do not know any "real" topic model to compare with, we can still measure the reproducibility of the algorithm.
Similarly to what we observed above, we find again that standard LDA is not reproducible and the effective number of topics is strongly affected by the input number of topics, see Fig. S15.
For TopicMapping, we observe that the number of topics is affected by the p-value we choose for filtering the noisy words. This is not what happens in all the other tests we have presented so far, which have a rather clear topic structure: therefore, choosing a p-value of 5% or 1% barely makes any difference. Instead, in analyzing Astronomical Journal abstracts, for instance, the topic structure is not so sharp anymore and we do observe that reducing the p-value provides a higher number of topics. Fig. S15 shows the results. For Astronomical Journal, with a p-value of 5% we only observe one topic. Decreasing the p-value to 1% we start observing sub-topics like: "galaxi* observ* emiss*", "star cluster metal" or "orbit system planet". For Cell, we also observe that the effective number of topics increases for smaller p-values. However, in both cases, TopicMapping is much more reproducible. Figure S15: Reproducibility and effective number of topics for LDA and TopicMapping for the scientific abstracts of Astronomical Journal and Cell. The number of topics can be tuned in LDA changing the input number of topics. Similarly, in TopicMapping the resolution can be tuned to some extent filtering words with different p-values. However, this effect is present only in corpora with a less defined topic structures than the language test or the synthetic graphs, for instance. Median and 25th and 75th percentiles are shown.

S7. COMPUTATIONAL COMPLEXITY
For a given vocabulary size, LDA's complexity is proportional to the number of documents times the number of topics.
The computational complexity of TopicMapping's guess is also linear with the number of documents. In particular, building the graph costs O( d u 2 d ), where u d is the number of unique words in document d. Infomap's complexity is of the same order of magnitude (smaller if we filter links), because the algorithm runs in a time proportional to the number of edges in the graph. Local PLSA-likelihood optimization is also linear in the number of documents, and can scale better than LDA with the number of topics, if the assignments of words to topics is sparse. In fact, we use sparse data structures to compute the topics for each document and each word, meaning that for each document, for instance, we do not handle a list of all topics (including never used topics), but only a list of the topics the document actually makes use of. Indeed, this enables the algorithm to scale much better with the number of topics (see Fig. S16) on the synthetic datasets.
As a further example, to analyze the WoS corpus, Top-icMapping takes ∼ 25 minutes on a standard desktop computer. LDA takes ∼ 20 minutes for finding models with 6 topics and 120 minutes for models with 24 topics.

topics
10 topics 20 topics

S8. TOPICS IN WIKIPEDIA
We have collected a large sample of the English Wikipedia (May 2013). The whole datasets comprises more than 4 million articles. However, since most of them are very short articles (stubs), we decided to consider only articles with at least 5 in-links, 5 out-links and 100 unique words. Also very specific words (such as those which appear in less than 100 articles) have been pruned. This gives us a dataset of 1,294,860 articles 118,599 unique words and ∼ 800 millions words in total.
In order to get results quickly, we decided to parallelize most of the code. For building the network we used 9 threads, each one was assigned a fraction of the total word pairs we had to consider. Doing so, we were able to construct the graph of words in roughly 12 hours. Infomap is extremely fast: each run of the algorithm takes about one hour and we ran it 10 times. After that, we ran the filtering algorithm with a single thread, taking less than one day (we set a filtering step of 0.05). Finally, we parallelized the LDA optimization on about 50 threads: doing so, each iteration took about an hour.
In the main paper, we have shown the results of Top-icMapping after running LDA optimization for one single iteration. The inset was obtained running the algorithm on the sub-corpus consisting of all words which were more likely drawn from the first topic. Fig. S17, instead, shows the results after the full LDA optimization. For comparison, we also show the results starting the algorithm with random initial conditions. Interestingly, in this dataset, LDA optimization changed our guess significantly. This is not what happens in any of the other datasets we have tested, for which the topics in our guess were less heterogeneous (see Sec. S1 F).

S9. TOPICMAPPING AS A LIKELIHOOD OPTIMIZATION METHOD
Here we discuss to which extend TopicMapping provides models with better likelihood compared to standard LDA. Indeed, in controlled test cases as the synthetic tests we have presented in this work, TopicMapping generally finds better models in terms of likelihood and this explains why it performs better (the actual generative model has the highest likelihood).
In real cases, as we discussed in Sec. S1 F, the likelihood can be maximized splitting large topics in subtopics and merging smaller topics. Therefore, if we compare the likelihood found by TopicMapping and the one found by variation inference [10] as a function of the number of topics, TopicMapping does not provide models with higher likelihood. However, this comparison heavily penalizes TopicMapping, which often provides models with a broad distribution of topics, and many of them are barely used at all. We then argue that comparing models with the same number of effective topics is a more fair comparison. Doing so, Fig. S18 shows that, indeed, TopicMapping's models have often higher likelihood. However, the difference is not dramatic as we can see from the inset of Fig. S18, because of the degeneracy of the likelihood landscape. In this section, we compute the likelihood of the alternative model for the English documents (Sec. S1). Let us call a and b the number of English words in the first group and in the second group respectively. We have that: a + b = N w and af 1 + bg 1 = 1, and the equivalent holds for f 2 and g 2 . When we write an English document, we randomly sample words from the English vocabulary. This means that the probability that n a words fall in the first group, and n b = L d − n a in the second, follows a binomial distribution: The last ingredient is how to decide which dialect a document should be fitted with. Let us define a threshold T such that, if n a T we use the first dialect, and we use the second otherwise. Without loss of generality, we also assume that T 1, because otherwise we go back to the one single dialect case (Eq. S3).
Let us call L E the likelihood of an English document in this model. Its average can be written as: where log L 1 (n a ) = n a log f 1 + (L d − n a ) log g 1 , and the same equation holds for L 2 replacing f 1 with f 2 and g 1 with g 2 .
We can compute the optimal values for f 1 and f 2 simply setting derivatives to zero: If we call: na=T p(n a )n a ω 1 , p a1 = m a1 L d and µ a = L d a/N w , the optimal f 1 and f 2 can be written as: ω 1 is how often we use the first dialect, µ a is the expected number of words which fall in the first group of English words, p a1 is the probability of using words from the first group, given that we are using the first dialect. We also have that: ω 2 = 1 − ω 1 and ω 1 p a1 + ω 2 p a2 = p a = a N w .
We can now compute the expected log-likelihood of Eq. S22: Calling the entropy of a binary variable h(p) = −p log p − (1 − p) log (1 − p), we get: Now, the problem is to find, for given L d and N w , which choice of the parameters a and T maximizes Eq. S23. It turns out that there are two different regimes depending on the condition N w L d .
If N w L d , a possible strategy is to assume T = 1. This means that we use the second dialect if (and only if) there are no words in the first group. This means that p a2 = 0.
In fact, using the equations above, we get: and p a2 = 0.
It is possible to prove that for L d 1 and N w L d the maximum is attained for: ω 1 = 1 2 and p a = log 2 L d , and disregarding size effect due to a being an integer: log L E − L true (log 2) 2 0.4804.
In the second case, L d N w , we restrict ourselves to considering T = L d /N w . In the limit L d 1, using the Gaussian approximation of the binomial distribution in Eq. S21, we get: .
If also N w 1, the difference is independent of a: log L E − L true 1 π 0.3183.
In conclusion, the log-likelihood per document of the alternative model (given that we use English), is bigger that the one of the generative model, and, remarkably, the difference varies from roughly 0.5 to 0.3, so that it is substantially independent of all the parameters of the model. Since we can divide the English words in two arbitrarily groups, we can actually have a large number of alternative models. For instance, if we have L d = 100, and N w = 1000, the model with highest likelihood splits English in two groups of 7 and 993 words, so that the number of alternative models becomes: and there are many more alternative models with slightly smaller likelihood: for instance using a = 500, the likelihood of the alternative model is 99.6% the likelihood we obtain for a = 7, but the number is 10 300 . All these models are likely local maxima of the log-likelihood for Expectation-Maximization algorithms.
B. Derivation of Eq. S13 Let us start computing log L M1 , the log-likelihood per document for the model where the subtopics are merged and all languages are recovered. We recall that the symbol means that the likelihood is computed given that we know the topics of the document.
If we merge the two English subtopics, the common words (C) have probability 1/(U + C) while the words only used in one of the two subtopics (2U ) have probability 1/(2U + 2C). Therefore, the average log-likelihood per English document in Model 1 is: which can be re-written as: The log-likelihood per non-English document is: Instead merging two languages which are not English (Model 2), we get: The difference in the average log-likelihood between the two models becomes (we recall that p k is the probability of any non-English language): Eq. S13 follows from the equation above. For asymmetric LDA, we also have to consider the difference in the language entropies. Accounting for that, we get: Then, Model 1 has higher likelihood than Model 2 if: The correction from Eq. S13 is O(L −1 d ).

C. The Dirichlet distribution
The Dirichlet distribution is frequently used in Bayesian statistics since it is the conjugate prior of the multinomial distribution. The distribution is parameterized by K values α topic , and the support of the function is the standard (K − 1)−simplex, i.e. the set of vectors x of dimension K such that i x i = 1 and all x i 0.  Figure S19: For each document, we extract p(topic|doc) from the Dirichlet distribution for several values of α, setting 20 equally probable topics. We then measure the probability of its most prominent topic (red curve) as well as the sum of the two and five largest topic probabilities (blue and green curves). The plot shows the median together with the 25% and 75% quantiles. Small values of α lead to documents which are mostly assigned to one single topic: for instance, for α = 10 −3 , the probability of the top topic is basically 1 and all others are zero. For α = 10 −1 , the top topic has roughly 0.5 probability, the second one has 0.2 and all the last 15 combined have less than 0.05. For large values like α 10 2 , all topic probabilities tend towards equality, p(topic|doc) = 0.05: this means that documents cannot be classified as they use all topics with equal probability.
Clearly, x can be interpreted as a probability distribution. Moreover x i = α i / topic α topic . In generating the synthetic corpus, for each document, we use the same α topic = K × p(topic) × α to draw p(topic|doc) from the Dirichlet distribution. In fact, even letting α depend on documents (but not on topics), this definition makes sure we get back the pre-defined topic probabilities, since p(topic|doc) = p(topic). Fig. S19 shows how p(topic|doc) depends on α in the simple case of 20 equiprobable topics.

D. Measuring Perplexity
Perplexity is the conventional way to evaluate topic models' accuracy [10]. Here, we briefly review how it is computed.
The spirit is to cross validate the model, whose parameters have been computed on a trained set of documents, looking at how well the model fits a small set of unseen documents. Therefore, the procedure is (i) to held out a fraction of documents from the corpus (typically 10%), (ii) train the algorithm using the remaining 90% of the documents, (iii) infer the topic probabilities for the unseen documents p(topic|doc) without changing the topics, i.e. p(word|topic), (iv) compare the actual word frequencies p(word|doc) of the unseen documents with the topic mixture q(word|doc) = topic p(topic|doc) × p(word|topic).

E. Algorithms' usage details
For LDA, we used the implementation that can be found from http://www.cs.princeton.edu/~\blei/ lda-c/index.html. The stopping criterion in running LDA and PLSA was that the relative improvement of the log likelihood bound was less than 10 −5 with respect to the previous iteration. In running LDA we let the algorithm optimize α as well. The initial value was set to α = 1.