Role of neutral evolution in word turnover during centuries of english word popularity

• A submitted manuscript is the version of the article upon submission and before peer-review. There can be important differences between the submitted version and the official published version of record. People interested in the research are advised to contact the author for the final version of the publication, or visit the DOI to the publisher's website. • The final author version and the galley proof are versions of the publication after peer review. • The final published version features the final layout of the paper including the volume, issue and page numbers.


Introduction
English has evolved continually over the centuries, in the branching o® from antecedent languages in Indo-European prehistory [31,36], in the rates of regularization of verbs [31] and in the waxing and waning in the popularity of individual words [3,13,34]. At a much¯ner scale of time and population, languages change through modi¯cations and errors in the learning process [14,27].
This continual change and diversity contrasts with the simplicity and consistency of Zipf's law, by which the frequency a word, f, is inversely proportional to its rank k, as f $ k À and Heaps law, by which vocabulary size scales sub-linearly with total number of words, across diverse textual and spoken samples [30,38,44,47,15,21,46,40].
The Google Ngram corpus [34] provides new support for these statistical regularities in word frequency dynamics at timescales from decades to centuries [22,38,40,1,28]. With annual counts of n-grams À À À an n-gram being n consecutive character strings, separated by spaces À À À derived from millions of books over multiple centuries [32], the n-gram data now covers English books from the year 1500 to year 2008.
Further research on common words and phrases made possible by the n-gram data demonstrates the \Matthew e®ect" of stochastic proportional growth, which has been observed in a range of natural, biological and socio-cultural realms [39]. In English, the Zipf's law in the n-gram data [38] exhibits two regimes: one among words with frequencies above about 0:01% (Zipf's exponent % 1) and another ( % 1:4) among words with frequency below 0:0001% [40]. The latter Zipf's law exponent of 1.4 is equivalent to a probability distribution function (PDF) exponent, , of about 1.7 ( ¼ 1 þ 1=).
While the well-known Zipf's law demonstrates a necessary but incomplete characterization of stochastic proportional growth, a more complete characterization requires analyzing change in time-resolved data [39]. In this respect, word frequency data have at least two other statistical properties. One, known as Heaps law, refers to the way that vocabulary size scales sub-linearly with corpus size (raw word count). The n-gram data show Heaps law in that, if N t is corpus size and v t is vocabulary size at time t, then v t % N t , with % 0:5, for all English words in the corpus [40]. If the n-gram corpus is truncated by a minimum word count, then as that minimum is raised the Heaps scaling exponent increases from < 0:5, approaching < 1 [40].
The other statistical property is dynamic turnover in the ranked list of most commonly used words. This can be measured in terms of how many words are replaced through time on \Top y" ranked lists of di®erent sizes y of most frequentlyused words [12,17,19,23]. We can de¯ne this turnover z y ðtÞ as the number of new words to have entered the top y most common words in year t, which is equivalent to the the top y in that year. The plotting of turnover z y for di®erent list sizes y can therefore be useful in characterizing turnover dynamics [2].
Many functional or network models readily yield the static Zipf distribution [21,15,39] and Heaps law [33], but not the dynamic aspects such as turnover. Here, we focus on how Heaps law and Zipf's law can be modeled together with continual turnover of words within the rankings by frequency [4,23]. We focus on the 1-grams in Google's English 2012 data set, which samples English language books published in any country [25].
Our overall¯nding is a model that can replicate observed the Google corpus, which we assume to be representative of overall language through time. Even if the Google sample is biased toward more recent texts [16], the model reveals its utility in replicating multiple dynamic properties, including growing corpus and vocabulary sizes, frequency distributions, and turnover within those frequency distributions.

Neutral Models of Vocabulary Change
One promising, parsimonious approach incorporates the class of neutral evolutionary models [11,12,7,24,35] that are now proving insightful for language transmission [13,10,43]. The null hypothesis of a Neutral model is that copying is undirected, without biases or di®erent`¯tnesses' of the words being replicated [2,29]. A basic neutral model, which we will call the full-sampling Neutral model (FNM), would assume simply that authors choose to write words by copying those published in the past and occasionally inventing or introducing new words. As shown in Fig. 1(a), the FNM represents each word choice by an author as selecting at random among the N t words that were published in the previous year [43,10]. This copying occurs with probability 1 À , where ( 1 is the¯xed, dimensionless probability that an author invents a new word (even if the word had originated somewherè outside' books, e.g., in spoken slang). Each newly-invented word enters with frequency one, regardless of N t . In terms of the modeled corpus, a total of about N t unique new words are invented per time step. Note that N t represents the total number of written words, or corpus size, for year t, which contrasts with the smaller \vocabulary" size, v t , de¯ned as the number of di®erent words in each year t regardless of their frequency of usage (these terms, which we use for generality, are equivalent to token and type in corpus linguistics, where token is the number of words in the corpus and type is the number of unique words).
As has been well demonstrated, the FNM readily yields Zipf's law [11,9,45], which can also be shown analytically (see Appendix A). Also, simulations of the FNM show that the resulting Zipf distribution undergoes dynamic turnover [12]. Extensive simulations [19] show that when list size y is small compared to the corpus (0:15y < N t ), this neutral turnover z y per time step is more precisely approximated by where n is the number of words per time interval.
This prediction can be visualized by plotting the measured turnover z y for di®erent list sizes y. The FNM predicts the results to follow z y / y 0:86 , such that departures from this expected curve can be identi¯ed to indicate biases such as conformity or anti-conformity [2]. It would appear from Eq. (1) that turnover should increase with corpus size. This is the nominal equilibrium for FNM with constant N t . If corpus size N t in the FNM is growing exponentially with time, however, then there may be no such nominal equilibrium. In this case, we predict that the turnover z y can actually decrease with time as N t increases. This is because newly invented words start with frequency one, and under the neutral model they must essentially make a stochastic walk into the top 100, say. As N t grows, so does the minimum frequency needed to break into the top 100. As the \bar" is raised, words are more likely to`die' before they ever reach the bar by stochastic walk [41]. As a result, turnover in the Top y can slow down over time and growth of N t .
The FNM does not, however, readily yield Heaps law (v t ¼ N t , where < 1), for which % 0:5 among the 1-gram data for English [40]. In the FNM, the expected exponent is 1.0, as the number of di®erent variants (vocabulary) normally scales linearly with N t [11]. While the FNM has been a powerful null model, in the case of books, we can make a notable improvement to account for the fact that most published material goes unnoticed while a relatively small portion of the corpus is highly visible. To name a few examples across the centuries, literally billions of copies of the Bible and the works of Shakespeare have been read since the 17th century, as well as tens or hundreds of millions of copies of works by Voltaire, Swift, Austen, Dickens, Tolkien, Fleming, Rawling and so on. While these and hundreds more books become considered part of the \Western Canon," that canon is constantly evolving [28] and many books that were enormously popular in their time À À À e.g. Arabian Nights or the works of Fanny Burney À À À fall out of favor. As the published corpus has grown exponentially over the centuries, early authors were more able to sample the full range of historically published works, whereas contemporary authors sample from an increasingly small and more recent fraction of the corpus, simply due to its exponential expansion [28,37].
As a simple way of capturing this, we propose a modi¯ed neutral model, called the partial-sampling Neutral model (PNM), of an evolving \canon" that is sampled by an exponentially-growing corpus of books. As shown in Fig. 1(b), the PNM represents an exponentially growing number of books that sample words from a¯xed size canon over all previous years since 1700. Our PNM represents a world where there exists an evolving canonical literature as a relatively small subset of the world's books on which all writers are educated. As new contributions to the canon are contributed, authors sample from the recent generation of writers with occasional innovation. Because the canon is a high-visibility subset of all books, only a¯xed, constant number words of text per year are allowed into a year's canon. The rest of the population learns from the cumulative canon since our chosen reference year of 1700.

Results
The average result from 100 runs in each of the FNM and PNM were used to match summary statistics with the 1-gram data. Several key statistical results emerge from analysis of the 1-gram data which we compare the FNM to the PNM in terms of these results: (1) Heaps law, which is the sublinear scaling of vocabulary size with corpus size, (2) a Zipf's law frequency distribution for unique words, (3) a rate of turnover that decreases exponentially with time and a turnover versus popular list size that is approximately linear. Here, we describe our results in terms of rank-frequency distributions, turnover and corpus and vocabulary size. We compare the PNM model to the full 1-gram data for English.
First, we check that the model replicates the Zipf's law that characterizes the 1-gram frequencies in multiple languages [38]. Our own maximum likelihood determinations, applying available code [15] to the Google 1-gram data, con¯rm that the mean ¼ 1:75 AE 0:12 for the Zipf's law over all English words in the 100 years from 1700 to 1800 (beyond 1800, the corpus size becomes too large for our computation). Normalizing by the word count [21], the form of the Zipf distribution is virtually identical for each year of the dataset, reaching eight orders of magnitude by the year 2000 ( Fig. 2(a)). The FNM replicates the Zipf ( Fig. 2(b)) but the PNM replicates it better and over more orders of magnitude (Fig. 2(c)). It was not computationally possible with either the FNM or PNM to replicate the Zipf across all nine orders of magnitude, as the modeled corpus size N t grows exponentially (Fig. 2(d)). Figure 3(a) illustrates the relationship between corpus size and vocabulary size in our partial-sampling Neutral model. Due to the exponentially increasing sample size, the ratio of vocabulary size over corpus size becomes increasingly small, thus the model gives us the sub-linear relationship described by v t ¼ N t , where < 1. On the double-logarithmic plot in Fig. 3(a), the Heaps law exponent is equivalent to the slope of the data series. The PNM matches the 1-gram data with Heaps exponent (slope) of about 0.5, whereas the FNM, with exponent about 1.0, does not. Figure 3(b) shows how 100 runs of the PNM yields a Heaps law exponent within the range derived by [40] for several di®erent n-grams corpora (all English, English¯ction, English GB, English US and English 1M). The PNM yields Heaps law exponent % 0:52 AE 0:006, within the range of English corpora, whereas the FNM yields a mismatch with the data of % 1 AE 0:002 ( Fig. 3(b)).
In Fig. 3(a), there is a constant o®set on the y-axis between vocabulary size in the PNM ( ¼ 0:02; N ¼ 10; 000) versus the 1-gram data. Both data series follow Heaps exponent b % 0:5, but the coe±cient, A, is several times larger for the 1-gram data than for the PNM. We do not think this is due to our choice of canon size N in the PNM, because if we halve it to 5000, the resulting A does not signi¯cantly change. The di®erence could be resolved, however, with larger exponential growth in PNM corpus size, S t , over the 300 time steps. Computationally, we could only model the PNM with growth exponent ¼ 0:02 À À À using ¼ 0:03, as would¯t the actual growth of the n-gram corpus over 300 years [8], makes the PNM too large to compute. Nevertheless, we can roughly estimate the e®ect; when we reduce from 0.02 to 0.01, while keeping N ¼ 10; 000, we¯nd that A averaged over one hundred PNM runs is reduced from 6:3 AE 0:5 to 1:4 AE 0:3. Given an exponential relationship, increasing alpha to 0.03 would increase A to about 20, which is within the magnitude of o®set we see in Fig. 3(a). Of course, this question can be resolved precisely when the much larger PNM can be simulated.
Regarding dynamic turnover, we consider turnover in ranked lists of size y, varying the list size y from the top 1000 most common words down to the top 10 (the top word has been \the" since before the year 1700). We measure turnover in the word-frequency rankings by determining the top y rankings independently for each year, and then counting the number of new words to appear on the list from one year to the next. and the top 500 decreased exponentially from the year 1700 to 2000, proportional to e À0:012t (r 2 > 0:91 for both), where t is years since 1700. This exponential decay equates to roughly a halving of turnover per century. Since the corpus size was increasing with time, Fig. 4 e®ectively also shows how turnover in top y list decreases as corpus size increases in the partial-sampling Neutral model. The exponential decay in turnover in the partial-sampling Neutral model is markedly di®erent than the base Neutral model, in which turnover would be growing as corpus size grew, due to term n 0:013 s in Eq. (1). Finally, we also look at the \turnover pro¯le", plotting list size y versus turnover z y for di®erent time slices (Fig. 5). For all words, z y / y 1:26 for di®erent time periods (Fig. 5). We can then compare the turnover pro¯le for the 1-grams to the prediction from Eq. (1) that turnover will be proportional to y 0:86 , as shown in Fig. 5(b). Table 1 lays out the speci¯c predictions of each of the models and how they fare against empirical data. Bands indicate 95% range of simulated values. While the predictions for the FNM and PNM are similar for y ¼ 50 and for the year 1800 Although the FNM can¯t Zipf's Law with the right parameters, it cannot also¯t Heaps law or the turnover patterns at the same time as matching Zipf's Law. In contrast, the PNM can¯t Zipf's law, Heaps law exponent (Fig. 3(a)), and the 2000 series in Fig. 4 (but starts to breakdown at y > 150). Neither the FNM nor the PNM does very well at y ¼ 200.

Discussion
We have explored how`neutral' models of word choice could replicate a series of static and dynamic observations from a historical 1-gram corpora: corpus size, frequency distributions, and turnover within those frequency distributions. Our goal was to capture two static and three dynamic properties of word frequency statistics in one model. The static properties are not only the well-known (a) Zipf's law, which a range of proportionate-advantage models can replicate, but also (b) Heaps law. The dynamic properties are (c) the continual turnover in words ranked by popularity, (d) the decline in that turnover rate through time, and (e) the relationship between list size and turnover, which we call the turnover pro¯le.
We found that, although the FNM model predicts the Zipf's law in ranked word frequencies, the FNM does not replicate Heaps law between corpus and vocabulary size, or the concavity in the non-linear relationship between list size y and turnover z y , or the slowing of this turnover through time among English words.
It is notable that we found it impossible to capture all¯ve of these properties at once with the FNM. It was a bit like trying to juggle¯ve balls, as soon as the FNM could replicate some of those properties, it dropped the others. Having explored the FNM under broad range of under a range of parameter combinations, we ultimately determined that it could never replicate all these properties at once. This is mainly because both vocabulary size in the FNM is proportional to corpus size (rather than roughly the square root of corpus size as in Heaps law) and also because turnover in FNM should increase slightly with growing corpus, not decrease as we see in the 1-gram data over 300 years. Other hypotheses to modify the FNM, such as introducing a conformity bias [2], can also be ruled out. In the case of conformity bias À À À where agents choose high-frequency words with even greater probability than just in proportion to frequency À À À both the Zipf law and turnover deteriorate under strong conformity in ways that mis-match with the data.
What did ultimately work very well was our partial-sampling Neutral model, or PNM ( Fig. 1(b)), which models a growing sample from a¯xed-sized FNM. Our PNM, which takes exponentially increasing sample sizes from a neutrally evolved latent corpus, replicated the Zipf's law, Heaps law, and turnover patterns in the 1-gram data. Although it did not replicate exactly the particular 1-gram corpus we used here, the Heaps law exponent yielded by the PNM does fall within the range À À À from 0.44 to 0.54 À À À observed in di®erent English 1-gram corpora [40]. Among all features we attempted to replicate, the one mismatch between PNM and the 1-gram data is that the PNM yielded an order of magnitude fewer vocabulary words for a given corpus size, while increasing with corpus size according to the same Heaps law exponent. The reason for this mismatch appears to be a computational constraint: we could not run the PNM with exponential growth quite as large as that of the actual 300 years of exponential growth in the real English corpus.
As a heuristic device, we consider the¯xed-size FNM to represent a canonical literature, while the growing sample represents the real world of exponentially growing numbers of books published ever year in English. Of course, the world is not as simple as our model; there is no o±cial¯xed canon, that canon does not strictly copy words from the previous year only and there are plenty of words being invented that occur outside this canon. Also, the Google dataset is an imperfect sample of the language for earlier years. At least some of the growth observed over time is due to greater availability and easier digitization of later texts, such that the Google corpus grows faster than the language itself over the years [16].
This does not change our overall result, however, in the PNM can replicate dynamic properties observed in an exponentially-growing corpus (even if that exponent were smaller) that the FNM cannot. In particular, our canonical model of the PNM di®ers from the explanation by [40], in which a \decreasing marginal need for additional words" as the corpus grows is underlain by the \dependency network between the common words . . . and their more esoteric counterparts." In our PNM representation, there is no network structure between words at all, such as \interword statistical dependencies" [42] or grammar as a hierarchical network structure between words [20].

Conclusion
Since the PNM performed quite well in replicating multiple static and dynamic statistical properties of 1-grams simultaneously, which the FNM could not do, wē nd two insights. The¯rst is that the FNM remains a powerful representation of word usage dynamics [13,43,26,24,9,5], but it may need to be embedded in a larger sampling process in order to represent a very large data sample. Case studies where the PNM succeeds and the FNM fails could represent situations where mass attention is focused on a small subset of the cultural variants. The same idea seems appropriate for a digital world, where many cultural choices are pre-sorted in ranked lists [24]. In the present century, published books contain only a few percent of the verbiage recorded online, with the volume of digital data doubling about every three years. Centuries of prior evolution in published English word use provides valuable context for future study of this digital transition.

Models and Data
Our aim is to compare key summary statistics from simulated data generated by the hypothetical FNM and PNM processes with summary statistics from Google 1-gram data. See Acknowledgements for data source address and the repository location for the Python code used to generate the FNM and PNM.

Neutral models
The FNM assumes words in a corpus at time t are selected at random from the corpus at time t À 1. The corpus size N t increases exponentially, N 0 e 0:021t , through time to simulate the exponentially increasing corpus size observed in the Google n-grams data [8]. We ran a genetic algorithm (described in Appendix B) to search the model state space to obtain parameter combinations À À À latent corpus size N t , innovation fraction and initial corpus size N 0 À À À that yielded similar summary statistics to the 1-gram data. With the corpus growth exponent¯xed at 0.021, initial corpus size, N 0 , was constrained by computational capacity.
Following the genetic algorithm search, the model was initialized with corpus size N 0 ¼ 3000 and invention fraction ¼ 0:003. Once steady state was achieved, we permitted the corpus size in each successive generation to increase at an exponential growth rate comparable to the average annual growth rate of Google 1-gram data until it¯nally reached N 300 ¼ 1:5 million by time step t ¼ 301.
At each time t in the FNM, a new set of N t words enter the modeled corpus. Each word in the corpus, at time t, is either a copy of a word from the previous generation of books, with probability 1 À , or else invented as a new word with probability . Each of the copied words is selected from v tÀ1 possible words (the vocabulary in the previous time step), which follow a discrete Zipf's law distribution with the probability a word is selected being proportional to the number of copies the word had in the previous corpus in time step t À 1 [7].
The PNM, represented schematically in Fig. 1, draws an exponentially increasing sample (with replacement) from a latent neutrally-evolving canon. We designate the number of words in the sample as S t , and the cumulative number of words in the canon as N t , which grows by a¯xed number of words in each time step. This exponentially increasing sample, S 0 e t , has an initial corpus size S 0 ¼ 3000, growth exponent ¼ 0:021, yielding a¯nal sample size S 300 ¼ 1:5 million, matching the FNM. The latent corpus evolves by the rules of the FNM, but with a constant corpus size of 10,000 for each year t (representing a canonical literature from which the main body of authors sample). The cumulative canon, N t , thus grows by 10,000 words per year. The partial sample, S t , at time t can copy words from all canonical literature, N t , up to that time step. We set ¼ 0:003 and run for t ¼ 301 time steps representing years between 1700 and 2000, which are the same parameters used in the FNM.

1-gram data
The 1-gram data are available as csv¯les directly from Google's Ngrams site [25]. As in a previous study [1], we removed 1-grams that are common symbols or numbers, and 1-grams containing the same consonant three or more times consecutively. As in our other studies [1,8,6], we normalized the count of 1-grams using the yearly occurrences of the most common English word, the. Although we track 1-grams from the year 1700, for turnover statistics we follow other studies [40] in being cautious about the n-grams record before the year 1800, due to misspelled words before 1800 that were surely digital scanning errors related to antique printing styles of that may con°ate letters such as`s' and`f' (e.g. myfelf, yourfelf, provi¯ons, increafe, afked, etc.). The code used for modeling is available at:https://github.com/dr2g08/Neutral-evolution-and-turnover-over-centuries-of-English-word-popularity.
values of N between 5000 and 30,000 and S 0 between 1000 and 10,000. In both cases the lower bound is chosen to ensure a minimum acceptable vocabulary size is reached and the upper bound is limited by computational constraints. The product N was limited between 5 and 90, as the region in which Neutral model yields a reasonable Zipf's law. For the genetic algorithm, the¯tnesses were scored by the following equations and a variable values:

Summary statistic Equation Target variables
Heaps Law v ¼ An b A and b Zipf's law f $ k À Turnover decay (y ¼ 50) zð50Þ ¼ z 0 e À 50 t 50 and z 0 Turnover decay (y ¼ 100) zð100Þ ¼ z 0 e À 100 t 100 and z 0 Turnover decay (y ¼ 200) zð200Þ ¼ z 0 e À 200 t 200 and z 0 The PNM parameter combination receives a point when each of the target statistics is approximately the same as the equivalent value from the n-grams data. The genetic algorithm starts with 100 random parameter combinations then the following steps are repeated until they converge on parameter combinations that maximizē tness scores: (1) The¯ttest 20% from the corpus is passed to the next generation.
(2) The remaining 80% is populated by recombinations of two randomly selected parents from the¯ttest 20% from the previous generation. (3) 15% of the new agents are subject to random mutation of a single parameter to ensure diversity in the corpus.