Abstract Lactase persistence (LP) is common among people of European ancestry, but with the exception of some African, Middle Eastern and southern Asian groups, is rare or absent elsewhere in the world. Lactase gene haplotype conservation around a polymorphism strongly associated with LP in Europeans (−13,910 C/T) indicates that the derived allele is recent in origin and has been subject to strong positive selection. Furthermore, ancient DNA work has shown that the −13,910*T (derived) allele was very rare or absent in early Neolithic central Europeans. It is unlikely that LP would provide a selective advantage without a supply of fresh milk, and this has lead to a gene-culture coevolutionary model where lactase persistence is only favoured in cultures practicing dairying, and dairying is more favoured in lactase persistent populations. We have developed a flexible demic computer simulation model to explore the spread of lactase persistence, dairying, other subsistence practices and unlinked genetic markers in Europe and western Asia's geographic space. Using data on −13,910*T allele frequency and farming arrival dates across Europe, and approximate Bayesian computation to estimate parameters of interest, we infer that the −13,910*T allele first underwent selection among dairying farmers around 7,500 years ago in a region between the central Balkans and central Europe, possibly in association with the dissemination of the Neolithic Linearbandkeramik culture over Central Europe. Furthermore, our results suggest that natural selection favouring a lactase persistence allele was not higher in northern latitudes through an increased requirement for dietary vitamin D. Our results provide a coherent and spatially explicit picture of the coevolution of lactase persistence and dairying in Europe.

Author Summary Most adults worldwide do not produce the enzyme lactase and so are unable to digest the milk sugar lactose. However, most people in Europe and many from other populations continue to produce lactase throughout their life (lactase persistence). In Europe, a single genetic variant, −13,910*T, is strongly associated with lactase persistence and appears to have been favoured by natural selection in the last 10,000 years. Since adult consumption of fresh milk was only possible after the domestication of animals, it is likely that lactase persistence coevolved with the cultural practice of dairying, although it is not known when lactase persistence first arose in Europe or what factors drove its rapid spread. To address these questions, we have developed a simulation model of the spread of lactase persistence, dairying, and farmers in Europe, and have integrated genetic and archaeological data using newly developed statistical approaches. We infer that lactase persistence/dairying coevolution began around 7,500 years ago between the central Balkans and central Europe, probably among people of the Linearbandkeramik culture. We also find that lactase persistence was not more favoured in northern latitudes through an increased requirement for dietary vitamin D. Our results illustrate the possibility of integrating genetic and archaeological data to address important questions on human evolution.

Citation: Itan Y, Powell A, Beaumont MA, Burger J, Thomas MG (2009) The Origins of Lactase Persistence in Europe. PLoS Comput Biol 5(8): e1000491. https://doi.org/10.1371/journal.pcbi.1000491 Editor: Mark M. Tanaka, University of New South Wales, Australia Received: February 5, 2009; Accepted: July 28, 2009; Published: August 28, 2009 Copyright: © 2009 Itan et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: YI was funded by UCL Graduate school, UCL ORS (http://www.ucl.ac.uk/), Annals of Human Genetics (http://www.wiley.com/bw/journal.asp?ref=0003-4800) and B'nai B'rith/Leo Baeck London Lodge scholarships (http://www.britishcouncil.org/israel-education-scholarships-further-funding-and-support.htm), and AP was funded by the Arts and Humanities Research Council Centre for the Evolution of Cultural Diversity (http://www.cecd.ucl.ac.uk/home). Both are PhD students. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Lactase persistence (LP) is an autosomal dominant trait enabling the continued production of the enzyme lactase throughout adult life. Lactase non-persistence is the ancestral condition for humans, and indeed for all mammals [1]. Production of lactase in the gut is essential for the digestion of the milk sugar lactose. LP is common in northern and western Europeans as well as in many African, Middle Eastern and southern Asian pastoralist groups, but is rare or absent elsewhere in the world [1]–[4]. In Europeans LP is strongly associated with a single C to T transition in the MCM6 gene (−13,910*T), located 13.91 kb upstream from the lactase gene [5]. Furthermore, in vitro studies have indicated that the −13,910*T allele can directly affect LCT gene promoter activity [6]. The −13,910*T allele ranges frequency from 6%–36% in eastern and southern Europe, 56%–67% in Central and western Europe, to 73%–95% in the British Isles and Scandinavia [7],[8] while LP ranges in frequency from 15%–54% in eastern and southern Europe, 62%–86% in Central and western Europe, to 89%–96% in the British Isles and Scandinavia [9]. This makes the −13,910*T allele a good candidate for predicting LP in Europe. However, genotype/phenotype frequency comparisons have shown that the −13,910*T allele cannot account for LP frequencies in most African [3] and Middle Eastern populations [10]. Instead, different LP-associated alleles occurring in the same genomic region have been reported, indicating convergent evolution [2],[4],[10],[11]. Using long-range haplotype conservation [8] and variation in closely linked microsatellites [12] as proxies for allelic age, the −13,910*T variant has been estimated to be between 2,188 and 20,650 years old and between 7,450 and 12,300 years old, respectively. These recent age estimates, when considered in conjunction with modern allele frequencies, indicate that −13,910*T has been subjected to very strong natural selection (s = 0.014–0.19; [8]). It is interesting to note that similar estimates for the strength of selection have been obtained for one of the major African LP variants [4]. It is unlikely that lactase persistence would provide a selective advantage without a supply of fresh milk and this has lead to a gene-culture co-evolutionary model where lactase persistence is only favoured in cultures practicing dairying [13]–[16], and dairying is more favoured in lactase persistent populations [14], [17]–[19]. The reasons why LP, in conjunction with dairying, should confer such a strong selective advantage remain open to speculation. Flatz and Rotthauwe [20] proposed the calcium assimilation hypothesis, whereby a lactase persistence allele is favoured in high-latitude regions because reduced levels of sunlight do not allow sufficient synthesis of vitamin-D in the skin. Vitamin D is required for calcium absorption and milk provides a good dietary source of both nutrients. Additional factors are likely to include the ability to consume a calorie and protein-rich food source, the relative constancy in the supply of milk (in contrast to the boom-and-bust of seasonal crops), and the value of fresh milk as a source of uncontaminated fluids. It is likely that the relative advantages conferred by these various factors differ in Europe and Africa. Estimates of the age of the −13,910*T correspond well with estimates of the onset of dairying in Europe. Slaughtering age profiles in sheep, goats and cattle suggest dairying was present in south-eastern Europe at the onset of the Neolithic [21],[22], while residual milk proteins preserved in ceramic vessels provide evidence for dairying in present day Romania and Hungary 7,900–7,450 years BP [23]. Furthermore, residual analyses of fats indicate dairying at the onset of the Neolithic in England, some 6,100 years BP [24],[25], and after to 8,500 BP in the western parts of present day Turkey [26]. Allelic age estimates are also consistent with the results of a recent ancient DNA study [27] which showed that the −13,910*T allele was rare or absent among early farmers from Central and Eastern Europe. These observations lend support to the view that −13,910*T, and thus LP, rose rapidly in frequency only after the onset of dairying, as opposed to the ‘reverse-cause’ hypothesis [14], [17]–[19], whereby dairying developed in response to the evolution of LP. Important questions remain regarding the location of the earliest −13,910*T-carrying dairying groups and the demographic and gene-culture co-evolutionary processes that shaped the modern distribution of LP in Europe. The present-day distribution of the −13,910*T allele might be taken to indicate an origin in Northwest Europe. However, the earliest archaeozoological and residual lipid and protein evidence for dairying is found in the Near East, in Southeast Europe and in Mediterranean Europe [21],[26],[28]. While these observations can seem contradictory, forward computer simulations have shown that the centre of distribution of an allele can be far removed from its location of origin when a population expands along a wave front [29],[30]. Assuming that the −13,910*T-allele was only subjected to strong natural selection in dairying groups, it is likely that −13,910*T-carrying dairyers underwent demographic expansion to a greater extent than non-dairying groups. While gene flow between dairying and non-dairying groups would ultimately lead to genetic homogeneity, under conditions of limited gene flow between cultural groups, it is plausible that the earliest LP peoples would have made a higher contribution to the European gene pool than their non-LP neighbours. In this study we use demic forward computer simulations to examine potential scenarios for the spread of LP in Europe. We simulate three interacting cultural groups (hunter gatherers, non-dairying farmers and dairying farmers) and track the spread of an allele that is selected only in one group (dairying farmers). We also track the expected proportion of genetic ancestry from the geographic region where LP/dairying coevolution began. We parameterize intrademic gene flow between cultural groups, interdemic gene flow, sporadic longer-distance migration, the cultural diffusion of subsistence practices and selection favouring lactase persistent dairyers. We compare the predicted frequency of a LP allele and arrival dates of farmers – from simulation outcomes – to known frequencies of the −13,910*T allele [3],[8] and carbon-14 based estimates of the arrival dates of farmers [31] at different locations throughout Europe. We employ approximate Bayesian computation (ABC), a set of methods that allow the estimation of parameters under models too complex for a full-likelihood approach [32]. By comparing summary statistics on the observed data with those computed on our simulated datasets, ABC enables us to estimate the key demographic and evolutionary parameters including the region where LP-dairying coevolution in began in Europe.

Discussion The simulation model we have employed here is relatively complex compared to related human demographic/evolutionary models reported [33]–[36]. The inclusion of a selected allele and three distinct but interbreeding cultural groups is necessary for the type of questions we are addressing. But the inclusion of four parameters related to sporadic migration activity, namely the proportion of individuals available for sporadic long-distance migration and the sporadic mobility of each of the 3 cultural groups (modeled separately as a Gaussian random walk process) both allows us to tackle the problem of migration overseas and adds, in our view, an extra level of realism to the model. However, as with any simulation model of population history, many simplifying assumptions have to be made and the extent to which these assumptions may lead to erroneous conclusions remains unknown. For example, we have not considered the ‘reverse-cause’ hypothesis [14], [17]–[19] – which proposes that dairying first arose in populations that were already LP – because both ancient DNA evidence [27] and data from lipid residues on pots [26] are inconsistent with this view. However, this does not mean that once LP-dairying gene-culture coevolution was established, conversion to the culture of dairying was more likely in high LP frequency populations. Such a process is captured in our model to an extent, in that ‘cultural’ conversion is determined by the frequency of the receiving cultural group (see equation 4), and LP is unlikely to rise to high frequencies anywhere without the presence of dairying. Nonetheless, a more explicit treatment of this process may lead to different conclusions. Some parameters, such as those relating to the effects of climate zone/elevation, and the logistic growth rate, are fixed based on realistic assumptions [37]–[39]. For those parameters that are allowed to vary within a range we note that an important shortcoming is that in any single simulation their value is constant over the 360-generation duration of the run. This may be a particular issue for selection acting on an LP allele in F d (see below). Since we identify ‘good’ simulations using their fit to only two data sets (arrival time of farming and LP allele frequency, both at a range of geographic locations) it is unsurprising that our analysis is relatively uninformative for some parameters. However, inclusion of these parameters does serve to reflect uncertainty in their values. Estimates of the arrival dates for farming the 11 locations we consider here were calculated as local weighted averages of calibrated carbon-14 dates [31] from a Gaussian sampling region. We set the standard deviation of this region at the average nearest neighbour distance to ensure that most of the carbon-14 data was used. However, the geographic density of carbon-14 dates is highly uneven across Europe and so the number of such dates that are informative for farming arrival time at any of the 11 locations will vary. Also, there appears to be a considerable amount of noise in the dates for the first farmers. For example, the earliest carbon-14 date for farming in Ireland predates those for Great Britain, the Low Countries and Denmark. To test if these concerns had a major effect on our results we reanalysed our simulation date by setting the target farming arrival dates as those inferred by assuming a constant rate of spread of farming (estimated at 0.9 km/year [31]) and calculating the great circle distance from Anatolia to each sampling location. The results of this reanalysis were very similar to those presented above (see Supplementary Figures S3, S4, S5, and S6), We are well aware that the spread of the Neolithic over Europe was not as constant as our model assumes. After the arrival of the Neolithic in the Balkans, there is a pause of approximately 800 years before it starts to spread to Central Europe, and there is another pause of 1,000 years before it spreads further into the northern German lowlands and other parts of the northern Europe. Clearly, the carbon-14 dates we have used to estimate the farming arrival times will not fully reflect the complex history of neolithisation in all parts of the continent. The list of parameters for which the marginal posterior distributions are notably narrower than their corresponding prior ranges (selective advantage, intrademic gene flow, the sporadic migration distance of F d and F nd , and the geographic origin location of LP/dairying co-evolution) – which we interpret as those parameters for which our analysis is informative – is an unsurprising one since we would expect these parameters to have the greatest influence on the spread of an LP allele and farming in Europe. Likewise, it is unsurprising that the proportion available for sporadic migration and the sporadic mobility of (a) dairying farmers, and (b) non-dairying farmers are both strongly negatively correlated (Figure 2A and 2B) since we would expect these parameters to be confounded in influencing the arrival time of farmers at different locations. The estimated selective advantage conferred by a LP allele (mode = 0.0953; 95% CI = 0.0518–0.159) is in good agreement with previous estimates for Europeans (0.014–0.15 [8]). However, it should be noted that (1) this estimate is for selection only in dairying farmers, who make up just under half of the population that we simulate, and (2) we assume that selection is constant over time. It is possible that selection favouring LP has in fact been episodic and possibly spatially structured in different climate zones [20], [40]–[44]. Episodic selection would be difficult to model without additional information on when those episodes were likely to have occurred. But we reason that constant selection strength is a more parsimonious assumption in the absence of evidence to the contrary. If, as modelled here, dairying farmers made up less than half of the European post-Neolithic population then we would expect the real continent-wide selection values for LP to average less that half of what we estimate here. Such a range of selection values are, however, still consistent with previous estimates based on haplotype decay [8]. Perhaps the most interesting result presented here is our estimation of the geographic and temporal origins of LP-dairying co-evolution. We find the highest posterior probabilities for a region between the central Balkans and central Europe (see Figure 3). At first sight such a location of origin may seem counter intuitive since it is far-removed from Northwest Europe, where the −13,910*T allele is found at highest frequency. However, previous simulations have shown that the geographic centroid of allele can be offset from its location of origin, particularly when it occurs on the wave front of a demographic expansion [29],[30]. The lactase-dairying coevolution origin region inferred here is consistent with a number of archaeologically attested patterns concerning the emergence and spread of dairying. Recent carbon isotope ratios from lipids extracted from archaeological sherds show the presence of milk fats in present-day western Turkey and connect these findings to an increased importance of cattle herding [26], [45]–[48]. In general, the spread of the Neolithic lifestyle from the Aegean to Central Europe goes hand in hand with the decline of the importance of sheep and goat and the rise in frequency of cattle bones in archaeological assemblages. While the Balkans at the beginning of the Neolithic still shows a variety of subsistence strategies [49], the middle Neolithic in SE-Europe and the earliest Neolithic in Central Europe after 7,500 BP show a clear preponderance of cattle. Benecke [50] gives the following averaged rates for the respective domestic species: cattle 55.2%, sheep and goat 32.6%, pig 12%. The proportion of cattle in Central Europe increases during the following centuries to an average of 73% and then stays (with a few exceptions) stable for most prehistoric periods of Middle and northern Europe. Thereby, cattle herding is in most cases connected with kill-of profiles indicative for dairying [22], [50]–[55]. Milk consumption and dairying have been proposed to be as early as the Pre-Pottery Neolithic B of the Near East and may even be a reason for domestication [56],[57]. Without doubt, it was a common cultural practice during all phases and regions of the European Neolithic, especially for goat and cattle. However, a fully developed dairying-based farming economy emerges first during the late Neolithic in Southeast Europe and the Middle Neolithic Cultures following the Linearbandkeramik (LBK) in Central Europe, and is connected mainly to cattle and partly also to goat (for the Rössen culture see [50],[55]). In the Mediterranean, milking of cattle occurs episodically [28] and sheep and goat remain the dominant domestics, as they were earlier in Anatolia and the Aegean. It is very likely that the goat and sheep, and to a lesser extent cattle, based economies of the Mediterranean used processed milk in the form of yoghurt, cheese and other milk-derived products instead of fresh milk. The nutritional and agricultural differences between southern Europe, the Mediterranean and central and northern Europe, as well as historic reports, point to this. For instance, the Romans used goat and sheep milk for the production of cheese, and cattle as a draught animal. In contrast the Germanic peoples and other inhabitants of central and northern Europe practised cattle dairying and drank fresh milk in significant amounts. Strabo reports in his Geography [58]: “Their [sc. “the men of Britain”] habits are in part like those of the Celti, but in part more simple and barbaric - so much so that, on account of their inexperience, some of them, although well supplied with milk, make no cheese; and they have no experience in gardening or other agricultural pursuits.” Overall, by considering the results from our simulations and archaeological, archaeozoological, and archaeometric findings, it seems very plausible to connect the geographic origin of the spread of LP to the increasing emergence of a cattle-based dairying economy during the 6th millennium BC. The geographic region of origin of the LBK – in modern day Northwest Hungary and Southwest Slovakia [59],[60] – certainly correlates well with our results (see Supplementary Figure S7). The date of origin of LP-dairying coevolution estimated here (mode = 7,441 years BP; 95% CI = 6,256 to 8,683 years BP; see Figure 4A and Supplementary Table S2) also fits well with dates for the early LBK in Central Europe (∼7,500 years BP) and its proposed main predecessor, the Starčevo culture of the northern Balkan Peninsula and south of Lake Balaton (8,100 to 7,500 years BP; [61]). However, as explained above, our date estimate is conditioned by farming arrival dates in the estimated LP-dairying coevolution origin region. As a result, our date and location estimates are not independently derived. Nonetheless, a role for LP-dairying coevolution in the later rapid spread of LBK culture – from its origins in the Carpathian Basin – into central and Northwest Europe would be consistent with the significantly higher sporadic migration distances we infer for of F d when compared to F nd . This is also consistent with the rapid dissemination of the LBK culture over a territory of 2,000 km width and approximately one million square kilometres within less than 500 years [62]. Contrary to our expectations, we did not find that the presence of a positively selected LP allele in early dairying groups increases the unlinked genetic contribution of people living in the region where LP-dairying coevolution started to the modern European gene pool, when using demographic parameter values estimated here. The main reason for this is likely to be the relatively high inferred rates of intra- and interdemic gene flow between dairying and non-dairying farmers and between neighbouring demes, respectively, leading to a rapid erosion of any demographic ‘hitchhiking’ of unlinked genomic regions. Additionally, we only track the genetic contribution of people living in and around the deme of LP/dairying coevolution from the inception of this process. Since it takes some time for the LP allele to rise to appreciable frequencies, any demographic ‘hitchhiking’ effect may become important only after the allele centroid has moved some distance away from its origin deme. Another notable result was obtained when we compared the range of expected 13,910*T allele frequencies at different European locations – from simulations accepted at the 0.5% tolerance level – to those observed. While all observed values were within the 95% equal tail probability interval of the simulated values, many were somewhat offset from the modes. We interpret this as indicating that our model does not fully explain the distribution of the 13,910*T allele in Europe. One possible explanation for this is that migration activity – as modeled here by interdemic gene flow and sporadic unidirectional migration – has increased subsequent to the expansion of farming into the northwestern reaches of Europe. In this scenario the farming expansion phase, occurring 9,000 to 5,500 years BP, would be mainly responsible for generating the 13,910*T allele frequency cline in Europe but higher migration activity following this period would then have a homogenizing effect in LP allele frequencies. Intriguingly, a general pattern can be seen (Supplementary Figures 1) whereby observed frequencies are lower than expected in northern Europe and higher than expected in southern Europe. Such a pattern is the opposite of what we would expect if selection for LP was higher in northern latitudes through a greater requirement for dietary vitamin D and calcium because low-sunlight conditions reduce UV-mediated vitamin D production in the skin [20]. This frequently cited mechanism [9], [41], [42], [44], [63]–[65] was not included in our model and thus would seem to have negative explanatory power. Thus our simulations indicate that geographically and temporally homogeneous selection in combination with well-attested underlying demographic processes are sufficient to explain, indeed, to over-explain, the LP/latitude correlation in Europe. However, it should be noted that since we have not explicitly included a parameterised latitudinal effect on selection in our model, there may be scenarios where such an effect could also explain patterns of LP in Europe. As inferred here, the spread of a LP allele in Europe was shaped not only by selection but also by underlying demographic processes; in this case the spread of farmers from the Balkans into the rest of Europe. We propose that this combination of factors could also explain the apparent homogeneity of LP-associated mutations in Europe. In Africa there are at least four known LP-associated alleles, including three that are likely to be of African origin [2],[4] as well as −13,910*T, which is likely to be of European origin [3],[12]. The greater apparent diversity of LP-associated mutations in Africa may reflect a greater genetic diversity in general, leading to the availability of more mutations upon which selection can act following the advent of dairying. However, we suggest that this diversity is the result of an ‘imposition’ of dairying culture on a pre-existing farming people, rather than the spread of dairying being tied to the spread of dairyers. Such a model would require the availability of a number of, albeit low-frequency, LP-causing mutations; either through a high mutation rate or a large number of potential LP-causing sites. It is therefore possible that, in the absence of the spread of dairying being linked to a major demographic expansion, high LP-allele diversity will also be found in the Indian subcontinent. We accept that the model we have used does not accommodate all data (both genetic and archaeological) that is potentially informative on the coevolution of LP and dairying in Europe. Future improvements can be made by adding more ‘realism’ to the model and by increasing the number of data types that are used in the ABC analysis, leading to more integrative inference. The former should include both adding more fixed parameter information (such as the effects of past vegetation, climate variation and other geographic features on migration parameters and carrying capacities [66]–[68]) and estimating currently fixed parameters such as the ratio of dairying to non-dairying farmers. The latter could be achieved by writing the simulation model so that it generates expectations for other data types. For example, including the movement of domestic cattle could be used to generate expectations on patterns of ancient and modern cattle genetic diversity, for which considerable data is available [69]–[73]. Finally, it should be possible to extend the approach we have used to study the evolution of LP and dairying in other parts of the world. We infer that the coevolution of European LP and dairying originated in a region between central Europe and the northern Balkans around 6,256 to 8,683 years BP. We propose the following scenario: after the arrival of the Neolithic in south-eastern Europe and the increasing importance of cattle herding and dairying, natural selection started to act on a few LP individuals of the early Neolithic cultures of the northern Balkans. After the initial slow increase of LP frequency in those populations and the onset of the Central European LBK culture around 7,500 BP, LP frequencies rose more rapidly in a gene-culture co-evolutionary process and on the wave front of a demographic expansion (see Supplementary Videos S1, S2 and S3), leading to the establishment of highly developed cattle- (and partly also goat-) based dairying economies during the Middle Neolithic of central Europe around 6,500 BP. A latitudinal effect on selection for LP, through an increased requirement for dietary vitamin D [20], is unnecessary to explain the high frequencies found in northern Europe.

Material and Methods Our simulation approach is motivated by a previous demic computer simulation study [33] and has features in common with more recent applications of this approach [34]–[36]. Geographic space is modelled as a series of rectangular demes arranged to approximate the European landmass (2375 land demes and 1511 sea demes). Each deme has attributes of elevation, area (which varies due to the curvature of Earth and is calculated accordingly for each individual deme), and a climate (Mediterranean, Temperate, or Cold/Desert – see Supplementary Figures S8 and S9). A maximum total population size is specified for each land deme taking into account its area, and assuming that lower elevation and mild Mediterranean climate results in a greater potential population size, while harsher conditions, such as high elevations and cold/desert climates, result in a smaller potential population size [38]. The ratio for the relative contribution coefficients of climate and elevation factors to the population size is fixed at 1∶4 in this study; meaning that elevation has a more dramatic effect than climate on population size. The maximum deme population size (carrying capacity, K deme , Supplementary Figure S10) is calculated by: (1)where cl and el are the climatic and relative elevation factors, respectively; cl having values of 1 for Mediterranean, 2/3 for Temperate, and 1/3 for cold/desert climates [38] (see Supplementary Figure S9), and el being calculated as: (1.1) So el ranges between 0 at the highest elevation and 1 at sea level (see Supplementary Figure S8). D max is the maximum population density and is fixed at 5 individuals per km2 (i.e. in a sea level Mediterranean climate deme [39]), and A deme is the area of the deme in km2. Each deme contains three distinct cultural groups: non-dairying farmers (F nd ), dairying farmers (F d ), and hunter-gatherers (HG). The ratios of ceiling population size for F nd , F d , and HG (as a proportion of the total maximum population size for the deme, K deme ) are 50∶50∶1 respectively [37],[39]. Each cultural group in each deme is assigned a frequency for an allele that is subjected to genetic drift (modelled by intergenerational binomial sampling) and an allele at an unlinked locus that is not (as explained below). Initially the frequency of both ‘alleles’ is set at zero. The former represents a LP allele and is subject to selection of intensity s, only in the F d group. The latter, here termed the GB (genetic background) ‘allele’, is used to track the general genetic ancestry component from the region where the LP allele is first found among dairying farmers. It will be used to infer the expected proportion of genes that originate from this region. The two alleles are assumed to be unlinked and are modelled separately. We treat s as an unknown but bounded parameter, and choose random values ranging from 0 to 0.2 in simulations [8]. The LP and GB ‘allele’ frequency dynamics are determined in each generation by five processes: (1) intrademic bidirectional geneflow between cultural groups; (2) bidirectional geneflow between demes (interdemic) within the same cultural groups; (3) sporadic unidirectional migration within the same cultural groups; (4) cultural diffusion (CD); and (5) selection operating on LP allele-carrying individuals within the F d group. Hardy-Weinberg equilibrium within each cultural group within each deme is assumed. Population size increase for each cultural group in each deme is modelled by logistic growth, limited by the carrying capacity of each group within each deme. We fixed the growth rate to r = 1.3 per generation, a value estimated from data of world population growth rate over the last 10,000 years, excluding the post-Industrial Revolution population boom (US Census Bureau: www.census.gov). In addition, the F d group is allowed to increase in size as a function of the selective advantage of the LP allele, s, by considering the number of LP individuals and the selective advantage to being a LP dairyer (see equation 7). We define intrademic bidirectional geneflow as the exchange of individuals between different cultural groups within a deme (see Supplementary Figure S11). A proportion of individuals in each cultural group, P c , are deemed ‘available to change group’. The actual number of individuals that are exchanged between cultural groups i and j, B i↔j , is determined as follows: (2)Where N i and N j are the total number of individuals belonging to each cultural group. We treat P c as an unknown but bounded parameter, and choose random values ranging from 0 to 0.2 in simulations [74],[75]. We define interdemic bidirectional geneflow as the exchange of individuals between the same cultural groups in neighbouring demes (see Supplementary Figure S11). A proportion of individuals in each cultural group, P d , are deemed ‘available to change deme’. The actual number exchanged is determined using the same formula as for intrademic bidirectional geneflow (equation 2), except we substitute P d for P c , and N i and N j are the total number of individuals belonging to each cultural group in each neighbouring deme. In each generation, each cultural group in each deme undergoes bidirectional geneflow with one neighbouring deme, randomly chosen from the 8 possible. We define sporadic unidirectional migration as the movement of some individuals in a particular cultural group and deme to the same cultural group in a different deme (see Supplementary Figure S11). A proportion of individuals in each cultural group, P s , are deemed ‘available to migrate’. The actual number of individuals that migrate, N mig , is dependent on the ‘pressure’ to leave the current deme and the availability of unoccupied carrying capacity in the destination deme (‘attractiveness’), and is determined as follows: (3)Where K curr and K dest are the carrying capacities, and N curr and N dest are the number of people in the cultural group, in the current home and destination demes respectively. We treat P s as an unknown but bounded parameter, and choose random values ranging from 0 to 0.2 in simulations. The destination deme is chosen by a Gaussian random-walk process, which takes into account the mobility of the cultural group and the topography of the home deme. The Gaussian distribution is centred on the home deme; and its standard deviation is the product of the mobility of the cultural group, M i , and the relative mobility factor of the home deme, M curr . We treat M i as a separate unknown but bounded parameter for each of the three cultural groups, and choose random values ranging from 0 to 3 (demes) in simulations. M curr is determined for each deme by its elevation, allowing greater mobility at lower elevations [76],[77], with fixed values of 0.5 (demes) at mountainous terrain (above 1100 meters), 1.0 at lowlands (below 1100 meters), and 1.5 at coastal demes. The sporadic unidirectional migration function allows movement overseas, but whenever a sea deme is identified as a non-realistic destination deme the nearest neighbouring coastal deme is chosen instead. This feature, together with the attractiveness of low elevation land and the higher M curr value for coastal demes, creates the realistic tendency of a faster spread of farming along coastlines, consistent with archaeological data [78]. We define Cultural Diffusion (CD) as the spread of culture and technology by learning through exposure rather than by migration (see Supplementary Figure S11). In our simulations a proportion of individuals in each cultural group, P dif , are deemed ‘available to convert’ from one cultural group to another. The number of individual that convert from cultural group i to cultural group j, N i→j , is determined by this parameter and the proportion of the carrying capacity (K) of the home deme (deme 0) and in the 8 neighbouring demes (demes 1 to 8) that is taken up by cultural group j, as follows: (4)where b is the relative influences of the home deme and the 8 neighbouring demes (fixed to 0.75). We treat P dif as an unknown but bounded parameter, and choose a random value ranging from 0 to 0.2 in each simulation. That value is then applied to ‘conversions’ between all 3 cultural groups. The geographic location where LP/dairying gene-culture coevolution starts is chosen at random from all land demes. This LP mutation is initialized at a frequency of 0.1 in F d when their population size reaches a critical size in the chosen start deme, set to a minimum of 20 individuals per deme in simulations. While we would expect any de novo mutation to always have an initial frequency of 1/2N, we also expect that it will have a high probability of extinction unless selection is very strong [79]. Indeed, in preliminary simulations this was observed (data not shown). Thus, for computational efficiency we condition on the LP mutation having already reached a frequency of 0.1 in F d in the deme of origin. However, such a starting frequency means that little more than four LP alleles are initialized in simulations. Selection acting on the LP allele, p, increases its frequency in F d only, as follows [80]: (5)where s is the selection coefficient for p, and p′ is the new LP allele frequency. In addition, selection acting on the LP allele increases the number, N, in F d as follows: (6)where N′ is the new number of F d in a particular deme. All simulations were run for 360 generations which, assuming a generation time of 25 years [81],[82], corresponds to the 9,000-year history of farming in Europe. We performed 200,000 simulations in total. The genetic contribution of the population living in the region of origin of LP/dairying gene-culture coevolution to the overall European population is tracked over generations by calculating the GB ‘allele’ frequency over all demes in all 3 cultural groups. In the generation when the LP allele is initialized, all cultural groups in the origin deme and 8 neighbouring demes are assigned the unlinked GB ‘allele’ at a frequency of 1. The GB ‘allele’ is subjected to the same intra- and inter-deme geneflow and migration processes as described above, but is not subject to drift, as modelled by binomial sampling, or to selection. At the end of each simulation this GB allele is taken to represent the general genetic contribution of the population living in the region of origin of LP to the modern European population. The ancestry component of Europeans, at any generation, that originates from people living in the region of origin of the LP allele (F GB ) is calculated as follows: (7)where n is the number of land demes, N i is the total number of people in deme i, and p GBij and N ij are the frequency of the GB ‘allele’ and the population size in deme i/cultural group j, respectively. To estimate parameters of interest we use an ABC approach, following [32]. By comparing summary statistics computed on each simulated dataset to those from the observed data, we are able to accept only those simulations with summary statistics sufficiently close to the target (i.e. the observed summary statistics) and reject the remainder. We then perform a weighted local-linear regression on these retained parameter sets, with weight determined by the “distance” between the simulation summary statistics and the target (all details below). This generates approximate marginal posterior probability distributions for each parameter of interest, from which we derive our modal point estimates. Our chosen summary statistics, U, are the frequencies of the −13,910*T allele at 12 different sample locations around Europe, the Near East and western Asia [7],[8]. In addition we include as summary statistics the times to arrival of farming at 11 of the same locations (the Anatolia location is excluded as the simulation model is initialized with this as the origin of the spread of farming into Europe). We recognize that these are not summary statistics sensu stricto but are parameters in the model for which we have independent estimates. However, the simulations, being stochastic, generate a distribution of arrival times, and we need to condition on those that are consistent with the known archaeological evidence. The most straightforward way to do this is to place a point prior on the arrival dates, and then condition on these using the ABC machinery, as if they are summary statistics. The point priors for the arrival dates of farming at 11 of the 12 sampling locations considered (Anatolia was set to 9,000 years as the simulations begin 360 generations ago in ‘an Anatolia’ populated by farmers) were calculated as follows: (1) The average nearest-neighbour distance (ANND) between each sampling location was calculated (557.13 km). (2) A 2-D Gaussian sampling region was constructed around each of the 11 sampling locations, of standard deviation = ANND/1.96 (this ensures that 95% of each Gaussian sampling region will be within the ANND). (3) A weighted average of all dates within 3 standard deviations of the sampling location was calculated using all calibrated carbon-14 earliest farming arrival dates from Pinhasi et al. [31], and weighting using the distance from the sampling location and the standard probability density function for a Gaussian distribution. Assuming a generation time of 25 years [81],[82] these observed dates are converted to generations from the start of the simulation, which was set at 9,000 years BP or 360 generations ago (see Table 1). We also include two Spearman's rank-order correlation coefficients, calculated separately for the 12 T-allele frequencies and the 11 times to arrival of farming, giving a total of 25 summary statistics. When calculating these statistics for the simulated data: we take LP frequencies in the final generation of the simulation at the 12 corresponding geographic locations; and the time to arrival of farming is defined as the simulation generation at which either F d or F nd reach 1% of their carrying capacity within each of the 11 corresponding location demes. All time to arrival of farming statistics are scaled to the interval [0,1] by dividing by the total number of simulated generations (360). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. −13,910*T allele frequencies, inferred farming start dates and geographic coordinates of 12 locations data used in ABC analysis. https://doi.org/10.1371/journal.pcbi.1000491.t001 Parameters of interest, φ, are: the east-west and north-south coordinates of the location where the LP-allele first undergoes selection among F d ; the generation at which this selection starts; the selective advantage of LP within the F d group, s; the proportion available for interdemic bidirectional geneflow, P d ; the proportion available for intrademic bidirectional geneflow among cultural groups, P c ; the rate of cultural diffusion, P dif ; the proportion of people available for sporadic migration, P s ; the mobility of each of the three cultural groups, M i ; and the contribution of people living in the deme where LP-dairying gene-culture coevolution began and its 8 surrounding demes, F GB , to the modern European gene-pool. The uniform prior distributions for each parameter are given in Supplementary Tables S2 and S3. Our full ABC algorithm is as follows: (1) choose the summary statistics U as outlined above and calculate their values, u, for the observed data (these are given in Table 1), (2) choose a tolerance level δ (as suggested we pre-define a proportion of the best fitting simulations, P δ , to accept and from this calculate an implicit tolerance level δ), (3) sample a parameter set φ i from the pre-determined prior distribution of φ, (4) simulate forward under our model using parameter set φ i , (5) in the final generation of our simulation we calculate the summary statistics, u i , for this simulated data, (6) If ∥u i −u∥≤δ (where ∥.∥ is the Euclidean norm between the two vectors) we accept parameter set φ i , (7) steps 3 to 6 are repeated until we have a sufficient number of retained parameter sets, (8) A local-linear standard multiple regression is then performed to adjust the φ i , with each φ i weighted according to the size of ∥u i −u∥ using the Epanechnikov kernel function K δ (t) (see [32] for details), (9) The resulting fitted parameter sets φ i * form a random sample from the approximate joint posterior distribution P(φ|U = u). All retained parameters – except for the two coordinate values and the generation at which the co-evolutionary process starts – were log transformed prior to the regression step, and subsequently back-transformed to produce the fitted parameter sets φ i *, as suggested by Beaumont et al. [32]. The simulation and ABC analysis procedures were written in the Python Programming Language (URL: http://www.python.org/) employing the numarray and Numpy array handling libraries. Maps and animations were generated using the Python library PyNGL. Post-ABC analysis data was processed and visualised using the statistical package ‘R’ (URL: http://www.R-project.org/).

Acknowledgments We thank Dallas Swallow, Lounès Chikhi, Charlotte Mulcare, Pascale Gerbault, Jens Lüning, Kevin Bryson, Norbert Benecke, Mehmet Özdogan, László Bartosievicz, Marjan Mashkour and Yael Pinchevsky for access to date and helpful comments.

Author Contributions Conceived and designed the experiments: MGT. Performed the experiments: YI AP MGT. Analyzed the data: YI AP MAB MGT. Contributed reagents/materials/analysis tools: YI AP MAB JB MGT. Wrote the paper: YI AP JB MGT.