Abstract The Δ32 mutation at the CCR5 locus is a well-studied example of natural selection acting in humans. The mutation is found principally in Europe and western Asia, with higher frequencies generally in the north. Homozygous carriers of the Δ32 mutation are resistant to HIV-1 infection because the mutation prevents functional expression of the CCR5 chemokine receptor normally used by HIV-1 to enter CD4+ T cells. HIV has emerged only recently, but population genetic data strongly suggest Δ32 has been under intense selection for much of its evolutionary history. To understand how selection and dispersal have interacted during the history of the Δ32 allele, we implemented a spatially explicit model of the spread of Δ32. The model includes the effects of sampling, which we show can give rise to local peaks in observed allele frequencies. In addition, we show that with modest gradients in selection intensity, the origin of the Δ32 allele may be relatively far from the current areas of highest allele frequency. The geographic distribution of the Δ32 allele is consistent with previous reports of a strong selective advantage (>10%) for Δ32 carriers and of dispersal over relatively long distances (>100 km/generation). When selection is assumed to be uniform across Europe and western Asia, we find support for a northern European origin and long-range dispersal consistent with the Viking-mediated dispersal of Δ32 proposed by G. Lucotte and G. Mercier. However, when we allow for gradients in selection intensity, we estimate the origin to be outside of northern Europe and selection intensities to be strongest in the northwest. Our results describe the evolutionary history of the Δ32 allele and establish a general methodology for studying the geographic distribution of selected alleles.

Citation: Novembre J, Galvani AP, Slatkin M (2005) The Geographic Spread of the CCR5 Δ32 HIV-Resistance Allele. PLoS Biol 3(11): e339. https://doi.org/10.1371/journal.pbio.0030339 Academic Editor: Andy Clark, Cornell University, United States of America Received: October 29, 2004; Accepted: August 4, 2005; Published: October 18, 2005 Copyright: © 2005 Novembre et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Competing interests: The authors have declared that no competing interests exist. Abbreviations: MLE, maximum likelihood estimate; PDE, partial differential equation

Introduction The geographic spread of advantageous alleles is fundamental to evolutionary processes, including the geographic distribution of adaptive traits, the cohesiveness of species, and the spatial dynamics of coevolution between pathogens and their hosts. Various theoretical models describe the dynamics of how advantageous alleles spread within a population, but few well-studied examples exist, particularly in humans, of how advantageous alleles spread geographically. The CCR5 Δ32 mutation is a good example of an advantageous allele with a well-characterized geographic distribution. The Δ32 mutation currently plays an important role in HIV resistance because heterozygous carriers have reduced susceptibility to infection and delayed onset of AIDS, while homozygous carriers are resistant to HIV infection [1]. The mutation is found principally in Europe and western Asia, where average frequencies are approximately 10%, although the frequency varies within this geographic area. HIV only recently emerged as a human pathogen, so researchers were surprised when various sources of evidence showed strong selection in favor of Δ32 throughout its history. The age of the Δ32 allele has been estimated to be between 700 and 3,500 y based on linkage disequilibrium data [2,3], and recent ancient DNA evidence suggests the allele is at least 2,900 y old [4]. If Δ32 were neutral, population genetics theory predicts it would have to be much older given its frequency. The alternative explanation is that the Δ32 mutation occurred recently and then increased rapidly in frequency because of a strong selective advantage [2,5]. Quantitative studies have concluded that heterozygous carriers of Δ32 in the past had a fitness advantage of at least 5% and possibly as high as 35% [2,3]. Bubonic plague was initially proposed as the selective agent [2], but subsequent analysis suggested that a disease like smallpox is a more plausible candidate ([6–8], with reviews in [9–11]). To understand the origin and spread of Δ32, we modeled the effects of selection and dispersal on the allele. The Δ32 mutation is found only in European, West Asian, and North African populations. The allele frequency exhibits a north–south cline with frequencies ranging from 16% in northern Europe to 6% in Italy and 4% in Greece (Figure 1; [2,5,12–17]). The broadest area of high frequency is located in northeastern Europe, particularly the Baltic region, as represented by samples from Sweden, Finland, Belarus, Estonia, and Lithuania. There are additional peaks of frequency in samples from the northern coast of France, the Russian cities of Moscow and Ryazan, and portions of the Volga–Ural region of Russia. Ashkenazi Jews have high frequencies of Δ32, but this is likely due to founder effects unique to their history rather than the general process of dispersal that spread the allele in other populations [18]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Shaded Contour Map of Δ32 Allele Frequency Data The sampling locations are marked by black points. The interpolation is masked in regions where data are unavailable. https://doi.org/10.1371/journal.pbio.0030339.g001 Previous discussion of the geographic distribution of Δ32 has focused on the north–south cline in frequency. Lucotte and Mercier [12] suggested that the cline and other features of the geographic distribution imply a Viking origin. In particular they proposed that the allele was present in Scandinavia before 1,000 to 1,200 y ago and then was carried by Vikings northward to Iceland, eastward to Russia, and southward to central and southern Europe. The age and geographic distribution of the allele are consistent with the qualitative predictions of the Viking hypothesis [12,17], but there has been no quantitative analysis of the Viking hypothesis or alternative hypotheses. One alternative is that a northern origin coupled with typical levels of dispersal in Europe is adequate to explain the geographic distribution of Δ32. Under this hypothesis, rare long-distance dispersal events, such as Viking dispersal, play a minor role in the spread of the advantageous allele. Another alternative is that the allele may have arisen in central Europe and increased to a higher frequency in the north because of a geographical gradient in selection intensity [19]. There are two plausible biological causes for a gradient in selection. First, the selective advantage of Δ32 may have been larger in the north. This hypothesis stems from anecdotal evidence that indicates smallpox epidemics were more intense in northern Europe [20]. A second mechanism is that a selective cost associated with the Δ32 allele may have been stronger in the south, and thus the overall selection intensity in favor of Δ32 may have been weaker in the south and stronger in the north. While there is little direct evidence that Δ32 carriers are more susceptible to general infection [10,11], the plausibility of a selective cost of Δ32 is supported by evidence that chemokines play an important role in inflammatory responses to infection [21,22] and by studies with mice that show that CCR5 knockouts have poor immune responses to various pathogenic infections [23–25]. These results suggest some pathogens may have an advantage infecting Δ32 carriers because the immune response is impaired by the absence of functional CCR5 chemokine receptors. If such pathogens tended to be more prevalent in the temperate climates of southern Europe, then a selection gradient would arise. It is even plausible that Δ32 could be disadvantageous in certain areas where the protective effect is outweighed by the disadvantage of a weakened immune response. In a model with selection gradients, Viking dispersal may still contribute to the spread of the allele, but the geographic origin of the allele and the influence of spatially variable selection differ from that in the Viking hypothesis. A further question regarding the geographic spread of Δ32 is whether the historical selective agent acted only in Europe and western Asia or on a larger geographic scale. In the former case, the restriction of Δ32 to Europe and western Asia is explained by spatially varying selection, and in the latter, by insufficient time for the allele to have dispersed farther. Here we fit a simple population genetic model to the geographic distribution of Δ32 in order to infer features of the processes of dispersal and selection that shaped the historical spread of the allele. In particular we conclude that given current estimates of the age of the Δ32 allele, the allele must have spread rapidly via long-range dispersal and intense selection to attain its current range. We find the Δ32 allele is likely restricted geographically because of limited time to disperse rather than local selection pressures. In addition, we show that the data are consistent with origins of the mutation outside of northern Europe and modest gradients in selection.

Discussion We can draw several conclusions from our analysis of the geographic distribution of the Δ32 allele. First, the results suggest that strong selection (s ≥ 0.10) and long-distance dispersal of humans (σ > 100 km) are necessary to explain the current geographic distribution of the Δ32 allele. Values of σ > 100 km are larger than the estimates of 1–75 km found in various studies using historical records for Europeans [27]. If, however, the Δ32 allele is older and less advantageous than previously estimated [3,4,6], which has been suggested recently by [29], then our analysis of the geographic distribution becomes more consistent with previously published estimates of dispersal distances in European populations. Because our results depend on the ratio R = σ2/s, a smaller s implies estimates of σ that are closer to those based on historical records (see Figure 4), although still somewhat larger. For example, for our MLE of R = 1.03 × 106, a value of s = 0.0075 corresponds to σ = 88 km. One explanation for the discrepancy between genetic and historical estimates of dispersal distance is that the estimates based on historical records may be too low because they do not reflect longer-distance dispersal events such as those associated with trade routes or population movements. Our larger estimate of σ may reflect such long-distance dispersal events. Second, we conclude that if selection is spatially uniform, Δ32 arose by mutation in northeast Europe as suggested by Libert et al. [5]. This hypothesis is parsimonious because it does not require gradients in selection intensity. If selection is not spatially uniform, we find the geographic origin could be far from locations where Δ32 is currently in high frequency. We reach this conclusion because, in our model, dispersal dominates initially and spreads the new allele over a large geographic area before selection can increase the allele frequency locally. When we allow for selection gradients and take account of the data from Iceland, we conclude that Δ32 most likely originated either in Spain or northern Germany. The gradients in selection intensity needed are not extreme and are on the order of only a 20% relative difference between southern and northern Europe and a 5% relative difference between eastern and western Europe. Although allowing for selection gradients is less parsimonious, the model with selection gradients had a significantly higher likelihood than the model with uniform selection. The north–south gradient detected here is consistent with anecdotal evidence that smallpox was more prevalent in the north [20]. Third, our results show that the geographically restricted distribution of Δ32 is a result of Δ32 not having had time to disperse more widely, rather than resulting from a geographic restriction of selection favoring it. Given more time and no change in selection affecting Δ32, the allele would have spread over a wider area. Our large estimates of dispersal are consistent with the Viking hypothesis of Lucotte and Mercier [12]. Moreover, when selection is assumed to be spatially uniform, the maximum likelihood origin is in southern Finland. However, incorporating gradients in selection provided significantly better fits to the data, and in models with gradients, origins in Scandinavia did not have high likelihoods. Thus, our likelihood-based analysis provides some support for the Viking hypothesis in that we detect a strong signature of long-range dispersal events, but it also raises the possibility that the allele arose outside of Scandinavia and spread into the region via dispersers from the south. Our analysis makes a number of simplifying assumptions. Our model does not incorporate genetic drift. To examine the effect of ignoring drift, we simulated a stepping stone model with local deme sizes of N e = 2,500 and a selection coefficient of s = 0.05. We found that with drift, the allele frequency surface becomes somewhat more jagged than without drift, but the underlying shapes are still the same (results not shown). Concluding that drift is negligible based on the simulation results is conservative as the selection coefficient of Δ32 is most likely at least 0.05 [2,3] and the effective population size of a deme (where 1° latitude by 1° longitude may be considered a deme) would be greater than 2,500 even assuming 2000 BCE population densities (2.5 km−2; [30]) and N e as one-quarter of the census population size. Additional support for the idea that drift is unimportant for the large-scale patterns in this type of model comes from the analytical results of Kot et al. [31] for a similar model of branching random walks. They show that the average rate of expansion in a stochastic model is similar to that in a deterministic model. Another assumption of our approach is that the allele under selection has an additive effect. We tested the robustness of our results to deviations from additivity by generating allele frequency surfaces in which the fitness advantage of Δ32 heterozygotes is kept constant and a range of fitness advantages of the Δ32/Δ32 genotype was assumed. We found that varying the degree of dominance had little effect on the geographic distribution of Δ32 (results not shown). The negligible importance of the fitness advantage of Δ32 homozygotes arises because the proportion of Δ32 homozygotes is sufficiently small throughout the history of Δ32 that the assumption regarding the fitness of the homozygote only has a minor effect. Our use of diffusion equations assumes that only the mean and variance of the dispersal distribution are needed to model the effects of dispersal and that higher central moments such as kurtosis are negligible. Studies of Fisher's wave of advance in the ecology literature have shown that if kurtosis is non-negligible, as in the case of “fat-tailed” dispersal distributions (distributions whose tails are not exponentially bounded), the asymptotic behavior of the wave of advance changes so that the speed of the wave continually accelerates [32,33]. Because observed dispersal distributions for humans have been found to be leptokurtic (i.e., large kurtosis) [34], we considered the effect of a leptokurtic dispersal distribution on our results. We simulated the spread of an advantageous allele in a two-dimensional stepping stone model on a torus using three different dispersal distributions with varying degrees of kurtosis: a normal distribution, a double gamma distribution used by Cavalli-Sforza et al. [34] to fit human dispersal data, and a modified double exponential distribution used by Clark et al. [35] to describe fat-tailed seed dispersal data (see Table 1). These distributions were scaled to have a standard deviation of 100 km. We sampled data from the resulting spatial distributions of allele frequency, and estimated the ratio of dispersal to selection using the same method we applied to the Δ32 data. The results (Table 1) show that the effect of kurtosis on the estimates is very small: between all three distributions the estimates of R vary in a narrow range and the corresponding values of σ vary only between 98 and 110 km. While violations of each of these simplifying assumptions (no genetic drift, additivity of the selective effect, a diffusion approximation for dispersal) are unlikely to have important effects on our estimates, the variance introduced by violations may contribute to the overdispersion observed in the data and the significant G-test statistic we computed. Another likely cause of the unexplained variance is that our model does not explicitly incorporate specific historical events. Information about particularly important dispersal events will help refine quantitative models of the evolution of Δ32. However, a challenge to developing such models is the difficulty of keeping them from becoming too parameter-rich or overburdened with assumptions regarding historical demographic events [36]. In summary, we present an approach to analyzing the geographic distribution of a selected allele. The approach allows us to estimate the ratio of dispersal to selection as well as fit gradients in selection to the observed allele frequency data. Our analysis confirms Δ32 has been under strong selection, and furthermore shows that long-range dispersal and selection gradients have been important processes in determining the spread of this advantageous allele. The results provide an insight into the history of Δ32 and into the processes that affect the geographic spread of advantageous alleles in humans.

Materials and Methods Landmass data We focused our analysis on the region extending from 22°N to 75°N and 10°W to 154°E. Topographic data were obtained from the ETOPO5 data assembled by the National Geophysical Data Center. The exact dataset used was a version with 1° latitude/longitude resolution that is provided as a standard dataset in MATLAB 7. The coastline data were extracted by taking all values above sea level to be land. A land bridge between Denmark and Sweden was added to model migration between the two closely separated land masses. An image of the habitat is available as Figure S1. Allele frequency data A summary of Δ32 allele frequency data was constructed by pooling data from multiple published papers [12,14,19,37–43]. Latitude and longitude for 58 of the 71 samples were kindly provided by S. Limborska. For the remainder of samples, we used either the latitude and longitude of the city where the sample was collected or the latitude and longitude of a major city in the region sampled. We excluded any data points that were obtained from land masses not connected to the European mainland in our model, as well as allele frequency estimates from Ashkenazi Jews because of the unique founder effects in their history [18]. The allele frequency data are provided in Table S1. Model of selection and dispersal To model the frequency of the Δ32 allele across Europe we used an approach based on Fisher's wave-of-advance theory [26]. The model is deterministic and based on a two-dimensional, nonlinear PDE. The PDE describes the function p(x,y,t), which represents the distribution of allele frequency across the xy plane at time t: where Δ(p) is a nonlinear function of p that represents the change in allele frequency due to selection. The coefficient σ2 denotes the variance of the parent–offspring dispersal distance distribution. To calculate Δ(p), we assume fitnesses of 1 + d, 1 + s, and 1 for the Δ32/Δ32, Δ32/+, and +/+ genotypes, respectively. For this parameterization of selection, standard deterministic theory [44] gives the following result: To incorporate gradients in selection, s and d are replaced with linear functions, denoted s(x,y) and d(x,y), respectively. In particular where s c , x c , and y c represent the selection coefficient and the x and y coordinates of the center of the habitat, respectively. This approach does not limit s(x,y) to being positive; that is, if the gradient in selection is strong enough, portions of the range may have negative selection coefficients. The results presented here are limited to the assumption of additivity, so d(x,y) = 2s(x,y), although the results are not sensitive to the assumption regarding d (see Discussion). To represent the occurrence of the mutation at a single location in space with an initial local frequency of p 0 , we specified the initial conditions of the PDE solution to be where δ(x,y) is a two-dimensional Dirac delta function, which takes on values close to one at x = 0 and y = 0, and values near zero for all other values of x and y. The value of p 0 was calculated by the formula 1/D, where D is the initial population density. Population density only enters the model by determining the initial frequency. We generated results for D = 2.5 and D = 20. The two conditions correspond to published estimates of the population density in Europe at 1000 BCE and 1300 CE [30] and thus represent population density at the two orders of magnitude that are relevant for the origin of Δ32. The general results presented did not differ between D = 2.5 and D = 20, so we report results only for D = 2.5. For the boundary conditions, a model of reflecting boundaries was imposed. The assumption of reflecting boundaries is an implicit assumption that alleles are not lost or gained at the habitat boundaries. For the application of the equations to a geographic habitat, we set the x-axis to be latitude and the y-axis to be longitude. In the results we report σ in units of kilometers. For rendering the habitat we work in coordinates of degrees latitude and longitude, so a simple conversion is needed to change σ in units of kilometers to units of degrees latitude and longitude. To convert from units of latitude we employ the number of kilometers per degree latitude as the scaling coefficient, which is a constant 111 km per degree latitude. To account for the decreasing amount of geographic distance represented by 1° longitude as one moves north, σ in the longitudinal axis is converted by taking σ/m(x) where m(x) is the number of kilometers per degree longitude at latitude x. Without this correction, there would be a nearly 4-fold difference in latitudinal dispersal between the lower edge (22°N) and upper edge (75°N) of the range we considered. A similar correction is applied to s(x,y), d(x,y), and p 0 . Numerical solutions To solve the PDE for p(x,y,t), we used an alternating-direction implicit approach with Crank–Nicholson updates at each time step [45]. In this approach, the continuous habitat is discretized into elements of length Δx and width Δy. Time is discretized into segments of length Δt, so that p(x,y,t) is represented by a three-dimensional matrix P(n) with elements P(n) j,k representing the value of p(x,y,t) at time nΔt at a point (jΔx,kΔy) relative to the origin. Δx and Δy were set to 1° of latitude and longitude, respectively. For the results presented here Δt was fixed at 0.005. Results were qualitatively similar for different values of Δx, Δy, and Δt, provided that all three were sufficiently small. The accuracy of numerical solutions was confirmed by comparison to analytical results that exist for simple geometries and linear selection pressures (results not shown). Model for the frequency of Δ32 in Iceland To model the frequency of Δ32 in Iceland, we used a standard single-population deterministic model of selection in which the additive selection intensity was set to s = 0.2 and the initial frequency was obtained from the PDE model. Specifically, the initial frequency was obtained by setting a selection Intensity of s c = 0.2 in the PDE model and recording the allele frequency at a representative location in Scandinavia (60°N 11°E) 44 generations (≈1,100 y) before the allele reaches 16% frequency. This approach neglects any possible founder effects associated with the founding of Iceland and any recurring migration between Iceland and mainland Europe. It also assumes the selection intensity in Iceland to be similar to that on the mainland. Maximum likelihood estimation To obtain MLEs for the parameters of the model, we used a fixed allele age t a and supposed the number of Δ32 alleles observed at each sample locale arose as an independent binomial sample where the success probability at point (x,y) is determined by p(x,y,t a ). The resulting likelihood function is This likelihood approach benefits from taking into account the sample size at each sampling locale, so that the discrepancy between predicted and observed allele frequencies is penalized less at locations with smaller sample sizes. The value of t a used for the results was the time at which, given the other parameters, the maximum allele frequency first reached 16%, although using a value of 20% provided qualitatively similar results. A grid-based method was used to produce a joint likelihood surface over R, G NS , G EW , x 0 , and y 0 . We used a grid for R that had eight values between 2 × 104 and 2 × 106 spaced evenly on a logarithmic scale; a grid for the geographic origins x 0 and y 0 that contained the 29 locations indicated by the points in Figure 6; a grid for G NS that started at 0 and then covered from 1 × 10−5 to 22 × 10−5 with increments of 3 × 10−5; and a grid for G EW that included zero as well as a range from −17.5 × 10−5 to 17.5 × 10−5 with increments of 5 × 10−5. The resulting grid contained 14,848 points in the five dimensions of R, G NS , G EW , x 0 , and y 0 . Evaluating the goodness of fit To asses the goodness of fit of the model we performed a standard G-test. The G-test statistic can be formulated as 2(L n −L 5 ) where L 5 is the log-likelihood computed using the MLE values in our full five-parameter diffusion-based model, and L n is the log-likelihood computed using the observed sample frequencies as the respective population frequencies for the binomial distributions in the likelihood function (equation 5). For our dataset, L n = −144.9 and L 5 = −247.7. We also estimated the overdispersion parameter ϕ by the ratio of the G-test statistic to the number of degrees of freedom. Under the null hypothesis the estimate of the overdispersion parameter is expected to be one, and for large samples, values greater than one are indicative of overdispersion in the data. Evaluating the effect of kurtosis To evaluate the effect of kurtosis we used simulations on a two-dimensional stepping stone habitat of 121 × 121 demes placed on a torus-shaped habitat arranged in a uniform distribution on (−6,000 km, 6,000 km) along the x and y axes. The origin of the allele was chosen to be at x 0 = 0 and y 0 = 0, and the initial frequency of the allele was set to 10−5. The simulations were stopped when the allele frequency became greater than 16%. Selection was incorporated with an additive allele with s = 0.2. Dispersal was modeled using three dispersal distributions. Here, for simplicity, we present the one-dimensional version of each dispersal kernel. The two-dimensional distribution was found by assuming dispersal along each axis was probabilistically independent and taking the products of the corresponding one-dimensional distributions. The first distribution was a mean-zero Gaussian distribution: The second was a modified double exponential that when c < 1 is not exponentially bounded, and thus qualifies as a fat-tailed dispersal distribution: The third was a double gamma distribution that was used by Cavalli-Sforza et al. [34] to fit historical data on human dispersal: All three distributions were parameterized to have a standard deviation equal to 100 km, so that the effect of kurtosis alone could be assessed. For the shape parameter of the double gamma distribution we used the value of 0.0419 estimated by Cavalli-Sforza et al. [34] for human historical data. The resulting double gamma distribution had a kurtosis of 146.2. For the modified double exponential distribution, we used a shape parameter (c = 1/2) that gives reasonable kurtosis (K = 25.2) and guarantees the tails of the distribution are not exponentially bounded. The resulting allele frequency surfaces were binomially sampled at 49 evenly spaced locations with samples of size 120 to construct a simulated allele frequency dataset. The data were then passed to the likelihood-based method used on the Δ32 data but with G NS , G EW , x 0 , and y 0 all fixed to zero, so that R was the only parameter to estimate. For the grid search we used a grid of R values with nine points from 3.3 × 104 to 1 × 105. The mean and standard error for estimates of R and σ reported in Table 1 are the average of ten replicates for each dispersal distribution.

Supporting Information Figure S1. The Geographic Area Used in the PDE Model The grey squares mark the geographic area used for numerical solution of the PDE. Coastlines are overlaid in the figure only for reference. https://doi.org/10.1371/journal.pbio.0030339.sg001 (38 KB PDF). Table S1. Allele Frequency Data Used in the Analysis https://doi.org/10.1371/journal.pbio.0030339.st001 (33 KB PDF).

Acknowledgments Funding for this work comes from Howard Hughes Medical Institute (JN), the Miller Foundation (APG and MS) and National Institutes of Health grant NIH-GM-40282 (MS). We thank S. Limborska for providing geographic coordinates of his published data, as well as Eric Anderson, Laurent Excoffier, Gerard Lucotte, and two anonymous reviewers for helpful comments regarding the manuscript.

Author Contributions JN, APG, and MS conceived and designed the experiments. JN performed the experiments and analyzed the data. JN, APG, and MS wrote the paper.