The probability of one big step versus many smaller steps

According to the single-mutant view, no other species, not even our closest extinct relatives the Neanderthals, ever had Merge, and Merge was universally in place before Homo sapiens moved out of Africa, restricting its emergence to a time window of approximately 100 000 years. Independent evidence indicates that the effective population size of ancient human populations during this period was maximally around 10 000–15 000 individuals with a considerably smaller population bottleneck (hundreds of effective individuals) around 140 000 years ago15. Small effective population sizes are supported by observations of ethno-linguistic group sizes of modern hunter-gatherers, which tend to be around 150016. Although there may be many of such groups, thus resulting in a larger number of individuals (the “census size”) because most reproduction occurs within the groups, the effective population size tends to be much smaller than the census size, and closer to the group size.

Our analysis of Berwick and Chomsky’s proposal examines the fate of a single mutant that arises in a population of this size. Conveniently, fixation of a single mutant in a small finite population is an extensively studied dynamics: a range of theoretical tools exists (which we will present in sections 2.2 and 2.3 below) for analysis of the core proposal. The single mutation envisaged must have conferred an enormous fitness advantage. Formally, fitness is usually expressed as a selection coefficient. Only the ratio of fitness of different individuals is generally relevant, so the wild type (i.e. the individuals without mutations) is defined to have fitness 1. Mutant individuals have fitness 1 + s, where s is called the selection coefficient. Because in many cases (as in the case at issue) the organisms under consideration have two copies of each gene, there will be two types of mutant individuals: heterozygous (one copy of the mutation) and homozygous (two copies of the mutation). These in general have different selection coefficients: s aA and s AA respectively in our notation.

Selection is usually assumed to be weak in evolutionary models, which licenses many useful methods of approximate analysis. These methods are unavailable to us as a result of the uncommon assumption that a single mutation confers a large fitness effect. This effect cannot be arbitrarily large, because the number of offspring that an individual could raise in prehistoric and pre-agricultural times was limited by many unrelated factors. One way to balance these constraints is to limit our numerical analyses to selection coefficients s ≤ 1 (which still corresponds to mutants having on average twice the number of offspring than the wild type in the strongest case). Berwick and Chomsky’s assumption that fitness advantages reflect increased capacity for individual thought rather than communication implies frequency-independent selection.

Finally, the fitness effect of the proposed mutation would occur in the heterozygote mutant, and its effect must be fully dominant. In other words, both individuals with one copy (heterozygotes) and with two copies (homozygotes) of the mutant gene would have full Merge and thus the full fitness advantage. This is because in the proposed scenario, one either has Merge or not, and therefore there cannot be two separate stages with different fitness effects. The alternative of a recessive mutation, in which individuals need two copies of the mutated gene to have merge, would lead to a strongly reduced probability of the mutation spreading, because homozygotes would be rare initially and the mutant gene would therefore first have to spread through random drift. Figure 2 provides a sample of probabilities; for an exact description of these calculations see the next subsection.

Figure 2 Fixation probabilities for recessive beneficial mutations. Note that these are very low given the large selection coefficients. Full size image

By contrast, the alternative scenario for gradual evolution of linguistic ability proposes that the evolution of language happened in a way that is far less exceptional by biological standards. This means that there were potentially many mutations involved that had small to moderate fitness effects and that all contributed to the ability for language. Because of the smaller fitness effects, fixation probabilities were lower and fixation times were longer. This does not mean that the overall probability of this scenario is lower, however: multiple mutations evolving in parallel can be present in a population at the same time. If the fitness effects are sufficiently small and additive, mutations can be considered to evolve independently (for a discussion of why we think this is justified, see Supplementary Material S7). Also, because the effects of the mutations are small, it is much more probable that if one mutation disappears from the population, eventually another mutation with a similar effect occurs. Finally, in this scenario it is not necessary that all mutations reach full fixation: it is expected that there is some remaining variation in the population. However, the effect of this variation may be small, as there are many genes involved and the variation for a trait determined by many genes with additive effects decreases with the number of genes involved. Unlike the single-mutant hypothesis, there is no reason in this case to assume that the mutations involved were purely dominant. It is unlikely that they were recessive, because recessive mutations of any size are unlikely to spread. However, they may have been semidominant, in which homozygotes have a higher fitness than heterozygotes, which results in higher fixation probabilities and somewhat lower fixation times.

The mutations involved in the gradual scenario may have had positive frequency dependence, because this hypothesis, unlike the Chomskyan view, does not rule out a selective advantage conferred by communication. Positive frequency dependence leads to lower fixation probabilities and longer fixation times, as initially the mutation may not have an influence on fitness. This is quantitatively investigated in section 2.2.4.

Fixation probability and time

Our analysis makes use of the diffusion method developed by Kimura9,10. This approach models the proportion of mutants in a population as a continuous value (in reality the proportion is discrete, because the population consists of a finite number of discrete individuals) and it models evolution of the number of mutants as a diffusion process. This approach is flexible enough to model individuals as haploid or diploid, and in the case of diploid individuals it can model different fitness values for heterozygous and homozygous mutants.

General approach

The diffusion approach only considers the evolution of the proportion p of mutant alleles A, with the wild-type allele denoted as a. Following Kimura17, the evolution of a finite population of diploid individuals can be approximately described by a partial differential equation:

$$\frac{\partial u(p,t)}{\partial t}={M}_{\delta p}(p)\frac{\partial u(p,t)}{\partial p}+\frac{1}{2}{V}_{\delta p}(p)\frac{{\partial }^{2}u(p,t)}{\partial {p}^{2}}$$ (1)

(the Fokker-Planck equation, see Supplement S0 for details) featuring the probability u(p,t) of the mutant allele becoming fixed at time t, the mean change \({M}_{\delta p}(p)\) in proportion p per generation, and its variance \({V}_{\delta p}(p)\). Assuming that there is no more mutation after the initial mutation that generates mutant allele A, the mutants will either disappear or take over the population with probability \(u(p,\infty )\) . (depending on the boundary conditions, this can represent either) and as time tends towards large values, the derivative to time of u tends to zero. We can then impose these conditions of disappearance or takeover in the partial differential equation (see Supplement S0). To find a solution to the partial differential equation, we need approximations to the mean change \({M}_{\delta p}(p)\) in proportion p per generation, and its variance \({V}_{\delta p}(p)\).

Mean and variance terms

There are different ways of approximating the mean and variance for a given population, and these result in different expressions for \({M}_{\delta p}(p)\) and \({V}_{\delta p}(p)\). The simplest way is to model the population as if it were haploid (see Supplementary Section S1 for a demonstration that this way is also the one that most closely approximates Monte Carlo simulations). The reason that this works even for diploid organisms is that initially, homozygous mutants will be very rare (in a randomly mixing population of sufficient size). Therefore, the mutant allele may become so frequent before homozygotes play a role in its evolution, that it is highly likely that the mutant allele reaches fixation. We can therefore ignore the existence of heterozygotes. The mean change and its variance (see Supplementary Section S2 for a derivation) are then:

$${M}_{\delta p}(p)=p(1-p)\frac{{s}_{aA}}{1+{s}_{aA}\cdot p}$$ (2)

$${V}_{\delta p}(p)=\frac{p(1-p)}{2N}\frac{2+{s}_{aA}}{1+{s}_{aA}\cdot p}$$ (3)

where s aA is the selection coefficient of the heterozygote mutant (i.e. assuming the fitness of the wild type is 1, the fitness of the heterozygote mutant is 1 + s aA ) and N is the population size. Assuming a haploid population is mathematically equivalent to assuming a diploid population with s AA = 2s aA , (this is also called the case of no dominance or semidominance).

This simple approximation also yields a closed form solution for the fixation probability described by Eq. (1):

$${p}_{fix}({p}_{0})=u({p}_{0},\infty )=\frac{1-{e}^{-4N\frac{{s}_{aA}}{2+{s}_{aA}}{p}_{0}}}{1-{e}^{-4N\frac{{s}_{aA}}{2+{s}_{aA}}}}$$ (4)

where p 0 is the initial frequency of mutant alleles (which is 1/2 N when starting with a single heterozygous mutant).

Fixation time

In order to estimate the time it takes for a mutant to reach fixation in a population (starting with a given proportion of mutants) the Fokker-Planck Eq. (1) can be adapted. Setting it equal to minus the fixation probability (instead of 0) results in an equation for the conditional fixation time \(\vartheta (p)\), i.e. the product of the probability of reaching fixation multiplied by the time needed to reach fixation, starting with a proportion p of mutants (using equations XII.3.4a and XII.3.4b from van Kampen18):

$$-{p}_{fix}(p)={M}_{\delta p}(p)\frac{\partial \vartheta (p,t)}{\partial p}+\frac{1}{2}{V}_{\delta p}(p)\frac{{\partial }^{2}\vartheta (p,t)}{\partial {p}^{2}}$$ (5)

and using boundary conditions:

$$\vartheta (0)=\vartheta (1)=0$$ (6)

The unconditional fixation time \(\tau (p)\) can then be calculated by dividing the conditional fixation time by the fixation probability. Although ignoring the fact that the population consists of heterozygotes may work well for calculating fixation probabilities, it is not expected to work well for calculating fixation times. This is because it is assumed in our analysis that homozygote mutants have no advantage over heterozygote mutants, while approximating the population as haploid implicitly assumes that homozygotes have a selection coefficient that is twice that of heterozygotes. For this reason Kimura’s17 approximation is used:

$${M}_{\delta p}(p)=p(1-p)({s}_{aA}+({s}_{AA}-2{s}_{aA})p)$$ (7)

$${V}_{\delta p}(p)=\frac{p(1-p)}{2{N}_{e}}$$ (8)

where N e is the effective population size, i.e. the size of a randomly mating population that behaves the same as the actual population (which may not have random mating). These appear to give satisfactory values compared to the Monte-Carlo simulation (see Supplementary Section S1).

Solving the resulting equations numerically, we obtain fixation times, expressed in generations shown in Fig. 3. It shows that for small values of N∙s, fixation time is approximately 4 N, which corresponds to the theoretical value for drift19. For larger values, fixation time drops rapidly.

Figure 3 Unconditional mean fixation times for different selection coefficients and population sizes (in individuals, so the number of alleles is twice the number). Note that the missing points in the graph could not be calculated because of numerical limits. Generations can be converted in years by multiplying with generation time, which is from 25–30 years (20). Full size image

Frequency dependent fitness

When fitness is (positively) frequency dependent, fixation probabilities are expected to become smaller, as initially the spread of mutants will only be due to drift; if the mutation is rare, individuals that have it will have little or no advantage of it. A simple model for frequency dependent fitness is to make the fitness of a mutant individual linearly dependent on the frequency p of mutants in the population, with one linear coefficient for the heterozygote and another for the homozygote (see Supplement S3 for details). These frequency dependent selection coefficients can be substituted in the equations for the mean change and the variance in (2) and (3) and the fixation probabilities (and fixation times) can be estimated.

As can be seen in Fig. 4 the fixation probabilities become dramatically lower in the case of frequency dependent selection, even for high values of the slope of the selection coefficients (implying high values of the selection coefficients when mutant frequencies become sufficiently high).

Figure 4 Comparison of the fixation probabilities of the semidominant case and the case where fitness has positive (linear) frequency dependence. To be as close as possible to the semidominant case, the slope of homozygote fixation is set as twice as that of the heterozygote fixation. Note that the vertical axis (of fixation probabilities) is logarithmic, and that therefore the fixation probabilities in the frequency dependent case are much lower than in the semidominant case. Full size image

Time and probability conclusion

Fixation probability can be modeled with satisfactory accuracy (given that our model parameters are approximations anyway) by using the haploid/semidominant approximation. Fixation time does need to take into account whether the mutation under investigation is dominant, recessive or otherwise. However, assuming the proposed mutation for Merge was dominant and conveyed a large fitness advantage, neither fixation probability nor fixation time are obstacles to the hypothesis. The many mutations with small fitness effects proposed by the gradual scenario would take longer to reach fixation and would be less likely to do so. However, as we already remarked above, multiple mutations of this kind could spread at the same time, and importantly, it is less problematic for this scenario if one mutation does not reach fixation, because it is expected that similar mutations will recur, while the proposed macromutation for Merge will not.

To revisit the probabilistic interpretation we laid out earlier, an appropriate summary of the preceding findings is that as far as fixation probabilities go, both hypotheses (single-mutant and gradual evolution) assign a realistically high conditional probability to the observation that language is universal in our species (i.e. the mutations have spread to fixation or near fixation), each for different reasons. In other words, consideration of fixation dynamics does not decisively distinguish between these two competing hypotheses; both are plausible routes to fixation in populations of approximately the hypothesized size and time depth, conditional on the emergence of the relevant mutation(s) in the first place. As a result, evolutionary support for these hypotheses turns exclusively on consideration of their a priori probability.

Probability of mutations

Orr offers an appropriate theoretical result for the distribution of the size of beneficial mutations, based on extreme value theory8, the theory that considers the distribution of the extreme values of a large random sample. Orr assumes that there are a potentially large number of variants of the genome, all with fitness effects that have been drawn from a distribution that fulfills reasonable assumptions. More specifically, the distribution has to be of Gumbel type 1 (or exponential type), a class of distributions that encompasses most everyday distributions with infinite support, such as the Normal (and log Normal) distributions, the Weibul distribution and the exponential distribution. He further assumes that the wild type already has relatively high fitness compared to all possible variants, because it already is the result of many generations of selection (in Orr’s view fitness is already high because organisms are already highly complex, but it is also compatible with Berwick and Chomsky’s view that the mutation for Merge occurred in creatures that already had a fairly rich cognitive repertoire). There are therefore only comparatively few variants that would result in higher fitness. In this case it can be shown (using extreme value theory) that the fitness of beneficial mutants follows an exponential distribution, and that the parameter of this distribution is equal to the reciprocal of the expected value of the fitness difference between the best and second best mutants. This leads to the following expression for the probability p of a fitness effect:

$$p(s)=\frac{1}{E[{\Delta }_{1}]}{e}^{-\frac{1}{E[{\Delta }_{1}]}s}$$ (9)

where s is the increase in fitness, which is equivalent to the selection coefficient used in section 2.2 and \(E[{\Delta }_{1}]\) the expected value of the difference in fitness between the best and the second best mutations. Experimental evidence is in general agreement with the predictions implied by this model20 (but see Supplement S4 for a detailed discussion of its pros and cons).

In order to use the exponential distribution to estimate the probability of an evolutionary scenario, it is necessary to have an estimate of the parameter Δ 1 . For this it is necessary to have data about the fitness effects of beneficial mutations, and in particular data about the distribution of such mutations when they occur, as opposed to after they have gone to fixation (which would conflate their probability of occurrence with the probability of going to fixation). For practical reasons, data of this kind are hard to obtain, and only exist for very rapidly evolving organisms such as viruses. A small data set is presented in Sanjuán et al.21 Fitting an exponential distribution to their datapoints of “random” mutations, which are the ones needed to estimate Δ 1 , a value of \(E[{\Delta }_{1}]\) of between 1/53 (for maximum likelihood estimation) to 1/37 (for least sum of squares estimation) is found. The data, the interpolation and the theoretical prediction are given in Fig. 5.

Figure 5 Fitness advantage of viruses from Sanjuán et al.‘s (23) sample. The interpolated curve (minimal sum of squares fit) on the random data points (the light blue dashed line and the squares) results in a value of E[Δ 1 ] of 1/37. Based on this value, the theoretically predicted curve (the dark blue dashed line and the triangles) for the preobserved mutations (occurrence and going to fixation) fits the data well. For these curves a large effective population size was assumed, so the term with N in Eq. (10) was ignored. Full size image

Given the small sample, and given that it is unknown how well the data for viruses map on data for humans, there is considerable uncertainty in this estimate. Because lower values correspond to lower probability of large mutations, and in order to prevent an unjustified bias against large mutations, an estimate of 1/30 for the parameter of the exponential distribution is a reasonable approximation.

Probability of the two scenarios

In order to compare the two scenarios, it is necessary to calculate the probability of a single mutation occurring and going to fixation. For clarity’s sake the term improvement is introduced for the event of a beneficial mutation occurring and going to fixation. The magnitude of the improvement is defined to be equal to the fitness advantage of the mutation. Finally, the probabilities of single improvements must be combined in order to calculate how many improvements are needed in order to obtain the desired trait.

The probability of a single improvement

The probability of a single improvement is the product of a mutation of a given fitness advantage occurring Eq. (9), and that mutation going to fixation. It has been shown that Eq. (4) gives a satisfactory approximation of the latter probability. Unfortunately, as far as we know there is no closed-form solution for the integral of the product of (6) and (12) (which would give the cumulative distribution function). However, there is a very good closed-form approximation (see Supplementary Material S5), resulting in the following cumulative distribution function of a single improvement:

$${\rm{CDF}}(s)=1-\frac{{e}^{-\alpha s}}{\beta }[\frac{1}{\alpha }-\frac{1}{2+\alpha }{e}^{-2s}+\frac{1}{N(N+\alpha +1)}{e}^{-(N+1)s}]$$ (10)

where and β is a scaling factor. That this equation is a good approximation is illustrated by the fit to the experimental observations illustrated with the dark blue line in Fig. 5.

The expected number of improvements necessary

The distributions of improvements can be used to estimate how many improvements are necessary to achieve a given total improvement m and what the maximum size of improvements was. The (mean) number of improvements necessary is not equal to m divided by the mean size of an improvement, but somewhat larger: because improvements come in discrete steps, the total improvement achieved will in general exceed m by a little bit. Therefore it is easier to simulate this with Monte-Carlo techniques. This is implemented by accumulating random improvements until they exceed the threshold m. For m = 1, α = 30 and N = 1500, the histograms of the number of improvements and the maximum size of improvements in each string of improvements are given in Fig. 6.

Figure 6 Histograms of the number of improvements needed to reach a total improvement of 1 (a) and the maximal step size that occurred in each run (b), Parameters were α = 30 and N = 1500. Full size image

The following observations can be made: first, the number of necessary steps is larger than 1. Even for a modest (total) improvement (e.g. 1), it is exceedingly unlikely that an improvement of this magnitude would occur in a single step. Second, relatively large improvements (i.e. with a fitness effect between 0.1 and 0.4) are common. For almost all simulations, improvements with effects of this size occurred. This is in line with the observation by Eyre-Walker and Keightley that “[…] large-effect mutations seem to be those that contribute most to adaptation.”20, p. 614. Even though beneficial mutations of large effect are rare, they are more likely to go to fixation. This is related to our final observation, that the total number of improvements involved is generally not very large. In summary: a single mutation event is ruled out in probabilistic terms. Instead, our analysis favors an evolutionary scenario with relatively few, but relatively important events that unfold gradually (see Supplementary Material Section S6 for the precise relation between the number of mutations required and parameters m and α).