Abstract Whole-exome or gene targeted resequencing in hundreds to thousands of individuals has shown that the majority of genetic variants are at low frequency in human populations. Rare variants are enriched for functional mutations and are expected to explain an important fraction of the genetic etiology of human disease, therefore having a potential medical interest. In this work, we analyze the whole-exome sequences of French-Canadian individuals, a founder population with a unique demographic history that includes an original population bottleneck less than 20 generations ago, followed by a demographic explosion, and the whole exomes of French individuals sampled from France. We show that in less than 20 generations of genetic isolation from the French population, the genetic pool of French-Canadians shows reduced levels of diversity, higher homozygosity, and an excess of rare variants with low variant sharing with Europeans. Furthermore, the French-Canadian population contains a larger proportion of putatively damaging functional variants, which could partially explain the increased incidence of genetic disease in the province. Our results highlight the impact of population demography on genetic fitness and the contribution of rare variants to the human genetic variation landscape, emphasizing the need for deep cataloguing of genetic variants by resequencing worldwide human populations in order to truly assess disease risk.

Author Summary Recent resequencing of the whole genome or the coding part of the genome (the exome) in thousands of individuals has described a large excess of low frequency variants in humans, probably arising as a consequence of recent rapid growth in human population sizes. Most rare variants are private to specific populations and are enriched for functional mutations, thus potentially having some medical relevance. In this study, we analyze whole-exome sequences from over a hundred individuals from the French-Canadian population, which was founded less than 400 years ago by about 8,500 French settlers who colonized the province between the 17th and 18th centuries. We show that in a remarkably short period of time this population has accumulated substantial differences, including an excess of rare, functional and potentially damaging variants, when compared to the original European population. Our results show the effects of population history on genetic variation that may have an impact on genetic fitness and disease, and have implications in the design of genetic studies, highlighting the importance of extending deep resequencing to worldwide human populations.

Citation: Casals F, Hodgkinson A, Hussin J, Idaghdour Y, Bruat V, de Maillard T, et al. (2013) Whole-Exome Sequencing Reveals a Rapid Change in the Frequency of Rare Functional Variants in a Founding Population of Humans. PLoS Genet 9(9): e1003815. https://doi.org/10.1371/journal.pgen.1003815 Editor: Scott M. Williams, Dartmouth College, United States of America Received: January 10, 2013; Accepted: August 8, 2013; Published: September 26, 2013 Copyright: © 2013 Casals et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This study was supported by Canadian Foundation for Innovation, FRSQ and Genome Quebec grants to PA, and by research funds provided by the Terry Fox Foundation and the Canadian Institutes for Health Research. EH is a scholar of the Fonds de la Recherche en Santé du Québec. AH is supported by a Fonds de la Recherche en Santé du Québec (FRSQ) fellowship. JFS is the recipient of a Cole Foundation scholarship. DS holds the François-Karl-Viau Research Chair in Pediatric Oncogenomics. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: The authors have declared that no competing interests exist.

Introduction Genetic variation in humans is a result of stochastic processes, selection and demographic history [1]. Modern humans show a reduced level of differentiation due to recent population dispersion less than 100,000 years ago, and differences between populations are thought to account for little more than 15% of all genetic variation across individuals [2]. However, this picture is based on the allele frequency differences of common and shared variants between populations, representing only a small fraction of the total number of variants. Recently, much effort has been put into the description of the total variation landscape in human populations by resequencing hundreds to thousands of individuals from the same population at particular loci or for complete exomes [3]–[8]. Additionally, the 1000 Genomes Project has characterized the complete genomic sequences of more than one thousand humans covering worldwide diversity [9], [10]. Two important conclusions have arisen from studies deeply characterizing the allele frequency spectrum in human populations. First, the high number of low frequency variants is likely only explainable by models of recent demographic explosion [3]–[8]. Furthermore, low frequency variants are enriched for functional variants, particularly for nucleotide changes that affect protein function, and are therefore putatively more related to disease [3]–[8], [11]. Second, most rare variants are private or show very little sharing among continents [7], [8], [12], [13]. This may be particularly important in terms of genetic fitness, since rare variants are enriched for deleterious alleles. However, until now differences in the relative amount of detrimental variants have only been shown over relatively large timescales by comparing African and European populations [14], [15]. Furthermore, these findings predict a lack of replication in association studies using rare functional variants across populations, since rare variants can show higher levels of stratification [16], thus emphasizing the need of population-specific catalogues of genetic variation [12]. In this work, we analyze whole-exome sequence data from French-Canadian individuals, comparing various population level statistics to those for French and European populations, which allow us to make inferences about the fitness of a population with a unique demographic history. The current population of six million French-Canadians in Quebec are descendants of about 8,500 French settlers who colonized the province between 1608 and 1759, before the English conquest [17], [18]. Although colonization included emigrants from all of France, the migration event mostly originated from the Atlantic coast and Paris region. After 1760, French immigration virtually stopped, and the French-Canadian population experienced rapid growth due to a high birth rate, and became genetically isolated from France with limited exchange with other non-French communities in the same geographical area [19]. Overall, French-Canadians have experienced a growth from 8,500 to six million individuals, which represents a population expansion of more 700% in less than 20 generations. While other colonized territories in America or Oceania may have experienced a similar growth, the uniqueness of the French-Canadian population is due in part to the reduced contribution of new immigration after the first settlers [20] and the founding population is estimated to have contributed 90% of the current French-Canadian genetic pool [21]. In addition, during the 19th century new territories were colonized by a reduced number of settlers, contributing massively to the genetic pool in these regions, giving place to several regional founder effects. This particular component of the demographic history of the French-Canadian population has resulted in a geographic heterogeneity of genetic diseases in Quebec, with more than twenty Mendelian diseases occurring at unexpectedly high frequencies in some areas of the province [19], [22]. Here, we specifically test the theory that deleterious mutations accumulate and/or persist in a population that has undergone a demographic bottleneck and rapid expansion in a short period of time, potentially as a consequence of reduced selection, using the French-Canadian population of Quebec. It has been argued that colonists at the forefront of expansions have a fitness advantage [23]. Here we show that if this is the case, then this short-term fitness advantage may come at an overall long-term cost. We also aim to describe how this complex demographic scenario has shaped the genetic variation in a modern population; as of yet, no study has described how the original genetic bottleneck and subsequent population expansion have affected the full-spectrum of genetic variation among French-Canadians.

Discussion Recent deep resequencing of human populations has highlighted an accumulation of rare variants above that expected under Wright-Fisher models [3]–[8]. Using exome resequencing data from over a hundred French-Canadian individuals, we show that a human founding population that has undergone rapid expansion contains an excess of private and rare variants compared to the French population after a colonization event less than 20 generations ago. Genetic variants in French-Canadians tend to be putatively more deleterious than those in the French. On the population level, evidence for this comes from the fact that mutations in the French-Canadian population tend to occur at functional sites with higher conservation scores and/or sites predicted to be damaging, are located preferentially at missense sites, and have higher missense to synonymous ratios than in French and European populations. Furthermore, at the individual level, this potentially translates into an increased genetic burden, since although French-Canadians carry a similar number of derived alleles as the French, these alleles tend to occur at more putatively damaging sites, as indicated by alleles in French-Canadians occurring at sites that on average have higher GERP scores (Figure 1C). Furthermore, since the French-Canadian population shows lower levels of heterozygosity (and thus higher levels of homozygosity), this may have implications for disease susceptibility. It is known that the incidence of around twenty Mendelian diseases is higher in Quebec [19], [22] and some hereditary diseases show a particular pattern in the French-Canadian population, with local enrichments within particular geographical areas originated by regional founder effects [19], [22], [32]. Although it is difficult to translate our results into specific population genetic risk estimates, it may be possible that the increase of rare deleterious variants and reduced heterozygosity in the French-Canadian population is leading to higher disease risk. Rare alleles that were present at damaging sites in the original population may subsequently have been removed in the French population, yet still persist in French-Canadian individuals due to sampling effects, smaller population sizes, less competition and a higher birth rate. Although this seems unlikely to impact upon diseases caused by recessive variants in homozygous form, damaging mutations that have arisen since the founder event may be dominant or serve as the second, and ultimately vital, mutation within an important gene under a compound heterozygous model of Mendelian disorders. Furthermore, we find some evidence that higher frequency variants (MAF>10%) are on average more damaging in the French-Canadian population when compared to the French, since they tend to have higher GERP scores (see Results), which may impact upon the incidence of Mendelian diseases under a homozygous recessive model. It has previously been shown that there is proportionally more deleterious variation in European populations after the out of Africa expansion [14], [15]. However, this process occurred over a much longer timeframe and also relies on a long bottleneck to explain the increase in deleterious variants in Europeans [14], [15]. In French-Canadians we observe a similar increase of rare deleterious variants but over a markedly short time frame. Furthermore, since the French-Canadian population did not undergo a long population bottleneck, the excess of deleterious variants could be explained by a rapid expansion of the population as well as other demographic factors such as subsequent regional founder effects in Quebec. To test this we performed a number of forward simulations incorporating selection and the demographic history of Europe, as inferred in a recent study [14], followed by a simple population bottleneck and rapid expansion in the French-Canadian population, and a less extreme expansion in the French population (for details, see Materials and Methods). We modeled population bottlenecks of varying sizes, performing 100 replicates for each scenario, and then calculated the difference in the proportion of variants with MAF<5% between the French and French-Canadian populations (Table S6). The scenarios modeled likely represent a simplified version of the actual demographic history of the French and French-Canadian populations, however we use them here to test differences between populations undergoing different rates of expansion under selective constraint after sharing a large proportion of demographic history. Under these models, the largest increase in rare variants in French-Canadians occurs when the population did not undergo a bottleneck, showing differences as large as 5.23% across the 100 replicates, with an average shift of 1.09%. Furthermore, we also observe on average an additional 8.32 deleterious alleles per megabase (defined as having a negative selection coefficient) per replicate segregating in the French-Canadian population compared to the French population. For simulations including a bottleneck, the biggest increase of rare variants in the French-Canadian population occurs for a bottleneck of 75%, with differences as large as 5.74% across the 100 replicates, an average increase of 0.74% and an additional 5.86 deleterious alleles per megabase per replicate in the French-Canadian population. Although these simple models lead to an increase in the proportion of rare variants in the French-Canadian population, the shift observed in the empirical data, which shows an increase of 9.8% of variants with MAF<5% in the French-Canadian population compared to the French population when using the same sample sizes (see above and Figure S1), is larger than that generated by simulations; there are several possible explanations for this. First, it may be that current tools are not able to accurately model recent events such as a rapid population expansion. Second, it is likely that a more complex demographic scenario is needed to explain the size of the increase in rare variants in the French-Canadian population, that may also include changes in selective forces as a consequence of the reduced competition occurring between a small number of founders. In fact, the French-Canadian population is genetically stratified into subpopulations with differentiated demographic histories [19], [21], [22]. Independent settlements and expansions with partially reduced genetic exchange across subpopulations, unequal contribution to the current genetic pool, as well as some admixture with other populations could have also contributed to the shift in the site frequency spectrum. Consistent with these notions, a recent study focusing on a specific sub-founding population within Quebec presented evidence that individuals on the wave front of colonization events have a heritable advantage and a higher contribution to the current genetic pool [23]. In this study, we have not focused on specific regions within the population and have not tested this observation. However, our results do demonstrate that the recent founding event and subsequent colonization events may have had a substantial deleterious impact across genomes. To a lesser extent, rare variants could also arise from the inclusion of founders from different regions in France or other European countries, which could be also related to the level of genetic diversity in Quebec, similar to that reported for European populations [21]. Similarly, the unequal sex ratio of the Quebec settlers of more than ten times more men than women [18], may also have contributed to a shift in the effective population size and loss of heterozygosity. Finally, although there is evidence of a population bottleneck in the French-Canadian population, such as reduced levels of heterozygosity, given the results of our simulations it seems unlikely that the bottleneck was particularly strong. In this study, we show that even in the case of two very close populations that are separated by only 400 years approximately, the differences in the landscape of genetic variation can be substantial under particular demographic conditions. Rare variants are presumed to explain some of the missing heritability not accounted for by common variants in genome wide association analyses for complex disorders [33] as well as most of the rare diseases. Furthermore, there is mounting evidence that coding rare variants are contributing to complex traits [34]. The high number of population private rare functional variants described in this study constitutes a challenge for genetic association studies, affecting the replicability and correlation of genetic risk factors across human populations. Indeed, even from a relatively limited number of French-Canadian chromosomes, we discovered a substantial number of missense mutations that are not found on the widely used Illumina exome-arrays built from SNPs ascertained across a number of major sequencing studies. One third of the missense SNPs we discovered from sequencing over one hundred exomes are not found on these arrays, variation that likely influences complex traits and disease phenotypes, but is missing from analysis of disease risk. Although we understand from population genetics that most variants will be rare, this observation speaks to the need for continued sequencing of isolated or semi-isolated populations. Beyond the particular case of the French-Canadian population, this study highlights the importance of local demographic events in shaping genetic variation, and the need for creating population-based catalogues of human genetic variation [12].

Materials and Methods Ethics statement This research has been approved by the CHU Sainte-Justine's ethical committee. Data was analyzed anonymously. Samples One hundred and fourteen French-Canadians were selected for sequencing. French-Canadian samples are the healthy parents of four disease cohorts (primary immunodeficiencies, acute lymphoblastic leukemia, schizophrenia and autistic spectrum disorder) recruited at the Sainte-Justine Hospital (Montreal). Additionally, sequences from 30 French samples previously analyzed were included in the study [24]. We used principal component analysis to identify and remove the genetic outliers (see below). Exome capture and sequencing Exome capture was performed with the SureSelect Target Enrichment System from Agilent Technologies optimized for Applied Biosystems SOLiD sequencing, using the Agilent SureSelect All Exome Kit (38 Mb) and the Human All Exon 50 Mb kit covering exons annotated in the consensus CCDS [35]. Analyses were performed considering the coding regions targeted in the Agilent SureSelect All Exome Kit (38 Mb). Briefly, 3–5 µg of DNA were sheared by sonication, 5′ ends repaired, and the resulting fragments were ligated to adaptors, which were then run in size-select gels to select fragments of 150–250 bp in size. The extracted DNA was amplified by PCR and hybridized to the capture library containing the human exome. Hybridization was performed in a solution at 65°C for a minimum of 24 hours, followed by washing and capture of the hybridized DNA through magnetic bead selection, PCR and purification. Quantification of DNA libraries was performed using a Bioanalyzer and qPCR instrument. Exome sequencing was performed using SOLiD 3 Plus and SOLiD 4 Systems (Applied Biosystems), following the manufacturer's recommended protocols. Sequence reads were aligned to the human genome reference sequence (hg18, downloaded from http://genome.ucsc.edu) with BioScope, the available mapping tool for the SOLiD technology. GATK recalibration [36] was applied after mapping, PCR duplicates removed with Picard (http://picard.sourceforge.net) and SNP calling was performed using Samtools [37]. In total, 61 gigabases of sequencing reads mapped to the reference genome, with an average of 86% of the targeted regions being covered by at least one sequencing read. Each individual had an average coverage of 17-fold (see supplementary material, Table S5). SNP annotation was performed using the SeattleSeq Annotation tool (http://gvs.gs.washington.edu/SeattleSeqAnnotation/). Variants from the French population were generated from exome sequencing of the same targeted exons using Illumina sequencing [24]. High quality dataset Stringent variant calling criteria were applied to produce a high quality dataset of both the French and French-Canadian populations, including only variants that satisfy all of the following conditions: (i) fall within the regions targeted by the Agilent SureSelect exome capture kit, (ii) with SNP consensus or variant quality of 30 or higher, (iii) with sequence coverage of 10-fold depth or greater and (iv) in Hardy-Weinberg equilibrium (using a stringent p-value of 0.001). Furthermore, variants were included only if these criteria were met in at least 20 individuals in both the French and French-Canadian populations. The average transition/transversion ratio for all the French and French-Canadian samples in the coding variants was 3.32, as expected for exonic sequences [38] and we detected no significant difference between French and French-Canadian samples (3.38 and 3.30, respectively). Similarly, frequencies of the twelve possible nucleotide changes are similar between the two populations (Figure S8). For the resampling analyses, we randomly choose thirty individuals from the French-Canadian population and applied the same filters as above. Population structure analysis In order to use the most genetically homogeneous group of individuals in each population we performed principal component analysis (PCA) for each population sample using SmartPCA as implemented in the program eigenstrat [39]. First, PCA was performed within each population including variants called in at least 80% of the individuals in each population to avoid the effects of missing values; these variants totaled 13,035 positions for the French-Canadian population and 26,843 for the French population. Significant PCs were inferred using the TW-statistic (p-value<0.01) and outlier individuals were identified based on their individual loading exceeding two standard deviations from the mean of each significant axis. This analysis revealed five outlier individuals in the French-Canadian population and none for the French samples (Figure S9). Removing outlier individuals based on population structure analysis of each population separately resulted in the retention of 109 French-Canadian and 30 French individuals for subsequent analyses. Next, we performed PCA combining both populations, including only positions called in at least 80% of the combined samples, and only individuals with missing data less than 1%. This represented a total of 4,588 SNPs in 89 samples. We find no obvious differences between the two populations (Figure S9), although the French-Canadian population seems to show a slightly lower level of diversity and represents only a subset of the total genetic variation in the French population. The joint frequency spectrum of genetic variation was represented using the δaδi software [40]. SNP validation We performed a number of validation procedures to check the quality of our data. First, we performed Sanger sequencing on a total of 113 heterozygous calls detected in the individuals included in this study (89% of the 97 variants have MAF<5% and 54% were singletons). In total we confirmed 109 calls, giving a false positive rate of 3.5%. This figure probably represents an upper bound, since the variants selected for validation are enriched for rare variants which are known to be more prone to sequencing errors [41]. Second, we sequenced the offspring of 16 individuals from the French-Canadian population, using the same protocols and filtering steps as in the parents, in order to confirm the presence of certain alleles in the population. Thus, to check the false positive rate for variants that are likely to contain the most errors (singletons), we isolated any positions in the parents that were singletons in our population and then checked to see if the variant is called in the child, only including the position if the same quality filters were met in the offspring (variant quality>30, coverage>10). Under normal patterns of Mendelian inheritance we expect 50% of singletons to be inherited by the child. Overall, we observe 4,666 singletons across the 16 individuals, 2,328 of which are present in the offspring (49.89%), representing a false positive rate for singletons of ∼0.2%. Third, we also tested the quality of our data by comparing DNA and RNA sequences for three French-Canadian individuals using the same high quality filtering criteria in both datasets (consensus or variant quality greater than 30, coverage greater than 10). For RNA sequencing, RNA was enzymatically fragmented, and cDNA generated by reverse transcription from adaptors ligated to ends of the RNA molecule. Then, the cDNA was amplified using primers complementary to adaptors and purified. Sequencing was performed in a single SOLiD slide containing barcoded samples. Sequence reads were aligned to the human genome reference sequence (hg18, downloaded from http://genome.ucsc.edu) with SOLiD's BioScope mapping tool. Recalibration was performed with GATK [36], and PCR duplicates were removed with Picard (http://picard.sourceforge.net). SNP calling was performed using Samtools [37]. As differences may exist between DNA and RNA as a consequence of RNA editing [42]–[45] and allelic expression [46], for positions that are heterozygous in DNA, we considered a site as successfully validated if at least one read was present in RNA for both alleles; we confirm 474 of 506 sites. Since it is known that approximately 28% of genes show greater than a 4-fold difference in the expression of two alleles in RNA [46], it is likely that some differences between DNA and RNA are driven by allelic specific expression. Indeed, 5 out of the 32 sites that fail validation in one individual show evidence for being heterozygote (displaying at least one read from each allele) in the RNA of at least one of the other two individuals that were sequenced. Differences between DNA and RNA at heterozygous sites are not significantly enriched for rare variants; only 5 out of 32 sites that fail validation have MAF<5% (variants with MAF<5%, 5/66 not validated, p = 0.92). Furthermore, we also considered sites that contained homozygous non-reference alleles in DNA sequences and then checked the corresponding position in RNA. All 242 positions were validated, further confirming the quality of the data. Finally, to consider the quality of common variants, we compared the genotype frequencies at polymorphic sites obtained from our exome sequencing that overlapped with data from 521 French-Canadian individuals that were genotyped on Illumina's Omni 2.5M arrays. In each case we compared the number of homozygous reference, homozygous alternative and heterozygous calls in our exome data with the same number of randomly sampled individuals from the chip data. In total, 23,231 sites were overlapping, 99.94% of which were not significantly different between exome sequencing and array data (p>0.05, after Bonferroni correction). Selection coefficient and fitness effect analysis To estimate the strength of purifying selection in the French and French-Canadian populations we applied two methods. First, we used prfreq, a program that uses Poisson random fields [14] to estimate the maximum likelihood values for different scenarios given an observed site frequency spectrum (SFS). For the French and French-Canadian populations, we projected the SFS down to 60 alleles by randomly sampling individuals from the French-Canadian population and including only sites with 0% missing data. The ancestral allele was inferred from the homologous chimpanzee sequence obtained from Seattleseq annotation (http://gvs.gs.washington.edu/SeattleSeqAnnotation/) and since mutation rates vary across the genome as a function of neighbouring nucleotides [47], we corrected for the uncertainty of the ancestral sequence following the method of Hernandez et al [48]. Maximum likelihood values for each scenario were obtained with a multinomial calculation that estimates the probability of each SNP segregating at a given derived allele frequency. P-values associated with various demographic and selective models were estimated using likelihood-ratio tests. Demographic parameters were inferred from the site frequency spectrum of synonymous variants comparing three scenarios: a stationary population, contraction/expansion, and a population bottleneck and expansion (Tables S2 and S3). Finally, the selective parameters were obtained by comparing the likelihood of the missense SFS using the demographic model inferred from synonymous variants (see above) to the likelihood for the same demographic model incorporating a selection parameter (γ = 2Ne(s)). To compare the γ values estimated in the French and French-Canadian populations we compared the likelihoods estimated in each case with the likelihoods computed using the γ values from the other population. Second, to calculate the distribution of fitness effects associated with mutations occurring in the French and French-Canadian populations we used the DFE-alpha software [31] (http://homepages.ed.ac.uk/eang33/). To construct the unfolded site frequency spectrums for the two populations we included variants and sites in the targeted region in which at least 30 and 90 individuals passed the high quality filters for the French and French-Canadian populations respectively. These numbers were chosen to reduce the amount of missing data at each site, whilst retaining the majority of polymorphic sites for analysis. We then counted the number of sites that had zero to 180 derived alleles in the French-Canadian population, where derived alleles represent sites that have diverged from chimpanzee. The same approach was applied for the French population using 60 chromosomes. For the French-Canadian population, ninety individuals were sampled randomly without replacement at sites where the number of alleles passing quality filters exceeded 180. Derived alleles were inferred from chimpanzee sequences and human and chimpanzee pairwise alignments were downloaded from the UCSC website (http://hgdownload.cse.ucsc.edu/downloads). As in the original DFE analysis [31], intronic sites served as the neutral standard, the distribution of fitness effects was fit to zero-fold degenerate sites and any sites that were part of a CpG dinucleotide were removed. Confidence intervals were generated by bootstrapping; sites were selected randomly across the site frequency spectrum with replacement to generate 100 new datasets for each population. Replication datasets To replicate the major findings of this study we analyzed data from a cohort of fifty French-Canadian individuals sequenced on the Illumina platform representing the unaffected parents from different disease projects (developmental delay and fetal malformations). Exomes were captured from 3 µg of blood genomic DNA, using the Agilent SureSelect Human All Exon Capture kit (V3 and V4; Mississauga, ON), and sequenced paired end using the Illumina Hi2000seq technology. Raw sequencing data was processed using the same pipeline and filtering process as described above, including only those sites that are sequenced in all datasets. PCA was performed as before, taking SNPs with MAF>5% and missing data<5% - zero outliers were removed (Figure S10). For the CEU population, we obtained BAM files for 35 individuals from the 1000 Genomes Project ftp site (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/) sequenced on the SOLiD platform and applied the same pipeline and filters as detailed above. Simulations To test for an increase in rare variants in the French-Canadian population we simulated a number of demographic scenarios under selection using the forward simulator SFS_code [49]. First, we implemented timing and population size scaling for the European demographic history, as detailed in the SFS_code documentation (http://sfscode.sourceforge.net/SFS_CODE_doc.pdf, figure 2, model taken from [14]). This model includes an initial burn-in period with a population size of 7,895, followed by a bottleneck at time zero to a population size of 5,699. Following this, the population remains at constant size for 7,703 generations before an instantaneous growth to 30,030, which remains for a further 874 generations. We scaled this model using an ancestral population size of 1,000. Then, we simulated a population split and a bottleneck of 50%, 75% and 100% (no bottleneck) for one of the populations to represent the founding of Quebec, scaled using the initial population size to occur twenty generations ago. This was then followed by exponential growth over twenty generations in the European and Quebec populations to increase their size by 3 and 600 respectively (as documented in historical records), using 100 replicates for each scenario. In total, we simulated 360 unlinked genes per replicate, each consisting of five 400 bp exons separated by introns of size 2 kb (similar to the average exon and intron sizes documented in [50]). We used a mutation rate per site of 1.5×10−8 and an average recombination rate of half this value. We ignored positive selection since it is likely to be rare and used an average selection coefficient of −0.03, as inferred in [14], sampled from a gamma distribution. In each replicate and for each population, we selected 100 individuals and then compared the proportion of variants with MAF<5%. The results of these simulations are shown in Table S6.

Acknowledgments The authors are indebted to the patients and their parents for participating in this study. Whole exome sequencing was performed at the Child Health Genomics Platform of the Sainte-Justine UHC Research Center. We thank Marc Pybus for his help with forward simulations.

Author Contributions Conceived and designed the experiments: FC AH PA. Performed the experiments: FC YI EG JFS ML. Analyzed the data: FC AH JHu VB TdM JCG VS. Contributed reagents/materials/analysis tools: FFH SG JHe IF DS JLM GAR EH FLD PA. Wrote the paper: FC AH PA.