Samples

Educational attainment has a well-documented health-education gradient as well as phenotypic and genetic relation to cognitive functioning67, and is influenced by environmental and genetic factors19,68. We obtained summary statistics for about ten million single nucleotide polymorphisms (SNPs) from a GWAS of educational attainment (EduYears)38 (sample N = 328,917 Caucasian individuals from North America, Western Europe and Australia), as well as from UK Biobank GWASs of college or university degree (College)37 (sample N = 111,114) and of general cognitive ability (GCA1)40 (sample N = 269,867). The data on GCA1 was based on 269,867 individuals drawn from 14 cohorts, primarily consisting of data from the UK Biobank (sample N = 195,653) and the Cognitive Genomics Consortium (sample N = 35,289). We also obtained summary statistics for the same set of SNPs from a similar GWAS of GCA (i.e., GCA2) in middle and older age by the CHARGE consortium, which included a total of 53,949 individuals39. Finally, we used Height (sample N = 183,727) and BMI (sample N = 339,224) GWAS summary statistics from the GIANT consortium study69,70 as control sets.

Analytical Approach

We employed genetic enrichment methods recently developed to uncover more of the genetic architecture of complex traits60,61,62,71. Specifically, we investigated the enrichment of associations concurrent with the evolutionary affiliations in a covariate-modulated statistical framework61. We investigated whether variants in evolutionarily salient regions or tagging other variants therein, are more likely associated with measures of education attainment and general cognitive function, as well as with other control phenotypes. The visual displays of enrichment were produced with MATLAB (www.mathworks.com/products/matlab). The enrichment test statistics were computed using the LD-score package44.

Cognitive phenotypes

College completion (College) measures the highest level of educational qualification achieved37 while educational attainment (EduYears)38 measures the years of completed schooling. These are used as a proxy for intelligence and they show high genetic correlation48. The general cognitive ability is not a specific cognitive skill but a measure of various fluid cognitive ability tests. The measures used in GCA1 and 2 were constructed using the first un-rotated component extracted from a principal component analysis of the individual cognitive test scores that measured general fluid cognitive functions39. These measures of fluid intelligence correlate highly with general cognitive ability40,48,72. The scores used by these two independent phenotypes capture the shared variance across cognitive test batteries measuring fluid cognitive functions, and explain around 40–50% of the variation across cognitive domains. Further details of the tests administered can be found in the original publications37,39,40.

Post-Neanderthal selective sweep regions

The index of the post-Neanderthal selective sweep (PNSS) regions was obtained from the work of Prüfer et al.36 and is downloadable from, http://cdna.eva.mpg.de/neandertal/. The authors used a hidden Markov model to identify regions in Neanderthals that differ from modern humans36. Neanderthal and Denisovan genes were used to identify regions that differed from the representative modern human population in the 1000 genomes project variants. The identified regions were assigned a score based on genetic lengths and a cut off was assigned for regions that were most likely to have undergone positive selection in humans.

We assigned all SNPs a value of 0 or 1 based on whether these fell outside or inside the regions of recent positive selection in humans, respectively.

Confounding/mediating effects

We controlled for the following factors while assessing the evolutionary enrichment of cognition/education associations:

Brain genes

We used the protein atlas (http://www.proteinatlas.org/humanproteome/brain) to select all genes that are expressed specifically in the brain of Homo sapiens. We identified a total of 4915 genes by filtering for genes that have high expression levels in brain. The 1000 Genomes Project SNPs were then aligned with the identified genes. The ones overlapping with these genes were assigned a “Brain” value of 1, the rest were assigned a “Brain” value of 0. All SNPs were subsequently assigned LD–informed “Brain” scores (see below).

Annotation of genomic regions, LD-based

The SNPs that fall within certain regions of interest may capture only a limited portion of the association signal ascribable to that region. We used an LD-weighted scoring algorithm35,56,71 to identify SNPs that tag specific DNA regions even if they are not situated within them. For each SNP, a pairwise correlation coefficient approximation to LD (r2) was extracted for all 1KGP SNPs within a 1,000,000 base pairs (1 Mb). All r2 values <0.2 were set to 0 and each SNP was assigned an r2 value of 1.0 with itself. LD-weighted region annotation scores for all DNA regions of interest were computed as the sum of LD r2 between the tag SNP and all 1KGP SNPs in those regions. Given SNP i, its LD-weighted region annotation score was computed as LDscore i = Σ j (δ j r ij 2), where r ij 2 is the LD r2 between SNP i and SNP j and δ ij takes values of 1 or 0 depending on whether the 1KGP SNPj is within the region of interest or not. LD scores were assigned to exons, introns, 3′UTR and 5′UTR35,56,71,73.

Intergenic correction

Intergenic SNPs are defined as having LD-weighted annotation scores for exon, intron, 3′UTR and 5′UTR equal to zero and being in LD with no SNPs in the 1KGP reference panel located within 100,000 base pairs of a protein coding gene, within a non-coding RNA, within a transcription factor binding site or within a miRNA binding site71. Those singled out in this way are expected to form a collection of non-genic SNPs not belonging to any annotated functional elements and their LD-associated regions within the genome and therefore represent a collection of likely null associations. Intergenic SNPs were used to estimate the inflation of GWAS summary statistics due to cryptic relatedness. We used intergenic SNPs because their relative depletion of associations suggests they provide a set of reliably null SNPs that is less contaminated by polygenic effects. The inflation factor, λ GC , was estimated as the median squared z-score of independent sets of intergenic SNPs across one hundred LD-pruning iterations, divided by the expected median of a chi-square distribution with one degree of freedom.

Conditional quantile-quantile plots

To visualize enrichment, we constructed conditional quantile-quantile (Q-Q) plots where we compared the nominal p-value distribution to the empirical distribution71. In the presence of null relationships, the nominal p-values form a straight line on a Q-Q plot when plotted against the empirical distribution. We plotted −log 10 nominal p-values against −log 10 empirical p-values for the two SNP strata subdivided by the PNSS score, as well as for all SNPs. Leftward deflections of the observed distribution from the null line reflect increased tail probabilities in the distribution of test statistics (z-scores) and consequently an over-abundance of low p-values compared to that expected under the null hypothesis71. Enrichment is present if the line corresponding to the variants of interest has a leftward deflection from the comparison stratum. To assess polygenic effects below the standard GWAS significance threshold, we focused the Q-Q plots on SNPs with nominal −log10(p) < 7.3 (corresponding to p > 5 × 10−8).

Fold enrichment plots

To visually emphasize the association enrichment, we used conditional fold enrichment plots74. As for Q-Q plots, the covariate of interest, i.e. the PNSS score, is used to subdivide SNPs into two strata. The plots were obtained by computing the empirical cumulative distribution of −log10(p)-values for SNP association with a given phenotype for all SNPs, and for the two SNPs strata determined by the PNSS score. Then each stratum’s fold enrichment was calculated as the ratio CDF stratum /CDF all between the −log10(p) cumulative distribution for that stratum and the −log10(p) cumulative distribution for all SNPs. The nominal −log10(p) values are plotted on the x-axis, the fold enrichment in the y-axis. To assess polygenic effects below the standard GWAS significance threshold, we focused the fold enrichment plots on SNPs with nominal −log10(p) < 7.3 (corresponding to p > 5 × 10−8). Enrichment is present if the line corresponding to the SNPs of interest has an upward deflection. The plots should be interpreted with caution when the baseline is determined by fewer than 5–10 data points.

Stratified enrichment analysis

To quantify the contribution of variants within the PNSS regions we conducted analyses using an approach45 based on stratified LD score regression44. We first dichotomized the PNSS scores into binary scores. We used the LD-score tool with the “—h2” option to estimate SNP-based heritability of variants with negative NSS score, controlling for the a set of 53 annotations44, including standard genomic annotations such as exon, intron, 3′UTR, 5′UTR, presence of enhancers, total LD score, and brain gene affiliation. Given the complex LD75 in the extended major histocompatibility complex (MHC) region (genome build 19 location 25119106–33854733), we excluded SNPs in the MHC region and SNPs in LD (r2 > 0.1) with such SNPs from the analysis, to avoid any inflation due to complex correlations. We then used the LD-score tool with the “—l2” option to calculate the total LD of 1,190,321 variants from the HapMap3 project towards the category of variants with negative scores. The pairwise LD r2 measures were calculated across 9,997,231 variants from the reference panel (1 kG Phase3 genotypes for individuals with European descent). The effect reported (β PNSS ) is the LD score regression coefficient. Its p-value is the probability of the “true” effect size being different from zero based on the standard error estimate of the coefficient.

Data can be obtained from

EduYears: https://www.thessgac.org/data, College: http://www.ccace.ed.ac.uk/node/335, GCA1:https://ctg.cncr.nl/software/summary_statistics, GCA 2: data can be obtained from the CHARGE consortium, BMI and Height:, http://portals.broadinstitute.org/collaboration/giant/index.php/GIANT_consortium_data_files.