Participants

This study included unrelated individuals from the multivariate longitudinal Twins Early Development Study that recruited almost 17,000 twin pairs born in England and Wales between 1994 and 1996.21 The sample is representative of British families in ethnicity, family SES and parental occupation.21 The genotyped subsample is representative of UK census data at first contact (Supplementary Table S1). The Institute of Psychiatry, Psychology and Neuroscience ethics committee (05.Q0706/228) granted project approval and parental consent was obtained prior to data collection.

DNA for 3497 individuals was extracted from saliva samples and hybridized to HumanOmniExpressExome-8v1.2 genotyping arrays at the MRC SGDP Centre Molecular Genetics Laboratories. The raw image data from the array were normalized, pre-processed, and filtered in GenomeStudio according to Illumina Exome Chip SOP v1.4. (http://confluence.brc.iop.kcl.ac.uk:8090/display/PUB/Production+Version%3A+Illumina+Exome+Chip+SOP+v1.4). In addition, prior to genotype calling, 869 multi-mapping SNPs and 353 samples with call rate <0.95 were removed. The ZCALL program22 was used to augment the genotype calling for samples and SNPs that passed the initial QC.

DNA from an additional 3665 samples genotyped earlier in the project was extracted from buccal cheek swabs and genotyped at Affymetrix (Santa Clara, CA, USA). Samples were successfully hybridized to AffymetrixGeneChip 6.0 SNP genotyping arrays (http://www.affymetrix.com/support/technical/datasheets/genomewide_snp6_datasheet.pdf) using experimental protocols recommended by the manufacturer (Affymetrix). The raw image data from the arrays were normalized and pre-processed at the Wellcome Trust Sanger Institute (Hinxton, UK) for genotyping as part of the Wellcome Trust Case Control Consortium 2 (https://www.wtccc.org.uk/ccc2/) according to the manufacturer’s guidelines (http://www.affymetrix.com/support/downloads/manuals/genomewidesnp6_manual.pdf). Genotypes for the Affymetrix arrays were called using CHIAMO (https://mathgen.stats.ox.ac.uk/genetics_software/chiamo/chiamo.html).

After initial quality control and genotype calling, the same quality control was performed on the samples genotyped on the Illumina and Affymetrix arrays separately using PLINK,23 R24 and VCFtools.25 Samples were removed from subsequent analyses on the basis of call rate (<0.99), suspected non-European ancestry, heterozygosity, array signal intensity (>4 s.d. from the mean) and relatedness. SNPs were excluded if the minor allele frequency was <0.05%, if more than 1% of genotype data were missing or if the Hardy Weinberg P-value was lower than 10−5. Non-autosomal markers and indels were removed. Association between the SNP and the array, batch or plate on which samples were genotyped was calculated; SNPs with an effect P-value less than 10−3 were excluded. A total sample of 5825 samples, with 2698 individuals genotyped on Illumina and 3127 individuals genotyped on Affymetrix, remained after quality control.

Genotypes from the two arrays were separately imputed using the Haplotype Reference Consortium26 and Minimac3 1.0.1327, 28 available on the Michigan Imputation Server (https://imputationserver.sph.umich.edu) as reference data. A series of quality checks were performed before merging data from the two arrays imputation (e.g. array effects, allele frequencies by imputation quality). For the present analyses, we limited our analyses to variants genotyped or imputed at info >0.95 on both arrays, and with Hardy Weinberg Equilibrium test P-value >10−5. After stringent pruning to remove markers in high linkage disequilibrium (R2>0.1) and excluding high linkage disequilibrium genomic regions so as to ensure that only genome-wide effects were detected, we performed Principal Component Analysis on a subset of 40, 745 autosomal SNPs that remained after applying our quality control criteria, and that overlapped between the two genotyping arrays. To control for population stratification, we regressed the GPS on the first 10 principal components and used the residuals in all subsequent analyses.

Measures

National Curriculum levels age 7 and 12

English and mathematics National Curriculum levels were collected from teachers when the twins were aged 7 (M=7.2, s.d.=0.27) and 12 (M=11.4, s.d.=0.66). National Curriculum data and genotypes were available for 4047 children at age 7 and 2950 at age 12. The assessments are based on a rubric aligned with the UK National Curriculum, which is the standardized core academic curriculum formulated by the National Foundation for Educational Research (NFER) and the Qualifications and Curriculum Authority (QCA) (NFER: http://www.nfer.ac.uk/index.cfm; QCA: http://www.qca.org.uk). After receiving parental consent, teachers were contacted directly via mail. Teacher ratings assessed two main abilities: English (including ‘speaking and listening’, ‘reading’ and ‘writing’) and mathematics (including ‘using and applying mathematics’, ‘numbers’ and ‘shapes, space and measures’).

At age 7 and 12, teachers rated National Curriculum levels on a 5-point and 9-point scale, respectively, with higher scores representing greater ability. Mathematics and English abilities correlated 0.74 and 0.81 at age 7 and 12, respectively. Therefore, we created overall academic achievement mean scores by calculating the standardized mean for the English and mathematics scores for both ages.

General Certificate of Secondary Education measures age 16

The General Certificate of Secondary Education (GCSE) is a standardized UK-based examination taken at the end of compulsory education at age 16. In addition to the compulsory core subjects of English, mathematics and science, students can choose from a variety of subjects such as physical education, music, geography, modern foreign languages, and information and communication technology.

GCSE results were obtained by questionnaires sent via mail and by telephone interviews of parents and twins themselves. The grades were coded to range from 4 (G; the minimum pass grade) to 11 (A*; the best possible grade). The GCSE score used in this study represents the mean of the compulsory core subjects mathematics and English (if both English language and English literature were taken, a mean grade for English was derived). The two subjects correlated 0.70. We included only mathematics and English grades in the composite score to improve comparability between the educational achievement measures at the different ages. Self-reported GCSE grades of Twins Early Development Study participants show high accuracy, correlating 0.98 English and 0.99 for mathematics grades with data obtained for a subsample from the National Pupil database (NPD: https://www.gov.uk/government/collections/national-pupil-database).14 Data for subject grades and genotypes were available for 4301 twins (mean age=16.62, s.d.=0.32).

General cognitive ability (g)

To measure general cognitive ability, the twins were assessed on various tests including verbal and non-verbal abilities at age 7, 12 and 16. A mean score composite was derived from four tests (’Conceptual Grouping’,29 ‘Similarities’,30 ‘Vocabulary’,30 ‘Picture Completion’30) at age 7; three tests (‘Raven’s Progressive Matrices’,31 ‘General Knowledge’32 ‘Picture Completion’30) at age 12; and two tests (’Raven’s Progressive Matrices’ and ’Mill Hill Vocabulary test’) at age 16. Behavioral and genotypic data were available for 3559 individuals at age 7 (M=7.17, s.d.=0.29); 3349 individuals at age 12 (M=11.46, s.d.=0.64) and 1743 individuals at age 16 (M=16.52, s.d.=0.30). General cognitive ability measures at the different ages correlated on average 0.48. For simplicity we created a general cognitive ability mean composite based on data available at ages 7, 12 and 16. Only participants with data from at least two ages were included (N=2228), and mean imputation was performed on those with a missing third measure. We also report results related to general cognitive ability measured at each age individually in Supplementary Table S6.

Family SES

A composite of several factors such as parental education and occupation is considered to reflect SES better than any single factor.33 Data from 4958 genotyped individuals were available for family SES. This measure represents maternal age at birth of eldest child, the mean score of maternal and paternal highest education level, as well as the respondent’s (mother or father) occupation, administered by the Standard Occupational Classification 2000 (Office for National Statistics, 2000) at child age 2, which was the first age of contact.

Small but significant mean differences between girls and boys were found for educational achievement at all ages (Supplementary Table S2). Small age effects were found for educational achievement within each of the three ages (Supplementary Table S2). Therefore, all measures with the exception of SES and EduYears GPS were recalculated as standardized residuals corrected for gender and age. To account for a slight negative skew in educational achievement tests at age 7 and 16 and a slight positive skew at age 12, measures were quantile normalized.34

Statistical analyses

Genome-wide polygenic scores

We computed GPS for 5825 unrelated individuals using β-weights and P-values from summary statistics obtained by GWA analysis. Summary statistics were derived from the 2016 GWA study on years of education8 with a sample size of 328,918 individuals. It should be noted that the summary statistics we used are slightly different to those of the 2016 EduYears study:8 here 23andMe data are excluded due to legal restrictions, and an initial release of the UK Biobank data are included (see Supplementary Table S3 for cohort details). GPS based on these modified summary statistics correlated highly (r=0.86) with the published GPS8 when both GPS were constructed using the Health and Retirement Study as target sample. Quality-controlled SNPs were clumped for linkage disequilibrium in PRSice,35 using R2=0.1 cutoff within a 250-kb window. In toal, 108,737 SNPs remained after linkage disequilibrium clumping. We used PRSice35 to calculate polygenic scores. Firstly, PRSice calculated GPS for each individual in our sample by summing the trait-associated SNPs that are weighted by their effect size derived from GWA analysis. PRSice then performed a regression analysis to test for association between GPS and each of our outcomes (educational achievement at 7, 12, 16, SES and g). This is repeated for GPS calculated at a large number of P-value thresholds, ranging from 0.001 to 1 (increments of 0.001) in the GWA results, under the high-resolution scoring option in PRSice. Through this high-resolution scoring we identified the ‘best-fit’ GPS for all measures (Supplementary Table S4), which were used throughout our analyses for each respective trait. The ‘best-fit’ GPS is identified as that which gives the smallest P-value for association with outcome among all the regression tests performed on the GPS (see Supplementary Figures S4). Given the multiple testing involved in high-resolution scoring we use an association significance threshold of P=0.001, as recommended in Euesden et al.35

For our GPS analyses, we have more than 80% power to explain 0.2% of the phenotypic variance (see Supplementary Methods S1 for details). To test interactions between different levels of EduYears GPS and family SES, we have more than 80% power to detect a small interaction effect of η2=0.02 (given α=0.05; N=600; number of groups=4).

We performed regression analyses with EduYears GPS as a predictor of educational achievement at ages 7, 12 and 16, as well as of g and family SES. To test for potential differences between correlations between EduYears GPS and educational achievement at the different ages, we performed Fisher’s r-to-z transformations. We also used multiple regression to test whether associations between EduYears GPS and educational achievement remain after controlling for family SES and g. We also tested for mean differences in educational achievement between the extreme septiles of EduYears GPS at each age using analyses of variance. Finally, interaction effects between EduYears GPS and SES on educational achievement and on g were analyzed using multiple regression models that included each main effect and the interaction effect term.