Discovery stage cohorts

Analyses were performed among participants of studies in the Early Growth Genetics (EGG) Consortium, the Initiative for Integrative Psychiatric Research (iPSYCH) study, and the Genomic and Proteomic Network for Preterm Birth Research (GPN). The iPSYCH sample (n = 51,357) included patient groups of six mental disorders: autism, ADHD, schizophrenia, bipolar disorder, depression, and anorexia46. Participating studies from the EGG Consortium included the Avon Longitudinal Study of Parents and Children (ALSPAC, n = 6072) study, the Children’s Hospital of Philadelphia (CHOP, n = 1445) cohort, three sub-samples from the Copenhagen Prospective Studies on Asthma in Childhood (COPSAC_REGISTRY, n = 933; COPSAC2000, n = 356; COPSAC2010, n = 618), a sub-sample from the Danish National Birth Cohort (DNBC, n = 2,130), the Exeter Family Study of Children’s Health (EFSOCH, n = 699) study, the GENERATION R (GenR, n = 1331) study, the Hyperglycemia and Adverse Pregnancy Outcome (HAPO, n = 1347) study, the Infancia y Medio Ambiente (INMA, n = 994) study, a sub-sample of the Norwegian Mother and Child cohort study (MoBa_2008, n = 1064), two Northern Finland Birth Cohort studies (NFBC1966, n = 5209, and NFBC1986, n = 2494), the Western Australian Pregnancy Cohort Study (Raine Study, n = 334), a sub-sample of Statens Serum Institut’s genetic epidemiology (SSI-GE, n = 3294) studies, the Special Turku coronary Risk factor Intervention Project (STRIP, n = 441), the Diabetes and Inflammation Laboratory (1958BC (DIL-T1DGC), n = 2168) cohort, and the Wellcome Trust Case Control Consortium 1958 British Birth Cohort (1958BC (WTCCC), n = 2403). Study protocols within the EGG Consortium were approved at each study center by the local ethics committee and written informed consent had been obtained from all participants and/or their parent(s) or legal guardians. Regarding the iPSYCH and SSI-GE cohorts, GWAS data were generated based on dried blood spot samples obtained during routine neonatal screening and stored in the Danish Neonatal Screening Biobank, which is part of the Danish National Biobank. Parents are informed in writing about the neonatal screening and that the samples can later be used for research, pending approval from relevant authorities46. The iPSYCH and SSI-GE studies were both approved by the Danish Scientific Ethics Committees, the Danish Data Protection Agency and the Danish Neonatal Screening Biobank Steering Committee. The GPN study genotype and phenotype data were downloaded from dbGaP (https://www.ncbi.nlm.nih.gov/gap, accession number: phs000714.v1.p1) and 190 early preterm cases and 274 term infants were included in our early preterm and preterm birth meta-analysis. Study descriptions, relevant sample sizes, and basic characteristics of samples in the discovery stage are presented in Supplementary Data 1.

Replication stage cohorts

Three population-based cohorts with existing GWAS data were used for replication stage analyses of the lead SNPs for gestational duration and early preterm birth, namely an additional sub-sample of the Norwegian Mother and Child cohort study (MoBa_HARVEST, n = 7072), the Born in Bradford study (BiB, n = 1354), and a Finnish cohort from Helsinki (FIN, n = 865). We also included a set of mother–father–child trios from Iowa (n = 276 trios) for family-based association testing of the lead SNPs for early preterm birth. For these trios, genotyping was done using TaqMan (ThermoFisher Scientific) assays for rs1877720 (assay ID: C__12110609_10) and rs2306375 (assay ID: C__42774777_10). The study characteristics of the four replication stage cohorts are described in Supplementary Data 1.

Exclusion criteria for cases and controls

We excluded pregnancies based on the following criteria: (1) stillbirths; (2) twins or any multiple births; (3) ancestry outliers using principal component analysis; (4) outliers in birth weight or birth length (gestational duration possibly wrong); (5) Caesarian section, if due to pregnancy complications; Caesarian sections due to complications during labor were not excluded. Caesarian sections were allowed for cases in the postterm birth analysis; (6) physician initiated births (induced births were allowed for cases in the postterm birth analysis); (7) placental abruption, placenta previa, pre-eclampsia/eclampsia, hydramnios, placental insufficiency, cervical insufficiency, isoimmunization, gestational diabetes, cervical cerclage; (8) pre-existing medical conditions in the mother, such as diabetes, hypertension, autoimmune diseases (including systemic lupus erythematosus, rheumatoid arthritis and sclerodermia), immuno-compromised patients; and (9) known congenital anomalies. Further, the study sample was restricted to individuals of European ancestries, in most cohorts by principal component analysis. Some cohorts were not able to perform exclusions according to all criteria, but applied as many criteria as possible (see Supplementary Data 1 for details).

Data cleaning and imputation

Genotyping in each of the contributing studies was conducted using various high-density SNP arrays (see Supplementary Data 1 for details). Data cleaning was done locally for each study, with sample level exclusion criteria based on high genotype missing rate, high autosomal heterozygosity rate, discrepancy between reported sex and the sex inferred from genotyping, and sample heterogeneity, as well as SNP-level exclusion criteria based on call rate, Hardy–Weinberg disequilibrium, duplicate discordance, Mendelian inconsistencies, and low minor allele frequency. Imputation was performed based on reference data from the Haplotype Reference Consortium (HRC) release 1.1 (ref. 29) for most studies. The iPSYCH sample was imputed based on the integrated phase III release of the 1000 Genomes Project30. Study-specific details on data cleaning filters and imputation are given in Supplementary Data 1. SNP positions were based on National Center for Biotechnology Information (NCBI) build 37 (hg19) and alleles were labeled on the positive strand of the reference genome.

GWAS analysis

We analyzed four traits for association with fetal genotypes, including three binary case–control traits (early preterm birth, preterm birth, and postterm birth) and one quantitative trait (gestational duration). Early preterm birth cases were defined as infants born before gestational week 34+0 (i.e. <238 days of gestation); preterm birth cases were infants born before gestational week 37+0 (i.e. <259 days of gestation, including the early cases); postterm birth cases were infants born at or after gestational week 42+0 (i.e. ≥294 days of gestation). The controls used in the analyses of these three traits were defined as infants born at or after gestational week 39+0 and before gestational week 42+0 (i.e. ≥273 days of gestation, and <294 days of gestation). Case groups in each study contributing to the discovery stage analyses were required to contain at least 50 individuals. The dichotomous trait analyses did not include additional individuals compared to the gestational duration analysis. However, these traits are of high clinical relevance and were therefore included in the study design. Also, including the dichotomous trait tests makes the study sensitive to potentially changing mechanisms influencing parturition at different pregnancy stages.

For the dichotomous traits early preterm birth, preterm birth, and postterm birth, the genome-wide association analyses within discovery studies was done by logistic regression using imputed allelic dosage data under an additive genetic model. For the quantitative trait gestational duration, we applied a rank-based inverse normal transformation. More specifically, in each cohort gestational duration in days (in some cohorts converted from weeks; see Supplementary Data 1) was regressed on infant sex and the resulting residuals were quantile transformed to a standard normal distribution before being tested for association with fetal SNP genotypes. The DNBC and MoBa_2008 samples represent case–control studies of preterm birth, which means that the distribution of gestational duration is bimodal for these studies. In these two cohorts, we transformed gestational duration to be on the same scale as the population-based cohorts (see Supplementary Methods and Supplementary Fig. 11 for details).

Some of the analyzed cohorts represent case–control studies of various diseases. For these, the association analyses of the four outcomes of interest were done in strata defined by disease group. Thus, the iPSYCH study was divided into six patient groups (autism (n = 7147), ADHD (n = 8606), schizophrenia (n = 1101), bipolar disorder (n = 864), depression (n = 13,836), and anorexia (n = 1924)) and a population control group (n = 17,879), which were analyzed separately and combined by fixed-effects meta-analysis. Similarly, the SSI-GE sample was split into six patient groups (atrial septal defects (n = 368), febrile seizures (n = 1350), hydrocephalus (n = 289), hypospadias (n = 301), opioid dependence (n = 685), and postpartum depression (n = 301)) (see Supplementary Data 1 for details), which were analyzed separately and combined by fixed-effects meta-analysis. Genome-wide association analyses in each cohort/sub-sample was conducted using PLINK47, SNPTEST48, or RVTESTS49.

We obtained effect size estimates of the lead variant for gestational duration in the unit of days based on the iPSYCH study. Using a linear model, we regressed gestational duration in days on SNP allele dosage within each iPSYCH disease group, adjusting for infant sex. A combined estimate was obtained by fixed-effects inverse-variance meta-analysis. The iPSYCH study was also used to obtain frequency estimates for the lead variant for gestational duration within samples grouped by gestational duration. In this case, iPSYCH disease status was omitted from the model, since sample sizes would be too small if analyses were stratified by gestational duration groups as well as iPSYCH disease groups.

Meta-analysis

Prior to meta-analysis, SNPs with a minor allele frequency (MAF) <0.01 and poorly imputed SNPs (r2hat <0.3 from MACH50 or info <0.4 from SNPTEST48) were excluded. Furthermore, SNPs available in less than 50% of the discovery cohorts for each trait were excluded. To adjust for inflation in test statistics generated in each cohort, genomic control51 was applied once to each individual study (see Supplementary Table 6 for λ values in each study). The sub-samples within iPSYCH and SSI-GE were meta-analyzed separately first and estimates were then adjusted by genomic control again. Finally, we combined results from all discovery cohorts using fixed-effects inverse-variance-weighted meta-analysis as implemented in METAL52. Final meta-analysis results were obtained for 7,646,297 SNPs for gestational duration with a genomic inflation factor (λ) of 1.049, 7,588,467 SNPs for early preterm birth (λ = 1.005), 7,545,601 SNPs for preterm birth (λ = 1.013), and 7,583,965 SNPs for postterm birth (λ = 1.026). Heterogeneity between studies was estimated using the I2 statistic53. Combined analysis of the discovery and replication stage data was also conducted by fixed-effects inverse-variance-weighted meta-analysis. We considered SNPs with P < 0.05 in the replication stage and P < 5 × 10−8 in the combined analysis to indicate robust evidence of association.

Power analysis

We assessed the statistical power of our study design by computer simulations in R54. For gestational duration, we simulated a quantitative trait influenced by an additive genetic effect and allowed the effect size and the effect allele frequency to vary. For early preterm, preterm, and postterm birth, we simulated disease state from a logistic regression model allowing the odds ratio for a log-additive genetic effect and the frequency of the effect allele to vary. For each combination of effect size and effect allele frequency, we simulated 5000 data sets using the relevant sample size (e.g., for preterm birth: 4775 cases and 60,148 controls in the discovery stage). We then conducted association tests on the simulated data sets and calculated power as the proportion of tests with a P value lower than the relevant significance level (P < 5 × 10−8 for the discovery stage and P < 0.05 for the replication stage).

Test of non-linear effect

At the 2q13 locus, the lead SNP was not associated with early preterm birth or preterm birth suggesting that the association with gestational duration was strongest in later stages of pregnancy. To address this question, we put forward the null hypothesis H 0 : the variant contributes equally to higher gestational duration no matter when the child was born. We used a semi-parametric bootstrap approach to test the null hypothesis. First, we binned the 51,357 births from the largest contributing study (iPSYCH) in five groups by gestational duration. We then calculated observed allele frequencies f1, …, f5 in the five bins. Next, we regressed gestational duration on genotype and extracted the empirical residuals. Gestational duration was now bootstrapped under the null hypothesis with resampling of the empirical residuals. Our test statistic is based on comparing allele frequencies in the five bins in 10,000 bootstrapped data sets. If the variant does not influence gestational duration as much (relative to other factors) in the early part of the distribution, then the observed allele frequency f1 in the first bin will be closer to the overall frequency than expected under H 0 , while the allele frequency in the second bin (f2) will be lower than expected under H 0 and in the fifth bin (f5) the allele frequency will be higher than expected under H 0 . The P value for the test is calculated as the proportion of bootstrapped data sets with allele frequencies that are more extreme than the observed allele frequencies, i.e.

$${\it{P}} = \frac{1}{{10,000}}\mathop {\sum}

olimits_{{\mathrm{boot}} = 1}^{10,000} {1\left( {{\it{f}}1_{{\mathrm{boot}}} \, > \, {\it{f}}1} \right) \ast 1\left( {{\it{f}}2_{{\mathrm{boot}}} \, < \, {\it{f}}2} \right) \ast 1\left( {{\it{f}}5_{{\mathrm{boot}}} \, > \, {\it{f}}5} \right)}.$$ (1)

A more detailed description of the approach is given in the Supplementary Methods.

Bioinformatics analysis

To investigate the functional characteristics of our findings, we annotated all variants with P < 1 × 10−4 at the 2q13 locus using ANNOVAR55 (accessed 1 June 2017), a tool that retrieves variant and region-specific functional annotations from several databases. We retrieved eQTL information for these variants from the GTEx V6 (ref. 31) and GEUVADIS32 project databases. We also queried GeneHancer56, a database of human enhancers and their inferred target genes, which has integrated four different enhancer data sets, including the Encyclopedia of DNA Elements (ENCODE), the Ensembl regulatory build, the functional annotation of the mammalian genome (FANTOM) project, and the VISTA Enhancer Browser. Gene-enhancer scores (>5) were included in the annotation of the variants. We further downloaded all reported variants in the National Human Genome Research (NHGRI) GWAS Catalog33 (accessed 24 November 2017) associated with a trait or disease at P < 5 × 10−8, and searched for SNPs in LD (r2 > 0.2) with the lead SNP at 2q13 locus. Further annotation of these variants was performed with the Ensembl Variant Effect Predictor57.

To assess possible enrichment of cytokine-related variants in the association results for gestational duration, we did a quantile–quantile plot of observed versus expected –log10 P values of SNPs known to be associated with cytokine levels (Supplementary Fig. 9). The cytokine-related SNPs were restricted to cytokine GWAS publications58,59,60, in which the association had been reported in the GWAS Catalog with P < 5 × 10−8.

Exome analysis

Exome sequencing data were available for a subset of samples in the iPSYCH study and analysis was restricted to the overlap between iPSYCH exome samples and the part of the iPSYCH cohort that were included in the GWAS. In total, n = 18,382 individuals, sampled from either schizophrenia (n = 910), bipolar (n = 683), ADHD (n = 3793), autism (n = 5561), affective disorder (n = 1), or controls (n = 7488) were analyzed. For these samples, variants within a 1 MB region (113–114 MB) containing the 2q13 association signal were extracted and combined with the genotype data for the lead variant rs7594852. For these variants, association analysis was performed with gestational duration, transformed as described above. We adjusted the regression model for sex and the first three principal components obtained from the genotyping data. Due to small sample size in the strata of inclusion diagnosis, we did not perform analyses within strata of inclusion diagnosis but instead performed adjustment for four indicator variables denoting whether the individual has schizophrenia, ADHD, bipolar disorder, or autism. In addition, association analysis conditioned on rs7594852 was performed by adding rs7594852 dosage as a covariate. Based on the same sequencing data, we performed a gene-based test for rare-variant association to (quantile transformed) gestational duration, using the optimal sequence kernel association test (SKAT-O)61 approach as implemented in EPACTS version 3.2.6, using default settings.

Estimating fetal and maternal genetic effects

For the 2q13 locus, we analyzed 15,588 mother–child pairs from seven studies with both fetal and maternal genotypes available (ALSPAC, BiB, DNBC, EFSOCH, FIN, MoBa_2008, and MoBa_HARVEST). We used linear regression to test the association between quantile transformed gestational duration (same transformation as in the main analysis) and fetal genotype conditional on maternal genotype and vice versa. In the same complete mother–child pairs (i.e. where genotype data were available for both mother and child), we estimated unconditional effects of fetal and maternal genotype, respectively. We combined the results from the individual studies using fixed-effects meta-analysis. To further address the question of fetal versus maternal effects, we combined unadjusted fetal effect estimates for gestational duration in days (based on 51,357 infants from the iPSYCH study) with corresponding maternal estimates from a recently published GWAS (based on 43,568 mothers)28 using a WLM approach recently described35. Briefly, the fetal effect adjusted for maternal genotype is

$$\hat \beta _{f_{{\mathrm{adj}}}} = - \frac{2}{3}\hat \beta _{m_{{\mathrm{unadj}}}} + \frac{4}{3}\hat \beta _{f_{{\mathrm{unadj}}}}.$$ (2)

And the standard error for the adjusted estimate is

$${\mathrm{SE}}\left( {\hat \beta _{f_{{\mathrm{adj}}}}} \right) = \sqrt {\frac{4}{9}{\mathrm{var}}\left( {\hat \beta _{m_{{\mathrm{unadj}}}}} \right) + \frac{{16}}{9}{\mathrm{var}}\left( {\hat \beta _{f_{{\mathrm{unadj}}}}} \right)}.$$ (3)

Test statistics for the fetal adjusted effect were calculated as

$$Z_{f_{{\mathrm{adj}}}} = \frac{{\hat \beta _{f_{{\mathrm{adj}}}}}}{{{\mathrm{SE}}\left( {\hat \beta _{f_{{\mathrm{adj}}}}} \right)}}$$ (4)

and compared to a standard normal distribution to get two-sided P values. Analogous formulae were used to obtain maternal results adjusted for fetal genotype. Further details and full derivations can be found in the article by Warrington et al.35.

Variance explained and genetic correlation analyses

To estimate the fraction of variance in gestational duration explained by the lead variant rs7594852 at the 2q13 locus, we fitted a linear regression model of quantile transformed gestational duration in the iPSYCH cohort (n = 51,357) where the variant was genome-wide significant. The model was corrected for the iPSYCH disease group and the fraction of variance explained by rs7594852 genotype dosage was extracted. The estimate of variance explained by all common (MAF >1%) autosomal SNPs (also known as SNP heritability) was calculated based on the discovery stage meta-analysis results using LD Score regression62. The main discovery stage meta-analysis was based on quantile transformed gestational duration, but we also estimated the fraction of variance explained for gestational duration in days (based on 51,357 infants from the iPSYCH study). However, both of these estimates are influenced by fetal as well as maternal genetic loci. We therefore used the WLM approach for all common SNPs to obtain estimates of fetal effect adjusted for maternal genotype and vice versa. We then estimated the fraction of variance explained based on the WLM-adjusted results.

LD score regression62 was used to estimate the genetic correlation between fetal effect estimates for (quantile transformed) gestational duration and effect estimates for a 690 traits and diseases in LDHub36. In addition to the traits available in LDHub, we calculated the genetic correlation between fetal effect estimates for gestational duration in days and corresponding estimates from a maternal GWAS of gestational duration28, also using LD score regression.

Computational prediction of gene regulatory mechanisms

In order to prioritize genetic variants for experimental validation, we ranked all variants at the 2q13 locus with r2 > 0.8 to the lead SNP, rs7594852, by their likelihood of being functional based on the strength of the supporting functional genomic data (e.g., ChIP-seq peaks for transcription factors or histone marks, open chromatin as measured by DNAse-seq, see Supplementary Data 7 for details). We used a wide range of functional genomic data in our analysis obtained from sources such as the UCSC Genome Browser63, Roadmap Epigenomics64, Cistrome65, and ReMap-ChIP66. By restricting our analysis to studies performed in relevant cell lines (placenta, chorion, amnion, trophoblasts, neutrophils, and macrophages), we prioritized those variants likely to have regulatory function in these cells. Variants were ranked based upon the total number of data sets they overlap, which is a similar strategic scheme to that employed by RegulomeDB67.

Electrophoretic mobility shift assays

EMSAs were performed to determine whether the rs7594852 polymorphism at the 2q13 locus differentially affected HIC1 binding. Recombinant human HIC1 purified protein (ORIGENE #TP322752) was obtained from ORIGENE (expressed in HEK293 using TrueORF clone, RC222752) with a c-Myc/DDK tag. Double-stranded IRDye700 5′ end-labeled 39 bp oligonucleotides, identical except for the nucleotide at rs7594852 (either the C or T allele), were obtained from IDT. The oligo sequence of the common C allele is

5′-IRDye700/GCCAGACCCCGCCTCCTGGCACAGAGGACCACGCCCGGC-3′.

The alternative T allele oligo sequence is

5′-IRDye700/GCCAGACCCCGCCTCCTGGTACAGAGGACCACGCCCGGC-3′.

The DNA-binding reaction buffer contained 1× binding buffer, 1× DTT/Tw20, 1 μg poly(dI–dC), 0.05% NP-40 (LI-COR EMSA buffer kit), and 1 mM zinc acetate. Binding reactions contained 435 ng of purified HIC1 protein. Fifty femtomoles fluorescent oligo DNAs were then added to the appropriate protein/binding mix and incubated for 20 min at room temperature. For supershift assays, 1 μg per lane at a concentration of 0.05 μg/μL of mouse anti-DDK (FLAG) monoclonal antibody (ORIGENE #TA50011-100) was incubated with the binding buffer for 20 min prior to addition of and incubation with oligo DNA. In all, 1× orange loading dye (LI-COR kit) was added to samples, which were then resolved on (pre-cast, pre-run at 100 V for 60 min) 6% TBE gels (Novex,13 ThermoFisher) in 0.5× TBE buffer for 120 min at 80 V (4C). Fluorescent bands were then imaged using a LI-COR chemiluminescent imaging system. EMSA experiments display representative panels of 2–3 replicates. Densitometric analysis for HIC1 band intensity was performed using a Licor Odyssey scanner. The uncropped image underlying Fig. 3c is shown in the Source Data File.

eQTL analysis in placental samples

eQTL analyses were conducted based on existing RNA sequencing data in placental samples from the Rhode Island Child Health Study. Placenta tissues were from singleton, term pregnancies without pregnancy complications, and the original study reported eQTL results linking SNP array data with genome-wide RNA sequencing data40. In the eQTL analyses of the current study, the sample was restricted to 102 infants of European ancestries. Only genes/transcripts with transcription start sites within 500 kb of the lead SNP rs7594852 for gestational duration, with a total read count >50 across all samples, and with >1 counts per million (cpm) in at least two samples, were considered.

Biomarker analysis

Measurement of the biomarkers BDNF, CRP, EPO, IgA, IL8, IL-18, MCP1, S100B, TARC, and VEGFA was conducted based on infant dried blood spot samples obtained a few days after birth during routine neonatal screening. We tested each measured analyte for association with the lead SNP rs7594852 for gestational duration. We first fitted a linear model with age as a predictor and then normalized and log-transformed the residuals. The log-transformed residuals were tested for association with rs7594852 dosage while adjusting for infant sex, six principal components and iPSYCH disorders.

URLs

For 1000 Genomes Project, see http://www.1000genomes.org/; for ANNOVAR, see http://annovar.openbioinformatics.org/; for Cis-BP, see http://cisbp.ccbr.utoronto.ca/; for dbGaP, see https://www.ncbi.nlm.nih.gov/gap; for EGG Consortium, see http://egg-consortium.org/; for Ensembl Variant Effect Predictor, see https://www.ensembl.org/vep; for EPACTS, see https://github.com/statgen/EPACTS; for GeneHancer, see http://www.genecards.org/; for GEUVADIS data browser, see http://www.ebi.ac.uk/Tools/geuvadis-das/; for GWAS Catalog, see http://www.genome.gov/gwastudies/; for Haplotype Reference Consortium, see http://www.haplotype-reference-consortium.org/; for iPSYCH, see http://ipsych.au.dk/about-ipsych/; for LDHub, see http://ldsc.broadinstitute.org/; for LD Score regression, see https://github.com/bulik/ldsc; for METAL, see http://www.sph.umich.edu/csg/abecasis/metal/; for NCBI Genotype-Tissue Expression (GTEx) eQTL database and browser, see http://www.ncbi.nlm.nih.gov/projects/gap/eqtl/index.cgi; for PLINK, see https://www.cog-genomics.org/plink2; for RVTESTS, see https://github.com/zhanxw/rvtests; for SNPTEST, see https://mathgen.stats.ox.ac.uk/genetics_software/snptest/snptest.html.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.