Imaging data and derived phenotypes

The UK Biobank brain imaging protocol consists of six distinct modalities covering structural, diffusion and functional imaging, summarized in Supplementary Table 1. For this study, we primarily used data from the February 2017 release of ~10,000 participants’ imaging data (and an additional ~5,000 subjects’ data released in January 2018 provided the larger replication sample).

The raw data from these six modalities have been processed for UK Biobank to create a set of IDPs4,5. These are available from UK Biobank, and it is these IDPs from the 2017–2018 data releases that we used in this study.

In addition to the IDPs directly available from UK Biobank, we created two extra sets of IDPs. First, we used FreeSurfer v6.0.037,38 (https://surfer.nmr.mgh.harvard.edu) to model the cortical surface (inner and outer 2D surfaces of cortical grey matter), as well as modelling several subcortical structures. We used both the T1 and T2 FLAIR images as inputs to the FreeSurfer modelling (or just the T1 when the T2 was not available). FreeSurfer estimates a large number of structural phenotypes, including volumes of subcortical structures, surface area of parcels identified on the cortical surface, and grey matter cortical thickness within these areas. The areas are defined by mapping an atlas containing a canonical cortical parcellation onto an individual subject’s cortical surface model, thus achieving a parcellation of that surface. Here we used two atlases in common use with FreeSurfer: the Desikan-Killiany–Tourville atlas (denoted DKT39) and the Destrieux atlas (denoted a2009s40). The DKT parcellation is gyrus-based, whereas Destrieux aims to model both gyri and sulci based on the curvature of the surface. Cortical thickness is averaged across each parcel from each atlas, and the cortical area of each parcel is estimated, to create two IDPs for each parcel. Finally, subcortical volumes are estimated, to create a set of volumetric IDPs.

Second, we applied a dimension reduction approach to the large number of functional connectivity IDPs. Functional connectivity IDPs represent the network edges between many distinct pairs of brain regions, comprising in total 1,695 distinct region-pair brain connections (http://www.fmrib.ox.ac.uk/ukbiobank/). In addition to this being a very large number of IDPs from which to interpret association results, these individual IDPs tend to be substantially noisier than most of the other, more structural, IDPs. Hence, while we did carry out GWAS for each of these 1,695 connectivity IDPs, we also reduced the full set of connectivity IDPs into just six new summary IDPs using data-driven feature identification. We performed this dimensionality reduction by applying ICA41, applied to all functional connectivity IDPs from all subjects, to find linear combinations of IDPs that are independent between the different features (ICA components) identified42. We carried out the ICA feature estimation without any use of the genetic data, and we maximized independence between component IDP weights (as opposed to subject weights). We used split-half reproducibility (across subjects) to optimize both the initial dimensionality reduction (14 eigenvectors from a singular value decomposition was found to be optimal) and also the final number of ICA components (6 ICA components was optimal, with reproducibility of ICA weight vectors greater than r = 0.9). The resulting six ICA features were then treated as new IDPs, representing six independent sets (or, more accurately, linear combinations) of the original functional connectivity IDPs. These six new IDPs were added into the GWAS analyses. The six ICA features explain 4.9% of the total variance in the full set of network connection features, and are visualized in Supplementary Fig. 18. More details of the ICA analysis of the resting state data, together with browsing functionality of the highlighted brain regions can be found on the FMRIB UK Biobank Resource web page (http://www.fmrib.ox.ac.uk/ukbiobank/).

We organized all 3,144 IDPs into 9 groups (Supplementary Table 12), each with a distinct pattern of missing values (not all subjects have usable, high-quality data from all modalities4). For the GWAS in this study we did not try to impute missing IDPs owing to the low levels of correlation observed across groups.

The distributions of IDP values varied considerably between phenotype classes, with some phenotypes exhibiting substantial skew (Supplementary Fig. 19) that would probably invalidate the assumptions of the linear regression used to test for association. To ameliorate this, we quantile-normalized each of the IDPs before association testing. This transformation also helped to avoid undue influence of outlier values. We also (separately) tested an alternative process in which an outlier removal process was applied to the untransformed IDPs; this gave very similar results for almost all association tests, but was found to reduce the significance of a very small number of associations. This possible alternative method for IDP preprocessing was therefore not followed through (data not shown).

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to allocation during experiments and outcome assessment.

Genetic data processing

We used the imputed genetic dataset made available by UK Biobank in its July 2017 release6. This consists of >92 million autosomal variants imputed from the Haplotype Reference Consortium (HRC) reference panel43 and a merged UK10K + 1000 Genomes reference panel. We first identified a set of 12,623 participants who had also been imaged by UK Biobank. We then applied filters to remove variants with minor allele frequency (MAF) below 0.1% and with an imputation information score below 0.3, which reduced the number of SNPs to 18,174,817. We then kept only those samples (subjects) estimated to have recent British ancestry using the sample quality control information provided centrally by UK Biobank6 (using the variable in.white.British.ancestry.subset in the file ukb_sqc_v2.txt); population structure can be a serious confound to genetic association studies44, and this type of sample filtering is standard. This reduced the number of samples to 8,522. The UK Biobank dataset contains a number of close relatives (third cousins or closer). We therefore created a subset of 8,428 nominally unrelated subjects following procedures similar to those described previously6. After running GWAS on all the (SNP) variants in the 8,428 samples we applied three further variant filters to remove variants with a Hardy–Weinberg equilibrium P value <10−7, remove variants with MAF <0.1% and keep only those variants in the HRC reference panel. This resulted in a dataset with 11,734,353 SNPs.

We used two separate datasets to replicate the associated variants found in this study. The first set of 930 subjects was a subset of the 1,279 subjects with imaging data that we did not use for the main GWAS, who had primarily been excluded because they were not in the recent British ancestry subset. An examination of these samples according the genetic principal components (PCs) revealed that many of those samples are mostly of European ancestry (Supplementary Fig. 20). We selected 930 samples with a first genetic PC <14 from Supplementary Fig. 20 and these constituted the replication sample. In January 2018 a further tranche of 4,588 samples with imaging data was released by UK Biobank. Of these subjects, we selected 3,956 subjects that both had genetic data available and also had been imaged in the same imaging centre as the discovery sample. We applied the same pre-processing pipeline as for the discovery set. We then restricted this to 3,456 subjects that were of recent British ancestry and replication tests were then conducted on these 3,456 subjects.

Potential confounds for brain IDP GWAS

There are a number of potential confounding variables when carrying out GWASs of brain IDPs. We used three sets of covariates in our analyses relating to (a) imaging confounds (b) measures of genetic ancestry, and (c) non-brain imaging body measures.

We identified a set of variables that were likely to represent imaging confounds, for example those associated with biases in noise or signal level, corruption of data by head motion or overall head size changes. For many of these we generated various versions (for example, using quantile normalization and also outlier removal, to generate two versions of a given variable, as well as including the squares of these to help model nonlinear effects of the potential confounds). This was done in order to generate a rich set of covariates and hence reduce as much as possible potential confounding effects on analyses such as the GWAS, which are particularly of concern when the subject numbers are so high4,45.

Age and sex are can be variables of biological interest, but can also be sources of imaging confounds, and here were included in the confound regressors. Head motion is summarized from resting and task-based fMRI as the mean displacement (in mm) between one time point and the next, averaged over all time points and across the brain. Head motion can be a confounding factor for all modalities and not just those comprising timeseries of volumes, but is readily estimable only from the timeseries modalities. Nevertheless, the amount of head motion is expected to be reasonably similar across all modalities (for example, correlation between head motion in resting and task fMRI is r = 0.52) and so it is worth using fMRI-derived head motion estimates as confound regressors for all modalities.

The exact location of the head and the radio-frequency receiver coil in the scanner can affect data quality and IDPs. To help to account for variations in position in different scanned participants, several variables have been generated that describe aspects of the positioning (see http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25756, http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25757, http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25758, and http://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=25759). The intention is that these can be useful as ‘confound variables’; for example, these might be regressed out of brain IDPs before carrying out correlations between IDPs and non-imaging variables. TablePosition is the Z-position of the coil (and the scanner table on which the coil sits) within the scanner (the Z axis points down the centre of the magnet). BrainCoGZ is somewhat similar, being the Z-position of the centre of the brain within the scanner (derived from the brain mask estimated from the T1-weighted structural image). BrainCoGX is the X-position (left–right) of the centre of the brain mask within the scanner. BrainBackY is the Y-position (front–back relative to the head) of the back of brain mask within the scanner.

UK Biobank brain imaging aims to maintain as fixed an acquisition protocol as possible during the 5–6 years that the scanning of 100,000 participants will take. There have been a number of minor software upgrades (the imaging study seeks to minimize any major hardware or software changes). Detailed descriptions of every protocol change, along with thorough investigations of the effects of these on the resulting data, will be the subject of a future paper. Here, we attempted to model any long-term (over scan date) changes or drifts in the imaging protocol or software or hardware performance, by generating a number of data-driven confounds. The first step was to form a temporary working version of the full subjects × IDPs matrix with outliers limited (see below) and no missing data, using a variant of low-rank matrix imputation with soft thresholding on the eigenvalues46. Next, the data were temporally regularized (approximate scale factor of several months with respect to scan date, see https://biobank.ctsu.ox.ac.uk/showcase/field.cgi?id=53, Instance 2) with spline-based smoothing. We then applied PCA and kept the top 10 components, to generate a basis set that reflects the primary modes of slowly changing drifts in the data.

To describe the full set of imaging confounds we use a notation where subscript i indicates quantile normalization of variables, and m indicates median-based outlier removal (discarding values greater than five times the median absolute deviation from the overall median). If no subscript is included, no normalization or outlier removal was carried out. Certain combinations of normalization and powers were not included, either because of very high redundancy with existing combinations, or because a particular combination was not well-behaved. The full set of variables used to create the confounds matrix are: a, age at time of scanning, demeaned (cross-subject mean subtracted); s, sex, demeaned; q, four confounds relating to the position of the radio-frequency coil and the head in the scanner (see above), all demeaned; d, ten drift confounds (see above); m, two measures of head motion (one from resting fMRI, one from task-based fMRI); and h, volumetric scaling factor needed to normalize for head size47.

The full matrix of imaging confounds is then:

$$[a\hspace{5pt}{a}^{2}\hspace{5pt}a\times s\hspace{5pt}{a}^{2}\times s\hspace{5pt}{a}_{i}\hspace{5pt}{a}_{i}^{2}\hspace{5pt}{a}_{i}\times s\hspace{5pt}{a}_{i}^{2}\times s\hspace{5pt}{m}_{m}\hspace{5pt}{m}_{m}^{2}\hspace{5pt}{h}_{m}\hspace{5pt}{q}_{m}\hspace{5pt}{q}_{m}^{2}\hspace{5pt}{d}_{m}\hspace{5pt}{m}_{i}\hspace{5pt}{h}_{i}\hspace{5pt}{q}_{i}\hspace{5pt}{q}_{i}^{2}\hspace{5pt}{d}_{i}]$$

Any missing values in this matrix are set to zero after all columns have had their mean subtracted. This results in a full-rank matrix of 53 columns (ratio of maximum to minimum eigenvalues is 42.6). Additional discussion on the dangers and interpretation of imaging confounds in big imaging data studies, particularly in the context of disease studies, has been published45.

Genetic ancestry is a well-known potential confound in GWAS. We ameliorated this by filtering out samples that were not of recent British ancestry. However, a set of 40 genetic principal components (PCs) has been provided by UK Biobank6, and we used these PCs as covariates in all of our analyses. The matrix of imaging confounds, together with a matrix of 40 genetic principal components, was regressed out of each IDP before the analyses reported here.

There exist a number of substantial correlations between IDPs and non-genetic variables collected on the UK Biobank subjects4. We therefore also carried out some analyses involving variables relating to blood pressure (diastolic and systolic), height, weight, head bone mineral density, head bone mineral content and two principal components from the broader set of bone mineral variables available (https://biobank.ctsu.ox.ac.uk/crystal/docs/DXA_explan_doc.pdf). Supplementary Fig. 21 shows the association of these eight variables against the IDPs and shows significant associations. These are variables that are likely to have a genetic basis, at least in part. Genetic variants associated with these variables might then produce false positive associations for IDPs. To investigate this possibility, we ran GWASs for these eight traits (conditioned on the imaging confounds and genetic PCs) (Supplementary Fig. 22). We also ran a parallel set of IDP GWASs with these ‘body confounds’ regressed out of the IDPs.

Heritability and genetic correlation of IDPs

We used a linear mixed model implemented in the SBAT (sparse Bayesian association test) software (https://jmarchini.org/sbat/) to calculate additive genetic heritabilities for the P = 3,144 traits. To estimate genetic correlations we used a multi-trait mixed model. If Y is an N × P matrix of P phenotypes (columns) measured on N individuals (rows) then we use the model:

$$Y=U+\varepsilon $$ (1)

where U is an N × P matrix of random effects and ε is an N × P matrix of residuals, and these are modelled using Matrix normal distributions as follows:

$$U \sim MN\left(0,K,B\right)$$

$$\varepsilon \sim MN\left(0,{I}_{N},E\right)$$

In this model, K is the N × N kinship matrix between individuals, B is the P × P matrix of genetic covariances between phenotypes and E is the P × P matrix of residual covariances between phenotypes. We estimate the covariance matrices B and E using a new C++ implementation of an EM algorithm48 included in the SBAT software (https://jmarchini.org/sbat/).

For the marginal heritabilities and genetic correlation analysis we used a realized relationship matrix (RRM) for the kinship matrix (K). This RRM was calculated from the 8,428 nominally unrelated individuals using fastLMM (https://github.com/MicrosoftGenomics/FaST-LMM). We used the subset of imputed SNPs that were both assayed by the genotyping chips and included in the HRC reference panel, and so will essentially be hard-called genotypes. In addition, all SNPs with duplicate rsids (reference SNP cluster IDs) were removed. PLINK (http://www.cog-genomics.org/plink/2.0/) was used for file conversion before input into fastLMM.

To estimate genetic correlations, we fit the model to several of the groupings of IDPs detailed in Supplementary Table 12. The estimated covariance matrices B and E were used to estimate the genetic correlation of pairs of IDPs. The genetic correlation between the ith and jth IDPs in a jointly analysed group of IDPs is estimated as

$${r}_{ij}=\frac{{B}_{ij}}{\sqrt{{B}_{ii}{B}_{jj}}}$$

Multi-trait association tests

We used a multi-trait mixed model to test each SNP for association with different groupings of traits (Supplementary Table 7). The model has the form Y = Gα + U + ε, where G is an N × 1 vector of SNP dosages and α is a 1 × P vector of effect sizes. We fit the model using estimates of B and E from the ‘null’ model with α = 0 and a leave one chromosome out (LOCO) approach for RRM calculation. We ran this test on the main set of 8,428 samples and on the replication samples. For the replication analysis we used the estimates of B and E from the main set of 8,428 samples. This test was implemented in SBAT software.

Genetic association of IDPs

We used BGENIE v1.2 (https://jmarchini.org/bgenie/) to carry out GWASs of imputed variants against each of the processed IDPs. This program was designed to carry out the large number of IDP GWAS required in this analysis. It avoids repeated reading of the genetic data file for each IDP and uses efficient linear algebra libraries and threading to achieve good performance. The program has already been used by several studies to analyse genetic data from the UK Biobank49,50. We fit an additive model of association at each variant, using expected genotype count (dosage) from the imputed genetic data. We ran associated tests on the main set of 8,428 samples and the replication samples.

Identifying associated genetic loci

Most GWAS analyse only one or a few different phenotypes, and often uncover just a handful of associated genetic loci, which can be interrogated in detail. Owing to the large number of associations uncovered in this study, we developed an automated method to identify, distinguish and count individual associated loci from the 3,144 GWASs (one GWAS for each IDP). For each GWAS we first identified all variants with –log 10 (P) > 7.5. We applied an iterative process that starts by identifying the most strongly associated variant, storing it as a lead variant, and then removing it, and all variants within 0.25 cM from the list of variants (equivalent to approximately 250 kb in physical distance). The process was then repeated until the list of variants was empty. We applied this process to each GWAS using two filters on MAF: (a) MAF > 0.1%, and (b) MAF > 1%. We grouped associated lead SNPs across phenotypes into clusters. This process first grouped SNPs within 0.25 cM of each other, and this mostly produced sensible clusters, but some hand curation was used to merge or split clusters based on visual inspection of cluster plots and levels of linkage disequilibrium between SNPs. For some clusters in Extended Data Table 1, we report coding SNPs that were found to be in high linkage disequilibrium with the lead SNPs.

Accounting for multiple IDPs

We adjusted the genome-wide significance threshold (−log 10 (P) > 7.5) by a Bonferroni factor (–log 10 (3,144) = 3.5) that accounts for the number of IDPs tested, giving a threshold of –log 10 (P) > 11. This assumes (incorrectly) that the IDPs are independent and so is likely to be conservative, but we preferred to be cautious when analysing so many IDPs.

Genetic correlation analysis

We used linkage disequilibrium score regression51 to estimate the genetic correlation between the IDPs studied in our analysis and ten disease-, personality- or brain-related traits. We gathered summary statistics for GWASs of the neuroticism personality trait (https://www.thessgac.org/data), autism spectrum (https://www.med.unc.edu/pgc/) and sleep duration (http://www.t2diabetesgenes.org/data/) and also seven disease traits: attention deficit hyperactivity disorder, schizophrenia, major depressive disorder and bipolar disorder (https://www.med.unc.edu/pgc/), Alzheimer’s disease (http://web.pasteur-lille.fr/en/recherche/u744/igap/igap_download.php), stroke (PMC4818561 from http://cerebrovascularportal.org/informational/downloads) and amyotrophic lateral sclerosis (http://databrowser.projectmine.com/). The number of samples in each of these studies and the DOIs for the corresponding studies are provided in Supplementary Table 13.

For each IDP–trait pair, we used the LDSCORE regression software (v1.0.0; https://github.com/bulik/ldsc) to compute the genetic correlation between the IDP and the trait, with linkage disequilibrium measurements taken from the 1000 Genomes Project (provided by the maintainers of the LDSCORE regression software). We filtered the SNPs to include only those with imputation INFO ≥ 0.9 and MAF ≥ 0.1%. Only INFO scores for major depressive disorder, schizophrenia and attention deficit hyperactivity disorder were provided by the source studies, and so for these three analyses we applied the INFO threshold to both the SNPs from our study and also the source study. For the remaining six studies, an INFO filter was applied to the SNPs from our own study. Owing to low levels of heritability of the functional edge IDPs, all of these were removed from this analysis. As calculation of genetic correlation between traits only really makes sense if both traits are themselves heritable, we only used those IDPs with z-scores for significantly non-zero heritability greater than 4. In total, we used 897 IDPs. To account for correlations between IDPs, we used the raw phenotype correlation matrix to simulate z-scores (and associated tail probabilities) using samples from a multivariate normal distribution with that same correlation matrix.

Analysis of enrichment of functional categories

We used the LDSCORE regression software to carry out the heritability enrichment partitioning analysis into different functional categories (https://github.com/bulik/ldsc). We used 24 functional categories: coding, UTR, promoter, intron, histone marks H3K4me1, H3K4me3, H3K9ac5 and two versions of H3K27ac, open chromatin DNase I hypersensitivity site (DHS) regions, combined chromHMM/Segway predictions, regions conserved in mammals, super-enhancers and active enhancers from the FANTOM5 panel of samples. For each IDP, the enrichment of each functional category was summarized as the proportion of h2 explained by the category divided by the proportion of common variants in the category. For each IDP and each annotation we used the two-sided enrichment P value as reported by the LDSCORE regression software. We labelled those P values as enriched or depleted depending on whether the enrichment estimate was greater or less than 1. We stratified these P values accordingly into 23 groups of IDPs.

Code availability

Most of the software and code used in this study are publicly available, including custom Matlab scripts used to prepare IDPs for GWAS (http://www.fmrib.ox.ac.uk/ukbiobank/gwaspaper/). Pre-compiled binaries for the latest version of BGENIE and SBAT are available at https://jmarchini.org/software/. This software is currently licensed free for use by researchers at academic institutions. Commercial organizations wishing to use these packages must enquire about a licence from the University of Oxford. Brain image processing was largely carried out with FSL (FMRIB’s Software Library, https://fsl.fmrib.ox.ac.uk/fsl/fslwiki) and further Matlab-based preparation of IDPs and imaging confounds utilized code from FSLNets (https://fsl.fmrib.ox.ac.uk/fsl/fslwiki/FSLNets).

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.