Contact for reagent and resource sharing

Further information and requests for reagents may be directed to the Biobank CARTaGENE which regulates the access to the data and biological materials (http://www.cartagene.qc.ca/en/contact-us).

Study population

The study protocol was approved by the Ethical Review Board Committee of Sainte-Justine Research Center and all participants provided informed consent. CARTaGENE biobank comprises more than 40,000 participants aged between 40 and 60 years, recruited at random among three urban centers in the province of Quebec. CARTaGENE is a regional cohort within the Canadian Partnership for Tomorrow Project, including over 315,000 participants, with various measures obtained from blood parameters, biological function, disease history, lifestyle, and environmental factors19.

Sample selection

For freeze 1, we selected 708 individuals from the CARTaGENE’s biobank samples with available Tempus Blood RNA Tubes (ThermoFisher Scientific) and Framingham risk scores, ensuring an equal representation of ages and gender. Two-hundred-and-ninety-two additional samples were subsequently selected from CARTaGENE (freeze 2) based on their RNA and complete arterial stiffness (AIx) measures availability. These samples were selected for having high AIx values as well as average AIx values to complement the first freeze of samples with the intention of achieving a broad range of arterial stiffness values across the complete study cohort. All samples were collected in the same year, with a standardized protocol in all sampling clinics19. All blood samples were collected in the morning, on fasting participants.

Table 1 Summary of k-mean clustering Full size table

Genotyping and QC

In total, 928 samples with RNA-Seq profiles that passed quality control (QC) thresholds were genotyped on the Illumina Omni2.5 array to obtain high-density SNP genotyping data. A total of 1,213,103 SNP were retained after filtering and QC (Hardy–Weinberg p value > 0.001, MAF > 5% and percent of missing data < 1%).

RNA sequencing

Whole blood samples were collected from participants in 2010. Total RNA was isolated using the Tempus Spin RNA isolation kit (ThermoFisher Scientific) and a globin mRNA-depletion was performed using the GLOBINclear-Human kit (ThermoFisher Scientific). The quality and integrity of the RNA samples were verified using an Agilent Bioanalyzer 2100 and all samples had an RNA integrity number (RIN) > 7.5. A RIN above 7.5 is indicative of high quality RNA in the sample and for which RNA degradation is minimal, indicating optimal transport and preservation conditions. Our RIN threshold is more stringent than other large-scale consortium studying gene expression in tissues51,56. TruSeq RNA Sample Prep kit v2 (Illumina) was used to construct paired-end RNA-Seq libraries with 500 ng of globin-depleted total RNA. Recommended Illumina protocols were followed for quantification and quality control of RNASeq libraries prior to sequencing. Paired-end RNA sequencing was performed on a HiSeq 2000 platform at the Genome Quebec Innovation Center (Montreal, Canada). Sequencing was performed for freeze 1 (708 samples) using three samples per lane, and for freeze 2 (292 samples) using six samples per lane yielding about 60 million reads per samples. All RNA-seq experimental steps following blood draw were conducted in the same central laboratory, and samples were distributed randomly over sequencing lanes (Supplementary Fig. 3a, b), thereby reducing the introduction of experimental bias at these steps.

Reads were trimmed for adapters and bad quality bases first using Trim Galore and were then assembled to a reference genome (hg19, European Hapmap (CEU) Major Allele release) using STAR (v2.3.1z15)57 using the two-pass protocol, as recommended by the Broad Institute. The two-pass protocol consists in two consecutive mappings steps having the same set of parameters with only the reference that is optimized in the second mapping procedure. The first mapping is done using the reference gene definition coming from ENSEMBL (release 75). Then, using the splicing junction database files formed by the first pass mapping step for all the samples combined together and the same gene definition file, a second reference is indexed and optimized and is used for the second mapping step. The number of mismatches allowed across pair is five and a soft-clipping step that optimizes alignment scores is also done automatically by STAR. The PCR duplicates were conserved as it was shown that quantification of highly expressed genes were disproportionately affected by PCR duplicates removal58. Only properly paired reads were kept (using samtools59) for the analysis, according to STAR parameters. After these steps, HTseq (v.0.6.1p1)60 was launched separately on each alignment file using the same gene reference file that was used for the alignments.

All analyses downstream were conducted using R 3.1.2 and R 3.2.2 and Bioconductor R packages.

Fine-scale population genetic structure within Quebec

To unveil finer scale patterns of population structure, i.e., differences between individuals with European ancestry versus individuals having a French Canadian ancestry, we also used ChromoPainter (v0.04)61, a haplotype-based method powerful enough to detect fine-scale genetic structure. Original genotyping data was used apart from singletons, yielding to 1,908,336 SNPs. Singletons were removed as they are non-informative for phasing and contribute to computation burden for the step of haplotypes sharing inference performed with ChromoPainter. Genotypic data was phased with SHAPEIT (v2.r644)62 using the HapMap genetic maps. Coancestry matrices were obtained from ChromoPainter with parameters estimation step done with ten iterations on four chromosomes only. ChromoPainter method performs a reconstruction of every individual genome using chunks of DNA donated by the other individuals and report matrices of the number and length of those chunks. We used the chunk count matrix to (1) run FineSTRUCTURE algorithm to build a tree (as recommended for large data set, we performed 10,000,000 burn-in and runtime MCMC iterations) (Supplementary Fig. 1D) and to (2) perform a PCA (Fig. 1a, Supplementary Fig. 1c). Regional ancestry for each FC was determined based on the three clusters obtained from the fineSTRUCTURE tree, (Supplementary Fig. 1d, Fig. 1b).

In agreement with Quebec settlement history, previous studies of the Quebec population28,63, and the fineSTRUCTURE tree, a PCA of FC individuals reveals groupings of sub-populations of individuals that follow a North–South structure (Fig. 1b, c). The founding event from French settlers followed by the subsequent colonization of remote regions has led to population differentiation among regions in Quebec28,63. By further restricting the group of individuals to be analyzed to only FC (n = 726) and considering their region of residence (either Quebec City, Montreal, and Saguenay) a PCA on the chunk count matrix reveals three groups corresponding to region of residence, with the Montreal and Quebec groups overlapping to a greater extent, in line with their greater geographic proximity (Fig. 1b, c). Those three groups were also recovered by the fineSTRUCTURE tree (Supplementary Fig. 1D). Considering all SNPs and the whole haplotypic structure is the key in seeing differences for those two metropolitan regions that have low differentiation. We further identified several participants with a regional ancestry discordant with their region of residence: an indication of recent regional migration of these participants across Quebec regions (Supplementary Table 2).

Imputation

To increase the power for the association study with gene expression levels, variant imputation was conducted on 968 individuals for which the genotyping was available from the Illumina Omni2.5 array. We pre-phased the genotypes with SHAPEIT (v2.r64410)62 using the default parameters, on both the autosomes and the chromosome X. We filtered variants for MAF > 1% and Hardy–Weinberg p value > 0.0001 and passed the haplotypes to IMPUTE2 (v2.2.2)64 to perform the imputation using the 1000 Genomes Phase I integrated haplotypes (Dec 2013). We used the parameters Ne = 11418 and call thresh = 0.9. We removed variants with a call rate <90%, MAF > 1%, and Hardy–Weinberg p value > 0.0001. A total of 9,157,622 variants passed the filters. Of these, 8,877,297 variants were found on the autosomes and included 779,579 insertion-deletion polymorphisms (indels) (8.78%) and 8,097,718 SNPs (91.22%). 280,325 variants were found on the chromosome X, which included 28,504 indels (10,16%) and 251,821 SNPs (89.84%).

To determine the ancestry of each individual from genotyping data, we carried out a principal component analysis (PCA) with SNPs pruned for LD (pairwise r2 > 0.2 and 50 SNPs window shifting every five SNPs) (Supplementary Fig. 1A), yielding 146,689 SNPs. The continental ancestry (African/European/Asian/Canadian/American/Middle-Eastern) of each individual was determined based on the PCA plot (Supplementary Fig. 1A) and verified as to whether it corresponds to self-reported ancestry based on the country of origin of four grandparents. If the country of origin of three out of four grandparents and the PCA continental grouping were concordant, the individual was assigned to a continental origin.

RNA-sequencing filtering

Genes with counts-per-million below 0.5 in more than half of the cohort (505 individuals) were removed from the analysis for a total of 15,632 genes retained for all downstream analyses. Individuals that showed obvious outlier after visual inspection of principal component plots were removed (three individuals).

Variables contributing to transcriptomic variation

The deep phenotyping of the CARTaGENE cohort allow for a thorough exploration of the biological and environmental factors that may influence genome-wide gene expression patterns. As most statistical procedures assume a normal distribution to the underlying data, we transformed the normalized counts from freeze 1 to a Gaussian distribution using a log2cpm transformation using edgeR. We summarize the gene expression levels by performing a PCA on the normalized expression matrix (ePCA). To identify variables that contribute to genome-wide gene expression variation, we performed a stepwise regression (stepwise search from both directions) on ePC1 and ePC2. Results of the stepwise regression are given in Supplementary Table S1, as well as the results from the replication analyses using freeze 2. We included the following low level endophenotypes in the stepwise procedure: set, region of residence, cell counts (lymphocytes, neutrophils, monocytes), arterial stiffness, age, and sex.

Using the freeze 1 data set of 708 individuals, we quantified the proportion of the variance in expression attributable to cell counts, age, sex, region, and arterial stiffness (Supplemental material) by using principal variance component analysis (PVCA), and found that the region of residence explains ∼16% of the variance in gene expression, while the effects of age, sex, and cell counts were much lower (Fig. 1e). These analyses were repeated on an additional 289 participants (freeze 2) and both of these effects were found to be replicated on expression profiles (Supplementary Table 1). Similarly, when combining transcriptional profiles for all individuals, we found that the region of residence explains ∼15% of the variance in gene expression both in FCs and in Europeans (Supplementary Fig. 2).

Sampling site effect within region

The RNA extractions and library preparation were performed for all individuals in the same laboratory to reduce technical bias. However, participants were sampled across four different sampling sites inevitably situated within geographical regions where participants lived. Our experimental design was built in such a way that sequencing run was not correlated with region of residence (Supplementary Fig. 3a). To evaluate whether the sampling site has any effect on the RNA-Seq quantification data, we performed extensive analyses of the two sampling sites situated within Quebec City: St-Sacrement (STS, n = 136) and Enfant-Jesus (EF, n = 129). QUE individuals expression profiles from the combined data set show that individuals from STS and EF form a single cluster on a ePCA plot (Supplementary Fig. 3b). Furthermore, a variance component analysis (PVCA) was performed on the QUE individuals only and including sampling site as an explanatory variable shows that the sampling site explains <5% of the variance within QUE region, while freeze explains 15%, age 5%, and gender 2.5% (Supplementary Fig. 3c). In comparison, in FCs or Europeans, region of residence accounts for 15% of variance in gene expression. In addition, we performed a differential expression analysis between sites within a region (see details below) using permutations, and found that there are no genes differentially expressed between clinics within a region, supporting the absence of sampling differences between clinics affecting gene expression to a detectable and significant level.

Correction for technical and biological unwanted variation

High quality RNA-sequencing of all 997 individuals reveals a similar geographic structure in transcriptional profiling than population structure from genotyping (Fig. 1c). Investigation of the variance associated to gene expression reveals that region of residency (variable of interest) explains about 16% (Supplementary Fig. 2a) of the variance regarding the population of origin (Supplementary Fig. 2b, c), but unwanted variables explain a certain proportion of the variance (Fig. 1e, Supplementary Fig. 2b, c).

RNA-Seq data generation, and expression data in general, are prone to technical biases which in some cases can mimic, or be confounded, with biological variation. The appropriate normalization pipeline in an RNA-Seq experiment will depend on the experimental design and the hypothesis being tested. Local sequence context can bias the uniformity of read counts along the genome, and sophisticated normalization pipeline may be necessary when comparing expression levels across genes65. Most experimental designs of RNA-Seq studies, like the one presented here, compares different groups of individuals to each other, therefore the normalization pipeline should rather focus on removing unwanted variation across individuals.

We removed the effects of hidden covariates potentially affecting expression levels using surrogate variable analysis (SVA)29. We used the SVA correction, retaining five surrogate variable, for the differential expression analyses, correcting for technical (i.e., runs, sets, number of reads) and biological (i.e., date of appointment, time of the year, sex, smoking status, cell counts) effects on gene expression (Supplementary Fig. 4). We performed the same stepwise regression approach as previously, but on the SVA corrected expression level matrices and show that we retained the variation associated with region, but removed any effects of cell counts and arterial stiffness that was present in the uncorrected expression levels (Supplementary Fig. 4, Supplementary Table 1). The corrections do not fully compensate for the effect of the freeze (technical), we therefore include this covariate in all subsequent analyses. Estimating the variance associated with hidden batch has been shown to remove variation associated with biological and technical factors and also increase the power to identify eQTLs58,66.

Differential expression analysis

Because of the large proportion of the variance in gene expression explained by region of residence, we identified genes that are differentially expressed between pairwise comparisons between the FC-locals from the three regions (Montreal, Quebec, and Saguenay). Using edgeR67, we performed a differential gene expression analysis using the 15,632 genes that passed the QC filters established above. We performed the differential expression modeling using the following statistical model:

$${\mathrm{\mu }}_{{{{i}}\mathrm{g}}} = {\mathrm{\beta }}_{\mathrm{g}}{\mathrm{Rr}}_i + {\mathrm{\beta }}_{\mathrm{g}}{\mathrm{Ro}}_{{i}} + {{B}}_{{i}} + {{S}}_{\mathrm{g}} + \epsilon _{{{{i}\mathrm{g}}}}$$

where Rr is the region of residence, Ro the region of origin, B is the surrogate variable, representing the batch effect estimated by SVA, and S represent the freeze effect that is included in the final (see below for further details).

The significance level of the test was estimated as a gene p value below the Bonferroni-corrected threshold of 3.20 × 10−6 (0.05/15,632). The SVA corrected expression levels retained the variation associated with region, but removed any effects of cell counts that was present in the uncorrected expression levels (Supplementary Table 1).

We performed a power analysis of our ability to detect differentially expressed genes with smaller samples sizes. Several of our comparisons of regional- or continental-migrants with FC-locals involve smaller number of individuals (Supplementary Table 2). We therefore assessed our ability to detect differentially expressed genes by performing differential expression analyses between groups for which we found large number of differentially expressed genes, but using a smaller subset of random individuals (without replacement) of each of these groups. We randomly selected 15 Mtl-locals and 15 Sag-locals, and performed the DGE analysis using the same model as above. We also performed the analysis using 50 Mtl-locals and 50 Sag-locals. In each case, we could identify differentially expressed genes which largely overlap with the differentially expressed genes detected in comparisons using all individuals (Supplementary Fig. 6). We observe that with an increasing number of individuals, our power to detect differentially expressed genes increases and that the identity of the differentially expressed genes detected in each of these comparisons largely overlap (Supplementary Fig. 6).

We further support the effect of region of residency on gene expression by performing differential gene expression analysis across regions using permutations that are even more robust to batch effects. The permutation-DGE analyses confirm that differences are the greatest between MTL and SAG. Similar permutation analyses also show that individuals living in the same region but sampled in different clinics have similar gene expression profiles (Supplementary Fig. 3B), supporting the absence, if not minor, of effects of sampling procedures on the gene expression across sampling clinics.

Regional environmental effects on gene expression

We take advantage of the presence of individuals from different regional and continental origins in our cohort to disentangle further the effects of the genetic background and environmental influences on genome-wide gene expression. We first selected individuals of either FC and European continental ethnicity (Fig. 1a, Supplementary Fig. 1). A total of 798 individuals including 136 Europeans and 662 FC were selected for downstream analyses. We stratified the individuals according to their continental origin (FC vs Europeans), and further stratified the FCs into their assigned genetic ancestry (MTL, QUE, SAG) obtained from the fineSTRUCTURE analysis (Fig. 1b, Supplementary Fig. 1D). We then determined their region of residence (MTL, QUE, SAG) for a total of 12 ancestry-residence groups: we identified individuals for which their origin (Continental or regional) is discordant with the region they reside, which we refer to as continental- and regional-migrants respectively (Supplementary Table 2). We also identified FC individuals for which their regional origin is concordant to the region they reside, which we refer to as FC-locals (Mtl-FC-locals, Que-FC-locals and Sag-FC-locals). We performed the differential gene expression analysis pipeline as described above for different pairs of continental-migrants, regional-migrants, and FC-locals to disentangle the effects of the genetic background and the regional environment on genome-wide expression (Fig. 2). We selected 6649 genes that show differential expression (p value < 3.20 × 10−6) in the comparison between Mtl-FC-locals and Sag-FC-locals. Using the 12 origin-living groups and the 6649 genes, we performed an unsupervised clustering and visualized the groupings using a heatmap (Supplementary Fig. 5).

Gene enrichment and reactome analyses

Gene enrichment analyses were performed using the topGO package in R, with a Fisher exact test. Differentially expressed genes between MTL-locals and SAG-locals were compared against the 15,632 genes expressed in the CARTaGENE cohort that were retained after QC filters (background). Reactome enrichment analyses were conducted with R the package reactomePA, and here again, the background set of genes was defined as the 15,632 genes expressed in blood that pass our filters (Supplementary Fig. 7 and Supplementary Table 3).

Fine-scale environmental data

We obtained air quality measures in the year of sampling (2010) from either land-based stations (SO 2 , ozone) or national LUR models estimates (PM 2.5 and NO 2 ) incorporating information from land use data and satellite remote sensing55,68,69,70. Built environment variables (street network, population density, food deserts, greenness, walkability) and social and material deprivation indicators were accessed through the Quebec government data portal (https://www.inspq.qc.ca/environnement-bati). All environmental data sources are described in Supplementary Table 4.

Environmental data was available at the three-digit postal code district level (i.e., Forward Sortation Area, FSA), or was reformatted to this geographic scale. Postal code districts in Canada are small geographic areas which assist in delivering mail. Postal codes are a series of six digits that identify a small geographic area in a municipality, usually grouping just a few houses together or a small neighborhood. Three-level digits are larger areas that include several houses, a small neighborhood, or a small village. The population of FSAs in Canada range from a few hundreds to tens of thousands of individuals. Three-digit postal code districts can be of different areas, and are smaller in densely populated areas, and larger in areas of low population density. Maps in Fig. 1c, d and Supplementary Fig. 9 depict three-digit postal code districts as thin gray lines areas, and each district is colored with the mean value of interest in each map. Each individual in the CARTaGENE cohort has a three-digit postal code district associated to it, referring to the location of its primary residence. We assigned fine-scale environmental measures to each individual based on its three-digit postal code.

Coinertia analyses

Coinertia analysis (CoIA)31,71 is a multivariate statistical part of the large family of ordination methods, such as PCA, redundancy analysis (RDA), or canonical correlation analysis (CCA). CoIA is a general approach and existing methods such as the ones mentioned above appear as special cases of it31. These methods have been widely used in ecological research, including CoIA which has been more recently developed. Collectively, these methods allow for detecting an underlying data structure between two data tables. CoIA uses a combination of PCA and multivariate linear regressions to detect linear combinations of variables from one data table that explain the variance in the second data table. CoIA is more flexible than RDA or CCA, and overcomes their limitations by allowing for more variables than the number of samples to be tested31,71, which is generally the case in genome-wide scale analyses (i.e., more genes than individuals). This makes CoIA a method of choice to integrate data of diverse types, and of high-throughput like most omics data.

We first used CoIA analysis to reveal the common structure between differentially expressed genes (Fig. 3, Supplementary Fig. 11) and the fine-scale environmental data. We produced two separate principal component analyses (PCAs) based on continuous encoded matrices of both environmental and gene expression levels (normalized for library size and sequencing freeze). The data were centered and reduced to one unit of variance prior performing the PCA analysis. We conserved components for each PCA to explain 80% of the variance in the data. We imputed missing data only for the fine-scale environmental data set (there were no missing data in the gene expression matrix) using the function imputePCA from the R package missMDA. The coinertia analysis performs a double inertia analysis of each data set and then project the variables of the original environmental and gene expression data sets on the new co-inertia axes. Relationships between the two matrices were assessed by comparing the CoIA estimated from the real data set with the CoIA distribution estimated after bootstrapping. Two sets of 500 of CoIAs were computed independently between gene expression and fine-scale environmental data. Supplementary Fig. 11 depicts the resampling scheme. For each Group 1 or Group 2 (n = 497 for each) a total of 10,000× resampling of 200 individuals (without replacement) were performed. We performed a CoIA for each resampling step. We report the median value of the distribution of each environment–gene expression pair cross-tabulated values for each group. Gene enrichment were performed using gProfiler72, and using the 15,632 expressed genes that passed our filters in whole blood as the background gene set (Supplementary Table 3). We evaluated the significance of the correlations between the two matrices with a multivariate generalization of the Pearson correlation coefficient (RV coefficient) using a permutation test (RV-test) with 10,000 steps from the R package ade4.

To identify clinically relevant endophenotypes that are associated with fine-scale environmental data, we performed a CoIA between 57 clinically relevant endophenotypes (Supplementary Fig. 10) and fine-scale environmental data. The 57 clinically relevant endophenotypes were selected to encompass physical measures (BMI, height, age, sex), most systems relevant to the human health (cardiovascular system, pulmonary functions, hepatic system, renal system, disease history, vision, immune system) and lifestyle measures (smoking status, alcohol consumption, nutrition, physical activity). All biochemical endophenotypes were measured in a single central laboratory. We resampled 10,000 times 493 individuals from the cohort, and performed CoIA at each step between endophenotypes and fine-scale environmental variables. We report the median value of the distribution of each environment–endophenotype pair cross-tabulated values (Supplementary Fig. 10).

To reveal possible associations between expression levels and endophenotypes, we then performed CoIAs with a similar resampling scheme between 12 selected endophenotypes that were the most strongly associated with air pollutants from Supplementary Fig. 10 (Stroke, Arterial stiffness measures, spirometry measures, Asthma, monocyte counts, LDL, AST, ALT, GGT) and differentially expressed genes (results shown in Supplementary Fig. 12).

Exposure windows of weekly SO 2

To increase our resolution in air pollution exposures, we used daily SO 2 ambient levels measured in each three-digit postal code. We calculated the average exposure during the 14 days preceding the blood draw for each participant. This way, we reduce the effect of random fluctuations due to technical artifacts or short-term meteorological anomalies that may affect measurements. Also, changes in gene expression and biomarkers in blood following a pollution exposure has been documented as a relatively fast phenomenon, occurring after just a few days of exposure36. We then categorized the participants using a k-means algorithm73 into high exposure or low exposure categories (see details on the number of participants and cluster centers in the eQTL section below).

DGE between high- and low-SO 2 exposure

To find differentially expressed genes between high and low exposure individuals, we used the same approach as described above for identifying differentially expressed genes between regions, with the following modifications: given the unbalanced number of individuals in each category (108 high exposure vs 800 low exposure) of exposure, we resampled 100 times 108 individuals with replacement from the low- and high-exposure category and performed the DGE pipeline. We performed the SVA while retaining variation associated with SO 2 exposure. We combined the results of DGE analysis in a list of 468 differentially expressed genes, and from these candidates, 170 genes were also identified as differentially expressed between regions (Fig. 2a). Those strong 170 candidates were used for enrichment, CoIA, and multivariate models. We also identified genes (transcription factors) that regulate our 170 differentially expressed genes (RDEGs) using cytoscape, and we used them in addition to the differentially expressed genes in the CoIA analyses.

Multivariate models for SO 2 exposure

In an effort to characterize the effects of confounding variables on pollution exposure, we applied multivariate models on gene expression levels. First, similar as in the differential gene expression analysis, we performed a SVA to remove unwanted variation of technical or unknown biological variables while retaining the variation around SO 2 exposure. We then built multivariate models using the SO 2 , O 3 , and PM2.5 14-day exposures, as well as the remaining 9 non-pollution environmental exposures (Supplementary Fig. 9), as well as smoking status. Smoking status may indeed cause similar changes in endophenotypes as pollution exposure. We then selected the endophenotypes revealed by the CoIA as being the most associated with region and pollution exposure (Lung disease, Asthma, Stroke, monocyte counts, liver enzymes (AST, ALT, GGT), Arterial stiffness, spirometry tests, and lymphocyte counts), and tested whether any of these would explain variation in the 170 candidate genes. Furthermore, after having identified the health endophenotypes that are associated with gene expression in MTL and in the whole data set (FEV1, liver enzymes, lung diseases, and arterial stiffness, see Supplementary Figs. 10 and 12), we regressed out their effect from the expression of the 170 candidate genes, and run the multivariate models to test for the effects of environmental variables itself (results collated in Supplementary Table 7).

env-eQTL analysis

Environmental factors not only directly affect phenotypic variation, but can also modulate associations between segregating genetic variants and phenotypes1,44,45. To discover gene-by-environment interactions, we identified eQTLs for which the effect size is modulated by environmental exposure to one of four ambient air pollutants (env-eQTLs): PM2.5, NO 2 , O 3 , and SO 2 . We categorized the participants using a k-means algorithm73 into two categories, high or low exposure, irrespective of the pollutant type (Table 1). A k-means algorithm attempts to partition the individuals into k groups (here, k = 2), such that the sum of squared Euclidean distances from points to the assigned centroid (cluster mean) is minimized.

We adopted a strategy (Supplementary Fig. 15) to randomly divide the CARTaGENE cohort into discovery and replication cohorts. During this process, for the discovery of eSNP–eGene pairs, we scan the genome at ± 500 kb of the TSS of gene to find all putative cis-eSNPs. We used the following model where gene expression (Y) is regressed on a given SNP (S), a given environmental air pollutant (E) and the interaction between S and E:

$${\mathrm{Model}}:{\mathrm{Y}}_{{{ijk}}}\sim {\mathrm{S}}_{{{ijk}}} + {\mathrm{E}}_{{{ijk}}}{\mathrm{ + S}}_{{{ijk}}}{\mathrm{E}}_{{{ijk}}}$$

The gene expression level, was normalized using an inverse normal transformation, and corrected for relatedness and other batch effects using the SVA R package (see above for further information). Here, we focussed solely on the p value associated with the Student’s t-statistic for the interaction term S ijk E ijk . We applied a Bonferroni correction to the interaction p values for SNP-wise multiple testing within gene and retained the most significant putative eSNP–eGene pair from each gene. We then assessed this set of “best” eSNP–eGene p values for significance across all 15,632 genes at the false discovery rate (FDR) threshold of 0.05 by transforming the set into q values46). This represented the set of significant discovery eSNP–eGene pairs to be tested in the replication set. We then reported the environmental eSNP–eGene pairs that were significant (replicated) in the replication cohort (q value < 0.05, adjusted for the ten pairs being tested) and had the same direction of effect in both cohorts (n = 4 out of the 10).

To provide support for the replicated environmental eSNP–eGene pairs that we reported as significant, we estimated “honest” empirical p values for the whole sample (discovery + replication) using permutation: for each eSNP–eGene pair we performed the same eQTL modeling (Y ijk ~ S ijk + E ijk + S ijk E ijk ) and permuted the expression values (Y) before obtaining the test statistic (Student’s t) for the interaction term. By repeating this procedure 1000 times for each eSNP–eGene pair, we built null distributions to assess the original observed (not permuted) t-statistics. The empirical permutation p value for each eSNP–eGene pair was taken as the proportion of permutation t-statistics larger than the observed t-statistic (Supplementary Fig. 16, Supplementary Table 9).

Impact of lower frequency variants

We performed a CoIA analysis between all eSNPs of significant eGenes and endophenotypic traits. To do so, we resampled 1000 times, without replacement, 420 individuals from the cohort, and performed a CoIA at each step between endophenotypes showing variation across environments and the eSNPs. The median value for each endophenotype-eSNP correlation from the 1000 CoIA was calculated. The CoIA results are the correlations between eSNPs and the endophenotypic traits values. We then tested whether the strength of these correlations between eSNPs and endophenotypic traits were related to the MAF of the eSNP by examining the odds ratio of observing less common variants (MAF between 0.05 and 0.1, compared to common variants of MAF > 0.1) for stronger endophenotypic associations (Supplementary Fig. 18). The MAF was estimated from the complete cohort data.

Data availability

Genotyping, expression, health phenotypes, and exposure data used in this study are available from CARTaGENE (www.cartagene.qc.ca) or the CPTP portal (http://portal.partnershipfortomorrow.ca) upon request. The built environment data set is publicly available from the Quebec government data portal. The air pollution data set is available upon request to Air Health Effects division, Government of Canada. All environmental data sources are detailed in Supplementary Table 4.