The CHIMGEN study (chimgen.tmu.edu.cn) was approved by the local ethics committee, and written informed consent was obtained from each participant. The aim of this study was to collect genomic, neuroimaging, environmental, and behavioral data from 10,000 healthy Chinese Han participants aged 18–30 years in 30 research centers from 21 mainland cities in China. To date, we have recruited more than 7000 participants, becoming the largest and most integrative Chinese neuroimaging genetics cohort. The detailed inclusion and exclusion criteria as well as the methods and procedures for screening; genotyping; blood sample collection; and behavioral, environmental, and neuroimaging data acquisition are described in the standardized operation procedures (SOPs) of the CHIMGEN study (Supplementary file 2). The detailed quality control procedures for personal information; blood samples; GWAS; and behavioral, environmental, and neuroimaging assessments are elaborated in the quality control manual of the CHIMGEN study (Supplementary file 3). Since the CHIMGEN study is ongoing, the following description of the CHIMGEN cohort was based on the data of only 5819 participants who had undergone comprehensive quality assessments.

Sampling strategies

All participants were recruited by advisements posted in colleges and communities. The number of participants in each center depends on the available resources (researchers, funds, scanners, etc.) of the center. The recruited participants were not solely from the city or province of the participating centers. These samples are not used to represent populations (epidemiological samples), but to investigate biological mechanisms. Their epidemiological relevance needs to be investigated in subsequent studies.

Recruitment distribution

The 5819 participants were recruited from 29 centers. The recruitment distribution of these participants across centers is shown in Fig. 1a. Eighteen of the 29 centers recruited more than 100 participants. The largest center recruited 1307 participants and the smallest center recruited 54 participants.

Fig. 1: Recruitment and neuroimaging, behavioral, and environmental characteristics. a The main graph shows the numbers of participants recruited by each of the 29 centers. The insertion shows the numbers of participants recruited using each type of scanner. b The mean parameter maps of the gray matter volume (GMV), regional homogeneity (ReHo), fractional anisotropy (FA), mean diffusivity (MD), mean kurtosis (MK), and cerebral blood flow (CBF). c Data distribution of the representative behavioral assessments. CVLT II-Total score, the total number of correct recalls over the five learning trials of the word list A in the version 2 of the California verbal learning test; N-back-CR, the correct rate of the 3-back task in the N-back task; No-Go-CR, the correct rate of the No-Go task in the Go/ No-Go task; ROCFT-DR score, the score of delayed recall of the Rey-Osterrieth complex figure test; TPQ-RD, reward dependence of tridimensional personality questionnaire. d Data distribution of the representative paper-based environmental assessments. EA emotional abuse, EN emotional neglect, PA physical abuse, PN physical neglect, and SA sexual abuse. Full size image

Quality control for MR scanners

For each MR scanner, two phantoms were used to assess the imaging quality of the scanner. Specifically, an American College of Radiology MRI phantom was used to assess the functioning of the MR scanner, including geometric distortion, slice positioning and thickness accuracy, high contrast spatial resolution, intensity uniformity, ghosting artefacts and low contrast object detectability. A custom phantom [12, 13] was used to evaluate temporal stability during a functional MRI acquisition. Moreover, two healthy volunteers were scanned at all centers to assess the consistency of the MRI data acquired by different MR scanners. The effects of scanners on common MRI measures (gray matter volume (GMV), regional homogeneity (ReHo) and fractional anisotropy (FA)) are shown in Supplementary Fig. 1. These measures showed high consistency for MRI data acquired by the same type of MR scanner with the same scan parameters; however, there were visible differences for MRI data acquired by different types of MR scanners. For the latter, a meta-analysis of the results derived from MR data from different scanners may be a practical method to reduce the bias caused by MR scanner types.

First-step quality assessments of the neuroimaging data

All 5819 participants were included in the first-step quality assessments of the neuroimaging data: 23 participants were excluded for metal artefacts, 1 for brain atrophy and 1 for excessively large ventricle. The remaining 5794 participants were included in the following quality control and statistics.

Genotyping and quality control

A high-throughput genotyping chip designed for the Asian population (Illumina Asian screening array chip) with 700,000 sampling SNPs was used for genome-wide genotyping. Although all 5794 participants had blood samples, only 4885 participants have been genotyped thus far. After excluding two sex mismatching samples, nine duplicated or related samples, 29 samples with extreme heterozygosity and one sample with divergent ancestry (Supplementary Fig. 2), 4844 participants (99.16%) passed the quality control for the genetic data. It should be noted that the following quality assessments (n = 5753) also included 909 participants without genotyping results.

Neuroimaging data and quality control

Neuroimaging data were acquired by nine types of 3.0-Tesla MRI scanner (Supplementary Fig. 3). Structural MRI (sMRI), diffusion tensor imaging (DTI) and resting-state functional MRI (rs-fMRI) data were acquired in all centers, and diffusion kurtosis imaging (DKI) and arterial spin labeling (ASL) data were acquired in 16 centers. The numbers of participants whose MRI data were acquired by each type of MRI scanner are shown in the insertion of Fig. 1a. The MRI data of 4045 (70.31%) of the 5753 participants were acquired by the MR 750 scanners. For each type of MRI scanner, the voxel-level maps of GMV calculated based on sMRI data, ReHo calculated based on rs-fMRI data, and FA and mean diffusivity (MD) calculated based on DTI data averaged across all qualified participants are shown in Supplementary Fig. 4. All types of scanner showed similar and symmetrical spatial distribution of the GMV, FA and MD, and 8/9 types of scanner showed similar and symmetrical spatial distribution of ReHo with the GE Signa HDx which showed asymmetric spatial distribution of the ReHo map, especially in posterior brain regions, being the only exception (Supplementary Fig. 4C). Therefore, the rs-fMRI data of the 97 participants acquired by the GE Signa HDx were excluded from this study.

The quality control results of the neuroimaging data (n = 5753) are shown in Supplementary Fig. 5. In the 5753 participants, there were 5743 (99.83%) participants with qualified sMRI data, 5507 (95.72%) with qualified rs-fMRI data, and 5750 (99.95%) with qualified DTI data. In the 3619 participants with DKI data, 3610 (99.75%) participants passed the quality control. In the 4108 participants with ASL data, all participants passed the quality control. Based on these MRI data, thousands of neuroimaging variables could be calculated. For example, the average maps of the GMV of the 5743 participants, the ReHo of the 5507 participants, the FA and MD of the 5750 participants, the mean kurtosis (MK) calculated based on DKI data of the 3610 participants, and the cerebral blood flow (CBF) calculated based on ASL data of the 4108 participants are shown in Fig. 1b. All of these parameter maps showed a symmetrical spatial distribution in the brain.

Quality control for behavioral and paper-based environmental data

The preliminary quality control results for behavioral and paper-based environmental data of the 5753 participants are shown in Supplementary Fig. 6. In the 5753 participants, 8 participants were excluded for the loss of almost all behavioral and paper-based environmental data. In the remaining 5745 participants, 5723 (99.48%) participants with qualified Beck depression inventory (BDI- II) data, 5722 (99.46%) with qualified state and trait anxiety inventory (STAI) data, 5728 (99.57%) with qualified tridimensional personality questionnaire (TPQ) data, 5688 (98.87%) with qualified California verbal learning test (CVLT-II) data, 5619 (97.67%) with qualified symbol digit modalities test (SDMT) data, 5640 (98.04%) with qualified Rey-Osterrieth complex figure test (ROCFT) data, 5578 (96.96%) with qualified N-back task data, 5536 (96.23%) with qualified Go/No-Go task data, 5616 (97.62%) with qualified ball-tossing game data, 5639 (98.02%) with qualified ultimatum game (UG) data, 5733 (99.65%) with qualified urbanization score data, and 5728 (99.57%) with qualified childhood trauma questionnaire (CTQ) data.

The data distributions of the representative behavioral variables are demonstrated in Fig. 1c and those of the representative paper-based environmental variables are shown in Fig. 1d. Although some variables do not follow a normal distribution, the relatively wide range of values indicates good discriminative power across participants.

Sample characteristics

The demographic characteristics of the 5745 participants with relatively complete assessments are shown in Table 2. This study included 3718 females and 2027 males. Their ages ranged from 18 to 30 years, with a mean ± standard deviation (SD) of 23.7 ± 2.4 years. Their years of education ranged from 9 to 24 years, with a mean ± SD of 16.8 ± 1.9 years. Their heights ranged from 146 to 197 cm, with a mean ± SD of 166.4 ± 7.9 cm. Their weights ranged from 23 to 114 kg, with a mean ± SD of 58.8 ± 10.7 kg. Their body mass indices (BMI) ranged from 10.8 to 38.5, with a mean ± SD of 58.8 ± 10.7. Most of these participants were unmarried (n = 5550), with only 195 married.

Table 2 Sex-specific demographic, behavioral and environmental data (n = 5745). Full size table

Sex-specific demographic, behavioral, and paper-based environmental statistics

The sex-specific demographic, behavioral and paper-based environmental statistics of the 5745 participants with relatively complete assessments are shown in Table 2. Although most of these variables show significant differences (P < 0.05) between male and female participants, the effect sizes were generally very small except for sex differences in height (|r| = 0.74, large effect), weight (|r| = 0.67, large effect), and BMI (|r| = 0.39, medium effect).

Quantitative environmental variables derived from remote sensing and national survey databases

In this study, we recorded the precise residential location of each participant in each year from birth to present. In the 5745 participants who passed the initial quality controls for the neuroimaging, behavioral and genetic data, 5723 participants (99.62%) provided both current and birthplace (Fig. 2a) residential locations; however, only 3979 participants (69.26%) provided lifetime migration information (Fig. 2b). Based on remote sensing and national survey databases, we obtained hundreds of macro-environmental measurements for each participant. Some representative macro-environmental variables at birth (Fig. 2c) and their lifetime changes are shown in Fig. 2d.

Fig. 2: Environmental variables derived from remote sensing and national survey data. a Geographic location of each participant’s birthplace (n = 5723). Blue dots indicate rural area, green dots indicate towns, and red dots indicate cities. b The migration map of participants (n = 3979). Red dots indicate current places of residence, and green dots indicate birthplaces. Gray lines connect the birthplaces and current places of residence of a given participant. c Data distribution of the representative environmental variables in the birth year or the year nearest to the birth year. Certified doctors is the number of certified doctors per 10,000 persons. NDVI, normalized difference vegetation index, and GDP, gross domestic product. d Longitudinal changes of the representative environmental variables in selected years. The value in each column is shown as the mean ± SE. Full size image

Future plans of the CHIMGEN study

In the future, the CHIMGEN consortium will complete the following tasks: (a) further recruit at least 3000 participants to reach the goal of 10,000 qualified participants; (b) simultaneously obtain the genomic, epigenomic, and transcriptomic data of ~700 participants; (c) collect 2000–3000 patients with major mental disorders; and (d) develop the CHIMGEN cohort into a longitudinal cohort by recalling the participants at a later time.

Data sharing policy

We would like to share all CHIMGEN data (including the genetic, environmental, neuroimaging and behavioral data) with other scientific communities according to the laws and regulations of the Chinese government. All the raw data of the CHIMGEN study can be accessed via collaboration with the CHIMGEN consortium. The summary statistics of the CHIMGEN data can be freely accessed via a formal application. A detailed scheme for sharing the CHIMGEN data can be found on our website (chimgen.tmu.edu.cn) and in Supplementary file 4.