Data reporting

No statistical methods were used to predetermine sample size. The experiments were not randomized and the investigators were not blinded to the allocation during analysis.

Samples

We accessed publicly available high-coverage, whole-genome FASTQ files from previous studies of human genetic variation52,53,54,55 and combined these with 1,267 high-coverage genomes generated as part of this project. Full details on the samples chosen for sequencing and the informed consent processes for these samples can be found in Supplementary Information 1. We restricted our analyses to genomes generated using Illumina short-read sequencing technology.

Whole-genome sequencing

Whole-genome sequencing libraries were prepared using standard protocols (Illumina) and sequenced on Illumina Hiseq 2500/4000 or X10 machines. We obtained paired-end (2 × 100 bp or 2 × 150 bp) for each sample.

Filtering, alignment and variant calling

We aligned the Illumina short-read sequences to the GRCh37+decoy reference genome with BWA-mem56 using the default parameters. Putative PCR duplicates were flagged using SAMBLASTER57. The SAM outputs were converted to BAM format, and sorted by chromosomal coordinates using Sambamba58, and all BAM files for the same samples were merged.

The sex of the samples was inferred from the coverage of the autosomes and the sex chromosomes, and confirmed from the submitted metadata with the samples. All samples that had an average coverage less than 20-fold or for which we found a difference in the inferred and reported sex were removed from further analysis. We used verifyBamID59 to identify contamination using the chip-free mode and samples for which swaps or contamination was identified were removed from subsequent analyses. A contamination level of 3% was used as a cut-off, and this left us with 1,739 samples that were used for all downstream analyses.

We identified the single-nucleotide substitutions and small indels variants in the 1,739 samples using the reference model (gVCF-based) workflow for joint analysis in GATK60. Variants were called individually in each sample using the HaplotypeCaller in ‘-ERC GVCF’ mode to produce a record of genotype likelihoods and annotations at each site in the genome. Multi-allelic variants are reported in the GenomeAsia browser but were not included in the analysis. A gVCF file was created for every sample, and a subsequent joint genotyping analysis of all gVCFs was done to identify the variants in the cohort. We followed the GATK-recommended best practices for variant recalibration to create a final VCF file and recalibrated the variants to select 99% of the true sites from the training set for VQSR61. The VCF files were zipped using bgzip and indexed using tabix.

Identification of first-degree relative pairs

Several of the reported analyses require filtering to remove related samples. We used KING62 to identify such first-degree relative pairs. We first used vcftools63 and plink64 to convert the VCF file into the required input format for KING. The estimated kinship coefficient was restricted to 0.177–0.354 as described in the KING manual to identify the first-degree relative pairs, and the results were confirmed from the submitted metadata. The number of unrelated samples by country-of-origin is shown in Supplementary Table 1.1.

Quantifying population structure and changes in population size

We restricted our attention to 7,966,132 autosomal markers (that is, SNPs) with MAF ≥ 0.01 and call rate ≥ 98%. In some analysis, severe linkage disequilibrium pruning was applied as follows: sliding windows of size 50 (that is, the number of markers used for linkage disequilibrium testing at a time) and window increments of 5 markers; for any pair of SNPs in a window, the first marker of the pair was discarded if r2 > 0.2. After linkage disequilibrium pruning, 1,089,227 SNPs were retained for analysis. All data-filtering procedures were conducted in PLINK v.1.964.

Analyses of population structure was performed using the quality-control-positive linkage-disequilibrium-pruned set of 1,089,227 autosomal SNPs. Principal component analysis (PCA)18 was conducted across all available populations in EIGENSTRAT v.6.1.4. Results were visualized in Tableau v.9.3. We applied unsupervised hierarchical clustering of individuals using the maximum likelihood method implemented in ADMIXTURE v.1.3.020 using default input parameters. The ‘--cv’ flag was adopted to perform the cross-validation procedure and to calculate the optimal k value.

We used MSMC5 to estimate changes in population size and split times. This analysis used two different phased genome datasets (using Shapeit v.265 and Eagle266). The details for the phasing are described in Supplementary Information 4. Chromosome 6 was excluded from the analysis owing to possible phasing errors in the HLA region. We used four haplotypes (two individual genomes) for estimating changes in population size in a population and eight haplotypes (two genomes from each of a pair of populations) for the estimation of population split times. We assumed a mutation rate of μ = 1.25 × 10−8 per site per generation and an average generation time of 29 years, as in previous studies8,19.

Comparison with 1000 Genomes Project genotype calls

We filtered the variant calls to include only biallelic SNPs with <10% missing genotype calls that were within the 1000 Genomes Project strict mask (available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/accessible_genome_masks/20141020.strict_mask.whole_genome.bed). Then, for each of the 119 overlapping samples considered individually, we calculated variant discordance rates for those filtered SNPs that (1) had a genotype call in both the 1000 Genomes Project data and the GAsP data; and (2) had a ‘variant’ call (that is, a non-homozygous reference genotype call) in at least one of the datasets. These discordance rates were then stratified by the estimated MAF in the GAsP dataset.

Patterns of allele sharing

We used a parsimony-based analysis of allele sharing55 that focused on SNPs that were not present in sub-Saharan Africans or in archaic humans (further details are provided in Supplementary Information 8).

Archaic admixture

We used a method similar to the ‘enhanced’ D-statistic approach8,67 to estimate levels of Neanderthal and Denisovan ancestry in each non-African sample. The estimates were calibrated assuming 0% Denisovan ancestry in the British population, 4% Denisovan ancestry in the Papuan population and 2% Neanderthal ancestry in the British population (full details are provided in Supplementary Information 9).

Determination of high-quality variants for medically related analyses

High-quality variants were defined as variants that (1) had a read-depth ≥ 5 and genotype-quality ≥ 20; (2) were contained in the high-confidence regions as described by Genome in a Bottle (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/supplementaryFiles/HG001_GRCh37_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_highconf.bed) and (3) passed the gnomAD_Filter. Variant annotation was carried out using SnpEff68 (v.4.1).

IBD scores

Groups with at least two samples were considered for analysis. We restricted our analysis to genomic regions with high-confidence calls and removed related samples based on reported relationship, kinship, PCA and IBD analyses. The scores given in the figure are relative ratios compared to that of the Finnish group.

PTVs

PTVs are defined as high-quality variants that were annotated as having a strong impact on the protein (such as frameshifts, essential splice sites or premature stop codons). We restricted calls to high-confidence regions determined by Genome in a Bottle as described above and filtered for high-confidence PTVs using the LOFTEE program69. We used a similar strategy for additional filtering of variants as proposed previously47 and flagged variants with ≤7 reads covering the variant site; ≤80% of reads had the variant, were not in the bottom 1 percentile of phyloP or gerpRS65 scores and for which the affected transcripts made up less than 50% of all expression as specified by GTEx.

Enriched medically relevant variants

We compared variant allele counts for Asian and Oceania samples from the GenomeAsia cohort to allele counts present in non-Asian gnomAD samples (European (non-Finnish), European (Finnish), Latino, African or other) for variants found in a set of 124 medically relevant genes. The genes used were 115 genes used for prenatal screening70 as well as the cancer-associated genes BRCA1, BRCA2, TP53, MEN1, MLH1, MSH2, MSH6, PMS1 and PMS2A. A Fisher’s exact test was used to calculate variations that were significantly overrepresented in the GenomeAsia subsamples and corrected for multiple testing using the Bonferroni method. We further accessed variants for these genes that had not previously been reported. All variants were further filtered as being damaging as determined by having a high impact on the protein (stop codon, essential splice site or frameshift mutation) or were predicted to be damaging by the Polyphen2 program. A cumulative comparison of allele counts for all over-represented and novel variants was performed and compared to non-Asian gnomAD to calculate a P value, odds ratio and relative difference in cumulative allele frequency (GenomeAsia cumulative allele frequency minus gnomAD non-Asian allele frequency). Reported P values were corrected for multiple testing using the Bonferroni method.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.