No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessments.

Sequencing data of the 3,000 Rice Genome project

The selection and sequencing of rice accessions have previously been described12. The SNPs/indels and SVs in 3,010 accessions were identified by mapping against the Nipponbare RefSeq, and the pan-genome sequence was created by integrating the Nipponbare RefSeq and non-redundant novel sequences derived from 3,010 rice assemblies. SV comparison and gene PAV analyses focused on 453 rice accessions with sequencing depth >20× and mapping depth >15× (Extended Data Figs. 4a, 5b).

Detection of SNPs and indels

Reads were aligned to the Nipponbare RefSeq using BWA-MEM (release 0.7.10)47. The mapped reads were then sorted and duplicates were removed by Picard tools (release 1.119) (http://broadinstitute.github.io/picard/). The reads around indels were realigned by GATK RealignerTargetCreator and IndelRealigner package (release 3.2-2)48. The variants were called for each accession by the GATK UnifiedGenotyper (release 3.2-2)48 with ‘EMIT-ALL-SITES’ option. A joint genotyping step for comprehensive SNP union and filtering step was performed on the 3,010 emit-all-sites VCF files. A variant position is reported if at least one sample supports it with QUAL no less than 30. A total of 29,399,875 SNPs (27,024,796 are bi-allelic) and 2,467,043 indels (small insertions and deletions <40 bp) were identified from the analyses of the genomes of 3,010 accessions. Three subsets of the 3K-RG Nipponbare SNPs were defined using the following filtering criteria: (1) a base SNP set of ~17 million SNPs created from the ~27 million high-quality bi-allelic SNPs by removing SNPs in which heterozygosity exceeds Hardy–Weinberg expectation for a partially inbred species, with inbreeding coefficient estimated as 1−H obs /H exp , in which H obs and H exp are the observed and expected heterozygosity, respectively (detailed in Supplementary Notes); (2) a filtered SNP set of ~4.8 million SNPs created from the ~17-million-SNP base SNP set by removing SNPs with >20% missing calls and MAF < 1%; and (3) a core SNP set of SNPs derived from the filtered SNP set using a two-step linkage disequilibrium pruning procedure with PLINK49,50, in which SNPs were removed by linkage disequilibrium pruning with a window size of 10 kb, window step of one SNP and r2 threshold of 0.8, followed by another round of linkage disequilibrium pruning with a window size of 50 SNPs, window step of one SNP and r2 threshold of 0.8.

Determining the effects of SNPs

The effects of all bi-allelic SNPs (low, medium and high effects) on the genome were determined based on the pre-built release 7.0 annotation from the Rice Genome Annotation Project (http://rice.plantbiology.msu.edu/) using SnpEff51 release 4.1l, with parameters -v -noLog -canon rice7. Using sequence ontology terms, a low-effect SNP was classified as ‘synonymous_variant’, ‘splice_region_variant’, ‘initiator_codon_variant’, ‘5_prime_UTR_premature_start_codon_gain_variant’ or ‘stop_retained_variant’. A moderate-effect SNP was identified as a ‘missense_variant’ and a high-effect SNP as a ‘start_lost’, ‘stop_gained’, ‘stop_lost’, ‘splice_donor_variant’ or ‘splice_acceptor_variant’. For indel effects, only indels with lengths that were not multiples of three were counted and SNPs overlapped with protein-coding regions (CDSs of RGAP 715 genes) were considered as the most disruptive effects on genes. Results of the SNP and indel effect analysis are given in Supplementary Data 2 Tables 3, 4. We computed the SNP numbers (proportions) of rare SNPs and homozygous singletons for a ‘typical genome’ of a subpopulation as the median SNP number (proportion) of the SNPs in a given category among those genomes for that subpopulation (Supplementary Data 2 Table 5).

Population structure and SNP diversity

Multi-dimensional scaling analysis was performed using the ‘cmdscale’ function in R, using the IBS distance matrix of the 3K-RG genomes computed with PLINK49,50 on the filtered SNP set. The same distance matrix was used to construct a phylogenetic tree by the unweighted neighbour-joining method, implemented in the R package phangorn52. The population structure of the 3K-RG dataset was analysed using ADMIXTURE software46 on the core SNP set (version 0.4, http://snp-seek.irri.org/download.zul). First, ADMIXTURE was run on 30 random 100,000-SNP subsets of the core SNP set with k (the number of groups) ranging from 5 to 18, and k = 9 was chosen because it was the minimal value of k to separate all previously known groups (cA, cB, XI, GJ-trp, GJ-tmp and part of GJ-sbtrp). With k = 9, ADMIXTURE was then run again on the whole core SNP set nine times with varying random seeds; the Q-matrices were aligned using CLUMPP software53 and clustered on the basis of similarity. Then, the matrices belonging to the largest cluster were averaged to produce the final matrix of admixture proportions. Finally, the group membership for each sample was defined by applying the threshold of ≥ 0.65 to this matrix. Samples with admixture components <0.65 were classified as follows. If the sum of components for subpopulations within the major groups XI and GJ was ≥ 0.65, the samples were classified as XI-adm or GJ-adm, respectively, and the remaining samples were deemed ‘fully’ admixed (admix). Branches of the phylogenetic tree were coloured according to the k = 9 admixture classification (Fig. 1).

We computed linkage disequilibrium decay in each subpopulation as follows. The value of r2 was computed for each pair of SNPs of frequency ≥ 10% in the respective subpopulations that are separated by at most 300 kb using PLINK. The distances were binned into 1-kb bins (separately for each chromosome) and the median value of r2 in each bin was taken. The medians for each chromosome were then averaged to produce a final r2 estimate for the bin. We computed nucleotide diversity (π) for non-overlapping 10-kb and 100-kb windows along the Nipponbare RefSeq by adopting an approach similar to VariScan54 for genome-wide DNA polymorphism analyses and implemented as a custom R script.

Detection of genomic SVs and population differentiation

Genomic-SV detection for each of the 3,010 rice accessions was performed using a customized version of novoBreak55 (https://sourceforge.net/projects/novobreak/?source=navbar) against the Nipponbare RefSeq. SVs inferred by no less than 3 reads were further filtered with the following conditions: (1) more than four supporting split reads or (2) no fewer than three discordant read pairs. We detected deletions, inversions and duplications with sizes between 100 bp and 1 Mb, and translocations. Here, translocations were SVs with ‘inter-chromosomal breakpoints’. All SVs that passed the filter criteria in the 3K-RG accessions were pooled together. Two adjacent SVs were identified as the same SV if their start and end positions varied no more than 1 kb, and the overlapping region was more than 50% of the total size. The presence–absence matrix of SVs in each accession was built based on this pooled SV dataset. To obtain reliable SV comparison analysis results, we focused only on the 453 high-depth accessions (Extended Data Fig. 4a). Major-group-unbalanced SVs were determined by two-sided Fisher’s exact test followed by Benjamini–Hochberg adjustment (false discovery rate (FDR) < 0.05), similar to the detection of major-group-unbalanced genes.

De novo assembly

A variation of SOAPdenovo256 (version r240) with customized k-mers was used to assemble the rice genomes. A k-mer value was initially set for each accession according to a linear model ‘K=2*int (0.38*(sequencing depth) +10)+1’, which was trained from 50 randomly selected rice accessions. The best k-mer value was decided by checking the N50 of the SOAPdenovo results. The command line for SOAPdenovo was ‘SOAPdenovo-63mer (or SOAPdenovo-127mer) all -s configure_file (average insertion length set as 460 in the configure file) -K k-mer -R -F’ with iteration over different k-mers until N50 of the assembly with that k-mer is higher than those with ‘k-mer +2’ and ‘k-mer −2’. On average, we needed to run SOAPdenovo ~3.94 times for each rice accession. The quality of the genome assembly was evaluated for these contigs using QUAST version 2.357.

Sequencing and de novo assembly of IR 8 and N 22 reference genomes

High molecular weight DNA was extracted from young leaves adopting the protocol58 with minor modifications. The PacBio library was prepared following the 20-kb protocol (see ‘User-Bulletin-Guidelines-for-Preparing-20-kb-SMRTbell-Templates document.pdf’, available from https://www.pacb.com/support/documentation/?fwp_documentation_search="PN%20100-286-700-04") and was sequenced on an RSII sequencer with movie collection time of 6 h. The raw data of N 22 and IR 8 were assembled with FALCON59 and Canu60, respectively. Contigs were polished twice with PacBio raw reads using Quiver (https://github.com/PacificBiosciences/GenomicConsensus) and the IR 8 assembly was further polished with 66× WGS 2× 150-bp Illumina data using Pilon61. Polished contigs were assigned to pseudomolecules using Genome Puzzle Master62. Assembly statistics can be found in Supplementary Data 3 Table 4. IR 8 and N 22 were applied to evaluate the completeness and redundancy of the pan-genome.

Pan-genome construction

SOAPdenovo assembly for each accession was assessed by QUAST57 with Nipponbare RefSeq as the reference. From QUAST output, unaligned contigs longer than 500 bp were retrieved and merged. CD-HIT version 4.6.163 was used to remove redundant sequences at a cutoff of 90% identity with the command ‘-c 0.9 -T 16 -M 50000’. For remaining sequences, all-versus-all alignments with BLASTN were carried out to ensure that these sequences had no redundancy. Next, various contaminants including Archaea, bacteria, viruses, fungi and metazoans were removed. The non-redundant sequences were aligned to the NT database (downloaded from NCBI, 26 July 2014) with BLASTN with parameters ‘-evalue 1e-5 -best_hit_overhang 0.25 -perc_identity 0.5 -max_target_seqs 10’. Contigs of which the best alignments (considering E-values and identities) were not from Viridiplantae were considered as contaminants and were filtered out. The remaining contigs formed the non-redundant novel sequences. The rice species pan-genome was then generated by combining the Nipponbare RefSeq and non-redundant novel sequences.

Annotation of the pan-genome

The gene–transcript annotation of the Nipponbare RefSeq was downloaded from the Rice Annotation Project64, and if a protein-coding gene contained multiple transcripts only the transcript with the longest open reading frame was selected as the representative for the gene. Protein-coding genes on novel sequences were predicted using MAKER65, a gene prediction tool combining ab initio predictions, expression evidence and protein homologies. In detail, repeats were first masked (soft mask for low-complexity repeats) with RepeatMasker (www.repeatmasker.org) and RepeatRunner66. Two ab initio predictors, SNAP67 and AUGUSTUS68, were called by MAKER65 to predict gene models with their default parameters for rice. All rice expressed sequence tags (ESTs) were downloaded from GenBank (15 December 2014) and were aligned to the novel sequences with BLASTN. All rice proteins were downloaded from NCBI (15 December 2014) and were aligned to the novel sequences with BLASTX. To obtain more informative alignments, Exonerate69 was used to realign each sequence identified by BLAST around splice sites. EVidenceModeller70 was used to combine and refine the ab initio predictions with RNA and protein evidence. Incomplete gene models were removed before the consequent analysis.

Adjustment of predicted genes

We aligned the predicted transcripts against Nipponbare RefSeq to remove potential redundancy. Redundant genes were removed when the genes were clustered into gene families. However, when attempting to identify the number of novel genes, the redundant ones were removed first. We clustered all genes at a global identity of 95%, and removed novel genes that were not representative of the group.

Evaluation of pan-genome redundancy

We ran BUSCO (benchmarking universal single-copy orthologues) v.2.032 on CX140 (a Nipponbare accession) assembly, Nipponbare RefSeq, CX368 (an N 22 accession) assembly, N 22 high-quality reference genome and the pan-genome sequences. Augustus-3.2.368 and hmmer-3.1b71 were used for gene prediction in BUSCO. BUSCO was run with genome mode with embryophyta_odb9 as a reference.

Functional analysis

All protein sequences of pan-genome were extracted and aligned to the GO sequence database (http://geneontology.org/ on 4 April 2015) with BLASTP. Only alignments with E-values < 1 × 10−5 and identity > 0.3 were used. GO terms for each gene were estimated to be the same as those of its best-hit protein. In total, 20,842 (43.3%) genes could be annotated. For a gene family, its GO terms are the non-redundant set of the GO terms of the genes within this gene family. Overall, 6,338 (26.5%) gene families could be annotated. Enrichment of GO terms was carried out using the GOstats72 package in R with all gene families as the background.

Validation of the non-Nipponbare RefSeq genes

We verified the novel genes by multiple approaches. First, for each gene, we examined the number of accessions that possessed it. We mapped the sequencing reads to the pan-genome sequences. Genes with CDS coverage over 0.95 and gene-body coverage over 0.85 were considered to be present. Second, we verified the novel genes with 226 RNA sequencing experiments from 17 projects30. RNA sequencing reads were first trimmed with Trimmomatic version 0.3273 with parameters ‘ILLUMINACLIP:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:36’ and then aligned to the pan-genome sequences with a split-aligner HISAT2 version 2.0.1-beta74 using default parameters. The coverage of each gene was calculated with ‘BEDtools coverage’ in BEDtools suite version 2.17.075.

Gene family annotation

The genes were clustered to gene families with OrthoMCL version 2.0.976. All genes were extracted and translated into protein sequences and the protein sequences were compared by using all-by-all BLASTP (E-value = 1 × 10−5). OrthoMCL was applied to process the BLASTP output and cluster genes to gene families. Similarity of protein families was set to be 0.5 as suggested by a previous publication76.

Determination of gene presence or absence

We proposed a ‘map-to-pan’ strategy to determine gene presence or absence28. For the 453 accessions with high sequencing depth, although only about 60%–70% of their genomes can be de novo assembled (contig ≥ 500 bp), more than 98% of their genomes can be covered by short read mapping. This enabled the use of coverage of genes to determine their presence or absence. In practice, genes with CDS coverage over 0.95 and gene-body coverage over 0.85 were considered present. If one member of a gene family is present in a given rice accession, the gene family is considered as present.

Determination of core and distributed genes or gene families

A core gene (or gene family) is a gene (or gene family) present in all rice accessions, and we further defined candidate core genes (or gene families) as those with loss rates not significantly larger than 0.01 in all major groups. We first examined whether a gene (or a gene family) is distributed (loss rate > 0.01) in each type of O. sativa (XI, GJ, cA and cB). Binomial tests (with a null hypothesis of loss rate < 0.01) were carried out for each gene in each type. A P value below 0.05 meant that this gene (or a gene family) was lost in a significant proportion of rice accessions and is a distributed gene (or gene family) of these subpopulations. If a gene (or a gene family) was not determined to be distributed in all types (and it was not core), it was considered to be a candidate core gene (or gene family) of O. sativa. Other genes (or gene families) were considered to be distributed.

Determination of major-group-unbalanced, subpopulation-unbalanced and random genes or gene families

Distributed genes (or gene families) were divided further into major-group-unbalanced, subpopulation-unbalanced and random genes (or gene families). Major-group-unbalanced genes (or gene families) are defined as genes (or gene families) that are unequally distributed among XI, GJ, cA and cB groups. A two-sided Fisher’s exact test was used to determine whether the distribution of each gene (or gene family) is uniform. The P values of all genes were calculated with the ‘Fisher.test’ function in R and were then adjusted with the Benjamini–Hochberg FDR method. Genes (or gene families) with FDR < 0.05 were considered as major-group-unbalanced.

Subpopulation-unbalanced genes (or gene families) are defined as genes (or gene families) that are unequally distributed among subpopulations; thus, they can be divided into XI-subpopulation-unbalanced genes (or gene families) and GJ-subpopulation-unbalanced genes (or gene families). XI-subpopulation-unbalanced genes (or gene families) are defined as genes (or gene families) that are unequally distributed among XI-1A, XI-1B, XI-2 and XI-3 subpopulations. GJ-subpopulation-unbalanced genes (or gene families) can be defined similarly. The same statistical methods for the major groups were applied to determine the distribution balance for subpopulations. We defined genes (or gene families) that are neither major-group-unbalanced nor subpopulation-unbalanced to be ‘random’ genes.

Gene and gene-family age

Gene ages were inferred with previously described methods77. The NR protein database was downloaded from NCBI (28 March 2015) and all protein sequences were grouped according to 13 taxonomic levels (PS1: Cellular organisms; PS2: Eukaryota; PS3: Viridiplantae; PS4: Streptophyta, Streptophytina; PS5: Embryophyta; PS6: Tracheophyta, Euphyllophyta; PS7: Spermatophyta; PS8: Magnoliophyta, Mesangiospermae; PS9: Liliopsida, Petrosaviidae, Commelinids, Poales; PS10: Poaceae; PS11: BOP clade; PS12: Oryzoideae, Oryzeae, Oryza; and PS13: O. sativa) based on NCBI taxonomy. Thirteen BLASTP databases were built for protein sequences from PS1 to PS13. All genes on pan-genome sequences were first translated into proteins, and were aligned to the 13 databases using BLASTP with E-values < 1 × 10−5 and identity > 0.3. The age of a gene was considered as the taxonomic level of the oldest aligned protein. Genes that failed to align to all databases were assigned gene ages of PS13 (O. sativa). Some PS13 genes were reassigned as PS12 genes if they could be covered by 446 wild rice genomes2 with both gene-body coverage > 0.95 and CDS coverage > 0.95. The age of a gene family was considered as the age of the oldest gene within the gene family.

Introgression test

To test whether an XI sample had a non-introgression haplotype at a locus, we defined a D-value for a sample x as D(x) = d(x,XI) − d(x,GJ), in which d(x,XI) is the mean distance from sample x to a XI sample at the given locus.

With no gene flow from GJ to XI and vice versa, the D-value is negative for XI and positive for GJ. On the other hand, if an XI sample shares a haplotype with a GJ sample, the D-value will be positive and close to the D-values of GJ samples. For an XI sample, we rejected the hypothesis of GJ introgression if its D-value was negative and less than the lower bound of the 99% confidence interval for the D-value of GJ samples, which was computed on the subset of GJ consisting of samples with a positive D-value, to exclude the effect of potential XI-to-GJ introgression.

Reporting summary

Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.

Code availability

Code for studying pan-genome and gene and gene family PAVs are now integrated and published as the EUPAN toolkit28. Tailored novoBreak-germline is available at https://sourceforge.net/projects/novobreak/?source=navbar. Code for nucleotide diversity and SNP merging is available at https://github.com/dchebotarov/3k-SNP-paper. All other code is available from the corresponding authors upon request.

Data availability

The BAM alignment file and variant calls in VCF format for each accession of the 3K-RG against Nipponbare RefSeq are freely downloadable from Amazon Public Data at https://aws.amazon.com/public-data-sets/3000-rice-genome/ and the Department of Science and Technology Advanced Science and Technology Institute of the Philippines (DOST-ASTI) IRODS site, as described on the 3K-RG project site (http://iric.irri.org/resources/3000-genomes-project). The SV and PAV data of 3K-RG are available in the figshare database78 (https://doi.org/10.6084/m9.figshare.c.3876022.v1).

The following web tools are available for the mining, analysis and visualization of the 3K-RG dataset: SNP-Seek, http://snp-seek.irri.org; RMBreeding databases, http://www.rmbreeding.cn/index.php; rice cloud of genetic data public projects, http://www.ricecloud.org/; IRRI Galaxy, http://galaxy.irri.org/; and the 3,000 rice pan-genome browser79, http://cgm.sjtu.edu.cn/3kricedb/.

The 3K-RG sequencing data used for our analyses can be obtained via project accession PRJEB6180 from NCBI (https://www.ncbi.nlm.nih.gov/sra/?term=PRJEB6180), accession ERP005654 from DDBJ (https://www.ddbj.nig.ac.jp/index-e.html) and from the GigaScience Database (https://doi.org/10.5524/200001).