Genome assembly and annotation

SK sequencing and assembly

We sequenced the inbred line SK, derived from a tropical landrace (BioSample accession code: SAMC036455). High-molecular-weight DNA extraction and purification was performed using a DNeasy Plant Maxi Kit (Qiagen). DNA concentration was measured using NanoDrop (Thermo Fisher Scientific) and Qubit 2.0 (Invitrogen) instruments. A total of 43 single-molecule real-time cells were run on the PacBio Sequel instrument by BGI using Kit 2.0 chemistry, generating 19.7 million reads with a total length of 199 Gb. The PacBio data were de novo assembled using FALCON assembler19 and polished with the Arrow program (https://www.pacb.com/support/software-downloads/). DNA was also sequenced using an Illumina HiSeq 3000 machine. Paired-end libraries with insert sizes of 410 and 670 bp, as well as mate-pair libraries with insert sizes of 2, 5, 10 and 20 kb, were constructed, following a standard protocol provided by Illumina. We also used Illumina data to improve the assembly result by Pilon39—an integrated tool for comprehensive variant detection and genome assembly improvement.

Construction of optical genome maps

Based on standard BioNano protocols40, nicking, labeling, repair and staining processes were implemented. Specifically, DNA was digested by the single-stranded nicking endonuclease Nt.BspQI. Optical maps were assembled with BioNano IrysView41 analysis software; only single molecules with a minimum length of 100 kb and six labels per molecule were used.

PacBio sequence gap filling and gap filling result correction

The gaps in the BioNano assembly result were closed by PBjelly version 15.2.20 (ref. 20) with the PacBio sequence using default parameters. Then, the filled regions were polished with Plion39.

Scaffold construction using 10x Genomics data

The Chromium Genome Reagent Kit42 (10x Genomics) was used for indexing prepared samples and partitioning barcoded libraries. Sequencing was conducted with Illumina HiSeq X Ten to generate linked reads. Scaffolding was performed using 10x Genomics linked reads based the ARCS pipeline. Linked reads with barcodes that did not match the company’s barcode whitelist were filtered out. ARCS was run with sensitive parameters, as specified in a previous study21. To examine the linked scaffold, we used a consensus approach that contained evidence from three different sources: (1) Irys optical maps; (2) PacBio long-read alignments to the scaffolds; and (3) Illumina HiSeq read alignments to the scaffolds. We found that Irys supported the linking 110 paired scaffolds with each other, and there were 62 paired scaffolds that did not align with the Irys optical map. All of the conflicts were disconnected.

Anchoring of the assembled scaffolds

To anchor the scaffolds, a high-density genetic linkage map was developed using the RIL population with 263 recombination inbred lines derived from an SK × Zheng58 cross and genotyped with a 56,000 SNP array43. The genetic map spanned 1,858.9 cM and contained 2,796 bins derived from 13,883 high-quality SNPs. The sequences of probes from the Illumina MaizeSNP50 array43 were mapped to the 10x Genomics assembly result using BLAT44. Around 2.095 Gb (47 scaffolds) could be anchored to ten chromosomes by genetic linkage mapping, which made up 96.90% of the 10x Genomics assembly result. Genotype-by-sequencing probes of high-resolution genetic mapping of the maize pan-genome45 were also mapped to the 10x Genomics assembly result using BLAT software; 151 scaffolds could be assigned to a chromosome, but they could not be located and ordered within the chromosome. The size of the 151 scaffolds was 26 Mb.

Further gap filling

We allocated the corrected PacBio long reads to ten chromosomes by mapping them onto the ten pseudo-chromosomes and then reassembling them respectively. We aligned the contigs resulting from reassembly onto the ten pseudo-chromosomes and filled the gaps manually.

BioNano map-assisted gap filling

The BioNano de novo assembly and BioNano molecules were used to estimate the gap length. Then, we filled the gaps using corrected PacBio long reads with PBjelly20. Finally, the filled regions were polished with Plion39. Irys optical maps and Illumina HiSeq reads were used to examine these areas again.

Genome annotation

Transposable elements found in the SK genome were the result of the integration of independent de novo predictions (LTRharvest46, LTRdigest47, SINE-Finder48 and HelitronScanner49), and of homolog searching from RepeatMasker using P-MITE50 and Repbase databases51 as repeat libraries.

The pipeline for gene prediction included de novo and evidence-based predictions using MAKER-P52 and PASA53 on the repeat-masked genome (Supplementary Fig. 7). For homolog evidence, we collected the protein sequences of Arabidopsis thaliana, Brachypodium distachyon, Oryza sativa, Setaria italica, Sorghum bicolor and Z. mays. Transcript evidence included high-quality, full-length transcripts from Iso-Seq and Trinity-assembled transcripts from the RNA-Seq of nine tissues (male spikelet, female spikelet, internode, seedling root, seedling leaf, mature pollen, unpollinated silks, kernels 15 d after pollination, and vegetative meristem). For de novo gene prediction, we used Augustus54 and FGENESH (http://www.softberry.com/berry.phtml) trained on 2,000 homolog genes, which were supported by Iso-Seq full-length transcripts and monocots. All of the evidence was submitted to MAKER-P52, and the output of MAKER-P52 was refined again by PASA53.

SV calling

To call SVs, we used the smartie-sv pipeline14, which aligns, compares and calls insertions, deletions and inversions (https://github.com/zeeev/smartie-sv). At the core of the code is a modified version of BLASR, which was designed to align large divergent contigs against a reference genome. We called SVs (>10 bp; deletions and insertions) using smartie-sv. We applied two filters to the raw SV calls. First, we omitted SVs that were smaller than 10 bp or within the centromere. Second, regions (1 Mb windows) with more than 50 alignments were also excluded from the analysis. Third, contigs of <200 kb were also excluded. Furthermore, we confirmed >96% of 29 events (from 10 bp to 2 kb in size) by Sanger sequencing (Supplementary Table 15). For larger SVs, we randomly selected 12 SVs (from 5–70 kb) for visual inspection and good collinearty were shown between two genomes of the flanking sequence of SVs (Supplementary Fig. 21). As an initial dataset for identification of pSVs (Supplementary Note), the accuracy of 386,014 SVs should be acceptable, although there might be some false positives in them.

RNA-Seq data analysis and eQTL mapping

RNA-Seq data were obtained from our previous published dataset (SRP026161). A total of 11,496,863 high-quality SNPs were obtained from DNA deep resequencing (~20×) of 521 diverse inbred lines. We referred to a previously published method to conduct the quantification of gene expression and eQTL mapping55. Raw reads were trimmed, to remove adapters and low-quality reads, with Trimmomatics (version 0.36)56. Trimmed reads were mapped to the SK reference genome using STAR57. Read counts of each gene were calculated using HTSeq58 and normalized by library sequencing depth using the R package DESeq2 (ref. 59). After filtering the gene without expression in more than 100 samples, expression counts were normalized using Box–Cox transformation. Before eQTL mapping, 69 hidden factors were calculated using PEER60 and were used as covariates together with five multidimensional scaling coordinates calculated form the SNP dataset. Using these covariates, SNP eQTL and SV eQTL were mapped using Matrix eQTL61.

QTL mapping and transgenic validation of qHKW1

We planted heterozygous individuals derived from one heterogeneous inbred family line to screen new recombinant events34. The plants were planted in the field in Hainan (Sanya; 18.3° N, 109.5° E) and grown in 2.5 m rows, spaced 0.5 m apart, with 11 individuals in each row. The markers used for fine mapping of qHKW1 are listed in Supplementary Table 16. Progeny tests were performed by comparing the HKW of NILSK and NILZHENG58 homozygous individuals from F 3 families for each new recombinant. We used one-way analysis of variance in Excel to test whether there was a significant difference in HKW between two NILs. We fused Zm00001d028317 with yellow fluorescent protein and overexpressed it into maize inbred line ZC01 with the ubiquitin promoter. One-way analysis of variance analysis was used to test whether there were significant differences in expression levels or HKWs between overexpression transgenic-positive and -negative lines. We also performed CRISPR–Cas9-based gene editing of Zm00001d028317, with two guide RNAs targeting the first exon of Zm00001d028317 inserted into pCPB-ZmUbi-hspCas9 (ref. 62). Both of the overexpression and gene-editing transgenic vectors were transformed into C01 with Agrobacterium tumefaciens EHA105 (China National Seed Group). The transgenic lines were planted in a greenhouse in Yunnan province, China (21.9° N, 100.7° E). To avoid the effect of environment, we planted these transgenic materials and controls in the same greenhouse, with 30 cm plant-to-plant and 50 cm row-to-row distances. The primers used for transgenic experiments are listed in Supplementary Table 16.

Expression quantification of Zm00001d028317 and RNA-Seq

We extracted total RNA from the seeds, endosperm and embryos of two NILs, and the leaves of overexpression transgenic lines using a Quick RNA Isolation Kit (Huayueyang Biotech, Beijing, China). First-strand complementary DNA was synthesized using an EasyScript One-Step gDNA Removal and cDNA Synthesis SuperMix (TransGen Biotech). Real-time fluorescence quantitative PCR with SYBR Green Master Mix (Vazyme Biotech) on a CFX96 Real-Time System was used to quantify the expression level of Zm00001d028317. Each set of experiments was repeated three times, and the relative quantification method (2−ΔΔCT) used to evaluate quantitative variation. The primers used for quantitative PCR with reverse transcription are listed in Supplementary Table 16. The RNA, extracted from embryos at 20 d after pollination, of the overexpression-positive and -negative lines and CRISPR-edited and control lines was used to perform RNA-Seq. For each genotype, we performed RNA-Seq of three replicates at Annoroad Gene Technology (Beijing, China). One sample of the overexpression-positive line was excluded from further analysis due to its low global Pearson correlation (r < 0.95) with the other two samples.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.