Banana cultivars mainly involve M. acuminata (A genome) and Musa balbisiana (B genome) and are sometimes diploid but generally triploid5,6. We sequenced the genome of DH-Pahang, a doubled-haploid M. acuminata genotype (2n = 22), of the subspecies malaccensis that contributed one of the three acuminata genomes of Cavendish7. A total of 27.5 million Roche/454 single reads and 2.1 million Sanger reads were produced, representing 20.5× coverage of the 523-megabase (Mb) DH-Pahang genome size, as estimated by flow cytometry. In addition, 50× of Illumina data were used to correct sequence errors. The assembly consisted of 24,425 contigs and 7,513 scaffolds with a total length of 472.2 Mb, which represented 90% of the estimated DH-Pahang genome size. Ninety per cent of the assembly was in 647 scaffolds, and the N50 (the scaffold size above which 50% of the total length of the sequence assembly can be found) was 1.3 Mb (Supplementary Text and Supplementary Tables 1–3). We anchored 70% of the assembly (332 Mb) along the 11 Musa linkage groups of the Pahang genetic map. This corresponded to 258 scaffolds and included 98.0% of the scaffolds larger than 1 Mb and 92% of the annotated genes (Supplementary Text, Supplementary Table 4 and Supplementary Fig. 1).

We identified 36,542 protein-coding gene models in the Musa genome (Supplementary Tables 1 and 5). A total of 235 microRNAs from 37 families were identified, including only one of the eight microRNA gene (MIR) families found so far solely in Poaceae8 (Supplementary Tables 6 and 7).

Viral sequences related to the banana streak virus (BSV) dsDNA plant pararetrovirus were found to be integrated in the Pahang genome, with 24 loci spanning 10 chromosomes (Supplementary Text and Supplementary Fig. 2). They belonged to a badnavirus phylogenetic group that differed from the endogenous BSV species (eBSV) found in M. balbisiana9 and most of them formed a new subgroup (Supplementary Fig. 3). Importantly, all of the integrations were highly reorganized and fragmented and thus did not seem to be capable of forming free infectious viral particles, contrary to the eBSV described in M. balbisiana10.

Transposable elements account for almost half of the Musa sequence (Supplementary Text and Supplementary Tables 1 and 8–10). Long terminal repeat retrotransposons represent the largest part, with Copia elements being much more abundant than Gypsy elements (25.7–11.6%) (Supplementary Fig. 4). No major recent wave of long terminal repeat retrotransposon insertions appears to have occurred in the Musa lineage. Fewer than 1% of the long terminal repeat retrotransposons are complete and their median date of insertion is around 4 Myr ago, corresponding to the half-life of this type of transposable element11 (Supplementary Fig. 5). Long interspersed elements (LINEs) represent 5.5% of the genome. The banana genome is exceptional in the composition of its class 2 element population, which represents only about 1.3% of the genome. The only superfamilies identified were hAT, followed by Harbinger and Mutator. Only the first family was significantly represented and had non-autonomous deletion derivatives. The superfamilies CACTA and Mariner, which have been found in high copy numbers in all angiosperm genomes studied so far, are absent from the banana genome. Gene-rich regions are mostly located on distal parts of chromosomes, as observed in other plant genomes (Fig. 1 and Supplementary Fig. 1). There is, however, a particularly sharp transition between gene-rich and transposable-element-rich regions. This observation is confirmed by the pattern observed after genomic in situ hybridization, which shows that transposable elements are typically concentrated around centromeres in Musa12 (Supplementary Fig. 6). The asymmetric transposable element distributions along the chromosomes indicated that chromosomes 1 and 2 are acrocentric in DH-Pahang (Fig. 1). Long terminal repeat retrotransposons are particularly abundant in centromeric and pericentromeric chromosome regions. Their accumulation in these regions, particularly for the oldest ones, suggests that they are preferentially eliminated from gene-rich regions13 (Supplementary Fig. 5). Remarkably, typical short tandem centromeric repeats were not found in Musa. However, one long interspersed element (named Nanica) identified in the unassembled reads was localized by fluorescence in situ hybridization in the centromeric region of all Musa chromosomes (Supplementary Fig. 7 and Supplementary Table 10).

Figure 1: Chromosomal distribution of the main M. acuminata genome features. Distribution of genes and transposable elements (left) and paralogous relationships between the 11 chromosomes indicated with 12 distinct colours corresponding to the 12 Musa α/β ancestral blocks (right). LINEs, long interspersed elements. PowerPoint slide Full size image

Whole-genome duplications (WGDs) have played a major role in angiosperm genome evolution14; the first evidence of a WGD event in the Musa lineage was reported by Lescot et al.15. We uncovered a complex pattern of paralogous relationships between the 11 Musa chromosomes (Supplementary Text and Supplementary Fig. 8). Most paralogous gene clusters shared relationships with three other clusters, suggesting that two WGDs (denoted as α and β) occurred (Supplementary Fig. 9). Based on Ks and synteny relationships, duplicated gene clusters were tentatively assembled into 12 Musa ancestral blocks representing the ancestral genome before the α/β duplications (Figs 1 and 2 and Supplementary Figs 10–12). The duplicated segments included in the Musa ancestral blocks cover 222 Mb (67% of the anchored assembly) and contain 26,829 genes (80% of the anchored genes) (Supplementary Table 11). The Ks distribution among pairs of paralogous gene clusters dated the two WGDs at a similar period around 65 Myr ago (Supplementary Fig. 13), consistent with the WGDs that occurred in many different plant lineages near the Cretaceous–Tertiary boundary14 (Fig. 3). Additional paralogous relationships between the 12 Musa ancestral blocks displaying higher Ks values suggested that an additional, more ancient duplication event (denoted as γ) occurred around 100 Myr ago (Fig. 3 and Supplementary Figs 10, 11, 13 and 14).

Figure 2: Whole-genome duplication events. a, Paralogous relationships between chromosome segments from Musa α/β ancestral blocks 2 (red) and 8 (green). The 12 Musa α/β ancestral blocks are shown in different colours on the circle. b, Orthologous relationships of Musa ancestral blocks 2 and 8 with rice ancestral blocks ρ2, ρ5 and σ6. We did not observe a one-to-one relationship between, for instance, Musa α/β ancestral block 2 and one ρ ancestral block, which suggests that the γ and σ duplications are two separate events. c, Representation of the deduced WGD event. PowerPoint slide Full size image

Figure 3: Timing of whole-genome duplications relative to speciation events within representative monocotyledons and eudicotyledons. Boxes indicate WGD events. Green boxes indicate WGD events analysed in this paper. All nodes have 100% bootstrap support in a maximum likelihood analysis. Branch lengths (synonymous substitution rate) are indicated. The timing of the β WGD event relative to the Musaceae/Zingiberaceae split remains to be clarified. PowerPoint slide Full size image

In the grass lineage, it is well established that one WGD (denoted as ρ) occurred around 50–70 Myr ago, after Poales separated from other monocotyledon orders16,17. Evidence was reported on an additional WGD (denoted as σ) earlier in the monocotyledon lineage, but after its divergence from the eudicotyledons18. Our comparison of the Musa ancestral blocks with the Poaceae ρ and σ ancestral blocks as defined by Tang et al.18 revealed that genes from segments of different ρ blocks (corresponding to one σ block) have orthologous relationships with the same Musa regions, showing that the σ Poaceae event is not shared with Musa. Reciprocally, genes from Musa α/β paralogous segments have orthologous relationships with the same ρ and σ regions, showing that the earliest duplication (γ) we identified in the Musa lineage is not shared with Poaceae (Fig. 2 and Supplementary Fig. 15).

Independent phylogenomic analyses performed on 3,553 gene families, including genes mapped to syntenic ancestral blocks, generated further evidence (98.7–77.6% of the gene trees, Supplementary Text) that the three rounds of palaeopolyploidization identified in the Musa genome and the two previously reported in the Poaceae lineage occurred independently after the Poales and Zingiberales divergence estimated at 109–123 Myr ago19 (Fig. 3 and Supplementary Fig. 16).

Resolution of the Zingiberales relationship relative to Poales and Arecales (palms) has been problematic (see, for example, Givnish et al.20), but our analysis of 93 single-copy nuclear genes suggested that the palms are more closely related to Zingiberales (including Musa) than to Poales (Fig. 3, Supplementary Text and Supplementary Fig. 17). Phylogenomic and synteny analyses indicated that the palms do not, however, share any of the Poales or Zingiberales WGDs discussed here (Supplementary Figs 17 and 18). Moreover, our Ks analyses of date-palm gene models21 indicated that the palm genome had its own WGD event (Supplementary Fig. 19).

Most (65.4%) of the genes included in the Musa α/β ancestral blocks are singletons and only 10% are retained in four copies, in agreement with the loss of most gene-duplicated copies after WGD22. The most retained gene ontology categories corresponded to genes involved in transcription regulation (transcription factor activity), signal transduction including small GTPase-mediated signal transduction and protein kinases, and translational elongation (Supplementary Text and Supplementary Tables 12–14). This might be explained by the gene balance hypothesis23, which suggests that genes involved in multiproteic complexes or regulatory genes are dosage sensitive and thus are more prone to be co-retained or co-lost after WGD24. With 3,155 genes, the number of Musa transcription factors identified is among the highest of all sequenced plant genomes (Supplementary Table 15 and 16).

Comparison of Musa, rice, sorghum, Brachypodium, date palm (Phoenix dactylifera) and Arabidopsis proteomes revealed 7,674 gene clusters in common to all six species, thus representing ancestral gene families (Fig. 4). Interestingly, many specific clusters (2,809 in our setting) proved specific to Poaceae, suggesting a high level of gene divergence and diversification within the grass lineage. Specific Musa clusters (759) were enriched in genes encoding transcription factors (for example, Myb and AP2/ERF families), defence-related proteins, enzymes of cell-wall biosynthesis and enzymes of secondary metabolism (Supplementary Table 17).

Figure 4: Six-way Venn diagram showing the distribution of shared gene families (sequence clusters) among M. acuminata, P. dactylifera, Arabidopsis thaliana, Oryza sativa, Sorghum bicolor and Brachypodium distachyon genomes. Numbers of clusters are provided in the intersections. The total number of sequences for each species is provided under the species name (total number of sequences/total number of clustered sequences). PowerPoint slide Full size image

We compared the distribution of GC3 content (G or C in the third codon position) in Musa coding sequences with those of rice, ginger (Zingiber officinale) and date palm because this distribution was shown to be bimodal in Poaceae and unimodal in all analysed eudicotyledons25. In Musa, a GC-rich peak was apparent but less distinct from the GC-poor one (Supplementary Text, Supplementary Figs 20–23 and Supplementary Table 18), which confirms preliminary evidence that placed Musa in an intermediate position15. This feature was shared with ginger (Zingiberales) and contrasts with the unimodal GC distribution of date-palm coding sequences (Supplementary Fig. 21).

Plant conserved non-coding sequences (CNSs)—a type of phylogenetic footprint—are enriched in known transcription factors or other cis-acting binding sites, and are usually clustered around regulatory genes, supporting their functionality26. Starting with a collection of 16,978 CNSs conserved in Poaceae, we used the Musa genome to identify the 116 most deeply conserved regulatory binding sequences in the commelinid monocotyledon lineage (Supplementary Text, Supplementary Tables 19 and 20, and Supplementary Fig. 24). Deeply conserved CNSs in commelinids were frequently found located 5′ to genes encoding transcription factors, and were significantly enriched in WRKY motifs (Supplementary Table 21). After WGD, genes associated with deeply conserved CNSs were found to be retained as duplicates more often than genes with less deeply conserved CNSs (Supplementary Table 22). The banana genome also served as a stepping-stone to finding CNSs conserved beyond monocotyledons, including 18 CNSs that were found in this study to be conserved in the expected syntenic position in eudicotyledons as well (Supplementary Table 23). This evolutionary distance is not unusual for vertebrate CNSs (detectable after more than 400 million years of divergence)27 but it surpasses the findings of previous plant whole-genome surveys26. Plant deeply conserved CNSs are therefore rare but do exist, and are short compared with those of animals27, and must be at least as old as monocotyledon–eudicotyledon divergence (more than 130 million years of divergence).

The reference Musa genome sequence represents a major advance in the quest to unravel the complex genetics of this vital crop, whose breeding is particularly challenging. Having access to the entire Musa gene repertoire is a key to identifying genes responsible for important agronomic characters, such as fruit quality and pest resistance. Bananas are exported green and then ripened by application of ethylene. RNA-Seq analysis indicated strong transcriptional reprogramming in mature green banana fruits after ethylenic treatment (Supplementary Text, Supplementary Tables 24–26 and Supplementary Fig. 25). Transcription factors were particularly involved with 597 differentially regulated genes. Various modifications confirmed the biochemistry of the banana ripening process28, such as highly upregulated genes encoding cell-wall modifying enzymes, three downregulated starch synthase genes and one upregulated β-amylase gene. Two WGD-derived paralogous vacuolar invertase genes involved in sucrose conversion displayed opposite expression profiles, suggesting subfunctionalization and possible contribution to the soluble sugar balance in ripening bananas (Supplementary Fig. 26). The race against pathogen evolution is particularly critical in clonally propagated crops such as banana. Up to 50 pesticide treatments a year are required in large plantations against black leaf streak disease, a recent pandemy caused by Mycosphaerella fijiensis3. Moreover, outbreaks of a new race of the devastating Panama disease fungus (Fusarium oxysporum f. sp. cubense) are spreading in Asia4. Among defence-related genes, those encoding nucleotide-binding site leucine-rich repeat proteins were found to be little represented in the Musa sequence (89 genes) (Supplementary Table 27). RNA-Seq analysis showed that receptor-like kinase genes were upregulated in a partially resistant interaction with M. fijiensis (Supplementary Text, Supplementary Table 28 and Supplementary Fig. 27). Interestingly, direct links between basal plant immunity triggered by receptor-like kinase proteins and quantitative trait loci for partial resistance have been recently established in several plant species (see, for example, Poland et al.29). In addition, we showed that DH-Pahang is highly resistant to the new broad-range Fusarium oxysporum race 4 (Supplementary Text and Supplementary Fig. 28), thus conferring additional specific value to the DH-Pahang sequence.

The Musa genome sequence reported here bridges a large gap in genome evolution studies. As such, it sheds new light on the monocotyledon lineage. Several Poaceae-specific characteristics could be highlighted, boosting prospects for analysing the emergence of this very successful family. The Musa genome also enabled identification of deeply conserved CNS within commelinid monocotyledons and between monocotyledons and eudicotyledons, representing an invaluable resource for detecting novel motifs with a gene regulation function. We detected three rounds of polyploidization in the Musa lineage, which were followed by gene loss and chromosome rearrangements, resulting in little synteny conservation between lineages (Supplementary Figs 29 and 30) and over-retention of some gene classes, thus providing ample opportunities for independent diversification. In particular, transcription factor families are strikingly expanded in Musa compared with other plant genomes and probably contribute to specific aspects of banana development.

The Musa genome sequence is therefore an important advance towards securing food supplies from new generations of Musa crops, and provides an invaluable stepping-stone for plant gene and genome evolution studies.