Genome size

The flow cytometry (FCM) estimate of the Matina 1-6 genome size is 445 Mbp; this value approximates the mean determined for 28 different T. cacao genotypes (see Additional file 1, Table S1), including representatives of all 10 structural diversity groups defined by Motamayor et al. [2]. When analyzed by structural group, the genome sizes fall into three statistically distinct sets (see Additional file 1, Table S2). Genotypes within the Nacional group have the largest genomes, whereas most of the other genotypes have genomes similar in size to that of Matina 1-6, a member of the Amelonado group. Cultivars within the Criollo group, including B97-61/B2 (409 Mbp) whose genome was recently published [11], have some of the smallest genomes within the species (see Additional file 1, Table S2).

Assignment of chromosomes to pseudomolecules using fluorescence in situhybridization

We performed fluorescence in situ hybridization (FISH)-based karyotyping of mitotic T. cacao chromosomes, using two types of probes: genomic repeats (centromeric repeats) for differential chromosome 'painting', and bacterial artificial chromosomes (BACs) for chromosome identification. Centromeric repeat-based FISH has been used to identify chromosomes in soybean [16] and maize [17, 18]. We identified sequences homologous to a previously identified candidate centromeric repeat [11] by searching unassembled sequence reads acquired during generation of the Matina 1-6 physical map [19]. A ClustalW alignment (not shown) of the 236 sequences with the highest percentage of identity indicated a length for the Matina 1-6 Cent-Tc repeat of 171 bp (see Additional file 2, Figure S1 for consensus), comparable with the centromere monomer length in other plant species [20–22]. We designed specific oligonucleotide probes that target highly conserved or less well conserved centromeric repeat T. cacao (Cent-Tc) regions (see Additional file 2, Figure S1). Probe OLI-07 was present in the majority (74%) of the aligned Cent-Tc repeats, whereas probe OLI-13 was present in 12.5%. When used together in FISH experiments (see Materials and methods section for details) the two probes differentiated the mitotic chromosomes into several distinct color and intensity subgroups (Figure 1). To identify individual chromosomes, BAC probes (Materials and methods; also see Additional file 1, Table S3) derived from low-repeat content regions of each of the 10 pseudomolecules defined by Matina 1-6 physical mapping [19] were individually mapped to Cent-Tc-labeled chromosomes (Figure 1 and data not shown). We then developed a six-component (two Cent-Tc probes plus four BAC probes) FISH 'cocktail' that simultaneously identified every chromosome in a single mitotic chromosome spread (Figure 1). Because the Matina 1-6 physical map is anchored by high-density molecular markers that relate the map to the sequenced genome, pseudomolecule-derived BAC probes permit the association of molecular-linkage groups and the pseudomolecules with the chromosomes themselves, thereby enabling chromosome numbering based on pseudomolecule numbering, and generating a single, unified cytogenetic map for Matina 1-6.

Figure 1 Fluorescence in situ hybridization (FISH)-based karyotype of Theobroma cacao Matina 1-6. A FISH cocktail comprised of two Cent-Tc oligonucleotide probes plus four BAC clones permitted identification of the ten chromosome pairs. (A) Ideogram of the T. cacao Matina 1-6 karyotype. Centromeres are coded in accordance with the color and size of the combined FISH signals for OLI-07 (green pseudo-colored) and OLI-13 (red pseudo-colored). Bacterial artificial chromosome (BAC) probes (red or green) are indicated by paired dots near chromosome termini. The following BACs were used for probes: for Tc03, TcC_Ba057I03 and TcC_Ba027M06; for Tc04, TcC_BB065A03; for Tc06, TcC_Ba018I22. Relative chromosome sizes are not indicated, with the exception of the satellite arm of Tc07, which is shown as a knob. (B) Chromosomes labeled with the FISH cocktail arranged by chromosome number. Chromosomes are discriminated as follows: Tc01 has the second-brightest yellow centromere. Tc02 has the brightest yellow centromere. Tc03, Tc04, Tc06, and Tc07 all have similar centromere labeling (pure green), but are differentiated based on unique BAC probe labeling: Tc03 is labeled at each end by green BAC probes; Tc04 is labeled at one end by a green BAC probe; Tc06 is labeled at one end by a red BAC probe; and Tc09 is not labeled by BAC probes. Tc05 has the second-brightest red centromere; Tc08 has the brightest red centromere with an 'internal' green domain; and Tc09 has the brightest yellow-green centromere and is much longer than Tc10, which has the second-brightest yellow-green centromere. (C) DAPI channel image of chromosomes in (B). The satellite arms of Tc07 are above the centromeres. (D) A FISH image containing a complete chromosome spread. (E) Corresponding DAPI channel image from which chromosomes in (C) were extracted. Full size image

Sequencing and assembly of the Matina 1-6 genome

In total, 32,460,307 sequence reads (a combination of Sanger and Roche 454 pyrosequencing; see Additional file 1, Table S4), were assembled using a modified version of Arachne (version 20071016 [23]). The resulting assembly was integrated with multiple high-quality genetic maps (see Additional file 1, Table S5) to produce a chromosome-scale assembly. Sanger sequences were obtained from fosmids and from BAC ends selected using the minimum tiling path of a previously reported physical map [19] (see Materials and methods).

The initial assembly generated 1,672 scaffold sequences with a scaffold N50 of 7.4 Mbp, and 91 scaffolds longer than 100 kbp (see Additional file 1, Table S6), giving a total scaffold length of 348.7 Mbp. To generate version 1.1 of the Matina 1-6 genome assembly, we removed contaminating sequences and scaffolds containing exclusively any of the following: unanchored repetitive sequences, unanchored ribosomal (r)DNA sequences, mitochondrial DNA, chloroplast DNA, and sequences less than 1 kbp in length. Next, we integrated the maps and constructed chromosome-scale pseudomolecules (see Materials and methods). Unanchored repetitive sequences were identified from the scaffolds remaining from the construction of pseudomolecules. These are scaffolds composed of more than 95% 24-mers occurring more than four times in scaffolds longer than 50 kb. Unanchored rDNA, mitochondrial, and chloroplast sequences were identified by aligning the remaining scaffolds against the nr/nt nucleotide collection at the National Center for Biotechnology Information (NCBI).

The chromosome-scale assembly contains 711 scaffolds that cover 346.0 Mbp of the genome, with a contig N50 of 84.4 kbp and a scaffold N50 of 34.4 Mb. The resulting final statistics are shown in Table 1. Plots of the marker placements were constructed for all three genetic maps used for the assembly, and for the synteny between exons in the T. cacao Matina 1-6 and Criollo genomes (see Additional file 2, Figure S2). The pseudomolecules comprise 99.2% (346.0 Mbp of 348.7 Mbp) of the assembly. For detailed comparison of the Criollo and Matina scaffold assemblies, see Additional file 1 (Table S7). The genome assembly has been deposited in GenBank (accession number [ALXC01000000]).

Table 1 Final summary assembly statistics for chromosome-scale assembly of the Matina 1-6 (version 1.1) genome sequence. Full size table

Arachne assembly assessment

An initial assessment of the completeness of the Arachne2 euchromatic genome assembly was estimated using 1,015,064 Roche/454 leaf transcript sequence reads [24]. The results of this analysis indicated that 98.9% of these (see Materials and methods) mapped to the 10 chromosomes. Other completeness assessments are described in the annotation results (see below). We also validated the Arachne2 sequence assembly against a 998 kbp reference region of Sanger-sequenced BAC clones from the Matina 1-6 library [25]. A dot-plot comparing the assembled genome and the BAC clone reference region was constructed (see Additional file 2, Figure S3). Overall, the analysis indicated high contiguity across the region; a detailed summary reporting error events and bps affected is presented (see Additional file 1, Table S8).

Evidence-based gene annotation

We prioritized the generation and use of RNA sequencing data from diverse experiments, and from available plant whole-gene sets, for evidence-based annotation. The overall results are shown schematically in Figure 2.

Figure 2 Genomic features of Theobroma cacao Matina 1-6. Shown are overall densities of evidence sets (see Materials and methods) that contributed to T. cacao Matina 1-6 annotation, and the final results as described in the text. Data were plotted for the chromosomes (pseudomolecules) in 50 Kbp sliding windows. Yellow denotes protein homology evidence by alignment to proteins of eight previously annotated plant genomes; blue denotes mapping of transcriptome data from second-generation RNA sequencing; green denotes gene models; red denotes transposons from homology-based and structure-based annotation, as described in the text (see Materials and methods). Full size image

Longer transcript read sets (Roche/454, see Materials and methods) were first used to confirm the completeness of genome assembly version 1.1 as above. Seven million reads (see Additional file 1, Table S9) sequenced from leaf, bean, and floral RNA from different genetic backgrounds mapped at rates ranging from 88.25% to 99.25% in Matina 1-6 (version 1.1) and from 85.23% to 97.89% in Criollo (version.1 [11]) genome assemblies. Mapping sequence identities were similar, suggesting that no significant differences were derived from sampling (see Additional file 1, Table S9). The aforementioned reads were also used for gene model annotation together with 1.22 billion shorter paired-end reads from Illumina RNA sequencing (see Additional file 1, Table S10).

We used coding and intron spans from protein alignments of eight annotated plant genomes for homology modeling (see Materials and methods; also see Additional file 1, section 3). The contributions of RNA and protein analyses are displayed independently with final models in Figure 2.

Evidence-supported models represent 29,408 loci with 14,806 alternative transcripts determined from RNA assemblies (see Additional file 1, Table S11). Additionally, about 20,000 elements were annotated with little or no support. These included around 13,000 gene-like regions that overlapped substantially (≥ 33%) with transposon loci (described below) and showed little or no evidence of expression, as well as loci predicted ab initio only, partial gene models, possible non-coding RNAs, and pseudogenes.

If the evidence-supported gene models alone are considered, around 15% of the Matina 1-6 genome (54 Mb) is expressed, and 36 Mbp of this represents protein-coding sequence. Intron size distribution was bimodal, as seen in other plant genes [26] (see Additional file 1, Table S12). The breadth of RNA data and the larger set of plant genome sequences available since our first cacao annotation release (version 0.9) contributed to a more substantiated annotation by every measure. The improved gene models are characterized by longer protein-coding sequence, full expression support through untranslated regions, and fewer fragments relative to both Matina (version 0.9 and Criollo (version 1) (see Additional file 1, Table S13).

Analysis of gene content and orthology

Among the primary cacao protein structures, we found 18,457 unique alignments to UniProt references [27] and 4,903 that were without match in this broadly comprehensive database. Products of 3,426 gene models (5,819 transcripts) were associated with 140 KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway maps [28]. This compares with 125, 124, 123, and 124 currently annotated metabolic pathways for Arabidopsis, poplar, grape, and castor bean, respectively [28]. Cacao genes were categorized using OrthoMCL [29] into 15,523 putative orthologous groups situated among those of 8 other plants used in the annotation, and of these groups, 13,700 contained a single representative. Remarkably, only 43 orthologous groups that were defined based on genes present in the other plant species had no cacao gene representative, the lowest level among the fully sequenced plant genomes analyzed (see Additional file 1, Table S14a). Orthology analyses performed in this study (see Materials and methods; see Additional file 1, section 3) generate group partitions that sometimes divide known phylogenetic clades. Ten (23%) of the groups not represented in cacao were defined by uncharacterized or hypothetical proteins in the other plants. The remainder of the unrepresented groups averaged only one or two copies in the other plants, and generally reflected sequence divergence within plant gene families whose paralogs were indeed present in cacao. For example, although one group in the 2-oxoglutarate and Fe(II)-dependent oxygenase superfamily was not represented in cacao, a number of cacao genes falling into 59 orthologous groups and an additional 8 ungrouped genes were annotated as members of this superfamily. There were 16 calcium-dependent kinase (CDPK) groups and three ungrouped CDPKs found in cacao, although one group that was defined for the other plants was not represented in cacao by this analysis. The over-represented and under-represented gene families for cacao, including these examples are tabulated (see Additional file 1, Table S14b), and the results summarized in a Venn diagram form as shared gene families for five compared plants (see Additional file 1, Figure S4).

The identified proteins were assigned to gene ontology (GO) categories, provided by The Arabidopsis Information Resource (TAIR), and the proportion of genes assigned to each category were found to be relatively close between poplar, cacao, and grape, but significantly different from Arabidopsis (χ2 = 2895.128, degrees of freedom (df) = 78, P < 2.2 ×10-16; see Additional file 1, Table S14c). This initial comparison suggested that cacao has the highest overall similarity to poplar (1.0 correlation, χ2 = 7.2923, df = 25, P= 0.9998), and that Arabidopsis drives most of the differences in gene content per GO category. Although further work will contribute to improve the functional annotation of the proteins identified in Matina 1-6, the initial assessment and functional characterization provided here considerably improves the genomic tools available for cacao, as we describe below for the mapping and identification of genes regulating phenotypic traits.

Lineage-specific genes are undoubtedly among the 703 T. cacao gene clusters [29] that share no apparent orthology with genes in any of the other 8 plants used in this study (see Additional file 1, Table S14a). Ongoing analyses, which incorporate an annotated gene set for cotton, a second member of the Malvaceae family that was recently sequenced [30], will refine this picture, and will also help define genes that are associated with speciation. Hierarchical clustering of cacao gene families, among those of many other plants, is available in Phytozome (version 9 [31]).

To further understand the results from the previously described analyses and to annotate seemingly divergent or uncharacterized unique genes in cacao, we performed additional hidden Markov model (HMM) searches [32]. We annotated 4244 transcripts from 4,085 genes (transposon overlaps excluded) based on expression, that is, association with RNA sequencing evidence alone, as these showed no homology to proteins in the other 8 plant genomes. We then surveyed potentially remote homology that would not have been identified within our annotation standards by searching these cacao proteins against the profile HMM databases Pfam and TIGRFAMS, using HMMER3 [32]. In total, 126 HMM families (E-value 0.005) were found within the no-homology set. We compared these with the set of HMM families obtained from an identical scan of cacao proteins with other plant homology (for groupings of these annotations and the proportion of HMM families between the two sets, see Additional file 1, Table S14d). Although the Pfam database is part of InterPro (used above), probabilistic scoring in the HMMER3 algorithm can exceed the accuracy of finding motifs by single optimal alignments [33, 34]. Although only four HMM families were unique to the no-homology set, these were members of clades that were represented in the set of annotated plant homologs (see Additional file 1, Table S14d). Overall, this analysis therefore added modest numbers of additional members to known structural categories, and approximately 3% more definition to the unknown proteins of cacao.

Gene models of the Matina 1-6 (version 1.1) sequence have been deposited in NCBI as (GenBank accession number [ALXC00000000]).

Transposable elements

A significant portion of eukaryotic genomes is composed of transposable elements (TEs) [35–37]. Using a structure-based and homology-based strategy, we identified 8,542 intact TEs in the T. cacao Matina 1-6 genome. Those elements, together with numerous truncated elements and other fragments, cover 41.53% of the assembled genome. The overall results are shown schematically in Figure 2. In a parallel comparative analysis using the same approach, we identified 5,089 intact TE copies in the T. cacao Criollo genome [11], and found only 35.40% of the Criollo genome assembly to be comprised of TEs (see Additional file 1, Table S15). Although this estimate represents a nearly 10% increase in TEs over a previous annotation of Criollo (version 1 [11]), the proportion of TEs in the assembled portion of the Criollo genome is notably less than that in the assembled portion of the Matina 1-6 (version 1.1) genome.

Retrotransposons, especially long terminal repeat (LTR) retrotransposons, are the predominant class of TEs identified in the cacao genome. Two types of LTRs are found: intact and solo. Solo LTRs are thought to be formed by unequal intra-element homologous recombination (UR) between two closely identical intact LTRs [38]. The 7,545 LTR retrotransposons identified, comprising 5,345 solo LTRs and 2,200 intact LTRs, can be classified into 369 families as follows: 123 copia-like, 233 gypsy-like, and 13 unclassified families (see Additional file 1, Table S16). Of the 333 retrotransposon families identified in Criollo, 277 are common to both Criollo and Matina 1-6 genomes, 56 families are specific to Criollo, and Matina 1-6 contains 92 LTR retrotransposons families not identified in Criollo (see Additional file 1, Table S16). While 96.02% (Criollo) and 98.10% (Matina 1-6) of the intact copies of these retrotransposons belong to these 277 shared families, the copy numbers of elements in each family differ substantially. For example, for Tcr53, which is the largest family in both genomes, 1,124 copies were identified in the Matina 1-6 genome, whereas only 301 copies were detected in the Criollo genome. It should be noted that, because the portions of either the Matina 1-6 or the Criollo sequences that were not assembled into the current genome pseudeomolecules remain unknown, the contribution of TEs to the difference in the contents and sizes of the two genomes could not be assessed more precisely. Nevertheless, four times more copies of Tcr53 elements were detected in Matina 1-6 than in Criollo, which may be explained mainly by the different levels of independent amplification of this family in the two genomes after their divergence. Most of the families that are unique to either genome contained only single copies. DNA transposons represent 8.87% of the Matina 1-6 genome and 7.00% of the Criollo genome. The gypsy-like retrotransposons comprise 79.93% (4.90% out of 6.13%) of the difference in TE coverage between Matina 1-6 and Criollo. These results suggest that retrotransposons may have undergone a recent amplification in the Matina 1-6 genome.

Synteny between Matina 1-6 and Criollo

We further compared the whole-genome sequences of Matina 1-6 (version 1.1) and the Criollo-type B97-61/B2 (version 1 [11]). The results, depicted using Circos [39], indicate very well conserved synteny, as expected, with 271 orthologous regions (ORs) identified (see Additional file 2, Figure S5). The longest OR is a 7.7 Mbp region in chromosome 9 in both Criollo and Matina 1-6, which contains 2,580 matching exons. The mean number of matching exons in each OR is 150, and the mean lengths of the ORs are 681 kbp in Criollo and 717.5 kbp in Matina 1-6. Of the total number annotated in each genome, 87.23 % (28,666 of 32,862) of the Criollo genes and 76.08 % (22,374 of 29,408) of the Matina 1-6 genes are located in the ORs that we identified. The orthology is well conserved at the chromosomal level, but fewer and smaller ORs between non-orthologous chromosomes were also detected (see Additional file 2, Figure S5). Regions on chromosome 2 of Matina 1-6 are orthologous to regions on chromosome 9 of Criollo, for example, and regions on chromosome 3 of Matina 1-6 are orthologous to regions on chromosome 10 of Criollo. In total, 12 ORs between non-orthologous chromosomes were detected and the mean size of the ORs between non-orthologous chromosomes was much smaller than that of the ORs between orthologous chromosomes (see Additional file 1, Table S18).

We analyzed the distribution of the genes associated with transposons in ORs between non-orthologous chromosomes and those between orthologous chromosomes (see Additional file 1, Table S18). We examined the mean span (kb) between genes and between transposons in each region of the two cultivars. In both cultivars, the mean span between transposons was shorter in ORs between non-orthologous chromosomes, even though the mean span between genes was larger in ORs between non-orthologous chromosomes. These results suggest that a much higher proportion of the genes in ORs between non-orthologous regions are associated with transposon activity than those in ORs between orthologous regions. The higher frequency of transposon-related genes in the ORs between non-orthologous chromosomes, along with the smaller size of the ORs between the non-orthologous chromosomes, suggests that those blocks were translocated to non-orthologous regions by transposon activity.

Although more detailed studies are required to discern if the observed differences are inherent to the different assemblies and annotation approaches, we determined whether any disease resistance-related genes (within the leucine-rich repeats: LRR-RLK class) or genes potentially involved with cacao-bean quality [11] reside in ORs (see Additional file 1, Table S19). The genes that reside outside ORs or in ORs between non-orthologous chromosomes could be responsible for the differences in crop quality and disease resistance between the two cultivars. It has been reported that genes encoding proteins involved in plant interaction with biotic and abiotic extrinsic factors are far more likely to have been transposed than those that are involved in relatively stable processes [40].

Many of the LRR-RLK genes, flavonoid biosynthetic pathway genes, lipid biosynthesis genes, and terpenoid synthesis genes identified in T. cacao by Argout et al. [11] are outside ORs (see Additional file 1, Table S19; for the specific gene models involved in this analysis, see Additional file 1, Table S20a). However, this analysis was based on the Criollo annotation [11] and did not use the genes identified through our evidence-based annotation pipeline in Matina 1-6 for these pathways, many of which are also in non-orthologous regions. For example, the LRR-RLK gene motif is found in twice as many genes in non-orthologous regions in the Matina 1-6 assembly as in the Criollo assembly (see Additional file 1, Table S20b).

Mapping pod color

In both T4 mapping populations (see Materials and methods), the progeny segregate in a 1:1 ratio for pod color, suggesting the involvement of a single gene. The common parent in both of these populations, Pound 7, has green pods. The other parents of these crosses, UF 273 Type 1 and Type 2, have red pods (see Additional file 2, Figure S6). In the MP01 mapping population, pod color segregates in a 3:1 red:green ratio, and we infer that both parents of this population, TSH 1188 and CCN 51, are heterozygous for the dominant red allele (see Additional file 2, Figure S6).

Linkage-mapping analyses previously suggested that pod color is regulated by gene(s) located on chromosome 4 [14]. To corroborate this result, we performed Fisher's exact test for each single-nucleotide polymorphism (SNP), using a contingency table of genotype counts and phenotype scores (that is, red and green) and the marker data from the ten linkage groups on the three mapping populations (T4 Type 1, T4 Type 2, and MP01). In all three mapping populations, a large region on chromosome 4 showed significant (based on the Bonferroni correction at α = 0.05) association with pod color (see Additional file 2, Figure S7).

Using haplotypes can be more powerful than using single-marker methods to detect phenotypic effects of low-frequency variants in genome-wide association studies [41, 42]. Haplotype-based methods are also useful for inferring the underlying causal genetic basis for various traits in linkage-mapping populations because, as shown here, it is possible to evaluate the parental inheritance of the associated haplotype more efficiently. We therefore determined parental haplotypes for chromosome 4 in individuals from the three mapping populations studied to help us identify candidate genes that are associated with this trait. We resolved the genotypes of the progeny in each of the mapping populations for chromosome 4 into the two corresponding parental haplotypes, using the output of HAPI-UR [43] (see Materials and methods). To investigate the effect from each parental haplotype separately on pod color, we performed Fisher's exact tests separately for each parent, on individuals from each mapping population, using a 2 × 2 contingency table of parental haplotypes for each marker and red and green pod color. This analysis identified a difference between the T4 and MP01 populations; in both T4 populations, haplotypes belonging to one parent, Pound 7, did not affect the color phenotype; whereas in MP01, haplotypes from both parents did affect pod color (Figure 3). The observation that a single haplotype from each parent associates with red pod color in MP01, but not in T4, is consistent with the mendelian model outlined above, and suggests the existence of a potential common allele in the two T4 mapping populations, even though the parents are different. Our results also suggest that red pod color is dominant over green and that both UF 273 parents are heterozygous for the red allele, whereas the common parent, Pound 7, is homozygous for the recessive green allele.

Figure 3 Statistical significance of association of pod color with markers on chromosome 4 of the parental Theobroma cacao haplotypes. The y-value at each marker is -log 10 (P-value) with the P-value computed using Fisher's exact test for both haplotypes of each parent, taking as input a 2 × 2 contingency table per marker. The segment between the vertical dashed lines is the genomic region most strongly associated with pod color in all three mapping populations. Thresholds denoted by the dashed red line in each plot were calculated using the Bonferroni correction for multiple comparisons at α = 0.05. Full size image

We then focused on the small region that we had identified on chromosome 4 (Figure 3) as that most significantly associated with the trait. The locations of the most significant SNP markers vary only slightly between the mapping populations, ranging from 20,987,989 bp to 21,726,996 bp. The haplotype combinations, taking into account SNP 21,126,449 bp, explain the phenotypes in the data with at least 97% accuracy in each population (see Additional file 1, Table S21; see Materials and methods for more details on the computation with Fisher's exact test).

To fine-tune the mapping of the region of interest, we examined trees exhibiting a recombination of the haplotypes within the associated region, and added flanking markers to extend the region to 20.35 to 22.05 Mbp. In each T4 population, there were 8 to 15 trees with recombination events in at least one parental haplotype (Figure 4a,b). The T4 recombinants refined the location of the causal locus to the region between 20,562,635 and 21,726,996 bp on chromosome 4 (Figure 4a,b).

Figure 4 Haplotype analysis of trees exhibiting recombination in the chromosome 4 segment associated with pod color. (a) Recombinant trees from population T4 Type 1; (b) recombinant trees from population T4 Type 2; (c) recombinant trees from population MP01. Maternal and paternal haplotypes are shown at the top of each figure. Tree names are colored according to pod-color phenotype. Red represents haplotypes associated with red pod color, and green represents haplotypes associated with green pod color from the two parents, while the yellow marker values represent uncertainty in the haplotype assignment. The black vertical bars surround the most likely region regulating pod-color variation according to the haplotypes of the recombinants (that is, if a recombinant shows only haplotypes for a given marker associated with green pods, but its phenotype is red, this indicates that the marker is not associated with pod color; this is the case for CATIE 1-63 at marker 22,053,861 in (a). The P-values from the Fisher's exact test are shown above each marker for each parent. The P-values are colored by parental phenotype, with the father always being bright red. The location of three candidate genes is indicated by colored dots above the closest markers. Full size image

Using MP01 trees that exhibit recombination in this region, the location of the causal locus was further refined (Figure 4c) to the region between 20,562,635 and 21,126,449 bp on chromosome 4. Indeed, the alleles in TcAPO4 and TcMYB6 associated with red pod color in the CCN 51 and TSH 1188 parents were now also associated with green pod color in the recombinants. This result further supports the hypothesis of a single allele underlying pod color (although there is the possibility that different mutations in the same gene could be responsible).

Candidate genes for pod color

The availability of the high-quality Matina 1-6 reference genome permits straightforward identification of candidate genes within the restricted chromosome region identified. Recombination events identified in the T4 Type 1 and Type 2 mapping populations delimit the region regulating pod-color variation to between 20,562,635 and 21,726,996 bp on chromosome 4 (Figure 4a,b). In this region, three gene models are of particular interest: TCM_019192, homologous to Arabidopsis MYB113; TCM_019219, homologous to the Vitis vinifera APO4 gene; and TCM_019261, homologous to Arabidopsis MYB6. Both MYB genes encode transcription factors of the R2R3 domain class, which are known to be diverse in plants [44]; whereas APO proteins promote photosystem 1 complex stability and associated chlorophyll accumulation [45].

Cacao pod color is determined by the degree of red pigment that is superimposed over the base green color of the pod during development. The green color itself ranges widely, from light to dark green [7]. Because the intensity of pod pigment is influenced by the status of the base green color, we considered the gene homolog to APO4 to be a candidate gene for pod-color determination. It has been hypothesized that pod color could be regulated by one locus with alleles that determine the intensity of the green color acting in concert with a second locus that determines pigmentation [7].

Two of the 18 MYB genes on chromosome 4 of the T. cacao Matina 1-6 genome version 1.1_ (see Additional file 1, Table S22), which are localized within the 20,562,635 and 21,726,996 bp interval, are particularly promising candidate genes for pod-color variation in cacao because variants in MYB transcription factors are known to cause variation in berry pigmentation in grapes [46, 47]. While MYB6 is less characterized, MYB113 is in a small clade known to act in complexes with basic helix-loop-helix (bHLH) proteins to activate genes encoding enzymes acting in late stages of flavonol biosynthesis. MYB variants cause pigmentation variation in potato, tomato, and pepper [48], as well as in kale [49]. Pigmentation in apple skin and flesh is also regulated by MYB transcription factors [50]. Moreover, when genes encoding the R2R3 MYB transcription factors (cloned from apple) are transformed into tobacco, they induce the anthocyanin pathway, when co-expressed with bHLH proteins [50]. MYB gene evolution is thought to involve duplication events [51]; a single gene cluster of three MYB genes is involved in berry pigmentation in grape [46], and three genes of this type (including MYB113) are tandemly arrayed in Arabidopsis. The gene cluster in grape that is involved in berry-color variation includes the VvMYBA1 gene (homologous to TCM_019192), VvMYBA4 (homologous to TCM_019261), and VvMybA2. Mutations within VvMybA1 or VvMybA2 cause berries to be pigmented white/green instead of the wild-type purple color [52].

As described above, the MP01 recombinants narrow the region of interest to genes located between 20,562,635 and 21,126,449 bp on chromosome 4 (Figure 4c). Within this region, only one of the two MYB transcription factors is present: TCM_019192, which is homologous to VvMybA1. We refer to this gene as TcMYB113, and it encodes a protein with 275 amino acids sharing 61% identity with grape VvMYBA1 (see Additional file 2, Figure S8 for protein-sequence comparison). In the Criollo genome [11], TCM_019192 is annotated as two genes: Tc04_t014240 and Tc04_t014250. This difference may be attributed to the strict evidence-based Matina 1-6 version 1.1 annotation procedures that we used (see above).

Resequencing of additional genotypes and haplotype phasing of resequenced data

The availability of a high-quality reference genome greatly facilitates resequencing data analyses of additional genotypes. Five parents of the mapping populations (MP01: CCN 51 and TSH 1188,; T4 Type 1: Pound 7 and UF 273 Type 1; T4 Type 2: Pound 7 and UF 273 Type 2), and eight additional genomes (KA 2 101, K 82, LCTEEN 141, NA 331, PA 51, mvP 30, mvT 85, and Criollo 13) were resequenced (see Materials and methods). Reads were phased for the region surrounding the pod-color mapping interval (between 20,561,460 and 21,386,830 bp) on chromosome 4 (see Materials and methods). We identified 21,225 SNPs and 1,879 putative insertions/deletions (indels) within this region. Phasing of the reads permitted generation of haplotype sequences and, consequently, identification of the alleles associated with pod color, specifically for the three candidate genes.

The phased resequencing data from 13 genotypes plus the Matina 1-6 (version 1.1) sequence were used to generate clusters based on the haplotypes for the two MYB genes and APO4 (see Additional file 2, Figure S9). The haplotypes for TcMYB113 appear to cluster in agreement with the haplotype they induce or the haplotype of the phenotype of the clones they represent; this is not the case for TcMYB6 or APO4, in which green and red haplotypes are positioned within the same cluster (see Additional file 1, Figure S9). These results corroborate the association between TcMYB113 and pod-color variation across multiple resequenced unrelated samples with diverse genetic backgrounds. The phylogenetic relationships between the haplotypes also indicate that the haplotypes of the Criollo 13 green-pod genotype are closest to the cluster of red-pod-associated haplotypes (see Additional file 2, Figure S9). This suggests that the TcMYB113 alleles that are associated with red pod color in this study originated from the Criollo genetic group. In fact, only a single allele was found for MYB6 and APO4 in the red pod-associated haplotypes from the mapping population parents (see Additional file 2, Figure S9), but their sequence is identical to the one from the resequenced green-pod Criollo 13. This indicates that the red-pod parents of the mapping populations may have inherited a large segment of chromosome 4 from a Criollo genotype, and in fact, the Criollo genetic group was previously identified as the most likely source of the red-pod gene [7].

Sanger sequencing, association mapping, and putative functional effects of polymorphisms found within TcMYB113

The resequencing data shows specific SNPs in TcMYB113 within coding regions in the alleles associated with red pod color, which are at positions 20,878,747, 20,878,891, 20,878,957, 20,879,122, and 20,879,148. We corroborated this result by Sanger sequencing of the three candidate genes studied and by association mapping (see Additional file 1, Table S23). The association-mapping analysis was performed using 73 SNPs generated via Sanger sequencing (see Materials and methods) and 95 other SNPs from chromosomes 1 to 10 in 54 genotypes with green pods and 17 genotypes with red pods from diverse genetic backgrounds. Without accounting for genetic structure, the most significant P-values from the Fisher test were for the following SNPs (from highest to lowest significance): 20,878,891; 20,875,691, and 20,879,148 (see Additional file 1, Table S23). SNPs at these positions permitted differentiation of alleles that are associated with the green-pod Criollo genotypes from those associated with the red-pod trees of Criollo origin (see Additional file 2, Figure S10). The least significant of these (position 20,879,148; see Additional file 1, Table S23) results in an amino-acid change from serine to asparagine at codon 221 of the TcMYB113 protein in the alleles associated with red pod color. This residue occurs outside of the R2R3 DNA binding domain, but within the C-terminal region that varies substantially between MYB family members, making the structural consequence of this substitution difficult to predict. When genetic structure was taken into account in the association-mapping analysis, the significance of the association between this SNP and pod color was considerably lower (see Additional file 1, Table S23). The second most significant SNP (position 20,875,691) was detected via Sanger sequencing in the TcMYB113 5' untranslated region (UTR), located 25 bases upstream of the ATG start site (see Additional file 1, Table S23). The function of this SNP is difficult to infer; however, mutations occurring near translation start sites are known to affect protein-translation rates [53]. The SNP that was most significantly associated with pod color (position 20,878,891) is a synonymous mutation found within the coding region of TcMYB113. Intriguingly, this SNP is positioned within a target site for a dicot trans-acting small interfering RNA (tasiRNA) derived from TAS4 (TRANS-ACTING siRNA 4) TAS4-siR81(-) as identified by Luo et al. [54] (Figure 5; see Additional file 2, Figure S8). TAS-derived siRNAs post-transcriptionally downregulate protein-coding transcripts in a manner similar to microRNA (miRNA)-directed repression [55, 56]. MYB113 regulation in Arabidopsis additionally involves miR828, which acts both to cleave TAS4 to generate the interfering small RNA TAS4-siR81(-) [57] and also, independently, to silence MYB113 [54]. We identified not only the TAS4-siR81 site, but also a conserved miR828 target sequence within TcMYB113 (Figure 5), suggesting that these regulatory mechanisms are highly conserved in cacao. However, we detected no polymorphism within the miR828 target sequencewhich overlaps with the highly conserved R3 domain (Figure 5). The activities of both TAS4 and miR828 ultimately regulate anthocyanin biosynthesis through MYBs in Arabidopsis [54]. Conservation of and natural variation in this regulatory loop is yet to be explored in other plants.

Figure 5 mir828 and TAS4- siR81 (-) sequence targets in TcMYB113. The green base pair indicates the single-nucleotide polymorphism (SNP) (20,878,891) that was most significantly associated with pod-color variation (C is associated with green pods and G is associated with red pods). Full size image

Resequencing data of the region 4 kbp upstream of TcMYB113 also revealed SNPs associated with the red pod-inducing haplotypes we had identified (see Additional file 2, Figure S11). However, because these SNPs are also present in the resequenced green-pod genotype Criollo 13, these mutations in the putative region containing the promoter of TcMYB113 are unlikely to be functionally associated with the red-pod phenotype.

Real-time quantitative PCR analysis of candidate genes

Pigmentation intensity has previously been correlated with variation in MYB transcript level [50, 58, 59]. In parallel with our genetic analyses, we quantified the accumulation of candidate-gene transcripts to assess their involvement with pod color. RNAs from developing young fruits (cherelles) of Pound 7 (a green-pod genotype) and UF 273 Type 1 and Type 2 (red-pod genotypes) were analyzed by quantitative (q)PCR, to compare the relative transcript levels of the T. cacao MYB113, AP04 and MYB6 genes. Relative expression was normalized to leaf cDNA generated from Pound 7. The TcAPO4 and TcMYB6 genes each showed decreased expression in all samples relative to expression in leaf, but there were no differential gene-expression patterns that correlated with pod color. By contrast, TcMYB113 expression in the small cherelle samples for all genotypes showed a modest increase in expression relative to the Pound 7 leaf (Figure 6a). This suggests that TcMYB113 gene expression might be tissue-specific, because anthocyanin accumulates in young leaves, including those of Pound 7 [7]. We found TcMYB113 transcript accumulation to be correlated with color intensity in a variety of samples tested. TcMYB113 transcript levels in large UF 273 cherelles (a clone with red pod color ), were 25-fold higher than that in large Pound 7 cherelles (a clone with green pod color), and 250-fold higher than that in Pound 7 leaves (Figure 6a). Notably higher transcript levels were detected in large cherelles relative to small cherelles (Figure 6a), and this correlated with the increase in pigmentation intensity during the early stages of fruit development.

Figure 6 TcMYB113 transcript levels determined by quantitative PCR analysis of RNA from the pericarp of cherelles (young Theobroma cacao fruits) of genotypes with green or red pod color (genotype names are colored accordingly). Bar colors indicate the color of (a,b) the cherelles and (c,d) the alleles analyzed Although standard deviations were calculated, the values obtained were too low to graph at the scale used below. All RNA levels are normalized to control RNA from Pound 7 leaf tissue. (a) TcMYB113 transcript levels in the pericarp of small (10 to 20 mm length) and large (30 to 50 mm length) cherelles of the green-pod genotype Pound 7 and the pericarp of the red-pod genotypes UF 273 Type 1 and Type 2. (b) TcMYB113 transcript levels in the pericarp of green cherelles of the green-pod genotype Gainesville II 316 and the pericarp of green, green plus red. and red cherelles of the red-pod genotype Gainesville II 164. Green cherelles from Gainsville II 164 were obtained from the completely shaded part of the tree sampled. (c) Allele-specific TcMYB113 transcript levels in the pericarp from small (10 to 20 mm length) and large (30 to 50 mm length) cherelles of the green-pod genotype Pound 7 and the pericarp of the red-pod genotypes UF 273 Type 1 and Type 2. (d) TcMYB113 allele-specific transcript levels in the pericarp of green cherelles of the green-pod genotype Gainesville II 316 and inthe pericarp of green (no sun), green plus red (partial sun), and red (full sun) cherelles of the red-pod genotype Gainesville II 164. Green cherelles from Gainsville II 164 were obtained from a completely shaded area of the tree sampled. Red bars in the green-pod Pound 7 is due to background fluorescence. Full size image

Interestingly, recent miRNA annotation of apple [60] showed miR828 expression only in flowers and fruit, and not in other tissues examined, suggesting that modulation of MYB expression through non-coding RNAs might be a tissue-specific aspect of conserved anthocyanin biosynthesis regulation.

Because anthocyanin accumulation is influenced by sunlight [58], we also measured transcript levels of TcMYB113 in a genotype (Gainesville II 164) that has red pods at maturity, but exhibits both green and red pigmentation during fruit development according to light exposure. Green cherelles were collected from the shaded branches of a Gainesville II 164 tree, and red cherelles were collected from an area of the same tree that was exposed to sunlight. TcMYB113 transcript levels in Gainesville II 164 were 200-fold and 100-fold higher (relative to the transcript levels in Pound 7 leaf) in the red and the mixed greenred cherelles, respectively, compared with the fully shaded green cherelles (25-fold increase relative to Pound 7 leaf) (Figure 6b). Gainesville II 316 (a green-pod genotype) that had been exposed to full sunlight was used for comparison, and its transcript levels (20-fold increase relative to leaf) were comparable with those of the Gainsville II 164 green cherelles. Together, these results further corroborate the involvement of TcMYB113 in determining pod color in cacao, and are a first indication that anthocyanin production in response to light in pods is mediated by TcMYB113 accumulation.

Given the correlation between differences in TcMYB113 transcript levels and the pod-color variation seen in both qPCR experiments, as well as the association-mapping results presented above, we hypothesize that TAS4 siRNA downregulates TcMYB113 transcripts, which results in pod-color variation. Annotation of TAS and miRNA complements in cacao will require further computational analysis and additional small RNA sequencing, as recently reported for apple [60]. However, we have recently localized a homolog of the conserved TAS4 TCM_019486, also on chromosome 4. An expressed sequence tag from this gene was previously reported [54], and we have directly confirmed expression of this gene and the presence of the siR81 processed segment in small RNA isolated and sequenced from cacao flowers (data not shown).

To further confirm that TAS4 siRNA plays a role in the downregulation of TcMYB113 transcripts, we repeated the experiments described above, but this time using primers specific to the green and red alleles (Figures 6c,d). cDNA from small and large pods of the T4 Type 1 mapping population parents were used as template, as described for the experiments mentioned above. Red-colored bars in the figure represent relative steady-state expression of the red allele, and green bars represent expression of the green allele. These results show a five-fold increase in the expression of the red allele over the green in samples with red pods (UF273 Type 1 and Type 2), whereas in green-colored pods, very little expression of either allele is present. Cumulative expression of both alleles is consistent with the previous expression levels seen for TcMYB113.

Allele-specific transcript levels were also measured in the Gainsville II clone with varying exposure to sunlight. The green-pod Gainsville II 316 showed no steady-state expression of the red allele, and its green-allele expression was similar to that seen in Figure 6d. For the red-pod Gainsville II 164, expression of the red allele was consistently higher than that of the green, and increased with increasing exposure to sunlight. Additionally, green-allele accumulation in this clone, although significantly smaller than that of the red allele, still increased with increasing exposure to sunlight. This suggests that TcMYB113 expression is regulated in response to sunlight exposure.

Together, these experiments further support our hypothesis that the variant in the TAS4-si81 target site on the red allele prevents TAS4 siRNA downregulation of TcMYB113, and allows the accumulation of the TcMYB113 red-allele transcript, whereas the green-allele transcript is degraded.