Sequencing the C. sativa PK genome and transcriptome

We obtained DNA and RNA samples from plants of PK, a clonally propagated marijuana strain that may have been bred in California and is reportedly derived from an "indica" genetic background [24]. Genomic DNA was isolated from PK leaves and used to create six 2 ×100-bp Illumina paired-end libraries with median insert sizes of approximately 200, 300, 350, 580 and 660 bp. Sequencing each of these libraries produced > 92 gigabase (Gb) of data after filtering of low-quality reads (see below), which is equivalent to approximately 110× coverage of the estimated ~820 Mb genome. To improve repeat resolution and scaffolding, we supplemented these data with four 2 × 44-bp Illumina mate-pair libraries with a median insert size of approximately 1.8 kb and two 2 × 44-bp libraries with a median insert size of approximately 4.6 kb, adding 16.3 Gb of sequencing data in 185 million unique mated reads. We also included eleven 454 mate-pair libraries with insert sizes ranging from 8 to 40 kb, obtaining > 1.9 Gb of raw sequence data (~2.3 × coverage of 820 Mb) and 2 M unique mated reads.

To characterize the cannabis transcriptome, we sequenced polyA+ RNA from a panel of six PK tissues (roots, stems, vegetative shoots, pre-flowers (i.e. primordia) and flowers (in early- and mid-stages of development)) obtaining > 18.8 Gb of sequence. To increase coverage of rare transcripts, we also sequenced a normalized cDNA library made from a mixture of the six RNA samples, obtaining an additional 33.9 Gb. The sequencing data obtained for the genomic and RNA-Seq libraries are summarized in Table 1.

Table 1 Purple Kush sequencing library statistics Full size table

Assembling the C. sativa PK genome and transcriptome

We used different approaches for the de novo assembly of the PK genome (SOAPdenovo [25]) and transcriptome (ABySS [26] and Inchworm [27]). To gauge the success of the outputs, and to refine the assemblies, we used both traditional measures (coverage, bases in assembly, N50, maximum contig size and contig count) as well as comparisons between the assembled versions of the genome and transcriptome.

For the transcriptome, we used two different assemblers, ABySS and Inchworm, to obtain the best possible coverage. Both assemblers were run on the individual tissue datasets and normalized cDNA libraries, as well as the full set of RNA-Seq data (summarized in Table 2). We used predicted splice junctions and the presence of apparent coding regions to orient the assembled transcripts and to perform quality control (QC). In general, Inchworm produced assemblies with a larger N50 than ABySS (Table 2); however, we also observed many cases in which adjacent transcripts (e.g. head-to-head transcripts that overlap in their termini) appeared to be merged. Therefore, we considered only Inchworm transcripts with a single blastx hit covering at least 70% of their length when merging assemblies. The filtered individual ABySS and Inchworm assemblies were combined by first selecting the largest transcript among sets of near-identical sequences from each assembly, followed by a second stage where transcripts with blunt overlaps were joined. This second step resulted in a significant improvement of transcript N50 from 1.65 to 1.80 kb (Table 2).

Table 2 Overview of transcriptome assembly stages Full size table

The final merged assembly contains 40,224 transcripts falling into 30,074 clusters of isoforms (Table 3). We selected the transcript with the largest open reading frame (ORF) as the representative for each cluster, resulting in a pruned assembly with an N50 of 1.91 kb. Most representative transcripts (83%) have a blastx hit in other plants, and the distribution of transcript classes, according to Panther [28], is nearly identical between PK and Arabidopsis (Figure 1), as is the total number of transcripts and the N50 (33,602 and 1.93 kb in Arabidopsis, respectively [29]). The total number of bases in representative Arabidopsis transcripts is, however, somewhat larger (50 Mb, [29]) which may indicate that some of the PK transcripts are partial or that genes are represented by more than one non-contiguous fragments. We noted a 3' end bias in the normalized cDNA library, presumably due to the polyA priming step (data not shown). Moreover, by combining near-identical transcripts during assembly merging and isoform clustering, we likely collapsed transcripts of large multi-copy gene families. Indeed, applying our isoform clustering algorithm to the Arabidopsis assembly reduces the total number of bases to 44 Mb, which is mostly due to the loss of transposable element genes. Overall, our assembled PK transcriptome is therefore very similar to the deeply characterized Arabidopsis transcriptome, both in size and composition.

Table 3 Genome and transcriptome assembly statistics Full size table

Figure 1 Transcript classes in Cannabis sativa and Arabidopsis thaliana. Panther [28] was used to determine the distribution of transcripts in (a) C. sativa (PK) (30,074 representative transcripts) and (b) A. thaliana (31,684 transcripts). The high degree of similarity between both species indicates that all major functional classes are proportionally represented in the PK transcriptome assembly. Full size image

Our genome assembly procedure first involved a series of filtering steps to remove low-quality reads, bacterial sequences (about 2% of all reads) and sequencing adapters. Mate-pair libraries (454 and Illumina) were further processed to remove duplicate pairs and unmated reads. We then assembled a small fraction of the Illumina data (1%) together with the 454 data, to reconstruct the mitochondrial (approximately 450 kb) and plastid (approximately 150 kb) genomes, and subsequently removed their highly abundant DNA sequences. The remaining reads were assembled with SOAPdenovo, resulting in a draft assembly that spans > 786 Mb of the cannabis genome and includes 534 million bp (Table 3). The Illumina mate-pair libraries had a significant impact on the assembly, increasing the N50 from 2 kb to 12 kb. Addition of the large-insert 454 data increased this to 16 kb (24.9 kb for scaffolds containing genes). Between 73% and 87% of the reads in each library could be mapped back to the draft genome (Table 1), indicating that our assembly accounts for most of the bases sequenced. As an additional measure of completeness, we also examined the proportion of the transcriptome represented in the genome assembly. Over 94% of assembled transcripts map to the draft genome over at least half of their length, and 83.9% of them are fully represented; that is, all bases of the transcript can be mapped to genomic contigs. Overall, 37.6 Mb (92.5%) of the complete transcriptome is accounted for in the genome assembly (Figure 2), and over 68.9% of transcripts are fully encompassed by a single scaffold. Thus, our draft genome assembly appears to represent a large majority of the genic, non-repetitive C. sativa genome.

Figure 2 Proportion of transcriptome mapping to genome assembly. (a) A histogram showing the number of bases in the transcript assembly that could be mapped to the genome at 98% sequence identity, as a function of transcript length in 300 nt bins. (b) The proportion of transcriptome bases that could be mapped to the genome for the same bins as in (a). The black dashed line indicates the proportion of the transcriptome that is accounted for in the genome assembly. Full size image

The assembled C. sativa PK genome and transcriptome (canSat3) can be downloaded and browsed at a dedicated website [30]. The Cannabis Genome Browser combines the genome assembly with the transcriptome annotations, and has tracks for RNA-Seq data, single nucleotide variants (SNVs) and repeats.

Expression of the cannabinoid pathway in C. sativa PK tissues

Our first step in the functional analysis of the C. sativa genome was to examine the relative expression of each of the 30,074 representative transcripts in the six PK tissues, from which the RNA-Seq data were derived (Figure 3a). In metazoans (e.g. humans), different organs and tissues have different physiological functions, and consequently have unique gene expression profiles [31]. Therefore, many genes have a highly restricted expression pattern. By contrast, in plants, different photosynthetic tissues are often composed of a similar set of cell types. Moreover, photosynthetic processes and primary metabolic pathways have widespread expression, and only a minor proportion of transcripts appear to be uniquely expressed in a given cell type [32]. Consistent with these observations, we found all of the cannabis photosynthetic tissues to have similar expression profiles (Figure 3a).

Figure 3 Analysis of gene expression in PK tissues. (a) RNA-Seq read counts for 30,074 representative transcripts (rows), expressed as log2 RPKM, were subjected to hierarchical agglomerative clustering based on their expression pattern across tissues (columns). (b) Schematic illustration of THCA and CBDA cannabinoid biosynthesis, including the production of fatty acid and isoprenoid precursors via the hexanoate, MEP and GPP pathways. Hexanoate could arise through fatty acid degradation, involving desaturase, lipoxygenase (LOX) and hydroperoxide lyase (HPL) steps. Activation of hexanoate by an acyl-activating enzyme (AAE) yields hexanoyl-CoA, which is the substrate for the polyketide synthase enzyme (OLS) that forms olivetolic acid. The prenyl side-chain originates in the MEP pathway, which provides substrates for GPP synthesis, and is added by an aromatic prenyltransferase (PT) [36]. The final steps are catalyzed by the oxidocyclases THCAS and CBDAS. Pathway enzymes and metabolic intermediates are indicated in black and blue, respectively. (c) Same data as (a), showing the expression levels for genes in the cannabinoid pathway and precursor pathways (rows) across the six assayed tissues (columns). The majority of the genes encoding cannabinoid and precursor pathway enzymes are most highly expressed in the flowering stages. Gene and pathway names correspond to those used in panel B. Full size image

Nonetheless, flowers show a pattern of gene expression consistent with the biosynthesis of cannabinoids and terpenoids in these organs. Cannabinoids are prenylated polyketides that are synthesized from the short-chain fatty acid hexanoate and geranyl diphosphate (GPP) [23, 33]. The latter precursor, which is the substrate for an aromatic prenyltransferase enzyme, is derived from the 2-C-methyl-D-erythritol 4-phosphate (MEP) pathway [34–36] (see Figure 3b for details). We found that the genes encoding cannabinoid pathway enzymes and also most of those encoding proteins (e.g. hexanoate, MEP and GPP) involved in putative precursor pathways were most highly expressed in the three stages of flower development (pre-flowers, and flowers in early and mid-stage of development) (Figure 3c). This finding is consistent with cannabinoids being synthesized in glandular trichomes, the highest density of which is found on female flowers [37]. The production of THCA in marijuana strains (such as PK) and CBDA in hemp, is due to the presence or absence of THCAS and CBDA synthase (CBDAS) in these two chemotypes. Indeed, THCAS is highly expressed in PK flowers of all stages, whereas CBDAS is absent (Figure 3c).

It is worth noting that of the 19 'pathway genes' we analyzed, 18 were complete in the transcriptome assembly, underscoring its quality. The transcript of the MDS gene (which encodes a protein involved in the MEP pathway) was assembled in two fragments with a blunt overlap of 48 nt, narrowly missing the merging threshold of 50 nt. This sequence was resolved by merging the fragments manually. All 'pathway genes' are fully represented in the draft genome and an overview of their genomic locations is provided on the Cannabis Genome Browser website [30].

Comparison of the expression of cannabinoid pathway genes between marijuana (PK) and hemp ('Finola')

Although there are differences in the morphology of marijuana and hemp strains, the THC content of PK and other strains selected and bred for use as marijuana is remarkably high. We investigated whether the high THC production in PK was associated with increased gene expression levels of cannabinoid pathway enzymes, relative to those in hemp. We performed RNA-Seq analysis on Finola flowers at the mid-stage of development, generating a total of 18.2 M reads. 'Finola' is a short, dioecious, autoflowering cultivar developed in Finland for oil seed production. It was created by crossing early maturing hemp varieties from the Vavilov Research Institute (St. Petersburg, Russia), 'Finola' might be derived from a "ruderalis" genetic background [38]. It contains moderate amounts of CBDA in female flowers but very low amounts (< 0.3% by dry weight) of THCA. Figure 4a shows that the overall mid-flower transcript profiles, expressed as RPKM (reads per kb per million reads), are similar between PK and 'Finola'; however, the entire cannabinoid pathway is expressed at higher levels in PK than in 'Finola', with later steps increased as much as 15-fold (Figure 4a).

Figure 4 Comparison of gene expression in female cannabis flowers, and gene copy number, between marijuana (PK) and hemp ('Finola'). (a) A scatter plot of RNA-Seq read counts for all representative transcripts in marijuana and hemp, expressed as log2 RPKM. Specific subsets of transcripts are shown in color, as indicated in the key. The dashed line represents the relative enrichment of trichomes in the marijuana strain, inferred from the ratio in expression of trichome-specific genes, as defined in the text. Gene symbols/abbreviations: CAN - known and putative cannabinoid pathway genes; HEX - putative hexanoate pathway genes; GPP - GPP pathway genes; MEP - MEP pathway genes; TF - putative transcription factors according to PFAM, with at least a 4-fold change in expression in PK relative to 'Finola'; MYB - Myb-domain transcription factors previously suggested as trichome regulators. (b) A scatter plot of the log2 median read depth (MRD) of genomic DNA-Seq reads that aligned uniquely to the PK transcriptome. Genomic reads were trimmed to a length of 32 bases prior to alignment with Bowtie, to allow for mapping close to exon junctions. The lack of outliers in the scatter plot indicates that there have been relatively few changes in gene copy number between marijuana and hemp. (c) The relative RNA-Seq expression of individual genes in the cannabinoid pathway and precursor pathways (is shown on the left), adjusted for enrichment of trichome-specific genes (i.e. relative to the dashed line in panel a). The median genomic DNA read depth for the same genes is shown on the right. The box plots reflect the variation in the depth of coverage of uniquely aligned genomic DNA reads across each transcript, with the median coverage distribution across all transcripts shown as reference (All). Reads that are likely derived from pseudogenes are marked by the symbol [P]. While there is increased expression of most cannabinoid genes in the HEX and CAN pathways (left) in PK, this does not appear to be due to an increased representation of these genes in the PK genome relative to the 'Finola' genome (right). Full size image

The difference in gene expression is not due to morphological variability between the strains, such as in the size or proportion of trichomes. We examined the global expression levels of trichome genes to account for possible differences in trichome density between PK and 'Finola' flowers, by analyzing an RNA-Seq dataset obtained for 'Finola' glandular trichomes (from a separate study, data not shown). From a set of the1000 most abundant transcripts, we selected 100 that had the greatest difference in expression between the mid-flower stage and the maximum expression level found in PK root, shoot or stem in the current study. This subset of genes should be highly enriched for genes predominantly expressed in glandular trichomes (and indeed contained all the cannabinoid and hexanoate 'pathway genes' expressed in 'Finola'). The median difference in expression level after excluding the cannabinoid genes is shown as a dotted line in Figure 4a, and was used to adjust the expression differences shown in Figure 4c. Even after accounting for global trichome differences, cannabinoid pathway enzymes remain among several hundred obvious outliers. Outliers also include several dozen transcription factors, including two Myb-domain proteins that have been previously suggested to play a role in regulating processes in cannabis trichomes [23] (Figure 4a). These data suggest that the increased production of cannabinoids in PK may be due in part to an increase in expression of the biosynthetic genes.

Resequencing of the C. sativa 'Finola' genome reveals copy number changes in a PK cannabinoid pathway related enzyme

To begin our search for the underlying causes of the differences between marijuana and hemp, we sequenced the genome of 'Finola' (e.g. Illumina 100 nt paired-end reads, 200-500 bp inserts, at approximately 50× coverage of the estimated 820 Mb genome). Plant genomes often contain many duplicated genes, and gene amplification represents a well-documented mechanism for increasing expression levels [39]. Therefore, we first asked whether there were apparent differences in copy number for the enzyme-encoding gene set, using the median read depth (MRD) of genomic DNA-Seq reads that could be uniquely mapped to transcripts as a proxy. Figure 4b illustrates that, overall, there appear to be relatively few differences in gene MRD between PK and 'Finola'. The exception to this is the much expanded coverage for AAE3, a gene encoding an enzyme of unknown function in PK. AAE3 is similar to an Arabidopsis AAE [TAIR:At4g05160] that has been shown to activate medium- and long-chain fatty acids including hexanoate [40]. Although AAE1 is a more likely candidate for the hexanoyl-CoA synthetase involved in cannabinoid biosynthesis (JMS and JEP, unpublished results), owing to its high expression in flower tissues and increased transcript abundance in PK (Figure 3b, 4), AAE3 might play an, as yet, unknown role in cannabinoid biosynthesis. Because we could detect both multi- and single-exon copies of AAE3, we believe that the large expansion of AAE3 has occurred through the insertion of processed pseudogenes in the PK genome. In addition, the read depth analysis uncovered reads corresponding to CBDAS in PK and THCAS in 'Finola'. However, on the basis of our inability to assemble these into functional protein-coding genes, we conclude that the THCAS reads in 'Finola' and CBDAS reads in PK are likely to be caused by the presence of pseudogenic copies, as we discuss below. Therefore, it appears that the differences in expression of cannabinoid pathway enzymes between marijuana and hemp are due to subtle genetic differences that cause changes in gene expression, either directly or indirectly.

The PK genome contains two copies of two genes involved in cannabinoid biosynthesis. Copies of AAE1, which encodes a protein likely to synthesize the hexanoyl-CoA precursor for cannabinoid biosynthesis, are found on scaffold1750 [genbank:JH227821] and scaffold29030 [genbank:JH245535]. OLS, which encodes the putative cannabinoid pathway polyketide synthase [18], was found to be duplicated at scaffold15717 [genbank:JH226441] and scaffold16618 [genbank:JH237993].

Analysis of single nucleotide variants (SNVs) in cannabis

To further examine the genetic variation in C. sativa, we estimated the frequency of SNVs in four taxa. In addition to PK and 'Finola', our analysis included the Illumina reads we generated from the hemp cultivar 'USO-31', as well as the reads from the marijuana strain Chemdawg, which were released by Medical Genomics, LLC [41] while this manuscript was in preparation. 'USO-31' is a tall, monoecious fibre hemp cultivar developed in the former Soviet Union that contains very low or undetectable levels of cannabinoids [42]. Our resequencing of 'USO-31' was similar to that of 'Finola' (Illumina 100 nt paired-end reads, 200 and 500 bp inserts, at approximately 16× coverage of the estimated 820 Mb genome). We aligned individual Illumina reads to the PK reference genome, and identified variant bases that qualify as SNVs (see the Methods section for further details). We also quantified the degree of heterozygosity within single genomes. Overall, PK, Chemdawg, 'Finola' and 'USO-31' have comparable rates of heterozygosity (0.20%, 0.26%, 0.25%, and 0.18%, respectively). The lower rate of heterozygosity in 'USO-31', which is monoecious, is presumably due to self-pollination.

The rate of occurrence of SNVs between any two strains ranged from 0.38% (PK versus Chemdawg) to 0.64% (Chemdawg versus 'Finola'). A neighbor-joining tree constructed using the concatenated polymorphic sequences from each of the strains is shown in Figure 5, and supports the expected pedigree of the two hemp cultivars, which are likely to have been bred using germplasm from North Central Asia. Although the ancestry of PK and Chemdawg is unclear, their position on the tree supports the notion that marijuana forms of cannabis are more related to each other than to the hemp forms, and that these two marijuana strains share a common heritage.

Figure 5 Neighbour-joining tree for two hemp cultivars and two marijuana strains. The tree was plotted in MEGA5 [71] using the maximum composite likelihood of SNV nucleotide substitution rates, calculated based on the concatenated SNV sequences in each variety, as a distance metric. The topology of the tree reveals a distinct separation between the hemp and marijuana strains. Full size image

Genomic analysis of cannabinoid chemotypes

The molecular basis for THCA (marijuana) and CBDA (hemp) chemotypes is unclear. De Meijer et al [43] crossed CBDA- and THCA-dominant plants to produce F1 progeny that are intermediate in their ratio of THCA:CBDA; selfing gave F2 progeny that segregated 1:2:1 for THCA-dominant:codominant mixed THCA/CBDA:CBDA-dominant chemotypes. These data suggested two explanations: a single cannabinoid synthase locus (B) exists with different alleles of this gene encoding THCAS or CBDAS; or THCAS and CBDAS are encoded by two tightly linked yet genetically separate loci. In the latter scenario, differences in transcript abundance and/or enzyme efficiencies might account for the observed chemotypic ratios. Indeed, given that both of these enzymes compete for CBGA, reductions in one activity might lead to a proportional increase in the production of the other cannabinoid. Our draft sequence of the THCA-dominant PK genome enables some preliminary insights into possible mechanisms of the inheritance of cannabinoid profiles. Using the published THCAS sequence [genbank:AB057805] [16] to query the PK genome, a single scaffold of 12.6 kb (scaffold19603, [genbank:JH239911]) was identified that contained the THCAS gene as a single 1638 bp exon with 99% nucleotide identity to the published THCAS sequence. Querying the PK transcriptome returned the same THCAS transcript (PK29242.1, [genbank:JP450547]) that was found to be expressed at high abundance in female flowers (Figure 3c). A THCAS-like pseudogene (scaffold1330 [genbank:JH227480], 91% nucleotide identity to THCAS) was also identified. We used the CBDAS sequence [genbank:AB292682] [17] to query the PK genome and identified as many as three scaffolds that contain CBDAS pseudogenes (scaffold39155 [genbank:AGQN01159678], 95% nucleotide identity to CBDAS; scaffold6274 [genbank:JH231038] + scaffold74778 [genbank:JH266266] combined, 94% identity; and scaffold99205 [genbank:AGQN01254730], 94% identity), all of which contained premature stop codons and frameshift mutations. The presence of these pseudogenes in the PK genome accounts for the identification of CBDAS genomic sequences in PK (Figure 4c). A 347-bp transcript fragment (PK08888.1, [genbank:JP462955]) with 100% nucleotide identity to CBDAS could be identified in the PK transcriptome, likely due to the nonsense-mediated decay of transcripts derived from CBDAS pseudogenes. Given that multiple independent loci were identified with high sequence similarity to either THCAS or CBDAS in a THCA-dominant marijuana strain, the two-loci model for the genetic control of THCA:CBDA ratios might be correct. A possible explanation is that during the development of high-THC marijuana strains such as PK, underground breeders selected for non-functional CBDAS that would effectively eliminate substrate competition for CBGA and thus increase THCA production. Alternatively, the CBDAS pseudogene in the PK genome might occur in all cannabis strains. If this is true, the single-locus model might still be correct, and we did not find a CBDAS-encoding allele at this locus because PK is homozygous for THCAS.

Analysis of PK transcriptome for cannabichromenic acid synthase (CBCAS) candidates

To illustrate the potential value of the cannabis genome and transcriptome to elucidate cannabinoid biosynthesis, we searched for genes encoding enzymes that might catalyze the formation of cannabichromenic acid (CBCA), which is present in most cannabis plants as a minor constituent and in certain strains as the dominant cannabinoid [44]. Although a protein that synthesizes CBCA has been purified from cannabis, the gene that encodes the CBCA synthase (CBCAS) has not been identified [45]. We hypothesized that CBCAS is an oxidocyclase enzyme related to THCAS and CBDAS, therefore, we queried the PK transcriptome using THCAS and CBDAS sequences. In total, 23 candidates were identified that had greater than 65% nucleotide identity with these sequences. These include four genes that we designated THCAS-like1 to THCAS-like4, which encode proteins that are 89%, 64%, 68%, and 59% identical to THCAS at the amino acid level, respectively. We also identified transcripts corresponding to CBDAS2 and CBDAS3, which are closely related to CBDAS but do not encode enzymes with CBDAS activity [17]. The remaining 18 transcripts encode proteins that are similar to reticuline oxidase, an oxidoreductase that functions in alkaloid biosynthesis [46]. Biochemical analysis of CBCAS candidates is currently underway.