Whole-genome shotgun sequencing and assembly

All sequencing reads were collected with standard Sanger sequencing protocols on ABI 3730XL automated sequencers at the Joint Genome Institute, Walnut Creek, CA. Three different sized libraries were used for the plasmid subclone sequencing process and paired-end sequencing. A total of 3,446,208 reads from the 2.6-kb sized libraries, 3,479,232 reads from the 6.0-kb sized libraries and 518,016 reads from a 36.2–40.6-kb library were sequenced. Two BAC libraries (EG_Ba, 127.5-kb insert and EG_Bb, 155.0-kb insert) were end sequenced to add an additional 294,912 reads for long-range linking.

The sequence reads were assembled using a modified version of Arachne v.20071016 (ref. 36) with parameters maxcliq1 = 100, correct1_passes = 0, n_haplotypes = 2 and BINGE_AND_PURGE = True. The resulting output was then passed through Rebuilder and SquashOverlaps with parameters to merge adjacent assembled alternative haplotypes and subsequently run through another complete Arachne assembly process to finalize the assembly. This produced 6,043 scaffold sequences, with a scaffold L50 of 4.9 Mb and total scaffold size of 692.7 Mb. Scaffolds were screened against bacterial proteins, organelle sequences, GenBank nr and were removed if found to be a contaminant. Additional scaffolds were removed if they (1) consisted of >95% of base pairs that occurred as 24mers four other times in the scaffolds larger than 50 kb; (2) contained a majority of unanchored RNA sequences; or (3) were less than 1 kb in length.

For chromosome-scale pseudomolecule construction, markers from the genetic map were placed using two methods. SSR-based markers were placed using three successive rounds of e-PCR with N = 0, N = 1 and N = 3. Markers that had sequence associated with them, including SNP markers, were placed with BLAT51 and blastn52. A total of 19 breaks (16 in high coverage (>6x), 3 in low coverage (≤6x)) were made in scaffolds based on linkage group discontiguity; a subset of the broken scaffolds were combined using 257 joins to form the 11 pseudomolecule chromosomes. Map joins were denoted with 10,000 repeats of the letter N (Ns). The pseudomolecules contained 605.9 Mb out of 691.3 Mb (88%) of the assembled sequence. The final assembly contains 4,952 scaffolds with a contig L50 of 67.2 kb and a scaffold L50 of 53.9 Mb. The completeness of the resulting assembly was estimated using 1,007,962 ESTs from BRASUZ1. The goal of this analysis was to obtain an estimate of the completeness of the assembly, rather than to do a comprehensive examination of gene space. Briefly, ESTs <300 bp were removed, along with chloroplast, mitochondrial or rDNA ESTs. All duplicate ESTs were placed against the genome using BLAT51. The remaining ESTs were screened for alignments that had ≥90% identity and ≥85% EST coverage. The screened alignments indicated that 98.98% of available expressed gene loci were included in the 11 chromosome assemblies.

Gene prediction

To produce the current gene set, we used the homology-based FgenesH and GenomeScan predictions. The best gene prediction at each locus was selected and integrated with EST assemblies using the PASA program37. The gene set shown in the browser was generated from the input gene models at JGI. The gene prediction pipeline was structured as follows: peptides from diverse angiosperms and ∼260,000 EST assemblies (from ∼2.9 M filtered E. grandis ESTs and ∼2.4 M EST sequences from other closely related (‘sister’) Eucalyptus species, assembled with PASA) were aligned to the genome and their overlaps used to define putative protein-coding gene loci. The corresponding genomic regions were extended by 1 kb in each direction and submitted to FgenesH and GenomeScan, along with related angiosperm peptides and/or ORFs from the overlapping EST assemblies. These two sets of predictions were integrated with expressed sequence information using PASA37 against ∼260,000 Eucalyptus EST assemblies. The results were filtered to remove genes identified as transposon-related.

Gene family cluster and gene ontology analysis

The Inparanoid algorithm38,39 was used to identify orthologous and paralogous genes that arose through duplication events. Clusters were determined using a reciprocal best pair match and then an algorithm for adding in-paralogues was applied. The peptide sequences used were from Arabidopsis lyrata, Arabidopsis thaliana, Brachypodium distachyon, Caenorhabditis elegans, Chlamydomonas reinhardtii, Danio rerio, Ectocarpus siliculosus, Escherichia coli, Eucalyptus grandis, Fragaria vesca, Glycine max, Homo sapiens, Jatropha curcas, Mus musculus, Neurospora crassa, Nostoc punctiforme, Oryza sativa, Phoenix dactylifera, Physcomitrella patens, Populus trichocarpa, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Selaginella moellendorffii, Solanum tuberosum, Sorghum bicolor, Synechocystis pcc6803, Theobroma cacao, Vitis vinifera and Zea mays. The sequences were downloaded from Gramene53,54, Phytozome (http://www.phytozome.net) and Ensembl (http://www.ensembl.org). A functional annotation pipeline similar to the one used for strawberry genome annotation10 was used to infer Gene Ontology55 assignments to 29,841 protein coding genes (∼82%). Peptide sequences were analysed through an integrated approach involving Interproscan40, SignalP41, Predotar42, TMHMM43 and orthology-based projections from Arabidopsis.

Green plant phylogeny

We used an integrated approach of gene orthology clustering10 and an automated workflow for phylogenomic analyses56 to reconstruct land plant phylogeny of peptide sequences. A total of 174,020 peptides encoded by single-copy protein coding orthologous nuclear genes from 17 plant genomes (Supplementary Data 7) were identified, aligned and assembled into a supermatrix resulting from conservative and liberal superalignments that retained 42.26% (697,423 amino acids) and 46.35% (764,978 amino acids) of the original 1,650,340 amino acid concatenated alignment for maximum likelihood phylogenetic reconstruction44, respectively (Supplementary Data 7). These alignments come from 3268 orthologous gene clusters with each cluster carrying single copy genes from at least 9 (50%) species.

Protein domain analysis

Domains and domain arrangements were compared within the rosids to distinguish a core set of domains and arrangements present in all rosids and those shared by one or more of the four rosid lineages included in the analysis (Supplementary Data 8). Domains occurring at twice the frequency in Eucalyptus compared to the average abundance in the rosids were defined as overrepresented. If several splice variants were present for one protein, we excluded all but the longest transcript. All proteomes were scanned for domains with the Pfam_scan utility and HMMER 3.0 against the Pfam-A and Pfam-B databases57. For the annotation of Pfam-A domains, we used the model-defined gathering threshold and query sequences were required to match at least 30% of the defining model58. Pfam-B domains were annotated using an e-value cutoff of 10−3. When possible, Pfam-A domains were mapped to clans and consecutive stretches of the same domain were collapsed into one large pseudo-domain59,60. We defined domain arrangements as ordered sets of domains for each protein. For the analysis of arrangements, only Pfam-A domains were used.

Genome-wide mRNA expression profiling

To study the expression of predicted protein-coding and ncRNA genes, RNA-seq reads obtained from Illumina sequencing of seven Eucalyptus tissues (that is, shoot tips, young leaf, mature leaf, flower, roots, phloem and immature xylem, http://www.eucgenie.org/, Hefer et al., unpublished data) were mapped to the Eucalyptus genome using TopHat61 with the Bowtie algorithm62 for performing the alignment. The aligned read files were processed by Cufflinks50, with RNA-seq fragment counts (that is, fragments per kilobase of exon per million fragments mapped (FPKM)) to measure the relative abundance of transcripts. Differential ncRNA expression between the seven Eucalyptus tissues was determined using Cuffdiff50.

ncRNA analysis

To predict ncRNAs in Eucalyptus, the genome sequence was scanned using Infernal63 with the covariance models (that is, a combination of sequence consensus and RNA secondary structure consensus) of 1,973 RNA families in the RFam database v10.1 (refs 64, 65). The bit score cutoff of the Infernal search was set as the TC cutoff value that was used by RFam curators as the trusted cutoff. The Infernal search result was further filtered by an e-value cutoff of 0.01. To examine the ncRNA conservation between Eucalyptus and other plant genomes, the Eucalyptus ncRNA candidate sequences obtained from the Infernal search were used as queries to search against the genome sequences listed above using BLAT51 with a minimum coverage (that is, minimum fraction of query that must be aligned) of 80% and a minimum identity of 60%.

5′ UTR empirical curation

Approximately 2.9 million E. grandis ESTs and ∼700 million RNA-seq reads from seven diverse tissues were used to empirically curate 5′ UTR annotations. At each locus, the predicted, EST and RNA-seq derived 5′ UTR lengths were compared. An empirical annotation was prioritized over an in silico prediction and the longest empirical transcript was preferred. Those loci which had a 5′ UTR reported by only FGenesH retained their annotation as the best current annotation.

Genome evolution

We used the E. grandis genome sequence information (http://www.phytozome.net/eucalyptus.php) to unravel the Myrtales evolutionary palaeo-history leading to the modern Eucalyptus genome structure of 11 chromosomes. Independent intraspecific (that is, paralogue) and interspecific (that is, orthologue) comparisons were necessary to infer gene relationships between Eucalyptus and the other rosid genomes. We applied a robust and direct approach45,46 allowing the characterization of genome duplications by aligning the available genes (36,376) on themselves with stringent alignment criteria and statistical validation.

We used the VISTA pipeline infrastructure47,48 for the construction of genome-wide pairwise DNA alignments between E. grandis and Populus trichocarpa. To align genomes we used a combination of global and local alignment methods. First, we obtained an alignment of large blocks of conserved synteny between the two species by applying Shuffle-LAGAN global chaining algorithm66 to local alignments produced by translated BLAT51. After that we used Supermap, the fully symmetric whole-genome extension to the Shuffle-LAGAN. Then, in each syntenic block we applied Shuffle-LAGAN a second time to obtain a more fine-grained map of small-scale rearrangements such as inversions.

Syntenic regions between Eucalyptus chromosome 3 and Populus chromosome XVIII were defined as segments of contiguous sequence. Each contiguous block of DNA was annotated and cross-compared between the two species. Gene models within the syntenic blocks were compared based on a sliding window representing 10 gene models with an allowance of two intercalated gene models. Genes occurring in tandem repeats on either the Eucalyptus or Populus chromosomes were counted as a single locus in either case. The constructed genome-wide pair-wise alignments can be downloaded from http://pipeline.lbl.gov/downloads.shtml and are accessible for browsing and various types of analysis through Phytozome (http://phytozome.org).

For comparative analysis of the E. grandis and E. globulus genomes, enriched nuclei were extracted using a modified BAC library preparation protocol67 and DNA extracted following Tibbits et al.68. DNA was prepared for sequencing using Illumina TruSeq kits and 100-bp paired-end sequencing was performed on a HiSeq2000. NUCLEAR software (Gydle Inc.) was used to filter for high-quality reads that were then mapped to the E. grandis genome scaffold assembly. The VISION software was used to visualize assemblies and assembly metrics were computed using custom Perl, R and Shell scripts.

Genome function analysis

Using homology to Arabidopsis genes and Pfam domain analysis we identified candidate homologues for lignin, cellulose and xylan biosynthetic genes. All possible family members were identified and their gene expression evaluated in seven developing tissues of E. grandis using Illumina RNA-seq analysis. In particular, we analysed each gene’s expression relative to other family members/isoforms in xylem, as well as relative to the median (∼90,000 FPKM) of xylem expression in the entire transcriptome. Considering each gene’s relative and absolute expression levels, all members expressed over median in xylem were noted (Supplementary Data 5 and Supplementary Data 9). Similarly, a search for conserved protein motifs for the terpene synthase gene family was conducted in eight plant genomes, including E. grandis (Supplementary Information section 6 and Supplementary Data 6). The amino acid sequences were aligned and truncated to compare homologous sites. A maximum likelihood tree was created, rooted by the split between two major types of terpene synthase genes, and nodes were coloured by species.