This is a guest post by Michelle Vierra (@the_mvierra)

UPDATE April 21, 2020: new Redwood Infographic from PacBio, plus the official blogpost that included the final assembly after more sequencing was done

New data type spawns curiosity at PAG

After another wonderful year at the Plant and Animal Genome Conference (PAG) in January, my colleagues and I were struck by the wide variety of genomes that have been sequenced with PacBio’s new HiFi reads. HiFi data, which is produced by highly accurate long-read sequencing, strikes a balance between read length — with reads up to 25 kb — and accuracy — with reads that are at least 99% accurate. This balance seems to be a winning recipe for assembling complex genomes: the long read lengths easily span shorter repeats, while the high accuracy helps distinguish large, complex repeats. It’s the best of both worlds.

PacBio HiFi reads provide the best of both worlds by having long read lengths and high accuracy.

HiFi data was the belle of the ball at PAG, where we saw extremely high-quality assemblies for everything from humans to cannabis, fish, and tetraploid plants.The informatics community jumped right on board too, with three different assembly tools (HiCanu, Hifiasm, Nighthawk phasing tool) debuted during the week specifically focused on optimizing assembly with HiFi data!

Three assemblers/phasing tools for HiFi data debuted during PAG 2020 — Nighthawk, Hifiasm, and HiCanu.

The number one question I was asked at PAG was whether we recommend using HiFi or traditional long reads for extremely large genomes, like the ~15 Gb hexaploid wheat genome. At that point, we didn’t have enough data to give a formal recommendation. However, I did get a boost of confidence in the ability of HiFi data to resolve even the largest plant genomes after Kevin Fengler, Comparative Genomics Lead at Corteva Agriscience, presented his assembly of the 11 Gb oat genome, done in only 12 hours, leading to a contig N50 of over 20 Mb!

With the oat genome success in mind, we contemplated what some of the most famously crazy genomes would look like with HiFi data. Would that balance of read length and accuracy tame even the wildest of genome challenges the world had to offer?

Go big or go home — tackling a giant genome

Then an idea struck. Being a California-based company sitting next door to Stanford University piqued our interest in one of the locally famous species of tree — the towering California redwood (also known as the coastal redwood). The California redwood genome is estimated to be around 27 Gb and hexaploid — a beast of a genome by any measure!

Sequoia sempervirens is one of the world’s fastest-growing conifers that live for thousands of years. Once ubiquitous throughout the Northern Hemisphere, now only 5% of the original old-growth coast redwood forest remains.

After calculating the tissue material required for DNA extraction and the number SMRT Cells we’d need to sequence, developing a list of software tools that would be up to the challenge, and cultivating a fearless team of PacBio scientists, we decided to go for it!

Luckily, California redwood trees were planted on the beautifully landscaped, public Stanford campus. Emily Hatas, our senior director of business development and fellow plant enthusiast, Greg Young, our Bay-Area-based senior field application scientist, and I packed up some ice, scissors, and a kitchen scale, and headed over to the trees one sunny Monday afternoon. After a quick rinse and flash freezing process, we accomplished step 1 — sample acquisition. We then enlisted the help of our applications development group to isolate DNA with the Circulomics Plant Nuclei kit and generate a HiFi library worthy of the cause, completing step 2 — sample preparation.

Sample collection of California redwood on Stanford campus. In under two hours, the samples were retrieved, rinsed, flash-frozen, and stored for DNA extraction.

After a quick single SMRT Cell test to ensure library quality, we went into full production mode, sequencing 606 Gb of HiFi data over a period of 7 days. This data represented a 22-fold coverage of our anticipated 27 Gb genome. We have observed in many HiFi genome assembly projects thus far that the traditional method of generating high coverage to polish out errors isn’t needed, and excellent assemblies have been generated from only 20-fold coverage of HiFi reads. Thus, hitting our coverage target, we felt comfortable to crack on with the genome assembly.

Greg Concepcion, our staff engineer of bioinformatics and resident large genome wrangler, then took the reins for a first attempt at this giant genome assembly. Greg chose Hifiasm since it’s been reported to be one of the fastest assemblers and also focuses on resolving haplotypes, both features important to resolving a 27 Gb hexaploid genome.

After just 6 days on 64 cores with 512 Gb of RAM, the assembly finished with no issues along the way, a real testament to the clever coding by Haoyu Cheng and Heng Li in the Hifiasm assembler. The results were amazing with an assembly almost twice the size of the expected genome with a contig N50 of 1.92 megabases! The larger than expected assembly size, which appears to represent two similar haplotypes rather than the six expected for a hexaploid, seems to agree with the suggestion that the most recent polyploidization event for the California redwood is an autopolyploidy event, as described by Scott et al. Overall, we are very pleased to see the improvements that this genome assembly represents over other recent conifer genomes.

Comparison of conifer assembly results. [1] Hybrid assembly of redwood. [2] Fir assembly by Neale et al. [3] Transcript set of Abies alba from Neale et al. Varying number of transcripts aligned to each genome (4,958 mapped to PacBio HiFi redwood, 4,760 mapped to ONT redwood, 16,187 mapped to Douglas fir)

No genome too large for HiFi reads

So, what do we get out of our work on the California redwood genome? First, I am now confident in my recommendation to use HiFi data to generate high-quality genome assemblies from any organism. This redwood genome hits the mark on all 3 C’s of genome assembly quality — contiguity, completeness, and correctness. Secondly, it shifts the thinking around large, complex genome assemblies, previously thought to take a ton of time and compute resources for assembly — not to mention the sequencing time. This massive genome was put together in just 17 days — 4 days of sample prep, 7 days of sequencing, and 6 days for assembly. I remember not that long ago it took about the same amount of time to assemble a human genome! High-quality genome assemblies for any organism are now truly accessible to anyone wanting to do one.

Assembling the California redwood using PacBio HiFi reads. The entire workflow from flash-frozen needles to completed assembly took only 17 days.

Lastly and most importantly, we get to share a great new resource with the community so you can explore the data for yourself. We’ve opted to make the assembly and data totally public for anyone who wants to try new or additional tools, browse the genome for interesting biology, or just prove to themselves that the assembly can be done. We are hosting the data and assembly here, so have at it and happy genome exploring!

Although we make it sound easy, we could not have done this project without the amazing team of PacBio colleagues who dedicated their time and expertise to make this project happen. A big thank you to: Lee Chern, Primo Baybayan, Lei Zhu, Harsharan Dhillon, Greg Buck, Patrick McNamara, Richard Hall, Jonas Korlach, and all our bosses for letting us do it.

Michelle Vierra is the Strategic Marketing Manager for Plant & Animal Sciences at PacBio. She can be reached via email or phone (1.650.521.8258)

Interested in finding out more about HiFi data for sequencing your organism of interest? Get in touch with a PacBio scientist to scope out your project.