Tardigrades. Never heard of them? Me neither. But lateral/horizontal gene transfer, I’ve heard of.

In the interest of full disclosure, I did not review the Boothby paper in PNAS. But I was asked to provide a comment for one of the two articles authored by Ed Yong. However, as I’m just coming off maternity leave, I was unable to provide one in time due to a lack of bandwidth and sleep as well as a fear that I might be unfairly biased. As I told a colleague on November 24, “I thought their methodology was flawed and couldn’t put together [even] a coherent sentence as a comment. I think there is transfer but I think they severely over estimate it. I could be biased though by the fact they claim this is the largest, but [I think] my Drosophila has more extensive transfer.”

Now having given it time and thought, I think I can summarize these two manuscripts objectively and coherently from the perspective of someone who has spent a great deal of time trying to assess the levels of lateral gene transfer in animal genomes. In the interest of making this more tractable, I am only going to focus on the claim of massive LGT, and not on any aspect related to functionality or other aspects of either manuscript. Since you can have massive LGT that is not functional, I am going to ignore all of the RNASeq and EST data, and conclusions derived from them. They are essentially irrelevant for establishing the claim of massive LGT.

First the Boothby manuscript and why I suspected a flawed methodology. The red flag came with the statement, “our tardigrade cultures are fed algae, not bacteria, and although our algal cultures are not axenic, we would expect little to no bacterial contamination in our sequencing data.” If they are not axenic, you can expect a great deal of bacterial contamination, talk to any microbiologist.

However, and despite this clear bias, the authors did perform several analyses to try to address this. The first was an examination of coverage, but the details are lacking. It is difficult to ascertain much from the figures referenced (Figure S2 C and D) and the focus is on the SD (presumably standard deviation) of the coverage, not the mean. The second piece of evidence is about bacterial rRNA genes. However, it is quite likely all of the bacterial rRNA sequences are collapsed into single contigs. This has plagued all bacterial genome sequence projects from the dawn of genome sequencing, except those with a single rRNA operon. The multiple rRNA operons collapse into a single repeat that does not typically assemble with the rest of the genome. Therefore, this analysis is actually largely uninformative. Third, it is mentioned that general contamination is minimal. For example, human sequences are not found. Dataset S1 is referenced, which appears to be annotation of contigs, and does not fully address human contamination. Given the size of the human genome and the amount of data generated, contaminating human sequences are unlikely to be assembled into any sizable contig. The true extent of human contamination should be assessed on the read level to be more accurate. Besides, bacterial and human contamination is like comparing apples and oranges (heck, apples and oranges are at least both fruit).

The manuscript then moves to look at the phylogeny of these sequences, establishing they are truly of bacterial origin. There are over 380 pages of supplementary figures analyzing this. I’m sure it was a tremendous amount of work, one that is often requested by reviewers, but it actually is just one aspect of LGT. Its unfortunate more space wasn’t devoted to the other aspects. This is followed by an analysis of domains, that finds genes of foreign origin contribute many unique protein domains. But this seems premature; are these LGTs actually in the genome?

To establish this the authors perform PCR to test physical linkage of gene pairs. They obtain PCRs for 104 of 107 randomly selected genes. However, it isn’t reported how these genes were randomly selected. It is hard to truly be random and I suspect these represent the best 107 contigs, not random ones. But Kudos to the authors for actually providing images of the gels for all 107 amplifications. Yet, only roughly half (58) support lateral gene transfer since they are the only ones that bridge between the metazoan and foreign gene. The remaining 46 amplify from foreign gene to foreign gene, which cannot exclude contamination. Furthermore there is a lack of specificity in numerous amplification reactions with products of different sizes. Frequently, the one of the correct size doesn’t even appear to be the most abundant product. However, a remarkable number do show strong amplification at the correct size for the proportion I checked. However, it is difficult to be confident of the correct size given the limited migration in the gels shown and the highlighted boxes that can distort perception. Combined, it could be easy to conclude that the region was amplified when it was not amplified specifically. The solution would have been to end sequence verify the PCR products to ensure the correct product was amplified. There is no indication this was undertaken.

PacBio sequencing was conducted to further support these LGTs. First, low coverage PacBio is not a great method for LGT validation since it has steps in library construction that makes it prone to chimeras. This is a known problem we have published on that is not yet widely appreciated. However, LGTs that are recovered in both the PacBio dataset and the Illumina dataset should be real as you wouldn’t expect such random events to occur repeatedly across two platforms. One figure is shown in the manuscript that is used to demonstrate the congruity. Congruity is expected, whether or not these are real LGTS or not, since most of the sequence is from tardigrade. Furthermore the PacBio assembly is <60 Mbp compared to the >200 Mbp Illumina assembly. This means that only about a quarter of the data in the Illumina assembly is found in the PacBio assembly. Therefore, a lot of data is missing; quite possibly these LGTs. If that was examined more closely, I couldn’t find where it was presented. However, it does not seem to support the hypothesis.

I’m going to stop critiquing the Boothby manuscript at this point. I am curious to understand how Moleculo sequencing may or may not yield LGT-like sequences, as I make it a point to understand this for all the common sequencing platforms. I’m disappointed given the claim at the lack of text about LGT in animals and references to the literature, including my own work. I wish all the points I’ve raised here had been addressed in the review of this manuscript, but clearly some key points were overlooked for whatever reasons. Unfortunately, it happens. But I do feel the review system failed for these authors.

So that leaves the Koutsovoulos bioRxiv preprint, which has not been peer-reviewed and may have been put together quite rapidly. So Kudos to the authors! I’ll try to give these authors the benefit of the doubt. For full disclosure though, that means trying to ignore the parenthetical jab at my own paper that really needs to be supported by either an argument or citation.

Largely, my concern here is genome cleansing. Genome cleansing by assembly experts, database curators, and scientists had led to the erroneous removal of LGT of genomes time and time again. It is clearly the largest cause of LGT under-estimation. Here, the authors use blobplots to identify contigs with abnormal GC content and coverage. These plots are then used to remove “contaminant” sequences through an iterative assembly process. The problem arises because this assumes that LGTs should have a similar composition to the host genome and similar coverage. This is true for LGTs that have been in the genome for large spans of time and are fixed in the population. However, in organisms that acquire LGTs frequently, you would expect to have LGTs of all ages, including those that have not acquired the compositional biases of the host DNA. You would also expect that they are not fixed in the population yielding abnormally low coverage distributions. (Although I’ll note that in neither paper could I grasp if the population sequenced was inbred). We’ve even demonstrated in both insects and nematodes that recent transfers from bacterial Wolbachia endosymbionts can be extensively duplicated, so you might even expect abnormally high coverage distributions. Essentially, these criteria aren’t necessarily good at distinguishing contamination from LGT. In fact, even contigs with the same coverage and compositional biases may not be LGT. Therefore, the test employed also does not adequately test the hypothesis. Even setting that aside, a large number of Bacteroidetes sequences seem to have been removed through the process that had the same coverage and compositional bias, and this isn’t explained or transparent.

So, is there LGT in the tardigrade genome? Both papers suggest yes, it is the extent that is at question. I suspect that Boothby et al. applied criteria that are too liberal while Koutsovoulos et al. applied criteria that are too conservative. Reality may lie in the vast expanse between the two estimates. An analysis of the coverage of junction-spanning read pairs (JSPRs) may prove informative. Chimeras should occur randomly in a standard Illumina or PacBio run. Therefore, chimeras in the assembly will only be supported by a single pair of reads. Breaking regions of contigs supported by a single pair of reads and eliminating resulting contigs that lack a Metazoan may yield a better estimate of the true extent of LGT. Although it will still be an estimate.

So how can we know more definitively? Ultimately, the experiment needs to be designed from the beginning to test that, minimizing contamination and using a strategy that minimizes artefactual chimeras. I’m guessing neither group set out to examine LGT in the genome, so they ultimately didn’t devise an experiment where it can established well. PacBio sequencing to obtain a complete genome on a homozygous inbred line with validation of metazoan-bacterial junctions by amplification and subsequent end sequencing verification should answer the question. Making it a homozygous line reared as aseptically as possible would be even better, possibly using antibiotics for multiple generationsto remove bacterial contaminants.

Is that possible? I don’t know; what’s a tardigrade? (Actually, they seem fascinating and I’m glad to have learned more about them).

A few other points. I’ve seen criticism of the UNC authors (Koutsovoulos et al.) for not having their data already available at NCBI. There can be many reasons for this, not least of which is that genomes containing LGT typically take a long time to make it through the NCBI submission process. One of the steps is a “contaminant” screen that in effect removes all LGTs, whether they are real or not, whether they are experimentally validated or not. This needs to be remedied.

The UNC authors were also criticized for not comparing their genome to the genome from the Blaxter group. However, I think this was the right call. First, the Blaxter group should be given the first opportunity to publish any large genome comparisons. Second, the Blaxter group demonstrates the value in having them present the comparison because they understand how the genome was assembled. Too often scientists consider genomes to be static pieces of DNA that are unambiguous. Often, this is far from the case

UPDATE (12/7/15): The UNC/Boothby data was posted online upon numerous requests http://weatherby.genetics.utah.edu/seq_transf/between Nov. 30 and Dec. 4. Also, the genes were picked using a random number generator as described in the supplementary information.