There are 10× more bacterial cells in our bodies from the microbiome than human cells. Viral DNA is known to integrate in the human genome, but the integration of bacterial DNA has not been described. Using publicly available sequence data from the human genome project, the 1000 Genomes Project, and The Cancer Genome Atlas (TCGA), we examined bacterial DNA integration into the human somatic genome. Here we present evidence that bacterial DNA integrates into the human somatic genome through an RNA intermediate, and that such integrations are detected more frequently in (a) tumors than normal samples, (b) RNA than DNA samples, and (c) the mitochondrial genome than the nuclear genome. Hundreds of thousands of paired reads support random integration of Acinetobacter-like DNA in the human mitochondrial genome in acute myeloid leukemia samples. Numerous read pairs across multiple stomach adenocarcinoma samples support specific integration of Pseudomonas-like DNA in the 5′-UTR and 3′-UTR of four proto-oncogenes that are up-regulated in their transcription, consistent with conversion to an oncogene. These data support our hypothesis that bacterial integrations occur in the human somatic genome and may play a role in carcinogenesis. We anticipate that the application of our approach to additional cancer genome projects will lead to the more frequent detection of bacterial DNA integrations in tumors that are in close proximity to the human microbiome.

There are 10× more bacterial cells in the human body than there are human cells that are part of the human microbiome. Many of those bacteria are in constant, intimate contact with human cells. We sought to establish if bacterial cells insert their own DNA into the human genome. Such random mutations could cause disease in the same manner that mutagens like UV rays from the sun or chemicals in cigarettes induce mutations. We detected the integration of bacterial DNA in the human genome more readily in tumors than normal samples. In particular, extensive amounts of DNA with similarity to Acinetobacter DNA were fused to human mitochondrial DNA in acute myeloid leukemia samples. We also identified specific integrations of DNA with similarity to Pseudomonas DNA near the untranslated regulatory regions of four proto-oncogenes. This supports our hypothesis that bacterial integrations occur in the human somatic genome that may potentially play a role in carcinogenesis. Further study in this area may provide new avenues for cancer prevention.

Funding: This work was funded by the National Institutes of Health through the NIH Director's New Innovator Award Program (1-DP2-OD007372) and the NSF Microbial Sequencing Program (EF-0826732). The computational aspects of this work were completed on the IGS Data Intensive Grid (DIAG) funded through NSF Major Research Instrumentation Program (DBI-0959894) to Dr. Owen White and Mr. Anup Mahurkar (diagcomputing.org). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Using publicly available sequence data from the human genome project, the 1000 Genomes Project, and The Cancer Genome Atlas (TCGA), we examined bacterial DNA integration into the human somatic genome, particularly tumor genomes. Here we show that bacterial DNA integrates in human somatic genomes more frequently in tumors than normal samples. These data also support our hypothesis that bacterial integrations occur in the human somatic genome and may lead to altered gene expression.

Almost all cancers associated with Hepatitis B virus (HBV) have the virus integrated into tumor cells [34] . Most of the observed HBV integrations have been isolated as a single occurrence from a single patient [6] . However, a few recurrent integrations into genes promoting tumor formation have been identified, such as the integration of HBV into the human telomerase reverse transcriptase gene [35] , [36] . These mutations can result in altered gene expression and promote carcinogenesis. The advent of next generation sequencing has facilitated the investigation of how and where these viruses integrate into the human genome with unprecedented resolution and accuracy. In a recent study, next generation DNA and RNA sequencing identified HBV integrations in liver cancer genomes and concluded that the HBV integrations disrupted chromosomal stability and gene regulation, which was correlated with overall shortened survival of individuals [6] .

One of the key mechanisms by which some viruses promote carcinogenesis is through their integration into the human genome, causing somatic mutations [29] – [31] . In the early 20 th century viruses were suggested as a transmissible cause of cancer. However, it was not until the mid-1960s that the capability of viruses to promote human cancer was fully recognized [29] . The majority of viral-associated human cancers are related to infection with human papillomaviruses (HPV), hepatitis B and C viruses, and Epstein-Barr virus. Together these viruses are associated with ∼11% of the global cancer burden [32] . In 2002, cervical cancers resulted in ∼275,000 deaths, of which HPV had integrated into ∼90% of these cancers [33] .

Bacterial plasmids have also been engineered to integrate autonomously in vertebrate genomes using the phiC31 integrase. A phiC31 integrase-containing plasmid was first shown to integrate into human cells in vitro [27] at a pseudo-attP site that does not disrupt normal gene functions. The plasmid also integrates into mice in vivo after hydrodynamic tail-vein injection [28] and can yield a properly expressed protein that rescues a mouse knockout phenotype [28] .

The bacteria Bartonella henselae has also been shown to transform human cells in vitro. Bartonella henselae is a human opportunistic pathogen that causes cat-scratch disease [24] . B. henselae and B. quintana are the only known bacteria to cause bacillary angiomatosis, the formation of benign tumors in blood vessels [24] , [25] . A recent study demonstrated the ability of Bartonella henselae to integrate its plasmid into human cells in vitro through its type IV secretion system [26] .

One of the best studied examples of LGT from bacteria to eukaryotes is LGT to plants from the bacteria Agrobacterium tumefaciens. A. tumefaciens uses a type IV secretion system to inject bacterial proteins and its tumor inducing plasmid into plant cells [19] . Through illegitimate recombination, the plasmid integrates into the plant genome, and plasmid encoded transcripts are produced using endogenous eukaryotic promoters [20] , [21] . The corresponding proteins create a specific carbon source for A. tumefaciens and promote the formation of plant tumors [19] , [22] . Therefore, A. tumefaciens creates a tumor environment that promotes the bacteria's own growth. A. tumefaciens has been shown to transform a variety of plant and non-plant cells including human cells in vitro [22] , [23] .

Some eukaryotes have extensive vertically inherited LGT despite potential barriers such as the nucleus, the immune system, and protected germ cells. DNA continues to be transferred from mitochondria and chloroplasts into the eukaryotic nucleus. These organelles originated from α-proteobacteria and cyanobacteria, respectively [10] . LGT from bacteria to eukaryotes, including animals, is also quite widespread [11] – [13] , particularly from endosymbionts [3] . Wolbachia endosymbionts infect up to 70% of all insects [14] , with ∼70% of examined, available invertebrate host genomes containing gene transfers [15] . The amount of genetic material transferred ranges from 100 bp [15] , [16] to bacterial genome sized LGTs [15] , [17] , [18] .

Previous studies have examined LGT from bacteria to humans that would result in vertical inheritance. During the original sequencing and analysis of the human genome, 113 proteins putatively arising from bacterial LGT were initially identified [7] . This was later refuted by an analysis that demonstrated that the number of putative LGTs is dependent on the number of reference genomes used in the analysis suggesting that the proteins found exclusively in both bacteria and humans at that time were due to the small sample size of genomes sequenced, instead of LGT [8] . A subsequent phylogenetic analysis of LGT in the human genome overlooked comparisons with all prokaryotes [9] . Both analyses only focused on full length genes, missing any smaller LGTs or LGT of non-coding DNA. In addition, by focusing on consensus genome sequences, these analyses focused on LGT to the germ line and ignored somatic cell mutations. While LGT to the germ line can affect future generations and potentially the evolution of our species, LGT to somatic cells has the potential to affect an individual as a unique feature of their personal genome.

Lateral gene transfer (LGT) is the transmission of genetic material by means other than direct vertical transmission from progenitors to their offspring, and has been best studied for its ability to transfer novel genotypes between species. LGT occurs most frequently between organisms that are in close physical proximity to one another [1] . Human somatic cells are exposed to a vast microbiome that includes ∼10 14 bacterial cells that outnumber human cells 10∶1 [2] . Considering that (a) some human cells are in a constant and intimate relationship with the microbiome, (b) eukaryotes have widespread LGT from bacteria [3] , (c) bacteria in vitro can transform the mammalian genome [4] , and (d) viruses integrate into the human genome and cause disease [5] , [6] , we sought to investigate if LGT from bacteria to human somatic cells may be a novel mutagen and play a role in diseases associated with DNA damage like cancer.

Results

Identifying bacterial integrations in the somatic human genome Human DNA for genome sequencing is typically isolated from one of three sources: sperm, blood, or cell lines created by transforming collected cells. Most of the data presented here from the Trace Archive and 1000 Genomes project were collected from the latter two. Systematic comparisons of the integration rate based on tissue source is not possible because the metadata on source can be missing, internally inconsistent, or at odds with publications of the data. However, it is important to consider that some of the data arises from cell lines. Cell lines may be more permissive to LGT from bacteria. Cell lines are used frequently because once they are generated they can be maintained in the laboratory allowing greater access to materials by more researchers. On the other end of the spectrum, transfers of bacterial DNA in sperm cells could be inherited by a subsequent generation. In contrast, transfers in blood cells would generate somatic mutations that would not be inherited. In addition, if a transfer occurs in a terminally differentiated cell its fate within the individual would even be limited. Somatic mutations are frequently overlooked in genome sequencing as there may be only a single instance within the sequenced population of cells that is lost in the consensus-built genome assembly. Therefore, we examined all available human sequence traces for evidence of LGT to somatic cells. Previously, we had developed a pipeline for rapidly identifying LGT between Wolbachia and its hosts by using NUCMER [37] (Figure 1A). BLASTN against NT was used to further validate such transfers. Using this pipeline, 8 of the 11 hosts of Wolbachia endosymbionts that were examined were found to have evidence of LGT between the endosymbiont genome and the host chromosome [15]. In five of these hosts, we were able to successfully characterize every LGT we attempted to validate using standard laboratory techniques [15]. The other three hosts were not examined further. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. LGT from bacteria to human somatic cells using Trace Archive data. The schematic illustrates our pipeline that identified 319 clones (a) and 680 traces (b) with the hallmarks of LGT from bacteria to humans using Trace Archive data (Panel A). The traces and clones with similarity to Lactobacillus casei are randomly distributed across the bacterial genome (Panel B). The BLAST search results for one of these reads shows the left portion with similarity to Lactobacillus casei ATCC334 (Panel C), while the right portion of the read has similarity to the human SCCA2 gene (Panel D). The transfer of Lactobacillus casei DNA occurs in the fourth intron of the SCCA2 (SerpinB4) gene. The chromatogram (Panel E) shows the junction between the sequences in C and D and appears to be a single, high quality sequence trace. https://doi.org/10.1371/journal.pcbi.1003107.g001

Bacterial LGT in the trace archive Given our prior success with the NUCMER-based pipeline, we used it to search for LGT in the somatic cells of humans. We searched 113,046,604 human shotgun Sanger traces from 13 sequencing centers and >8 individuals with 2,241 bacterial genomes using NUCMER (Figure 1A). All reads were subsequently searched against NT with BLASTN (Figure 1A) and manually curated to identify (a) reads containing non-overlapping matches to human and bacteria sequences (Table S1) and (b) read pairs where one read matched human and the other matched bacteria (Table S2). These searches revealed a total of 680 traces that contain significant non-overlapping similarity to both bacteria and human sequences (Figure 1Aa, Table S1). There are also 319 identified clones that contain sequences with similarity to both bacteria and human sequences (Figure 1Ab, Table S2). For example, 40 traces and 220 clones contain bacterial fragments with best blast matches to Lactobacillus spp. when NT was the database. These matches were found to be distributed across an entire Lactobacillus genome (Figure 1B) and could not be assembled. The lack of coverage/redundancy across the LGT junctions may be indicative of somatic cell transfers. As an example, one such trace is illustrated that disrupts a gene encoding an antigen found in squamous cell carcinomas [38] (Figure 1CD). The trace containing this junction does not show evidence of an artifact (e.g. two clones being sequenced simultaneously) (Figure 1E). Laboratory artifacts can lead to sequences resembling bacteria-eukaryote somatic cell LGT. Errors can occur in clone or sequence tracking, such that traces are assigned to the wrong project, or through contamination of plasmid preparations that leads to two sequences being generated simultaneously. Some cases of these were identified and systematically culled. For example, reads with matches to E. coli were systematically eliminated because of the high potential for artifactual contamination of genomic DNA in plasmid sequencing preparations. Similarly, all matches involving Erythrobacter were eliminated since a set of traces submitted by one center were found to contain two sequences—one for human and one for Erythrobacter likely owing to systematic contamination of the culture stocks or the plasmid preparations. When two templates are present the resulting read will switch between the two templates as the relative signal between the templates changes resulting in a consensus read call that resembles LGT. However, such artifacts are not readily apparent for any of the putative LGTs described her since the sequences span multiple plates, libraries, and runs and show no evidence of two templates (Table S1, S2). Ligation of bacterial DNA to human genomic DNA during library construction can also result in chimeric clones with a single clone with a bacterial insert and a human insert. This would be observed as a low percentage of bacteria-human mate pairs relative to bacteria-bacteria mate pairs. For example, if 1 in every 100,000 clones contains two inserts, as opposed to the single insert wanted/expected, one would expect a chimeric clone with both a human and bacterial insert would occur no more than 1/100,000, or 0.001%. Considering that human sequences greatly outnumber bacterial sequences, we would expect clones with bacteria and human inserts to occur much less frequently than human-human chimeras and that the number of bacteria-human chimeras will be almost solely based on the amount of bacterial DNA in the samples. We would also anticipate that if 0.001% of bacterial reads are found in bacteria-human chimeric clones then 0.001% of human reads will be found in human-human chimeric clones and be discordant in the human genome. However, we find that the percentage of reads or read pairs supporting integration relative to the number of human mate pairs is higher than one would anticipate or has been measured previously. The average percentage of bacteria-human mate pairs compared to bacteria-bacteria mate pairs is ∼6% (319 highly curated bacteria-human clones/5,280 minimally curated bacteria-bacteria clones), meaning 6% of the bacteria sequences are attached to human sequences. If the bacteria-human sequences were the result of artifactual chimeras, we would expect that 6% of the human sequences should also be erroneously attached to non-adjacent human sequences. This level of artifact chimerism would undermine assembly as well as results regarding human genome structural variation. To the contrary, one such structural variation study found that <1% of the mate pairs were discordant with the reference human genome [39] using some of the same genome sequencing data used here. While it would be prudent to measure the human-human chimerism rates across all the data to compare to the bacteria-human chimerism rates, the lack of a strict ontology for the metadata precludes this. Specifically, it is difficult to determine the exact nature of the pertinent data needed (i.e. sequencing strategy and insert size) for such an analysis.

LGT in the 1000 Genomes Project Using this pipeline on 3.15 billion Illumina read pairs from the 1000 Genomes Project available as of February 2011, 7,191 read pairs supported bacterial integration into the somatic human genome after BLASTN validation, removal of PCR duplicates, and a low complexity filter. The integrations have up to 5× coverage on the human genome. Of the 484 individuals examined, 153 individuals have evidence of LGT from bacteria with 1 individual having >1000 human-bacteria mate pairs and 22 individuals having >100 such pairs. On average, 47 human-bacteria mate pairs were identified in these individuals with putative somatic LGT (median = 2; maximum = 1360). These putative somatic cell LGTs were identified in data from all five centers that contributed data to this release. Bradyrhizobium was the most common OTU identified in the reads supporting LGT, with Bradyrhizobium sp. BTAi1 being the most common strain-level OTU. The Bradyrhizobium-like reads were distributed across an entire reference Bradyrhizobium genome (Figure S4A) similar to what was observed for Lactobacillus sequences in the Trace Archive data (Figure 1B). BTAi1 is a strain that is unusual in its ability to fix nitrogen and carry out photosynthesis. Therefore, some may consider the presence of BTAi1-like sequences in humans unusual. However, our understanding of what bacteria exist in the body is limited. Most of the samples containing Bradyrhizobium-like reads were from the Han Chinese South (CHS) population and were sequenced by the Beijing Genomics Institute (BGI). OTUs associated with bacterial integration that were detected in only one center may be viewed suspiciously, and several, including this one, were observed. However, population level differences in the diet, life style, and microbiome of the different populations examined could also lead to this result. The CHS study is an example of the difficulties in ascertaining the source of the material sequenced. The study information in the SRA states that lymphoblastoid cell lines were used (SRP001293; http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?study=SRP001293), but the sample information states that blood was used (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=samples). Two OTUs—Propionibacter acnes and Enterobacteriaceae—were detected in samples from all five centers. P. acnes is a common skin bacteria that is associated with acne. It is thought to contaminate genomic DNA preparations either from laboratory workers or during sample collection. Whether bacterial DNA arises from contaminants or the microbiome, laboratory artifact chimeras in Illumina whole genome shotgun sequencing that resemble bacterial integrations can occur (a) during PCR amplification steps in library construction or (b) from over-clustering on the flow cell [47]. The other OTU found across all five centers is a family level assignment of Enterobacteriaceae, which includes Escherichia coli. While next generation sequencing no longer relies on plasmid-based clones, they do use ligation steps and recombinant enzymes isolated from E. coli. Therefore, it is quite possible that low levels of E. coli DNA could be introduced with the enzyme preparations. Because both E. coli and P. acnes DNA in samples could arise from contamination of the samples, out of an abundance of caution they were excluded from all analyses. However, we note that this may be a conservative approach given that other Enterobacteriaceae may be found in the samples besides E. coli and both E. coli and P. acnes could contribute to bacterial integration.

Distinguishing bacterial integrations from laboratory artifacts Given that the putative LGTs detected are likely some combination of real LGT and laboratory-based artifacts of reads from the microbiome, we sought to establish a metric by which the two could be differentiated. Given the short length of these reads, our analysis of next generation sequencing data focused solely on Illumina paired end data, identifying putative bacterial integrations when one read mapped to human and one to bacteria. Due to the length of the reads, chimeric reads could not be identified with BWA (e.g. a 50-bp read that had 25-bp mapping to a bacteria and 25-bp mapping to human could not be identified with BWA because it would remain unmapped). Given the sole use of paired end data, reads from the microbiome were defined as those where both reads only map to a bacterial genome. This is, however, an oversimplification since any integration of bacterial DNA larger than the library insert size is likely to generate such reads. Regardless, the microbes that contribute to putative LGT are just a subset of the microbes present (Figure S5). If junctions of bacteria-human read pairs are merely artifacts, one would anticipate that they form in the same proportion relative to the contaminating DNA. However, this was not observed (Figure S5). Each OTU could be binned into one of two categories based on the difference between the composition of the microbiome and the LGT reads: (A) one where the contribution of the specific bacteria relative to the total population of bacteria is higher in the reads supporting LGT and (B) one where the contribution of the specific bacteria relative to the total population is higher in the reads coming from the microbiome. One would anticipate that the former would contain bacteria participating in real LGT, since the proportion of reads with putative LGT is higher while the latter would represent the level of artifactual chimeras from contaminating DNA. This cannot be examined on a per sample basis since most samples have a limited amount of bacterial DNA. However, when the data is aggregated across the entire project (Figure 4A), the bacteria do in fact fall into either of these two categories. As expected, bacteria in the families of Propionibacterineae and Enterobacteriaceae fall into category B, along with Xanthomonadaceae. In contrast, Bradyrhizobiaceae falls into category A. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Relative proportion of OTUs in the microbiome compared to the proportion in bacterial DNA integration. The relative contribution of an OTU at the family level is shown (Panel A) for the microbiome (blue) and bacterial integrations (purple). OTUs that are over-represented in the microbiome include several common lab contaminants that were observed at low levels across multiple samples and centers (e.g. P. acnes). OTUs that are over-represented in the bacterial integrations are more likely to be the organisms mutagenizing the human somatic genome. The contribution of λ phage microbiome (blue) and integrations (red) was measured and illustrated separately (Panel B). https://doi.org/10.1371/journal.pcbi.1003107.g004 In a preliminary analysis, the phage λ was observed to fit into category A. In the above analysis, it is not observed because λ, a bacteriophage, has similarity to sequences with an NCBI taxonomy of “cloning” and “expression vector” that are excluded with our final criteria. However, if we specifically include the λ reads, λ falls within category A (Figure 4B). The reads map only to a small portion of the λ phage, specifically ranging in coverage from 50×–250× on both sides of a HindIII site. It is possible that this is a contaminant as λ is commonly used in research labs. For instance, an excised gel slice may have been contaminated with a λ fragment from an adjacent lane containing a λ ladder. However, this is not consistent with having reads on both sides of the same HindIII site. If the slice was contaminated with two ladder fragments, we would anticipate equal numbers of reads at two additional sites reflecting both ends of the fragment, which was not observed. We could not reconstruct, with in silico digestions of common and uncommon restriction endonucleases, a scenario that explains our observation and reflects what is known about laboratory artifacts in genome sequence data. Should this integration of λ in the human genome be validated, it is intriguing since a phiC31 integrase-containing plasmid has already been shown to integrate into human cells in vitro [27] at a pseudo-attP site. If prophage can integrate naturally into the human genome, they may also be capable of producing virions that would serve as an immune defense against certain bacteria.

Rate of integration of bacterial DNA in the human genome To further explore the relationship between bacterial integrations and laboratory artifacts, we sought to establish the mutation rate across each dataset as well as within subsets. The Trace Archive and 1000 Genomes data are derived from terminally differentiated blood samples, where integrations are expected to occur in a single generation. In the Trace Archive data [7], [45], [48] a total of 680 traces contain significant non-overlapping similarity to both bacteria and human sequences and 319 clones contain both bacteria and human sequences (Tables S1 and S2). From this data, an integration rate was measured as 680 integrations in 113,046,604 reads per a single generation, or 6.02×10−6 integrations/generation. While this may be considered an overestimate due to known laboratory artifact chimeras that result from cloning, it may also be an under-estimate as reads deposited in the Trace Archive are often cleansed of reads believed, but not proven, to be from bacterial contaminants. In the Illumina reads from the 1000 Genomes Project [49], 7,191 read pairs supporting integration were detected in 3,153,669,437 paired reads sequenced yielding a remarkably similar mutation rate of 2.28×10−6 integrations/generation assuming the mutations happen in a single generation. This mutation rate would reflect both integrations as well as the formation of laboratory artifactual chimeras. To establish the contribution of the laboratory artifacts, we examined putative integrations involving OTUs of Propionibacterium. If reads with this OTU arose from contamination, then any bacteria-human read pairs would arise from laboratory artifacts. Of the 845,260,743 read pairs in runs containing putative integrations and/or reads attributed to the microbiome with a Propionibacterium-level OTUs, 191 read pairs represented putative integrations, yielding a mutation rate of 2.26×10−7, or 10-fold lower than that for the entire dataset. A similar analysis of λ, which may represent true integrations for the reasons outlined above, reveals 554 reads supporting integration out of 404,243,537 read pairs, or a mutation rate of 1.37×10−6, which is 6-fold higher than the Propioinibacterium rate.

Coverage lends support for integrations Coverage across a bacterial integration would provide greater evidence of its validity and would be observed when more than 1 unique read is present at a single site. Uniqueness of the reads was assessed with PRINSEQ after concatenating the two reads together and identifying if they are identical. Such identity of both sequence and insert size suggests that the pair are either PCR or optical duplicates formed during library construction or sequencing, respectively, and that should be counted only once. If coverage of unique read pairs supporting LGT across the human genome can be observed, it may suggest clonal expansion of a population with the LGT and support that they were formed biologically in vivo rather than through laboratory-based artifact formation in vitro. When the analysis is limited to putative LGT with >1× coverage on the human genome, only 275 read pairs support somatic cell LGT. The most predominant bacterial species level OTU, with 100 read pairs, is Stenotrophomonas maltophilia, an emerging opportunistic pathogen of the respiratory and blood systems of immunocompromised individuals [50]. The Stenotrophomonas-like reads were evenly distributed across the bacterial genomes (Figure S4B). Reads supporting S. maltophilia-like LGT were detected in two individuals in the study of Utah residents with Northern and Western European ancestry (CEU) sequenced at the Max Planck Institute of Molecular Genetics (MPIMG). One individual had the majority with 97 of these read pairs. While read pairs with >1× coverage were only detected in two samples from one site, when the coverage limit was relaxed, 450 read pairs with a S. maltophilia level OTU were detected in both the CEU and CHS studies and from both MPIMG and BGI. While compelling, given the low coverage, the data from the 1000 Genomes Project is inconclusive in the absence of experimental validation. Yet, in terminally differentiated cells, like blood cells that are routinely sequenced, somatic cell LGT cannot be validated because the transfer sequenced was destroyed in the process of sequencing and is likely the only copy that exists. Transfers could occur in progenitor cells but as they are typically well protected, it is less likely. Furthermore, extensive coverage is not expected for the same reason. In several cases, we could identify coverage that further supports the validity of these reads but these instances were quite limited. In addition, much of the 1000 Genomes data examined are from the first pilot study that only generated 0.5–4× coverage of the genomes. Lastly, much of the DNA for the 1000 Genomes Project is derived from cell culture, not directly from blood cells. There is an opportunity for LGT to happen in cell culture that would not necessarily be biologically relevant. Therefore, we sought to validate these results further by examining data from cancer samples in TCGA.