Human plasma has long been a rich source for biomarker discovery. It has recently become clear that plasma RNA molecules, such as microRNA, in addition to proteins are common and can serve as biomarkers. Surveying human plasma for microRNA biomarkers using next generation sequencing technology, we observed that a significant fraction of the circulating RNA appear to originate from exogenous species. With careful analysis of sequence error statistics and other controls, we demonstrated that there is a wide range of RNA from many different organisms, including bacteria and fungi as well as from other species. These RNAs may be associated with protein, lipid or other molecules protecting them from RNase activity in plasma. Some of these RNAs are detected in intracellular complexes and may be able to influence cellular activities under in vitro conditions. These findings raise the possibility that plasma RNAs of exogenous origin may serve as signaling molecules mediating for example the human-microbiome interaction and may affect and/or indicate the state of human health.

The recent development of highly parallelized next generation (NextGen) sequencing technologies has further advanced the use of sequencing as a tool in studying complex biological systems by genome sequencing and transcriptome analysis [7] , [8] , [9] , [10] . One advantage of using a sequence-based approach for transcriptome analysis is the ability to identify novel transcripts, such as alternative usage of exons or polyadenylation sites of known transcripts. The recent explosion of information on microRNA (miRNA) and other noncoding RNAs (ncRNAs) is the result in part of applying these new technologies. MiRNAs are transcribed from genome by processes similar to protein-coding genes. The primary miRNA transcripts are processed in the nucleus and later in the cytosol by the RNase III enzymes Drosha and Dicer, respectively [11] . Typically, one strand of this mature miRNA duplex then associates with the RNA-induced silencing complex (RISC) where it interacts with its messenger RNA (mRNA) targets. To date more than 1000 different human miRNA species have been identified (see miRBase, www.mirbase.org ). Recently, a significant number of these RNA molecules have been observed in the extracellular environment and have been implicated as important mediators in cell-cell communication [4] , [12] , [13] .

Many novel biological insights have emerged from the analysis of DNA and RNA sequences. Important discoveries, such as various pathology-causing variants in the human genome and the history of human migration, were made possible by the availability of sequencing technology [1] , [2] , [3] . Normal human physiology is the result of a well-orchestrated balance between genetic (intrinsic) and environmental (extrinsic) factors, and the availability of the complete human genome sequence facilitates the study of complex human-environmental interactions. Recently this has included the human-microbiome interaction, especially the gut microbiome [4] . These microbes interact intimately with gut epithelium and the alteration in the spectrum of the gut microbiome has been linked to various physiopathological conditions, such as diarrhea, obesity, and inflammatory pathologies as well as to the general state of health [5] , [6] .

Compared to the total exogenous RNA in human plasma samples ( Table S5 ), the exogenous RNA populations associated with RISC complex are significantly smaller (ranging from 4% to 16%, depending on search criteria) ( Table 3 ). The exogenous RNA population in fetal bovine serum was even lower than found associated with the RISC complex. Looking carefully at the sequences we obtained, there are a number of sequences that mapped to various bacterial transcripts that were present in both the RISC complex and fetal bovine serum (examples see Table 4 , full list on Table S10 ). Some of the regions contain these sequences can form miRNA precursor like hairpin structure ( Figure 5 ). This observation further supports the suggestion that exogenous RNA sequences may influence the function of cells through a miRNA-like mechanism.

It has been reported that cells in culture can take up microvesicles and internalize its molecular contents including RNA [23] , [24] . Since some of the exogenous RNAs in circulation were packaged in lipid vesicles, we investigated whether the cellular machinery involved in small RNA function could incorporate exogenous RNA sequences. We compared the exogenous RNA spectrum between fetal bovine serum used in cell culture and intracellular RNA associated with RISC complex, immunoprecipitated by argonaute 2 (Ago2), a key component of the complex. Mass spectrometry based proteomic analysis on immunoprecipitated proteins revealed the presence of Ago2 along with several DDXs (DEAD box, helicase domain containing proteins), hnRNPs (heterogeneous nuclear ribonucleoprotiens) and RRMs (RNA recognition motif containing proteins) proteins as reported earlier [25] ( Table S9 ). The presence of Ago2 in the immunoprecipitated protein mixture was also verified by Western blot ( Figure S7 ).

While the functions of extracellular miRNAs are not fully understood, it has been demonstrated that certain cells can take up the miRNA contained in lipid vesicles, which results in changing the cell’s gene expression profile [19] . To explore the potential functions of exogenous RNAs in circulation, we transfected several synthetic, double-stranded mature microRNA-like molecules selected from observed exogenous miRNA sequences and some highly abundant exogenous sequences (bacterial rRNAs) that have potential to form pre-miRNA-like secondary structures ( Figure S5 ) into a mouse, Dicer-deficient, fibroblast cell line. Because of the lack of the Dicer protein, a key RNAse III, miRNA processing enzyme, the Dicer deficient cells contain very much less mature miRNA compared to normal cells. This provides a good tool for studying the function of miRNAs. By introducing individual miRNA into these cells and avoiding multiple interactions of microRNA and mRNA (Wang et al in preparation) it is possible to interrogate the cells for mRNA levels, which are informative as to specific miRNA function. Based on microarray profiling results, it is clear that the expression profiles of a number of genes in the cells were affected by some of the exogenous RNA sequences. We verified the changes in levels of some of these affected genes’ mRNA by QPCR ( Table S2 and Figure S6 ). The pathways enriched among those down-regulated genes are listed in Table 2 . Two of the insect miRNAs, miR-263a-5p and bantam, did not produce any significant effects on the cellular transcriptome, which suggests that the process of transfection itself was not the cause of the observed gene expression changes. This observation suggests that RNA sequences in plasma might have some biological effects on human cells.

It has been shown that the endogenous miRNAs can form complexes with proteins or be packaged in various lipid vesicles protecting them from abundant RNase in plasma [4] , [19] , [20] , [21] , [22] . To explore the stability of exogenous miRNAs and RNAs in circulation, we treated the plasma samples with DNAse, protease, Trixon X-100, and additional RNase before RNA isolation. Like endogenous miRNA (miR-16), the levels of specific exogenous miRNA (miR-263a-5p) and RNA (16S rRNA from Pseudomonas putida) were reduced significantly after Triton X-100, protease, RNase, and protease followed by RNase treatments ( Figure S4 ). Adding additional RNase caused less reduction compared to protease followed by RNase treatments. This suggests that some of the exogenous RNA molecules, like endogenous miRNAs, are probably associated with protein and/or lipid complexes in circulation and a fraction of those complexes may not be tightly bound, such that the freeze thawing process or incubation at 37°C during enzyme treatment may release some of the protected RNAs.

Our sequencing results also revealed the presence of exogenous miRNAs from other species. Due to the extreme sequence similarity of miRNA sequences among some species, it is often difficult to determine the exact origin of those exogenous miRNAs. Some of the highly abundant exogenous miRNA species detected in our plasma samples are listed in Table 1 . Except for miR-168a from the common cereal grains such as corn or rice, the rest of the exogenous miRNAs were from various common household insects, including the housefly, mosquito and bees. One interesting observation is the high variation in the number of reads among individual donors for those insect miRNAs. This was probably caused by the different living conditions and levels of contaminated food consumed by our donors, but this remains to be investigated.

The Y-axes are the number of reads in log 10 value and individual species are indicated on the X-axis. The number of reads used in the figures represented the averages from all 9 plasma samples used in the study. Figure 4C shows the difference in the abundance of reads mapped to common cereal gains between serum sample from a Chinese individual (open bars) and the plasma samples (Caucasian) used in the study (solid bars).

After carefully examining the sequences mapped to species other than bacteria and fungi, we observed a significant number of processed reads that mapped to common food items. As we did for bacteria and fungi, we removed all the reads that mapped to rRNAs and tRNAs to increase the accuracy of mapping results. We did not analyze sequences mapped to metazoan species since the risk of coincidental sequence match caused by sequencing error is much higher between human and other metazoan samples. The most abundant food item derived RNA sequences identified from our plasma samples then are corn (Zea mays) followed by rice (Oryza sativa Japonica Group) ( Figure 4a ). The number of mapped reads to corn is 66 times higher on average than rice. In comparing the data from a serum sample from a Chinese individual (downloaded from the public domain: SRR332232), we found that the sequence abundance between corn and rice is reversed: rice has the highest number of reads, about 55-fold times the number from corn ( Figure 4b ). Besides the common cereal grains, we also observed RNA from other food items including soybeans (Glycine max), tomato (Solanum lycopersicum), grape (Vitis vinifera) and others in our plasma samples ( Figure 4c ).

Metarhizium anisopliae, a common fungus in soil had the most mapped reads and Thielavia terrestris, a thermophilic fungus became the species with the most abundant reads after removing tRNA and rRNA sequences ( Figure 3e and 3f ). Either with or without rRNAs and tRNAs reads, we observed a significant number of reads mapped to yeast (Saccharomyces cerevisiae) which is commonly used in baking and brewing ( Figure S3 ).

Fungi represent the largest source of exogenous RNA, about 14% of the processed reads under the Strategy 2 in our plasma samples ( Table S5 ). Like bacteria, the species mapped covered all major fungal phyla, and Ascomycota is the most abundant phylum either with or without considering rRNA and tRNA reads ( Figure 3d and Table S8 ). We could not detect species from Microsporidia following the removal of rRNA and tRNA sequences.

A significant number of the reads mapped to bacteria are from various ribosomal RNAs and tRNAs. High sequence similarity of these sequences among different microbial species can lead to misassignment of sequence reads. Thus, to increase the reliability of mapping results, we removed reads that mapped to bacterial rRNAs and tRNAs and reanalyzed the remaining reads. Removing rRNA and tRNA sequences affected our ability to detect species from Chloroflexi, Deferribacteres, Fibrobacteres and other phyla ( Figure 3a open bars and Table S8 ). However, the Proteobacteria was still the most abundant phylum followed by Bacteroidetes and Firmicutes.

The Y-axes are the numbers of reads in log 10 value and individual phylum are indicated on the X-axis. The number of reads used in the figures represents the average of all 9 plasma samples used in the study. The solid bars represent the total number of processed reads mapped to specific phyla while open bars are the number after removing rRNA and tRNA reads. The individual bacteria and fungi species with the most abundant processed reads (B and E) and processed reads after removing tRNA and rRNAs (C and F) are also shown.

We observed reads from plasma covering all major bacteria phyla and two archaea phyla (Euryarchaeota and Crenarchaeota) ( Figure 3a and Table S8 ). We did not observe any significant difference in the sequence distribution patterns among plasma samples from normals and patients with either colorectal cancer or ulcerative colitis ( Table S8 ). Firmicutes, typically the most abundant bacteria phylum in the human gut microbiome [4] , is the 3 rd most abundant sequence population in plasma.

To exclude the possibility that the observed exogenous RNAs were from intact bacteria and fungi contamination in our plasma samples, we used the 0.2 uM filter commonly used in tissue culture to eliminate bacteria and fungi contamination, to filter the plasma samples before RNA isolation. We did not observe any significant difference in exogenous RNA levels between filtered and unfiltered plasma, using QPCR primers specific to Pseudomonas putida 16S RNA and Ceratocystiopsis minuta 18S RNA, matching the results for the human 28S rRNA ( Figure S2 ).

To ensure that the exogenous sequences we observed were not derived from any contaminated instruments or reagents, we analyzed two public domain NextGen sequencing data sets: SRR332232, serum small RNA sequencing results from a normal Chinese individual [15] , and SRR014350, yeast transcriptome data from a yeast culture [16] . The yeast culture should not have any exogenous sequences since it was grown in a sterile, defined culture media. The yeast dataset yielded less than 0.15% of the reads mapped to sequences other than yeast ( Figure 2c and Table S7 ), a level that is fully attributable to coincidence caused by sequencing errors. Using our sequencing analysis pipeline, by contrast, we observed that about 12% of the sequences in human serum sample were from various exogenous species under Strategy 2.

To eliminate the possibility of bacteria and fungi contamination during plasma preparation and handling, we generated sequencing libraries from other types of samples including human tissue (commercially obtained normal lung RNA), bovine milk (commercial whole milk), and mouse plasma (C57BL/6J), and proceeded through the same analysis scheme. Sequences from bacteria, fungi and other species can also be seen in these samples ( Figure 2b and Table S6 ). The overall percentages of exogenous sequences for mouse plasma were lower compared to human plasma samples. The human lung tissue had a very small fraction: less than 1% under strategies 1 and 2, of the processed sequences were from exogenous sources. The commercially obtained milk contains a significant fraction of sequences attributable to bacteria.

In order to identify the origin of those unmapped sequences in our sequencing results and to ensure that there was no error introduced in preparing the sequencing library that could account for the unknowns, we conducted a systematic search against various sequence databases. We used a “map and remove” approach to analyze the sequence ( Figure 1 ). The processed sequences were first screened against endogenous (human) sequence databases including known human miRNA, human transcripts, followed by human genomic sequence. Except for the miRNA (since some of the miRNAs have very similar sequences), we applied three different levels of error tolerance, 0 mismatch (termed Strategy 0), 1 mismatch (termed Strategy 1) and 2 mismatches (termed Strategy 2) for the endogenous sequence mapping. The remaining unmapped sequences were then compared to sequences from the known human microbiome, miRNA sequences from other species, and the non-redundant nucleic acid sequence collection from NCBI without any mismatch allowance. To our surprise, a significant number of the unmapped reads aligned with various bacterial and fungal sequences ( Figure 2a and Table S5 ).

To ensure our protocol is effectively mapping back the reads to transcript and genome sequences, a NextGen sequence read simulator, ART [14] ( http://bioinformatics.joyhz.com/ART/ ), was used to generate artificial transcriptome data. With a 2 mismatch allowance, over 98% of the sequences from our simulated dataset can be mapped to the corresponding transcriptome ( Table S4 ). This provided some assurance that our protocol can map most (∼98%) of the NextGen sequencing data under 2 mismatch allowance.

On first examination, we noticed that less than 1.5% of the processed reads actually mapped to human miRNAs. About 11% of the remaining reads mapped to human transcripts and human genome sequence when no sequence mismatch was allowed ( Table S3 ). With a higher tolerance of sequence mismatches, the fraction of reads that can be mapped to known human transcripts rose to about 42% and 15% to other human genomic sequences (under two mismatch allowance). However, this still leaves over 40% of the processed reads with an unknown origin.

Because of the shortcomings of existing miRNA measuring systems, we adapted the NextGen sequencing technology to obtain more accurate spectra of these important molecules in circulation, specifically to explore the plasma-miRNA association with colorectal cancer and ulcerative colitis. Initially we conducted NextGen sequencing on 9 plasma samples: 3 samples from healthy individuals, 3 from patients with colorectal cancer prior to any treatment and 3 from individuals suffering from ulcerative colitis (Mayo Score between 10 and 11) ( Table S1 ). Sequence reads were preprocessed and then aligned to known human miRNAs, human transcripts and human genome sequence. The concentration of several miRNAs in plasma showed differences among normal and patients with either colorectal cancer or ulcerative colitis. We conducted quantitative polymerase chain reaction (QPCR) measurements to validate some of these miRNAs ( Figures S1a and S1b ).

Discussion

Since most of the circulating endogenous RNAs we identified originate from specific microbiome species, different conditions, benign and pathological, may skew the circulating RNA population. It may be surprising that we did not observe any significant differences in the general spectrum of endogenous and exogenous RNAs in circulation between samples from patients and normal individuals used in this study. This is probably caused by the high degree of heterogeneity of circulating RNA among individuals and the small sample size used. Imprecise diagnosis can also complicate the linkage between disease and circulating RNA population. Larger sample sizes with well documented pathological conditions or using animal models in a controlled environment will be used to shed further light on this observation.

It has long been thought that the human body is highly insulated from its environment by two major protective mechanisms, an active and dynamic immune system, and the body’s physical barrier represented by skin, gut epithelium, mucus membranes etc… While some sequences mapped to known human transcripts and other genomic sequences, we were surprised, even after very stringent screening criteria, to see that a significant proportion of the sequencing reads from human plasma samples in our study clearly originated from various microbes, insects and food sources (Figure 2a). We used simulated (Table S4) and public domain datasets (Figure 2c), and also examined several different types of samples, to ensure our sequence mapping protocol reliably assigns the endogenous sequences, and we employed a number of controls to eliminate the possibility of contamination in our samples and reagents. Treating the plasma samples with DNAse prior to RNA isolation has little or no effect on the levels of exogenous sequences, but the levels of these sequences decreased by about 2 fold when the plasma samples were treated with RNase (Figure S4). This provides support for the conclusion that most of the exogenous sequences in our results derive from RNA rather than DNA. To escape enzymatic degradation, these RNA molecules probably form complexes with proteins and/or lipid molecules, as has been previously reported for endogenous miRNAs, since treating the plasma samples with protease and Triton further decreased the levels of these exogenous RNA molecules (Figure S4).

The ability to accurately assign the origin of exogenous sequences is dependent on the availability of genome sequences from each species. Even though the genomes from a significant number of species have been determined in recent years, the number is insignificant compared to the number of species in the biota, particularly bacteria, archea and fungi. This includes many species in the human microbiome. Although we could not determine the origin of those exogenous RNA sequences with complete confidence, the diverse and numerous reads that showed perfect matches to a wide range of known microbial sequences suggest the spectrum of RNA in the blood appears to reflect the population of the gut, including the microbiome composition. Interestingly, samples derived from the human gastrointestinal tract were recently found to be significantly enriched in small RNAs when compared to microbial community samples derived from other environmental settings [26]. This finding along with the results of the present work suggest that the human gastrointestinal microbiota may disproportionately synthesizing small RNA molecules which are then reflected in human blood. In addition, the finding of a significant amount of RNA mapping to common cereal grains including corn, rice and wheat (Figure 4) clearly indicates that part of the RNA spectrum in the circulation is also provided directly from food intake.

The existence of exogenous RNAs in circulation is intriguing for several reasons. These molecules could possibly be molecular waste in the process of degradation and elimination by the body, and they could represent potential nutrients destined for further degradation and absorption. However, we cannot easily explain how such a large molecules as these RNAs must be enter the blood stream through the gut epithelium. Most of these RNAs are likely to be complexed with proteins and lipids, allowing them to escape RNase degradation. In addition, we note that these RNA molecules could influence cellular mRNA expression if they are taken up by cells. We demonstrated this by transfecting synthetic RNA sequences from some of the more abundant exogenous sequences we found in plasma samples. We also found the same (identical) bacteria sequences from fetal bovine serum in the RISC complex of cultured cell lines (Table 4 and Table S10). Even though the concentration of RNA we use in transfection (10 nM) is much higher compared to plasma (<0.3 pM based on an average of 100 ng/ml of RNA in plasma with an average of 100 nucleotide in length and the selected RNA sequence represents less than 0.01% of the total RNA population), we cannot exclude the possibility that certain cells in the body have an active uptake system which can pick up the circulating RNA (both endogenous and exogenous RNAs) at low concentrations. The recent finding of the rice miR-168a which has an estimated serum concentration in fM level in various mouse tissues further supports the possibility of an active uptake system for some cells to take up low concentration, circulating RNA [15] into their intracellular compartments. Although extracellular nucleic acids in plasma were discovered almost 60 years ago [27], the identification of stable extracellular miRNA in circulation supports the the fundamental idea of RNA-mediated signaling processes between cells. If some of the exogenous RNAs we find in circulation have the potential to influence cellular activity, individuals having different levels of these RNAs from normal food intake or from different microbiome populations could be affected by these differences in unexpected ways. The complex interactions of diet, microbiome and the cellular functions of the body, are affected in turn by genetics, which suggests new dimensions of the gene-environment interaction spectrum.

During the review process of this manuscript, Semenov et al reported the finding of exogenous RNAs that mapped to microbial sequences including Escherichia, Acinetobacter, Propionibacterium and others in normal human plasma using a SOLiD sequencing platform [28]. This independent study adds to the evidence we present here and supports the idea of exogenous RNA present in circulation as a common phenomenon in humans.

The finding of diverse exogenous RNA molecules in plasma, and their potential influence on cellular gene expression, raises several interesting questions about how humans interact with their environments and particularly with their gastrointestinal biota. Though the interaction between microbes and gut epithelium is yet to be fully understood, some sort of feedback signaling process might well be involved [6], [29]. Peptides and small chemicals have long been thought to be the two major types of signaling molecules between microbe and gut epithelium. The finding of microbial RNA in circulation now adds the possibility of an RNA-mediated human-microbiome interaction as an additional communication mechanism of this important axis for human health.