RNA-Seq techniques generate hundreds of millions of short RNA reads using next-generation sequencing (NGS). These RNA reads can be mapped to reference genomes to investigate changes of gene expression but improved procedures for mining large RNA-Seq datasets to extract valuable biological knowledge are needed. RNAMiner—a multi-level bioinformatics protocol and pipeline—has been developed for such datasets. It includes five steps: Mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. To demonstrate its utility, we applied RNAMiner to datasets generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells, and successfully identified differentially expressed genes, clustered them into cohesive functional groups, and constructed novel gene regulatory networks. The RNAMiner web service is available at http://calla.rnet.missouri.edu/rnaminer/index.html .

Funding: The work was made possible by an NIH Botanical Center grant (P50AT006273) from the National Center for Complementary and Alternative Medicines (NCCAM), the Office of Dietary Supplements (ODS), and the National Cancer Institute (NCI), a NIH R01 grant (R01GM093123), and a NSF CAREER grant (DBI1149224). Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NCCAM, ODS, NCI, the National Institutes of Health, or the National Science Foundation.

Data Availability: All of the six data sets used in this work are publicly available. The third, fourth, and fifth data sets have been deposited into the Gene Expression Omnibus (GEO) database (accession numbers: GSE41570, GSE41679, and GSE35288, respectively). The first, second, and sixth data sets are publicly available at http://calla.rnet.missouri.edu/botanical/RNAMiner.html .

Copyright: © 2015 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

As proof of principle, we have applied the RNAMiner protocol to RNA-Seq data generated from Human, Mouse, Arabidopsis thaliana, and Drosophila melanogaster cells. The data mining process successfully produced valuable biological knowledge such as differentially expressed genes, cohesive functional gene groups, and novel hypothetical gene regulatory networks by reducing the size of the initial data set over a thousand-fold.

In order to address these challenges, RNAMiner has been developed to convert gigabytes of raw RNA-Seq data into kilobytes of valuable biological knowledge through a five-step data mining and knowledge discovery process. RNAMiner integrates both public tools (e.g., TopHat2 [ 8 ], Bowtie2 [ 9 ], Cufflinks [ 10 ], HTSeq [ 11 ], edgeR [ 12 ], and DESeq2 [ 13 ]) with our in-house tools (MULTICOM-MAP [ 14 , 15 , 16 ]) to preprocess data and identify differentially expressed genes in the first three steps. In the last two steps, RNAMiner uses our in-house tools MULTICOM-PDCN [ 17 , 18 ] and MULTICOM-GNET [ 19 , 20 ] to predict both functions and gene regulatory networks of differentially expressed genes, respectively.

A RNA-Seq experiment typically generates hundreds of millions of short reads that are mapped to reference genomes and counted as a measure of expression [ 5 ]. Mining the gigabytes or even terabytes of RNA-Seq raw data is an essential, but challenging step in the analysis.

Transcriptome analysis is essential for determining the relationship between the information encoded in a genome, its expression, and phenotypic variation [ 1 , 2 ]. Next-generation sequencing (NGS) of RNAs (RNA-Seq) has emerged as a powerful approach for transcriptome analysis [ 3 , 4 ] that has many advantages over microarray technologies [ 5 , 6 , 7 ].

Methods

Some RNA-Seq data analysis pipelines (e.g. Galaxy [21], KBase [22], iPlant [23]) provide users with a convenient and free platform for RNA-Seq data analysis by combing public tools, such as TopHat [24], Bowtie [25], Cufflinks [10], Cuffmerge [10], and Cuffdiff [10]. As with these pipelines, RNAMiner combines these public tools such as TopHat2 [8], Bowtie2 [9], Cufflinks [10], Cuffdiff [10], and it is free. However, there are several differences between RNAMiner and other pipelines. First, RNAMiner integrates more tools, such as HTSeq [11], edgeR [12], DESeq2 [13], and our in-house MULTICOM-MAP [14,15,16], to calculate gene expression values and identify differentially expressed genes. These tools can generate more accurate consensus results. For example, RNAMiner uses Cuffdiff, edgeR, and DESeq2 to identify differentially expressed genes based on TopHat mapping results and gene expression values calculated by HTSeq and MULTICOM-MAP. RNAMiner generates up to five distinct lists and one consensus list of differentially expressed genes, which usually produces more accurate results. Second, RNAMiner predicts functions of differentially expressed genes and builds gene regulatory networks by integrating our in-house tools MULTICOM-PDCN [17,18] and MULTICOM-GNET [19,20]. These analyses provide more biological information. Other pipelines (e.g. Galaxy and iPlant) do not provide these analyses. Another software package—KBase—contains a service to predict gene functions, but the service only provides GO annotation for plant genomes. Third, without requirements for user registration and selection of many parameters, RNAMiner is easier to use than other pipelines. Compared to running each tool separately, users can easily run all these tools integrated in RNAMiner at one time and download results generated by all the tools at the RNAMiner web site.

The five data analysis steps of the RNAMiner protocol (Fig 1) are described individually in sub-sections below. Tables 1 and 2 list the versions and the parameters of all the public tools used in RNAMiner and describe the meanings of the parameters.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. The RNAMiner protocol for big transcriptomics data analysis. Five blue boxes denote five data analysis steps, i.e. mapping RNA-Seq reads to a reference genome, calculating gene expression values, identifying differentially expressed genes, predicting gene functions, and constructing gene regulatory networks. The tools used in each step are listed inside each box. The external input information is represented by brown boxes and the final output information is represented by green boxes. The information flow between these components is denoted by arrows. https://doi.org/10.1371/journal.pone.0125000.g001

1. Mapping RNA-Seq reads to a reference genome We use two public tools, TopHat2 [8] and Bowtie2 [9], to map RNA-Seq reads to reference genomes in the UCSC genome browser [26] in conjunction with the RefSeq genome reference annotations [27]. The workflow of mapping RNA-Seq reads to a reference genome and calculating gene expression values is illustrated in Fig 2. It is worth noting that, since the RefSeq genome reference annotations contain information about some non-coding small RNAs, the reads of the non-coding RNAs are mapped and counted in addition to regular protein coding mRNAs. MULTICOM-MAP [14,15,16] is used to remove reads mapped to multiple locations in a reference genome from the mapping data in BAM/SAM format [28] generated by TopHat2 and Bowtie2. Only reads mapped to a unique location on the genome are retained to calculate the read counts of the genes. We use MULTICOM-MAP to analyze the mapping results to obtain baseline information, such as the total number of reads, the number of reads mapped to a unique location, and the number of reads mapped to multiple locations. This mapping process can generally reduce the size of datasets by several orders of magnitude. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. The workflow of mapping RNA-Seq reads to a reference genome and calculating gene expression values. The blue boxes denote the tools (TopHat2, Bowtie2, MULTICOM-MAP, HTSeq, Cufflinks) used in the steps of mapping RNA-Seq reads to a reference genome and calculating gene expression values. The external input information is represented by brown boxes and the output information is represented by green boxes. The information flow between these components is denoted by arrows. https://doi.org/10.1371/journal.pone.0125000.g002

2. Calculating gene expression values For RNAMiner, MULTICOM-MAP [14,15,16] and two public tools: HTSeq [11] and Cufflinks [10] are used to calculate gene expression values according to the genome mapping output and the RefSeq genome reference annotation [27]. MULTICOM-MAP and HTSeq produce raw read counts, while Cufflinks generates normalized values in terms of FPKM, i.e., fragments per kilobase of exon model per million mapped fragments. The normalized gene expression values generated by Cufflinks are used to identify differentially expressed genes in the next step. The read counts generated by MULTICOM-MAP and HTSeq are fed separately into two R Bioconductor packages, edgeR [12] and DESeq2 [13], to identify differentially expressed genes. The normalized gene expression values (RPKM, reads per kilobase of exon model per million mapped reads) of MULTICOM-MAP are used to construct gene regulatory networks in the last step. Cufflinks, MULTICOM-MAP, and HTSeq are largely complementary and mostly differ in how they handle the reads mapped to common exons of multiple isoforms of a gene. Cufflinks distributes the count of such reads to each isoform proportionally according the estimated probability that the reads were derived from the isoform. In contrast, MULTICOM-MAP distributes the total count of such reads to each isoform, while HTSeq discards the reads without counting them for any isoform. This analysis step generates the overall expression profile of most genes in a transcriptome and can reduce the size of data from Step 1 by ~one thousand-fold, from gigabytes to several megabytes.

3. Identifying differentially expressed genes We use Cuffdiff [10] and two R Bioconductor packages, edgeR [12] and DESeq2 [13] to identify differentially expressed genes separately (see Fig 3 for the workflow). EdgeR and DESeq2 identify differentially expressed genes based on the raw read counts calculated by MULTICOM-MAP and HTSeq, resulting in four lists of differentially expressed genes (i.e., edgeR+MULTICOM-MAP, edgeR+HTSeq, DESeq2+MULTICOM-MAP, and DESeq2+HTSeq). In contrast Cuffdiff identifies differentially expressed genes directly from the genome mapping outputs containing only reads mapped to a unique location on the genome, resulting in one list of differentially expressed genes. Cuffdiff, edgeR and DESeq2 further adjust p-values by multiple testing using Benjamini and Hochberg's approach, which controls the false discovery rate (FDR) [10,12,13]. Usually, the cut-off of p-value (or q-value) is set to 0.05. Based on the five lists of differentially expressed genes generated by Cuffdiff, edgeR+MULTICOM-MAP, DESeq2+MULTICOM-MAP, edgeR+HTSeq, and DESeq2+HTSeq, a consensus list of differentially expressed genes is generated as the final output which usually comes from the overlap of at least three lists of differentially expressed genes. This step generates valuable information that may play an important role in the biological experiment. For example, the significantly differentially expressed genes identified by RNAMiner could be the targets for new biological experiments. This analysis step can generally reduce the size of data of the previous step by a couple orders of magnitude, condensing the data set size to several hundred kilobytes. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. The workflow of identifying differentially expressed genes. The blue boxes denote the tools (edgeR, DESeq2, Cuffdiff) used in the step of identifying differentially expressed genes. The external input information is represented by brown boxes and the output information is represented by green boxes. The information flow between these components is denoted by arrows. https://doi.org/10.1371/journal.pone.0125000.g003

4. Predicting gene functions We use MULTICOM-PDCN [17,18], a protein function prediction method ranked among the top methods in the 2011–2012 Critical Assessment of Function Annotation (CAFA) [29], to predict functions of differentially expressed genes (see Fig 4 for the workflow). MULTICOM-PDCN integrates sequence-profile and profile-profile alignment methods (PSI-BLAST [30] and HHSearch [31]) with protein function databases such as the Gene Ontology database [32], the Swiss-Prot database [33], and the Pfam database [34], to predict functions of proteins in Gene Ontology [32] terms in three categories: biological process, molecular function, and cellular component. MULTICOM-PDCN also provides some statistical information about the predicted functions, such as the number of differentially expressed genes predicted in each function. We then use the Cochran-Mantel-Haenszel test implemented by R program mantelhaen.test [35,36] to check if predicted function terms are good for Fisher’s exact test to identify the significantly enriched GO function terms. A p-value from the MH test lower than 0.05 suggests the two nominal variables (e.g., two function terms) are conditionally independent in each stratum [35,36]. We then calculate a p-value of enrichment for each predicted function using R function fisher.test [35,36,37,38,39,40,41,42] and sort the predicted functions by their p-value in ascending order, from the most significant ones to the least significant ones. The list of the most significantly enriched functions can provide an overview of the biological processes differentially perturbed in two biological conditions. Although the physical size of the data and knowledge generated in this step is comparable to the size of the data in the previous step, the differentially expressed genes can be organized in three functional perspectives: biological process, molecular function, and cellular component. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. The workflow of predicting gene functions. The blue box denotes the tool used in the step of predicting gene functions. The tools of PSI-BLAST and HHSearch used in MULTICOM-PDCN are listed inside the blue box. The external input information is represented by brown boxes and the output information is represented by green boxes. The information flow between these components is denoted by arrows. https://doi.org/10.1371/journal.pone.0125000.g004