NHP samples and RNA sequencing

Whole blood samples were harvested from healthy NHPs for rhesus macaque samples RM01 through RM05, African green monkey samples AG01 through AG06, and Cynomolgus macaque samples CM01 through CM06. Blood samples were diluted in 3 to 1 Trizol LS. Tissues were harvested from uninfected NHPs for CM01 and AG01. Bone marrow was unable to be collected from AG01 and brain tissue from CM01. To prepare samples for nucleotide extraction, 0.5 grams of tissue was homogenized in 10 ml of Trizol LS per sample.

RNA was extracted using Trizol LS (Invitrogen, Carlsbad, CA) and used for cDNA synthesis by TruSeq RNA Sample Prep Kit v2 (Illumina, San Diego, CA), according to the manufacturer’s protocol. The libraries were evaluated for quality using the Agilent 2100 Bioanalyzer (Agilent, Santa Clara, CA). After quantification by real-time PCR with the KAPA qPCR Kit (Kapa Biosystems, Woburn, MA), libraries were diluted to 10 nM. Cluster amplification was performed on the Illumina cBot and libraries were sequenced on the Illumina GAIIx using the 76 bp and 100 bp paired-end formats. Additional file 1 describes the details of sequencing results.

Ethics statement

Each animal received a baseline health assessment, including a complete blood count and blood chemistry, and was determined to be clinically normal on physical examination. All animals were seronegative for measles virus, Macacine Herpesvirus 1, simian immunodeficiency virus, and simian T-cell leukemia virus. All animals were negative for mycobacterium tuberculosis by tuberculin skin test at least 6 months prior to the study. To ensure applicability of results to Animal Biosafety Level 3 and 4 environments, an exemption for partial and/or full contact housing was approved by the IACUC due to the anticipated stress of permanent social separation from a cage mate, the nature of the diseases studied, as well as safety and sanitation concerns. Macaques were singly housed in 4.5-ft2 cages with 4 cages per rack (Allentown Caging Equipment, Allentown, NJ), with visual and auditory contact with conspecifics at all times. A form of dietary enrichment was provided once daily. Environmental conditions were maintained as recommended in the Guide for the Care and Use of Laboratory Animals (temperature, 68 to 72°F; relative humidity, 30% to 70%; and 12:12-h light:dark cycle) (18). Animals were fed a commercial primate diet (2050 Teklad Global 20% Protein Primate Diet, Harlan Laboratories, Frederick, MD). Fresh water was chlorinated and filtered at the municipal level (Edstrom Industries, Waterford, WI) and was provided ad libitum. A uniform schedule of food and toy enrichments (Challenge ball, Kong, football, and Dental star, Bio-Serv, Frenchtown, NJ) were used as outlined by our institute’s husbandry and care program.

All research was conducted under an IACUC approved protocol in compliance with the Animal Welfare Act, PHS Policy, and other federal statutes and regulations relating to animals and experiments involving animals. The facility where this research was conducted is accredited by the Association for Assessment and Accreditation of Laboratory Animal Care, International and adheres to the principles stated in the Guide for the Care and Use of Laboratory Animals, National Research Council, 2011(18). Euthanasia was performed to minimize pain and distress by intravenous administration of sodium pentobarbital.

Genome-based transcriptome reconstruction

To generate high quality assemblies, we first assessed the quality of reads using the FastQC algorithm [36]. We used FASTX-Toolkit to perform trimming, quality filtering, and duplication removal [37]. Additionally, we employed PRINSEQ-Lite [38] to filter transcripts with fewer than 50 bases. For pair-ended libraries, we removed read pairs if both the forward and the reverse (or their complements) were duplicates. We filtered low complexity sequences using the DUST algorithm (threshold 3), and trimmed reads with a quality score of <15 from the 3′-end.

Tophat (version 2.0.8) [24] with default parameters was used to map CM and AG reads to their corresponding reference genomes. In the initial run, the reads obtained from liver, lymph node, lung, marrow, and spleen were mapped to the CM genome; and brain, lymph node, liver, spleen, and lung to the AG genome. Bowtie1 (version 0.12.9) was used as the main aligner for Tophat throughout this study. After the alignment, Cufflinks (version 2.1.1) was used with default parameters to assemble reads into transcripts. Subsequently, the assembled transcripts were merged with Cuffmerge to obtain a non-redundant unified set of transcripts. Blood samples were then added for augmentation and benchmarking of these transcriptomes, via annotation based transcript (RABT) assembly procedure [24, 39] (using --GTF option in Tophat and --GTF-guide in Cufflinks, followed by Cuffmerge).

Multi-species annotation (MSA) pipeline

We designed the Multi-Species Annotation pipeline to assign gene symbols to contigs through aligning them by BLAST [40, 41] to sequences in NCBI’s nt database. In the present study, we used a cutoff BLAST e-value of 1e-4. For every subject sequence with a hit in the database and corresponding to a unique accession ID, we utilized its Gene Feature Format (GFF) file to describe the coordinates of genes within the sequence. Thus, the pipeline relied on the BLAST output and a concatenated set of GFFs. It is comprised of the following three steps.

The first step was to add information from the GFF files to the BLAST output, as well as to merge local alignments, so there was only one row per unique subject-query ID. This was necessary because BLAST is a local aligner, so pieces of a query sequence can map to multiple locations on a single subject sequence. In this step, we also computed other information, such as query coverage, subject coverage, and gene coverage. To determine which genes were covered, we converted the BLAST results into BED format and used Bedtools-intersect [42] with the coordinates given in the GFF file. This step resulted in a table of every transcript and all the accession numbers and corresponding species to which it mapped, and all the genes with which it intersected.

In the second step, we parsed this table. While the first step merged multiple alignments over unique subject-query IDs, in this step we merged rows across unique query IDs. With this, one transcript pointed to many genes across many different species. (If a gene symbol was not available, only the accession ID was kept.) At this step, some BLAST results can be excluded based on query, subject, or gene coverage information; however, we chose not to apply any of these filters. Gene symbols were canonical-ized into their official HUGO names [43], where possible. Finally, we assigned to each transcript the most frequent gene symbol from the corresponding BLAST alignments to multiple species.

In the final step, we used the Cufflinks gene model prediction, and assigned the consensus gene symbol of all transcript isoforms to the their parent gene model (Additional file 5).

Identification of isoforms

One gene symbol may annotate multiple gene models in our assembly. We relied on the Cufflinks-predicted position of the gene models and the expression of their contig transcripts in all tissues to filter out erroneous contigs. We excluded contig transcripts with less than 35% cumulative query coverage obtained at the annotation step, which constituted <10% of all transcripts. Then, for each gene symbol, we identified the consensus chromosomal position of all gene models and only included the contig transcripts that matched the position. When there was ambiguity in determining the consensus chromosomal position, we chose the gene model with the highest total expression values, as measured by FPKM [14], in its contig transcripts. At this point, 10-20% of the contig transcripts were predicted to be single-exon isoforms. We limited the identification of single-exon isoforms to the common genes between humans and the NHPs in the study. For each gene symbol, we compared the transcripts’ length distribution and number of exons via the Wilcoxon rank sum test in the human transcriptome (Ensembl 73, excluding processed and nonsense mediated decay transcripts) versus the CM or AG assemblies. The single-exon isoforms in the genes without statistically significant differences in both distributions were then retained and the rest were discarded. We evaluated the use of Wilcoxon rank sum test by applying our methodology to the chimpanzee and RM transcriptomes (Ensembl 73, excluding processed and nonsense mediated decay transcripts). Chimpanzee has 13,841 genes with similar gene symbols to those in human transcriptome and RM has 12,720. However, only 1 gene in chimpanzee and only 46 genes in RM have significantly different transcript length distributions or numbers of exons.

Gene expression profiling

We obtained public RNA-seq datasets for liver, lymph node, lung, blood, and brain from Homo sapiens (HG), Pan troglodytes (PT), Gorilla gorilla (GG), and Rhesus macaque (RM) via Gene Expression Omnibus/ArrayExpress from the following series: GSE30352, E-MTAB-513, GSE52166, GSE50957 [26–29]. We computed the abundance of gene expression using Htseq-count 0.5.3p3 and used DESeq 1.14.0 [44] to normalize for the differences in library size. We used the sva 3.8.0 package [45] in R to adjust for batch effects introduced by combining samples from multiple studies. In particular, we used the ComBat function in the sva package to adjust for batch effects from the five sources of RNA-seq data (USAMRIID, Brawand, Human BodyMap, KirknessSep, and KirknessNov). Since ComBat is designed for microarrays, we converted counts to log-scale and exponentiated the normalized values after normalization. We performed principal component analysis using the prcomp function in R. We obtained the list of one-to-one orthologous genes among human, chimpanzee, and rhesus from BioMart [46]. For Comparative gene expression profiling of blood samples from RM, CM, and AG, we used DESeq-normalized expression values; however, no correction for batch effects was required as these samples were all prepared and processed simultaneously.

Identification of novel transcripts and validation

We relied on ORFs to define the coding potential for simplicity and ease of analysis. We searched all six frames in each transcript using TransDecoder (rel16JAN2014) [47] and filtered on a minimum amino acid length of 50. We then used BLASTP to iteratively align the translated sequences to the human Refseq proteins, the human subset of the nr protein database, and finally the full nr database (using an e-value cutoff of 1e-2). Compared to BLASTX, this process is computationally efficient and does not compromise sensitivity [48]. We used Primer-BLAST [49] to design primers to validate novel transcripts (Additional file 3).

For validation, RNA was extracted from tissues of Cynomolgus macaques and African green monkeys using Trizol LS (Invitrogen, Carlsbad, CA). cDNA synthesis was performed using the Superscript III First Strand Synthesis System (Invitrogen, Carlsbad, CA). Amplicons were generated with the replicate primer pairs designed for validation using Phusion Hot Start II DNA Polymerase (New England BioLabs, Ipswich, MA) and run on a 2% agarose for confirmation. Positive samples were quantified on the Nanodrop2000 Spectrophotometer (ThermoScientific, Waltham, MA) and Sanger sequenced on the Applied Biosystems 3730xl DNA Analyzer.