Detecting non-coding sequence conservation

All insect genomes were obtained from the core databases in Ensembl metazoa release 21 [32]. To search for conserved non-coding sequences, we extracted the 2 kb sequence upstream of each gene’s translation start site in N. vitripennis and compared this sequence with the sequence upstream of the orthologous gene in each comparator organism (Additional file 1: Figure S9). Our choice of 2 kb as an appropriate sequence length to analyse was twofold; firstly, based on the OGS v1.2 gene annotation, 2 kb is enough to cover over 95 % of 5’ UTR sequence, secondly, it is a computationally tractable amount of sequence given our resource and time constraints. Orthologs of N. vitripennis genes were computed using a pairwise reciprocal best BLAST hit (RBH) approach [33] based on the protein sequences (protein sequences from Ensembl metazoa release 21) of all transcripts in each genome. RBHs are calculated by a two-way comparison; for example if the best BLAST hit for a gene A in Nasonia is gene X in Drosophila, then it is called an RBH if and only if the best BLAST hit for gene X in Drosophila is gene A in Nasonia. The stringency of this criterion provides high-confidence orthologs. The number of N. vitripennis orthologs calculated for each genome is shown in Table S6. For our purposes of conducting a computationally intensive search for deeply conserved regions, this method was judged preferable to other, more comprehensive ortholog search methods such as orthoMCL [34] through a performance comparison of Apis mellifera and Nasonia vitripennis ortholog detection. While orthoMCL detects orthologs for more genes than the RBH method, RBH detects orthologs for most of these genes (>70 %, Additional file 1: Figure S11). Due to the higher sensitivity of orthoMCL to events such as gene duplication, the pairwise comparisons necessitated by an orthoMCL search (217,485) compared to those (best BLAST hits only) necessitated by an RBH search (7,435) would have been computationally prohibitive.

The 2 kb sequences were then extracted upstream of each translation start site. The choice of 2 kb allows us to capture both 5’ UTR and putative promoter sequence, while still being computationally tractable. If the 5’ end of a 2 kb sequence overlapped with a nearby gene, the sequence was truncated. If the sequence was entirely contained within another gene, then it was removed from the analysis entirely. Nasonia sequences were compared against sequences from each other species. The seaweeds algorithm provides optimal alignment scores for all pairs of possible windows across the two sequences, i.e. about 4 million short optimal alignments. The window length was chosen to be 50 bp. The alignment score was set to 1 for a match, 0 for a mismatch, and −0.5 for any gap. The rationale for this scoring matrix, and for using alignments in general, was to perform a search without a priori knowledge of the regions in question or the types of regulatory elements likely to be found. The choice of 50 bp is a variation on a previous study [35] which allows for greater sensitivity while maintaining specificity. An advantage of the window-based seaweeds algorithm [15] over other algorithms such as Smith-Waterman [36] is the avoidance of the “shadow effect” [37] where longer, but biologically less significant alignments may be computed while different, shorter alignments are ignored. Instead all windows are considered equally and results can be easily compared and tested for statistical significance as all sequences are of equal length.

To calculate a conservation score for a pairwise comparison, we take into account sequence conservation and punitively apply information about annotated repetitive elements to produce intermediate scores. These intermediate scores are then scaled from 0 to 1 using an lower (L) and upper (U) threshold; where scores below L are assigned a conservation score of 0, scores above U are assigned a conservation score of 1, and scores in between are defined on a sigmoid curve to reflect an initially exponential increase in confidence as scores increase above the lower threshold, which levels off as scores approach the upper threshold reflecting saturating confidence. The results of all the pairwise alignments were bundled together to form one inclusive dataset. Essentially, this step involves identifying individual small CNEs which actually map to the same sequence, and combining them into a single CNE (illustrated in Additional file 1: Figure S10). During the bundling process, significantly overlapping alignments, i.e. individual CNEs which score above the lower bound but which are not disjoint from one another, were merged together into longer regions and significant regions in two or more species which mapped to the same subsequence in N. vitripennis were identified and merged into a single CNE. Each potential merged CNE is then assigned a combined conservation score (CCS); the combination of the conservation scores from each pairwise comparison. The CCS is computed using the following formula:

$$ 1\ \hbox{-}\ {\Pi}_i\ \left(1\ \hbox{-}\ {P}_i\right) $$ (1)

Where P is the maximum conservation score for a potential CNE in a species, and i indexes the species.

To set the appropriate parameters for detecting conservation, we aligned randomly paired upstream sequences (pseudo-orthologs) to get an idea of what scores could be expected by chance. Using real non-orthologous genomic sequence as a control is preferable to using randomized or ‘shuffled’ sequence as it maintains the complex sequence makeup of true genomic sequences, whereas in shuffled sequences these motifs will be depleted providing a less stringent control. We first identified candidate CNEs by running the pseudo ortholog sequences and the true ortholog sequences with lower L and U parameters (L: 80 U: 94), and setting a CCS threshold at the level where no conservation was detected in the pseudo ortholog set. The lower and upper bound parameters were initially set based on examination of pairwise comparison histograms; the lower threshold was set at the point where scores were unlikely to be meaningful (i.e. where the control set shows a similar number of CNEs), and the upper threshold at a level where no sequences were reported in the control, as performed by Baxter et al. [35]. These candidate CNEs were filtered for similarity to known coding sequences, and after increasing the parameters to the level where no conservation was detected in the pseudo-ortholog control (L: 87 U: 100), we scored these candidate CNEs for conservation producing a final filtered set of 322 CNEs. At this point of filtering, the lower and upper bound are simply used to define a continuum of confidence scores for CNEs already known to be significant.

Filtering for coding sequences and pseudogenes

To ensure that the CNEs were unlikely to contain unannotated coding sequences or pseudogenes, we utilized BLASTX 2.2.27+ [38] to filter out such sequences. To set an appropriate filtering threshold, we first randomly permuted the nucleotide sequences in the set to be filtered. These sequences were then scored against the nr protein database [39] using BLASTX, and the minimum (most significant) E-value score was noted as the most significant hit likely to be produced by random sequences with identical nucleotide composition to the set to be filtered. Once this threshold was set, the true sequences were scored against nr using BLASTX, and any sequence scoring below this threshold was discarded. The thresholds and number of sequences filtered by this method can be seen in Additional file 2: Table S7.

Pseudo CNE comparisons

To elucidate properties of the CNEs through comparison, we generated a set of pseudo CNEs. The set of 359 pseudo CNEs that we obtained were created by aligning pseudo orthologs with relatively low threshold parameters (L: 80 U: 94), applying a CCS cutoff of 0.528 to retrieve a similar number of sequences to the true CNEs, and performing BLASTX filtering. Pseudo CNEs constitute a good control as they are similar to true CNEs in every way but for the fact that they do not represent true orthologous sequences. By comparing the CNEs with these pseudo CNEs, we can therefore identify properties likely to be characteristic of the truly conserved non-coding sequences.

All comparisons were performed on the N. vitripennis versions of each CNE. GO term overrepresentation analysis was performed on both the CNEs and pseudo CNEs using BiNGO [40], a plugin for Cytoscape [41]. We used annotation obtained from the Gene Ontology Consortium [42] (data-version: 2013-08-23) and tested against a background annotation set, formed by combining annotation derived from the D. melanogaster Ensembl homologs of the full set of N. vitripennis genes, ensembl GO annotation, and the official gene set 2 GO annotation on NasoniaBase [43, 44]. Statistical comparisons of the distributions of GC content, CpG observed/expected, CNE length, and CNE position were performed using the Wilcoxon rank sum test in R [45]. P-values of 2.2e-16 are at the floating point precision limit (p ~ 0 at machine precision). For each CNE, the distance from translation start site was calculated as the distance from the 3’ end of the CNE from the 5’-most annotated translation start site of each gene. Unless indicated otherwise, all comparisons are between the N. vitripennis true CNEs and the N. vitripennis pseudo CNEs.

Nucleosome occupancy prediction and GC content analysis

We used the nucleosome occupancy prediction software described in [46] to predict nucleosome occupancy within each CNE and in the flanking region. We extracted the sequence of each CNE along with flanking sequence in order to obtain a 10 kb sequence centered on each CNE. Sequences containing Ns were removed as per the software requirements. This step removed a significant proportion of CNEs - 142 of 322 CNEs (44 %). Control sets were produced by extracting, for each CNE, a region with the same properties (length, and distance from translation start site) upstream of a randomly selected gene. 10 random sets were created, and the results from these controls were averaged and the standard deviations calculated. For the GC content comparison, sequences were extracted in the same way and GC content was measured in a sliding window (50 bp window size, 10 bp step size) along each sequence using a custom Perl script.

5’ UTR analysis

To get an estimate of how many sequences were conserved in transcribed regions as opposed to non-transcribed regions, we split each CNE into all possible 20-mers and used bowtie2 v2.0.5 [47] to map each sequence to the N. vitripennis evidential gene transcriptome dataset [43, 44] supplemented with RNA-seq data (data available on http://www.waspatlas.com), as well as the Ensembl version of the official gene set OGS v1.2 augmented with the same data. A 20-mer was counted as transcribed if it mapped to either transcriptome, and the percentage of mapped to unmapped reads was calculated to give an overlap percentage for each CNE.

To test for the presence of conserved secondary structures, the SCI (structure conservation index) was calculated for each CNE alignment and used to compute the probability of a conserved secondary structure. The SCI is defined as the minimum free energy of the consensus folding structure divided by the mean minimum free energy (MFE) of each sequence in the alignment folded independently [48]. Sequences in alignments with high structural conservation will show similar energies whether allowed to fold independently or forced into the consensus structure. A high SCI (close to 1) therefore indicates a well-conserved structure, and an SCI of more than 1 may indicate the presence of compensatory mutations. We implemented the SCI in Perl using RNAfold [49] and RNAalifold [50], and used code from [51] to implement a shuffling procedure as a control. Alignments are shuffled on a column-by-column basis, keeping the overall conservation pattern intact. By generating sets of shuffled alignments in this way, we can thus calculate the probability that the conservation of RNA secondary structure in the true alignment is exceptionally strong. For each CNE, we generated 1000 control sequences, calculated the Z score distribution, and used this to generate an empirical p-value. 29 CNEs showed conserved structure at p < 0.05, and 11 at p < 0.01.

Motif overrepresentation

To test for overrepresentation of motifs, we used a set of 1038 position weight matrices, including matrices from JASPAR [52] and PLACE [53] and followed the procedure in [35]. We reduced redundancy by performing hierarchical clustering based on the Hellinger distances between matrices, yielding a set of 735 representative matrices at a threshold of 1.5. The matrix with the median entropy was selected to be the representative of each cluster. Motifs were tested for overrepresentation using a binomial test taking into account the strength of the matches. We produced 100 control sets using the same method as in the nucleosome occupancy prediction/GC content analysis section, with the exception that if we found Ns in the true CNEs then Ns were inserted into each control CNE in the same positions. A matrix was counted as over-represented if the binomial p-value in the true CNE set was lower than the p-value in all control sets, and underrepresented if it was higher than the p-value in all control sets. Of the 735 representative matrices, 88 were found to be under-represented compared to the controls, and 35 were over-represented. As with the nucleosome occupancy analysis, this analysis was highly affected by GC content; in general, matrices with high GC content were found to be overrepresented whilst those with low GC content were found to be underrepresented. GC content of a matrix was measured proportionally with the weight of each position; a matrix with several highly weighted G or C nucleotides will thus have higher GC content than a matrix with G or C nucleotides that have low weight.

Not1 CNE characterization

To test for presence or absence of the Not1 CNE, the consensus sequence from the original CNE alignment was scored against the 100 bp upstream sequences of Not1 homologs in all organisms available in Ensembl metazoa using BLAST [38]. The E-value distribution was plotted, and sequences with E-values lying outside of the main distribution (E < 0.001) (Additional file 2: Table S8) were aligned, and the 30 bp hairpin loop sequences extracted. The stabilities of the hairpins were predicted using RNAfold [49] and sequence logo diagrams prepared using the seqLogo [54] package in R.

Motif elicitation and RPLP1/RPLP2 analysis

MEME [55] was used to elicit motifs from the RPLP1 and RPLP2 CNEs. After initial exploratory analysis, we searched for the 3 best motifs in the sequences of both CNEs, looking for motifs from 6 to 13 bp in the RPLP1 CNE, and from 6 to 15 bp in the RPLP2 CNE. For the RPLP1 CNE, 500 bp sequences upstream of all RPLP1 homologs analyzed were used, and 2 kb sequences for the RPLP2 homologs analyzed. Due to total sequence length restrictions, the RPLP2 motif elicitation analysis was first done with a seed elicitation followed by analysis of the remaining sequences. 3 motifs were extracted for the RPLP1 CNE, and 2 for the RPLP2 CNE, one of which was manually split into two upon further inspection. Sequences were inspected for presence or absence of the CNE motif components, and sequences containing the CNEs were aligned based on the motif positions. Distances of the motifs from translation start sites were calculated and plotted. Sequence logo diagrams were prepared using the SeqLogo [54] package in R. All genes and organisms used in this analysis can be found in Additional file 2: Table S9.

The motif elicitation analysis was repeated using a phylogeny-aware method, Phylogibbs [56]. The motif-based alignments produced by MAST [55] were cut to 17 sequences due to memory constraints, and used as inputs for the algorithm. Phylogenetic trees were constructed using divergence estimates from timetree [8], and blanket proximity values were assigned to branches based on the approximate figure of 0.85 proximity given in the Phylogibbs documentation for the mouse-rat divergence time of 22.6 Myr. The analysis for the RPLP1 CNE revealed three significant motifs overlapping the motifs produced by MEME. Similar results were also obtained for the RPLP2 CNE. The full results of this analysis including the elicited PSSMs can be found in Additional files 3 and 4.