Survey for immunological repertoire and annotation

The genomes of haploid males from a single colony of B. terrestris and of B. impatiens were sequenced by the Bumblebee Genome Consortium and the details of the sequencing, assembly, and automated annotation can be found in [67]. Using OrthoDB [68,69] orthologous groups, we identified orthologs from the two bumblebees, as well as from Apis florea, M. rotunda, and N. vitripennis, Tribolium castaneum, of previously characterized immune genes from D. melanogaster, A. gambiae, and A. mellifera that comprise 27 immune-related gene families or pathways. To complement the orthology searches, we searched for homologs of known immune proteins from the two bumblebees using blastp [70,71] against the official gene sets (NCBI RefSeqs). To confirm the absence of any proteins that appeared to be missing, we searched the genome assemblies and Short Reads Archive using tblastn.

Immunological expression

To confirm the relevance of these genes to immune activation and the validity of novel genes revealed in our annotation we challenged 2- to 3-week-old unmated male and gyne (that is, daughter queen) B. terrestris by injecting them with 2 μl of a suspension of either heat-killed E. coli (Gram-negative) or Arthrobacter globiformis (Gram-positive) at a concentration of 108 cfu/ml, or with sterile Ringer solution under the tergites of the abdomen, or as naïve controls handled them in the same way without any injection. We used 12 replicates for each treatment/caste combination (total N = 96). These experimental bees were the granddaughters and grandsons of queens collected in northern Switzerland in spring 2012 that had established colonies in the lab. We confirmed that these colonies were free of common parasites such as Crithidia bombi and Nosema bombi through visual inspection. All experimental bees were immobilized on ice for 30 minutes before treatment, including the naïve controls. After treatment we housed the bees singly with ad libitum pollen and 50% (w/w) sugar water. Eight hours after treatment we snap froze the bees in liquid nitrogen. We homogenized the abdomens before extraction with 0.5 g Zirkonium beads at 0°C to −4°C using an Omni Bead Ruptor 24 Homogenizer (OMNI International, Kennesaw, GA, USA). We then extracted total RNA using Qiagen RNeasy Plus Mini extraction kits (Qiagen, Hilden, Germany) according to the manufacturer's instructions. We confirmed RNA integrity of every sample with 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA) with the RNA 6000 Nano Kits. We transcribed the RNA to cDNA using Quantitect reverse transcription kits (Qiagen) including controls without reverse transcriptase (no-RT controls) to test for genomic contamination. All samples were checked using quantitative PCR for our housekeeping genes to ensure that the no-RT controls amplified at least 10 cycles later, and thus contain less than 0.1% of the transcripts found in the RT samples.

Based on the full genomic sequences, we selected 27 candidate genes to represent various components of the immune response of insects, including the Toll, JAK/STAT, IMD and JNK pathways; recognition genes, melanization responses, various effectors and antiviral genes. We explored the expression of these genes upon immune stimulation relative to the geometric mean of three housekeeping genes (pla2, ak, ef1a). The full list of genes, their accession numbers and primers can be found in Table 1. All primers were designed using QuantPrime [72], based on the GenBank sequences (Table 1), except those for relish, which were published in [73]. The primers for kayak were designed based on a manually annotated gene given the temporary identifier (Bter:08277927). All primers were tested and have minimal dimer and high amplification efficiency (1.9 to 2.1).

We measured expression on a Fluidigm 96.96 Dynamic array on the BioMark system (Biomark Inc., Pueblo, CO, USA) using EvaGreen as a reporter according to the manufacturer’s protocol (Advanced Development Protocol 14; PN 100–1208 B). We eventually measured expression of 95 samples (12 replicates for each treatment in males and in queens with one naïve queen randomly dropped to make room for the negative control). We ran the samples with three technical replicates and used the average cycle threshold (Ct) of these technical replicates for further analysis.

We standardized expression of all genes of interest relative to the geometric mean of our three housekeeping genes (yielding deltaCt (dCt) values; all dCT values first transformed with Yeo-Johnson power transformations to improve normality and homoscedasticity, 'car' package in R) after confirming that the composite housekeeping value did not vary with sex (F 1,87 = 0.09, P = 0.77), treatment (F 3,87 = 0.29, P = 0.83), or their interaction (F 3,87 = 0.70, P = 0.56) by ANOVA. We analyzed these dCt values using MANOVA with sex (gyne and male) and treatment (naïve, injected with Ringer’s solution, heat-killed E. coli, or heat-killed A. globiformis) as fixed, fully crossed factors (base package in R). We used MANOVA for these analyses, since the expression of any of these genes is not independent of the expression of other genes and because MANOVA accounts for multiple testing and is thus robust to type I error. When MANOVA effects were significant, we subsequently explored the univariate individual gene effects.

Building gene family phylogenies

We retrieved protein sequences of selected gene families from OrthoDB [68,69] and aligned them using ClustalW [74] and adjusted the alignments manually or with Gblocks [75] before tree-building using MrBayes [76] with the mixed model. We ran MrBayes for as long as was necessary (typically for 20,000 to 400,000 generations) to reduce the average deviation of split frequencies below 0.01 or until the split frequency approached 0.01 but did not improve further. We discarded the initial 25% of the trees as a burn-in.

Testing for signatures of selection

We extracted orthologous groups of immune-related genes from OrthoDB6 [68,69]. From the 130 orthologous groups with sequences from B. terrestris, B. impatiens, A. mellifera and A. florea we extracted 148 multiple sequence alignments containing exactly one sequence from each species. We use these 148 alignments for comparisons between the Bombus and Apis clades. From the 122 orthologous groups that contain M. rotundata sequences we further extracted 139 alignments that also contain a M. rotundata ortholog, which we use as an outgroup to compare social with solitary (non-social) bees. In six of the alignments (abaecin, basket, cactus, defensin, kayak and tak1) one or more orthologs were not present in OrthoDB6 and had to be retrieved from an alternative source (that is, NCBI). Protein sequences were aligned independently for the four-taxa (Bombus and Apis) or five-taxa trees (with Megachile) with ProGraphMSA [77] and trimmed using Gblocks with the stringent settings as described in [75]. Where orthologous groups contained multiple sequences for some species the most closely related sets of sequences were aligned. In the 12 orthologous groups that contained more than one sequence for each species we extracted the maximum number of alignments, such that each alignment contains only one sequence from each species. We retrieved cDNA sequences for the alignments from the official gene sets (A. mellifera v.4.5, A. florea v.1.0, B. impatiens v.2.0, B. terrestris v.1.0, M. rotundata v.1.0) using a custom written Python script.

We used likelihood-based codon models implemented in the PAML package [78] to analyze differences in the rate of evolution and to test for signals of selection. We tested hypotheses by using likelihood ratio tests to select the best fitting model among pairs of nested models that differ only in their representation of ω, the ratio of non-synonymous to synonymous substitutions (ω = dN/dS). We make the assumption that ω > 1 indicates positive selection, while ω < 1 and ω = 1 indicates negative and neutral selection.

The average rate of evolution was determined using the M0 [79] model, which assumes a constant ω across all sites and branches. The average ω is not a good indicator for the presence of positive selection, since functional and structural constraints ensure that most sites in functional genes are conserved [80]. Hence, we used the M7 and M8 models to test for the presence of positively selected sites. [79]. Both models allow ω to vary from site-to-site according to a Beta distribution, but the M8 model additionally allows some sites to evolve with ω > 1, to account for sites under positive selection.

In order to detect episodes of positive selection on the connecting branches between clades we used the branch-site model [81,82]. Some branches are assigned a priori to the foreground, where some sites are allowed to evolve with ω > 1, while all sites on background branches are constrained to ω ≤ 1. The branch-site model is compared to a null model where there is no difference between foreground and background branches. We used Clade model D [83] to test for more general differences between clades. This model allows ω to differ between clades in some sites. It is compared to a null model where there are no differences in ω between clades.

To ensure that the PAML optimization did not get stuck in local optima we used six different initial estimates for ω in all analyses and initialized branch lengths to values calculated with PhyML [84]. We corrected for multiple testing by controlling the false discovery rate using the method of Benjamini and Hochberg [85]. To calculate the posterior probabilities of sites being under positive selection in the M8 and Branch-site models we used the Bayes Empirical Bayes (BEB) approach implemented in PAML [86].

We repeated the analyses using Probcons [87] for aligning sequences. However, we only report the results from alignments produced by ProGraphMSA, as these alignments give more conservative estimates and hence a smaller chance of falsely reporting positive selection. Similarly, we do not report results from using Gblocks with the relaxed settings, as described in [75], or no trimming at all. These results are available from the authors.

Data

Sequence data can be found on NCBI (B. impatiens BIMP_2.0 RefSeq Assembly GCF_000188095.1, B. terrestris Bter_1.0 GCF_000214255.1, A. mellifera Amel_4.5 GCF_000002195.4, A. florea Aflo_1.0 GCF_000184785.1, M. rotundata MROT_1.0 GCF_000220905.1). Alignments used in this manuscript can be found in Additional files 12.