Genome sequencing

Ants were collected into 95% ethanol and stored at −20 °C. DNA was extracted using a standard phenol-chloroform procedure. Two Illumina libraries were used for P. gracilis genome assembly: a small insert ‘fragment’ library and a mate-pair library. The fragment library was prepared from the DNA of a single haploid male specimen and was size-selected so that 100 base paired-end reads would overlap, as suggested by the authors of ALLPATHS-LG17. The mate-pair library was prepared from the remaining DNA available from the fragment library male combined with DNA from three of his brothers. The mate-pair library preparation was size-selected for 3 kb inserts. Each library was sequenced on its own Illumina HiSeq2000 lane.

The genome was assembled using ALLPATHS-LG release 44837 (ref. 17). Because our mate-pair library was prepared using DNA from multiple individuals, we ran the assembly pipeline twice. The first time, we set the PATCH_UNIPATHS, PATCH_SCAFFOLDS, and FIX_SOME_INDELS options to false to exclude the mate-pair reads from the part of the pipeline that uses them to patch holes in the assembly. We then ran the pipeline again, this time with default settings, but used the error corrected mate-pair reads resulting from the first run as input. These error-corrected reads were corrected to match the sequences from the fragment library, eliminating any heterozygosity introduced by multiple individuals. Therefore, the final genome assembly is equivalent to the haploid sequence from a single male.

We used CEGMA v2.4 (ref. 27) to assess the completeness of the assembled genome sequence. We used PILER-DF v1.0 (ref. 18) and RepeatModeler (http://www.repeatmasker.org) to predict the presence of TE sequences. CD-HIT (ref. 44) was used to reduce redundancy in the set of predicted TEs by representing sequences that were at least 80% similar over 80% of their length by the single longest sequence. We then aligned our TE predictions against the UniProt Swiss-Prot45 and Drosophila melanogaster proteins (FlyBase release FB2013_02) using BLASTX and removed predictions with bit scores of at least 100 or alignments of 50% similarity over 50% of sequence length as false positives46. We also removed TE predictions less than 80 bases long46. The remaining predictions were classified with the RepeatClassifier module of RepeatModeler. RepeatMasker (http://www.repeatmasker.org) and RepeatRunner47 were used by MAKER to generate a genome-wide repeat annotation based on the final set of TE predictions. The MAKER v2.30 (ref. 24) automated annotation pipeline was used to identify genes in the P. gracilis assembly. Ab initio gene predictions from both SNAP48 and AUGUSTUS49 were used in conjunction with the assembled ESTs and all annotated protein sequences from the ants Acromyrmex echinatior (OGS v. 3.8)50, Atta. cephalotes (OGS v. 1.2)51, Camponotus floridanus (OGS v. 3.3)52, Harpegnathos saltator (OGS v. 3.3)52, Linepithema humile (OGS v. 1.2)53, Pogonomyrmex barbatus (OGS v. 1.2)54, and Solenopsis invicta (OGS v. 2.2.3)55, honey bee, Apis mellifera (OGS v. 1.1)56, and jewel wasp, Nasonia vitripennis (OGS v. 1.2)57. All Hymenopteran gene sets were obtained from www.hymenopteragenome.org on April 3, 2013. SNAP and AUGUSTUS were initially trained on the 451 (of 458) CEGMA genes identified in the P. gracilis genome.

We ran standalone InterProScan version 5.2–45.0 (ref. 25) on all resulting gene models, including the predictions for which no transcript or protein evidence existed. If these predictions were associated with at least one gene ontology term, they were promoted to full models and included in the gene set. Sets of gene ontologies were also generated for Acromyrmex echinatior, Atta cephalotes, Camponotus floridanus, Harpegnathos saltator, Linepithema humile, Pogonomyrmex barbatus, Solenopsis invicta, Apis mellifera, and Nasonia vitripennis. We used the GO Term Finder software version 0.86 (ref. 58) to find gene ontology terms enriched within the P. gracilis annotation.

We used OrthoMCL version 2.0.9 (ref. 59) with default settings to find orthologues between the P. gracilis genes and the annotations from all Hymenoptera listed above as well as Drosophila melanogaster. When genes of interest had multiple possible orthologues or none found with OrthoMCL, we used BLASTP to find the most similar sequence in Drosophila.

Transcriptome

RNA was extracted using the method of Alaux et al.60 with minor modifications. Briefly, ants were homogenized in 1 ml Trizol (Invitrogen Life Technologies, Grand Island, NY) and incubated at room temperature for 5 min. The solution was shaken vigorously for 15 s after adding 100 μl water and 200 μl chloroform, followed by another room temperature incubation for 3 min. The resulting mixture was then centrifuged for 15 min at 12,000 g at 4 °C. The aqueous phase was removed, mixed with an equal volume of 70% ethanol and transferred into a Qiagen RNeasy column for RNA extraction. DNA was eliminated with on-column DNase I (Qiagen, Valencia, CA) treatment.

Six RNA extractions were performed for P. gracilis transcriptome sequencing. Two of these included 10 workers, 2 included 10 larvae, 1 included 10 pupae and 1 included 3 pupae, ∼20 eggs and 6 larvae. All individuals used for RNA extractions were alive at the time of extraction. The RNA concentrations of these 6 extractions were determined with a NanoDrop2000 (Thermo Fisher Scientific, Waltham, MA) and were combined so that RNA from each extraction was equally represented in the final pool.

A standard library created from the RNA pool was sequenced on a single paired-end 100 base lane of the Illumina HiSeq2000 platform. Reads and their partners were eliminated from the data set if <90% of the read length had quality scores >20 using the FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/index.html). Remaining reads were assembled with the 25 February 2013 release of Trinity61 using default settings. Transcript redundancy was reduced using CD-HIT44 by clustering sequences at least 97% similar over 95% of their length.

Annotation

Our initial annotations with the automated pipeline MAKER24 yielded a number of gene fusions; thus, we ran MAKER several times using different sets of evidence to reduce these anomalies. First, we executed MAKER with only protein sequences and ab initio predictors. The predictors were retrained on these results and re-run using the same data. We then executed MAKER again with the resulting models but with the addition of the transcriptome data to allow for the addition of untranslated regions to the protein-predicted genes. Lastly, we ran MAKER again with all available protein and transcript data. Our final gene set consisted of non-overlapping gene models from these separate MAKER runs with preference given to those models resulting from the protein-only-based models, except where two genes from the second set overlapped a single protein-only prediction. We also set the MAKER parameters ‘correct_est_fusion=1’ and ‘always_complete=1’, to further improve the annotation. Our approach successfully reduced the number of gene fusions present in the final gene set.

NuMt identification

We queried the 13 protein sequence annotations from the complete Solenopsis geminata mitochondrial genome (HQ215537.1) against the assembled ant genomes using TBLASTN. The hit with the lowest e-value was used as the seed to identify nuclear mitochondrial-like copies of each gene in each genome. These seed nucleotide sequences were queried against the genome using BLASTN and hits with e-values <1 × 10−20 covering at least 35% of the length of the seed sequence were counted as introgressions of that particular gene.

GBS sequencing

The queen and 47 workers of P. gracilis from a single colony in the same population as the individual whose genome was sequenced were genotyped using the GBS approach of Elshire et al.19 with an additional size selection step. As in Elshire et al.19, we used the restriction enzyme ApeKI and drew all adapter and barcode sequences, and the majority of the molecular techniques from that study. The barcoded adapters (5′-ACACTCTTTCCCTACACGACGCTCTTCCGATCTxxxx-3′ and 5′-CWGyyyyAGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT-3′, where ‘xxxx’ and ‘yyyy’ are the barcode sequences shown in Supplementary Data 2) and common adapters (5′-CWGAGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG-3′ and 5′-CTCGGCATTCCTGCTGAACCGCTCTTCCGATCT-3′) were annealed at 50 μM by temperature ramping from 95 °C for 2 min down to 25 °C by 0.1 °C s−1 and holding at 25 °C for 30 min. Approximately 0.06 pmol of a barcoded and common adapter were combined with ∼200 ng of sample DNA and dried down overnight at 37 °C. Restriction digests were performed at 75 °C for 2 h in 20 μl volumes with 1 × NEB Buffer 3 and 3.6 U ApeKI (New England Biolabs, Ipswitch, MA). To these reactions, 30 μl of 1.66 × ligase buffer and 640 U T4 ligase (New England Biolabs) were added, followed by incubation at 22 °C for 1 h and heat inactivation at 65 °C for 30 min, to ligate adapters to restricted sample DNA. From these reactions, 20 μl of each sample was pooled and this pool was purified using a Qiagen QIAquick PCR purification kit. We size-selected for fragments 300–800 bp long using agarose gel electrophoresis and purified the sample with the Qiagen gel extraction kit, eluting in 30 μl. The restriction fragment library was then PCR amplified in 50 μl reactions with 2 μl of the pooled library, 2 μl of each primer (5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT-3′ and 5′-CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT-3′) at 5 μM each, 21 μl of water and 25 μl of 2 × Taq master mix (New England Biolabs). Thermocycling consisted of 95 °C for 1 min followed by 18 cycles of 95 °C for 30 s, 65 °C for 30 s and 72 °C for 1 min with a final extension of 5 min at 72 °C. This reaction was replicated four times and the results pooled to reduce PCR biases. The pooled library was purified using a Qiagen QIAquick PCR purification kit and eluted in 30 μl. The result was sequenced directly on a single lane of an Illumina HiSeq2000 with 100 bp single-end reads.

Linkage mapping

All GBS sequence processing was performed using the pyRAD20 pipeline. We expected restriction recognition site-associated sequences of CWGC for ApeKI. Default quality controls were used and reads were clustered at 90% similarity. We required that no more than three indels could be present in within sample clusters and at least 12 reads to be assigned to a within sample cluster.

We required that the queen was genotyped as heterozygous at each locus used in mapping analyses, as homozygous loci in the queen cannot contribute to the linkage analysis. Ants are haplodiploid; queens and workers are diploid and males are haploid. Typically, Hymenopteran linkage maps are produced by examining the haploid male offspring of a single queen. We instead examined the diploid worker offspring from a single nest. Therefore, each genotyped individual was either homozygous if the father had the same allele as the queen or heterozygous if those alleles differed. As we did not know the parental phase of the alleles at each locus, we followed the procedure of Wang et al.21. Briefly, we arbitrarily assigned the two alleles at each locus to different phases and added duplicates of these alleles with the phases reversed to the input data set. This resulted in a doubled number of linkage groups with two copies of each group consisting of the same loci with mirrored phases. One of the copies for each linkage group was arbitrarily discarded from the results.

We used MSTmap22 for linkage analysis with the Kosambi distance function and default error detection parameters (no_map_dist=15.0 and no_map_size=2). The missing data threshold was set to 25% and the cutoff P-value for locus linkage was set to 7.5 × 10−6 after some parameter exploration. We mapped loci that were successfully assigned to linkage groups to the genome sequence using the Burrows–Wheeler Aligner62 excluding alignments with gaps longer than three bases (-w 3) and discarding reads with more than one alignment in the genome (-c 1).

The resulting genotype calls were strictly filtered for inclusion in linkage map analysis. Samples were checked for excessive heterozygosity and those with large amounts of missing data were excluded. Loci were excluded if <75% of individuals were genotyped at that locus or if >2 alleles were present. The two alleles at each locus are expected to occur in approximately equal proportions. Deviations from this expectation may indicate the inclusion of paralogues in genotype calling or other types of misleading data. We therefore performed Fisher’s exact tests on allele counts at all loci and those found to differ significantly from the expected 50% ratio were discarded.

Sequencing of six ingroup Pseudomyrmex species

We sequenced a single diploid worker from mutualists P. concolor, P. dendroicus and P. flavicornis, and generalists P. pallidus, P. elongatus and the undescribed species P. sp. PSW-54. These species were chosen as representatives of mutualist/generalist sister clades within the genus based on Ward and Downie13, although the most probable phylogenetic positions of P. concolor and P. pallidus are different according to our results. We also sequenced an additional diploid worker of P. gracilis so that diploid genotypes were known for each species. Eight lanes of 100 bp paired-end Illumina HiSeq2000 sequencing were used, one for each species and two for P. gracilis. We estimated genome repetitiveness and single-nucleotide polymorphism rate of all diploid genomes from the raw data using k-mer spectrum analysis of the error correction module of ALLPATHS-LG17. Voucher specimens representing the species for which full genomes were sequenced have been deposited in the Field Museum’s collection under accessions FMNHINS 2821891 to 2821900.

Assembling Pseudomyrmex genomes

We used Stampy version 1.0.21 (ref. 63) to map reads from each additional Pseudomyrmex species to our P. gracilis assembly, specifying a 3% substitution rate between the reads and the mapping reference. We then used Platypus version 0.5.1 (ref. 64) to call genotypes of each species. We required there to be 5 × coverage to call genotypes, filtered PCR duplicate reads, allowed up to 30 variants in each window and merged variant containing windows. We also calculated coverage at all sites using the ‘mpileup’ command of samtools65. We then used custom python scripts to merge these two sources of data into consensus sequences. Those sites for which Platypus did not yield genotype calls were called as the reference base only if they had 5 × or greater coverage. Otherwise, sites were called as unknown bases. When Platypus yielded >3 alleles at a site, they were also masked as unknown bases. We maintained the alignments of reads produced by Stampy and indel calls of Platypus, yielding a whole genome alignment of all seven Pseudomyrmex genomes. This alignment formed the basis of all genomic comparisons. Coordinates and reading frames of coding regions for all genomes were based entirely on the P. gracilis annotation.

Pseudomyrmex phylogeny

We designed the taxonomic sampling of this research based on the phylogeny of Ward and Downie13 who showed that three pairs of mutualistic and generalistic lineages existed within Pseudomyrmex: P. concolor and P. pallidus, P. dendroicus and P. elongatus, and P. flavicornis and P. sp. PSW-54. However, before proceeding with our analyses, we aimed to confirm the phylogenetic relationships of each of these species. We therefore concatenated all scaffolds from each species and inferred a phylogeny using the threaded version of RAxML version 7.3.0 (ref. 66) and the GTRGAMMA model of evolution. We also inferred phylogenies from 25 kb sliding windows with step sizes of 5 kb, analysing only those windows for which all species had known sequence over at least half the window length.

Tests for selection

We only examined signatures of selection in genes longer than 300 codons for which we obtained sequence from all seven species, requiring that each sequence had <20% missing data. Premature stop codons occurred in at least one sequence from ∼40% of genes that otherwise passed quality metrics. These stop codons typically were due to misalignments and were masked for all analyses. We used Gblocks version 0.91 (ref. 67) with default parameters to mask unconserved regions of gene nucleotide sequence alignments.

We used both the branch test and branch-site test implemented in PAML version 4.7 (ref. 37), to test for signatures of positive selection within the mutualistic lineages. We conservatively specified all three mutualists as ‘foreground’ branches, reducing the chances of spuriously identifying genes under selection in just a single lineage and therefore not necessarily related to mutualism. We used the ‘cleandata’ setting to remove ambiguous codons from sequences and used the topology recovered in this study as the species tree with the branch lengths from the total evidence whole-genome phylogeny as the initial estimate (fix_blength=1).

For the branch-site test, we first fit a model that allowed the mutualistic lineages to have a proportion of sites with dN/dS≥1, whereas the generalists had dN/dS≤1. This model was then compared with a second model that did not allow dN/dS to exceed one at any sites. The two models were compared using likelihood ratio tests and the resulting P-values were corrected to a 5% false discovery rate. We found that the branch-site test sometimes falsely identified genes under positive selection when they simply had high rates of substitution. To remedy this issue, we tested those genes that were found to be significant again, setting the non-mutualistic lineages as the foreground branches. Those that were detected to be under positive selection in both tests were discarded.

For the branch test, we again fit two models to the sequence data, setting the mutualistic lineages as the ‘foreground’ branches and either allowing for the mutualists and non-mutualists to differ in estimated dN/dS ratios or fitting a model with just a single ratio across the tree. The resulting models were again compared with likelihood ratio tests and false discovery rate corrected. When possible, those genes found to be of interest were modelled structurally with Phyre2 (ref. 36) to determine the active sites of the proteins.

Examining rates of molecular evolution

We assessed rates of molecular evolution of each sequenced species using a number of approaches, all of which were based on pairs of most closely related mutualists and generalists. It is noteworthy that we considered P. concolor and P. pallidus to be a pair of mutualist and generalist for all analyses, although, given our phylogenetic findings, other taxa were equally closely related. Most basically, we calculated genetic distance of each species from P. gracilis using the same sliding windows used to create phylogenies from across the genome. We assessed the statistical significance of differences in genetic distances using paired t-tests, pairing distances from the same windows. We also counted the number of windows for which mutualists had consistently greater genetic distances from P. gracilis than the most closely related generalists. We used a similar approach with the branch lengths recovered from the inferred phylogenies of each window.

We also estimated rates of non-synonymous and synonymous substitutions in all genes using the free-ratios model implemented in PAML. Again, we used the species tree from this study and used ‘cleandata=1’ to mask ambiguous data. Only those genes with dN/dS ratios <10 were examined as larger values were very likely to be due to assembly or annotation errors. Wilcoxon signed-rank tests were used to compare overall rates of change between species. We confirmed the quality of our overall estimates of dN/dS by summing all dN and dS estimates within each species and calculating the ratio based on these sums, thus reducing the influence of outliers.

Classifying and quantifying repetitive elements

We randomly subsampled 100,000 paired-end sequences from the raw data for each species using seqtk (https://github.com/lh3/seqtk). We then ran Transposome68 on these data with a merge threshold of 0.0001. All other parameters were defaults. Clustered sequences were classified by Transposome using the repeat database built from the P. gracilis genome.

Codon usage

Codon usage bias can be a result of selection, in particular when genes are highly expressed and translation efficiency is influential to their production. We measured codon usage bias by estimating effective codon number with ENCprime69 using the 2,000 bases flanking each end of a given gene to estimate background nucleotide composition. In addition to examining correlations between effective number of codons and the rates of non-synonymous and synonymous changes in each genome estimated by PAML, we also estimated the same parameters using HyPhy38. We tested for differences in genome-wide codon bias between taxa using Wilcoxon rank-sum tests.

Duplications

To determine whether particular genes have been duplicated in mutualistic lineages, we mapped reads from each species to the P. gracilis genome using mrFAST version 2.6.0 (ref. 70) allowing 6% edit distance. The genome was first repeat-masked using RepeatMasker version 4.0.1 (http://www.repeatmasker.org). We examined the resulting mappings with mrCaNaVaR version 0.51 (ref. 70). We then looked for regions where estimated sequence copy number was at least 1.5 and all of the species with each life history were consistently at least 1.5 × higher than all species with the other life history. In particular, we focused on coding regions that showed these patterns.

Signatures of parallel evolution

We used Saguaro version 0.1 (ref. 28) with default settings to search for signatures of parallel evolution within the mutualistic lineages, first filtering for those sites where non-gap calls had been made in at least two mutualists and two generalists. We also examined the topologies resulting from our sliding window analyses of phylogenetic relationships to determine whether any parts of the mutualistic genomes appeared more closely related to each other than to any generalists.

Estimating population sizes

We estimated θ for each of the ingroup species using G-PhoCS33. Genomic regions found to be repetitive by RepeatMasker (http://www.repeatmasker.org) in the P. gracilis reference genome were masked from the alignment, as were all coding sequences and the flanking 5,000 bases, to increase the odds of obtaining neutrally evolving sequences. We then sampled all loci of length 500 at least 15 kb apart with sequence data for at least four of the six in-group species. Sequence was not considered present in a given species if >20% of the total length consisted of undetermined sequence or gaps.

The first 1,000,000 iterations of the G-PhoCS MCMC run were discarded as burn-in and values were sampled for 2,000,000 iterations every 100 iterations thereafter for a distribution of 20,001 samples. These distributions were manually inspected for convergence.

Expression levels of fast-evolving genes

We determined the set of genes with consistently higher rates of molecular evolution in mutualistic lineages using the phylogenetic framework outlined above. Those genes for which all three mutualistic lineages in each pair of species (P. concolor versus P. pallidus, P. flavicornis versus P. sp. PSW-54 and P. dendroicus versus P. elongatus) had greater rates of molecular evolution were considered to be fast-evolving in mutualists. Those genes that showed the opposite pattern, with rates of change consistently higher in generalists, were used for comparison, as were the genes with no consistent pattern between mutualists and generalists. We used modENCODE’s library of Drosophila tissue gene expression data to determine whether these fast-evolving genes are expressed at different levels in particular tissues. There are 29 tissue types included in the modENCODE data set. For all genes for which a Drosophila orthologue was found using OrthoMCL, modENCODE expression levels were standardized to one across all tissues. Within each tissue, the level of expression of the fast-evolving genes was compared with standardized expression of all other genes using t-tests.

Selection on aggression genes

We expected genes known to be involved in aggression in other social Hymenoptera to be over-represented in the genes we identified as being under convergent selection or evolving at higher rates in mutualists due to their highly aggressive behaviour. Therefore, we assembled lists of genes putatively involved in aggression in honey bees39. The first list consists of 51 genes confidently determined to be involved in aggression. The second list includes 1,922 genes, all of which were implicated in aggression in some way. We identified orthologues in Pseudomyrmex and looked for over-representation of fast-evolving genes among these aggressiveness genes.

Data availability

All of the raw sequence data that support the findings of this study have been deposited in the National Center for Biotechnology Information’s Whole Genome Shotgun database under BioProject PRJNA268384 along with the P. gracilis assembly. Pseudomyrmex gracilis annotations and aligned reference-based assemblies for all species are available from the Dryad Digital Repository at http://dx.doi.org/10.5061/dryad.q530s as well as at http://benrubin.science/data/. All other data necessary to obtain our results were obtained from http://hymenopteragenome.org and http://flybase.org.