Biological material and sequencing

Queenright colonies of C. costatus, T. zeteki and T. cornetzi were collected in Gamboa, Panama, and maintained in the lab on a diet of polenta, oatmeal and bramble leaves at 25 °C and 60–70% relative humidity (RH). For A. colombica and T. septentrionalis, ants from single colonies were collected from Gamboa, Panama and from Apalachicola National Forest, Tallahassee, FL, USA, respectively. Fungal cultures were obtained by incubation on potato dextrose yeast-extract agar plates containing streptomycin or, for T. septentrionalis, potato dextrose agar plates containing streptomycin+penicillin followed by propagation in liquid potato dextrose agar medium. Samples that were not immediately processed were stored in RNAlater at −80 °C. DNA and RNA was then extracted using QIAGEN kits or standard extraction protocols (see Supplementary Information for full details). Sequencing libraries with insert sizes ranging from 200 bp to 10 kbp were generated for the genomic DNA using standard procedures, whereas 200 bp fragments were used for complementary DNA sequencing libraries. All libraries were paired-end sequenced on an Illumina HiSeq 2,000 platform with read lengths of 100 bp for small insert sizes, 49 bp for large insert sizes and 90 bp for the cDNA libraries. Queen insemination data are based on earlier work13 supplemented for A. cephalotes by genotyping of ca. 50 workers from 6 colonies using 4 polymorphic microsatellite markers and for T. septentrionalis by using 4 microsatellite markers to genotype ca. 10 workers from 10 field colonies made available by Jon Seal, University of Texas at Tyler.

Assembly and annotation

Genomic sequencing reads were filtered to remove low-quality reads and PCR duplicates, and then assembled using SOAPdenovo (v2.04)32. Contigs were first constructed based on short insert libraries and then scaffolded using paired-end information from all DNA libraries. Unresolved gap regions were locally reassembled by GapCloser (released with SOAPdenovo). Following assembly, we used BLAST33 against NCBI nt databases (e-value cutoff: 10−5) to remove contaminant (bacterial or bacterial+fungal) sequences.

Genomic repeats were annotated by combining several repeat detection methods as described in the Supplementary Information. Protein coding genes were annotated using GLEAN34, to integrate homology- and transcription-based evidence. Protein functions were inferred based on best BLASTP alignment to the SwissProt database35, whereas domains and Gene Ontology (GO)36 annotations were inferred from InterProScan 4.8 (ref. 37) against the InterPro database38. KEGG39 annotations were obtained using the KAAS server40.

In the absence of assembled genomes, fungal RNA-Seq reads were quality filtered and then assembled using Trinity41. Genes were predicted based on inferred open reading frames as described in the Supplementary Information, keeping only the longest isoform for alternatively spliced transcripts. Functional annotation was performed using the same method as described for genome-based annotations above.

Gene family analyses

Genes from the seven attine ant and five other ant genomes (Solenopsis invicta, Pogonomyrmex barbatus, Camponotus floridanus, Linepithema humile and Harpegnathos saltator), as well as three outgroup insects (Apis mellifera, Drosophila melanogaster and Nasonia vitripennis) were clustered into gene families using OrthoMCL v2.0.9 (ref. 42). Homologous relationships among sequences were determined using BLASTp with an e-value cutoff of 10−5 and an alignment length cutoff of 50% of the gene length followed by clustering by MCL. Only gene families found in single copies in all species (2,795) were used for phylogenetic inference (see below).

One-to-one orthologous relationships among genes of attine ants and the two closest, sequenced outgroups (S. invicta and P. barbatus) were determined based on pairwise reciprocal best BLASTP (e-value<10−5) hits. Groups of orthologous genes were combined based on pairwise orthologous relationships, resulting in 7,443 one-to-one ant orthologue groups. Orthologue groups for the cultivars were similarly determined using S. commune and A. bisporus as outgroups. This resulted in 1,075 one-to-one orthologue groups, which were used to build the fungal phylogeny (see below).

Codon-based alignments of groups of one-to-one orthologous ant genes were generated with PRANK v.120716 (ref. 43) and low-scoring sites masked with Guidance v1.2 (ref. 44). Changes in dN/dS ratios were modelled with PAML45 version 4.7, using models with from two to four distinct dN/dS ratios. Model likelihoods were compared with log-ratio tests and false discovery rate (FDR) correction to assess significance. Alignments that showed significant increases in dN/dS ratio were then used for GO analysis using BinGO46 v.2.44, using the Hypergeometric test with an FDR-corrected P-value cutoff of 0.05 and the GO annotations of the A. cephalotes proteins.

Significantly expanded ant gene families were determined by using badirate47 to identify ‘outlier’ gene families. Gene models and family assignments for these candidate outlier families were manually checked, resulting in the identification of two significantly expanded gene families: Nardilysin and Tom70, both of which were expanded in all attine ants. Subcellular localizations of potentially full-length A. echinatior and A. colombica Nardilysin proteins were inferred using WoLF PSORT48.

Overall trends in gene family expansions and contractions were assessed by counting the number of consistently expanded or contracted gene families at ancestral nodes based on gene family sizes at the terminal nodes, including novel or lost genes. Fifth and 95th sampling percentiles were calculated by permuting the data.

Phylogenies

Protein sequences of 2,795 (ants) or 1,075 (fungi) single-copy gene families were aligned using MUSCLE49 with default parameters, converted into coding sequence (CDS) alignments and concatenated in Geneious v7.0 (ref. 50), resulting in a data matrix consisting of 1,886,151 amino acid sites and 13 taxa (ants) or 825,686 amino acid sites and 8 taxa (fungi). The concatenated matrix was analysed under the parsimony criterion in PAUP* v.4.0a140 (ref. 51) using a heuristic search and 100 random-taxon-addition replicates for the ants and an exhaustive search for the fungi, in each case resulting in a single optimal tree.

Using this maximum-parsimony tree as a reference tree and the 2,795 (1,075) loci as the maximum number of possible partitions, a partitioning analysis was conducted in PartitionFinder v.1.1.1 (ref. 52) in which all possible protein models were considered and compared (models=all protein) under the Bayesian Information Criterion using the hcluster search algorithm, resulting in a scheme consisting of 132 (19) partitions. These partitions and models were employed in a maximum-likelihood analysis in RAxML 7.7.7 (ref. 53), resulting in a best tree with topology identical to the maximum-parsimony topology. The partitions and models were also employed in maximum-likelihood bootstrap analyses in RAxML consisting of 1,152 pseudoreplicates under the ‘−b’ (thorough search) bootstrap option, resulting once again in the same topology with bootstrap frequencies of 1.0 at all nodes.

We inferred divergence dates for the maximum-likelihood tree using the penalized likelihood approach implemented in r8s v.1.7 (ref. 54). For the ant dating analysis, the bee outgroup A. mellifera was excluded and two nodes in our tree were calibrated with fixed ages based on the results from a large-scale diversification analysis of the ant subfamily Myrmicinae that employed a total of 27 fossil calibrations across 251 species11. The two calibrated nodes in our tree correspond to (a) the most recent common ancestor (MRCA) of C. costatus and its sister group and (b) the MRCA of P. barbatus and its sister group. Three separate analyses were conducted, using the mean, 5% minimum credibility interval and 95% maximum credibility interval from Ward et al.11, respectively, to calibrate node a (26.6 (19.6, 33.8) MYA) and node b (95.4 (85.2,106.0) MYA). For the fungal dating analysis, the most distant outgroup taxon S. commune was used to root the tree, providing estimates for branch lengths descended from this root node and subsequently excluded from the analyses. We applied a fixed age calibration to the node corresponding to the MRCA of the outgroup Agaricus and its sister group using the results from a previous study55, a procedure similar to another diversification date analysis of lepiotaceous attine cultivars21. We conducted three separate analyses using different fixed ages for this node. These fixed ages were obtained from previous age estimates for this node from Geml et al.55 Thus, we conducted analyses using the mean age (73 MYA) and derived confidence ranges using the 5% minimum age (55 MYA) and the 95% maximum age (91 MYA) calibrations.

Ant genome synteny and arginine biosynthesis pathway loss

Pairwise genome synteny was determined among attine ants, among 5 other sequenced ants, among 12 fruit flies, 8 primates, 22 birds and 16 mosquito genomes. Pairwise orthologous genes were identified based on reciprocal best BLASTp hits as described above. Syntenic blocks were then defined as containing at least five contiguous orthologous genes and were extended across gaps of no more than 4 genes. No more than 5 gene inversions in total were allowed in any pairwise syntenic block.

The loss of synteny between species pairs was assumed to follow an exponential decay process and rates of synteny loss were calculated accordingly as 1−p s 1/T, where T is divergence time (in millions of years) and p s the estimated proportion synteny between two species. Overall differences between taxonomic groups in their rates of pairwise synteny loss were tested using a Kruskal–Wallis non-parametric test and pairs of groups were compared using a Steel–Dwass pairwise post-hoc test. Calculations were performed in JMP version 11.2.0.

Loss of synteny along the branches of the ant phylogeny was estimated by using the FITCH package in the PHYLIP suite of programs v. 3.695 (ref. 56), which reconstructs phylogenies based on distance matrices that are assumed to be additive, but does not make assumptions about an evolutionary clock. The input file was the loss of synteny between pairs of ant species, which was treated as a distance matrix and mapped onto the ant phylogeny by using the ‘U’ option to specify a user-defined tree with branch lengths derived from the dated phylogeny based on genome sequences.

Gene loss in the ant arginine biosynthesis pathway was assessed by mapping the intact argininosuccinate lyase and argininosuccinate synthase CDS sequences from S. invicta and P. barbatus to the attine genome assemblies using BLAT (v.35 × 1)57. Only matches to argininosuccinate synthase were found and Genewise (v2.2.0)58 was used to predict gene structures in the surrounding regions based on the peptide references of S. invicta and P. barbatus. Gene synteny of the flanking regions were found to be intact, whereas putative argininosuccinate synthase genes were pseudogenized by frame shifts and pre-stop codons in all cases.

Fungal CAZy and Interpro analyses

Protein sequences of attine cultivars and three outgroup fungi (C. cinerea v1.0, A. bisporus v2.0 and S. commune v2.0) were matched against the CAZy database (v2013)59 using BLASTp, requiring full-length alignment of the query with an e-value<10−6 and identity >50%. These matches were then subjected to BLAST against a library of individual CAZy module sequences and HMMer searches60 using specific models for each CAZy module family, requiring both methods to yield the same family assignment. For the cultivar of C. costatus, CAZy counts were obtained separately for genome-wide and transcriptome-only-based annotation sets. Previously published classifications61,62 were used to categorize CAZy families according to substrate. Clustering of species based on euclidian distances between normalized counts for each CAZy family was done using the R-package pvclust version 1.3–2. Statistical significance of count differences were assessed using binomial probabilities assuming equal count distributions.

Protein Interpro (IPR)38 losses were initially assessed based on the annotations as described above and were verified using HMMER60 searches with the potentially lost IPR domain profiles against six-frame translations of all transcriptomes, as well as the genomic assemblies of the C. costatus, A. echinatior and A. cephalotes25 cultivars, with an e-value cutoff of 10−2 and requiring the length of the match to be >30% of the domain length. For the ligninase domain, we assessed the synteny of surrounding genes using manual BLAST searches against the A. bisporus (H97 v2.0) and L. gongylophorus (Ac12 v1.0) genome sequences.

Positive selection

Positive selection was assessed using PAML (v4.6)45 branch-site models on the orthologue group alignments, using three different starting values for kappa and omega. We required an FDR-corrected P-value<0.05 from the LRT test and at least one site with a Bayes Empirical Bayes probability>0.95, and manually checked alignment quality around inferred positively selected sites. Signal peptides, protein domains and catalytic sites of positively selected proteins were analysed using PROSITE63 v. 20.114 and SMART64. Myrmicine ant orthologues of the attine chitinase and β-hexosaminidase sequences were identified using NCBI BLASTp. Protein average residue weights and isoelectric points were calculated using the pepstats programme from the EMBOSS package65, version 6.5.7. Significance tests were performed using phylogenetic ANOVA as implemented in the R-package phytools66 version 0.4–45. Protein structure modelling was done using SwissModel67 in both automated and alignment mode, and using several different templates for each. Although none of the models produced high scores, the overall folding remained consistent and poorly scoring regions were primarily confined to non-conserved loop regions that did not contain any of the positively selected sites.

Expression validation

Large A. echinatior workers from four different colonies were submerged in liquid nitrogen and divided into head (prosoma), mesosoma (thorax and propodeum) and metasoma (gaster and petiole). Five animals were pooled per sample. Labial glands and remaining mesosoma were dissected from large workers and immediately cooled on dry ice, using pooled samples of 20 ants each. Total RNA was extracted using the QIAGEN RNeasy Mini Kit with slight modifications. RNA concentration, integrity and purity were determined using a Nanodrop spectrophotometer (Thermo Scientific) and an Experion automated electrophoresis system (Bio-Rad). Total RNA was reverse transcribed into cDNA using the iScript cDNA Synthesis Kit (Bio-Rad), after which the cDNA was diluted with water to a final concentration corresponding to 5 ng μl−1 of total RNA.

Gene expression levels were determined with a QX200 ddPCR system (Bio-Rad) using TaqMan probes. The two genes encoding Ribosomal Protein L18 (RPL18) and TATA-binding protein, with the Genbank accession numbers XM_011064584 and XM_011062766, respectively, were used as housekeeping genes, to normalize the expression levels across samples. Primers and probes were designed using the Primer3Plus68 and PCR efficiency Calculator69 web interfaces, and are shown in Supplementary Table 29. PCR reactions were run on a Bio-Rad S1000 Thermal Cycler using the ddPCR Supermix for Probes (Bio-Rad), ∼1 μl of template per reaction, and a final concentration of primers and probes of 0.9 and 0.25 μM, respectively. Each reaction contained primers and probes for one target gene and one housekeeping gene, so the different fluorophores of the probes allowed discrimination between the PCR products. Following PCR, the samples were transferred to the ddPCR droplet reader, to measure the number of positive and negative droplets.

Absolute transcript concentrations for each gene were obtained using the QuantaLife software and normalized through division by the geometric mean of the housekeeping gene transcript concentrations of the same samples. A pseudocount of 0.08 (corresponding to 1 positive droplet in a reaction) was added to all values before taking the base 10 logarithm to stabilize the variances. Differences in mean expression levels of each of the two target genes among the different tissues were investigated using a one-way ANOVA test followed by a post-hoc Tukey’s honest significant difference test, using a significance level of 0.05 (n=4).

Data availability

All sequencing data described in this study have been deposited in the relevant National Center for Biotechnology Information (NCBI) databases and accession codes are provided in Supplementary Table 2.