Spider sequence data representing all major lineages were collected from previously published transcriptomic and genomic resources ( N = 53) and supplemented with newly sequenced transcriptomes ( N = 21) to form the target taxon set for the current study. Existing sequence data were acquired via the NCBI SRA database ( http://www.ncbi.nlm.nih.gov/sra ). Raw transcriptome sequences were downloaded, converted to fastq file format, and assembled using Trinity ( Grabherr et al., 2011 ). Genomic data sets in the form of predicted proteins were downloaded directly from the literature ( Sanggaard et al., 2014 ) for downstream use in our pipeline. Newly sequenced spiders were collected from a variety of sources, extracted using the TRIzol total RNA extraction method, purified with the RNeasy mini kit (Qiagen) and sequenced in-house at the Auburn University Core Genetics and Sequencing Laboratory using an Illumina Hi-Seq 2500. This produced 100bp paired end reads for each newly sequenced spider transcriptome, which were then assembled using Trinity. Proteins were predicted from each transcriptome using the program TransDecoder ( Haas et al., 2013 ).

Core ortholog approach and data processing

We employed a core ortholog approach for putative ortholog selection and implicitly compared the effect of using a common arthropod core ortholog set and one compiled for spiders; the arthropod core ortholog set was deployed as described in Bond et al. (2014). To generate the spider core ortholog set, we used an all-versus-all BLASTP method (Altschul et al., 1990) to compare the transcripts of the amblypygid Damon variegatus, and the spiders Acanthoscurria geniculata, Dolomedes triton, Ero leonina, Hypochilus pococki, Leucauge venusta, Liphistius malayanus, Megahexura fulva, Neoscona arabesca, Stegodyphus mimosarum, and Uloborus sp.. Acanthoscurria geniculata and Stegodyphus mimosarum were represented by predicted transcripts from completely sequenced genomes while the other taxa were represented by our new Illumina transcriptomes. An e-value cut-off of 10–5 was used. Next, based on the BLASTP results, Markov clustering was conducted using OrthoMCL 2.0 (Li, Stoeckert & Roos, 2003) with an inflation parameter of 2.1.

The resulting putatively orthologous groups (OGs) were processed with a modified version of the bioinformatics pipeline employed by Kocot et al. (2011). First, sequences shorter than 100 amino acids in length were discarded. Next, each candidate OG was aligned with MAFFT (Katoh, 2005) using the automatic alignment strategy with a maxiterate value of 1,000. To screen OGs for evidence of paralogy, an “approximately maximum likelihood tree” was inferred for each remaining alignment using FastTree 2 (Price, Dehal & Arkin, 2010). Briefly, this program constructs an initial neighbor-joining tree and improves it using minimum evolution with nearest neighbor interchange (NNI) subtree rearrangement. FastTree subsequently uses minimum evolution with subtree pruning regrafting (SPR) and maximum likelihood using NNI to further improve the tree. We used the “slow” and “gamma” options; “slow” specifies a more exhaustive NNI search, while “gamma” reports the likelihood under a discrete gamma approximation with 20 categories, after the final round of optimizing branch lengths. PhyloTreePruner (Kocot, Citarella & Halanych, 2013) was then employed as a tree-based approach to screen each candidate OG for evidence of paralogy. First, nodes with support values below 0.95 were collapsed into polytomies. Next, the maximally inclusive subtree was selected where all taxa were represented by no more than one sequence or, in cases where more than one sequence was present for any taxon, all sequences from that taxon formed a monophyletic group or were part of the same polytomy. Putative paralogs (sequences falling outside of this maximally inclusive subtree) were then deleted from the input alignment. In cases where multiple sequences from the same taxon formed a clade or were part of the same polytomy, all sequences but the longest were deleted. Lastly, in order to eliminate orthology groups with poor taxon sampling, all groups sampled for fewer than 7 of the 11 taxa and all groups not sampled for Megahexura fulva (taxon with greatest number of identified OGs) were discarded. The remaining alignments were used to build profile hidden Markov models (pHMMs) for HaMStR with hmmbuild and hmmcalibrate from the HMMER package (Eddy, 2011).

For orthology inference, we employed HaMStR v13.2.3 (Ebersberger, Strauss & Von Haeseler, 2009), which infers orthology based on predefined sets of orthologs. Translated transcripts for all taxa were searched against the new set of 4,934 spider-specific pHMMs (available for download from the Dryad Data Repository) and an arthropod core ortholog set previously employed in Bond et al. (2014). In the spider core ortholog analysis, the genome-derived Acanthoscurria geniculata OGs were used as the reference protein set for reciprocal best hit scoring. Daphnia pulex was used as the reference species for putative ortholog detection in the arthropod core ortholog analysis. Orthologs sharing a core identification number were pooled together for all taxa and processed using a modified version of the pipeline used to generate the custom spider ortholog set. In both analyses, sequences shorter than 75 amino acids were deleted first. OGs sampled for fewer than 10 taxa were then discarded. Redundant identical sequences were removed with the perl script uniqhaplo.pl (available at http://raven.iab.alaska.edu/ ntakebay/) leaving only unique sequences for each taxon. Next, in cases where one of the first or last 20 characters of an amino acid sequence was an X (corresponding to a codon with an ambiguity, gap, or missing data), all characters between the X and that end of the sequence were deleted and treated as missing data. Each OG was then aligned with MAFFT (mafft—auto—localpair—maxiterate 1,000; Katoh (2005)). Alignments were then trimmed with ALISCORE (Misof & Misof, 2009) and ALICUT (Kück, 2009) to remove ambiguously aligned regions. Next, a consensus sequence was inferred for each alignment using the EMBOSS program infoalign (Rice, Longden & Bleasby, 2000). For each sequence in each single-gene amino acid alignment, the percentage of positions of that sequence that differed from the consensus of the alignment were calculated using infoalign’s “change” calculation. Any sequence with a “change” value greater than 75 was deleted. Subsequently, a custom script was used to delete any mistranslated sequence regions of 20 or fewer amino acids in length surrounded by ten or more gaps on either side. This step was important, as sequence ends were occasionally mistranslated or misaligned. Alignment columns with fewer than four non-gap characters were subsequently deleted. At this point, alignments shorter than 75 amino acids in length were discarded. Lastly, we deleted sequences that did not overlap with all other sequences in the alignment by at least 20 amino acids, starting with the shortest sequence not meeting this criterion. This step was necessary for downstream single-gene phylogenetic tree reconstruction. As a final filtering step, OGs sampled for fewer than 10 taxa were discarded.

In some cases, a taxon was represented in an OG by two or more sequences (splice variants, lineage-specific gene duplications (= inparalogs), overlooked paralogs, or exogenous contamination). In order to select the best sequence for each taxon and exclude any overlooked paralogs or exogenous contamination, we built trees in FastTree 2 (Price, Dehal & Arkin, 2010) and used PhyloTreePruner to select the best sequence for each taxon as described above. Remaining OGs were then concatenated using FASconCAT (Kück & Meusemann, 2010). The OGs selected by our bioinformatic pipeline were further screened in seven different ways (subsets listed in Table 2). OGs were first sorted based on amount of missing data; the half with the lowest levels was pulled out as matrix 2 (1,699 genes). From matrix 2, a smaller subset of OGs optimized for gene occupancy was extracted, resulting in matrix 3 (850 genes). The full supermatrix (matrix 1) was also optimized using the programs MARE (Meyer, Meusemann & Misof, 2011) and BaCoCa (Base Composition Calculator; Kück & Struck, 2014). MARE assesses the supermatrix by partition, providing a measure of tree-likeness for each gene and optimizes the supermatrix for information content. The full supermatrix was optimized with an alpha value of 5, to produce matrix 7 (1,488 genes, 58 taxa). From the MARE-reduced matrix, genes having no missing partitions for any of the remaining taxa (n = 50) were extracted to form a starting matrix for the BEAST analyses (details below). Matrix assessment was also conducted using BaCoCa, which provides a number of descriptive supermatrix statistics for evaluating bias in amino acid composition and patterns in missing data. This program was used to assess patterns of non-random clusters of sequences in the data, which could potentially mislead phylogenetic analyses. Matrix 4 represents a 50% reduction of the full supermatrix using BaCoCa derived values for phylogenetically informative sites as a guide; essentially reducing missing data from absent partitions and gaps. This matrix is similar, but not identical to matrix 2. Matrix 5 resulted from application of arthropod core OGs from Bond et al. (2014) to the extended taxon set. Matrix 6 represents the full spider core OG matrix (matrix 1) with Stegodyphus pruned from the tree. OGs for each matrix were concatenated using FASconCAT (Kück & Meusemann, 2010).