Taxa sampled

We sampled 25 taxa: 23 pseudoxyrhophiines across 21 species, in 17 different genera and included two psammophiines as outgroups, Mimophis mahfalensis, and Rhamphiophis rubropunctatus. With the exception of the outgroup Rhamphiophis rubropunctatus and the pseudoxyrhophiine Parastenophis betsileanus, the same individuals were used for the Sanger and Anchored NGS methods (Additional file 1).

Ethics statement

All sample collection complied with the policies and guidelines of relevant institutions, including the Ministries des Eaux et Forêts, Madagascar National Parks, the Université d’Antananarivo, Departement de Biologie Animale, the University of Michigan Museum of Zoology, and the American Museum of Natural History.

DNA extraction and Sanger sequencing

DNA was extracted using a Quiagen® DNEasy kit, following the tissue protocol. Sanger sequencing was used to sequence five loci that have been previously utilized to infer squamate phylogenies [30, 43, 45]. This dataset consisted of two protein-coding mitochondrial genes (COI, 623 bp; CytB, 550 bp), two nuclear protein-coding genes (Cmos 562 bp; Rag2, 645 bp), and one nuclear intron (Nav intron 5, 610 bp) (PCR and sequencing details in Additional file 2). Sanger sequencing was performed at the American Museum of Natural History on an ABI 3730 sequencer. Sequences were edited and aligned using the Geneious alignment algorithm in Geneious® v.6.1.4 and checked by eye to ensure that protein coding loci did not contain stop codons.

Anchored phylogenomics locus selection and probe design

We used the following approach to identify a set of loci that could be obtained efficiently in snakes using Anchored Hybrid Enrichment. First, in order to improve gene-tree resolution, we increased the size of each target region described in [1] to approximately 1350 bp by including flanking regions that contained sufficient sequence conservation between Homo and Anolis. Because some of the original anchor regions were near each other, some of these loci were joined to form a single locus. Loci that performed poorly in [1] were removed. The resulting locus set contained 394 loci comprising a total of 468,296 bp target region (referred to as the version 2 vertebrate loci; genomic coordinates available on Dryad doi:10.5061/dryad.kp400). In order to improve the capture efficiency for snakes, we also obtained homologous sequences from the Python molurus genome (NCBI accession AEQU000000000) and 15× genomic reads obtained for the brown reed snake Calamaria pavimentata (Illumina PE100bp; specimen obtained from California Academy of Sciences, accession CAS235364). After aligning the Anolis, Python, and Calamaria sequences using MAFFT [46], alignments were trimmed to produce the final probe region alignments (alignments available on Dryad doi:10.5061/dryad.kp400), and probes were tiled at approximately 1.5X tiling density per species (probe specification available on Dryad doi:10.5061/dryad.kp400).

The NGS dataset was generated by the Center for Anchored Phylogenetics [47] using the anchored hybrid enrichment methodology described by [1]. We used a Covaris E220 Focused-ultrasonicator to fragment each genomic DNA sample to a fragment size of ~150–350. A Beckman-Coulter Biomek FXp liquid-handling robot was then used to prepare indexed Illumina libraries following protocol modified from Meyer and Kircher [48] (with SPRIselect size-selection after blunt-end repair using a 0.9x ratio of bead to sample volume). A single pool containing all of the libraries was then enrichment for the target using an Agilent Custom SureSelect kit (Agilent Technologies) that contained the probes described above. The enriched library pool was then sequenced on 1 PE150 Illumina HiSeq2000 lane by the Translational Science Laboratory in the College of Medicine at Florida State University.

NGS data assembly

Paired reads were merged following [49]. Briefly, for each degree of overlap for each read we computed the probability of obtaining the observed number of matches by chance, and selected degree of overlap that produced the lowest probability, with a p-value less than 10−10 required to merge reads. When reads were merged, mismatches are reconciled using base-specific quality scores, which were combined to form the new quality scores for the merged read (see [49] for details). Reads failing to meet the probability criterion were kept separate but still used in the assembly. Between 50 and 75 % of the sequenced library fragments had an insert size between 150 bp and 300 bp.

The reads for each sample were assembled into contigs using a reference assembly approach to map reads to the Calamaria probe regions and a de-novo assembly approach to extend the assembly into the flanks (java scripts available upon request from A. Lemmon). The reference assembler uses a library of spaced 20-mers derived from the sites conserved across squamates. A preliminary match resulted if at least 17 of 20 matches existed between the positions in a read and the corresponding positions in one of the spaced. Reads obtaining a preliminary match were then compared to the Calamaria, with greater than 55 matches out of 100 was considered a significant match. Approximate alignment positions of mapped reads were estimated using the position of the spaced 20-mers, and all 60-mers existing in the read were stored in a hash table used by the de-novo assembler. The de-novo assembler identifies exact matches between a read and one of the 60-mers found in the hash table. Simultaneously using the two levels of assembly described above, the read files were traversed repeatedly until an entire pass through the reads produced no additional mapped reads.

Mapped reads were then clustered for each locus into clusters using 60-mer pairs observed in the reads mapped to that locus. In short, a list of all 60mers found in the mapped reads was compiled, the 60-mers were clustered if found together in at least two reads. The 60-mer clusters were then used to separate the reads into clusters for contig estimation. Relative alignment positions of reads within each cluster were then refined in order to increase the agreement across the reads. Up to one gap was also inserted per read if needed to improve the alignment. Note that given sufficient coverage and an absence of contamination, each single-copy locus should produce a single assembly cluster. Low coverage (leading to a break in the assembly), contamination, and gene duplication, can all lead to an increased number of assembly clusters. A whole genome duplication, for example would increase the number of clusters to two per locus.

Consensus bases were called from assembly clusters as follows. For each site an unambiguous base was called if the bases present were identical or if the polymorphism of that site could be explained as sequencing error, assuming a binomial probability model with the probability of error equal to 0.1 and alpha equal to 0.05. If the polymorphism could not be explained as sequencing error, the ambiguous base was called what corresponded to all of the observed bases at that site (e.g. ‘R’ was used if ‘A’ and ‘G’ were observed). Called bases were soft-masked (made lowercase) for sites with coverage lower than 5. A summary of the assembly results is presented in the additional files (Additional file 3).

In order to filter out possible low-level contaminants, consensus sequences derived from very low coverage assembly clusters (<10 reads) were removed from further analysis. After filtering, consensus sequences were grouped by locus (across individuals) in order to produce sets of homologs. Orthology was then determined for each locus using a distance-based approach. First, a pairwise distance measure was computed for pairs of homologs by computing the percent of 20-mers observed in the two sequences that were found in both sequences. The list of 20-mers was constructed from consecutive 20-mers as well as spaced 20-mers (every third base) in order to allow increased levels of sequence divergence. Finally, we clustered the sequences using the distance matrix and a Neighbor-Joining algorithm that allowed at most one sequence per species to be assigned to a cluster. Only clusters containing at least 50 % of the species were utilized downstream.

Following orthology assessment, sequences in each orthologous set were aligned using MAFFT v7.023b [46]. The flags --genafpair and --maxiterate 1000 were utilized. The alignments for each locus was masked/trimmed in three steps. First, each alignment site was identified as “good” if the most common character observed was present in >50 % of the sequences. Second, 20 bp regions of each sequence that contained <10 good sites were masked. Third, sites with fewer than 4 unmasked bases were removed from the alignment.

Phylogenetic analyses

We estimated species trees using the summary statistics methods STAR [21] and MP-EST [20]. The STAR method estimates the species tree from a distance matrix constructed from the average ranks of gene-coalescence events from gene trees, while MP-EST estimates the species tree from gene trees by maximizing a pseudo-likelihood function of the triplets (rooted 3-taxon statements) of the species tree. Both of these methods require user-provided gene trees for all loci and allow for some missing data. We generated maximum-likelihood gene trees for all loci (NGS and Sanger-sequenced loci) using RAxML v.7.0.4 [50] under the GTRGAMMA model. For each locus, we performed 100 bootstrap replicates, which permits error estimates to be generated on the species trees; this entire process was streamlined using a Perl-scripted pipeline developed by FTB and A. Narechania [51]. We used the Species Tree Analysis Web Server [52] to run all STAR and MP-EST analyses. The STAR and MP-EST programs were run using several different subsets of loci. For the Sanger dataset, we used the complete dataset of 5 loci and subsets of 4 loci (removing CytB), 4 loci (removing COI), and 3 (nucDNA only) loci. For the NGS dataset, we randomly subsampled the entire 377-locus dataset into 200, 100, 50, 25, 10, 5, 4, and 3-locus datasets; these subsets were run five times using different sets of random loci to determine, on average, how discordant the resulting trees were compared to the 377-locus species trees. The topological comparisons between all trees were performed using Robinson-Foulds (RF) distances in the R statistics platform [53] using the package phangorn [54]. The RF distance is a metric that determines the number of bipartitions that differ between trees to indicate the amount of topological discordance between two trees [55]. To show relative differences between the trees using the RF distances, we used the percentage of the maximum RF distance, which was calculated using 2(n - 2), where n is the number of taxa, and n - 2 represents the maximum number of inner branches for a rooted tree [56]. The % RF distance between trees is the ratio of the RF distance divided by the maximum RF distance. We also used the coefficient of variation for each locus-set to determine in the MP-EST and STAR analyses at what point increasing the number of loci reached diminishing returns, where additional loci no longer resulted in an improvement to the species tree; this is indicated by the lowest coefficient of variation value for each locus set. We also compared the NGS and Sanger species trees using the same numbers of loci (5, 4, and 3) to examine how similar the resulting trees are topologically when using the same numbers of loci, again using RF distances.

We next used RAxML to generate concatenated trees using the complete 377-locus NGS dataset and the complete 5-locus Sanger dataset; the Sanger dataset was partitioned by locus and codon position for the protein coding genes (the mtDNA genes were considered a single partition).. These results were compared with the explicit species-tree approaches using % RF distances. To determine how locus number affects bootstrap-support values for the species trees, a mean bootstrap value was taken across the entire tree for each of the NGS subset species trees (200, 100, 50, 25, 10, 5, 4, and 3-locus datasets) and then averaged across the five replicates for each of those subset species trees.

Although the complete 377-locus NGS dataset is too large for the full-coalescent species-tree model implemented in *BEAST [14], we ran an additional series of species trees with subsets of 15 loci to determine if this method could estimate the same topology as the full 377-locus dataset using MP-EST and STAR. A recent study of chickadee phylogenetics [7] found that among several coalescent-based species-tree inference methods, *BEAST was the most robust and consistently converged on the same tree estimated from the authors’ full dataset (40 loci) when using 15 loci. We took five random subsets of 15 NGS loci and estimated five species trees using *BEAST [18] in BEAST v1.8 [57]. In addition, we also ran *BEAST on the 5-locus Sanger dataset to determine how well it performed using a dataset similar to those typically used to infer squamate phylogenies. We used jModelTest v2.1.6 [58] with the Bayesian Information Criteria to choose the most appropriate model of sequence evolution for each locus in the *BEAST analyses, as *BEAST allows for multiple models (details available on Dryad doi:10.5061/dryad.kp400). As in our STAR and MP-EST analyses Mimophis mahfalensis and Ramphiophis rubropuntatus were used as outgroups. The *BEAST analyses were run for 2 × 108 generations each using a log-normal relaxed clock model, a Yule-process speciation prior, and were sampled every 10000 generations. Tracer v1.4 [59] was used to assess stationarity for each of the runs and determine burnin. A summary table of the various methods used with the different datasets and loci is included as an additional file (Additional file 4).

To visualize congruence between resulting gene trees and the species trees, we used the program MetaTree [60]. MetaTree builds a “tree-of-trees” that shows the relationships between alternative phylogenies. This program takes user-inputted phylogenies with fixed sets of taxa to construct a tree that clusters similar topologies together, allowing the user to examine a set of trees and determine how similar topologies are to one another; here we use this program to examine whether gene trees from both the Sanger and NGS datasets are similar to the species tree topology. For our MetaTree analyses, we compared the gene trees from the Sanger dataset to the 377-locus species tree. Since Micropisthodon and Rhamphiophis were not available for all of the Sanger loci (Appendix 1), they were not included and pruned from the species tree for that comparison. For the NGS MetaTree analysis, we used subsets of 50 gene trees (enhancing visual clarity of the results) from the NGS dataset (using the 344 loci with all 25 taxa/individuals) to compare to the 377-locus species tree.