Taxon sampling and library preparation

DNA extraction of 52 squamate samples (total = 45 species) was performed using a phenol–chloroform–isoamyl alcohol (PCI) extraction protocol. Random shotgun genome libraries were prepared by fragmenting DNA samples to an average length of 300–600 bp using a M220 Covaris Ultrasonicator. The NEBNext Illumina DNA Library Prep Kit (New England Biolabs) was used following the manufacturer’s protocol to perform fragment-end repair, poly-A tailing, adapter ligation, and library amplification. After library preparation, fragments were size-selected using a BluePippin (Sage Science) for a length of 350–450 bp. Pooled multiplexed libraries were sequenced on an Illumina MiSeq with 300 bp paired-end reads. Paired reads were merged based on sequence overlap and were adapter and quality trimmed using CLC genomics workbench 9.0.1 64. Roche 454 shotgun sequencing data of nine snake species from previous studies12,13 and draft genome assemblies of 12 additional squamate species (Supplementary Data 1) were also included. Our final sampling included a total of 66 different squamate species. For each species, mitochondrial reads were filtered out in CLC genomics workbench 9.0.1 using the complete mitochondrial genome of the most closely related species available on GenBank35. Reads that mapped to the reference were used to assemble species-specific mitochondrial genomes. Reads that did not map to the reference (i.e., nuclear reads) were used for downstream repeat element annotation and analyses.

SSR identification and analysis

We used Pal_finder v.0.02.0365 (Palfinder hereafter) to identify microsatellites. Default Parfinder parameters were used to identify perfect dinucleotide (2mer), trinucleotide (3mer), and tetranucleotide (4mer) that were tandemly repeated for a total length of at least 12 bp. Perfect pentanucleotide (5mer) and hexanucleotide (6mer) tandemly repeated motifs were annotated only if longer than 15 bp. Loci/Mbp and bp/Mbp frequencies were calculated for all microsatellite motifs, length classes (2–6mers), and total content, and summarized per genome and major taxonomic group. Tests for multiple evolutionary rates of microsatellite abundance across lineages, ancestral state reconstruction of genomic microsatellite frequencies, and quantification of microsatellite landscape differentiation among species were performed using the R packages Phytools v.0.4–6066 and APE v.3.367. For the multiple evolutionary rate analysis of microsatellite (and TE) abundance, we conducted censored rate tests using Phytools with 1000 simulations (to compute p values) on 100 randomly sampled posterior trees using the restricted maximum likelihood technique to obtain unbiased estimates of the evolutionary rate parameter (σ)28. We used the time-calibrated phylogeny and the pic function in R (provided by the APE package) to compute phylogenetic independent contrasts for tests of clade-specific differences in genomic microsatellite content. We performed the nonparametric Kruskal–Wallis H test in R after the data rejected normality (Shapiro–Wilks test; p values <0.05 before and after log transformation) and homogeneity of variances (Bartlett’s test; p values <0.05 before and after log transformation). Between lineages variation was tested using a post hoc Dunn test for multiple comparisons using the Benjamini–Hochberg correction method in R (Supplementary Data 6).

TE identification and analysis

Squamate genomic repeat elements were annotated according to homology-based and de novo identification approaches. Because repeat element annotation can be highly dependent on the repeat library used, we built large multi-species (clade-specific) repeat libraries that we used to annotate repeats for all members of a clade. To build these clade-specific libraries, we first performed de novo repeat element annotation on each species (except where already published) using RepeatModeler v.1.0.968, followed by further repeat classification in CENSOR69. Second, we built clade-specific de novo repeat element libraries, one for all lizard species (33 species de novo reference library) and one for all snake species (de novo TE libraries for 21 species were combined, and merged with the reference library generated by Castoe et al.13). Each clade-specific library was then filtered to avoid redundancy of highly similar elements. We tested whether using a single squamate-specific library for all species would change the inferred relative TE content and overall amount of repeat identified; we found no detectable difference between the results of the two masking protocols (Supplementary Fig. 17), and therefore decided to use the two clade-specific libraries in order to reduce masking time by reducing the overall library size. Additional classification of unknown (unclassified) elements was achieved by comparing these unclassified elements to all elements that were classified using BLAST70. Additionally, we generated squamate-specific BovB and CR1-L3 LINEs reference sequence libraries for all 66 species included (additional information regarding library generation are provided in the following paragraph).

Repeat element analyses were performed in RepeatMasker v.4.0.671 with default parameter settings. To maximize element identification, we used a custom bash script to specify the order of the four libraries used as references for the masking process: (i) BovB-L3 LINEs library, (ii) Tetrapoda RepBase library (version 20.11, 07 August 201572), (iii) classified elements from the clade-specific library for either snakes or lizards, and (iv) unknown elements from the clade-specific library. We used the BovB-L3 LINEs library first to control for limited sampling and low-quality reference sequences of squamate reptile BovB and L3 LINEs in the tetrapoda library. RepeatMasker output files were post-processed using a custom-modified implementation of the ProcessRepeat script included in the RepeatMasker package. Specifically, we modified the output to include additional summary information in the .tab output file for TE subfamilies that are important and/or frequent in squamate reptiles (e.g., CR1-L3, L2, and Rex). Also, because the provided ProcessRepeat script still reflects old and outdated classification schemes of TEs (e.g., Penelope elements are inappropriately classified as LINEs), we made other modifications to the ProcessRepeat script to correct for such errors according to the classification reported by Chalopin et al6.

Comparing sampled and assembled genomes

We tested whether genomic repeat content estimated from unassembled shotgun genomic datasets were similar to estimates derived from fully assembled genomes. We compared RepeatMasker estimates of total TE genomic abundance between assembled genomes and unassembled shotgun genomic datasets for the same species (Python molurus, Boa constrictor, Thamnophis sirtalis, and Deinagkistrodon acutus) or for two closely related species belonging to the same genus (Gekko gecko vs. Gekko japonicus and Ophisaurus attenuatus vs. Ophisaurus gracilis). We also tested for potential biases due to unequal genomic sampling in the shotgun datasets. We extracted at random subsamples of 3, 5, 8, 10, 30, 50, 100, and 250 Mbp from unassembled genomic shotgun datasets of four species (Python molurus, Gekko gecko, Ophisaurus attenuatus, and Pantherophis emoryi), and compared RepeatMasker estimates of total TE genomic abundance for each. Read extraction was performed using the subsample_fasta.py script from the QIIME pipeline73. Finally, we compared RepeatMasker estimates of total TE genomic abundance in relation to the amount of sequence data obtained for all Illumina and 454 genomic shotgun datasets to test for biases related to sequencing technology, and for biases related to the amount of sequence data collected per individual, vs. estimates of total TE genomic abundance.

CR1 and BovB LINEs phylogenetic and evolutionary analyses

Species-specific consensus sequences for both CR1-L3 and BovB LINE retrotransposons were generated in CLC genomic workbench 9.0.1 using default parameters, a linear gap cost, and the global alignment setting. Nuclear reads for each species were mapped to the consensus sequence of the LINE consensus sequence from the most closely related species available, which was used as initial reference (e.g., both CR1-L3 and BovB reference sequences for the Burmese python were generated by Castoe et al.13 and used as reference for building the consensus for the Mexican burrowing python). The first consensus generated was then used as a new reference for further rounds of re-mapping of nuclear reads until no additional mapping reads were recovered. Consensus sequences were determined by simple majority rule consensus, removing regions with coverage <10x after the second mapping iteration, and <20x in the final mapping. Consensus sequences were aligned in ClustalW74 with a gap open penalty of 50, and alignments were manually adjusted prior to downstream analyses (Supplementary Data 10). To the CR1 consensus sequences generated from our 66 squamate species, we added CR1-L3 and CR1-L2 vertebrate consensus sequences available in RepBase, for a total of 155 sequences (Supplementary Data 10). Squamate BovB consensus sequences we generated from our 66 squamates were combined with other metazoan consensus sequences available in RepBase, for a total of 87 sequences (Supplementary Data 9). Bayesian phylogenetic tree reconstruction analyses of squamate CR1 and BovB LINEs were performed in BEAST275. Two independent analyses were run for 200 million generations each, following the Yule model of speciation and a relaxed log-normal clock model; MCMC chains were sampled every 1000 generations. The program Tracer v.1.676 was used to confirm that the MCMC chains had reached convergence. We conservatively discarded the first 25% of collected MCMC generations as burn-in, based on evidence that the likelihood and parameter values reached stationarity after approximately 15% of the sampling process.

CR1 and BovB LINEs coverage and age analyses

For each species, the species-specific CR1-L3 and BovB consensus sequence was used as a reference to estimate read coverage using the BWA mem alignment tool77, and the BEDTools2 (version 2.26.0) coverage tool78. Coverage counts were normalized by the total number of reads aligned to the full-length reference sequence. Read coverage was estimated for: (i) each 10 bp sliding window, (ii) for the first and second half of the reference sequence, and (iii) for each third of the reference.

We used pairwise sequence divergence from the consensus (pairwise π) as a proxy to infer age and relative element level of activity through time. Pairwise distances values for each element and species were estimated following a custom pipeline starting from BWA alignments. An R79 custom script built on the pegas80 and stringr packages was used to calculate pairwise π estimates using multi-fasta pairwise alignments of reads to the reference. Because we expected multiple TE subfamilies to exist, sequence divergence was estimated by excluding sites that define different CR1 and BovB subfamilies. For each species, we calculated the relative nucleotide frequency for each position in the multiple sequence alignment, and then calculated the mode of the frequency distribution (bins of 0.01) of the most frequent nucleotide at each position. Sites for which the most frequent nucleotide was in a bin more than three bins away from the mode were discarded as defining a separate subfamily.

Time-calibrated phylogeny of 66 squamate reptiles

We estimated a time-calibrated phylogeny for the 66 squamate species in our study and an additional eight outgroup vertebrates. We downloaded and parsed 12 mitochondrial-encoded protein-coding genes for each species with a mitochondrial genome sequence available on GenBank. The same genes were parsed from our de novo assembled mitochondrial genomes after genes were annotated for these using MITOS81. We aligned the 12 protein coding genes encoded on the mitochondrial heavy strand using MUSCLE v.3.8.2182 and concatenated the sequences into an alignment that we used for divergence dating (10,479 bp). Prior to divergence dating, we estimated the best-fit partitioning scheme and associated models of nucleotide substitution using Bayesian information criterion and the heuristic search algorithm provided in PartitionFinder v.1.1.183. We provided a starting partitioning scheme that defined 36 partitions (splitting codon positions for each of the 12 genes), and PartitionFinder identified the best-fit partitioning scheme comprising a single partition for each codon position (three total) and a GTR+I+G model for each partition. We estimated divergence times using BEAST v.2.3.484 with a calibrated Yule model of speciation and a log-normal relaxed clock model. We constrained the topology to that provided from previous studies of the squamate phylogeny and diversification85,86; we also constrained divergence times of a total of seven nodes using fossil calibrations also provided in previous studies. Calibration points and associated prior distributions are given in Supplementary Table 1. Two independent MCMC runs were conducted for 100 million generations each, with MCMC chain sampling every 10,000 generations. We assessed convergence to the posterior based on likelihood and parameter stationarity (effective sample size >200 for all parameters) using the program Tracer. We discarded the first 10% of generations as burn-in, based on the likelihood and parameter values exhibiting stationarity before 10% of sampling was completed.

AATAG microsatellite seeding by TE analyses

We performed adjacency analyses of AATAG and ATAG SSR loci on high-quality assembled genomes for seven snake species, and used the green anole lizard as an outgroup. To increase specificity, genomes were first masked only for simple repeats. We extracted coordinates of annotated AATAG and ATAG SSR loci from the .out RepeatMasker output files, and used these coordinates to extract target regions 400 bp upstream and downstream of each microsatellite locus. We then performed a second run of RepeatMasker to mask only TEs located in the extracted target regions that flank AATAG and ATAG loci. Following this strategy, we were able to annotate TEs located in close proximity to SSR loci, and to differentiate TEs that harbor microsatellite-like regions in their reference sequences. The composition of TEs physically associated with SSR loci regions was then compared to the average of five independent randomly generated genomic backgrounds matching in sample size the corresponding microsatellite landscape. For each species, genomic background reads were generated by using the random tool in the BEDTools2 v.2.26.0 package, in which we specified the number of sequences to be extracted and that their length was to match the SSR-adjacent genomic subsample. The generation of random bed files was performed independently five times per species, the TE composition was averaged across these five genomic backgrounds, and then compared to SSR loci adjacent regions. Fisher’s one-tailed exact tests were performed to evaluate the enrichment of TE families in SSR loci regions (at α = 0.01). Finally, to identify the specific element types involved in microsatellite seeding, we estimated genomic and SSR-adjacent conditional probabilities of TE-SSR co-occurrences. We estimated the conditional probability of sampling an AATAG SSR with an adjacent CR1 LINE present within 400 bp, and compared this to the estimated joint probability of sampling an AATAG SSR locus and a CR1 LINE using the genome-wide frequencies. We also calculated the conditional and joint probabilities for Rex LINEs, and compared those to the conditional and joint probabilities of CR1 LINEs, respectively.

Effective population size (N e ) estimation

Whole genomic Illumina paired-end reads for eight squamate reptiles species were first preprocessed for quality using Trimmomatic87. Clean paired and unpaired reads were aligned to their respective reference genome assemblies using BWA v.0.7.12, and single nucleotide polymorphisms were called with SAMtools (v.0.1.18) mpileup88. We applied the PSMC49 using a generation time of 3 years across all eight species (which represents the average of generation time approximations available from the literature; Supplementary Table 2) after verifying that the application of a single generation time yielded results consistent with estimates of average N e produced by the application of generation times within the range reported in the literature. Multiple studies have provided evidence of relatively similar mutation rates across lineages of squamates13,89. Therefore, in our PSMC analyses we used the generalized squamate mutation rate reported in Green et al.89 of 2.4 × 109 /year/site (as estimated from 4-fold degenerate sites between anole and python). To test the robustness of inferred population size estimates, we conducted 100 bootstrap replicate analyses by splitting the scaffolds into smaller segments and randomly sampling the segments with replacement. Default outputs of the psmc_plot.pl script were used to graphically summarize N e changes over time estimations per each bootstrapped sample (Supplementary Fig. 10b).

Coalescent approaches for estimating N e and changes in N e over time (like PSMC) have several intrinsic limitations. Importantly, they rely on explicit assumptions of a single population coalescent model (without subdivision, gene flow, or selection) to estimate the time since the most recent common ancestor of alleles at each locus, as well as an assumed generation time and substitution rate. Population structure has been identified as one major factor that can bias PSMC-based estimates of N e 50,52,90,91. For example, the inferred trend in N e variation of a structured population can portrait either a bottleneck or an expansion in population size whether the alleles were sampled from the same subpopulation or from different subpopulations, respectively51. Episodes of natural selection can also bias estimates of N e obtained using PSMC, as selection can manipulate the rate of coalescence at specific loci that are directly or indirectly linked to targets of selection54,55. Given the nature of our data, we are not able to assess the presence and extent of population substructure or selection, and therefore cannot exclude that our PSMC estimates are immune to such biases. Additionally, PSMC has low power at recovering rapid changes in N e , which may be incorrectly estimated to have occurred over a longer period of time, and cannot recover recent nor very ancient changes in N e (e.g., younger than ~10 kyBP and older than ~3 myBP for humans)49,51. Thus, we suggest caution when interpreting our PSMC estimates of N e and N e changes through time. However, we found low variance across bootstrapped N e estimates once the most recent and most ancient time points were removed, and patterns of expansion and contraction of N e are consistent with alternations of glacial and interglacial periods during the middle Miocene climate transition, the Pliocene and the Pleistocene92. In an attempt to reduce potential biases associated with PSMC estimates of recent and ancient changes in N e , median N e values were calculated after removing the first and the last time points from each sample. We replicated each analysis (see below) after applying different filtering schemes to the standard PSMC outputs (e.g., removal of 10 and 25% of time point data, and inclusion of only time points between 20 kyBP and 10 myBP). Since all tests provided the same conclusions, we report only analyses performed using median N e values that were calculated according to the original filtering scheme. Additionally, we replicated all of our analyses using adult body mass as a proxy for N e 56 to avoid potential biases associated with our coalescence-based methods of N e estimation (i.e., Fig. 4e). For each of the 66 squamate species, we obtained adult body mass measurements from the literature57 which were used to further test for a demographic explanation for variation in repeat content alongside coalescent-based estimates of N e .

Testing demographic explanations of repeat content variation

We performed linear regression analyses to test for correlations between N e and LINE truncation, N e and genomic abundance of BovB and CR1-L3 LINEs, truncation and genomic abundance of repeats, and between truncation and estimates of ages of repeat element activity. We used the pic function in APE and the time-calibrated phylogeny to compute phylogenetic independent contrasts to be used for all linear regressions. These analyses were conducted for both the coalescent-based estimates of N e and adult body mass as a proxy for N e . Since truncation values violated assumptions of normality and homogeneity of variance (Shapiro–Wilks test; p values <0.05 and Bartlett’s test; p values <0.05), we performed statistical analyses on log-transformed values (Shapiro–Wilks test; p values >0.05 and Bartlett’s test; p values >0.05).

Data availability

New raw, unassembled shotgun sequencing data and new assembled genome data have been deposited at NCBI under the following accessions:PRJNA413172 and PRJNA413201. The authors declare that all data and scripts used in this study are available via public databases or available from the corresponding author upon request.