Samples

Blood and fecal samples were collected from six captive baboons (genus Papio) housed at the Southwest National Primate Research Center (SNPRC) at the Texas Biomedical Research Institute. The individuals were of either P. anubis or hybrid ancestry (Supplemental Table S1). All six baboons were fed a diet manufactured by Purina LabDiet (“Monkey Diet 15%”) containing 15% minimum crude protein, 4% minimum crude fat, and 10% maximum crude fiber. In separate sedation events, blood and feces were collected from the same individual who was isolated for the duration of the sedation. Following centrifugation, the buffy coat was isolated from blood samples and stored at −80 °C. 2 ml of feces were also collected into 8 ml tubes containing 4 ml of RNALater (Ambion).

In addition, we collected or obtained fecal samples from 46 wild baboons in Zambia. Samples were collected between 2006 and 2015 from the Luangwa Valley, the Lower Zambezi National Park, Choma, or Kafue National Park and are of P. kindae × P. cynocephalus, P. ursinus griseipes, or P. kindae × P. ursinus griseipes ancestry (Supplemental Table S1). As with the SNPRC samples, 2 ml of feces were collected into 8 ml tubes containing 4 ml of RNALater. In contrast to the SNPRC samples, however, these samples were collected noninvasively from unhabituated animals in remote field conditions. Samples therefore could not be attributed to particular animals, although samples were selected to avoid duplication using either field observations or geographic distance. Following collection, samples were stored without refrigeration for 1–6 months before being frozen at −20 °C for long-term storage.

All procedures involving live animals were carried out in accordance with relevant guidelines and regulations. Experimental procedures at SNPRC were conducted with approval by the Institutional Animal Care and Use Committee of the Texas Biomedical Research Institute (protocol #1403 PC 0). Sedation and blood draws were performed under the supervision of a veterinarian and animals were returned immediately to their enclosures following recovery. Sample collection in Zambia was conducted with approval by the Animal Studies Committee of Washington University (assurance #A-3381–01) and following local laws and regulations in Zambia.

Buffy coat extractions were performed using the QIAamp DNA Blood Mini Kit (Qiagen), following manufacturer’s instructions. Fecal extractions were performed using the QIAamp DNA Stool Mini Kit (Qiagen) following manufacturer’s instructions for optimizing host DNA yields. DNA concentration and yield were measured using a Qubit dsDNA BR Assay (Life Technologies). In some cases, multiple DNA extractions from the same individuals were necessary when DNA was depleted over the course of this study.

We estimated the proportion of host DNA for each fecal DNA extraction using quantitative PCR (qPCR) by comparing estimates of host DNA concentration obtained by qPCR to estimates of total fecal DNA concentration obtained by Qubit. Amplification was conducted using universal mammalian MYCBP primers36 and evaluated against a standard curve constructed from the liver DNA of an individual baboon. Samples and standards were run in duplicate alongside positive and negative controls (see Supplemental Protocol for full details).

DNA enrichment

DNA was enriched using the NEBNext Microbiome DNA Enrichment Kit (New England Biolabs)22.

MBD2-Fc-bound magnetic beads were prepared according to manufacturer instructions in batches ranging from 40 to 160 μl. For each n μl batch, we prebound 0.1 × n μl MBD2-Fc protein to n μl protein A magnetic beads by incubating the mixture with rotation for 10 min at room temperature. The bound MBD2-Fc magnetic beads were then collected by magnet and washed twice with 1 ml ice-cold 1x bind/wash buffer before being resuspended in n μl ice-cold 1x bind/wash buffer.

As a pilot experiment, we prepared two successive libraries, library A and library B, following manufacturer’s instructions for capturing methylated host DNA, with minor protocol modifications incorporated for the second pilot library (library B). Library A included MBD-enriched fecal DNA from 4 SNPRC baboons and 2 Luangwa Valley baboons, as well as blood DNA from the same SNPRC baboons. Library B included MBD-enriched fecal DNA from 4 SNPRC baboons (with two repeats from library A), 4 Kafue National Park baboons, and 2 Luangwa Valley baboons, as well as blood DNA from 2 SNPRC baboons. For each fecal DNA sample, we combined 1–2 μg of extracted fecal DNA with 160 μl of prepared protein-bound beads and a variable volume of ice-cold 5x bind/wash buffer for maintaining 1x concentration of bind/wash buffer. After combining beads and DNA, we incubated the mixture at room temperature with rotation for 15 min. DNA and MBD2-Fc-bound magnetic beads were then collected by magnet and the supernatant removed. For samples in library A, we washed the collected beads with 1 ml of ice-cold 1x bind/wash buffer. For samples in library B, we conducted three expanded wash steps to maximize the removal of unbound DNA. For each wash in library B, we added 1 ml of ice-cold 1x bind/wash buffer and mixed the beads on a rotating mixer for three minutes at room temperature before collecting the beads by magnet and removing the supernatant. Following the final wash, we resuspended and incubated the beads at 65 °C with 150 μL of 1x TE buffer and 15 μL of Proteinase K for 20 min with occasional mixing. The eluted DNA was then separated by magnet, purified with 1.5x homemade SPRI bead mixture31, and quantified using a Qubit dsDNA HS Assay (Life Technologies).

Our pilot sequencing results from libraries A and B revealed large variation in the percentage of reads mapping to the baboon genome, with mapping percentages ranging from 1.1% to 79.3%, with much of the variation correlating with the proportion of host DNA in the unenriched fecal DNA sample (Supplemental Fig. S3). To expand the utility of the enrichment protocol to all fecal DNA samples, we conducted a series of capture experiments designed to optimize the enrichment of host DNA from “low-quality” samples (i.e., samples with low proportions of host DNA). For these experiments, we artificially simulated fecal DNA by combining high-quality baboon liver or blood genomic DNA (liver: SNPRC ID #19334; blood: SNPRC ID #14068 or #25567) with E. coli DNA (K12 or ATCC 11303 strains) at controlled proportions. The resulting post-enrichment proportion of baboon and E. coli DNA was evaluated by qPCR in two analyses using (1) universal mammalian MYCBP36 and (2) universal bacterial 16 S rRNA (16 S)37 primers along with standards created from the same respective organisms (experiments and results are described in detail in Supplemental Table S2).

Based on these capture optimization experiments, we prepared subsequent libraries using a version of the protocol incorporating modifications demonstrated to improve enrichment. Despite preferentially binding CpG-methylated DNA, the MBD2-Fc bait complex nevertheless bound a fraction of nonmethylated DNA. We therefore aimed to minimize both the absolute and relative amount of nonmethylated DNA binding to the bead complex. In our tests, the amount and fraction of nonmethylated bound DNA was highest when the ratio of the MBD2-Fc magnetic bead complex to total DNA was high, suggesting a surplus of MBD2 binding sites given the relatively small fraction of CpG-methylated DNA. We therefore reduced the ratio of prepared MBD2-Fc-bound magnetic beads to total DNA by tuning the amount of beads to the estimated amount of CpG-methylated host DNA. Because the amount of CpG-methylated host DNA in feces is extremely low, this modification also greatly decreased the cost of the reagents. Through our optimization experiments, we found that incorporating an additional wash step reduced the amount of contaminating nonmethylated DNA captured. Finally, we developed a method for serial enrichment of the samples (repeating the enrichment protocol), which substantially improved results. Our initial serial enrichment experiments failed to recover DNA, likely due to the use of proteinase K (combined with TE buffer) in the manufacturer’s elution protocol. Hypothesizing that incomplete removal of proteinase K, even following bead cleanup, interfered with the enrichment protocol, we instead eluted DNA using a high (2 M) NaCl concentration, resulting in successful serial enrichment. These changes, along with a full modified protocol, are detailed in the Supplemental Protocol.

For our next two libraries, libraries C and D, we added a much smaller volume of prepared MBD2-Fc-bound magnetic beads (1–22 μl) based on the estimated proportion of starting host DNA, kept the capture reaction volume consistent at a relatively low 40 μl (concentrating samples as needed using a SPRI bead cleanup), added an extra wash step in which samples were resuspended in 100 μl of 1x bind/wash buffer then incubated at room temperature for 3 minutes with rotation, and eluted samples in 100 μl 2 M NaCl. For four fecal DNA samples in library C and all of library D, we serially enriched the samples by repeating the capture reaction with 30 μl of MBD-enriched DNA (post SPRI-bead cleanup). Library C included fecal DNA from 5 SNPRC baboons, 2 Kafue National Park baboons, and 1 Luangwa Valley baboon. Library D contained fecal DNA from 6 Lower Zambezi National Park baboons, 4 Choma baboons, and 30 Kafue National Park baboons. We prepared a final library, library E, from independently extracted blood DNA from five SNPRC baboons in order to quantify the stochasticity associated with independent library preparation from independent extracts. The composition of libraries A-E are described in detail in Supplemental Tables S2-S3.

Library preparation and sequencing

Library preparation followed standard double-digest restriction site-associated DNA sequencing (ddRADseq) procedures27 with modifications to accommodate low input as described below.

For all samples, including blood DNA and MBD-enriched fecal DNA, we digested DNA with SphI and MluCI (New England Biolabs), following a ratio of 1 unit of each enzyme per 20 ng of DNA. Enzymes were diluted up to 10× using compatible diluents (New England Biolabs) to facilitate pipetting of small quantities, using an excess of enzyme if necessary to avoid pipetting less than 1 μl of the diluted enzyme mix. As the total amount of post-enrichment fecal DNA is by nature low, we adjusted adapter concentrations in the ligation reaction to ~0.1 μM for barcoded P1 and ~3 μM for P2, which correspond to excesses of adapters between 1–2 orders of magnitude. Since adapter-ligated samples are multiplexed into pools in equimolar amounts, we made efforts to combine samples with similar concentrations and enrichment when known. We used the BluePippin (Sage Science) with a 1.5% agarose gel cassette for automated size selection of pooled individuals, with a target of 300 bp (including adapters) and extraction of a “tight” collection range (±39 bp). For PCR amplification, we ran all reactions in quadruplicate to minimize PCR biases and attempted to limit the number of PCR cycles. As the concentration of post-size-selection pools was below the limits of detection without loss of a considerable fraction of the sample, estimation of the required number of PCR cycles was difficult. We therefore iteratively quantified products post-PCR and added cycles as necessary. The total number of PCR cycles per pool is reported in Supplemental Table S3, but was usually 24. Finally, libraries were sequenced using either Illumina MiSeq (libraries A-C; 2 × 150 paired-end) or Illumina HiSeq. 2500 (library D; 2 × 100 paired-end) sequencing with 30% spike in of PhiX control DNA.

Analysis

We demultiplexed reads by sample and mapped them to the baboon reference genome (Panu 2.0; Baylor College of Medicine Human Genome Sequencing Center) using BWA with default parameters and the BWA-ALN algorithm38. For every pair of blood and fecal samples from the same individual, we downsampled mapped reads to create new pairs with equal coverage in order to control for biases due to differences in sequencing depth. After realignment around indels, we identified variants using GATK UnifiedGenotyper39, in parallel analyses (1) calling variants in all samples at once and (2) processing each sample in isolation to avoid biasing variant calls from other samples at the expense of accuracy. Homozygous sites matching the reference genome were listed as missing when variants were inferred in single individuals. Variants were filtered with GATK VariantFiltration (filters: QD < 2.0, MQ < 40.0, FS > 60.0, HaplotypeScore > 13.0, MQRankSum < −12.5, ReadPosRankSum < −8.0) and indels were excluded.

We digested the baboon reference genome in silico, tallied reads within each predicted RADtag, and gathered the following information about each region: length, GC percentage, and CpG count in region ± 5 kb. We also calculated read depth in these simulated RADtags. Distributions of blood and fecal RADtags’ length, GC percentage, and local CpG density (Supplemental Fig. S2) were compared using Wilcoxon rank sum tests and visually inspected for possible gross distortion due to widespread dropout.

If the fecal enrichment procedure caused widespread allelic dropout, the proportion of alleles unique to the blood samples would be higher than that to the fecal sample. We tallied these unique alleles by using VCFtools40 to compute discordance on a per-site basis “--diff-site-discordance” as well as a discordance matrix “--diff-discordance-matrix”, and parsed the results to compute unique SNP percentage in paired blood and fecal samples. We tested for an excess of unique SNPs in blood with a Wilcoxon signed rank test.

To quantify loss of heterozygosity due to allelic dropout, we computed the method-of-moments inbreeding coefficient, F for all blood-feces pairs with equalized coverage, using both the individually called and multi-sample called SNP sets. F was calculated using the “--het” argument in PLINK28,29. The presence of dropout is expected to inflate F. We tested for differences in paired samples’ estimates of F via a Wilcoxon signed rank test. The dataset is not filtered for missingness, so sequencing errors inferred to be true variants may inflate heterozygosity estimates, thus deflating F.

To create a stringently filtered dataset with high genotyping rate, we filtered the multi-sample called SNPs in PLINK28,29, retaining only those genotyped in at least 90% of samples and removing samples with genotypes at fewer than 10% of sites. This filtered set was further pruned for linkage disequilibrium by sliding a window of 50 SNPs across the chromosome and removing one random SNP in each pair with r2 > 0.5. Using all samples, we performed multidimensional scaling to visualize identity by state (IBS). Using just the samples that were part of the same-individual blood-feces pairs, we then performed an association test and missingness chi-square test to detect allele frequencies or missingness that correlated with sample type. We did the same with the unfiltered dataset as well. Though we had few pairs of fecal samples from the same individual, we computed distance between pairs of samples from the same individual using the stringently filtered dataset to compare distance between and within sample types via a Wilcoxon signed rank test.