Ethics Statement

The study was approved by the Institutional Review Board at the J. Craig Venter Institute (JCVI) (#2016-238), and all methods were performed in accordance with relevant guidelines and regulations. Written informed consent was obtained from all participants prior to sample collection.

Cohort description and sample collection

Hair samples derived from scalp and pubic areas were collected from adults residing in Maryland (MD, n = 8) and California (CA, n = 8). Additionally, scalp hairs were collected from adults residing in Virginia (VA, n = 5). Both males and females from diverse ethnicities were recruited for this study. Samples were self-collected by participants over one week during late winter of 2016. Each individual provided multiple hairs, for a total of 42 and 32 hair samples from scalp and pubis respectively. The hair collection protocol was as described in Tridico et al.17. Shaft hair samples were self-collected at the same body location. Scalp hair was collected from behind the right ear, near the right retroauricular crease, and pubic hair was collected from their right pubis, near the right inguinal crease. Participants clipped rather than plucked hair to distinguish the hair shaft from the follicle. Prior to DNA extraction, hair length was measured and classified as short (<2 cm), medium (2–4 cm) or long (>4 cm).

Sample preparation and DNA extraction

Hair samples were resuspended in 1200 ul of lysis buffer (20 mM Tris-Cl, pH 8.0, 2 mM EDTA, 1.2% Triton X-100) in preparation for DNA extraction. DNA from hair samples was extracted using enzymatic lysis; 200 mg/ml lysozyme (Sigma/Aldrich, St Louis, MO) and 20 mg/ml proteinase K (Life Technologies, Carlsbad, CA), followed by phenol chloroform isoamyl alcohol extraction and ethanol precipitation. Residual PCR inhibitors were removed using the MOBio Powerclean kit (MOBio Labs, Carlsbad, CA). DNA was quantified using fluorometric methods (SybrGold, ThermoFisher, Waltham, MA) prior to downstream applications.

16S rRNA gene V4 sequencing

Microbiota profiling was performed targeting the V4 region of the 16S rRNA gene. 16S rRNA gene amplification in each sample was performed using adaptor and barcode ligated V4 specific primers so that sequences from each sample in the library were identified with unique barcode indices. Mock community DNA was included in the library preparation step as described previously in Kozich et al.19. The mock community serves as a control for contaminants as well as a tool to ensure reproducibility and quality sequence reads, indicating the presence of unexpected spurious operational taxonomic units (OTUs). In addition, PhiX DNA was spiked into all sequencing runs as an integral control for sequencing. A high % of PhiX spike in (10–20%) adds diversity to 16S rRNA gene runs and improves quality. Amplicon from extraction controls and no template controls was also included to determine if any contamination occurred during DNA extraction or during the library prep stage. 16S rRNA gene libraries were analyzed on the High sensitivity DNA chip (Agilent) to ensure that libraries were free of adapter dimers contaminants and that they are appropriately sized for the platform. 16S rRNA gene libraries were sequenced using V2 chemistry 2 × 250 bp format on Illumina MiSEQ (Illumina Inc, La Jolla, CA) using standard manufacturer’s specifications. QC analysis was performed after each sequencing run where the % reads >= Q30, passing filter clusters and yield/sample were monitored.

16S rRNA gene quantification

To determine the absolute quantification of the bacterial biomass in each sample, quantitative real-time polymerase chain reaction (qPCR) was performed using 1 µl of each sample (20 uL total reaction volume) with LightCycler® 480 SYBR Green I Master (Roche Diagnostics, Rotkreuz, Switzerland). Reactions were performed in duplicate using the LightCycler® 480 (Roche Diagnostics). The following amplification protocol was used: 60 cycles each of 95 °C for 10 sec, 60 °C for 10 sec, and 72 °C for 30 sec with single acquisition, using 16S rRNA V4 primers19 at a final concentration of 200 nM. Streptococcus pneumoniae serotype 4 strain TIGR4 genomic DNA (NC_003028) was used as the positive control, and a melt curve was performed to confirm specificity of the primers for the target.

16S rRNA gene sequence data analysis

Sequence reads from the 74 hair samples obtained plus 2 negative controls were processed using an in-house 16S rRNA gene data analysis pipeline. Operational taxonomic units (OTUs) were generated using the default parameters in UPARSE20 and taxonomies were assigned to these OTUs with mothur21 using 123 version of the SILVA 16S rRNA gene database22 as the reference database. Samples with more than 500 reads (65 samples) were further considered for downstream analysis. OTU count tables were normalized to relative abundances of reads mapping to different taxa at all taxonomic levels using the R-package Phyloseq23.

Statistical analysis

Non-metric multidimensional scaling (NMDS) graphs were generated using the Phyloseq R-package, while the permutational multivariate analysis of variance (PERMANOVA) calculations were performed to detect statistical significance using the VEGAN R-package using Bray-Curtis dissimilarity matrix24. To detect differential abundances in the hair microbiota at the genus level, phyloseq data was converted into a DESeq2 object using the phyloseq_to_deseq2 function, and DESeq2 package version 1.12.3 in R was used25 for differential abundance. DESeq2, using a local fit type to estimate dispersions, was used for its multiple testing adjustment applying Benjamini & Hochberg False Discovery Rate26. The p-value cutoff for the selection of significant OTUs is 0.05 after false discovery rate (FDR) adjustment for multiple comparisons. Random Forest algorithm implemented in R was used to perform classification of the MD vs CA samples.

Availability of data and materials

Raw datasets and associated metadata generated and analyzed as part of this study are available in the NCBI SRA database under the accession number: SRP149455 as part of the NCBI Bioproject PRJNA417700. Processed datasets can be analyzed in comparison with other publicly available human microbiota data through the Forensic Microbiome Database (FMD) http://fmd.jcvi.org/.

The processed read sequences analyzed from the HMP are available in the HMP Data Analysis and Coordination Center, http://hmpdacc.org/HM16STR/1.