Single inch hair sample preparation performance

Single one-inch hairs yield rich protein profiles that are comparable to profiles established with greater hair quantities; on average, 142 ± 33 (s.d.) proteins were identified from each of 9 head hairs (i.e., from three sets of proteomics-only biological replicates from three individuals), and the average number of identified unique peptides was 1,031 ± 219. From unique peptides, the average numbers of identified amino acids were 15,527 ± 3,056. The presence of a subset of unique peptides known as genetically variant peptides (GVPs) enabled inference of 16 ± 5 SNPs from major GVPs, and 17 ± 3 SNPs from minor GVPs (i.e., GVPs corresponding to the major and minor alleles, respectively). Because both major or minor GVPs allow SNP inference, non-synonymous, or missense, SNPs were reported for both types of GVPs. However, in some cases, detection of both GVPs for the same SNP may not be possible. In previous studies, Parker et al. identified at least 180 proteins in 10 mg of head hair samples from 60 subjects and detected between 156 and 2,011 unique peptides26, and Adav et al. identified, on average, 195 ± 12 proteins in human hair using various sample preparation methods28. Commensurate performance to previous works is achieved even when sample size is substantially reduced to simulate amounts of material available from forensic samples.

In addition, performing co-extraction of protein and mitochondrial DNA yielded no loss in protein information relative to processing for protein alone. Proteomic results from co-extraction were not statistically different from proteomics-only sample preparation for each of the above metrics (two sample t-test; p ≥ 0.106; Supplementary Fig. S1); for example, 156 ± 56 proteins were identified from proteomics-only samples and 151 ± 39 proteins were detected in co-extracted samples. These observations indicate that additional steps taken to co-extract DNA with protein did not adversely affect protein identification or detection of unique peptides and missense SNPs from GVPs. As both sample preparation methods yielded the same proteomic information, the protein/DNA co-extracted sample set was included in this study for all further analyses. Analysis of GVPs and mtDNA can provide corroborating evidence for more confident profiling of individuals, which will be explored in a later publication.

Proteomic variation at different body locations

Hair proteomic variation at three different body locations in 36 hair specimens was first assessed by comparing five metrics: the numbers of detected proteins, unique peptides, amino acids, and missense SNPs from major and minor GVPs (Fig. 1). Two-way ANOVAs with Tukey HSD post-hoc tests were performed for each metric to account for effects of body location and individual. Statistical testing revealed significant effects of body location on the numbers of detected proteins (p = 1.07 × 10−4), unique peptides (p = 5.66 × 10−4), and amino acids (p = 2.21 × 10−3), while effects of individual and the interaction between body location and individual were not significant. A single inch of pubic hair yields more proteins, unique peptides, and amino acids, than head or arm hair. A significant effect of body location on the number of SNPs inferred from GVPs was observed for major (p = 7.56 × 10−3) and minor GVPs (p = 1.91 × 10−5). These results suggest that compared to head and arm hair, the protein composition of pubic hair is more complex, from which many GVPs and SNPs can be identified for human identification.

Figure 1 Comparison of numbers of identified (a) proteins, (b) unique peptides, (c) amino acids, and missense SNPs inferred from (d) major and (e) minor GVPs at different body locations. Black lines represent statistically significant comparisons and significance levels are represented as: p ≤ 0.05 (*), p ≤ 0.01 (**), and p ≤ 0.001 (***). Pubic hair samples yield statistically greater numbers of proteins, peptides, amino acids, and inferred SNPs (two-way ANOVA and Tukey HSD; n = 36). Full size image

Significant effects of body location observed for these five metrics may arise from differences in mass per unit length of hair. The mass of a single inch of pubic hair (200.1 ± 39.6 µg) was statistically greater than an inch of head (84.4 ± 27.7 µg; two-way ANOVA and Tukey HSD; p = 1.76 × 10−9) or arm hair (49.4 ± 22.2 µg; p = 1.74 × 10−11). Despite mass differences in hair, the same injection volume was used for each sample, and thus, different quantities of material were loaded onto the column for LC-MS/MS. It is proposed that more proteins, unique peptides, amino acids, and inferred SNPs were identified in pubic hair samples owing to larger on-column mass loadings.

To assess body location-specific proteomic variation without bias from different on-column mass loadings, protein abundances were examined after normalization to total chromatographic peak area of identified peptides. A previous study by Laatsch and co-workers reported differential expression at different body locations for a subset of proteins24. To confirm these observations in this study for head and pubic hair, and to assess differential protein expression in arm hair, which was not examined previously, protein quantities derived from mass spectral data were compared. Various approaches have been utilized to quantify proteins using mass spectral data, including spectral counts24,29,30, precursor ion peak areas from MS scans31,32, and MS/MS fragment ion abundances33 to represent peptide abundance. Because dynamic exclusion was used during data acquisition to maximize peptide identification and protein coverage, MS/MS spectral counts do not reliably represent peptide abundance, especially lower abundance peptides34. We chose to use the more robust of the latter two methods and tabulated MS scan precursor ion peak areas in mass spectral data from a complete list of identified unique peptides. Bias towards samples with larger mass loadings was removed by normalizing each precursor ion peak area to the total peak area of all identified peptides. Protein abundance in each sample was calculated as the sum of all normalized peak areas assigned to the protein.

Protein abundance was examined in this study to observe any effects of body location. Statistical comparison of protein abundances identified 37 proteins with body location-specific differential expression, of which a subset is shown in Fig. 2 (two-way ANOVA and Tukey HSD). Further, many differentially expressed proteins show higher expression in pubic hair and are least abundant in arm hair, suggesting that pubic hair not only comprises a complex set of proteins, but also that proteins are more abundant in pubic hair compared to head and arm hair, even after accounting for mass differences. Not surprisingly, keratins and KAPs comprise only 27% of body location-specific differentially expressed proteins (i.e., 10 proteins), while intracellular proteins such as FABP4, MIF, and ATP5B make up the majority. As keratins and KAPs primarily contribute to the structural integrity of hair, which is highly conserved, it is unlikely that many hair structural proteins would exhibit differential expression at the various body locations. Many intracellular proteins are also least abundant in arm hair, although arm hair samples have notably high abundances of CALML5, GSDMA, and KAP19-5 compared to head hair samples. While the protein abundance profiles of head and arm hair samples are more similar compared to pubic hair, protein abundance variation in 37 markers enabled distinction of hair fibers from different body locations via principal components analysis (Supplementary Fig. S2). Differential protein expression captured with protein abundance confirms proteomic variation in hair from different body locations.

Figure 2 Average abundances for a subset of differentially expressed hair proteins at different body locations (two-way ANOVA and Tukey HSD; n = 36). Error bars represent standard deviation from 4 replicate measurements of each of three individuals. Black lines represent statistically significant comparisons and significance levels are represented as: p ≤ 0.05 (*), p ≤ 0.01 (**), and p ≤ 0.001 (***). Full size image

Effects of proteomic variation on GVP identification

Because protein abundances vary for a subset of hair proteins at different body locations and GVPs result from hair protein digests, it was considered that GVP identification may be affected by body location-specific differential protein expression. Therefore, it was imperative to examine the SNPs identified in each sample and determine whether differential protein expression affects GVP identification and subsequent SNP inference. Further comparison of identified SNPs in each sample was performed to observe whether some SNPs are only identified at specific body locations. Only SNP inferences consistent with an individual’s genotype determined from exome sequencing were considered. SNPs with false positive responses are not robust candidates for a GVP panel and were removed; 65 SNPs remained for further analysis.

To observe any localization of SNPs, distributions of inferred SNPs from major and minor GVPs were compared across body locations. Of 65 SNPs, only exome-proteome consistent SNPs, in which the proteomic response corresponded with the exome response, i.e., true positive and true negative responses, across all 12 samples per body location for either major or minor GVPs, were retained (Fig. 3). Figure 3a,b illustrate the amount of overlap in consistent SNPs across samples from different body locations. From 11 and 14 consistent SNPs identified from major and minor GVPs, respectively, 5 and 8 SNPs are identified at all body locations, which comprise the majority (on average, 69%) of exome-proteome consistent SNPs. This observation suggests that reliable SNP identification in samples within a body location often extends to all samples. Only 11 SNPs in total are not identified at all body locations; there is one unreliably identified SNP that overlaps between major and minor GVPs.

Figure 3 Comparison and distribution of exome-proteome consistent SNPs across different body locations. (a) Distribution of inferred consistent SNPs across the three body locations for major and minor GVPs, respectively. (b) Summary of the number of consistent SNPs at each body location. (c) Comparison of differentially expressed proteins to proteins of 11 SNPs with unreliable identifications at one or two body locations (i.e., not identified at all body locations). The majority of exome-proteome consistent SNPs identified at each body location are identified in all samples. Unreliably identified SNPs at either one or two body locations originate from a set of proteins that are not differentially expressed; there is no overlap between these sets of proteins. Therefore, SNPs are not body location-specific. Full size image

The possibility that body location-specific SNP localization results from proteomic variation was further examined by comparing subsets of proteins. The subset of 37 proteins with body location-specific differential expression was compared with the proteins of 11 inconsistently identified SNPs (Fig. 3c). Any overlap in composition would indicate that differential expression of the protein affects downstream GVP identification and SNP inference within that protein. However, no overlap existed between differentially expressed proteins and proteins containing unreliably identified SNPs. With the exception of five proteins (APOD, CALML5, GSDMA, K37, KAP10-3), SNPs are not identified in body location-specific differentially expressed proteins. Despite significant positive correlations between the frequency of identifying SNPs from 3 of these proteins and protein abundance (Pearson product-moment correlation; p ≤ 0.043; Supplementary Fig. S3), identification of these SNPs remains variable among sample replicates, regardless of body location. Further, no statistical positive correlation between SNP identification frequency and protein abundance was found for unreliably identified SNPs (Supplementary Fig. S3), demonstrating that body location-specific differential expression is not linked to SNP identification for all exome-proteome consistent SNPs. Therefore, while expression of APOD, GSDMA, and K37 may display some correlation with SNP identification, the vast majority (on average, 97%) of GVP identification is not affected by differential protein expression, especially if the peptides are consistently identified among sample replicates. SNP identification in hair specimens is not dependent on body location. GVP identification from protein digests of hair specimens is equally viable regardless of body location origin and all detected GVPs are candidates for a GVP panel.

GVP candidates for human identification panel

A series of criteria were established to evaluate GVP candidates for a robust panel. First, only GVPs that indicate exome-proteome consistent SNPs were considered. Furthermore, only consistent SNPs identified in all samples were selected, as these SNPs have the lowest false negative rates and their GVP counterparts have the highest chance of being detected. After accounting for overlap between major and minor GVPs, 12 SNPs remained for consideration. SNP identifiers, the two most abundant forms of the GVP, and their MS scan precursor ion abundances are reported in Table 1. See Supplementary Table S1 for a complete list of GVPs.

Table 1 SNP and GVP candidates for GVP panel. Full size table

The second criterion used to evaluate GVP candidates in Table 1 is marker independence for random match probability (RMP) determination at the SNP level. To assess the performance of a robust panel for forensic identifications in a population, random match probabilities are calculated as the product of genotype frequencies for each SNP locus. However, genotype frequencies for correlated SNPs, i.e., SNPs in linkage disequilibrium35,36, may be biased in the population, which violates the assumption of marker independence for RMP calculations. To reduce the effect of possible disequilibria, a conservative one-SNP-per-gene rule was adopted; more sophisticated treatment of linkage disequilibrium will allow for inclusion of more GVPs, and thus, lower RMPs. For multiple SNPs from a gene, the SNP with the lowest minor allele frequency was selected. Finally, SNPs without Reference SNP IDs were also not considered further, as genotype frequencies are not known for these candidates. After applying these criteria, 8 SNPs remained for inclusion in a panel from 245 GVPs.

GVP profiles and identification performance

GVP profiles for each sample were established using 8 robust SNPs. Each GVP profile was established using the presence or absence of the major and minor GVPs at each SNP locus. Figure 4 displays a simplified version of each profile by using observed phenotype frequencies to represent the presence or absence of GVPs, as described in Materials and Methods. The full set of profiles that denotes the presence or absence of GVPs is found in Supplementary Fig. S4.

Figure 4 GVP profiles of 36 samples using observed phenotype frequency to represent the presence or absence of major and minor GVPs at 8 SNP loci. Profiles within an individual are similar, indicating consistent identification of SNPs with robust GVPs. Full size image

GVP profiles within an individual, irrespective of body location, are more similar compared to GVP profiles between individuals. Pairwise comparisons of GVP profiles allowed quantification of profile similarity, using the number of observed phenotype differences across 8 SNP loci, termed GVP profile differences. Differences were recorded if the compared responses did not match exactly, and then summed for each pairwise comparison, totaling 630 comparisons. Replicate comparisons, performed between hair specimens from the same individual and body location, yielded 1.17 ± 0.99 GVP profile differences, and within-individual comparisons, between hair samples from the same individual but different body locations, showed 1.06 ± 0.94 differences. As expected, between-individual comparisons exhibited the greatest number of GVP profile differences, with 4.92 ± 0.84, 5.11 ± 0.92, and 2.79 ± 0.71 differences, respectively, between Individuals 1–2, 2–3, and 1–3 (Fig. 5a). All observed profile differences approximate expected GVP profile differences (Fig. 5b). Greatest profile variation lies between individuals (Kruskal-Wallis test; p = 2.96 × 10−108), demonstrating that despite some sample replicate and within-individual variation (e.g., body location), distinct GVP profiles are observed in samples from different individuals.

Figure 5 (a) Average number of GVP profile differences from different pairwise comparison categories compared to (b) expected number of GVP profile differences. Error bars represent the standard deviation. All but two comparisons, denoted by dotted line, are statistically significant (Kruskal-Wallis and Dunn tests; n = 630; p ≤ 3.80 × 10−6). The numbers of observed profile differences approximate expected GVP profile differences. Between Individual profile differences are statistically greater than Replicate and Within Individual profile differences. Full size image

Furthermore, RMPs, derived as products of observed phenotype frequencies from GVP profiles of each sample, align with the individual (Fig. 6). Experimental RMPs range between 1 in 3 and 1 in 870, within an order of magnitude of expected RMPs for each individual. Most importantly, GVP profiles of samples belonging to the same individual enable distinction of the individual to the same extent, regardless of body location, demonstrating that with a robust panel of inferred SNPs from GVPs, the probative value of one-inch head, arm, and pubic hair samples is equivalent within an individual.