Fractionating in the gel methods is part of a 2D study—the first dimension is separating hair proteins based on the MW during SDS‐PAGE, and the second dimension is separating extracted peptides by the LC gradient during LC‐MS/MS. Analyzing each fraction enables very low‐abundance GVPs to be identified. It is why we detect more GVPs from the two in‐gel methods than the in‐solution method. However, we detect fewer GVPs if we combine these fractions and process as a mixture (Table 3 ). We also tried a brief “short‐gel” run by applying SDS‐PAGE at 200 V for only 10 min (long gel: 30 min at 200 V). We compare the GVPs between long‐gel and short‐gel runs and find that short‐gel mixture loses even more GVPs (Table 3 ). This can be explained by hair proteins not being effectively separated in a shorter run or possibly that SDS not being fully separated from proteins. In any case, this finding highlights the importance of both separation and sensitivity in finding all identifiable GVPs in a sample. While running 10 fractions is very time‐consuming, possible GVPs were lost (Table 3 ) upon combining fractions indicates that more rapid analysis using a single LC‐MS/MS run can lose less abundant GVPs. Moreover, the finding that different GVPs are found with different digestion protocols implies that no existing method can be relied on to identify all possible GVPs. Together, this clearly shows the need of future work for finding the most efficient way to maximize GVP identification.

As described in the Method section, we identified a total of 14 published tryptic GVP sites from this Asian donor’s hair samples. These sequences along with corresponding nonvariant sequences are listed in Appendix S3 . Table 3 shows the specific GVP identification for the three methods with three replicate runs for each method, namely our direct method, the modified NaOH + SDS method, and the cleavable surfactant method. For both the direct method and modified NaOH + SDS method, GVP panel results from different fractions are combined in Table 3 . Appendix S3 uses the results from F1 to F10 as an example to illustrate how we performed this analysis for a complete data set by the direct method. Analysis led to a number of general findings:

The major advantage of gel fractioning is that it separates the proteins by molecular weight, thereby showing more clearly the origin in individual GVPs. It can also minimize ion suppression leading to the identification of additional GVPs. Unfortunately, this approach is time‐consuming. Our attempts to combine fractions led to the loss of potential GVPs (see section C). Identifications of all GVPs in a single‐digest analysis are apparently not possible at present (discussed below). Finding optimal methods will be the topic of future research.

The range of total ion current (TIC, upper panel) and peptide identifications (lower panel) across all 10 fractions. Blue dashed lines indicate TIC values reach their maximum numbers at fractions 6 and 7, where peptide IDs reach their minimum numbers at fractions 6 & 7.

Note that in Table 1 , fractions 6 and 7 show the highest peptide signal strengths but lowest numbers of peptide identifications (IDs). This is confirmed in Fig. 3 , where the total ion currents (TICs) are inversely correlated with peptide IDs with a correlation coefficient of −0.75. This is a consequence of the higher concentrations of relatively a few proteins dominating fractions 6 (type II) and 7 (type I), which leads to higher concentrations of their tryptic peptides with consequent signal suppression of peptides from other, less abundant proteins. In other fractions, no individual proteins dominate, so tryptic peptides are more equally spread across a larger number of proteins, though many of them are cross‐linked, fragmented, or otherwise modified. Table S1 shows when moving along the gel fractions from F1 to F10, the example big protein (Desmoplakin) decreases and the example small protein (a keratin‐associated protein) increases.

Figure 2 shows the intensities over the fractions for selected peptides from type I (A) or type II (B) hair cuticular keratin. In both cases, both the GVP and nonvariant form are shown along with another major peptide from each protein. The abundance of each peptide derived from its MS1 ion chromatogram peak area. These results indicate (i) the major gel bands correspond to type I (fraction 7) and type II (fraction 6) hair cuticular keratins, consistent with the literature 8 reports. Fractions 6 (type II) and 7 (type I) are enriched in individual hair cuticular keratins; (ii) it is noteworthy that most peptides identified outside the main regions were the same as those inside that region. This behavior persisted in all analyses. This is presumably due to the presence of significant quantities of cross‐linked proteins or unseparated complexes with higher molecular weight with lower mobilities as well as fragments of these proteins at lower molecular weights with higher mobilities. We find that keratin GVPs are found in virtually all gel fractions suggesting that they distributed among a wide range of cross‐linked proteins, suggests that the insoluble, cross‐linked portion of the hair protein may not contain additional keratin‐GVP identifications. According to reference 7, the insoluble, cross‐linked portion has a higher content of nonkeratin proteins and may contain additional non‐keratin‐GVP identifications. Further, we know of no way to enhance the method’s digestion effectiveness, though such an improvement would be very welcome.

We observed two distinct gel bands in fractions 6 and 7 (Fig. 1 ). The other fractions had several minor bands, but most of the intensity was evenly distributed (Fig. 1 C). Results are discussed below.

Hair cuticular keratins are major components of hair proteome. Table 2 examined the sequence coverage of listed total 15 hair cuticular keratins of type I and type II by library and Sequest searches from all ten fractions. Peptides present in multiple proteins were used in calculating the sequence coverage of each protein. Since we are interested in GVPs, of course the better coverage, the greater the chance of detecting potential GVP sites. In general, library searching provides a fuller coverage than database searching, although except for the most abundant KRT31, some of these coverages are far less than 100%. There are several possible reasons for this: (i) Cross‐linking makes certain sites hard to reach by trypsin during the digestion; (ii) extremely long (>50) or short (<6) peptides were not considered under the current search parameters; (iii) loss of extremely hydrophilic or hydrophobic peptides occurs during sample preparation and LC analysis. (iv) Incomplete conversion of proteins to peptides is common throughout proteomics, and according to reference 18 , an approximately 70–80% of recovery is expected after extraction from the gel. Putting all ten fractions together, 8 out of 15 hair cuticular keratins reach more than 90% coverage, 5 out of the rest 7 reach more than 50%, and only 2 less than 50% (KRT37 and KRT84). Appendix S2 shows sequence coverage in amino acids of 15 type I and type II hair cuticular keratins found by library and Sequest searches.

Using both spectral library and Sequest searching methods, the results derived from F1 to F10 are compared in Table 1 . As shown in Table 1 , when the “main” library was combined with the “hair” library for spectral library searching, the overall library identification for proteins—for both hair proteome 7 , 9 and hair cuticular keratins (a major subset of the hair proteome) 1 , 8 —was similar to that from Sequest; however, for all peptides identified, the spectral library method was somewhat more sensitive at a given FDR level, consistent with previous observations 14 .

Results for hair proteins extracted from a single 5 cm long hair by the direct method are presented in Table 1 . They were derived from one raw MS data file for each of the ten gel fractions. All were independently analyzed to determine details of the gel separation and digestion process.

We examined overall protein and peptide identifications from all ten gel fractions and compared our library search results to the results from sequence (Sequest) searches. When searching spectral libraries, we added the “hair”‐specific mass spectral library to our “main” library 12 , 13 to obtain better search performance. The next A and B subsections discuss these results and demonstrate the effectiveness of spectral library searching for peptide identification. In subsection C, we examine GVP detection with library searching in all ten fractions and compare the GVP panel analysis by the direct method to the other two published methods 1 , 8 .

Estimation of the digestion yield: The gel‐based method we chose for analysis unfortunately did not allow us to use a conventional Bradford colorimetric assay (BCA) to measure protein concentration. Instead, yields of digested peptides using the Pierce method mentioned above served a similar, albeit less direct purpose. Based on a measured 5 cm hair mass of 100 µg (10 5 cm lengths were found to weigh 1.0 mg), we found that at the incubation time of 5, 10, 15, 30, 60, and 90 min, corresponding total yields of peptides to be 16%, 27%, 37%, 75%, 66%, and 51%. The maximum of 75% at 30 min was selected as optimal (see above). For comparison, a yield of 47% was reported for an in‐solution method 8 using BCA after precipitating extracted proteins.

We also compared the protein, peptide, and GVP identifications between the direct method and modified NaOH + SDS method with analysis repeated three times for each method. Results of comparisons from a representative fraction (F6) are listed in Table 4 with three experimental repeats: (i) Higher average peptide yield (µg) was obtained in the direct method than in the modified NaOH + SDS method (11.5 vs. 2.9 µg); (ii) more average peptides were identified by the direct method than by the modified NaOH + SDS method (610 vs. 509); (iii) although similar average number of GVP ions was observed in the direct and modified NaOH + SDS methods, it is more reproducible with much smaller coefficient of variation (CV) in three experimental repeats in the direct method (0.02 vs. 0.27, respectively); and 4) gel blank—only a few peptide IDs from gel blank and no GVP identification at all. Gel blank serves as a control to see whether we introduce any contamination from handling the blank gel alone. Table 4 shows that the direct method is not only a more sensitive, but also a more reproducible method when compared to the modified NaOH + SDS method.

Comparison of the reproducibility of the direct and modified NaOH + SDS methods. The two gel images compare the reproducibility of methods: (A) the direct method and (B) modified NaOH + SDS method using 5‐cm‐long hair shaft samples from the same individual donor across 8 replicates (A: A to H; B: 1A to 1H). A MW standard was loaded in the first lane. Note that the NaOH + SDS gel includes a 9 t h lane for which the extraction from ten 5‐cm‐long hair shaft samples was included as a reference. The major bands that correspond to type I and type II hair cuticular keratins were labeled.

In an examination of the reproducibility of the present method, the extraction was repeated eight times using eight individual 5‐cm‐long hair shafts (labeled as A to H in Fig. 6 A) from the same donor and particularly compared it to modified NaOH + SDS method (labeled as 1A to 1H in Fig. 6 B, plus the last lane from 10 hairs included as a reference). We made the assumption that each individual 5‐cm hair shaft contained the same protein mass. Figure 6 clearly indicates that the direct method is more reproducible than modified NaOH + SDS method. This presumably arises from lower sample loss for the direct method since it only needs one step/30 min for hair protein extraction, while the multiple steps (also means much longer bench time) included in modified NaOH + SDS method are more prone to sample loss and generating variable results (workflows of the two methods are shown in S1) especially when the hair sample is very small.

Identification of an example GVP ion with high and low abundance. The example GVP ions (KRT33A A270V_V: QVVSSSEQLQSYQ[V]EIIELR/3_0 higher‐energy collisional dissociation (HCD) =30eV) was mapped to an IonPlot (x‐axis: retention time (RT) in min, y‐axis: abundance in log 10 scale) to show the library identification with high abundance (upper blue dot) or with low abundance (lower blue dot). One blue dot indicates one peptide ion. For each blue dot, the RT and the abundance in log 10 scale were labeled underneath; blue arrows indicate their corresponding library identifications by searching the spectrum of this peptide ion as query spectrum against the hair‐specific peptide spectral library including known GVP ions. The match factor (MF) was labeled underneath its library identification.

The present direct method is both suitable for very small hair samples and able to identify GVP ions across a broad range of ion intensity. Intensities of reliably identified GVP ions could differ by orders of magnitude in ion intensity. Figure 5 illustrates this for two spectra of the same GVP ion “QVVSSSEQLQSYQ[V]EIIELR/3_0.” Even though intensities differ by four orders of magnitude, retention times were almost identical (161.7 min vs. 161.5 min) and spectral library match factors were quite high (over 800).

Comparison of the sensitivity in the two methods. The sensitivity of the two methods was measured by comparing multiple metrics across a dilution series from 5D to 1280D: (A) the total number of ions; (B) the total number of peptides; (C) the total number of proteins; and (D) the total number of published GVP ions detected in mass spectral data from 5‐cm‐long hair shaft sample‐derived proteins that were extracted using the direct method (blue) and modified NaOH + SDS method (green). Actual data have been labeled on the points of each dilution series.

We examine the sensitivity of the direct method to modified NaOH + SDS method by comparing multiple metrics across a dilution series. In Fig. 4 , we show the relative sensitivity of the two methods by comparing the degree of dilution needed for each method to yield the similar number of IDs. After comparing total number of ions (Fig. 4 A), total number of peptides (Fig. 4 B), total number of proteins (Fig. 4 C), and total number of GVP ions (Fig. 4 D), we found that the direct method was about eight times more sensitive than modified NaOH + SDS method. The nonmonotonic behavior of some of the irregular trends is a consequence of results from the general difficulty in obtaining highly reproducible proteomic results and, for GVPs, their small numbers and therefore greater statistical fluctuation. Note that since the GVPs are few in number and variable in intensity, we could not reliably use GVPs alone to develop a reliable measure of method sensitivity based on their identifications alone. This was confirmed in a separate set of analyses: For example, GVP ions increased at 10D and then all the way decreased to minimum detection level at 1280D.

Since the direct method and modified NaOH + SDS method both use protein gel to separate hair proteins, for a direct comparison, we compared the direct method with modified NaOH + SDS method for a further sensitivity and reproducibility check in this section.

Examination of Artifacts Among Three Methods

In most proteomics experiments, a large fraction of ions sampled are not identified. This not only reduces the efficiency of the experiment but also has potential to generate false‐positive results. Moreover, the identity of the unidentified ions may aid in understanding and optimizing the experiment and provide a measure of quality control.

In the present experiment, almost 90% of ions are not directly identified as tryptic peptides using conventional library searching. Using our recently developed hybrid search 15, as shown in Table S2, 11% can be identified as expected tryptic peptides, while about 75% can be identified via hybrid identification. These hybrid identifications find peptides that are chemically modified forms of conventional tryptic peptides. The reason we would like to examine experimentally introduced artifacts is because we must be aware of artifactual modifications that may masquerade as a GVP and therefore generate false‐positive identifications, the larger the number of spurious modifications the greater the chance that one will accidentally overlap a possible GVP. Proteomics cannot distinguish biological versus artifact origins of identified peptides. For example, a methylation at or near a serine might be interpreted as a serine‐to‐threonine GVP. IonPlot in Figure 7 shows the classification of ions (GVP, identified, and not identified ions from F6 of the direct method) by the hybrid search including a list of several interesting modifications that we would like to discuss more in this section. These analyses also show the nature and extent of certain spurious chemical processes that add to sample complexity and, in effect, diminish the sensitivity and overall quality of the experiment.

Figure 7 Open in figure viewer PowerPoint Classification of ions by the hybrid search. IonPlot shows the classification of GVP, identified, and not identified (NoID) ions, as well as several modifications: formylation (formyl), methylation (methyl), alkylation (CAM), acetaldehyde, and acetylation that present in fraction 6 (F6), a representative gel fraction from a protein gel separating proteins derived from a 5‐cm‐long hair shaft of this Asian donor by the direct method. Solid: identified by regular library search; hollowed: identified by hybrid library search. x‐axis: retention time (RT) in minute (min), y‐axis: abundance in log 10 scale.

Since this issue is important for every sample preparation method regarding GVP detection, below we examine the artifacts among the three methods: our direct method, modified NaOH + SDS method, and cleavable surfactant method.

Table 5 compares the twenty most frequently identified DeltaMass values in three methods 15. For more information, Appendix S4 shows the histograms of all DeltaMass values obtained from hybrid search identifications in each method to give a broad view of the distribution of all DeltaMass values. From the top 20 DeltaMass values listed in Table 5, we now further discuss four types of experimentally introduced artifactual modifications (Fig. 8).

Table 5. The twenty most frequently identified DeltaMass values obtained from hybrid search identifications in the three methods. DeltaMass Theoretical Value of DeltaMass Proposed Modification Percent of Hybrid Identifications Direct (Median) NaOH + SDS (Median) Cleavable Surfactant (Median) 1.001 1.00335483 1‐C13 17.30 17.76 19.34 2.007 2.00670966 2‐C13 6.73 8.82 6.71 42.013 42.010565 Acetyl 6.25 5.75 3.54 26.017 26.015650 Acetaldehyde 3.52 2.49 0.66 3.009 3.01006449 3‐C13 3.59 4.96 3.55 27.999 27.994915 Formyl 1.87 3.03 1.57 14.018 14.015650 Methyl 3.08 2.60 1.12 ‐1.011 ‐1.00335483 ‐1‐C13 2.31 3.05 ‐17.023 ‐17.026549 ‐NH3 1.62 1.51 2.38 70.007 70.005480 Formyl + Acetyl 0.89 1.28 4.009 4.01341932 4‐C13 1.78 2.44 2.02 12.002 12.000000 Formaldehyde Adduct 1.45 1.20 43.014 43.005814 Carbamyl/Acetyl + 1‐C13 1.48 1.07 0.70 ‐18.008 ‐18.010565 Dehydration/Glu → pyro‐Glu 1.34 1.35 2.01 ‐2.013 ‐2.00670966 ‐2‐C13 1.36 1.58 1.43 23.986 23.98865266 Sodiated + 2C‐13 1.17 57.023 57.021464 CAM 1.78 1.87 4.21 15.997 15.994915 Oxidation 1.08 1.28 120.028 120.024500 Desulfurization + CAM + DTT 0.95 58.010 58.005480 Deamidation + CAM 1.06 0.89 3.33 ‐91.009 ‐91.009185 Cys(CAM)→Dehydroalanine 0.82 ‐16.019 ‐16.0231942 1C‐13 + ‐NH3 0.76 0.93 ‐0.983 ‐0.984016 Amidation 3.44 5.014 5.01677415 5‐C13 0.69 160.041 160.030654 Add‐Cys + CAM 1.25 31.995 31.989829 Dioxidation 1.78 152.003 151.996571 +DTT 0.86

Figure 8 Open in figure viewer PowerPoint Comparison of the artifacts in the three methods. Comparison of experimentally introduced artifactual modifications among three methods using our recently developed hybrid search: cleavable surfactant method (red), modified NaOH + SDS method (green), and the direct method (blue). The compared experimentally introduced artifactual modifications chosen as examples are as follows: acetaldehyde (upper left), acetylation (upper right), formylation (lower left), and over alkylation (lower right).

Acetaldehyde adduction: We compared the occurrence of an acetaldehyde adduct across the three methods. Figure 8 shows that this artifactual modification is more frequently identified in the direct and modified NaOH + SDS methods due to the presence of ethanol in the SimplyBlue SafeStain that we used to stain the protein gels. We here included an example in Figure 9 to show our main concern—a modification at peptide’s N‐terminus could be mistaken as a potential GVP: The DeltaMass value from the hybrid search for this hybrid identification is 26.0186 Da, within the mass tolerance range, which is likely due to acetaldehyde (26.01565 Da) but may be incorrectly identified as His (H) →Tyr (Y) (26.004417 Da) since His (H) is involved in the identification at the first amino acid in this peptide ion. Without the hybrid search, or without being aware of what type of artifactual modification exists, such a misidentification will occur.

Figure 9 Open in figure viewer PowerPoint An example of a modification at peptide N‐terminus mistaken as a GVP. Spectral match of a hair‐derived peptide to the peptide sequence HLQLAIR (Charge = 2, Mods = 0, Spectral Match Score = 705) with a DeltaMass of 26.0186 Da, which is likely due to acetaldehyde (26.01565 Da) but may be incorrectly identified as His (H) →Tyr (Y) (26.004417 Da).

Acetylation: While acetylation at Lys (K) and the protein amino terminus is biological modifications, artifactual acetylation at the peptide N‐terminus can be introduced during sample preparation. Although the source of acetic acid is not believed to have been introduced through sample preparation, this artifactual modification was identified more frequently in the direct and modified NaOH + SDS methods.

Formylation: Formylation is less dissimilar across all three methods than that of the previously described two modifications. This is expected as formic acid is required in all three sample preparations.

Alkylation: Alkylation (CAM) is significantly greater in the cleavable surfactant method compared with the direct and modified NaOH + SDS methods. This is consistent with the fact that iodoacetamide concentration we used in sample preparation of cleavable surfactant method is much higher than in the direct and modified NaOH + SDS methods.

Table 5 and Appendix S4 show that, overall, the results of the three methods have similar degrees of experimentally introduced modifications. It seems likely that the artefactual modifications are a result of the inherent difficulty of digestion such an insoluble and cross‐linked material as hair.

Regarding GVP panel analysis, we find consistent results in regular and hybrid searches. Hybrid searching usually reports more GVP ions with many kinds of unexpected modifications but seems not gaining additional known GVP site detection. Verified GVP detection by the hybrid search (not only seeing the version that included in the library but also seeing the versions with some unexpected modifications) increases the confidence of GVP panel analysis.