Computational complexity analysis

ECL is closely related to the work of Chen et al. [32] and Kojak [30]. Chen et al. [32] provided their algorithm’s computational complexity. Hoopmann et al. [30] provided Kojak’s source code without computational complexity analysis, so we analyzed its computational complexity based on the source code. In this section, we will analyze ECL’s computational complexity in detail.

Computational complexity analysis

Defining the following variables:

k : number of proteins in a database.

n : average number of peptides in a protein.

m : average length of a chain.

h : average number of peaks in an experimental spectrum.

s : number of experimental spectra.

L: number of precursor mass tolerance ranges. This approximately equals the precursor mass range divided by the precursor mass tolerance.

The time complexity of the algorithm proposed by Chen et al. [32] is

$$ O(skn^{2} \log (kn) + sk^{2}n^{2} \log (kn) / L + s k^{2} n^{2}(m + h) / L). $$ (6)

For the first and second terms, the authors only considered one experimental spectrum. We multiply the terms by s because there are s experimental spectra. We also use k 2 n 2/L to replace p in the original paper. For the third term, the authors only considered one PSM. We multiply the term by s k 2 n 2/L because there are k 2 n 2/L peptide-peptide combinations for each experimental spectrum and there are s experimental spectra. The time complexity of Kojak is

$$ O(kn \log(s) + kns (m + h + 1) + s t^{2}). $$ (7)

Please refer to the Additional file 1 for details.

For ECL, the computational complexity is dominated by step 7 in the workflow. The complexity of step 7.1 is O(log(s)). Steps 7.2 and 7.5.1 have the same time complexity, O(m). ECL stores theoretical and experimental spectra in sparse matrixes. We developed an algorithm to match peaks between a theoretical spectrum and an experimental spectrum with O(m+h) complexity (Algorithm 1). Thus, both steps 7.3 and 7.5.2 have the time complexity, O(m+h). Moreover, for an experimental spectrum and a pair of chains, steps 7.2 and 7.3 only need to be executed once because ECL checks each chain whose mass is smaller than or equal to half of the largest precursor mass in ascending order. Steps 7.3 and 7.5.2 also only need to be executed once for the same reason. The time complexity of step 7.4 is O(log(k n)). The time complexity of steps 7.5.3 and 7.5.4 is O(k n s/L). Thus, the time complexity of step 7 is

$$ O(kn(\log(s) + m + s(m + h) + \log(kn) + kns / L)). $$ (8)

There are seven variables in the time complexity equations. Five of them can be fixed based on biological prior knowledge:

n ≈100.

m ≈20.

h ≈10 2 .

s ≈10 4 .

L≈105.

We plotted curves of Eqs. (6), (7), and (8) against different numbers of proteins (Fig. 2). Since Kojak selects t peptides for each spectrum, we plotted three curves corresponding to three different t values. We can see that Chen et al. [32] has the highest time complexity. When the number of proteins is small, ECL has smaller time complexity compared to Kojak (leftmost of Fig. 2). This is because ECL doesn’t need to select peptides beforehand. When the number of protein is large, ECL has higher complexity than Kojak (rightmost of Fig. 2). This is because the number of peptide-peptide combinations searched by ECL grows quadratically as the increase of protein number (Eq. (8)). This is an unavoidable cost of exhaustive searching. On the other hand, the number of peptide-peptide combinations searched by Kojak is almost constant, and the total time complexity increases linearly (Eq. (7)).

Fig. 2 Computational complexity against different numbers of proteins. Three t values were used to plot Kojak’s computational complexity curves. Chen et al. [32] has the highest time complexity. When the number of proteins is small, ECL has smaller time complexity compared to Kojak. When the number of proteins is large, ECL has higher complexity than Kojak Full size image

Even though ECL’s time complexity is large, it can still handle a large database. Given a data set containing thousands of tandem mass spectra, ECL only needs 7 h to search a database containing 5200 proteins.

Space complexity

The space complexity of Chen et al. [32] is $$ O(kn + k^{2} n^{2}/L + knm + h). $$ (9) For the second term, we use k 2 n 2 / L to replace p in the original paper. For the third term, the authors only considered one peptide-peptide combination for each experimental spectrum. We multiply the term by kn considering that there are kn peptides for each experimental spectrum.

There are two steps in Kojak. The space complexity of the first step is O ( m + s h ), and the space complexity of the second step is O ( t m + h ). Thus, the total space complexity is $$ O(m + sh + tm + h). $$ (10)

The space complexity of ECL is $$ O(knm + sh). $$ (11)

Clearly, Chen et al. [32] has the highest space complexity, and Kojak has the lowest space complexity. Although ECL’s space complexity is higher than that of Kojak, from our experience, a personal computer with 32G memory is sufficient in most cases.

Experiments

In this paper, we will present two sets of experiments. The first one used a data set from the cross-linking of two synthetic peptides. The second one used four data sets from the 26S proteasome sample [33] provided by xQuest [25, 26]. Since our study did not involve any humans, animals or clinical data, we do not have ethics or consent issues.

An experiment with synthetic peptides

This experiment used two synthetic peptides produced by GL Biochem (Shanghai) Ltd. The sequences were “EVRKELDDLR” and “EAKELIEGLPR”. N-terminals were protected by Fmoc. We used 1 μL peptides and 0.5 μL DSS. Their concentrations were 1 and 0.5 mM, respectively. We dissolved the peptides and DSS in DMSO (dimethyl sulfoxide) to a final concentration of 50 mM. The reaction was carried out at room temperature, and the reaction time was 2 h. After quenching, we added 12.5 μL piperidine to the above solution to remove the Fmoc protection. The reaction lasted for another 2 h. Finally, we freeze-dried the sample to obtain the cross-linked peptides.

LC-MS (liquid chromatography-mass spectrometry) analysis was carried out on a Thermo LTQ Orbitrap XL mass spectrometer (Thermo Fisher Scientific Inc.) with a NanoLC system. The sample was loaded onto a trapping column (PepMap C18; 2 cm × 100 μm × 5 μm, 100 Å) using a flow rate of 4 μL/min of solvent A. The loading lasted for 10 min. Cross-linked peptides were separated at a flow rate of 200 L/min on a 75 μm × 50 cm C18 column (Acclaim PepMap RSLC C18, 75 μm × 50 cm × 3 μm, 100 Å). The following gradient was used: 0–8 min 2 % B, 8–12 min 2–10 % B, 12–180 min 10–50 % B, 180–200 min 50–98 % B, 200–215 min 98 % B, and 215–240 min 98 – 2 % B, where B was the ratio of acetonitrile to formic acid. B equaled 100:0.1 in this experiment. The mass spectrometer selected up to five precursors to perform CID. The intensity threshold of triggering fragmentation was 150 counts. Only those whose precursor charges were larger than or equal to 2 were considered. CID was performed for 30 ms using 35 % normalized collision energy and a 0.25 activation value. Dynamic exclusion was used with the following parameters: 1 repeat count, 60 s exclusion duration, 500 list size, and 10 ppm mass window. The ion target value was 1,000,000 (or 500 ms fill time) for full scans, and 1,000,000 (or 200 ms fill time) for a tandem mass scan. Fragmented ions were detected in a linear ion trap.

During the search, the precursor mass tolerance was 10ppm, and the tandem mass tolerance was 0.5Th. Up to 2 missed cleavages were allowed. The database contained 100 randomly selected proteins and two synthetic peptides. The decoy database was generated by reversing peptides, with lysine and arginine fixed. Because there was only one linkable site in each synthetic peptide, all cross-linked peptides formed by synthetic peptides were treated as inter-protein cross-linked peptides. The q-value cut-off threshold was 0.05.

The search was carried out on a personal computer with an Intel Core i5-4570 CPU (central processing unit) and 32 GB memory. ECL needed about 100 s to finish the task. Since we knew the ground truth, we could calculate the false discovery proportion. 4 out of 149 PSMs were incorrect. The corresponding false discovery proportion was 0.03. This experiment indicated that ECL could provide trustable results. Details can be found in the Additional file 2.

Experiments with 26S proteasome data

Four data sets from the 26S proteasome sample [25, 26, 33] were used. We first searched four data sets against a database released along with the data sets. It contained 34 proteins. The latest versions of xQuest, pLink, ProteinProspector, Kojak, and ECL were used: xQuest 2.1.1, pLink 1.23, ProteinProspector 5.14.4, Kojak 1.4.2, and ECL 20160117. The precursor mass tolerance was 10 ppm, and the tandem mass tolerance was 0.2Da. Other parameters were the same as those in the previous experiment. All the parameter files used by these tools were included in the Additional file 3. We used xProphet [26] to estimate the q-value for xQuest’s results by setting “qtransform” to 1 in the “xproph.def” file. Because ProteinProspector did not provide the q-value in its results, we estimated it as what Trnka et al. [29] did. We used Percolator to estimate the q-value for Kojak’s results as what Kojak required. Intra-protein cross-linked peptides and inter-protein cross-linked peptides were analyzed separately. For a fair comparison, these tools’ q-value thresholds were 0.05.

Table 1 shows the numbers of non-redundant cross-linked peptides identified by xQuest, pLink, ProteinProspector, Kojak, and ECL, respectively. Corresponding Venn diagrams can be found in the Additional file 1. ECL identified more cross-linked peptides than xQuest, pLink, and ProteinProspector. We used protein crystal structures from the protein data bank (PDB) to measure the distances between linking-sites in intra-protein cross-linked peptides. Only 3 proteins had structural information. Their UniProt accessions were O94444, P06732, and P50524, respectively. The corresponding PDB ID were 2X5N, 1I0E, and 4B0Z, respectively. There were 65 PSMs to these proteins. 60 of them had a distance smaller than 30 Å, which meant that they were within the distance tolerance. Details can be found in the Additional file 4. We also used ECLAnnotator to generate annotated tandem mass spectra for ECL’s results. They can be found at http://bioinformatics.ust.hk/ecl.html. Then, we analyzed matched and unmatched peaks. Please refer to the Additional file 2 for details.

Table 1 Numbers of non-redundant cross-linked peptides identified by xQuest, pLink, ProteinProspector, Kojak, and ECL, respectively. The database contains 34 proteins Full size table

In order to find out if the additionally identified cross-linked peptides were due to exhaustive search, we let Kojak output top 9999 pre-selected peptides for each cross-linked peptide’s highest score spectrum. (The default number of pre-selected peptides is 250. To our knowledge, other tools can not output their pre-selected peptides). Then, we compared the cross-linked peptides identified by ECL with those pre-selected peptides in the corresponding spectra. We consider one additionally identified cross-linked peptides pair is due to exhaustive search if all of the following criteria are satisfied (We thank the anonymous reviewer for suggesting these criteria):

1. The precursor masses in Kojak and ECL are within the same tolerance range. 2. If both of two peptide chains are in the pre-selection list and at least one is over 250, Kojak and ECL identify the same pair of peptide chains. 3. At least one peptide chain isn’t in the pre-selection list.

Table 2 shows the summarized results. About 30 % of these peptides aren’t within top 250 of Kojak’s pre-selected peptides, which means that the pre-selection procedure is one of the causes of missing findings. Each spectrum’s pre-selected peptides and detailed comparison results can be found in the Additional file 5.

Table 2 A table showing if Kojak searched those missing identified peptides Full size table

Table 3 shows the corresponding running time of xQuest, pLink, Kojak, and ECL, respectively. ProteinProspector spent 1254 seconds on average analyzing one data set. It was run on the authors’ web server so we didn’t compare it with the other four tools. Since Kojak supports multi-thread computing, we ran it with 4 threads. xQuest, pLink, and ECL don’t support multi-thread computing.

Table 3 Running time of xQuest, pLink, Kojak, and ECL, respectively. The unit is second Full size table

Finally, we tested if ECL could search a large database within a reasonable period of time. We searched the same data sets against the whole proteome of Schizosaccharomyces pombe species. There were 5200 proteins. We set the allowed maximum missed cleavage to 1. The rest of the parameters were the same as those in the last experiment. xQuest ran for a few days, but it still couldn’t finish the searching. pLink could not handle such a large database. ProteinProspector spent 1.7 h on average analyzing one data set on the authors’ web server. Kojak spent 0.25 h on average analyzing one data set. ECL spent 7 h on average analyzing one data set.

There were 4×1010 peptide-peptide combinations including decoy peptides. The precursor mass tolerance was 10 ppm. Thus, there were about 4×105 peptide-peptide combinations for each spectrum. Kojak selected top 250 peptides to generate peptide-peptide combinations for each spectrum, which covered about 8 % of the whole search space. ProteinProspector used a similar pre-selection procedure to select top 1000 peptides. Thus, the number of peptide-peptide combinations searched by ProteinProspector and Kojak was almost constant with the increase of the database size. However, the number of peptide-peptide combinations searched by ECL increased quadratically. That’s why ECL was slower than ProteinProspector and Kojak.

ProteinProspector, Kojak, and ECL identified fewer cross-linked peptides compared with the previous experiment (Table 4). It is a known issue [34, 35] that larger databases lead to fewer results. The discussion of this issue is beyond the scope of this paper. ECL identified more non-redundant peptides than ProteinProspector and Kojak. Please note that there is no intra-protein cross-linked peptides identified by Kojak because Percolator output errors in estimating q-value for Kojak. The errors said: “the input data has too good separation between target and decoy PSMs”. It is a common error when there are only a few target or decoy PSMs. Please refer to Percolator’s document for more detail.