Main dataset

Recently, our group has collected and compiled experimentally validated THPs (peptides bind/home to tumor) from literature and developed a public database TumorHoPe14. In this study, we have obtained 651 THPs from TumorHoPe. These peptides are considered as positive examples. In order to develop a classification method, we needed negative examples (i.e. peptides, which do not bind to tumor or non-THPs). Unfortunately, experimentally validated non-THPs have not been reported in the literature. In order to generate negative dataset, we have generated 651 random peptides from proteins obtained from SwissProt. These random peptides were considered as non-THPs. Though it is possible that some of the random peptides may have tumor homing property, but probability is very low. This is a standard procedure to use random peptides as negative examples in situations where experimentally validated negative examples are not available15,16. Finally, main dataset is consists of 651 THPs (experimentally validated) and 651 non-THPs (random peptides).

Small dataset

It was observed that most of the THPs have 10 or less than 10 residues. Therfore, we created a sub dataset from main dataset where peptides (THPs or non-THPs) have minimum four residues and maximum ten residues. This small dataset contains 469 THPs and equal number of non-THPs (random peptides).

Terminus datasets

In order to understand the role of N- and C-terminal residues of THPs, we have created terminus datasets considering the N- and C-terminal residues of peptides from main dataset. Following type of terminus datasets have been derived from main dataset; (i) NT5 contains first five residues (5 N-terminus residues) of peptides, (ii) CT5 contains last five residues (5 C-terminus residues) of peptides and (iii) NTCT5: in this dataset, various features (amino acid composition, dipeptide composition and binary profiles) of first five and last five residues of peptides were generated and combined them for developing models. Similarly, NT10, CT10 and NTCT10 terminus datasets were derived from main dataset where ten residues were taken either from any one terminus or from both termini.

Sequence logos

In order to understand frequency of different types on amino acids at different positions in THPs, we created sequence logos using WebLogo software17. The size of the residue in logo represents the frequency of residues at a given position. The height of the residue is a measure of the variability of that residue at that particular position: the taller the logo, the lesser variability at that position.

Support vector machine

SVM is a machine-learning tool based on the structural risk minimization principle of statistics learning theory. SVMs are a set of related supervised learning methods used for classification and regression. The user can choose and optimize number of parameters and kernels (e.g. Linear, polynomial, radial basis function and sigmoidal) or any user-defined kernel. In this study, we implemented SVMlight Version 6.02 package of SVM18, which requires a fixed number of inputs for training, thus necessitating a strategy for encapsulating the global information about proteins of variable length in a fixed length format. The fixed length format was obtained from protein sequences of variable length using amino acid composition, dipeptide composition and binary profile.

Amino acid composition (AAC)

It has been shown in previous studies that simple frequency of 20 amino acids in a protein sequence can be used to predict various functions of proteins like sub-cellular localization and classification of proteins19. In this study, we have used AAC of peptides for discriminating THPs and non-THPs. Thus, peptide information was encapsulated in a vector of 20 dimensions, using amino acid composition of the peptide. AAC is the fraction of each amino acid type within a peptide. The fractions of all 20 natural amino acids were calculated by using the following equation:

Where Comp (i) is the percent composition of amino acid (i); R i is number of residues of type i and N is the total number of residues in the peptide.

Dipeptide composition (DPC)

DPC provides composition of pair of residues (e.g. Ala-Ala, Ala-Leu) present in peptide and used to transform the variable length of peptides to fixed length feature vectors. It gives a fixed pattern length of 400 (20 × 20) and encapsulates information about the fraction of amino acids as well as their local order. It is calculated using following equation:

Where dipeptide (i) is one out of 400 dipeptides.

Binary profile patterns (BPP)

BPP were generated for each peptide, where a vector of dimensions of 20 represents each amino acid (e.g. Ala by 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0). A pattern of window length W was represented by a vector of dimensions 20 × W. We have created binary profile patterns for first 5 and 10 residues from N-terminus, similarly for last 5 and 10 residues from C-terminus of peptides in datasets. The BPP has been used in a number of existing methods20,21,22.

Cross-validation technique

One of the major challenges in developing in silico models is to validate these models using standard techniques. One of the well known and commonly used technique for validation is jack-knife or leave-one-out cross-validation where one peptide is used for testing and remaining peptides for training. This process is repeated in such a way that each peptide is used for testing. This technique is CPU time intensive, so in this study we have used five-fold cross-validation technique. Here all peptides are randomly divided into five sets, where four sets used for training and remaining set for testing. The process is repeated five times in such a way that each set is used once for testing. Final performance is obtained by averaging the performance of all the five sets.

Performance measure

The performance of various models developed in this study was evaluated by using threshold-dependent as well as threshold-independent parameters. In threshold dependent parameters we used sensitivity (Sn), specificity (Sp), overall accuracy (Ac) and Matthew's correlation coefficient (MCC) using following equations.

Where TP and TN are correctly predicted positive and negative examples, respectively. Similarly, FP and FN are wrongly predicted positive and negative examples, respectively.

We created receiver-operating characteristic (ROC) for all of the models in order to evaluate performance of models using threshold-independent parameters. ROC plots with area under the curve (AUC) were created using PASW statistical package.

Independent dataset

In order to evaluate the performance of our methods, we have created an independent dataset of 83 novel experimentally validated THPs and equal number of random peptides (non-THPs), which have not been included in the training, feature selection and parameters optimization of the model. Experimentally validated THPs were collected manually from recent research papers and patents, while random peptides were generated randomly from proteins obtained from Swissprot as discribed in methods.