The Open Drug Discovery Toolkit (ODDT) is provided as a Python library to the cheminformatics community. We have implemented many procedures for common and more sophisticated tasks, and below we review in more detail the most prominent. We would also like to emphasize that by making the code freely available through a BSD license, we encourage other researchers and software developers to implement more modules, functions and support of their own software.

Molecule formats

Open Drug Discovery Toolkit is designed to support as many formats as possible by extending the use of Cinfony [13]. This common API unites different molecular toolkits, such as RDKit and OpenBabel, and makes interacting with them more Python-like. All atom information collected from underlying toolkits are stored as Numpy [14] arrays, which provide both speed and flexibility.

Interactions

The toolkit implements the most popular protein-ligand interactions. Directional interactions, such as hydrogen bonds and salt bridges, have additional strict or crude terms that indicate whether the angle parameters are within cutoffs (strict) or only certain distance criteria are met (crude). The complete list of interactions implemented in ODDT consists of hydrogen bonds, salt bridges, hydrophobic contacts, halogen bonds, pi-stacking (face-to-face and edge-to-face), pi-cation, pi-metal and metal coordination. These interactions are detected using in-house functions and procedures utilizing Numpy vectorization for increased performance. Calculated interactions can be used as further (re)scoring terms. Molecular features (e.g., H-acceptors and aromatic rings) are stored as a uniform structure, which enables easy development of custom binding queries.

Filtering

Filtering small molecules by properties is implemented in ODDT. Users can use predefined filters such as RO5 [15], RO3 [16] and PAINS [17]. It is also possible to apply project-specific criteria for MW, LOGP and other parameters listed in the toolkit documentation. See Example 1 in the “Results and discussion” section for more details on how to use filtering.

Docking

Merging free/open source docking programs into a pipeline can be a frustrating experience for many reasons. Some programs, like Autodock [18] and Autodock Vina [19], do not support multiple ligand inputs, where some other programs output scores to separate files (e.g., GOLD [20]) or even directly print to the console. Additional effort is required for re-scoring output ligand-receptor conformations in other software. Every in-silico discovery project is flooded with custom procedures and scripts to share data between programs. The docking stack within ODDT provides an easier path with the use of a common docking API. This API allows retrieving output conformations and their scores from various widely-used docking programs. The docking stack also supports multi-threading virtual screening tasks independently of underlying software, helping to utilize all available computational resources.

Scoring

Open Drug Discovery Toolkit provides a Python re-implementation of two machine learning-based functions: NNscore (version 2) and RFscore. The training sets from its original publication were used for the RFscore function [9]. For NNScore, neither the training set nor the training procedure was made available by authors, other than a brief description [8]. To bring support for NNScore, we used ffnet [21]. The training procedure for NNscore was reimplemented in ODDT and should closely reproduce the resulting ensemble of neural networks. The training data are stored as csv files, which are used to train scoring functions locally. After the initial training procedure, the scoring function objects are stored in pickle files for improved performance.

Machine learning scoring functions consist of four main building blocks: descriptors, model, training set and test set. ODDT provides a workflow for training new models, with additional support for custom descriptors and custom training and test sets. Such a design allows not only the use of the toolkit to reproduce scores (or reimplement scoring functions) but also enables the researcher to develop their own custom scoring procedures. Finally, if random seeds are defined, the scoring function results in ODDT are fully reproducible.

The ability to assess the predictive performance of scoring function (or scoring procedures) is of utmost importance. ODDT provides various ways to accomplish these tasks. One approach may use the area under receiver operating characteristics curve (ROC AUC and semi-log ROC AUC) and the enrichment factor (EF) at a defined percentage. These methods can be applied for every scoring function (and their combination) when training/test sets or active/inactive sets are supplied. Two other methods to test scoring function(s) performance include internal k-folds and leave one out / leave p out (LOO/LPO) cross-validation, both of which are particularly useful to detect model overfitting. These methods are available in ODDT through the sklearn python package [22].

Statistical methods

Modeling the relationship between chemical structural descriptors and compound activities provides insight into SAR. Ultimately, such models may predict screening outcomes of novel compounds, guiding future discovery steps. Because some screening data are linear by their nature, simple regressors can be applied to find correlations (e.g., comparative molecular field analysis, CoMFA [23]). We implemented two straightforward regressions which that are widely used in cheminformatics, both in ligand and structure-based methods: multiple linear regression and partial least squares regression.

Nonlinear, more complex data are better assessed by machine learning models. Two forms of machine learning models are particularly important in drug discovery: (1) regressors for continuous data, such as IC50 values or inhibition rates, and (2) classifiers applied to multiple bit-wise features or ligands tagged as active/inactive (e.g., NNScore 1.0). ODDT employs sklearn as the main machine learning backend because it has a mature API and good performance. In some cases when neural networks are required, ODDT mimics the sklearn API and instead uses ffnet [21]. The current version of our toolkit provides machine learning models that are widely used in cheminformatics and drug discovery: (1) random forests, (2) support vector machines, and (3) artificial neural networks (single and multilayer). These models have been shown to provide great guidance when assessing protein-ligand complexes in the development and application of various scoring functions [8–10] and in SAR and QSAR (e.g., [24, 25]).