Predictive modelling is widely used in drug discovery with applications including prediction of interaction, inhibition and toxicity [1–3]. In ligand-based approaches, such as quantitative structure-activity relationship (QSAR), machine learning is commonly used to correlate chemical structures with activity while ligands are described numerically using descriptors [4]. Such modelling efforts consist of a set of computational tasks that are commonly invoked manually or with shell scripts that glue together multiple tasks into a simple form of pipeline. A computational task not uncommon in machine learning for drug discovery is a set of cross-validations nested with parameter sweeps to find optimal parameters for the model training. Such intricate sets of computations create complex task dependencies that are not always easy to encode in existing tools, if at all possible. Furthermore, as data sizes increase there is a need to use high-performance e-infrastructures such as compute clusters or cloud resources to carry out analyses. These add their own requirements, making reproducible, fault-tolerant automation even more difficult to achieve [5].

Scientific workflow management systems (WMS) are a possible solution in this context as they provide improved maintainability and robustness to failure over plain shell scripts. They provide this by describing the set of computations, the data they use and the dependencies between them in a generic way. Lower level details such as the logistics of data handling and task scheduling are left to the WMS. By hiding such technical details, they allow the researcher to focus on the research problem at hand when authoring the workflow rather than getting bogged down with peripheral matters. Thus, modifying the workflow connectivity becomes less complex and error-prone.

Commonly used workflow tools for predictive modelling in drug discovery include KNIME [6, 7] and PipelinePilot [8], where KNIME is an open source software with proprietary extensions and PipelinePilot is a proprietary software application. Both provide user interaction via a graphical user interface (GUI) where researchers can drag and drop components and build workflows for predictive modelling, among other things. While a GUI has clear advantages over text-based user interfaces for scientists lacking expertise in scripting or programming, it is unclear whether it provides any advantages over text-based user interfaces in terms of efficiency for expert users [9]. Graphical rich clients typically put more requirements on the computer on which they are run, such as requiring a graphical desktop system, which is not always available on HPC systems. This means that the tool can not be deployed fully to such HPC systems. Instead, the graphical client has to be run on the user’s local computer even when the jobs are executed remotely. Also, even when a graphical desktop system is available on an HPC system, performance reasons might make it impractical to access a graphical client over a secure shell (SSH) connection, as is often needed.

KNIME, by being the only open source tool of the mentioned tools, might be considered a good default choice for the types of use cases discussed in this study. However, the open source version of KNIME does not support HPC (“remote execution”), creation of libraries of custom, re-usable components (“custom node repository”) or detailed audit logging [10], all of which are features of vital importance to the use cases discussed in this paper. See Table 1 for a comparison between KNIME and the solution presented in this paper.

Table 1 Feature comparison: KNIME open source versus “Vanilla” Luigi and SciLuigi Full size table

In the wider field of bioinformatics there are numerous scientific workflow tools available for analysis, e.g., in genomics and proteomics [11], but a big proportion of these tools have various characteristics that limit their usefulness in highly complex analyses, such as when combining cross-validation and parameter sweeps. Furthermore, some tools do not support defining custom, re-usable components that can be assembled ad hoc for new workflows. In many WMS tools, complex workflows cannot be created without combining the workflow tool with shell scripts [12], pointing to their limitations for complex use cases.

Galaxy [13–15] and Yabi [16] are GUI-centric tools or frameworks with a client/server architecture that require the installation of a server daemon and meta data to support automatic GUI generation. By their GUI-centric nature, they do not allow a level of programmability similar to the text-based tools, meaning that it is not equally easy to use programmatic constructs such as loops to automate repetitive workflow patterns such as parameter sweeps. Galaxy supports a REST-interface [17] that can be used to provide this type of programmability, but this requires interfacing the tool with external scripts outside of the tool itself.

Snakemake [18], NextFlow [19] and BPipe [20] are text-based tools implemented as Domain Specific Languages (DSL). DSLs are mini-languages created specifically for the need of a specific domain [21], such as the topic at hand, scientific workflows. While DSLs can simplify workflow writing by allowing the workflows to be defined in a language that more closely maps to the problem at hand [22], they often impose limits on the types of workflows that can easily be modelled without having to modify the language itself [23]. They also often require ad hoc solutions for integrating with existing version control software, editors and debuggers [21]. Thus, DSLs can be too limiting for highly complex workflow constructs such as those in machine learning for drug discovery. This was perceived to be the case with Snakemake and BPipe. NextFlow’s DSL allows more flexibility due to its dataflow-based implementation, but does not support creating a library of reusable component definitions. Instead, NextFlow requires components to be defined in conjunction with the workflow definition [24].

Ruffus [25] and Luigi [26] are text-based tools exposed as programming libraries, meaning that their functionality is supposed to be used from within an existing scripting language such as Python. As programming libraries, they generally require more code for defining workflows compared to DSL tools but on the other hand provide greater flexibility, as they allow users to make use of the full power of the generic programming language in which they were implemented [27].

While Ruffus provides an API based on decorators, Luigi provides an object-oriented programming API, which can be perceived as more familiar to some developers. Luigi also allows more control over output file naming than Ruffus. Furthermore it has support for the Apache Hadoop [28] and Apache Spark [29] execution environments together with support for the local file system in the same framework. Figure 1 gives an overview over Luigi’s relation to other workflow tools. Luigi thus was perceived as one of the most promising tools for use in the type of analyses described in this paper.

Fig. 1 Sunburst diagram showing the hierarchical structure of the workflow tool landscape and Luigi’s position therein Full size image

Despite these advantages, Luigi has shortcomings in some areas that can lead to brittle and hard-to-maintain workflow code limiting its applicability to complex analyses in drug discovery.

Flow-based programming (see the “Methods” section for details) is a paradigm developed for general purpose programs, suggesting a set of core design principles for achieving robust yet easy to modify component-oriented systems—a good description of what scientific workflow systems are aimed to be.

With this in mind, we present below a solution for agile development of highly complex workflows in machine learning for drug discovery, based on selected design principles from flow-based programming combined with the Luigi workflow framework, which we have named SciLuigi. In addition, functionality commonly used in scientific workflows has been added that was not included in vanilla Luigi, such as support for an HPC resource manager and audit logging capabilities.

The solution is demonstrated on a machine learning problem for modelling a large set of biochemical interactions using a shared computer cluster. Note that evaluating the actual modelling, and evaluation of the performance thereof, is outside the scope of this article, which is instead focused on solutions for the automation and coordination of such workflows, rather than the computational modelling methods themselves.