Biology and its sub-disciplines, like genomics, have become incredibly data intensive in recent years. Methods like high-throughput sequencing and mass spectrometry generate huge amounts of data that need to be processed in a reproducible and scalable manner. However, many workflows in bioinformatics and genomics are driven by a series of shell scripts that are, at least in some cases, manually triggered.

This begs the question: how can bioinformatics and genomics professionals utilize the tooling that they are familiar with or need (i.e., shell scripts that wrap a variety of specialized tooling) in a more sustainable, reproducible, and scalable manner? As it turns out, many are turning to Pachyderm as the answer!

Let’s take a common use case as an example. Suppose we want to do variant calling, which identifies variations in gene sequencing data, using tools from the Genome Analysis Toolkit (GATK). A typical workflow for this might include a couple of shell scripts to (i) find variant likelihoods for various input files, and (ii) perform joint genotyping.

Of course you could manually get the input files to these shell scripts, trigger them, and gather the results, but this is time consuming and doesn’t scale to high-throughput scenarios. It’s also far from reproducible.