By Guillaume Filion, filed under software pollution, benchmark, bioinformatics.





I never planned to do bioinformatics. It just happened because I liked the time in front of my computer. Still, as every sane individual, I sometimes think that I could do something else with my life, and I wonder whether I am doing the right thing. On this topic, I recently came across the famous farewell to bioinformatics by Frederick J. Ross, which is worth reading, and of which the most emblematic quote is definitely the following.

My attitude towards the subject after all my work in it can probably be best summarized thus: Fuck you, bioinformatics. Eat shit and die.

There is nothing to agree or disagree in this quote, but Frederick gives further detail about his point of view in the post. In short, bioinformaticians are bad programmers, and community-level obfuscation maintains the illusion.

By making the tools unusable, by inventing file format after file format, by seeking out the most brittle techniques and the slowest languages, by not publishing their algorithms and making their results impossible to replicate, the field managed to reduce its productivity by at least 90%, probably closer to 99%.

There are indeed many issues in the bioinformatics community and I am on Frederick’s side regarding file formats. For instance, I have huge respect for the maintainers of the BAM/SAM format, but here is a quote, straight from the documentation*.

Structure for core alignment information. typedef struct { int32_t tid; int32_t pos; uint32_t bin:16, qual:8, l_qname:8; uint32_t flag:16, n_cigar:16; int32_t l_qseq; int32_t mtid; int32_t mpos; int32_t isize; } bam1_core_t; Fields tid chromosome ID, defined by bam_header_t pos 0-based leftmost coordinate strand strand; 0 for forward and 1 otherwise bin bin calculated by bam_reg2bin() qual mapping quality l_qname length of the query name flag bitwise flag n_cigar number of CIGAR operations l_qseq length of the query sequence (read)

You do not need to know anything about C to notice that the description does not match. At some point, the core storage format of BAM has changed (just that!) and the old documentation got mixed up with the new one. So much for a planetary standard.

But no discussion of bioinformatics nonsense would be complete without a benchmark section. In our last software article, we were asked to run our benchmark against an all-pairs algorithm called slidesort. The original benchmark of slidesort concealed two minor details: that it takes months to return, and that it is not an all-pairs algorithm. The email of the maintainers being obsolete, we had to put some effort into finding the authors to ask for explanations. The answer was that it was probably a bug. But “bug” is too polite, “software pollution” is more appropriate.

... so why do bioinformatics?

The answer is simple: because it matters. Even though I deeply agree with Frederick, not everything boils down to working with skilful people. The impact of bioinformatics is unacknowledged but visible. How many discoveries started with a BLAST search? How many experiments were possible only because the human genome is sequenced? Besides, not every problem in bioinformatics is about memory footprint and CPU cycles; in some cases there are lives at stake. Choosing a treatment for cancer patients, deciding upon an abortion based on genotype data, initiating a vaccination campaign... and so much more.

Bioinformatics is biology, and it matters.

Notes:

* The text has since been updated.