How cancer will be cured. Simon Grondin 7 min read And also diabetes, Alzheimer’s and countless other genetic diseases. The short answer is by the relatively recent software advances that are general purpose distributed computation systems (Spark, Hadoop and others) and Machine Learning. We’re currently putting together the pieces necessary to identify the exact root causes of those diseases. It won’t be this year or the next, but the people that will find those cures are most likely not only already alive but also already working on them. It definitely won’t be easy, though. That whole industry is built on top of an extremely shaky software foundation. The people doing the ground work are mostly unpaid grad students creating standard formats to hold genomic data and converting existing data into those formats. Currently, genetics studies all have their own standards, usually stored in an ad-hoc relational schema. Those datasets are then often patched together and the result is not only hard to work with, it’s also error-prone. All this makes it impossible to combine them into a larger dataset where more subtle patterns and answers could be discovered. The researchers in biology and genetics are not trained in software best practices. They are assisted by technicians that often don’t have any formal training in programming, who simply picked up Bash, Python and SQL on the fly. Published results contain the output of those messy Python scripts, which are built mostly from copy pasted incantations taken from StackOverflow and the like. Those scripts are then stashed away into folders and forgotten. In the mean time, the original dataset used to generate the results has “grown organically”. Tables have been added, others have changed and the original patient data has been updated and augmented. In other words, it becomes nearly impossible to reproduce the studies. Continue reading →

Some software design guidelines Simon Grondin 1 min read I have no idea whether I’m paraphrasing this from someone or if I made it up, but these design guidelines have had a tremendous impact on how I design software and how I judge it, especially programming languages. Make the good easy, short and rewarding Make the exceptional explicit and verbose, but possible Make the bad impossible Even better, when only a small number of actions are good, it’s easy to keep them all in memory and build using only these tools. Any task that ends up looking ugly sends a clear signal that it’s time to go dig into the specialty toolbox. Those exceptional actions, being explicit, document the intent and the original reasons behind the solution. Being verbose ensures that their usage remains sparse and calculated. Anything that’s possible will end up being used. There’s no number of “X Considered Harmful” blog posts that will prevent it, and the bad will never truly go away without drastic breaking changes. Using libraries and tools that follow these guidelines in turn helps shape code in a way that makes it safer and easier to build upon. Your code then passes on the benefits to the layer above, whether it’s another library, or the end user. Continue reading →

Things I’ve learned about making portable binaries Simon Grondin 7 min read I don’t claim to be a master at linking and ELF (Linux) executables, but there’s some tricks I’ve learned that I wish someone had explained to me back then. There’s two problems to resolve to make clean distributable binaries: dependency hell and the libc compatibility. By solving both, we can get an executable to run on any recent Linux system, regardless of the distribution and installed packages. Dependency hell Compiling to a binary is a two-step process. There’s the actual compiling, then the linking. In the first step, each source file gets turned into an Object file (.o extension). Then all the .o and the .a files are put together in a binary and the ELF meta-information is added. .a files are static library files. If your code uses an external library and that library has previously been statically compiled, then the .a file can be directly embedded into the binary. The ELF headers contain data such as “where is the main?” and “where should I look for the dynamically linked libraries (.so files)?” Continue reading →

Logarithmic scaling with the fastest JSON validator Simon Grondin 5 min read If there’s something any backend engineer loves, it’s when they manage to solve a problem in an O(log(n)) way. I did that this week. As the number of requests per second goes up exponentially, the number of servers needed to handle them goes up linearly (until we reach a bandwidth bottleneck). Amazing! First, let me recap. We were having performance issues in an application. We needed a lot of big and expensive EC2 compute-optimized instances to handle fairly low traffic by our standards (250mb/s). Those instances were burning CPU like crazy. The codebase is built with performance in mind and we couldn’t find a single inefficiency in it. That means it’s time for flame graphs! They look like this: Continue reading →

An overview of OCaml Simon Grondin 30 min read A quick overview of the OCaml programming language and ecosystem. This is not a tutorial. This article is targeted at programmers that already know mainstream languages such as Java and JavaScript at an intermediary level. I’ll cover the basics of the OCaml language with examples and some theory sprinkled here and there. The goal is to give you a “feel” of the language. If you’ve ever wondered What functional programming looks like in practice

How functional languages represent objects

How they do asynchronous operations

How they even write real world code when everything is immutable Then you’re about to get answers to all those questions and a whole lot more. First of all, a quick description. OCaml is a functional programming language. It’s pragmatic: it aims for beautiful, expressive, immutable high level code, but recognizes that sometimes it’s necessary to drop down to imperative code in hot sections. Its syntax can appear radical at first, but it suits the language well. It’s a high level language that retains support for low level operations and excellent abilities to call C code transparently when needed. OCaml uses strong static typing. Continue reading →