My PhD is entitled: Integration and visualisation of clinical-metabolic datasets for medical-decision making. So after several initial projects, in the past 1.5 years I’ve been working on a general system for the integration, exploration and visualisation of complex biological datasets.

The problem with such an ambitious and broad PhD title is that it can fragment your attention and you can quickly lose focus (which happened to me several times in the past years). But the more I thought and worked on this problem the better the various pieces came together to form a coherent and (hopefully) useful system.

It started out as a simple correlation network visualisation tool, and today, 12000 lines of code later it’s a complete cloud-based (AWS) research tool which uses advanced feature selection algorithms and covariance estimation techniques along with completely novel and interactive visualisation interfaces.

What does CorrMapper do?

I think CorrMapper could be best understood by thinking of the problems researchers with complex biological datasets face these days, so let’s enumerate these:

Most of the analytical platforms we use today in life sciences and medical research are getting cheaper each year. Some of them are getting cheaper at a ridiculous rate. The sequencing and assembly of the first human genomes cost hundreds of millions of dollars ($3 billion by the US government funded project 1 and $300,000,000 by the private Celera initiative) and took 11 and 3 years respectively. Yet today, less than 15 years later, we are about to pass the $1000 price point in human genome sequencing, with the Illumina HiSeq X Ten, which will be capable of sequencing 18,000 human genomes per year, to the gold standard of 30× coverage. Usually, when products get cheaper two things happen: a/ more people start using them, b/ people use more of them. If this happens to several analytical platform simultaneously, then researchers with the same budget will suddenly be able to design studies which utilise multiple platforms at once. This is precisely what happened in the past 10 years in life sciences and “multi-omics” studies are becoming a lot more popular and affordable. Multi-omics just means that we have more than one omics dataset, where omics is the terminology used in life sciences to collectively refer to the data coming from genomics, transcriptomics, metagenomics, metabolomics, etc. Multi-omics studies have great potential as they allow us to examine the biology behind a disease from multiple viewpoints, each analytical platform opening a new window to the underlying biochemical processes. For example the change of gene expression in colon cancer is just as important as the changes in epigenomic markers, or the gut microbiome which cannot be ignored neither, as there seems to be a complex, multi-level interplay between the bugs in our gut and our health. The question is: how do we relate these disparate datasets and combine them so that their complimentary information could be harnessed to expand our biomedical knowledge. Modern omics datasets are extremely feature rich and in multi-omics studies this complexity is compounded by a second or even third dataset. Many of these features however, might be completely irrelevant to the studied biological problem, or redundant in the context of others (multicollinearity). We wouldn’t expect for example to find all 25000 human genes to be involved in breast cancer, or all urinary metabolites as candidate biomarkers for liver failure. Learning from such feature rich datasets inevitably incurs an increased computational cost. It also increases the chance of over- fitting the noise in our data, while reducing the predictive power of our models. Finally, the correlation networks arising from these high-throughput datasets are often hard to interpret and explore due to their density and lack of interactive tools. Finally, biomedical studies will have increasing amounts of metadata attached to the actual omics measurements, making the stratification of patients easier than ever before. This is largely due to the explosion of digital/wearable health gadgets and the radical improvement in the digitization of healthcare records.

CorrMapper attempts the near impossible and address several of these problems at once:

CorrMapper provides a completely automatically generated metadata explorer which allows researchers to explore and stratify their patient cohort with an interactive dashboard which seamlessly integrates metadata with up to two omics datasets. See this demo. Several cutting edge feature selection algorithms have been built into CorrMapper to allow researchers to focus their attention to only those features which have the most discriminatory power with respect to a metadata variable, for example cancer vs. control. This not only decreases the computational cost of subsequent analysis steps but also helps with the interpretation of the data. The selected features are then used to estimate a sparse inverse covariance matrix using the Glasso algorithm in the huge R package. This is useful because under Gaussian assumptions this inverse covarience matrix will only be zero if the row and column variables of the given cell are conditionally independent given all other features in the matrix. In plain English, we can work out which variables are conditionally independent (having removed all the confounding effects of the others). This is hugely important because without this the complexity of biological systems would almost certainly guarantee that we will see a lot of spurious and confounded correlations in our analysis. Based on the inverse covariance matrix we can draw a network of related/correlated variables, see below. The edges of the network represent Spearman rank correlations for which p values are estimated using 10000 permutations. The p values are made more precise using a Generalized Pareto Distribution based method. Finally the p values are corrected for multiple testing using one of the user selected methods. The resulting heatmap and networks of correlations are then simultaneously visualised and interlinked, see this demo. If the uploaded datasets have genomic features, this allows CorrMapper to use a more appropriate genomic network visualisation, where the features are laid out in clockwise fashion along the genome of the given species, see this demo. Both network visualisation modules are highly interactive enabling researchers to dig deep and explore the intricate correlations in their datasets.

Of course the problems we outlined above are so enormous, that they will provide countless sleepless nights and tiring years for thousands of future PhD students easily. Nonetheless I hope that CorrMapper will be proven to be a valuable (if small) step in the right direction of solving these puzzling problems.

Below is an outline of CorrMapper’s pipeline.

When will it be available?

I’m in the process of migrating this work from the AWS test server to our cluster at Imperial, and writing up the project in a paper. CorrMapper.com is live and working, but it’s running on a small instance of AWS so it often runs out of memory if your dataset is too big. So although you can (and I encourage you to) register and upload your datasets into CorrMapper, please don’t expect production level runtimes and smoothness just yet.

But in the meanwhile you can have a look at poster from a recent conference if you’re interested in what’s going on behind the scenes, or check out these demos: