The unravelling of the human genome and subsequent advances in genetic technology have had a profound effect on our understanding of the genetics of severe, rare diseases. The genetic map provided by the HGP a decade ago allowed researchers to track down the specific mutations underlying these diseases far more quickly than they could have previously, and that process has been even further accelerated by recent advances in DNA sequencing technology. It took a full decade of pain-staking work (between 1983 and 1993) to track down the gene responsible for Huntington's disease; with recently developed technology allowing simultaneous sequencing of all protein-coding genes in the genome (so-called exome sequencing) the same outcome can - for lucky researchers and patients - be achieved in a matter of weeks.

However, the current torrent of genomic data from disease patients - as with so many technology-driven revolutions - brings challenges as well as rewards. Looking at every variation in a patient's protein-coding genes can be an overwhelming experience; determining which of the thousands of potentially damaging variations in a patient's exome are the true disease-causing mutation still requires a combination of genetic epidemiology, population genetics, functional analysis and (often) more than a little luck.

Unfortunately, one simple and crucial question in performing this analysis - which of these variants have been seen before in a patient suffering from a similar disease? - can often be shockingly difficult to answer.

Why is this so hard? Detailed information about human disease-causing variation has been painstakingly collected by clinicians around the world over decades; unfortunately, much of it is currently effectively locked away in "locus-specific databases" (LSDBs), small and often custom-built databases storing detailed information about mutations found in just one or a small number of genes. While LSDBs often represent the most accurate and comprehensive records of disease-associated mutations, the information they contain can be difficult or impossible to access for anyone outside the small disease community they serve.

That's not to say that centralised databases of human disease mutations don't exist; however, none of them are currently sufficiently accessible or comprehensive for high-throughput clinical genomics. The venerable Online Mendelian Inheritance in Man (OMIM) database, for instance, was started in print form by Victor McKusick in 1966 and now contains a formidable amount of textual information about human diseases and their genetic basis; while its online version remains the first port of call for many disease geneticists, it is far from comprehensive, and the current lack of systematic mapping of disease mutations to the human reference sequence make it challenging to use for high-throughput analysis. Plenty of other competitors have sprung up since: the Human Gene Mutation Database is almost certainly the most useful, but requires a commercial subscription to access its most up-to-date content. Currently there is no open-access resource allowing researchers to easily compare their patient data with a comprehensive, up-to-date, readily accessible and mapped list of known disease mutations. That is, to put it mildly, a huge problem for researchers embarking on projects that will involve sequencing the entire exomes of hundreds or thousands of patients.

So, given the clear and pressing need for such a resource, why doesn't it exist yet? Part of the issue is logistical: combing both existing databases and the primary literature for mutations is seriously complicated by a lack of consistent nomenclature and formatting. Expert curation is essential, given that even mutations published as disease-causing frequently turn out on closer inspection to be innocent genetic polymorphisms (which is why, for instance, geneticist James Lupski found his genome carries two copies of each of five different variants reported to cause serious diseases he doesn't have). However, this process is expensive and time-consuming.

And logistics are not the only obstacle; politics also rears its ugly head. LSDBs represent, to their creators, massive investments of time and resources, so there's (somewhat understandable) resistance to the idea of having that valuable information slurped up into a giant resource without granting them sufficient credit. The issue of patient privacy is also frequently raised as a generic objection to open data sharing - although this becomes much harder to justify as a reason for withholding data when all that is required is a list of validated mutations and some basic, non-identifiable clinical data.

Despite these challenges, multiple large efforts have now coalesced with the goal of building centralised databases of human variation. The last month has seen two major announcements in this area: firstly, the Chinese government announced that it was committing a staggering US$300 million over the next decade to the Human Variome Project, an international consortium seeking to catalogue all disease-causing variation; but even that substantial investment will apparently cover just 25% of the Project's budget. Secondly, this week saw the announcement in Nature Biotechnology of MutaDATABASE, "a new, freely available, online database developed to contain standardized information on each human disease gene that not only will list all DNA variations identified in that gene but also aims to combine that data with clinical information for the individuals carrying the DNA variation".

This sounds promising; so, when can we start taking advantage of these resources? Not quite yet, unfortunately. The HVP's website provides an extensive list of publications but no obvious way to access the data it has collected so far, and at the time this post went live MutaDATABASE's front page carried this rather depressing table:



(That led, incidentally, to some rather snarky comments from the informatics community on Twitter; but to be fair, the Nature Biotechnology article announces the creation of the database, not its population with data.)

Added in edit: I somehow missed Neil Saunder's blog post on MutaDATABASE, which makes some critical but fair points about publishing a resource before it's functional; go check it and the subsequent comment thread out.

The creation of multiple independent databases of disease variation will no doubt strike some as potentially wasteful, and the announcement of MutaDATABASE drew a carefully worded missive from the convener of the HVP (thanks to Nick Loman for the pointer):

In the past two months we have had the opportunity to speak with Patrick Willems of GENDIA several times regarding the proposed mutaDATABASE project. We commend Patrick for the energy and vigour with which he is approaching the challenges that the Human Variome Project was created to address. The mutaDATABASE project is an ambitious undertaking that would eventually see a locus specific database in operation for every human gene. Obviously, this is a goal that is shared by the Human Variome Project. However, there exists a substantial amount of overlap between aspects of the mutaDATABASE project and several other initiatives being run or facilitated by the Human Variome Project, the Human Genome Variation Society and GEN2PHEN. As one of the core values of the Human Variome Project is “efficiency” we strongly urge all of these overlapping initiatives to combine their efforts to minimise wasteful duplication of effort. The upcoming Human Variome Project meeting (http://www.humanvariomeproject.org/meetings/paris/) would be an ideal opportunity for a substantial discussion on these issues. The Human Variome Project strongly supports all efforts to reduce the amount and severity of the burden of genetic disease on Human Society and actively encourages all projects designed with this objective in mind. It is only through working together, as a consortium, as a discipline, and as a planet, that we will accomplish our goals and alleviate some of the worst of human suffering.

Is it inefficient to build multiple independent resources for the same purpose? Undoubtedly, but I'd argue that it's also a welcome opportunity to introduce some genuine competition into a field that is moving far too slowly. Both the HVP and MutaDATABASE - and OMIM, and HGMD, and any other competing resource - will need to work hard to demonstrate their relevance quickly, lest they find themselves rendered irrelevant by more nimble and user-friendly alternatives. And as geneticists begin to sink beneath a massive and advancing tide of clinical sequencing data, such a push could not be more timely.

Bale, S., Devisscher, M., Criekinge, W., Rehm, H., Decouttere, F., Nussbaum, R., Dunnen, J., & Willems, P. (2011). MutaDATABASE: a centralized and standardized DNA variation database Nature Biotechnology, 29 (2), 117-118 DOI: 10.1038/nbt.1772