Empty Pipes

The 20 Most Studied Genes 08 Dec 2014

|

genbank

bioinformatics

|



The Nation Center for Biotechnology Information (NCBI) maintains an enormous amount of biological data and provides it all to the public for no cost as a collection of databases. One of the most popular is GenBank, which contains information about annotated genes. Consider the gene p53, which encodes a tumor suppressor protein, the absence of which allows many cancers to proliferate. By looking at its entry in GenBank, we can immediately find out its full name (tumor protein p53), which organism this entry corresponds to (Human), aliases (BCC7, LFS1, TRP53), a short description and a whole host of other technical information.

Among the information provided with each entry is a section which contains a list of papers which have referenced this gene. In a sense, each reference is a paper which has contributed some bit of knowledge about the function of this piece of DNA (or RNA). This got me wondering, which are the most studied genes? Which genes have made an appearance in the most published papers?

To answer this, I downloaded the table which contains the reference information from GenBank, performed some rudimentary analysis, and generated the following table of the top 20 most popular genes, as measured by the number of times they have been cited:

The graph above shows the number of references in PubMed to a particular gene in GenBank. The color of the bars refers to the organism that the gene is found in. It was made using d3.js and the script for generating it can be found here (github.com), while the data itself is located here (github.com).

The genes on the list can be broadly placed into 6 categories:

Immediately evident is the overrepresentation of disease-related genes. 15 of the 20 genes are heavily involved in some human disease. The remaining entries are either regulatory (UBC, ACE and ESR1), historic (w) or just simply useful (Gt(ROSA)26Sor). The majority come from human, followed by mouse (used to express genes also found in humans: Tnf and Trp53), and finally HIV and Drosophila. This is something of a reflection of where our interests and funding lie. The two most studied genes are involved in cancer, research in which is both well-funded and heavily reliant on genetic analysis. Four on the list are associated with the immune system (IL6, IL10, NFKB1, and HLA-DRB1), two (APOE and ACE) are associated with heart disease and one with HIV. We focus the majority of our attention on the things which are likely to kill us.

Conspicuously absent from the list are any genes from plants or genes involved in metabolism. Important pathways such as differentiation, DNA replication and protein synthesis are all absent. That’s not to say that they are not studied, it’s just that they recieve less attention than processes involved in our demise. Then again, the age of molecular genetics has only begun in the last century or so. Perhaps our interests will shift in the future as we find cures and treatments for existing maladies and start having to deal with others such as a a changing climate, energy crises and an aging population. Biology may hold partial solutions to these problems and the proportional amount of effort we put into finding processes to remove carbon dioxide from the air, to produce fuels from biomatter or to limit or reverse aging may grow to eclipse that put into research in the current top-20 genes.