June 23, 2004. BMC Bioinformatics publishes “Mistaken Identifiers: Gene name errors can be introduced inadvertently when using Excel in bioinformatics”. We roll our eyes. Do people really do that? Is it really worthy of publication? However, we admit that if it happens then it’s good that people know about it.

October 17, 2012. A colleague on our internal Yammer network writes:



Sad but true. I keep finding newbie bioinformatics errors in the Cancer Genome Atlas project data. This time a text download of 450K methylation from the Cancer Genome Atlas project reveals that Excel has had its evil way with the data at some point. Gene names such as MAR1, DEC1, OCT4 and SEPT9 are now reformatted as dates. For example: barcode probe name beta value gene symbol chromosome position TCGA-06-0125-02A-11D-2004-05 cg13918206 0.92035091902012 1-Dec 9 118159781

I click through the CGA data portal in search of more datasets and choose, more or less at random, another file containing data from the Illumina 450K platform. It’s called jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt. Let’s get that into R:

tcga <- read.table("jhu-usc.edu__HumanMethylation450__TCGA-G4-6628-01A-11D-1837-05__methylation_analysis.txt", sep = "\t", header = T) dim(tcga) # [1] 485577 6 head(tcga$gene.symbol) # [1] 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec 1-Dec # 25107 Levels: 10-Mar 11-Mar 11-Sep 13-Sep 14-Sep 1-Dec 1-Mar 1-Sep ... ZZZ3

Seems we have a problem. Let’s count up gene names:

genes <- as.data.frame(table(tcga$gene.symbol)) head(genes, 20) Var1 Freq # 1 119652 # 2 10-Mar 5 # 3 11-Mar 21 # 4 11-Sep 32 # 5 13-Sep 18 # 6 14-Sep 4 # 7 1-Dec 8 # 8 1-Mar 30 # 9 1-Sep 10 # 10 3-Mar 33 # 11 4-Mar 25 # 12 5-Mar 4 # 13 5-Sep 12 # 14 6-Mar 21 # 15 7-Mar 13 # 16 8-Sep 2 # 17 9-Mar 6 # 18 9-Sep 16 # 19 A1BG 6 # 20 A1CF 5

Yes, we have a problem.

“Newbie bioinformaticians” is one thing. Large institutes, awarded millions of dollars to contribute to “big science” projects is another.

Despair at the quality of public data, fears about reproducibility in science. Must be Monday.