Abstract Background Genetic studies are challenging in many complex diseases, particularly those with limited diagnostic certainty, low prevalence or of old age. The result is that genes may be reported as disease-causing with varying levels of evidence, and in some cases, the data may be so limited as to be indistinguishable from chance findings. When there are large numbers of such genes, an objective method for ranking the evidence is useful. Using the neurodegenerative and complex disease amyotrophic lateral sclerosis (ALS) as a model, and the disease-specific database ALSoD, the objective is to develop a method using publicly available data to generate a credibility score for putative disease-causing genes. Methods Genes with at least one publication suggesting involvement in adult onset familial ALS were collated following an exhaustive literature search. SQL was used to generate a score by extracting information from the publications and combined with a pathogenicity analysis using bioinformatics tools. The resulting score allowed us to rank genes in order of credibility. To validate the method, we compared the objective ranking with a rank generated by ALS genetics experts. Spearman's Rho was used to compare rankings generated by the different methods. Results The automated method ranked ALS genes in the following order: SOD1, TARDBP, FUS, ANG, SPG11, NEFH, OPTN, ALS2, SETX, FIG4, VAPB, DCTN1, TAF15, VCP, DAO. This compared very well to the ranking of ALS genetics experts, with Spearman's Rho of 0.69 (P = 0.009). Conclusion We have presented an automated method for scoring the level of evidence for a gene being disease-causing. In developing the method we have used the model disease ALS, but it could equally be applied to any disease in which there is genotypic uncertainty.

Citation: Abel O, Powell JF, Andersen PM, Al-Chalabi A (2013) Credibility Analysis of Putative Disease-Causing Genes Using Bioinformatics. PLoS ONE 8(6): e64899. https://doi.org/10.1371/journal.pone.0064899 Editor: Bart Dermaut, Pasteur Institute of Lille, France Received: January 24, 2013; Accepted: April 19, 2013; Published: June 5, 2013 Copyright: © 2013 Abel et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: The authors are especially grateful for the long-standing and continued funding of this project from the ALS Association and the MND Association of Great Britain and Northern Ireland. They also thank ALS Canada, MNDA Iceland and the ALS Therapy Alliance for support. The research leading to these results has received funding from the European Community 's Health Seventh Framework Programme FP7/2007–2013 under grant agreement number 259867. AA-C receives salary support from the National Institute for Health Research (NIHR) Dementia Biomedical Research Unit at South London and Maudsley NHS Foundation Trust and King's College London. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. Aleks Radunovic, Nigel Leigh, and Ian Gowrie originally conceived ALSoD. ALSoD is a joint project of the World Federation of Neurology (WFN) and European Network for the Cure ALS (ENCALS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing interests: AA-C is a consultant for Biogen Idec and Cytokinetics. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLOS ONE policies on sharing data and materials.

Introduction Genetic studies are challenging in many complex diseases, particularly those with limited diagnostic certainty, low incidence and prevalence, or those of old age. Association studies suffer a reduction in power when there is phenotypic heterogeneity resulting from difficulty with diagnosis, and linkage studies are limited because the older generations are not available and the younger generations have not yet reached the age of risk. The result is that genes are reported as causative with varying levels of evidence and it can be difficult for those not in the field to assess how credible any genetic evidence is. One such condition is amyotrophic lateral sclerosis (ALS). This is an adult onset neurodegenerative syndrome of upper and lower motor neuron degeneration, with a mean age of onset of 56 in diagnosed familial cases (FALS) and 60 to 70 years in apparently sporadic cases, and an average survival of 3 to 5 years from symptom onset [1] [2]. Illustrating the complexity and difficulty in performing genetic research on ALS, the reported frequency of familial ALS varies from 0.8% [3] to 17–18% [4] although all studies agree that most cases are apparently sporadic [5]. There is, however, a genetic basis both to familial and apparently sporadic ALS [6], [7], [8]. All genes reported mutated in familial ALS have also been found mutated in sporadic ALS. Because of the late age of onset and poor prognosis, suitable families are difficult to collect for linkage, and large populations are difficult to collect for association. The first gene identified for familial ALS was SOD1 [9] [10]. Through linkage and association studies of SNPs, microsatellites and copy number variants, as well as through direct sequencing of candidate genes and whole exome sequencing using high throughput methods, over 100 genes have now been implicated in the cause of ALS [11]. The level of supporting evidence for each gene or gene variant varies from small to overwhelming, and is in some cases contradictory. Furthermore, the increasing cooperation between ALS researchers internationally, and the understanding that large datasets are needed, coupled with advances in technology, mean that the rate of detection of putative new ALS genes is rapid and increasing. This leads to two immediate problems: first, it is difficult to keep up with what is an “accepted” ALS gene, and second, there is no simple, objective way to define the list of ALS-causing genes. As a result, researchers may find themselves unable to agree on whether any one gene is an ALS gene or not. The situation is further compounded by the loose definition of ALS, which for genetic purposes has a far wider phenotypic definition than most ALS researchers would accept in a clinical setting [12]. For example, ALS2 includes an infantile, slowly progressive upper motor neuron syndrome that is most similar to hereditary spastic paraparesis, rather than an adult onset mixed upper and lower motor neuron syndrome with a poor prognosis for survival. Similarly, ALS with frontotemporal dementia is regarded as a slightly different entity from ALS even though frontotemporal dementia and ALS are in at least some cases a continuum of disease, and in many cases ALS genes and frontotemporal dementia genes are the same as genes for ALS with frontotemporal dementia. One solution to this problem is to design some method for objectively scoring the level of evidence supporting a gene or gene variant as disease causing. This would have the advantage that the phenotype could be defined by the user, allowing a loose definition or more stringent definition as required. The ALSoD database stores data on putative ALS genes using information derived from publications and directly input by researchers. We have therefore explored the possibility of using these data to generate a credibility score for ALS genes with the aim of producing a system that can be generalized to other similar conditions.

Results For the pathogenicity prediction, using a threshold score >1 (that is, where the combination score is 2 or 3) to define pathogenicity, just 110 mutations out of 425 were identified as pathogenic, with particularly poor predictions for FUS and TARDBP when compared with biological evidence of pathogenicity. Using a threshold score of >0 (that is, where the combination score is 1 or 2 or 3) to define pathogenicity brought the number of pathogenic mutations to 198, suggesting that about 50% of recorded FALS mutations are pathogenic based on bioinformatics predictions. There were 14 genes that fulfilled the inclusion criteria for generation of a credibility score at the time of the survey, and had sufficient data manually curated from publications as explained in the data extraction process above. These were ALS2, FUS, DAO, VCP, VAPB, ANG, DCTN1, FIG4, SETX, SOD1, TARDBP, SPG11, NEFH, and OPTN. Using the full set of 11 procedures, the automated method ranked these as ALS-causing genes in the following order: SOD1, TARDBP, FUS, ANG, SPG11, NEFH, OPTN, ALS2, SETX, FIG4, VAPB, DCTN1, TAF15, VCP, DAO. Subsets of the 11 procedures may be defined by the user if needed. This allows flexibility in which evidence is regarded as useful. For example in Figure 3, using the number of mutations reported in a single gene and the number predicted as pathogenic as test criteria ranks the genes in the following order: SOD1, TARDBP, FUS, ANG, OPTN, SETX, ALS2, SPG11, FIG4, DCTN1, VAPB, VCP, DAO. The output shows that the first six genes, SOD1, TARDBP, FUS, ANG, OPTN and SETX, have a total of 121, 17, 19, 12, 5 and 4 pathogenic mutations respectively and, for example, the I113T, D90A and A4V pathogenic mutations of the SOD1 gene were replicated in 17, 14 and 12 studies. It also shows there are 6 different mutations in codon 93 of SOD1 and 5 different mutations in codon 521 of FUS. Other displayed information includes the number of countries in which gene mutations have been reported. For example, SOD1 mutation has been reported in 34 countries with representation from every continent of the world, while TARDBP, ALS2, ANG, FUS, SETX and NEFH have been reported in 13, 9, 7, 7, 6 and 5 unique countries respectively. Genes like FIG4, DPP6, DCTN1, UBQLN2, TAF15 which were recorded in only 1 country each have the lowest ranks. 8/25 ALS genetics experts selected based on having published at least one paper on ALS genetics responded. Comparison of the full automated method with the ALS genetics experts' rankings gave a Spearman's Rho of 0.69 (P = 0.009) for the forced expert rankings, and 0.57 (P = 0.042) for the unforced rankings, indicating a good correlation between the methods.

Discussion We have presented an automated method for using published information to score the level of evidence supporting a causative relationship between gene mutation and a disease. The information on which the credibility analysis is based is collected routinely by locus-specific databases and the method can therefore be generalized to other diseases. The method used has been applied to amyotrophic lateral sclerosis but could equally be applied to any disease in which there is phenotypic and genotypic heterogeneity. A strength of this method is that multiple lines of evidence are used to generate an objective opinion as to the credibility of a gene as a disease gene, and while publication bias will affect the score, this is minimized by several factors. First, in this study unpublished data are used since the database includes directly input information from researchers who have not published. Second, a major part of the score is generated using theoretical models of pathogenicity. Third, once published, any information remains useable, and not prone to the vagaries of scientific fashion, or the bias of individual opinion leaders. The effects of these components on the score can be seen by comparing the automated ranking and the ranking generated by both groups of ALS genetics experts. In general the rankings were in agreement. For example, with one exception, the top five genes were the same for all three methods. For some genes there were strikingly different ranks. ANG was ranked 9 of 13 by the experts who could give equal ranks, but in the top five for the other two methods. The biggest discrepancies were otherwise for ALS2, NEFH, and VAPB, each of which was ranked in the bottom two for one of the methods and in the middle for the other two methods. Similar approaches have been used in association studies. In previous work, three criteria used to determine how credible a disease gene might be were the amount of evidence, manifest as number of studies and population size studied, replicability of a result, and protection from bias by good study design [24]. We have tried to follow similar principles in generating this credibility score. A weakness of this method is that it relies on an agreed set of criteria for analysis to generate the score, but there is no way to decide objectively whether the criteria are reasonable or what their relative weights should be. For example, we have not included pathogenicity demonstrated in animal models in the score but others might regard this as a vital component. Although we have tried to build in flexibility so that researchers can include or exclude certain criteria, unless the available criteria are exhaustive there will always be the possibility that the method is incomplete. Similarly, because the criteria can be user-selected, there can be no truly universal measure of credibility using this system. Since this tool was developed, pathological expansion in the C9orf72 gene has been identified as a cause of ALS and frontotemporal dementia [25], [26]. At the time of our survey of experts this was not the case and it has therefore been excluded from the analysis presented. A major advantage of this tool is the automation which changes the rank of a gene depending on the evidence provided on the database. This system could be applied to other complex diseases where multiple genes are responsible for a phenotype.

Author Contributions Conceived and designed the experiments: OA JFP PMA AA-C. Wrote the paper: OA JFP PMA AA-C. Advised on criteria used and literature review: JFP PMA AA-C. Contributed genetic data: PMA AA-C. Survey and Statistical Analysis of data: OA AA-C.