Whole exome and whole genome sequencing have both become widely adopted methods for investigating and diagnosing human Mendelian disorders. As pangenomic agnostic tests, they are capable of more accurate and agile diagnosis compared to traditional sequencing methods. This article describes new software called Mendel,MD, which combines multiple types of filter options and makes use of regularly updated databases to facilitate exome and genome annotation, the filtering process and the selection of candidate genes and variants for experimental validation and possible diagnosis. This tool offers a user-friendly interface, and leads clinicians through simple steps by limiting the number of candidates to achieve a final diagnosis of a medical genetics case. A useful innovation is the “1-click” method, which enables listing all the relevant variants in genes present at OMIM for perusal by clinicians. Mendel,MD was experimentally validated using clinical cases from the literature and was tested by students at the Universidade Federal de Minas Gerais, at GENE–Núcleo de Genética Médica in Brazil and at the Children’s University Hospital in Dublin, Ireland. We show in this article how it can simplify and increase the speed of identifying the culprit mutation in each of the clinical cases that were received for further investigation. Mendel,MD proved to be a reliable web-based tool, being open-source and time efficient for identifying the culprit mutation in different clinical cases of patients with Mendelian Disorders. It is also freely accessible for academic users on the following URL: https://mendelmd.org .

Funding: This work was made possible by a research grant from the Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG). RGCCLC was supported by a graduate fellowship from Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES), NDL was supported by a graduate fellowship from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPQ), RLF was supported by a graduate fellowship from Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPQ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2017 G. C. C. L. Cardenas et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The goal of Mendel,MD is not to provide a single candidate gene, but rather a limited list of good candidates that can always be manually investigated by researchers and doctors using their research and clinical skills. One innovative strategy we tried to develop was the option for a “1-Click” automatic search that makes use of minimal pre-set of filter options and thresholds to produce a list of candidate variants in genes included at the Online Mendelian Inheritance in Man (OMIM) [ 12 ] and at the Clinical Genomic Database (CGD) [ 13 ]. The user can also, if they wish, add extra options of filters for different modes of inheritance, for chromosomal positions, variant effects, functional classes, variant frequencies and pathogenicity scores among other options.

Mendel,MD uploads a VCF file, annotates it, inserts it to a database and finally filters it. For this process, it makes use of a simple web interface that can be freely accessed from any computer, tablet or smartphone with any Internet browser.

Currently there are already a few commercial tools that attempt to address this problem such as Variant Analysis from Ingenuity[ 4 ], VarSeq from Golden Helix[ 5 ] and Sequence Miner from Wuxi NextCode[ 6 ]. Also, there are a few open source tools such as GEMINI[ 7 ], seqr[ 8 ], VCF-Miner[ 9 ], BiERapp[ 10 ], BrowseVCF[ 11 ] that also aim to provide a Graphical User Interface to simplify the analysis of the genetic information of a patient. On Table 1 we provide a feature grid comparing Mendel,MD with the other tools available.

Whole exome sequencing (WES) and whole genome sequencing (WGS) have revolutionized clinical genetics through the discovery of new genes, the characterization of new genetic diseases, and the description of new phenotypic features in previously known disorders [ 1 – 3 ]. The efficiency of WES and WGS in unraveling Mendelian Disorders originates from the collective characterization of genes in a pangenomic, agnostic, non-targeted fashion. Variants that are present in all expressed human genes are analyzed in parallel, using multiple filter options while searching for the “culprit” variant in each clinical case. Such a process depends on software that ideally should be easy to use by clinicians, who sometimes have limited knowledge of computing. Thus, in the best of all possible worlds, computer tools for genomic analysis should be simple, intuitive and user-friendly.

Design and Implementation

Mendel,MD was developed to be compatible with Python 2.7 and 3.x. We developed the web interface using the Django web-framework[14]. We used different methods, tools and sources of information to generate at the end of the process a fully annotated VCF file [15] with all the necessary information for the selection of good candidate variants and genes that could be responsible for causing the disease in multiple different clinical cases.

This data is inserted into a PostgreSQL database in order to facilitate the filtering of each patient’s variants through a web browser (see an example of this annotated VCF file in S1 Data).

The first thing we developed was the upload system using a JavaScript library called JQuery File-Upload[16] which enabled the ability of a user to simply drag-and-drop VCF files from his desktop into the browser or to select multiple VCF files and upload all at once to Mendel,MD. The current system accepts the following formats for upload:.VCF,VCF.GZ, VCF.ZIP and VCF.RAR. In Fig 1 we present the web interface of the upload system.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 1. Upload System. This figure shows the interface for submission of VCF files in the system using a library called JQuery File-Upload. As soon the user selects the VCF files from his computer, the upload starts automatically and there is an estimated time showing that it is updated constantly until the completion of the upload. https://doi.org/10.1371/journal.pcbi.1005520.g001

Diseases In order to aggregate information about Mendelian Disorders into our database we used two main sources of information: the Online Mendelian Inheritance in Man (OMIM) [12] and the Clinical Genomic Database (CGD) [13]. The list of genes is always compared live for each filter analysis search to allow, for example, the investigation of variants only in genes previously known to be associated with Mendelian Disorders. In the “Disease” section of Mendel,MD, it is possible to search for diseases by their names or by the gene symbols associated with them (Ex. ‘Mitochondrial depletion syndrome 5’ or ‘SUCLA2’) and quickly retrieve a list of genes and diseases associated with every term. From the results of this search, it is possible to select a list of genes and search for variants only in the selected genes screening all the individuals present in our database.

Genes We added to the database the official list of gene symbols and descriptions from the HUGO Gene Nomenclature Committee (HGNC) website, which currently has 19,006 protein-coding genes. In the “Genes” section of our tool it is possible to search for gene symbols and gene names, (Ex. ASS1P1 or argininosuccinate) and select from the list of genes to visualize variants in all the individuals present in our database.

Annotation framework We used a Distributed Task Queue system called Celery [17] to annotate multiple VCFs in parallel. This tool enables the possibility of scaling the annotation of VCF files using a cluster of computers in order to speed up this process and also to execute it faster in bigger machines. We used 4 queues to annotate VCFs, parse the results and insert the final results into our database. In Fig 2, we present the annotation framework that we called pynnotator[18], which was developed together with Mendel,MD. Next we describe in more detail how this annotation framework works. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Pynnotator annotation framework. This figure shows the processes we perform on each VCF file after a user uploads it to the system. First we perform a vcf-validation and a sanity check to make sure the file is ready to be annotated. After it, we send each checked VCF file to be annotated in parallel by SNPEFF, SNPSIFT, VEP, Decipher, and two other python scripts we developed in-house. The first one is capable of using other VCF files to annotate a VCF file so we use the VCFs from 1000Genomes, dbSNP and NHLBI GO Exome Sequencing Project to annotate a VCF and the second python script uses data from dbNSFP to add functional annotation and prediction scores to our annotation. After merging the results of all tasks we compress the file and prepare it to be inserted to the database. https://doi.org/10.1371/journal.pcbi.1005520.g002 After a user submits a VCF file, the first step our framework performs is the validation of each file using a method called “vcf-validator” from VCFtools [15]. After doing this validation, we execute a python script called “sanity-check” to prepare the VCF to be annotated by Mendel,MD. This script searches and removes lines of the VCF files that contain the genotype “0/0”, removes the “chr” letters from the beginning of each chromosome name, sorts all the variants of the VCF by chromosome name and position, and finally it removes the EFF tag of any prior annotation that was done with SNPEFF in the past. Another tool that provides a similar functionality is VCFAnno[19]. After validating and checking each file, we make use of the “threading” module library of Python to execute the following tools in parallel: SNPEFF[20] and SNPSIFT[21], Variant Effect Predictor (VEP)[22] and “vcf-annotate” from VCFtools [23]. Following this, we use a python script called “vcf-annotator.py”, which is an important step of our annotation since it is a generic form used to annotate any VCF file using multiple VCF files as a reference. This script itself also uses multiple threads in order to make this particular part of the annotation more efficient. We use the following projects and databases as reference for the annotation task: 1000 Genomes Project [24], dbSNP and Clinvar [25], Exome Sequencing Project (ESP) [26] and dbNFSP [27]. These files were downloaded and stored using the BGZIP format and were indexed using tabix [28] which helped reduce the amount of space required to perform our annotation (30GB) while keeping the files indexed and enabling fast information retrieval based on the genome coordinates. The library pysam [29] was used for interfacing with tabix to access the required information. Finally we used two VCF files with information from the public HGMD mutations (downloaded from Ensembl) and the Haploinsufficiency Index of some genes as calculated by Huang et al [30]. At the end of our annotation process, we merge all the output of the tools used into a final VCF file containing hundreds of annotated fields added to the column INFO at every line that was present in the original file. This file contains the annotation for various scores of pathogenicity such as SIFT[31], PolyPhen-2 [32], VEST [33] and CADD [34], and these scores are very important for evaluating the pathogenicity of each variant and can help select good candidates for each clinical case. In S1 Data we present an example of a VCF file annotated by Mendel,MD. We noticed earlier in this project that the task of re-annotating each VCF file would need to be repeated many times in order to keep this information updated. To address this challenge we created a page called “Dashboard” where a user with administration privileges can quickly select individuals and send them to be re-annotated every time new datasets and tools would be provided from upstream. We developed this process in a way that new tools and datasets could easily be integrated into it, so that changes could constantly be made with the goal of improving the quality of the analysis. After the annotation was finished we inserted each annotated VCF into an SQL database developed using PostgreSQL in order to store, index, and quickly retrieve this information. To take care of filtering variants from multiple individuals we developed a method called “Filter Analysis”. Next we describe how this method is useful for excluding variants according to filter options pre-defined by the user. In Fig 3 we show a summary for a VCF file with metrics about the read depth, quality score and total number of variants in order to help define thresholds for the next section implemented, which is called Filter Analysis. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. VCF summary. Here you can see some metrics for each VCF file such as the total number of variants and the number of novel variants. https://doi.org/10.1371/journal.pcbi.1005520.g003

Filter analysis To implement the filtering of the VCF data we made extensive use of the Django Object-relational mapping (ORM) which is capable of translating python code directly into SQL queries, thus facilitating the process of building complex queries that can be combined with the goal of reducing the number of candidate variants and genes for each different clinical case. In Fig 4 we show the interface that was developed for filtering these variants based on the fields from the VCF that were annotated and inserted into the database. With these options a user can exclude variants based on certain fields such as the type of mutation (e.g. homozygous or heterozygous), the impact of mutation according to SNPEFF (Ex. high, moderate, modifier or low), and even the frequency of the mutation according to the databases 1000 Genomes, dbSNP and Exome Sequencing Project. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 4. Filter analysis. Using this method, a user can quickly define multiple options in order to exclude variants according to different criteria. This enables you to quickly repeat the filtering step using different options to adjust your analysis according to your preferences. https://doi.org/10.1371/journal.pcbi.1005520.g004 It is also possible to search for variants only in genes previously known to be associated with Mendelian disorders. We implemented autocomplete fields where the user can type a word and quickly search and retrieve a list with the possible options of diseases with this term to add to their search. This feature can speed up the process of increasing the options and also it allows the user to search for variants only in genes associated with specific diseases. We made this part of the analysis user-friendly so that it could be easily performed by doctors and researchers. This feature can greatly hasten the identification of good candidate variants for experimental validation. In the results section of this search, the user can see a list of genes that are already known to be associated with Mendelian Disorders in the OMIM and the Clinical Genomics Database and decide to focus only on variants present in these genes. This is a good strategy that can help markedly reduce the number of candidate variants that may cause a Mendelian Disorder.