Epigenetic Proteins

The criterion for selection of epigenetic proteins was their involvement as major players of epigenetic mechanisms in gene regulation and cancer on the basis of earlier published literature. On the basis of above criteria, we selected 167 epigenetic proteins and compiled available data for these proteins to develop dbEM. These proteins were broadly classified as DNA methyl transferases (DNMTs), histone deacetylases (HDACs), histone acetyltransferases (HATs), histone methyltransferases (HMTs), histone demethylases (HDMs) and chromatin remodelers.

Genomic Data

Cancer Cell Line Encyclopedia (CCLE)12 and Catalogue of Somatic Mutation in Cancer (COSMIC)4 are the two primary resources used to retrieve the mutation data of these proteins. We have filtered out a total of 17159 mutations present in 167 epigenetic proteins from CCLE and COSMIC data. Out of all kinds of mutations, substitution was the major type of mutation as shown in Fig. 2. For expression and copy number variation (CNV) data, CCLE expression data based on Affymetrix U133 plus array (normalized by RMA technique using quantile normalization) and log2 CNV data from Affymetrix SNP 6.0 arrays was utilized. Circos plot13 highlights the gene expression and CNV data in Fig. 3.

Figure 2 Bar graph showing the counts of various types of mutations in epigenetic proteins. Full size image

Figure 3 Circos plot depicting the gene expression and copy number variation. Moving from outer to inner circles, first circle represents the chromosomes on which epigenetic genes have been mapped. Gene expression data has been represented in second and third circles as scatter diagram and heat map, respectively (red and green dots denote gene expression values greater and less than 10, respectively). Fourth and fifth circles represent copy number variation (CNV) as scatter diagram and heat map, respectively (red and green dots have values greater and less than 2, respectively). Each dot represents a cancer cell line. Full size image

Gene Essentiality Data

In order to investigate the essentiality of particular genes encoding for above-mentioned epigenetic proteins for survival of cancer, we compiled shRNA dropout profiles from COLT-cancer database14. This database contains gene essentiality data of around 16000 genes in 72 cancer cell lines of ovarian and pancreatic tissues. Essentiality of a particular gene is reflected in terms of GARP score and significant P-value. GARP score is a measure of shRNA dropout rate and less GARP score (more negative) represents more essentiality.

Tertiary Structure and Structural Domains

In order to assist drug designing of small molecules targeting epigenetic proteins, structural information about these proteins is also compiled in dbEM. In spite of the availability of PDB structures of around 109 proteins, we used HH suite 2.0.1615 and Modeller 9.1316 to model the structure of all 167 epigenetic proteins. Modeled structures can be visualized in Jmol applet and PDB files can be downloaded. We have provided links to all available structures in the PDB. We have also mapped the Pfam17 and Superfamily18 domains to identify characteristic domains of epigenetic proteins. FASTA sequences of proteins were used as a query to map the Pfam and Superfamily domains with default settings. Finally, we mapped 837 Pfam domains and 371 Superfamily domains in 167 epigenetic proteins.

Sequence Alignment

dbEM allows user to align the sequence of epigenetic proteins in different ways. Information about normal variations of epigenetic proteins was taken from 1000 genome project in VCF format. Variations of each protein were extracted from VCF file and mapped onto their wild-type sequences to create different variants. Finally, these variants were aligned with wild-type proteins. In addition, wild-type sequences were also aligned with mutants available in CCLE and COSMIC. To get evolutionary information about these epigenetic proteins, their homologous sequences in different species (obtained from NCBI) were aligned and an evolutionary tree was generated using ClustalW19 and Jalview20 were used for sequence alignment and better visualization of data respectively.

HMM and PSSM Profiles

HMM and PSSM profiles were generated for each protein which give a conservation score at each position of protein. ‘Jackhmmer’ and ‘hmmbuild’ modules of HMMER21 software were used to generate HMM profiles whereas PSSM profiles were made by using ‘blastpgp’ and ‘makemat’ modules of BLAST software22. Three databases were used to create these profiles viz. UniProt database, mutated sequences database and normal variant database.

Post-translational Modifications

Since earlier studies have shown that functioning of various proteins is regulated by different post-translational modifications such as phosphorylation, acetylation, ubiquitylation, sumoylation23,24,25. PTM information was also included in this database. dbEM provides complete information about position, amino acid (which is modified) and type of modifications for all the epigenetic proteins compiled from dbPTM26.

Epigenetic drugs and Inhibitors

Recently, FDA approved the use of epigenetic drugs such as vorinostat and azacitidine for treatment of cutaneous T cell lymphoma and myelodysplastic syndrome7,8,9. Epigenetic proteins can serve as potential therapeutic targets for treatment of cancer and small inhibitors targeting them are already in pre-clinical and clinical trials. dbEM provides information about 54 small molecule inhibitors of HDACs, HMTs, DNMTs and HDMs, which have been used either alone or in combination for treatment of a variety of cancers. dbEM is further linked to DrugBank and PubChem databases to provide more information about aforementioned epigenetic drugs and inhibitors.

Integration of web tools

Many user-friendly tools have been integrated in dbEM for maximal and easy extraction of information related to epigenetic proteins in context of cancer.

Search

We have implemented three forms of search tools namely simple search, composite search and similarity search. Simple search option allows user to perform search by entering a simple keyword such as protein name, location, class, subclass, domain, mutation, inhibitor, inhibitor class and their targets, which returns an aesthetic table containing all major information available about the query keyword. Composite search allows user to perform complex query on basis of three fields viz. cellular location, class and subclass by use of logical operators (AND/OR). Similarity search tool facilitates the user to perform similarity-based search of query protein against the epigenetic proteins included in dbEM.

Align With Modifiers

dbEM allows user to align the query protein sequence with four different types of sequences: (i). Normal epigenetic protein sequences (ii). CCLE mutants (iii). COSMIC mutants (iv). 1000 Genome variants. Alignments can be visualized on Jalview applet and give information about consensus, quality and conservation of protein sequence. This module has an additional feature, which allows user to view dendrogram tree for query sequence with the type of sequences selected.

Profile Based Prediction

This tool allows the user to predict whether a certain change in protein sequence would be considered as a normal variation (SNP) or cancer causing mutation. It is based on the similarity score of query sequence with HMM profiles of normal variants from 1000 Genome project and cancer mutants from CCLE and COSMIC. If the similarity score of query protein sequence is higher with HMM profile of cancer mutants than normal variants, then it will be declared as cancer causing mutation and vice-versa. This tool calculates the similarity with respective HMM profiles by use of ‘hmmsearch’ module of HMMER suite (version 3.1b1)21.

Browse Section

dbEM has a powerful browsing facility, which allows the user to browse the database and acquire information using the five major modules 1. Epigenetic Modifiers: This module provides all epigenetic proteins in a tabulated form with each of them linked to UniProt, homologous proteins, PDB and PubChem/ChEMBL bioassay links. 2. Chromosomes: In this module epigenetic proteins have been categorized on the basis of their gene location on respective chromosomes. Circos plot depicts the chromosomal distribution, gene expression and copy number variations (CNV) of epigenetic genes (Fig. 3). 3. Frequency of Mutation: Frequency of cancer mutations and normal variations for each protein was calculated on the basis of mutational information of epigenetic proteins from CCLE and COSMIC and variants from 1000 Genome project. Higher ratio of cancer mutation frequency to normal variant frequency for a protein marks it as a potential drug target for anticancer therapy. DNMT3A, HDAC2 and KDM6A have highest frequency of mutation in cancer as shown in Table 1. 4. Genomic Features: This module allows user to select epigenetic proteins on the basis of certain range of genomic features related to mutation frequency, expression range and copy number variation (CNV). 5. Drugs/Inhibitors: User can use this module to gather information about 54 molecules that are used as inhibitors or epigenetic drugs for treatment of various cancers. Information includes class, chemical class, therapeutic use, clinical trial status and mode of action of these molecules. The molecules in dbEM are linked to PubChem and DrugBank databases for detailed information.

Table 1 Top 10 mutated epigenetic proteins in cancer. Full size table

Information

In this section of dbEM, information about data statistics, publication, related links and acknowledgment is provided. Data Stats: In this module, we have incorporated the statistics of distribution of these epigenetic proteins on the basis of class, subclass, cellular location, frequency of nature of mutation and chromosomal location. Literature: This link provides the list of recent research articles related to epigenetics in health and disease. Useful Links: On this page links to various databases that have been used for data acquisitions are provided. Acknowledgments: In this section, authors of databases and software used in construction of dbEM are acknowledged.

Get Data

In order to maximize the use of dbEM and to complement the scientific research community, a dedicated download page is built which allows the user to download data related to mutation, expression data, copy number variation, modeled tertiary structures/domains; sequence alignment/profiles and post-translational modifications of epigenetic proteins available in dbEM.