Overview

Figure 1 gives an overview of what XGR is and what the user can expect from it. XGR has two ends, the backend (an R package) [25] and the frontend (a web-app) [26]. Metaphorically, it works as a knowledge-driven ‘megabus’, carrying the passengers (users of varying computational skills) from the departure (a user-input list of genes, SNPs, or genomic regions) to the destination (outputs in a user-friendly format including ontology enrichments and network relationships). The petrol used by this megabus is the ontology and network knowledge (see next section), and the engine is its analytical capability, currently supporting enrichment, similarity, network, and annotation analysis (summarised in Table 1; see below for details). Put simply, XGR is designed to interpret genomic summary data resulting from modern genetic studies (differential expression, GWAS, and eQTL mappings), not targeting the upstream generation of summary data but instead enhancing its downstream biological discovery.

Fig. 1 Schematic workflow of XGR: achieving enhanced interpretation of genomic summary data. This flowchart illustrates the basic concepts behind XGR. The user provides an input list of either genes, SNPs, or genomic regions, along with their significance levels (collectively referred to as genomic summary data). XGR, available as both an R package and a web-app, is then able to run enrichment, network, similarity, and annotation analyses based on this input. The analyses themselves are run using a combination of ontologies, gene networks, gene/SNP annotations, and genomic annotation data (built-in data). The output comes in various forms, including bar plots, directed acyclic graphs (DAG), circos plots, and network relationships. Furthermore, the web-app version provides interactive tables, downloadable files, and other visuals (e.g. heatmaps) Full size image

Table 1 A summary of XGR characteristics for tasks achieved and runtime required Full size table

Source data and uniform representations

As a central part of the knowledge-driven interpretations, we have assembled currently available knowledge at the gene, SNP, and genomic region level (detailed below). All source data are represented uniformly as well-documented RData-formatted files, taking advantage of the R software open-development environment and its infrastructure packages such as igraph [27] and GenomicRanges [28]. The primary source data are maintained as part of in-house relational databases, from which Perl scripts are used to create RData files. Following an established pipeline, they are subject to regular updates and are also regularly supplemented to keep pace with the explosive nature of big data in genomics.

Ontologies and annotations at the gene level

Conceptually similar to a dictionary, an ontology contains well-defined vocabularies (called ‘terms’) and their relationships to each other, and is readable by both humans and computers. Depending on how relationships between terms are organised, ontologies can be broadly categorised into two types: 1) structured ontologies, where terms are organised in a tree-like structure (specifically a directed acyclic graph (DAG)), e.g. Gene Ontology [10], Disease Ontology [11], Phenotype Ontologies in human and mouse [12, 13]; 2) non-structured ontologies, where terms are simply listed as keywords, such as a collection of pathways from MSigDB [29], and of gene druggable categories from DGIdb [30]. Using ontologies to annotate genes is one of the most effective and scalable ways of capturing a particular knowledge sphere. The reuse of existing knowledge through ontology annotations is one of the key principles behind XGR. At the time of writing (October 2016), XGR supports nearly 30 gene annotations covering almost every type of knowledge domain, ranging from functions to diseases, phenotypes, pathways, and many others (Table 1). Whether structured or non-structured (in which case an artificial root is created to link together all terms), an ontology together with annotations is universally represented as an annotated directed graph. This design aids in performing operations such as graph visualisation, annotation propagation (according to the true-path rule), and semantic similarity calculations between terms. Ontologies and their identifier codes used in XGR are summarised in [31].

Ontology annotations at the SNP level

SNP annotations are based on the Experimental Factor Ontology (EFO). EFO standardises GWAS traits from the NHGRI GWAS Catalog using well-defined terms [3]. SNPs associated with one or more related traits grouped together by an EFO term are annotated by this term. Like any structured ontology, EFO is organised as a DAG. By the true-path rule, an SNP associated with a trait (mapped to an EFO term) should also be annotated by its ancestor terms (more general terms). For example, SNPs annotated by a term ‘EFO:0000540’ (immune system disease) consist of: 1) SNPs directly annotated with this term; and 2) SNPs associated with its child terms such as ‘EFO:0005140’ (autoimmune disease) and ‘EFO:0000706’ (spondyloarthropathy), which inherit the parent annotation. The problem of linkage disequilibrium (LD) makes it necessary to also include additional SNPs that are in strong LD with GWAS lead SNPs. For ease use in XGR, LD SNPs are pre-calculated using PLINK [32] based on the 1000 Genomes Project data [33] in different population panels, and those with R2 > 0.8 with GWAS lead SNPs are retained.

Annotations at the genomic region level

Unlike coding genes that are well annotated using ontologies, non-coding genomic regions lack such annotations. Interpretation of these regions relies largely on functional genomic data generated experimentally and on comparative genomic data predicted by computational methods. Genomic annotations currently supported in XGR include a broad spectrum of genomic and epigenomic data including, transcription factor binding sites, DNaseI hypersensitivity sites, histone modifications, expressed enhancers, and genome segmentations (Table 1). Each genomic annotation set is represented as a ‘GRanges’ object, primarily based on the ‘hg19’ (GRCh37) genome build. Also supported is conversion of genomic regions between commonly used builds: ‘hg19’, ‘hg38’ (GRCh38), and ‘hg18’. Data types, sources, and identifier codes used in XGR are summarised in [31].

Interaction networks at the gene level

XGR supports networks of different interaction types (functional, physical, and pathway-derived), of varying interaction quality (highest, high, and medium), and of two interaction directions (directed versus undirected). Networks are mainly sourced from the STRING database [22] and the Pathway Commons database [23]. STRING is a meta-integration of undirected interactions from a functional aspect, while Pathway Commons contains both undirected and directed interactions from a physical and pathway aspect. Interaction type and quality, as well as identifier codes used in XGR, are summarised in [31].

Enrichment analysis

Enrichment analysis (or ‘Enricher’) is based on conventional statistical tests (Fisher’s exact test, hypergeometric or binomial test) to identify enriched ontology terms using either built-in or custom ontologies. The Fisher’s exact test establishes the independence between, for example, a user-defined gene group and a group of genes annotated by a term, and compares sampling only to the left part of the null background (without replacement). The hypergeometric test is to sample at random (without replacement) from the null background containing annotated and non-annotated genes. Finally, and in contrast to the hypergeometric test, the binomial test is to sample at random (with replacement) from the null background with the constant probability. As to the ease of reporting the significance level of a term (Additional file 1), they are, in order: hypergeometric test > Fisher’s exact test > binomial test. In other words, in terms of the calculated p value, hypergeometric test < Fisher’s exact test < binomial test. To further investigate the property of the statistical test, we simulated a random set of genes (having the same number of genes as in the real data) and estimated how often each enriched term in the real data would be expected from a null distribution based on the simulated data. As seen in Additional file 2, the chance (false positive rate) of enrichments in the real data that is falsely called significant from the simulated null data is extremely low. We also assessed false positive rate by simulating a random set of genes of different sizes and found they were independent of the size of gene sets (Additional file 3).

XGR is unique in being designed to produce much more informative enrichment results. This is achieved either by taking into account the ontology tree-like structure when using a structured ontology or by applying a filtering procedure when using a non-structured ontology (Fig. 2).

Fig. 2 Necessity of respecting ontology tree-like structure and of removing redundant non-structured pathways in enrichment analysis. This is demonstrated by analysing differentially expressed genes induced by 24-h interferon gamma in monocytes. The effect of taking ontology tree-like structure into account is demonstrated using Disease Ontology (DO) and the removal of redundant non-structured ontologies using Reactome pathways. a Side-by-side bar plots comparing the significant DO terms between the analysis without considering the tree structure (DO Tree(-)) versus the analysis considering the tree structure (DO Tree(+)). The horizontal dotted line separates commonly identified terms (top section) and redundant terms in the DO Tree(-) analysis. b DAG plot comparing commonly identified terms (coloured in cyan) and redundant terms from the DO Tree(-) analysis (coloured in light cyan). The term name (if significant) is prefixed in the form ‘x1-x2’. x1 represents ‘DO Tree (-)’ and x2 ‘DO Tree (+)’. The value of x1 (or x2) can be ‘1’ or ‘0’, denoting whether this term is identified (present) or not (absent). c The top pathway enrichments, with the redundant pathways to be removed indicated (X). d Illustrations of whether a less significant pathway B is redundant considering a more significant pathway A. Pathway B is counted redundant if it meets both criteria. Criterion 1: more than 90% of input genes annotated with pathway B are also covered by pathway A. Criterion 2: more than 50% of input genes annotated with pathway A are also covered by pathway B. Scenario 1 does not meet either criteria, scenario 2 meets both, and scenario 3 meets criterion 1 but not criterion 2. Notably, criterion 2 ensures the resulting pathways (as shown in scenario 3) are informative in capturing knowledge spheres of different granularities; otherwise, pathway B would be considered redundant in scenario 3, leading to loss of information. FDR: false discovery rate Full size image

Using a structured ontology

The basic idea is to account for the dependency of terms during enrichment analysis; for example, estimating the significance of a term after removing gene annotations that its significant child terms have. For technical details, please refer to publications [34, 35].

Using a non-structured ontology

A filtering procedure is applied to further remove redundant terms resulting from enrichment analysis. Take pathway enrichment analysis as an example (Fig. 2), assuming that there are two significant pathways, A and B, and that pathway A is more significantly enriched than pathway B. The less significant pathway B is deemed to be redundant if it meets both of the following criteria: 1) >90% of input genes annotated with pathway B are also annotated by pathway A; and 2) >50% of input genes annotated by pathway A are also annotated by pathway B. Both criteria were chosen empirically, as we observed that the increase in criterion 1 (90%) would result in the inability to remove redundant terms (Additional file 4a) and that criterion 2 (50%) produces the relative stability of redundant terms being removed (Additional file 4b). It should be noted that, although these default criteria should be applicable in most circumstances, the user can refine them by manipulating different thresholds.

Functionality

The function ‘xEnricherGenes’ conducts gene-level enrichment analysis using either structured ontologies or non-structured ontologies. The function ‘xEnricherSNPs’ conducts EFO-based enrichment analysis at the SNP level, allowing the inclusion of additional SNPs that are in LD with input SNPs. The function ‘xEnricherYours’ enables customised analysis using the user’s own ontologies and annotations for entities beyond genes and SNPs. Enrichment outputs are stored as an object of a newly defined class ‘eTerm’. Directly operating on this object, the function ‘xEnrichBarplot’ visualises enrichment results using a barplot, and the function ‘xEnrichDAGplot’ uses a DAG plot to display enriched terms in the context of the ontology tree. The function ‘xEnrichCompare’ is specially designed for side-by-side barplot comparison when involving two or more enrichment results (e.g. across different conditions but using the same ontology). The function ‘xEnrichDAGplotAdv’ takes this comparison further, highlighting which terms are shared and which are unique in the ontology tree.

Annotation analysis

Annotation analysis (or ‘Annotator’) aims to interpret a list of user-defined genomic regions in two ways: either via annotations of nearby genes by ontologies or via co-localised functional genomic annotations. Thanks to the diversity of source data available and the generalisation of data representation (see above), XGR enables multifaceted interpretation of poorly annotated genomic regions.

Functionality

The function ‘xGRviaGeneAnno’ takes as input a list of user-defined genomic regions, defines the nearest genes within a user-specified distance gap, and conducts enrichment analysis using nearby gene annotations. Similar to enrichment analysis at the gene level, this function gives the choice of structured and non-structured ontologies, producing informative enrichment results that can be visually displayed/compared. Alternatively, both functions ‘xGRviaGenomicAnno’ and ‘xGRviaGenomicAnnoAdv’ conduct region-based enrichment analysis using co-localising functional genomic annotations. The function ‘xGRviaGenomicAnno’ uses the binomial test for estimating the significance of overlaps at base resolution. The function ‘xGRviaGenomicAnnoAdv’ estimates the significance of the observed overlaps against the expectation under the null distribution, which is generated through random sampling from background genomic regions. By default, the background uses annotatable genomic regions (depending on which genomic annotations are used). However, it is advisable for the user to specify this background according to experimental settings. Enrichment results (as ‘eTerm’ objects) from annotation analysis can be visualised and compared using functions ‘xEnrichBarplot’ and ‘xEnrichCompare’.

Similarity analysis

Similarity analysis (or ‘Socialiser’) calculates semantic similarity between two genes (or between two SNPs) based on their ontology annotation profiles. More precisely, it assesses the degree of relatedness in meaning of annotation profiles from a structured ontology. The function ‘xSocialiserGenes’ conducts similarity analysis for genes using annotations by structured ontologies, while the function ‘xSocialiserSNPs’ conducts SNP-based similarity analysis using annotations from EFO.

SNP semantic similarity

The procedure used to calculate semantic similarity between two SNPs is as follows. First, the information content (IC) of a term is defined to measure how informative it is when used to annotate SNPs: –log 10 (frequency of SNPs annotated by this term). Semantic similarity between each pair of terms is pre-calculated, usually quantified as IC at the most informative common ancestor (MICA) of the two terms. Finally, semantic similarity SIM(S 1 , S 2 ) between two SNPs, S 1 and S 2 , is derived from pairwise term similarity, using best-matching (BM) based methods: average (Eq. 1), maximum (Eq. 2), or complete (Eq. 3). For a term in the annotation profile of one SNP, all these BM-based methods calculate the maximum similarity to any term in the profile of the other SNP. It can be deduced from the formula that the average and maximum methods are more sensitive to the number of terms than the complete method. However, due to the current sparse nature of EFO-based annotation of GWAS SNPs, using any of the three methods produces similar results. Indeed, they are interchangeable, although results from the average and maximum methods are more similar to each other than to the complete method (Additional file 5). By default, the complete method is used to minimise the impact of the number of terms. The resulting SNP semantic similarity network is a weighted undirected graph, with SNPs as nodes and semantic similarity scores as the edge weights. Inclusion of LD SNPs is also possible for similarity analysis.

Basis of SNP similarity

The function ‘xCircos’ displays the similarity results using a circos plot, in which the degree of similarity between two SNPs is indicated by the coloured link. This function can be used to display the most similar links, or those links involving a specific SNP only. Two functions, ‘xSocialiserDAGplot’ and ‘xSocialiserDAGplotAdv’, are specially designed to explore the basis of similarity seen in the circos plot. The function ‘xSocialiserDAGplot’ is used to visualise the ontology annotation profile for an SNP, i.e. as a DAG plot of terms used to annotate the SNP, including original annotations (rectangular nodes) and inherited annotations (elliptical nodes). The function ‘xSocialiserDAGplotAdv’ uses a DAG plot to compare annotation profiles between two similar SNPs.

$$ SIM\left({S}_1,{S}_2\right)=\frac{1}{2}\times \left(\frac{1}{n_1}\sum_{t_1\in {T}_1}\underset{t_2\in {T}_2}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right)+\frac{1}{n_2}\sum_{t_2\in {T}_2}\underset{t_1\in {T}_1}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right)\right), $$ (1)

$$ SIM\left({S}_1,{S}_2\right)= MAX\left(\frac{1}{n_1}\sum_{t_1\in {T}_1}\underset{t_2\in {T}_2}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right),\frac{1}{n_2}\sum_{t_2\in {T}_2}\underset{t_1\in {T}_1}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right)\right), $$ (2)

$$ SIM\left({S}_1,{S}_2\right)=MIN\left(\underset{t_1\in {T}_1}{U}\underset{t_2\in {T}_2}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right),\underset{t_2\in {T}_2}{U}\underset{t_1\in {T}_1}{MAX}\left( MICA\left({t}_1,{t}_2\right)\right)\right), $$ (3)

where T 1 is a set of n 1 EFO terms used to annotate S 1 , T 2 is a set of n 2 EFO terms annotating S 2 , MICA(t 1 , t 2 ) is the IC of the MICA of two terms t 1 and t 2 , operators MAX, MIN, and U denote, respectively, maximum, minimum, and union.

Network analysis

Network analysis (or ‘Networker’) identifies the subset (gene subnetwork) from a gene interaction network with nodes/genes labelled with significance information. Depending on how the node/gene significance information is provided, there are two types of network analyses supported in XGR: gene-based network analysis and SNP-based network analysis.

Gene-based network analysis

The node/gene information is directly provided, e.g. differentially expressed genes with significance measured by false discovery rate (FDR). Given a gene interaction network with nodes/genes labelled with significance, the function ‘xSubneterGenes’ searches for a maximum-scoring gene subnetwork enriched with the most significant (highly scored) genes but allowing for a few less significant genes as linkers (usually hubs). The search for this maximum-scoring subnetwork is achieved via heuristically solving a prize-collecting Steiner tree problem; this approach has been demonstrated to be superior to other state-of-the-art methods. If required, an iterative procedure is applied to identify the subnetwork with a desired number of nodes/genes. For details please refer to our previous publication [36].

SNP-based network analysis

We extend the network analysis to the SNP level, allowing node/gene information to be indirectly provided (i.e. derived from the input), e.g. via GWAS SNPs along with p values. The function ‘xSubneterSNPs’ is designed to identify a gene subnetwork that is likely modulated by input SNPs and/or their LD SNPs. It consists of three steps (Fig. 3a):

Fig. 3 Informativeness of using cross-disease GWAS summary data in characterising relationships between immunological disorders. a Gene scoring from GWAS SNPs prior to network analysis. b Heatmap of cross-disease gene scores for 11 common immunological disorders based on ImmunoBase GWAS summary data. c Consensus neighbour-joining tree based on the gene-scoring matrix resolves disease classification/taxonomy according to the genetic and cellular basis of autoinflammation and autoimmunity. Subdivided into 1) polygenic autoinflammatory diseases with a prominent autoinflammatory component, 2) polygenic autoimmune diseases with a prominent autoimmune component, and 3) mixed diseases having both components. Inter-disease distance is defined as the cumulative difference in gene scores Full size image

1. SNP scoring (Eq. 4), which considers the p values, the threshold (e.g. 5e-8 for typical GWAS), and (for LD SNPs) LD strength R 2. 2. Gene scoring (Eq. 5), which scores genes based on genomic proximity to quantify their genetic modulation by SNPs (and LD SNPs). 3. Network scoring, using the function ‘xSubneterGenes’ to identify a maximum-scoring gene subnetwork (with the desired number of nodes if required).

$$ Scor{e}_{SNP}={R}^2\times \left(lo{g}_{10}\frac{1-{P}_{SNP}}{P_{SNP}}-lo{g}_{10}\frac{1-{P}_{thresh}}{P_{thresh}}\right), $$ (4)

where P SNP is the SNP p value, P thresh is the significance threshold (usually 5e-8), and R 2 is the LD strength.

$$ Scor{e}_{gene}=\underset{SNP\in \varOmega }{MAX}\left\{ Scor{e}_{SNP}\times \left[{\left(1-\frac{d}{D}\right)}^{\lambda}\times \left|d\le D\right|\right]\right\}, $$ (5)

where Score SNP is the SNP score calculated using Eq. 4, d is the gene-to-SNP distance within a maximum of the distance window D, λ is the decay exponent controlling the decaying influence of an SNP on a nearby gene as the distance increases, Ω stands for collections of SNPs (input SNPs and LD SNPs), and MAX denotes maximum scoring scheme used here to only keep the most-informative SNP when a large number of interdependent SNPs are located within the same genetic region.

Other implementation issues

Control for multiple testing

Where a large number of tests are involved, we adjust p values either controlling the FDR (by default) or controlling the family-wise error rate (FWER). FDR is a less stringent condition than FWER. The user can choose how to account for multiple testing.

R package dependency

We rely on the package ‘ggplot2’ [37] for various visuals and adapt the package ‘RCircos’ [38] for a circos plot. Where necessary for high-performance parallel computing, two packages, ‘doMC’ and ‘foreach’, are used to reduce computational costs. Other dependent packages are listed in [25].

Web-app implementation

We use a next-generation Perl web framework ‘Mojolicious’ [39], under which the XGR web-app is portable requiring nearly zero-effort maintenance. Its maintenance is further simplified as the web-app is purely powered by the XGR R package (stably deposited into the CRAN repository).