The classical pathway analysis begins by considering a comparison between two conditions, e.g. disease versus healthy. Evidence for differential gene expression can be provided by any technique such as fold change, t-statistic, Kolmogorov-Smirnov statistic, or perturbation factor. These statistics are then compared against the null distribution to determine how unlikely it is for the observed differences between the two conditions to occur by chance, thereby producing a ranked list of DE genes. After this hypothesis testing is done at the gene level, the next step is hypothesis testing at the pathway level producing a ranked list of impacted pathways. In summary, the input of a classical pathway analysis method includes: (i) a pathway database, and (ii) a gene expression dataset. The output is a list of pathways ranked according to their p-values.

Similarly, the input of the new approach includes: (i) a pathway database, (ii) a database of miRNA-mRNA interactions, (iii) multiple gene expression datasets, and (iv) multiple miRNA expression datasets. Each dataset is obtained from an independent study of the same disease. Here we describe a framework that transforms the new problem into the classical pathway analysis problem.

Figure 1 illustrates the pipeline of our framework, for the case of colorectal cancer. Panel (a) represents the biological knowledge obtained from public databases: pathway information and miRNA targets. Panel (b) shows a set of gene expression datasets obtained from independent studies, coming from different laboratories. For this example, we have 7 datasets (GSE4107, GSE9348, GSE15781, GSE21510, GSE23878, GSE41657, and GSE62322), all related to the same disease, colorectal cancer. Each dataset consists of two groups of samples: disease (group D) and control/healthy (group C). Panel (c) represents a set of miRNA expression datasets (GSE33125, GSE35834, GSE39814, GSE39833, GSE41655, GSE49246, GSE54632, and GSE73487), also from colorectal cancer. Similar to gene expression datasets, each miRNA dataset consists of disease and control samples. The data provided in panels (a,b,c) serve as the input for our framework.

Figure 1: Overall pipeline of the proposed framework. The input consists of (i) a pathway database and a miRNA database including known targets (panel a), (ii) multiple mRNA expression datasets (panel b), and (iii) multiple miRNA expression datasets (panel c). Each expression dataset consists of two groups of samples, e.g. disease versus control. The framework first augments the signaling pathways with miRNA molecules and their interactions with coding mRNA genes (panel d). It then calculates the standardized mean difference and its standard error in each expression dataset. The summary size effect across multiple datasets for each data type are then estimated using the REstricted Maximum Likelihood (REML) algorithm (panels e,f). Similarly, the p-value for differential expression is calculated for each dataset and then combined using the additive method (add-CLT). The augmented pathways, the combined p-values, and the estimated size effects then serve as input for ImpactAnalysis, which is a topology-aware pathway analysis method (panel g). Full size image

Pathways in public databases are typically described as graphs, where nodes are genes and edges are interactions between genes. In the first step, we extend the existing pathways with additional interactions between miRNAs and mRNAs. Panel (d) shows a part of the pathway Colorectal cancer, where blue nodes are genes and red nodes are miRNAs. The black arrow-headed lines represent activation while the red bar-headed lines represent inhibition. For example, hsa-miR-483-5p is known to suppress the expression of MAPK3 and therefore an inhibition relationship is added between the two nodes in the pathway. All pathways are extended to include the known miRNA-mRNA interactions. The next step is to estimate the expression changes of each node (gene, miRNA) under the effects of the disease.

Panel (e) shows the expression changes and the p-values for one gene in the mRNA data, across several datasets. In this case, the MAPK3 gene is used as an example. In the forest plot shown in this panel, each horizontal line represents the expression change in each study. The small black box in each line shows the standardized mean difference (SMD) and the segment shows the confidence interval of SMD. We use the standardized mean difference instead of the raw difference because the independent studies measure the expression in a variety of ways (different platforms, sample preparation, etc.). The number on the right side of each line is the p-value of the test for differential expression, using the modified t-test provided in the limma package31.

As shown in the figure, the SMD and p-value of a gene vary from study to study. We use the REstricted Maximum Likelihood (REML) algorithm32,33,34,35 to estimate the central tendency of SMD. We also use the add-CLT method28 to combine the independent p-values. Likewise, we compute the estimated SMDs and p-values for miRNA datasets (panel f).

The augmented pathways, the combined p-value, together with the estimated size effect then serve as input for classical pathway analysis. In this work, we use Impact Analysis, which is a topology-aware pathway analysis method, to calculate the p-value for each augmented pathway (panel g).

Standardized mean difference for each gene

Consider a study composed of two independent groups, and suppose we wish to compare their means for a given gene. Let and represent the sample means for that gene in the two groups, n 1 and n 2 the number of samples in each group, and S pooled the pooled standard deviation of the two groups. The pooled standard deviation and the standardized mean difference (SMD) can be estimated as:

The estimation of the standardized mean difference described in Equation (2) is often called Cohen’s d36,37. The variance of Cohen’s d is given as follows:

In the above equation, the first term reflects uncertainty in the estimate of the mean difference, and the second term reflects uncertainty in the estimate of S pooled . The standard error of d is the square root of V d . We note that Cohen’s d, which is based on sample averages, tends to overestimate the population effect size for small samples. Let n be the degrees of freedom used to estimate S pooled , i.e. n = n 1 + n 2 − 2. The corrected effect size, or Hedges’ g38, can be computed as follows:

where Γ is the gamma function. In this work, we use Hedge’ g as the standardized mean difference (SMD) between disease and control groups for each gene/miRNA.

Random-effects model and REML

Consider a collection of m studies where the effect size estimates, y 1 , …, y m have been derived from a set of studies, each of them modeled as in Equation (5). A fixed-effects model would assume that there is one true effect size which underlies all of the studies in the analysis, such that all differences in observed effects are due to sampling error. However, this assumption is implausible since it cannot account for heterogeneity between studies32,33,34,35.

In contrast, the random-effects model allows for variability of the true effect. For example, the effect size might be higher (or lower) in studies where the participants are older, or have a healthier lifestyle compared to others. The random-effects model assumes that each effect size estimate can be decomposed into two variance components by a two-stage hierarchical process33,39,40. The first variance represents the variability of the effect size across studies, and the second variance represents the sampling error within each study. We can write the random-effects model as:

where μ is the central tendency of the effect size, N(0, σ2) represents the error term by which the effect size in the ith study differs from the central tendency μ, and represents the sampling error.

The derivation and formulation of the REstricted Maximum Likelihood (REML) algorithm has been described in the literature33,41,42,43. The log-likelihood function for Equation (6) is given by:

The REML estimators of and are then computed by iteratively maximizing the log-likelihood. In our framework, we calculate for each node (mRNA and miRNA) of the extended pathways. The estimated overall effect size and the combined p-value of individual genes and miRNAs serve as the input for Impact Analysis.

Combining independent p-values

We first briefly recap some classical methods for combining independent p-values. Next, we describe the additive method28,44,45,46 that is used to combine p-values for each mRNA and miRNA molecule in our framework.

Fisher’s method47 is the most widely used method for combining independent p-values. Considering a set of m independent significance tests, the resulting p-values P 1 , P 2 , …, P m are independent and uniformly distributed on the interval [0, 1] under the null hypothesis. The random variables X i = −2lnP i (i ∈ {1, 2, …, m}) follow a chi-squared distribution with two degrees of freedom ( ). Consequently, the log product of m independent p-values follows a chi-squared distribution with 2m degrees of freedom. We note that if one of the individual p-values approaches zero, which is often the case for empirical p-values, then the combined p-value approaches zero as well, regardless of other individual p-values. For example, if P 1 → 0, then X → ∞ and therefore, Pr(X) → 0 regardless of P 2 , P 3 , …, P m .

Stouffer’s method48 is another classical method that is closely related to Fisher’s. The test statistic of Stouffer’s method is the sum of p-values transformed into standard normal variables, divided by the square root of m. Denoting ϕ as the standard normal cumulative distribution function, and p i (i ∈ [1..m]) the individual p-values that are independently and uniformly distributed under the null, the z-scores are calculated as z i = ϕ−1 (1 − p i ). By definition, these z-scores follow the standard normal distribution. The summary statistic of Stouffer’s method also follows the standard normal distribution under the null hypothesis. Similar to Fisher’s method, the combined p-values approach zero when one of the individual p-values approaches zero.

The additive method28,44,45,46,49 uses the sum of the p-values as the test statistic, instead of the log product. Consider the p-values resulting from m independent significance tests, P 1 , P 2 , …, P m . Let the sum of these p-values, (X ∈ [0, m]), be the new random variable. X is known to follow the Irwin-Hall distribution45,46 with the following probability density function (pdf):

when m is large, some addends will be too small or too large to be stored in the memory. This leads to a totally inaccurate calculation when m passes a certain threshold, depending on the number of bits used to store numbers on the computer. For this reason, a modified version of the additive method, named add-CLT, was proposed28.

Let Y represent the average of p-values: (Y ∈ [0, 1]). Since the probability density function (pdf) and the corresponding cumulative distribution function (cdf) of Y can be derived using a linear transformation of X as follows:

The variable Y is the mean of m independent and identically distributed (i.i.d.) random variables (the p-values from each individual experiment), that follow a uniform distribution with a mean of and a variance of . From the Central Limit Theorem50, the average of such m i.i.d. variables follows a normal distribution with mean and variance , i.e. for sufficiently large values of m. The transition from the additive method to the Central Limit Theorem takes place at the m ≥ 20 threshold.

In this work, we use the add-CLT method described above to combine the p-values calculated from the modified t-test (limma package).

Graphical representation of augmented pathways

Here we give the formal description of the pathway augmentation process. Let P = (V, E) be the graphical representation of the pathway we want to extend with miRNA-mRNA interactions. V is the set of vertices (genes) while the directed edges in E represent the interactions between genes in the pathway. Each interaction consists of an ordered pair of vertices and the type of interaction between the pair, i.e. E = {(x i , y i ), r i } where x i , y i ∈ G (gene set) and r i is the type of relation between x i and y i , such as activation, repression, phosphorylation, etc. Topology-based pathway analysis methods, such as Impact Analysis, use interaction types to weigh the edges or to set the strength of signal propagation along the paths in a pathway.

From the miRNA database, we get a set of miRNAs and their targets. Let us denote Z as the set of known miRNAs, ζ ∈ Z is one miRNA, and t(ζ) is the set of known targets for the miRNA ζ. The augmented pathway of P = (V, E) is denoted as P* = (V*, E*) and is constructed as follows:

In other words, if a miRNA ζ targets a gene g that belongs to the pathway, we add ζ to the pathway and then connect ζ with its targets in the pathway. By default, the interaction type of new edges is repression, which represents the translation blockage of miRNAs to mRNA. The interaction type can be changed to suit the interaction between the miRNA molecule and its targets. We extend all pathways in the pathway database using the formulation described in Equation (10). The R package mirIntegrator51 for pathway augmentation is available on Bioconductor website (www.bioconductor.org).

Impact analysis of augmented pathways

The Impact Analysis method14,52 combines two types of evidence: (i) the over-representation of DE genes in a given pathway9,10, and (ii) the perturbation of that pathway, caused by disease, as measured by propagating expression changes through the pathway topology. These two aspects are captured, respectively, by the independent probability values, P NDE and P PERT . Here we review the Impact Analysis formulation.

The first p-value, P NDE , is obtained using the hypergeometric model9,10, which is the probability of obtaining at least the observed number of differentially expressed genes. The second p-value, P PERT , depends on the identity of the specific genes that are differentially expressed as well as on the interactions described by the pathway. It is calculated based on the perturbation factor in each pathway. The perturbation factor of a gene, PF(g), is calculated as follows:

The first term represents the signed normalized expression change of the gene g, i.e. log standardized mean difference as shown in panels (e,f) of Fig. 1. The second term is the sum of perturbation factors of upstream genes, normalized by the number of downstream genes of each such upstream gene. The value of β ug quantifies the strength of interaction between u and g. Here, β ug = 1 for activation and β ug = −1 for repression.

The above equation essentially describes the perturbation factor PF for a gene as a linear function of the perturbation factors of all genes in a given pathway. In the stable state of the system, all relationships must hold, so the set of all equations defining the impact factors for all genes form a system of simultaneous equations whose solution will provide the values for the gene perturbation factors PF g . The net perturbation accumulation at the level of each gene, Acc(g), is calculated by subtracting the observed expression change from the perturbation factor.

The total accumulated perturbation in the pathway is then computed as:

The null distribution of Acc(P i ) is built by permutation of expression change. The p-value, P PERT , is then calculated by the probability of having values more extreme than the actually observed Acc(P i ).