As of October 2018, we have identified eight software tools for differential expression analysis of scRNAseq data, which are designed specifically for such data [21, 29, 30, 33, 34, 36,37,38] (SCDE, MAST, scDD, D3E, Monocle2, SINCERA, DEsingle, and SigEMD). We also considered tools designed for bulk RNAseq data that are widely used [4, 43] (edgeR, and DESeq2) or can apply to multimodal data [31] (EMDomics). The general characteristics of the eleven tools are provided in Table 1. MAST, scDD, EMDomics, Monocle2, SINCERA, and SigEMD use normalized TPM/FPKM expression values as input, while SCDE, D3E, and DEsingle use read counts obtained from RSEM as input. D3E runs on Python, while all other methods are developed as an R package. In the following sections, we provide the details of the tools.

Table 1 Software tools for identifying DE genes using scRNAseq data Full size table

Differential gene expression analysis methods for scRNAseq data

Single-cell differential expression (SCDE)

SCDE [21] utilizes a mixture probabilistic model for gene expression values. The observed read counts of genes are modeled as a mixture of drop-out events by a Poisson distribution and amplification components by a negative binomial (NB) distribution:

$$ \left\{\begin{array}{c}{r}_c\sim NB(e)\ \mathrm{for}\ \mathrm{normal}\ \mathrm{amplified}\ \mathrm{genes}\\ {}{r}_c\sim Possion\left({\lambda}_0\right)\ \mathrm{for}\ \mathrm{drop}-\mathrm{out}\ \mathrm{genes}\end{array}\ \right., $$

where e is the expected expression value in cells when the gene is amplified, and λ 0 is always set to 0.1. The posterior probability of a gene expressed at level x in cell c based on observed r c and the fitted model Ω c is calculated by:

$$ p\left(x|{r}_c,{\Omega}_c\right)={p}_d(x){p}_{Possion}\left(x|{r}_c\right)+\left(1-{p}_d(x)\right){p}_{NB}\left(x|{r}_c\right), $$

where p d is the probability of a drop-out event in cell c for a gene expressed at an average level x, and p poisson (x| r c ) and p NB (x|r c ) are the probabilities of observing expression value r c in the cases of drop-out (Poisson) and successful amplification (NB) of a gene expressed at level x in cell c, respectively. Then, after the bootstrap step, the posterior probability of a gene expressed at level x in a subpopulation of cells S is determined as an expected value:

$$ {p}_s(x)=E\left[{\prod}_{c\in B}p\left(x|{r}_c,{\Omega}_c\right)\right], $$

where B is the bootstrap samples of S. Based on the posterior probabilities of gene expression values in cells S and G, p S (x) and p G (x), SCDE uses a fold expression difference f in gene g for the differential expression analysis between subgroups S and G, which is determined as:

$$ p(f)={\sum}_{x\in X}{p}_S(x){p}_G(x), $$

where X is the expression range of the gene g. An empirical p-value is determined to test the differential expression.

Model-based analysis of single-cell transcriptomics (MAST)

MAST [29] proposes a two-part generalized linear model for differential expression analysis of scRNAseq data. One part models the rate of expression level, using logistic regression:

$$ logit\left(p\left({Z}_{ig}=1\right)\right)={X}_i{\beta}_g^D, $$

where Z = [Z ig ] indicates whether gene g is expressed in cell i.

The other part models the positive expression mean, using a Gaussian linear model:

$$ p\left({Y}_{ig}=y|{Z}_{ig}=1\right)=N\left({X}_i{\beta}_g^C,{\sigma}_g^2\right), $$

where Y = [y ig ] is the expression level of gene g in cell i observed Z ig = 1. The cellular detection rate (CDR) for each cell, defined as CDR i = (1/N)∑ g = 1 Z ig (N is the total number of genes), is introduced as a column in the design matrix X i of the logistic regression model and the Gaussian linear model. For the differential expression analysis, a test with asymptotic chi-square null distribution is utilized, and a false discovery rate (FDR) adjustment control [44] is used to decide whether a gene is differentially expressed.

Bayesian modeling framework (scDD)

scDD [39] employs a Bayesian modeling framework to identify genes with differential distributions and to classify them into four situations: 1—differential unimodal (DU), 2—differential modality (DM), 3—differential proportion (DP), and 4—both DM and DU (DB), as shown in Additional file 1: Figure S1. The DU situation is one in which each distribution is unimodal but the distributions across the two conditions have different means. The DP situation involves genes with expression values that are bimodally distributed. The bimodal distribution of gene expression values in each condition has two modes with different proportions, but the two modes across the two conditions are the same. DM and DB situations both include genes whose expression values follow a unimodal distribution in one condition but a bimodal distribution in the other condition. The difference is that, in the DM situation, one of the modes of the bimodal distribution is equal to the mode of the unimodal distribution, whereas in the DB situation, there is no common mode across the two distributions.

Let Y g be the expression value of gene g in a collection of cells. The non-zero expression values of gene g are modeled as a conjugate Dirichlet process mixture (DPM) model of normals, and the zero expression values of gene g are modeled using logistic regression as a separate distributional component:

$$ \left\{\begin{array}{c}\mathrm{nonzero}\ {Y}_g\sim \mathrm{conjugate}\ \mathrm{DPM}\ \mathrm{of}\ \mathrm{normals}\\ {}\mathrm{zero}\ {Y}_g\sim \mathrm{logistic}\ \mathrm{regression}\end{array}\right. $$

For detecting the DE genes, a Bayes factor for gene g is determined as:

$$ {BF}_g=\frac{f\left({Y}_g|{M}_{DD}\right)}{f\left({Y}_g|{M}_{ED}\right)}, $$

where f(Y g | M DD ) is the predictive distribution of the observed expression value from gene g under a given hypothesis, M DD denotes the differential distribution hypothesis, and M ED denotes the equivalent distribution hypothesis that ignores conditions. As there is no solution for the Bayes factor BF g , a closed form is calculated to present the evidence of whether a gene is differentially expressed:

$$ Scor{e}_g=\log \frac{f\left({Y}_g,{Z}_g|{M}_{DD}\right)}{f\left({Y}_g,{Z}_g|{M}_{ED}\right)}=\log \frac{f_{C1}\left({Y}_g^{C1},{Z}_g^{C1}\right){f}_{C2}Y\left({}_g^{C2},{Z}_g^{C2}\right)}{f_{C1,C2}\left({Y}_g,{Z}_g\right)}, $$

where Z g is the vector of the mean and the variance for gene g, and C 1 and C 2 represent the two conditions.

EMDomics

EMDomics [31], a nonparametric method based on Earth Mover’s Distance (EMD), is proposed to reflect the overall difference between two normalized distributions by computing the EMD score for each gene and determining the estimation of FDRs. Suppose P = {(p 1, w p1 ),(p 2, w p2 )…(p m, w pm )} and Q = {(q 1, w q1 ),(q 2, w q2 )… (q n, w qn )} are two signatures, where p i and q j are the centers of each histogram bin, and w pi and w qj are the weights of each histogram bin. The COST is defined as the summation of the multiplication of flow f ij and the distance d ij :

$$ COST\left(P,Q,F\right)={\sum}_{i=1}^m{\sum}_{j=1}^n{f}_{ij}{d}_{ij}, $$

where d ij is the Euclidean distance between p i and q j , and f ij is the amount of weight that need to be moved between p i and q j . An optimization algorithm is used to find a flow F = [f ij ] between p i and q j to minimize the COST. After that, the EMD score is calculated as the normalized minimum COST.

$$ EMD\left(P,Q\right)=\frac{\sum_{i=1}^m{\sum}_{j=1}^n{f}_{ij}{d}_{ij}}{\sum_{i=1}^m{\sum}_{j=1}^n{f}_{ij}} $$

A q-value, based on the permutations of FDRs, is introduced to describe the significance of the score for each gene.

Monocle2

Monocle2 [38] is an updated version of Monocle [32], a computational method used for cell type identification, differential expression analysis, and cell ordering. Monocle applies a generalized additive model, which is a generalized linear method with linear predictors that depend on some smoothing functions. The model relates a univariate response variable Y, which belongs to the exponential family, to some predictor variables, as follows:

$$ h\left(E(Y)\right)={\beta}_0+{f}_1\left({x}_1\right)+{f}_2\left({x}_2\right)+\dots +{f}_m\left({x}_m\right), $$

where h is the link function, such as identity or log function, Y is the gene expression level, x i is the predictor variable that expresses the cell categorical label, and f i is a nonparametric function, such as cubic splines or some other smoothing functions. Specifically, the gene expression level Y is modeled using a Tobit model:

$$ Y=\left\{\begin{array}{c}{Y}^{\ast }\ if\ {Y}^{\ast }>\lambda \\ {}\lambda\ if\ {Y}^{\ast}\le \lambda \end{array}\right., $$

where Y* is a latent variable that corresponds to predictor x, and λ is the detection threshold. For identifying DE genes, we use an approximate chi-square (χ2) likelihood ratio test.

In Monocle2, a census algorithm is used to estimate the relative transcript counts, which leads to an improvement of the accuracy compared with using the normalized read counts, such as TPM values.

Discrete distributional differential expression (D3E)

D3E [33] consists of four steps: 1—data filtering and normalization, 2—comparing distributions of gene expression values for DE genes analysis, 3—fitting a Poisson-Beta model, and 4—calculating the changes in parameters between paired samples for each gene. For the normalization, D3E uses the same algorithm as used by DESeq2 [11] and filters genes that are not expressed in any cell. Then, the non-parametric Cramer-von Mises test or the Kolmogorov-Smirnov test is used to compare the expression values’ distributions of each gene for identifying the DE genes. Alternatively, a parametric method, the likelihood ratio test, can be utilized after fitting a Poisson-Beta model:

$$ {\displaystyle \begin{array}{c} PB\left(n|\alpha, \beta, \gamma, \lambda \right)= Poisson\left(n|\frac{\gamma x}{\lambda}\right)\underset{x}{\bigwedge \limits } Beta\left(x|\alpha, \beta \right)\\ {}=\frac{\gamma^n{e}^{-\frac{\gamma }{\lambda }}\varGamma \left(\frac{\alpha }{\lambda }+\frac{\beta }{\lambda}\right)}{\lambda^n\varGamma \left(n+1\right)\varGamma \left(\frac{\alpha }{\lambda }+\frac{\beta }{\lambda }+n\right)\varGamma \left(\frac{\alpha }{\lambda}\right)}\varPhi \left(\frac{\alpha }{\lambda },\frac{\alpha }{\lambda }+\frac{\beta }{\lambda }+n,\frac{\gamma }{\lambda}\right)\end{array}}, $$

where n is the number of transcripts of a particular gene, α is the rate of promoter activation, β is the rate of promoter inactivation, γ is the rate of transcription when the promoter is in the active state, λ is the transcript degradation rate, and x is the auxiliary variable. The parameters α, β, and γ can be estimated by moments matching or Bayesian inference method, but λ should be known and assumed to be constant.

SINCERA

SINCERA [34] is a computational pipeline for single cell downstream analysis that enables pre-processing, normalization, cell type identification, differential expression analysis, gene signature prediction, and key transcription factors identification. SINCERA calculates the p-value for each gene from two groups based on a statistical test to identify the DE genes. It provides two methods: one-tailed Welch’s t-test for genes, assuming they are from two independent normal distributions, and the Wilcoxon rank sum test for small sample sizes. Last, the FDRs are adjusted, using the Benjamini and Hochberg method [44].

edgeR

edgeR [4] is a negative binomial model-based method to determine DE genes. It uses a weighted trimmed mean of the log expression ratios to normalize the sequencing depth and gene length between the samples. Then, the expression data are used to fit a negative binomial model, whereby the mean μ and variance ν have a relationship of ν = μ + αμ2, and α is the dispersion factor. To estimate the dispersion factor, edgeR combines a common dispersion across all the genes, estimated by a likelihood function, and a gene-specific dispersion, estimated by the empirical Bayes method. Last, an exact test with FDR control is used to determine DE genes.

DESeq2

DESeq2 [43] is an advanced version of DESeq [11], which is also based on the negative binomial distribution. Compared with the DESeq, which uses a fixed normalization factor, the new version of DESeq2 allows the use of a gene-specific shrinkage estimation for dispersions. When estimating the dispersion, DESeq2 uses all of the genes with a similar average expression. The fold-change estimation is also employed to avoid identifying genes with small average expression values.

DEsingle

DEsingle [36] utilizes a ZINB regression model to estimate the proportion of the real and drop-out zeros in the observed expression data. The expression values of each gene in each population of cells are estimated by a ZINB model. The probability mass function (PMF) of the ZINB model for read counts of gene g in a group of cells is:

$$ {\displaystyle \begin{array}{c}P\left({N}_g=n|\theta, r,p\right)=\theta \bullet I\left(n=0\right)+\left(1-\theta \right)\bullet {f}_{NB}\left(r,p\right)\\ {}=\theta \bullet I\left(n=0\right)+\left(1-\theta \right)\bullet \left(\begin{array}{c}n+r-1\\ {}n\end{array}\right){p}^n{\left(1-p\right)}^r\end{array}}, $$

where θ is the proportion of constant zeros of gene g in the group of cells, I(n = 0) is an indicator function, f NB is the PMF of the NB distribution, r is the size parameter and p is the probability parameter of the NB distribution. By testing the parameters (θ, r, and p) of two ZINB models for the two different groups of cells, the method can classify the DE genes into three categories: 1—different expression status (DEs), 2—differential expression abundance (DEa), and 3—general differential expression (DEg). DEs represents genes that they show significant different proportion of cells with real zeros in different groups (i.e. θs are significantly different) but the expression of these genes in the remaining cells show no significance (i.e. r, and p show no significance). DEa represents genes that they show no significance in the proportion of real zeros, but show significant differential expression in remaining cells. DEg represents genes that they not only have significant difference in the proportion of real zeros, but also significantly expressed differentially in the remaining cells.

SigEMD

SigEMD [37] employs logistic regression to identify the genes that their zero counts significantly affect the distribution of expression values; and employs Lasso regression to impute the zero counts of the identified genes. Then, for these identified genes, SigEMD employs EMD, similar to EMDomics, for differential analysis of expression values’ distributions including the zero values; while for the remaining genes, it employs EMD for differential analysis of expression values’ distributions ignoring the zero values. The regression model and data imputation declines the impact of large amounts of zero counts, and EMD enhances the sensitivity of detecting DE genes from multimodal scRNAseq data.

Datasets

In this work, we used both simulated and real data to evaluate the performance of the differential expression analysis tools.

Simulated data

As we do not know exactly the true DE genes in real single-cell data, we used simulated data to compute the sensitivities and specificities of the eleven methods. Data heterogeneity (multimodality) and sparsity (large number of zero counts), which are the main characteristics of scRNAseq data, are modeled in simulated data. First, we generated 10 datasets, including simulated read counts in the form of log-transformed counts, across a two-condition problem by employing a simulation function from the scDD package [30] in R programing language [45]. For each condition, there were 75 single cells with 20,000 genes in each cell. Among the total 20,000 genes, 2000 genes were simulated with differential distributions, and 18,000 genes were simulated as non-DE genes. The 2000 DE genes were equally divided into four groups, corresponding to the DU, DP, DM, and DB scenarios (Additional file 1: Figure S1). Examples of these four situations from the real data are shown in Fig. 1a. From the 18,000 non-DE genes, 9000 genes were generated, using a unimodal NB distribution (EE scenario), and the other 9000 genes were simulated using a bimodal distribution (EP scenario). All of the non-DE genes had the same mode across the two conditions. Then, we simulated drop-out events by introducing large numbers of zero counts. To introduce zero counts, first, we built the cumulative distribution function (CDF) of the percentage of zeros of each gene, using the real data, F X (x). Then, in the simulated data for each gene, we randomly selected c (c~ F X (x)) cells from the total cells for half of the genes in each scenario and forced their expression values to zero (10,000 genes in total). Thus, the CDF of the percentage of zeros of each gene is similar between the simulated and real data (Additional file 1: Figure S2). This way, the distribution of the total counts in the simulated data is more similar to real data, which enables us to assess the true positives (TPs) and false positives (FPs) more accurately.

Real data

We used the real scRNAseq dataset provided by Islam et al. [46] as the positive control dataset to compute TP rates. The datasets consist of 22,928 genes from 48 mouse embryonic stem cells and 44 mouse embryonic fibroblasts. The count matrix is available in the Gene Expression Omnibus (GEO) database with Accession No. GSE29087. To assess TPs, we used the already-published top 1000 DE genes that are validated through qRT-PCR experiments [47] as a gold standard gene set [21, 40, 42].

We also used the dataset from Grün et al. [48] as the negative control dataset to assess FPs. We retrieved 80 pool-and-split samples that were obtained under the same condition from the GEO database with Accession No. GSE54695. By employing random sampling from the 80 samples, we generated 10 datasets to obtain the statistical characteristics of the results. For each generated dataset, we randomly selected 40 out of the 80 cells as one group and considered the remaining 40 cells as the other group [42]. Because all of the samples are under the same condition, there should be no DE genes in these 10 datasets.

In the preprocessing of the real datasets, we filtered out genes that are not expressed in all cells (zero read counts across all cells), and we used log-transformed transcript per millions (TPM) values as the input.