As we mentioned earlier, analysing time-course microarray data by means of Gaussian process (GP) regression is not a new idea (cf. Background). In this section we review the methodology to estimating the continuous trajectory of a gene expression by GP regression and subsequently describe a likelihood-ratio approach to ranking the differential expression of its profile. The following content is based on the key components of GP theory as described in [21, 22].

The Gaussian process model

The idea is to treat trajectory estimation given the observations (gene expression time-series) as an interpolation problem on functions of one dimension. By assuming the observations have Gaussian-distributed noise, the computations for prediction become tractable and involve only the manipulation of linear algebra rules.

A finite parametric model

We begin the derivation of the GP regression model by defining a standard linear regression model (a more concrete example of such a model is for ϕ = (1, x, x2)⊤, i.e. a line mapped to a quadratic curve)

(1)

where gene expression measurements in time y = {y n } n = 1..N are contaminated with white Gaussian noise and the inputs (of time) are mapped to a feature space Φ = {ϕ (x n )⊤} n = 1..N . Furthermore, if we assume the noise to be i.i.d. (identically and independently distributed) as a Gaussian with zero mean and variance

(2)

then the probability density of the observations given the inputs and parameters (data likelihood) is Gaussian-distributed

(3)

Where .

Introducing Bayesian methodology

Now turning to Bayesian linear regression, we wish to encode our initial beliefs about the parameters w by specifying a zero mean, isotropic Gaussian distribution as a prior over the parameters

(4)

By integrating the product of the likelihood × prior with respect to the parameters, we get the marginal likelihood

(5)

which is jointly Gaussian. Hence the mean and covariance of the marginal are

(6)

(7)

(8)

By computing the marginal likelihood in eq. (8), we can compare or rank different models, without fear of overfitting on the data, or having to explicitly apply a regulariser to the likelihood; the marginal likelihood implicitly penalises too complex models [21, sec. 5.4].

Notice in eq. (7) how the structure of the covariance implies that choosing a different feature space Φ results in a different K y . Whatever K y is, it must satisfy the following requirements to be a valid covariance matrix of the GP:

Kolmogorov consistency , which is satisfied when K ij = K ( x i , x j ) for some covariance function K , such that all possible K are positive semidefinite ( y ⊤ Ky ≥ 0).

Exchangeability, which is satisfied when the data are i.i.d.. It means that the order in which the data become available has no impact on the marginal distribution, hence there is no need to hold out data from the training set for validation purposes (for measuring generalisation errors, etc.).

Definition of a Gaussian process

More formally, a Gaussian process is a stochastic process (or collection of random variables) over a feature space, such that the distribution p (y(x 1 ), y(x 2 ),..., y(x n )) of a function y(x), for any finite set of points {x 1 , x 2 , ..., x n } mapped to that space, is Gaussian, and such that any of these Gaussian distributions is Kolmogorov consistent.

If we remove the noise term I from K y in eq. (7) we can have noiseless predictions of f(x) rather than y(x) = f(x) + ε. However, when dealing with finite parameter spaces K f may be ill-conditioned (cf. sec. SE derivation), so the noise term guarantees that K y will have full rank (and an inverse). Having said that, we now formulate the GP prior over the latent function values f by rewriting eq. (8) as

(9)

where the mean function (usually defined as the zero function) and the covariance function respectively are

(10)

(11)

The squared-exponential kernel

In this paper we only use the univariate version of the squared-exponential (SE) kernel. But before embarking on its analysis, the reader should be aware of the existing wide variety of kernel families, and potential combinations of them. A comprehensive review of the literature on covariance functions is found in [21, chap. 4].

Derivation and interpretation of the SE kernel

In the GP definition section we mentioned the possibility of an ill-conditioned covariance matrix. In the case of a finite parametric model (as in eq. (1)), K f can have at most as many non-zero eigenvalues as the number of parameters in the model. Hence for any problem of any given size, the matrix is non-invertible. Ensuring K f is not ill-conditioned, involves adding the diagonal noise term to the covariance. In an infinite-dimensional feature space, one would not have to worry about this issue as the features are integrated out and the covariance between datapoints is no longer expressed in terms of the features but by a covariance function. As demonstrated in [22, sec.45.3] and [21, sec.4.2.1], with an example of a one-dimensional dataset, we express the covariance matrix K f in terms of the features Φ

(12)

then by considering a feature space defined by radial basis functions and integrating with respect to their centers h, eq. (12) becomes

(13)

where one ends up with a smooth (infinitely differentiable) function on an infinite-dimensional space of (radial basis function) features. Taking the constant out front as a signal variance and squaring the exponential gives rise to the standard form of the univariate squared-exponential (SE) covariance function. The noisy univariate SE kernel -- the one used in this paper is

(14)

The SE is a stationary kernel, i.e. it is a function of d = x i - x j which makes it translation invariant in time. δ ij is the Kronecker delta function which is unity when i = j and zero otherwise and l2 is the characteristic lengthscale which specifies the distance beyond which any two inputs (x i , x j ) become uncorrelated. In effect, the lengthscale l2 governs the amount that f varies along the input (time). A small lengthscale means that f varies rapidly along time, and a very large lengthscale means that f behaves almost as a constant function, see Figure 5. This parameterisation of the SE kernel becomes very powerful when combined with hyperparameter adaptation, as described in a following section. Other adapted hyperparameters include the signal variance which is a vertical scale of function variation and the noise variance (introduced in eq. (2)) which is not a hyperparameter of the SE itself, but unless we consider it as a constant in the noisy case, its adaptation can give different explanations about the latent function that generates the data.

Figure 5 Gaussian process fit on expression profile of gene Cyp1b1 in the experimental mouse data. Figure 5: A GP fitted on the centred profile of the gene Cyp1b1 (probeID 1416612_at in the GSE10562 dataset) with different settings of the lengthscale hyperparameter ℓ2. The blue crosses represent zero-mean hybridised gene expression in time (log2 ratios between treatment and control) and the shaded area indicates the point-wise mean plus/minus two times the standard deviation (95% confidence region). (a) Mean function is constant as ℓ2 → ∞ (0 inverse lengthscale in eq. (14)) and all of the observed data variance is attributed to noise ( ). (b) The lengthscale is manually set to a local-optimum large value (ℓ2 = 30) and thus the mean function roughly fits the data-points. The observed data variance is equally attributed to signal ( ) and noise ( ). Consequently, the GP features high uncertainty in its predictive curve. (c) The lengthscale is manually set to a local-optimum small value (ℓ2 = 15.6) and thus the mean function tighly fits the data-points with high certainty. The interpretation from the covariance function in this case is that the profile contains a minimal amount of noise and that most of the observed data variance is attributed to the underlying signal ( ). (d) The contour of the corresponding LML function plotted by an exhaustive search of ℓ2 and SNR values. The two main local-optima are indicated by the green dots and a third optimum that corresponds to the 1st panel appears almost as flat in the contour and its vicinity encompasses the whole lengthscale axis for very small values of SNR (i.e. given that SNR ≈ 0, the lengthscale is trivial). Full size image

One can also combine covariance functions as long as they are positive-definite. Examples of valid combined covariance functions include the sum and convolution of two covariance functions. In fact, eq. (14) is a combined sum of the SE kernel with the covariance function of isotropic Gaussian noise.

Gaussian process prediction

To interpolate the trajectory of gene expression at non-sampled time-points, as illustrated in Figure 5, we infer a function value f * at a new input (non-sampled time-point) x * , given the knowledge of function estimates f at known time-points x. The joint distribution p(f * , f ) is Gaussian, hence the conditional distribution p(f * | f ) is also Gaussian. In this section we consider predictions using noisy observations; we know the noise is Gaussian too, so the noisy conditional distribution does not differ. By Bayes' rule

(15)

where the Gaussian process prior over the noisy observations is

(16)

Predictive equations for GP regression

We start by defining the mean function and the covariance between a new time-point x * and each of the ith known time-points, where i = 1..N

(17)

(18)

For every new time-point a new vector k * is concatenated as an additional row and column to the covariance matrix K C to give

(19)

where C = N..N * is incremented with every new k * added to K C . By considering a zero mean function and eq. (19), the joint distribution p(f*, y) from eq. (15) can be computed

(20)

Finally, to derive the predictive mean and covariance of the posterior distribution from eq. (15) we use the Gaussian identities presented in [21, sec.A.2]. These are the predictive equations for GP regression of a single new time-point

(21)

(22)

(23)

and K f = K f (x, x). These equations can be generalised easily for the prediction of function values at multiple new time-points by augmenting k * with more columns and k(x*, x*) with more components, one for each new time-point x*.

Hyperparameter learning

Given the SE covariance function, one can learn the hyperparameters from the data by optimising the log-marginal likelihood function of the GP. In general, a non-parametric model such as the GP can employ a variety of kernel families whose hyperparameters can be adapted with respect to the underlying intensity and frequency of the local signal structure, and interpolate it in a probabilistic fashion (i.e. while quantifying the uncertainty of prediction). The SE kernel allows one to give intuitive interpretations of the adapted hyperparameters, especially for one-dimensional data such as a gene expression time-series, see Figure 5 for interpretations of various local-optima.

Optimising the marginal likelihood

In the context of GP models the marginal likelihood results from the marginalisation over function values f

(24)

where the GP prior p(f|x) is given in eq. (9) and the likelihood is a factorised Gaussian . The integral can be evaluated analytically [21, sec. A.2] to give the log-marginal likelihood (LML, it is common practice to take the log of the antiderivative for the sake of numerical stability, as it yields a sum instead of a product)

(25)

We notice that the marginal here is explicitly conditioned on θ (hyperparameters) to emphasise that it is a function of the hyperparameters through K f . To optimise the marginal likelihood we take the partial derivatives of the LML with respect to the hyperparameters

(26)

We use scaled conjugate gradients[32] -- a standard optimisation scheme -- to maximise the LML.

Ranking with likelihood-ratios

Alternatively, one may choose to go "fully Bayesian" by placing a hyper-prior over the hyperparameters p(θ| ), where represents some type of model, and compute a posterior over hyperparameters

(27)

based on some initial beliefs, such as the functions having large lengthscales, and optimise the marginal likelihood so that the optimum lengthscale tends to a large value, unless there is evidence to the contrary. Depending on the model , the integral in eq. (27) may be analytically intractable and thus one has to resort to approximating this quantity [33] (e.g. Laplace approximation) or using Markov Chain Monte Carlo (MCMC) methods to sample from the posterior distribution [34].

In the case where one is using different types of models (e.g. with different number of hyperparameters), a Bayesian-standard way of comparing between such two models is through Bayes factors [11, 19, 23] -- ratios of the integral quantities in eq. (27)

(28)

where the models usually represent two different hypotheses, namely - the profile has a significant underlying signal and thus it is truly differentially expressed and - there is no underlying signal in the profile and the observed gene expression is just the effect of random noise. The ranking is based on how likely in comparison to , given a profile.

In this paper we present a much simpler -- but effective to the task -- approach to ranking the differential expression of a profile. Instead of integrating out the hyperparameters, we approximate the Bayes factor with a log-ratio of marginal likelihoods (cf. eq. (25))

(29)

with each LML being a function of different instantiations of θ. We still maintain hypotheses and that represent the same notions explained above, but in our case they differ simply by configurations of θ. Specifically, on the hyperparameters are fixed to θ 1 = (∞, 0; var(y))⊤ to encode a function constant in time (l2 → ∞), with no underlying signal , which generates a time-series with a variance that can be solely explained by noise . Analogously, on the hyperparameters θ 2 are initialised to encode a function that fluctuates in accordance to a typical significant profile (e.g. ℓ2 = 20), with a distinct signal variance that solely explains the observed time-series variance and with no noise .

Local optima of the log-marginal likelihood (LML) function

These two configurations correspond to two points in the three-dimensional function that is the LML, both of which usually lie close to local-optimum solutions. This assumption can be verified, empirically, by exhaustively plotting the LML function for a number of profiles, see Figure 5. In case the LML contour differs for some profiles, more initialisation points should be used to ensure convergence to the maximum-likelihood solution. Because the configuration of the second hypothesis (no noise, ) is an extremely unlikely scenario, we let θ 2 adapt to a given profile by optimising the LML function, as opposed to keeping it fixed like θ 1 .

In most cases the LML (eq. (25)) is not convex. Multiple optima do not necessarily pose a threat here; depending on the data and as long as they have similar function values, multiple optima present alternative interpretations on the observations. To alleviate the problem of spurious local optimum solutions however, we make the following observation: when we explicitly restrict the signal variance hyperparameter ( ) to low values during optimisation, we also implicitly restrict the noise variance hyperparameter ( ) to large values. This occurs as the explanation of the observed data variance (var(y)) is shared between the signal and noise variance hyperparameters, i.e. . This dependency allows us to treat the three-dimension optimisation problem as a two-dimension problem, one of lengthscale ℓ 2 and one of signal-to-noise ratio without fear of missing out an optima.

Figure 5 illustrates the marginal likelihood as a function of the characteristic lengthscale ℓ2 and the SNR. It features two local optima, one for a small lengthscale and a high SNR, where the observed data are explained with a relatively complex function and a small noise variance, and one optimum for a large lengthscale and a low SNR, where the data are explained by a simpler function with high noise variance. We also notice that the first optimum has a lower LML. This relates to the algebraic structure of the LML (eq. (25)); the first term (dot product) promotes data fitness and the second term (determinant) penalizes the complexity of the model [21, sec.5.4]. Overall, the LML function of the Gaussian process offers a good fitness-complexity trade-off without the need for additional regularisation. Optionally, one can use multiple initialisation points focusing on different non-infinite lengthscales to deal with the multiple local optima along the lengthscale axis, and pick the best solution (max LML) to represent the hypothesis in the likelihood-ratio during the ranking stage.

Source code

The source code for the GP regression framework is available in MATLAB code http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/gp/ and as a package for the R statistical computing language http://cran.r-project.org/web/packages/gptk/. The routines for the estimation and ranking of the gene expression time-series are available upon request for both languages. The time needed to analyse the 22690 profiles in the experimental data, with only the basic two initialisation points of hyperparameters, is about 30 minutes on a desktop running Ubuntu 10.04 with a dual-core CPU at 2.8 GHz and 3.2 GiB of memory.