The GO is a vocabulary that describes the functions and locations of genes and its terms are arranged in a DAG structure, i.e. every node has zero, one or more parents and children. A protein can be assigned to one or multiple terms from each domain of GO [7]. The TPR of the GO-DAG implies that when a protein is known to be assigned to a particular GO term, it should also be assigned to all ancestor terms. In contrast, when a protein is known not to be a member of a GO term, it should not be a member of any of all the successors of that term. By GO-DAG consistency we denote satisfaction of the TPR (also see Table 1). In terms of prediction, given a probably inconsistent vector of input probabilities, one has to find the most probable multiple and consistent GO-DAG paths that the protein has to be annotated to.

Table 1 Parent-child relationship in a GO-DAG Full size table

Naturally, methods that treat GO terms independently and neglect the DAG structure of the GO can make predictions that are inconsistent. In particular for probabilistic methods those inconsistencies may appear in the form of p i >p j in which the term j is an ancestor of term i, and thus more general. In this study, we aim to find the most probable consistent GO term assignments, using such probability vectors as input. We first describe the general probabilistic setting, then derive two likelihood based objective functions and finally an evolutionary algorithm for the optimization.

Bayesian network modelling of GO

Consider a Directed Acyclic Graph (DAG) G=(V,E) with nodes V (denoting the set of GO terms) and E directed edges (the set of parent-child relationships). Vector θ denotes the input probability vector which is |V| - dimensional and x is the corresponding binary labeling, where x g =1 denotes membership for a particular protein to the g-th GO term in V GO term.

We model the GO-DAG as a Bayesian Network, with density for x:

p x ∣ G , θ = ∏ g = 1 ∣ V ∣ p x g ∣ x pa ( g ) (1)

where p a(g) denotes the parent set of node g and x p a(g) the set of labels that correspond to those parents.

The probability p(x g ∣x p a(g) ), under the DAG constraints, is given using the Conditional Probability Table (CPT) of Table 2. The table shows that when m i n(x p a(g) )=0 (i.e. at least one of the parents has label 0) then x g =0 with probability 1. Otherwise x g =1 with probability θ g and x g =0 with probability 1−θ g . Note that all inconsistent labelings have zero probability.

Table 2 Conditional probability table, under the DAG constraints Full size table

Given equation 1 and conditional probability tables with parameters = (θ 1 ,⋯,θ ∣V∣ ), one wishes to identify the most probable labelings vector x. There are two challenges in this. The first is how to choose the parameter vector θ, discussed in this section, and the second is how to search for the most probable labelings vector x, which is discussed in the next section.

Most computational methods for GO term prediction are developed under a multi-class classification framework, where each GO term denotes a class and for each protein the probability of being member of that class is evaluated by the method. Classes are arranged according to a DAG hierarchy and further each protein may belong to one or multiple classes. In GOStruct [13] a SVM approach was developed to perform multi-class classification in a single step. However, the vast majority of the methods split the multi-class problem in multiple binary classification ones (i.e. one versus all) and therefore act per GO term and disregard the GO hierarchy. GeneMANIA [19], Kernel Logistic Regression [20] and BMRF [17] propagate function information through networks of protein associations and this operation is performed per GO term. Blast2GO [21], GO-At [22] and Aranet [23] perform overrepresentation analysis for each GO term separately. Such methods do not return the conditional probabilities in the sense of equation 1. The membership probabilities that they return are perhaps best viewed as marginal probabilities, i.e. summed over all configurations for GO terms other than the specific term g. We might have tried to retrieve θ from the relation between marginal and conditional probabilities, but this is certainly not an easy way. We attempted other ways.

Methods such as BMRF return low probabilities for detailed GO terms and high ones for general terms. Prioritization of the proteins in a particular GO term can then be achieved by simply sorting them. By contrast, prioritization of GO terms for a particular protein (a more important task) is not simple as the sets of probabilities for different GO terms are not directly comparable. To make them comparable, the probabilities need to be calibrated. We derive two approaches.

The first, called DeltaL is based on the maximization of the difference of the likelihood and prior probability of the labelings as they are defined in equation (1). The second, called LogitR, is based on explicit calibration of the input probability vector.

For DeltaL, we modify the objective function of equation (1) by incorporating the prior probabilities of membership θ∗. The prior probability θ g ∗ depends on the generality of the GO term g (i.e. the class size) and is estimated as the fraction of the total proteins annotated to that GO term. We use the log ratio between the input probability and the prior, log ( p ( x g ∣ θ g ) / p ( x g ∣ θ g ∗ ) ) as score function for the labeling of the g-th GO term. For x g =1 the score is equal to log ( θ g / θ g ∗ ) , while for x g =0 it takes the value of log ( ( 1 − θ g ) / ( 1 − θ g ∗ ) ) . When θ g > θ g ∗ then x g =1 maximizes the function. In the opposite case x g =0 gives the maximum. The extended function is given by the difference of the log likelihoods:

ΔL X ; θ , θ ∗ = ∑ g = 1 ∣ V ∣ log p x g ∣ θ g p x g ∣ θ g ∗ , (2)

giving

ΔL X ; θ , θ ∗ = log p ( X ∣ θ ) − log p X ∣ θ ∗ . (3)

Note that when the input probabilities are very close to the priors, the objective function of DeltaL becomes multimodal.

In LogitR optimization of equation 1 is performed on a calibrated input probability vector. The calibration is done as follows:

logit θ cg = logit θ g ∗ + α logit θ g − logit θ g ∗ (4)

where θ c g is the calibrated probability for node g and can be calculated using the inverse of the logistic transformation, θ g ∗ is the prior probability of membership for node g and α a slope parameter. In this objective function, when the posterior probability θ=θ∗ then the probability of membership is equal to θ∗ (Figure 2A). As θ deviates from the prior, the calibrated probability θ c g changes according to the logistic function given θ g and α (Figure 2B). The α parameter was tuned using Saccharomyces cerevisiae data. In particular, for a range of α=1,1.5,2.0,2.5,3.0 the LogitR approach was applied taking as input BMRF based predictions obtained from a previous study [17] before March 2010. The evaluation set consisted of 327 proteins that were annotated after March 2010, according to the GO annotation file of July 2011. The relevant part of GO DAG contained 423 terms from Biological Process. For each value of α the prediction performance was measured using the F-score, which is the harmonic mean of precision and recall. The largest F-score was obtained for α=2 and therefore we fixed α to that value.

Figure 2 Calibration of posterior probabilities using α=2 . A. Calibrated probabilities (y-axis) against the posterior probabilities (x-axis) when the prior is equal to 0.2. B. Image plot, for the entire range of prior and posterior probabilities. The colors denote the calibrated probabilities. Full size image

Optimization by differential evolution: The FALCON algorithm

The DeltaL in equation(3) and LogitR in equation(4) approaches do not involve directly the TPR constraints. We develop an optimization algorithm inspired from Differential Evolution (DE) [16] that by construction is restricted to the subspace of consistent labelings. We call our algorithm Functional Annotation with Labeling CONsistency (FALCON). In general, DE works by evolving a population of candidate solutions to explore the search space and retrieve the maximum. Because DE is derivative free, it has appealing global optimization properties. Also, it is suitable for optimization in discrete spaces (like the labelings space in our problem).

The graph representation of the labelings is helpful to explain how the algorithm works. Given the graph G and its corresponding labeling X, we define a reduced graph R = ( V R , E R ) which contains the nodes with corresponding labels x=1. If X is consistent, in the TPR sense, R will be a connected sub-graph of G and maintaining the original structure for the V R nodes. Consider two labelings L 1 , L 2 and their graphs R 1 , R 2 respectively is given in Figure 3. Graph union R 1 ∪ R 2 gives the expanded graph R union = ( V R 1 ∪ V R 2 , E R 1 ∪ E R 2 ) , while graph intersection R 1 ∩ R 2 , gives the contracted one R int = ( V R 1 ∩ V R 2 , E R 1 ∩ E R 2 ) . The nodes that will be included in the resulting graph are given by set operations (i.e. ( V R 1 ∪ V R 2 ) and ( V R 1 ∩ V R 2 ) respectively), but also equivalently by performing logical OR (for union), X 1 ∨X 2 , and logical AND (for intersection), X 1 ∧X 2 operations on the labelings directly. Table 3 and Figure 3 illustrate those operations.

Figure 3 Examples of graph (upper row) and logical (lower row) operations, using the DAG structure of Figure 1A. Full size image

Table 3 Logical operations OR and AND for all the combinations of labels Full size table

Operations between consistent graphs (labelings) result in consistent graphs (labelings) as well, because the edge set of the last is the union or the intersection of the operands and therefore a particular edge has to pre-exist in at least one of the operands without violating the TPR. This property can be seen as follows: For any parent-child pair of nodes there are three types of configurations that are consistent (Table 1). Graph union and intersection between any combination of those pairs leads to locally consistent labeling. This holds for all the parent-child pairs, so it holds for the full labeling. Therefore the outcome of graph union and intersection will be consistent as well. Further, operations between more than two labelings will be consistent as well due to the associativity property. The FALCON optimization algorithm is based on the generation and evolution of a population of subgraphs i.e. R 1 ... R N , with N=2∣V∣. The population is first initialized with consistent labelings (graphs) and evolved exploiting the graph-union and graph-intersection operations between individuals. Through the generations, all the constructed labelings will be consistent due to the abovementioned property. In our optimization problem we used four strategies to propose a new candidate solution (labelings) for the i-th graph R i , that is member of the population:

S1: Global Union R Cand = R 1 ∨ R 2 ∨ e

S2: Global Intersection R Cand = R 1 ∧ R 2 ∨ e

S3: Local Union R Cand = R i ∨ R 1 ∨ R 2 ∨ e

S4: Local Intersection R Cand = R i ∧ R 1 ∨ R 2 ∨ e

The first two types are called global because they do not involve R i while the latter two are local moves. Graph e is a random subgraph of the original full graph (i.e. GO-DAG), constructed by sampling a random node and all its ancestors. e ensures that all consistent configurations can be eventually proposed and reached.

With f the objective function i.e. being DeltaL or LogitR, the scheme of the FALCON algorithm is as follows: Initialize Population R of size N=2∣V∣ by picking random consistent vectors (see below): while Convergence or Maximum generations not reached do for i=1 to N do Sample two labelings from the population R 1 , R 2 ≠ R i Construct R Cand using the a randomly picked strategy S 1,S 2,S 3,S 4 if f ( R Cand ) > f ( R i ) then, R i : = R Cand end for end while

Initialization of the population for DeltaL is done by random sampling GO terms according to their individual score (log ratio of the input and prior probability), while LogitR by sampling from the binomial distribution with probability equal to the calibrated one. In both cases the nodes were up-propagated in order to construct a consistent labeling. The computation was terminated after 10,000 generations or after reaching a plateau (i.e. there is no improvement in the objective function for 100 generations). Finally we point that a valid Markov Chain Monte Carlo algorithm cannot be derived using those proposal strategies because they do not represent reversible moves. The bitwise exclusive OR move proposed by Sterns in [24] is reversible but does not lead to consistent labelings. Implementation of the algorithm was done in R language for Statistical Computing and using the igraph R package [25].

Performance evaluation

We evaluated the performance of the FALCON algorithm on the DeltaL and LogitR objective function using Precision, Recall and F-score. Precision is defined as the percentage of correct GO terms in the list of the GO predictions. Recall is equal to the percentage of the GO assignments that were identified and F-score is the geometric mean of the Precision and Recall.

Simulated data

First, we tested the capability of FALCON to retrieve the most probable graph using the full graphs in Figure 1 with hundred simulated probability vectors. The first contains six nodes and the second fifteen. Because the graphs are small, exhaustive search of the most probable labeling was computationally tractable. We generated a hundred random probability vectors, by sampling probabilities for each node from the uniform distribution. Then we identified the most probable labeling for each simulated probability vector and the one returned by FALCON using equation (1) as objective function. Performance measures were calculated by comparing the vectors obtained by FALCON with the most probable ones as calculated from the exhaustive search.

Real data

The performance of FALCON was further evaluated using as input the GO membership probabilities of the Arabidopsis proteins as computed by BMRF in [18]. This method provides membership probabilities per GO term independently. We constructed two evaluation datasets from those data. First, we randomly picked 100 Arabidopsis proteins that were already annotated at the time of computing the BMRF posterior probabilities. One constraint was that they should have at least fifty annotations (after up-propagating their original annotations). In this way we ensured that they were annotated in rather detailed GO terms, and therefore the attempt to get GO-DAG consistent predictions would be sensible. Although these proteins had a fixed labeling in the computations, BMRF can calculate membership probabilities for them, by reconstitution, i.e. as if they were unknown. The second dataset consisted of 387 proteins that were annotated later than the date of the BMRF computations. Thus, at the time of the computation the proteins were not annotated. We used this second set of proteins to evaluate the performance of FALCON in realistic conditions. In addition, we obtained a further list of predictions using the supervised approach proposed in [18]. In this approach, from the posterior probabilities of the annotated proteins, an F-score based optimal threshold was calculated per GO term. Using this approach, called maxF, we derived a set of predictions for each evaluation dataset. Note that those lists are not guaranteed to be GO-DAG consistent.