Construction of the mammalian PPIN

18 different mammalian PPIN datasets and databases were combined (Table 1). To consolidate interactions, mouse identifiers were converted to their human orthologs using Homologene. Interactions without PMIDs and unary interactions were dropped. 134,590 PPIs from publications that reported more than 10 interactions were also excluded from most analyses. Collectively, the mammalian PPIN consists of 50,478 PPIs covering 9384 proteins, extracted from 34,853 publications with a range of discovery time spanning from April 1967 to October 2013. The yeast (Saccharomyces cerevisiae) PPIN was downloaded from iRefWeb 4.1 [23] by including only experimental physical interactions, filtering out unary interactions, and excluding from most analyses 82,391 PPIs from publications associated with more than 10 interactions. The yeast PPIN has 9678 PPIs between 3154 proteins, extracted from 6208 publications with a range of discovery time spanning from June 1946 to November 2011.

Table 1 Mammalian PPINs resources Full size table

Entropy calculation

We define the entropy of a sequence of discovery times for PPIs involving a given protein, i with known degree \( {\tilde{k}}_i \) by:

$$ S\left({\tilde{k}}_i\right)=-{\displaystyle \sum_{j=1}^{{\tilde{k}}_i}}\frac{f_j}{{\tilde{k}}_i} log\frac{f_j}{{\tilde{k}}_i} $$ (1)

Where f i is the number of discovered PPIs involving protein i in the j th interval of time, where the time intervals are defined by taking the time at which protein i was first observed until the final observation in the whole dataset, and dividing into \( {\tilde{k}}_i \) equal-sized bins. This entropy measure was also normalized by dividing by the maximum possible entropy \( \log \left({\tilde{k}}_i\right) \).

Random data permutations

In order to compare the entropy and interval distributions to a null distribution based on uniform randomization of the data, we destroyed the original data order while preserving the frequency distributions by employing random permutations. The first reshuffling method acts globally in time by randomly reassigning the time index to PPI discoveries. The second reshuffling method is local in that it only randomly reassigns time indices from the first appearance of the protein under consideration.

Generation of artificial networks for the network discovery model

Underlying networks for the PPI discovery model were generated by five different algorithms which resulted in networks with various global properties. In order to approximate the size of the true underlying mammalian PPIN, we constructed artificial networks with 25,000 nodes and tuned the parameters of the different network construction models to produce networks that have ~650,000 links. These numbers agree with a recent estimate of the size of the human PPIN [24].

For creating these background networks, 1) the Barabási-Albert (BA) scale-free network was created using the Barabási-Albert preferential attachment model [25]; 2) the BA cluster network was created using Holme and Kim algorithm [26], which adds an extra step to the Barabási-Albert preferential attachment model, a probability of 0.995 was used to add a link to a node neighbor, so that the average clustering coefficient is close to the observed for the mammalian LC-PPIN; the 3) duplication-divergence (DD) network was generated using the algorithm by Ispolatov et al. [27] with the link retention probability of 0.6473; the 4) Erdős-Rényi random network was created using the algorithm by Batagelj and Brandes [28] with the probability of link creation of 0.00208. The global properties of the underlying networks are summarized in Table 2.

Table 2 Properties of the artificial network models Full size table

A model of protein-protein interaction network discovery

The true underlying PPIN is represented by the graph G(V, E) where the vertices V correspond to the set of all proteins and the edges E correspond to the set of all true PPIs. We examine five different network structures in order to study their effect on network discovery dynamics as described above. For a given PPIN, edges are “discovered” by a random choice. At a given time step, the probability of discovering the true link between vertices i and j is given by, μ ij ∝ μ (\( {\tilde{k}}_i,{\tilde{k}}_j \)), where \( {\tilde{k}}_x \) is the currently known degree of vertex x. The form of the function μ determines the nature of the discovery process in this model, for example,

$$ \mathrm{m}\mathrm{u}\left(\mathrm{k}\mathrm{i},,,\mathrm{k}\mathrm{j}\right)\propto Constant $$ (2)

corresponds to a uniform unbiased discovery of the network in which all true edges are equally likely to be discovered. A biased PPIN discovery process can be modeled simply by:

$$ \mathrm{m}\mathrm{u}\left(\mathrm{k}\mathrm{i},,,\mathrm{k}\mathrm{j}\right)\propto 1+{\tilde{k}}_i+{\tilde{k}}_j $$ (3)

In this case there is a process of reinforcement whereby proteins which have many discovered interactions are more likely to be examined for more interactions. Furthermore, we can enhance, what is referred to in Tria et al. [16] as “triggering”, whereby a new discovery triggers adjacent possibilities for subsequent discovery, simply by setting,

$$ \mathrm{m}\mathrm{u}\left(\mathrm{k}\mathrm{i},,,\mathrm{k}\mathrm{j}\right)\propto {\tilde{k}}_i+{\tilde{k}}_j $$ (4)

In this case only links which are connected to at least one previously discovered protein can possibly become discovered.

In the unbiased case, at times which are far from saturation we expect that the known degree of each protein will increase linearly at a rate which is proportional to its true degree:

$$ {\tilde{k}}_i(t) = \frac{d_i}{2{\displaystyle {\sum}_i}{d}_i}t $$ (5)

Where d i is the true of degree i, and the factor of 2 arises because each link is shared by two nodes. In this case we do not expect any significant acceleration of growth for the nodes, i.e., we expect to discover interactions involving any given protein at a roughly constant rate.

Community structure analysis

The community structure detection algorithm used is based on modularity optimization [29]. The modularity of a partition of community structures measures the density of links inside the communities as compared to links between communities and is defined as [30]:

$$ Q = \frac{1}{2m}{\displaystyle \sum_{i,\ j}}\left[{a}_{ij} - \frac{d_i{d}_j}{2m}\ \right]\delta \left({c}_i,{c}_j\right) $$ (6)