For DNA sequences of various species we construct the Google matrix of Markov transitions between nearby words composed of several letters. The statistical distribution of matrix elements of this matrix is shown to be described by a power law with the exponent being close to those of outgoing links in such scale-free networks as the World Wide Web (WWW). At the same time the sum of ingoing matrix elements is characterized by the exponent being significantly larger than those typical for WWW networks. This results in a slow algebraic decay of the PageRank probability determined by the distribution of ingoing elements. The spectrum of is characterized by a large gap leading to a rapid relaxation process on the DNA sequence networks. We introduce the PageRank proximity correlator between different species which determines their statistical similarity from the view point of Markov chains. The properties of other eigenstates of the Google matrix are also discussed. Our results establish scale-free features of DNA sequence networks showing their similarities and distinctions with the WWW and linguistic networks.

Funding: This research is supported in part by the EC FET Open project “New tools and algorithms for directed network analysis” (NADINE No. 288956); the France-Armenia collaboration grant CNRS/SCS No. 24943 (IE-017) on “Classical and quantum chaos”; VK is supported by CNRS – Region Midi-Pyrénées grant. No additional external funding received for this study. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2013 Kandiah, Shepelyansky. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

At present the investigations of statistical properties of DNA sequences are actively developed by various bioinformatic groups (see e.g. [14] , [15] , [16] , [17] , [18] ). The development of various methods of statistical analysis of DNA sequences become now of great importance due to a rapid growth of collected genomic data. We hope that the Google matrix approach, which already demonstrated its efficiency for enormously large networks [2] , [3] , will find useful applications for analysis of genomic data sets.

An important step in the statistical analysis of DNA sequences was done in [13] applying methods of statistical linguistics and determining the frequency of various words composed of up to 7 letters. A first order Markovian models have been also proposed and briefly discussed in this work. Here we show that the Google matrix analysis provides a natural extension of this approach. Thus the PageRank eigenvector gives the frequency appearance of words of given length. The spectrum and eigenstates of characterize the relaxation processes of different modes in the Markov process generated by a symbolic DNA sequence. We show that the comparison of word ranks of different species allows to identify proximity between species.

In this work we use the Google matrix approach to study the statistical properties of DNA sequences of the species: Homo sapiens (HS, human), Canis familiaris (CF, dog), Loxodonta africana (LA, elephant), Bos Taurus (bull, BT), Danio rerio (DR, zebrafish), taken from the publicly available database [11] . The analysis of Poincaré recurrences in these DNA sequences [12] shows their similarities with the statistical properties of recurrences for dynamical trajectories in the Chirikov standard map and other symplectic maps [7] . Indeed, a DNA sequence can be viewed as a long symbolic trajectory and hence, the Google matrix, constructed from it, highlights the statistical features of DNA from a new viewpoint.

The Google matrix belongs to the class of Perron-Frobenius operators naturally appearing in dynamical systems (see e.g. [5] ). Using the Ulam method [6] a discrete approximant of Perron-Frobenius operator can be constructed for simple dynamical maps following only one trajectory in a chaotic component [7] or using many independent trajectories counting their probability transitions between phase space cells [8] , [9] , [10] . The studies of Google matrix of such directed Ulam networks provides an interesting and detailed analysis of dynamical properties of maps with a complex chaotic dynamics [7] , [8] , [9] , [10] .

The theory of Markov chains [1] finds impressive modern applications to information retrieval and ranking of directed networks including the World Wide Web (WWW) where the number of nodes is now counted by tens of billions. The PageRank algorithm (PRA) [2] uses the concept of the Google matrix and allows to rank all WWW nodes in an efficient way. This algorithm is a fundamental element of the Google search engine used by a majority of Internet users. A detailed description of this method and basic properties of the Google matrix can be found e.g. in [3] , [4] .

Results

Construction of Google matrix from DNA sequence From [11] we collected DNA sequences of HS represented as a single string of length base pairs (bp) corresponding to 5 individuals. Similar data are obtained for BT ( bp), CF ( bp), LA ( bp), DR ( bp). For HS, CF, LA, DR the statistical properties of Poincaré recurrences in these sequences are analyzed in [12]. All strings are composed of 4 letters and undetermined letter . The strings can be found at the web page [19]. For a given sequence we fix the words of letters length corresponding to the number of states . We consider that there is a transition from a state to state inside this basis when we move along the string from left to right going from a word to a next word . This transition adds one unit in the transition matrix element . The words with letter are omitted, the transitions are counted only between nearby words not separated by words with . There are approximately such transitions for the whole length since the fraction of undetermined letters is small. Thus we have . The Markov matrix of transitions is obtained by normalizing matrix elements in such a way that their sum in each column is equal to unity: . If there are columns with all zero elements (dangling nodes) then zeros of such columns are replaced by . Such a procedure corresponds to one used for the construction of Google matrix of the WWW [2], [3]. Then the Google matrix of DNA sequence is written as (1)where is the damping factor for which the Google search uses usually the value [3]. The matrix belongs to the class of Perron-Frobenius operators. It has the largest eigenvalue with all other eigenvalues . For WWW usually there are isolated subspaces so that at there are many degenerate eigenvalues [4] so that the damping factor allows to eliminate this degeneracy creating a gap between and all other eigenvalues. For our DNA Google matrices we find that there is already a significant spectral gap naturally present. In this case the PageRank vector is not sensitive to the damping factor being in the range (other eigenvectors are independent of [3], [4], [9]). Due to that in the following we present all results at the value . The spectrum and right eigenstates are determined by the equation (2)The PageRank eigenvector at has positive or zero elements which can be interpreted as a probability to find a random surfer on a given site with the total probability normalized to unity . Thus, all sites can be ordered in a decreasing order of probability that gives us the PageRank order index with most frequent sites at low values of . It is useful to consider the density of matrix elements in the PagePank indexes similar to the presentation used in [20], [21] for networks of Wikipedia, UK universities, Linux Kernel and Twitter. The image of the DNA Google matrix of HS is shown in Fig. 1 for words of 5 and 6 letters. We see that almost all matrix is full that is drastically different from the WWW and other networks considered in [20] where the matrix is very sparse. Thus the DNA Google matrix is more similar to the case of Twitter which is characterized by a strong connectivity of top PageRank nodes [21]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. DNA Google matrix of Homo sapiens (HS) constructed for words of 5-letters (top) and 6-letters (bottom) length. Matrix elements are shown in the basis of PageRank index (and ). Here, and axes show and within the range (left) and (right). The element at is placed at top left corner. Color marks the amplitude of matrix elements changing from blue for minimum zero value to red at maximum value. https://doi.org/10.1371/journal.pone.0061519.g001 It is interesting to analyze the statistical properties of matrix elements . Their integrated distribution is shown in Fig. 2. Here is the number of matrix elements of the matrix with values . The data show that the number of nonzero matrix elements is very close to . The main fraction of elements has values (some elements since for certain there are many transitions to some node with and e.g. only one transition to other with ). At the same time there are also transition elements with large values whose fraction decays in an algebraic law with some constant and an exponent . The fit of numerical data in the range of algebraic decay gives for : (BT), (CF), (LA), (HS), (DR). For HS case we find at and at with the average for . There are visible oscillations in the algebraic decay of with but in global we see that on average all species are well described by a universal decay law with the exponent . For comparison we also show the distribution for the WWW networks of University of Cambridge and Oxford in year 2006 (data from [4], [20]). In these networks we have and on average 10 links per node. We see that in these cases the distribution has a very short range in which the decay is at least approximately algebraic ( ). In contrast to that for the DNA sequences we have a large range of algebraic decay. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Integrated fraction Integrated fractionof Google matrix elements withas a function of Left panel : Various species with 6-letters word length: bull BT (magenta), dog CF (red), elephant LA (green), Homo sapiens HS (blue) and zebrafish DR(black). Right panel : Data for HS sequence with words of length (brown), (blue), (red). For comparison black dashed and dotted curves show the same distribution for the WWW networks of Universities of Cambridge and Oxford in 2006 respectively. https://doi.org/10.1371/journal.pone.0061519.g002 Since in each column we have the sum of all elements equal to unity we can say that the differential fraction gives the distribution of outgoing matrix elements which is similar to the distribution of outgoing links extensively studied for the WWW networks [3], [23], [24], [25]. Indeed, for the WWW networks all links in a column are considered to have the same weight so that these matrix elements are given by an inverse number of outgoing links [3]. Usually the distribution of outgoing links follows a power law decay with an exponent even if it is known that this exponent is much more fluctuating compared to the case of ingoing links. Thus we establish that the distribution of DNA matrix elements is similar to the distribution of outgoing links in the WWW networks with . We note that for the distribution of outgoing links of Cambridge and Oxford networks the fit of numerical data gives the exponents (Cambridge) and (Oxford). It is known that on average the probability of PageRank vector is proportional to the number of ingoing links [3]. This relation is established for scale-free networks with an algebraic distribution of links when the average number of links per node is about to that is usually the case for WWW, Twitter and Wikipedia networks [4], [20], [21], [22], [23], [24], [25]. Thus in such a case the matrix is very sparse. For DNA we find an opposite situation where the Google matrix is almost full and zero matrix elements are practically absent. In such a case an analogue of number of ingoing links is the sum of ingoing matrix elements . The integrated distribution of ingoing matrix elements with the dependence of on is shown in Fig. 3. Here is defined as the number of nodes with the sum of ingoing matrix elements being larger than . A significant part of this dependence, corresponding to large values of and determining the PageRank probability decay, is well described by a power law . The fit of data at gives (BT), (CF), (LA), (HS), (DR). For HS case at we find respectively and . For and other species we have an average . PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Integrated fraction Integrated fractionof sum of ingoing matrix elements with Left and right panels show the same cases as in Fig. 2 in same colors. The dashed and dotted curves are shifted in -axis by one unit left to fit the figure scale. https://doi.org/10.1371/journal.pone.0061519.g003 Usually for ingoing links distribution of WWW and other networks one finds the exponent [23], [24], [25]. This value of is expected to be the same as the exponent for ingoing matrix elements of matrix . Indeed, for the ingoing matrix elements of Cambridge and Oxford networks we find respectively the exponents and (see curves in Fig. 3). For ingoing links distribution of Cambridge and Oxford networks we obtain respectively and which are close to the usual WWW value . Thus we can say that for the WWW type networks we have . In contrast the exponent for DNA Google matrix elements gets significantly larger value . This feature marks a significant difference between DNA and WWW networks. For DNA we see that there is a certain curvature in addition to a linear decay in log-log scale. From one side, all species are close to a unique universal decay curve which describes the distribution of ingoing matrix elements (there is a more pronounced deviation for DR which does not belong to mammalian species). However, from other side we see visible differences between distributions of various species (e.g. non mammalian DR case has the largest deviation from others mammalian species). We will discuss the links between and the exponent of PageRank algebraic decay in next sections.

Spectrum of DNA Google matrix The spectrum of eigenstates of DNA Google matrix of is shown in Fig. 4 for words of letters and matrix sizes . The spectra for DNA sequences of bull BT, dog CF, elephant LA and zebrafish DR are shown in Fig. 5 for words of letters. The spectra and eigenstates are obtained by direct numerical diagonalization of matrix using LAPACK standard code. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Spectrum of eigenvalues in the complex plane Spectrum of eigenvalues in the complex planefor DNA Google matrix of Homo sapiens (HS) shown for words ofletters (from top to bottom). https://doi.org/10.1371/journal.pone.0061519.g004 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Spectrum of eigenvalues in the complex plane Spectrum of eigenvalues in the complex planefor DNA Google matrix of of bull BT, dog CF, elephant LA, zebrafish DR shown for words ofletters (from top to bottom). https://doi.org/10.1371/journal.pone.0061519.g005 In all cases the spectrum has a large gap which separates eigenvalue and all other eigenvalues with (only for non mammalian DR case we have a small group of eigenvalues within ). This is drastically different from the spectrum of WWW and other type networks which usually have no gap in the vicinity of (see e.g. [4], [21], [22]). In a certain sense the DNA spectrum is similar to the spectrum of randomized WWW networks and the spectrum of of the Albert-Baraási network model discussed in [26], but the properties of the PageRank vector are rather different as we will see below. Visually the spectrum is mostly similar between HS and CF having approximately the same radius of circular cloud . For DR this radius is the smallest with . Thus the spectrum of indicates the difference between mammalian and non mammalian sequences. For HS the increase of the word length leads to an increase of . For the number of nonzero matrix elements is close to and thus on average we have only about transitions per each element. This determines an approximate limit of reliable statistical computation of matrix elements for available HS sequence length . For HS at we verified that two halves of the whole sequence still give practically the same spectrum with a relative accuracy of for eigenvalues in the main part of the cloud at . This means that the spectrum presented in Figs 4,5 is statistically stable at the values of used in this work. We also constructed the Google matrix by inverting the direction of transitions and then normalizing sum of all elements in each column to unity. This procedure is also equivalent to moving along the sequence, from word to word, not from left to right but from right to left. We note that for WWW and other networks such a matrix with inverted direction of links was used to obtain the CheiRank vector (which is the PageRank vector of matrix ). Due to the inversion of links the CheiRank vector highlights very communicative nodes [4], [20], [21], [22]. In our case the spectrum of and are identical. As a result the probability distributions of PageRank and CheiRank vectors are the same. This is due to some kind of detailed balance principle: we count only transitions between nearby words in a DNA sequence and the direction of displacement along the sequence does not affect the average transition probabilities so that (up to statistical fluctuations). In a certain sense this situation is similar to the case of Ulam networks in symplectic maps where the conservation of phase space area leads to the same properties of and [7], [10]. We tried to test if a random matrix model can reproduce the distribution of eigenvalues in plane. With this aim we generated random matrix elements with exactly the same distribution as for HS case at (see Fig. 2). However, in this random model we found all eigenvalues homogeneously distributed in the radius being significantly smaller compared to the real data. Also in this case the PageRank probability changes only by 30% in the whole range being absolutely different from the real data (see next section). Thus the construction of random matrix models which are able to produce results similar to the real data remains as a task for future investigations.