Transcriptional regulation critically depends on proper interactions between transcription factors (TF) and their cognate DNA binding sites. The widely used model of TF-DNA binding – the Positional Weight Matrix (PWM) – presumes independence between positions within the binding site. However, there is evidence to show that the independence assumption may not always hold, and the extent of interposition dependence is not completely known. We hypothesize that the interposition dependence should partly be manifested as correlated evolution at the positions. We report a Maximum-Likelihood (ML) approach to infer correlated evolution at any two positions within a PWM, based on a multiple alignment of 5 mammalian genomes. Application to a genome-wide set of putative cis elements in human promoters reveals a prevalence of correlated evolution within cis elements. We found that the interdependence between two positions decreases with increasing distance between the positions. The interdependent positions tend to be evolutionarily more constrained and moreover, the dependence patterns are relatively similar across structurally related transcription factors. Although some of the detected mutational dependencies may be due to context-dependent genomic hyper-mutation, notably CG to TG, the majority is likely due to context-dependent preferences for specific nucleotide combinations within the cis elements. Patterns of evolution at individual nucleotide positions within mammalian TF binding sites are often significantly correlated, suggesting interposition dependence. The proposed methodology is also applicable to other classes of non-coding functional elements. A detailed investigation of mutational dependencies within specific motifs could reveal preferred nucleotide combinations that may help refine the DNA binding models.

Here we present a novel Maximum-Likelihood (ML) approach to quantify co-evolution of pairs of positions within a TF binding motif. Our analysis is based on putative binding sites for 64 vertebrate TFs within human proximal promoters. We infer evolutionary patterns from genome-scale alignments of Human, Chimpanzee, Rat, Mouse, and Dog. We found that interposition dependence is highly prevalent, especially between adjacent positions within binding sites. Typically a TF residue interacts only with a few adjacent DNA bases [20] , [21] . Accordingly, we found a trend of decreasing dependence with increasing distance between the two positions. The interdependent positions are evolutionarily more constrained than the positions that are independent. Moreover, we found that the interposition dependence pattern is relatively similar among the structurally related TFs, suggesting a structural basis for these patterns. We discuss a few cases of interdependent positions in the context of solved structures of DNA-bound TFs. In summary, our work presents compelling additional evidence to support co-evolution, and thus interdependence, between positions within mammalian cis elements.

In any biological system with interdependent components, a mutation in one component may lead to a compensatory change in other interacting components. Compensatory changes and co-evolution of functionally interacting components have been previously demonstrated in several contexts [10] , [11] , [12] , [13] , [14] , [15] , [16] . In the context of TF elements, several previous studies have assessed interposition dependence by computing the correlation between nucleotides at two positions [17] , [18] . However, these studies are based on instances of the DNA element only within a single species. A more direct approach to assess interposition dependence is to compare the histories of nucleotide substitutions at the two positions [19] . Specifically, if a mutation at position i, say from nucleotide u to v, frequently coincides with a mutation at position j, say from nucleotide x to y, such correlated evolutionary patterns can serve as a reasonable proxy for dependence between positions i and j.

Eukaryotic gene transcription is tightly regulated, in large part, by transcription factor proteins (TF) that bind to DNA, often in a sequence-specific fashion [1] , [2] . The DNA-binding preference of a TF is determined using a variety of in vitro and in vivo approaches [3] , and is commonly represented by a Positional Weight Matrix (PWM) [4] . A PWM is a 4-by-n matrix where the rows correspond to the 4 bases, and the columns correspond to n positions in the binding site. Each column indicates the preference for the 4 bases at a specific position. Although the PWM is currently used as the de facto model of TF-DNA interaction, a major shortcoming of this model is the assumption that the nucleotide preferences at individual positions within the binding site are independent of each other. However, there are both direct experimental evidence [5] , [6] , as well as indirect evidence based on computational modeling [7] , [8] , that suggest that the interposition independence assumption does not hold universally. The extent and nature of interposition dependence is not completely known, and it has been argued that overall, a simple additive (assuming independence between positions) model may be sufficient to capture the TF-DNA interaction [9] . However, our focus here is on detecting the specific instance of inter-positional dependence and not on the extent to which these dependencies affect the overall accuracy of binding site prediction.

Results

Overview of the Data and the Approach Figure 1 illustrates our approach. Our analysis is based on a genome-wide set of putative TF binding sites in human proximal promoters based on 64 vertebrate PWMs in JASPAR [22] (see Table S1 for the list of PWMs). We only consider binding sites contained within gapless regions in the multiple alignment of 5 species – Human, Chimpanzee, Mouse, Rat, and Dog, obtained from the UCSC database [23] (see Additional File in Supplementary Data for all binding sites for all PWMs used in our study). Consider PWM M, L bases long, and with N binding sites in the human promoters. For each pair of positions (i,j), 1≤ i<j ≤ L, our Foreground set includes N pairs of multiple alignment columns corresponding to the positions i and j in the N binding sites. For each position-pair (i,j) in the Foreground set, we computed CoEvol(i,j) to quantify the extent to which positions i and j co-evolve and thus can be deemed interdependent. We compare the CoEvol estimated from the Foreground set with that for the following three control sets, each similarly consisting of N pairs of positions. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 1. Overview of the approach. The top left panel depicts positions i and j within a binding site for a PWM M having N genome-wide matches. The binding site in human is shown in the context of a 5-species multiple alignment. The top left panel also shows the phylogenetic trees for the two positions. The phylogenetic parameters are estimated from the genome-wide set of promoter alignments (top right panel). The likelihoods of ancestral nodes for a specific PWM and position (or position-pair) are then estimated from the N instances and the phylogenetic parameters. The likelihoods of individual trees and tree-pairs and ultimately the CoEvol for a pair of PWM positions are then estimated as detailed in the text. The lower right panel illustrates the procedure to generate Shuffle control. The figure depicts N instances of position pair (i,j) in the central row. A random j-position is paired with each of the i-positions (lower row). https://doi.org/10.1371/journal.pone.0055521.g001 Random: For each PWM motif M, with length L, and with N binding sites in the genome, we randomly selected L positions N times from gapless positions of the multiple alignment corresponding to the human promoters. Each set of L positions was concatenated and treated as a ‘binding site’. The total of N random sites thus generated were then treated identically to the Foreground to estimate CoEvol for all (i,j) position pairs. This control represents the baseline expectation. RandomContext: This control is meant to capture the context-dependent mutational variations in the genome. The well-known CG to TG hyper-mutability is an extreme example of this phenomenon. There may be other reasons for local dependence in the promoter that needs to be distinguished from the inter-positional dependence within TF binding sites. For this control, we randomly selected N sites, each L bases long from gapless aligned portion of the promoter, similar to the Foreground. Shuffle. While using the same L*N positions as in Foreground, we pair each of the N i-position instances randomly with one of the N j-position instances (without replacement) (Figure 1). In other words, we construct each of the N random binding sites by selecting each of the L positions randomly from one of the N instances of that position. This procedure breaks the contextual link between (i,j) instances while still preserving the independent compositional and species-specific properties of the two position sets.

A Sizable Fraction of Position-pairs within Transcription Factor Binding Sites have Co-evolved In each of 4 sets – Foreground, Random, RandomContext, and Shuffle, for each motif M, and for each position pair (i,j) we computed CoEvol(i,j). The random expectation of CoEvol is zero, as evident for Random control. A positive value may indicate correlated evolution. We define scope as |i-j|. For each scope s, we pooled all CoEvol(i,i+s) for all i and all M. For scope = 1, Figure 2 shows the CoEvol distributions in all sets. For technical reasons, we had to devise a special procedure to deal with RandomContext – see Methods for details. As expected, the CoEvol values for RandomContext are significantly greater than 0, suggesting interdependence between adjacent positions in the promoter regions regardless of TF binding, as noted previously [24]. The Shuffle control comes closest to the Foreground, indicating that the underlying compositional and evolutionary patterns of the two positions greatly contribute to CoEvol values in the Foreground. Nevertheless, even relative to this most stringent Shuffle control, the Foreground has higher values of CoEvol. We regenerated the Shuffle control set 100 times and pooled all CoEvol values to test the null hypothesis that the CoEvol values for the Foreground are no greater than that for the Shuffle. We found that the Foreground CoEvol values for scope 1 are significantly greater than that for the pooled Shuffle values (Mann-Whitney U test p-value = 0.02, Kolmogorov-Smirnov test p-value = 1.1e-12). The corresponding p-values for scope 1 when comparing Foreground versus RandomContext were 6.1e-4 and 0, respectively. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. The distribution of CoEvol values for scope 1. Shown are CoEvol values for Foreground, Random, RandomContext and Shuffle controls. Values for RandomContext are included after correcting for a shift in the distribution of tree likelihoods (see Methods for details). Relative to the most stringent control – Shuffle, the Foreground CoEvol values are significantly greater (Mann-Whitney U test p-value = 0.02, Kolmogorov-Smirnov test p-value = 1.1e-12). https://doi.org/10.1371/journal.pone.0055521.g002 We repeated the Mann-Whitney test for each scope from 1 through 8 comparing Foreground and the pooled Shuffle CoEvol values. The p-values were significant only for scopes 1, 2, and 3 and not so for greater scopes. Thus we find that the CoEvol signal is limited to lower scopes. It is possible that at higher scopes, only a small but significant number of position-pairs co-evolve, and will not be detected by a global test of differences in distribution. Also, the median binding site length for our set of 64 PWMs is 12; so higher scopes are only relevant to a fraction of the PWMs. For these reasons, we restrict our analysis to scopes 1–8. To measure differences between Foreground and Shuffle in the right tail of the distribution, we adopted the following strategy. Using Random as the negative control, we computed T99 as the 99th percentile threshold of all CoEvol values for the Random control. Let FS represent the fraction of all Foreground CoEvol values for scope s that are greater than T99. The expected value of FS is 1%. However, we observed that in all scopes, 1≤ s ≤8, FS ranged from 17% to 25%. Similarly, we computed SS, defined as the fraction of all Shuffle CoEvol values for scope s that are greater than T99. To estimate the significance of FS relative to SS, we computed the fraction of 100 Shuffle sets in which SS ≥ FS. For scopes 1 through 7, none of the 100 Shuffle sets had SS ≥ FS, corresponding to a nominal p-value <0.01. For scope 8 the p-value was 0.04. Thus, based on the right tail analysis, there is significantly greater enrichment of high CoEvol values in the Foreground relative to the Shuffle control in all scopes.

The Number of Interdependent Position Pairs Decreases with Increasing Scope For a motif M with N binding sites, and for a specific position pair (i,j), let CEF be the CoEvol value in the Foreground. CES is defined similarly for the Shuffle. We estimate the significance of CEF as the fraction of 100 shuffles for which CES ≥ CEF. Given the p-values for all CoEvol(i, j) (all position pairs for each PWM), we estimate their q-values (False Discovery Rate) using the Storey-Tibshirani method to control for multiple testing [25]. All CoEvol(i, j) with a q-value ≤ 0.05 are considered to represent instances of co-evolution. Out of a total of 3914 position pairs evaluated for all motifs and for all scopes, 315 were deemed significant with an estimated false discovery rate of 5% (see Figure S1 and Table S2). Of these 315, 92 were in scope 1, monotonically decreasing to 23 for scope 8. An alternative measure of interposition dependence, and one that does not rely on co-evolution, is the mutual information between the nucleotide probability distributions at the two columns of the PWM. We found that the 315 position-pairs that our method deemed to be co-evolving have significantly greater mutual information relative to all other position-pairs (Mann-Whitney U test p-value = 8.7e-06).

Interdependent Positions have Greater Evolutionary Conservation The 315 interdependent position-pairs correspond to 353 unique PWM positions. There are 830 other positions for the 64 PWMs that were not deemed co-evolving with any other position. Thus, the detected position pairs are not dominated by a few positions, and include a large fraction of positions. We compared the 353 interdependent positions with the 830 independent positions with respect to their compositional and conservation properties. We did not find a significant difference in the C+G content between the interdependent and independent positions. To estimate the evolutionary conservation for the ith position of PWM M with N binding sites in the genome, we extracted the 17-species Phastcons evolutionary conservation score from Galaxy (main.g2.bx.psu.edu) and averaged over the N instances of position i. Our choice of 17-species alignment to estimate conservation is meant to minimize the dependence on the input set of alignment based on 5 mammalian species. We found that the interdependent positions tend to be evolutionarily more conserved (Mann-Whitney U test p-value = 3.5e-4). We note that there is no a priori bias in our method’s ability to detect co-evolution towards greater evolutionary conservation. Our results thus suggest that the interdependent positions are under a greater constraint against mutations.

Structurally Related TFs Exhibit Similar Patterns of Interposition Dependence For a pair of PWMs M1 and M2, we quantified the similarity J(M1,M2) in their interposition dependence as follows. A position-pair (i1, j1) for M1 is considered to “match” a position-pair (i2, j2) for M2 if (1) j1-i1 = j2-i2, i.e., they have the same scope, and (2) |i1-i2| ≤ D, where the parameter D allows for a shift between the positions in the two PWMs. J(M1,M2) is then defined as the ratio of the number of matching interdependent position-pairs between the two PWMs and the total number of interdependent position-pairs for the two PWMs; this is analogous to the standard Jaccard index. We grouped PWMs according to the TF’s structural class annotated in TRANSFAC. We compared, using the Mann-Whitney U test, the 98 J(M1,M2) values corresponding to PWM pairs within the same family, with the 1498 J(M1,M2) values corresponding to PWM pairs in different families. We found that the within-family similarity was significantly greater than the cross-family similarity with p-values of 2.9e-6, 1.4e-6, and 1.7e-8 for D = 0, 1, and 2 respectively. This suggests that structurally related TFs tend to exhibit similar patterns of interposition dependence.