We first investigated if some promoter patterns occur more often then others. Secondly we determined which of these patterns are more common in certain species and whether their distribution may have some evolutionary implications. In the third analysis we examined the distribution of these promoter classes among human tissues.

Promoter classification

When promoter patterns are generated, some initial general conclusions can be drawn. Although these promoter sequences are less conserved between species they exhibit similar patterns. Each pattern is composed of vertically aligned clusters of Kappa IC (y-axis) and (G + C)% (x-axis) values. Vertical positions of these clusters form a promoter pattern which has a specific form for each promoter sequence. We have been able to classify promoters according to their patterns and noticed ten general types of promoters (Figure1A-J). Although the overall shape and density seems to be conserved across different classes of promoters, they do differ in finer details. This may indicate a further possible organization of promoter classes in several subclasses. Their shape is explained by the presence of different structures such as simple sequence repeats (SSRs) or short tandem repeats (STRs). Among these structures we found an interesting distribution of short and long homopolymer tracts or di- and tri-nucleotides formations, many of which are consistent with other studies previously done[31, 32]. We have been able to partition these patterns into ten classes on the basis of clear visual distinctions between their shape and their cluster density. The name of each promoter class has been chosen by the average nucleotide content and Kappa IC values, as follows:

1) AT-based promoters. AT-based representative patterns are distinguished by high (A + T)% and Kappa IC values. The left side of the pattern is predominant, while the right side is significantly less pronounced. The shape of this pattern exhibits various different lengths of short poly(dA:dT) homopolymer tracts (Figure 1C). AT-based patterns are characteristic for gene promoters from Drosophila melanogaster and Arabidopsis thaliana and are less common in humans. 2) CG-based promoters. These promoters are represented by patterns containing a high percentage of C + G and high Kappa IC values. CG-based promoters show a high CpG content. The right side of the pattern is predominant while the left side is significantly less pronounced (Figure 1A). The shape of this pattern exhibits various different lengths of short poly(dC:dG) homopolymer tracts. In addition, the average frequency of occurrence between AT-based and CG-based promoters appears to differ completely in these species, but curiously, these promoters tend to be in a relative opposition in each species (Figure 2A,B). This observation suggests that these species have different preferences for allocation of certain fundamental functions. Patterns of this class are particulary characteristic for genes from Homo sapiens. 3) ATCG-compact promoters. ATCG-compact patterns characterize promoters with centrally disposed clusters, leading to the formation of a round shaped pattern (Figure 1D). The middle-lower region of the pattern contains evenly interspersed nucleotides (A,T,C,G ≈ 25%) and the middle-upper area shows different lengths of short homopolymer tracts (poly(dA), poly(dT), poly(dC), poly(dG)) disposed in tandem in any order. ATCG-compact patterns are characteristic for gene promoters from Arabidopsis thaliana. 4) ATCG-balanced promoters. Promoter sequences belonging to ATCG-balanced class show an almost balanced G + C and A + T content. The right and the left side of the pattern tend to share a relative 2-fold rotational symmetry. These patterns are generally composed of equally distributed short poly(dA:dT) and poly(dC:dG) homopolymer tracts (Figure 1B). ATCG-balanced and CG-spike promoters tend to occur in the same proportion in each species and appear to have almost similar average frequencies between species (Figure 2A,B). This observation indicates that for some specific functions the same classes of promoters are preferred between species. These patterns are characteristic for gene promoters from Homo sapiens and Oryza sativa. 5) ATCG-middle promoters. ATCG-middle patterns are characterized mainly by promoter sequences containing A + T and C + G balanced values and higher than average Kappa IC values. The right side and the left side of the pattern are equally distributed. However, the central part is pronounced. They are similar to ATCG-balanced class in that they also have a relative 2-fold rotational symmetry, but contain additional short homopolymer tracts (poly(dA), poly(dT), poly(dC), poly(dG)) disposed in tandem in any order (Figure 1E). These patterns are rare and are almost equally distributed in all four species. 6) ATCG-less promoters. Promoters from this class are represented by an abrupt transition between two C + G threshold levels. Similar to ATCG-balanced promoters, the right side and the left side of the pattern is equally distributed, however, some sequences around the central region are missing or have a lower density. Typically, these central regions lack of tandem short homopolymer tracts and short sequences consisting of equally interspersed nucleotides (A,T,C,G ≈ 25%), or short sequences showing small variations over 50% in favor of A + T or C + G nucleotides (Figure 1F). Based on the promoter sequence features, these promoter patterns seem to be complementary with ATCG-middle promoters. ATCG-less patterns are significantly rare (an overall frequency between species of 0.10% - 0.16%) and are characteristic for promoters from Homo sapiens and Oryza sativa but are almost absent in Drosophila melanogaster and Arabidopsis thaliana. 7) AT-less promoters. Promoter sequences belonging to AT-less class exhibit a high frequency of short CG-rich sequences. Although both sides of the pattern show a relative 2-fold rotational symmetry, the clusters from the left side of the pattern exhibit a lower density than those on the right. These patterns are characterized by a large number of short poly(dC:dG) tracts and a lower number of short poly(dA:dT) tracts (Figure 1G). Short poly(dA:dT) tracts typically occur as a consequence of an abrupt depletion of C + G nucleotides on short distances (30b–60b) inside the promoter sequence. Such a depletion is accompanied by high Kappa IC values and is typically present near TSS (± 200b), suggesting a regular expression of their genes. AT-less patterns are generally rare and are found equally in all four species, but are slightly more frequent in Homo sapiens. 8) CG-less promoters. In contrast, CG-less promoters are distinguished by a high frequency of short AT-rich sequences and are more common in Oryza sativa and Arabidopsis thaliana. The right and left side of the pattern tend to be equally distributed, however, the clusters from the right side of the pattern exhibit a lower density than those on the left. AT-less and CG-less promoters seem to be characterized by an imbalance between the number of short poly(dA:dT) tracts and short poly(dC:dG) tracts. Complementary to AT-less promoter characteristics, these patterns are characterized by a large number of short poly(dA:dT) tracts and a much lower number of short poly(dC:dG) tracts (Figure 1I). Compared with AT-less promoters, the overall preference for CG-less promoters is very high between species. However, in Homo sapiens the number of AT-less promoters slightly exceeds the number CG-less promoters (Figure 2A). 9) AT-spike promoters. Promoter sequences belonging to AT-spike class are represented by long repetitive sequences with a high content of A or T nucleotides. These patterns exhibit a central part and an elongated left side containing small density clusters. The shape of AT-spike representative patterns is explained by the presence of long poly(dA) or long poly(dT) homopolymer tracts or tandem short poly(dA) or short poly(dT) tracts (Figure 1J). These promoters are prevalent in Arabidopsis thaliana. 10) CG-spike promoters. In contrast to AT-spike promoter architecture, these promoters are represented by long repetitive sequences with a high content of C or G nucleotides. CG-spike patterns exhibit a central part and an elongated right side containing small density clusters. These patterns contain long poly(dC) or long poly(dG) homopolymer tracts or tandem short poly(dC) or short poly(dG) tracts (Figure 1H). AT-spike and CG-spike promoters seem to be complementary considering the fact that both promoter classes are differentiated by two opposite types of homopolymer tracts. AT-spike and CG-spike classes appear to be equally preferred between species, nevertheless, their promoters tend to be in opposition in each species (Figure 2B). This observation suggests a possible conservation of their antagonist role between these species, yet a different preference for certain functions. These patterns are common in Oryza sativa and Homo sapiens.

Figure 1 Ten classes of promoters and their representative patterns. Each promoter pattern is composed of vertically aligned clusters of Kappa IC (y-axis) and GC% (x-axis) values. The center of weight for each pattern is represented by a black circle. These representative promoter patterns are shown in the following sections as follows: (A) AT-based, (B) CG-based, (C) ATCG-compact, (D) ATCG-balanced, (E) ATCG-middle, (F) ATCG-less, (G) AT-less, (H) CG-spike, (I) CG-less and (J) AT-spike. Full size image

Figure 2 Organism-specific frequencies of each promoter class. Each column represents a class of promoters. Starting at the bottom of each column we present the class name, (B) the average preference of promoter classes between species, a representative shape of the promoter class (pink areas show denser clusters whereas light grayish gold color shows lower density clusters) and (A) the proportion of promoter classes in Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa. Full size image

Promoter distribution

Our comparative analyses have revealed similarities and differences in the promoter architecture between Arabidopsis thaliana, Drosophila melanogaster, Homo sapiens and Oryza sativa. We have plotted the center of weight from 20,586 promoter patterns according with each species in order to highlight the distribution of these regulatory sequences (Figure3). The center of weight of each promoter pattern indicates an average between all SSR and STR sequences. ATCG-middle patterns contain almost all types of SSR and STR sequences and can reveal some visual insights into different promoter regions (Figure4A-F). Although the phylogenetic relationships are usualy based on sequence alignment algorithms, Kappa IC approach is based on a frequency/content comparison. A superposition between promoter distributions from each species shows the shared surfaces, representing conserved promoter sequences (Figure3E-J). Promoter sequences from Arabidopsis thaliana and rice were notably differentiated, and only a small part of promoters were shared (Figure3B,D and Figure3I). Moreover, Arabidopsis thaliana promoters seem to have more structural features in common with those from Drosophila melanogaster (Figure3F). Promoters from Arabidopsis thaliana exhibit higher Kappa IC values than promoters from Drosophila melanogaster, while variations of C + G content are relatively the same. Curiously, the highest rate of conserved promoters was encountered between Homo sapiens and Oryza sativa (Figure3J) and the lowest rate of conservation was observed between Arabidopsis thaliana and Homo sapiens (Figure3H). Promoter sequences from Homo sapiens show both a wider distribution of C + G content and the highest values of Kappa IC (Figure3A,E,H,J). The superposition of promoter distributions of the four species shows that promoters do not reflect distant phylogenetic relationships (Figure3E-J). We have also noticed the directions and the angles of these promoter distributions which may suggest an evolutionary tendency for each species.

Figure 3 Promoter distributions for each species. (A) Homo sapiens, (B) Drosophila melanogaster, (C) Oryza sativa and (D) Arabidopsis thaliana. Each point represents the center of weight from a promoter pattern. Red color areas represent denser clusters of promoters. (E-J) superposition between promoter distributions. Red color areas represent conserved promoter sequences. Full size image

Figure 4 Location of SSRs and STRs within a promoter pattern. The light grayish gold shape represents a model of a promoter pattern from ATCG-middle class in which we approximate the location of various structures that compose a promoter sequence. (A) long Poly(dA) or Poly(dT) tracts or tandem short Poly(dA) or Poly(dT) tracts, (B) non-ordered short Poly(dA) and Poly(dT) and Poly(dC) and Poly(dG) tracts, (C) long Poly(dC) or Poly(dG) tracts or tandem short Poly(dC) or Poly(dG) tracts, (D) short Poly(dC) and Poly(dG) tracts, (E) evenly interspersed nucleotides (A,T,C,G ≈ 25%), (F) short Poly(dA) and Poly(dT) tracts. Full size image

TATA-less and TATA-containing correlations

Several reports regarding Homo sapiens TATA-containing promoters seem to vary in different studies, depending on the number of promoters used[33]. An earlier study found 32% TATA-containing promoters from a set of ~1,000 genes[34]. More recent genome-wide studies show that only ~10% of human genes contain TATA-dependent promoters[20, 35]. However, the EPD dataset (Additional file1) has been cleared of redundant promoters that shared the same TSS. Accordingly, their promoter set has a much higher presence of known promoter elements, such as TATA or GC boxes. Using the EPD collection of 8,512 Homo sapiens promoters, we searched for TATA motifs in a sample of 795 promoter sequences. Of this collection, we found that ~41% were TATA-containing promoters (Additional file2). TATA-containing promoter levels were higher in AT-based, AT-less, ATCG-compact, ATCG-balanced and ATCG-middle classes, whereas TATA-less promoter levels were higher in CG-based, AT-spike, CG-less and ATCG-less classes (Figure5). More extreme differences between TATA-containing and TATA-less promoters were observed in CG-based (TATA-containing (5.28%), TATA-less (36.72%) and AT-based (TATA-containing (6.41%), TATA-less (0.75%) classes (Additional file2).

Figure 5 TATA-less and TATA-containing correlations. In each class, blue bars show the proportion of TATA-less promoters and light yellow bars show the proportion of TATA-containing promoters. Observations were made on a sample of 795 promoters, randomly selected from a collection of 8512 Homo sapiens promoters. Full size image

Transitional states

Previous studies suggested that TATA-less and TATA-containing promoters have different chromatin structure[36–41]. Evolutionary, chromatin structure may influence the distribution of point mutations or other mutational events in the promoter sequence. A chromatin-dependent distribution of point mutations can lead to a gradual shift from a promoter class to another promoter class (ie. by disruption of poly(dA:dT) or poly(dC:dG) tracts in shorter elements), thus changing the predisposition for low or high levels of gene expression. Promoter patterns “trapped” in transitional states between classes may also perhaps indicate a change of their gene relationship towards other biological pathways. We have found intermediate states between these patterns which may suggest an evolutionary transition mechanism (Figure6). Initially, the transition states were observed by our neural network (Additional file3). All promoter patterns have been classified by the highest percentage of recognition for each class. Certain promoter patterns present similar percentages for two separate classes of promoters, indicating a potential inclusion in two classes simultaneously. Exact intermediate patterns are rare (sometimes even unique) and differ drastically from the majority of patterns (Figure6). For instance, ATCG-balanced class appears to have several patterns with a transitional tendency to ATCG-compact class or vice versa (Figure6A). These transitions are based on successive elimination/insertion of short poly(dA:dT) and poly(dC:dG) tracts. Another example is represented by a systematic reduction of short poly(dA:dT) tracts, which lead to a transition of AT-less promoters to CG-based class (Figure6C). In contrast, a systematic reduction of short poly(dC:dG) tracts leads to a class transition from CG-less promoters to AT-based promoters (Figure6D). From what we have witnessed, neither of these classes represent “end of the line” for these transitions since we observed intermediate patterns between all classes. Furthermore, we have observed varying degrees of difficulty of transition from one class to another. This difficulty is reflected in the number of promoters belonging to each class (Additional file2). For example, CG-based and AT-based, AT-spike and CG-spike or AT-less and CG-less classes tend to form mirror pairs. These pairs of classes have the lowest probability to transit directly from one to another. The evidence for this claim is supported by a small number of intermediate patterns that we have found between these alleged pairs of classes. For instance, intermediate patterns between AT-spike and CG-spike promoters can have both long poly(dA:dT) and long poly(dC:dG) tracts, a sequence arrangement that is rarely encountered (Figure6B). Consequently, we suggest that these direct transitions of promoters between pairs of classes may be caused by strong selection pressures conditioned by radical changes in the environment.

Figure 6 Promoter patterns found in transitional states. (A) MDH1B gene promoter found in a transitional state between ATCG-compact and ATCG-balanced class, (B) UFC1 gene promoter found in a transitional state between AT-spike and CG-spike class, (C) LRRN1 gene promoter found in a transitional state between AT-less and CG-based class and (D) PCDHB10 gene promoter found in a transitional state between AT-based and CG-less class. Full size image

Tissue-specificity in humans

Our general classification criterion allowed us to demonstrate compelling biological correlates between 2,369 tissue-specific genes (Figure7A,B). Some of our observations are also based on previous studies that suggest direct correlations between short or long homopolymer tracts and certain levels of gene expression[42–46]. Indeed, we have also observed a constant presence of different homopolymer elements in these patterns, suggesting that different promoter classes (ie. CG-spike or AT-spike) indicate a predisposition for various levels of gene expression as well as for a distinct number of factors which trigger gene expression. Specific interaction clusters have been reported in the past, such as muscle and heart or kidney and liver clusters[30]. We show some additional interaction groups, both between promoter classes and within each promoter class. In addition to these groups, the tissue order from each class further reflects the significance of the observed interactions (Additional file4). The highlights of our observations include:

1. CG-based promoters have the highest percentage of occurrence (37.59%) and appear to be TATA-less class correspondents which tend to be associated with “housekeeping” genes. CG-based promoters are not only the most common but as expected they show the highest levels in all tissues. The first six tissues in which CG-based promoters have the highest percentages are cervix, skin, stomach, ovary, mammary gland and tongue (Additional file 4: Figure S10B online). 2. AT-based promoters (5.25%) are present in all tissues but are absent from the mammary gland. The first six tissues in which AT-based promoters have the highest percentages are liver, heart, kidney, lymph node, soft tissue and muscle. This order coincides with the first six tissues in which ATCG-compact promoters have the highest percentages, namely in prostate, liver, kidney, muscle, heart and lymph node. Equally curious, the last six tissues in which CG-based promoters have the lowest percentages are liver, uterus, kidney, heart, lung and brain (Additional file 4: Figure S10G and Figure S 7B online). This implies a special relationship between CG-based and AT-based promoters because their proportions seem to indicate an almost antagonistic activity which may suggests an involvement of these promoters in some metabolic processes. Nevertheless, the relationship between CG-based promoters and other classes of promoters in these tissues seems to conceal more than a simplistic association with the housekeeping genes. 3. AT-less promoters (14.36%) are overestimated in uterus while CG-less and ATCG-balanced promoters are overestimated in testis (Additional file 4: Figure S10E,F,H online). 4. CG-less promoters have an occurrence of 3.98% and are present in all tissues but they are absent from Spleen (Additional file 4: Figure S10F online). 5. There was no clear correlation regarding tissue order between AT-less and CG-less promoters. Nevertheless, we noticed that some tissues have a tendency to stay grouped, such as muscle and heart, stomach and soft tissue, larynx and colon, lymph node and liver or bone marrow and peripheral nervous system (Additional file 4: Figure S10E,F online). These groups may suggest a role of these promoters in simple feedback mechanisms among tissues responsible for maintaining homeostasis. Furthermore, the occurrence of short poly(dA:dT) tracts on short distances near TSS could also indicate an involvement of AT-less (and, by association, a complementary role for their CG-less counterpart) promoters in short term non-critical gene expression, which may strengthen our hypothesis regarding their physiological role. Moreover, in different tissues AT-less and CG-less percentages show a combined relationship of complementarity and proportionality (Figure 8C). 6. AT-spike promoters are found especially in tissues that require high levels of gene expression such as lung, eye, pancreas, uterus, liver, soft tissue, brain, kidney, prostate and blood. This tissue order and the presence of long poly(dA) or long poly(dT) tracts suggests an involvement of these promoters in survival mechanisms, possibly responsible for interactions with the environment. 7. CG-spike promoters also appear to be involved in survival mechanisms. These promoters are found in large numbers especially in tissues that need a short-term critical gene expression. This is supported by the order of the first seven tissues in which these promoters are most common, such as lung, eye, brain, peripheral nervous system, spleen, heart and blood, which also tend to have a high interaction with the environment (Additional file 4). 8. The proportions of CG-spike and AT-spike promoters seem to be similar in the first two tissues, namely in lung and eye. The occurrence of long poly(dA:dT) or tandem short poly(dA:dT) tracts on short distances (>30b) near TSS, could also indicate an involvement of AT-spike and CG-spike promoters in short term critical gene expression.

Figure 7 Tissue-distribution frequencies for 2,369 human promoters. Two visualization methods are used: (A) shows the distribution of 30 tissues for each class of promoters and section (B) shows the distribution of promoter classes in each tissue. Full size image

Figure 8 An overall comparison between different promoter classes in each tissue. (A) tendency for a complementarity relationship between CG based and AT spike classes, (B) tendency for direct proportionality relationship of AT based – ATCG compact classes. (C) a combined relationship between AT less – CG less classes, both of complementarity and direct proportionality. Full size image

The frequency of AT-spike promoters (13.02%) exceeds that of GC-spike promoters (8.93%) but indicate proportional relative values in most tissues. Exceptions are tissues from cervix and muscle where the number of CG-spike promoters surpasses the number of AT-spike promoters (Additional file4).