Abstract Genome-wide association studies (GWAS) published in the last decade raised the number of loci associated with type 1 (T1D) and type 2 diabetes (T2D) to more than 50 for each of these diabetes phenotypes. The environmental factors seem to play an important role in the expression of these genes, acting through transcription factors that bind to promoters. Using the available databases we examined the promoters of various genes classically associated with the two main diabetes phenotypes. Our comparative analyses have revealed significant architectural differences between promoters of genes classically associated with T1D and T2D. Nevertheless, five gene promoters (about 16%) belonging to T1D and six gene promoters (over 19%) belonging to T2D have shown some intermediary structural properties, suggesting a direct relationship to either LADA (Latent Autoimmune Diabetes in Adults) phenotype or to non-autoimmune type 1 phenotype. The distribution of these promoters in at least three separate classes seems to indicate specific pathogenic pathways. The image-based patterns (DNA patterns) generated by promoters of genes associated with these three phenotypes support the clinical observation of a smooth link between specific cases of typical T1D and T2D. In addition, a global distribution of these DNA patterns suggests that promoters of genes associated with T1D appear to be evolutionary more conserved than those associated with T2D. Though, the image based patterns obtained by our method might be a new useful parameter for understanding the pathogenetic mechanism and the diabetogenic gene networks.

Citation: Ionescu-Tîrgovişte C, Gagniuc PA, Guja C (2015) Structural Properties of Gene Promoters Highlight More than Two Phenotypes of Diabetes. PLoS ONE 10(9): e0137950. https://doi.org/10.1371/journal.pone.0137950 Editor: Lucienne Chatenoud, Université Paris Descartes, FRANCE Received: April 1, 2015; Accepted: August 25, 2015; Published: September 17, 2015 Copyright: © 2015 Ionescu-Tîrgovişte et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited Data Availability: All relevant data are within the paper and its Supporting Information files. Funding: This work was supported by the postdoctoral program "CERO—Career profile: Romanian Researcher", grant number POSDRU/159/1.5/S/135760, cofinanced by the European Social Fund for Sectoral Operational Programme Human Resources Development 2007-2013. Competing interests: The authors declare that they have no competing interests.

Introduction Diabetes mellitus is a heterogeneous syndrome with an onset that can occur from birth to any point in one's lifetime [1]. The hereditary nature of diabetes is long known, but its genetic basis started to be unravelled only in the 7th decade of the last century [2,3]. It was found that the common phenotypes of diabetes are polygenic and not monogenic, as previously supposed according to the Mendel laws of hereditary. It is not surprising that ~30 monogenic forms of diabetes could be relatively easy identified [4]. Each of these forms have a different clinical phenotype and, frequently, different therapeutic indications [5]. However, their prevalence does not reach ~5% of the total diabetes cases. The gene-sequencing chips using targeted next-generation sequencing allows their quick and efficient detection [5,6]. The identification of the genetic basis for the common polygenic diabetes phenotypes proved to be a much more difficult issue. A deterrent for this was represented by the incoherence of the diabetes classifications over time. Characterization of diabetes phenotypes begun 150 years ago when Etienne Lancereaux (1829–1910), based only on clinical observation corroborated with forensic studies, reached the conclusion that diabetes is not a simple disease but a complex syndrome. Based on their features, Lancereaux identified two main clinical forms. Thus, he described the “thin” diabetes (which appears in young age, and it is characterized by a speedy decrease in weight and rapid evolution towards death) and “fat” diabetes (which appears in adults in the presence of obesity, shows a hereditary nature and usually a slow and torpid evolution) [7,8]. Due to its familial nature, the second phenotype was also named as “constitutional” diabetes. All the official classifications proposed by WHO (1965, 1980, 1985 and 1998) derived from these initial observations [9]. Finally, for the two major forms of diabetes, a neutral designation of type 1 (T1D) and type 2 (T2D) diabetes was adopted. The first important breakthrough for the elucidation of diabetes pathogenesis was represented by the autoimmune-genetic theory of T1D [3,10]. Thus, it was confirmed that diabetes is a polygenic disease and the mechanism of beta cell destruction is immune in nature. From this point, the genetic studies were planned considering that the two major phenotypes of diabetes were two different diseases. Consequently, some researchers focused on the genetics of T1D while others on the genetics of T2D. Usually, pediatric patients were selected for T1D studies [11,12], while for T2D predominantly adult and obese patients were selected. Such a “black and white” vision of diabetes phenotypes led to a tendency in highlighting mainly the differences between the two phenotypes. Moreover, the restrictive selection of patients enrolled in these genetic studies excluded almost all patients with diabetes onset between 20 and 40 years, whose separate analysis could have provided some useful information for a new thinking regarding the classification of diabetes. “Intermediary” or “secondary insulin dependent” diabetes [13], known better as “Latent Autoimmune Diabetes in Adults–LADA” [14–21], placed a grey zone in-between the two major phenotypes, which later proved to be associated both with classic T1D and T2D genes [11,22–27]. Genetics of T2D had a rather slow progression during the decade of candidate gene analysis, perhaps due to a not-inspired focus on the putative insulin resistance and not on the β-cell function, its true cause [28,29]. The genetic landscape of the two major diabetes phenotypes included only a couple of genes at the time of the Genome Wide Association (GWA) Scan emergence, awaited with much interest and optimism. GWAs have been able to establish an extended (but only provisory) inventory of the genes associated with T1D (~50) and with T2D (~60). The number of genes associated with T1D and T2D is expected to rise in the near future [30]. However, the discovery of new genes with a significant contribution to the pathogenesis of these phenotypes is less probable. The current genetic analysis techniques are mainly based on genotyping. Thus, genomic SNPs are tested for their association with one of the two investigated diabetes phenotypes. One major limitation of this technique resides in identifying the causal gene linked to the identified SNP, which can be placed nearby but also at some distance from that SNP [30,31]. The second constraint of this method seems to be the difficulty in describing the function of encoded proteins for many of these new genes. There is however a hope that these drawbacks will be eliminated in the future [32–35]. The third limitation is represented by a low contribution of recently identified genes to the genetic risk score of the disease [4,30,36–40]. Finally, the fourth limitation is represented by the GWA scan technique itself. Regardless of a potential higher SNP density in the future, it is hard to believe GWAS could identify some new relevant genes associated with these two phenotypes. However, a more precise localization of genes already associated with these two phenotypes is highly expected in the near future. The current study proposes a new approach to genetic analysis as well as a complementary method to the classical GWAS analysis. Gene promoters have rarely been studied as a whole in relation to this syndrome, although their key role in the expression of genes associated with diabetes may be the root of the issue.

Materials and Methods In our approach we used 31 promoter sequences (15 promoters from T1D and 16 promoters from T2D) obtained from Eukaryotic Promoter Database (EPD) and HomoloGene. To unravel the design principles of these promoter architectures, we have used Visual Basic to develop a software program for promoter analysis—called PromKappa (Promoter analysis by Kappa), recently published [57–59]. In brief, we used a sliding window approach (window size of 30 nucleotides (nt) and a step of 1 nt) to extract two types of values, namely Kappa IC and (C+G)%. Kappa IC values were plotted on a graph against (C+G)% values, which formed a recognizable promoter pattern for each promoter sequence (S3 File). A promoter pattern is an image that consists of 470 lines, whose coordinates have been plotted observing the two values extracted from each sliding window (Fig 5A, 5B, 5C and 5D). The shape of a pattern is composed from various sized clusters of lines on the y-axis (Fig 5D). The pattern colors range from blue to red according to the number of overlapping lines. Unlike sequence alignment algorithms, our method uses a comparison between the frequency and the nucleotide content of a promoter sequence, thereby measuring the degree of randomization of a DNA sequence [58]. The center of weight of 8,515 promoter patterns were plotted on a second graph in order to show the distribution boundaries of promoters in the human genome (Fig 5D). Next, on this distribution we superimposed the promoter locations of genes associated with T1D and T2D. For a confrontation with the promoters found in genes associated with diabetes, we show a total of 10 possible classes of gene promoters in eukaryotes (Fig 5E), found in our previous study [57]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 5. Schematic overview of the promoter analysis. (A) promoter sequences, (B) Kappa IC and (C+G)% values are extracted from each sliding window, (C) sliding window values plotted on a graph which shows a recognizable image-based pattern for each promoter sequence, (D) the center of weight of each promoter pattern plotted on a second graph in order to show the distribution of 8,515 promoters. Red color areas represent denser clusters of promoters. (E) The representative eukaryotic promoter classes are shown in the following sections: AT-based class, CG-based class, ATCG-compact class, ATCG-balanced class, ATCG-middle class, ATCG-less class, AT-less class, CG-spike class, CG-less class and ATspike class [57–59]. https://doi.org/10.1371/journal.pone.0137950.g005 Promoters from INS (insulin gene), FTO (fat tissue and obesity associated) and CTLA4 (cytotoxic T lymphocyte associated antigen 4) genes, which were not found in the EPD database (composed of 8,515 Homo Sapiens promoter sequences), were extracted from HomoloGene (500 bp genomic regions upstream of the gene). Thus, promoters found in HomoloGene were introduced in the EPD data base file. Furthermore, available EPD promoters were confronted with HomoloGene genomic regions (500 bp) upstream of genes associated with T1D and T2D phenotypes in order to ensure their accuracy. Kappa Index of Coincidence The Index of Coincidence (IC) principle derives from cryptography and has been used in the analysis of ciphertext. Kappa Index of Coincidence is a modified form of IC, adapted for the analysis of a single DNA sequence [57–59]. Here, Kappa IC algorithm has been used primarily as a unit of measure for the information contained in the DNA of the promoter regions. Thus, Kappa IC is used for calculating the level of “randomization” of a DNA sequence. Kappa IC is sensitive to various degrees of sequence organization such as simple sequence repeats (SSRs) or short tandem repeats (STRs). The formula for Kappa IC is shown below, where sequences A and B have the same length N. Only if an A[i] nucleotide from sequence A matches the B[i] correspondent from sequence B, then ∑ is incremented by 1. The same method for measuring the Index of Coincidence has been applied for only one sequence, in which the sequence was actually compared with itself, as shown below. function KIC(A) T = 0 N = length(A)- 1 for u = 1 to N B = A[u + 1] … A[N] for i = 1 to length(B) If A[i] = B[i] then C = C + 1 next i T = T + (C / length(B) × 100) C = 0 next u IC = Round((T / N), 2) end function Where N is the length of the sliding window, A represents the sliding window content, B contains all variants of sequences generated from A (from u+1 to N), C counts the number of coincidences occurring between B sequence and A sequence and T counts the total number of coincidences between B sequences and A sequence. C+G content We extracted C+G values from each sliding window considering the nucleotide frequencies from the entire promoter sequence. In the first stage, to determine the (C+G)% content for the entire (Total = TOT) promoter sequence we used the formula: Where CG TOT represents the percentage of cytosine and guanine from the promoter sequence, (A+T+C+G) TOT represents the sum of the number of occurrences in the promoter sequence of A, T, C and G, and (C+G) TOT represents the sum of the number of occurrences in the promoter sequence of C and G. In the next stage we used the value of CG TOT to calculate the (C+G)% content from the sliding window (sw): Where CG SW represents the percentage of cytosine and guanine from the sliding window. These promoter patterns are relative to the percentage of C+G of the entire promoter sequence. In this regard, CG SW value is relative to CG TOT . The expression (A+T+C+G) TOT represents the sum of the number of occurrences of A, T, C and G from the sliding window sequence. (C+G) SW represents the sum of the number of occurrences of C and G in the sliding window sequence.

Conclusions A third diabetes phenotype, known as double diabetes or 1.5 diabetes, is often observed in clinical practice. The results of our genetic analysis objectivly suports this view, showing that this third phenotype makes a smooth passage from T1D to T2D. It is interesting to note that Kappa IC values of IDM overlap with T2D but not with T1D. These genetic particularities may explain the difficulties of classifying some diabetic patients in the two “traditional” diabetes phenotypes. We have shown that the number of different phenotypes of diabetes is higher than two and the existence of IDM is objectively supported by our data. The third phenotype has itself two sub-phenotypes corresponding with several clinical particularities. Thus, in the near future the number of diabetes phenotypes is expected to increase, representing a strong impetus for a new classification of diabetes.

Acknowledgments This work was supported by the postdoctoral program "CERO–Career profile: Romanian Researcher", grant number POSDRU/159/1.5/S/135760, cofinanced by the European Social Fund for Sectoral Operational Programme Human Resources Development 2007–2013.

Author Contributions Conceived and designed the experiments: CIT PAG CG. Performed the experiments: CIT PAG CG. Analyzed the data: PAG. Contributed reagents/materials/analysis tools: PAG CIT. Wrote the paper: PAG CIT. Conceived of the study and participated in its design and coordination: CIT PAG. Created the algorithms and the software used in the analysis: PAG. Carried out the assembly of promoter files and manually tested the correctness of each promoter sequence: CIT CG. Participated in the promoter sequence analysis and drafted the manuscript: PAG CIT CG. Verified the accuracy of the data and repeated the experiment independently: CIT PAG CG. Discussed the results and commented on the manuscript: PAG CIT CG.