The mutation rate is not just variable at the megabase scale; recent studies have revealed that it also varies at the local scale, a term which we use to denote chromatin features spanning between one and a few hundred nucleotides (well below the megabase boundary; Box 2 ). In this article we review our current knowledge on the influence of such local features (i.e., nucleosomes, bound transcription factors, the genic structure, among others) on the rate of somatic –and in some cases also germline– mutations, and explore its relevance. Although the review focuses on studies of their influence in the mutation rate along the human genome, it discusses, when pertinent, results obtained in other eukaryotic genomes that support and complement the former.

Two methodological points should be highlighted regarding the analysis of mutation rate at local features, which are not as key for megabase-scale features. First, since their influence is very localized, their effect on the mutation rate may only be revealed through the combined analysis of many of them in comparison with their flanking regions, as schematically represented in panels B and C of the accompanying figure. Alternatively, some researchers () have centered their analysis on observed mutations and computed the enrichment for local features with respect to the flanking non-mutated nucleotides. Second, some of these differences in mutation rate with respect to flanking areas may be due to particularities in the sequence composition of local features. Therefore, in these comparisons, their observed mutation rate needs to be corrected for that expected solely from the sequence, as described in panels B and C of the accompanying figure.

(C) The difference between the observed and expected mutation rates yield a relative (with respect to the expected) increase of the mutation rate centered at the stacked feature instances.

(B) Left: instances of a local feature across the entire genome are stacked on their centers, and the observed mutation rates in the features and their flanking regions computed. Right: the effect of differences in sequence composition between instances of the feature and their flanks on their observed mutation rate must be taken into account. This may be done, for example, by randomly distributing the number of mutations observed in each sequence in the stack following the genome-wide frequency of tri-nucleotide context substitutions. This way, an expected mutation rate in the local features and their flanking regions based solely on their sequence composition may be computed.

(A) Examples of local chromatin features reviewed in the article. Their genomic coordinates are obtained from different sources depending on the nature of the feature. For example, exon and intron coordinates may be obtained from Gencode (); nucleosome positions, from high-resolution MNase-seq experiments (see text); TF binding sites locations, from genome-wide ChIP-seq probes ().

Recent decades have seen the emergence of a wave of discoveries concerning the distribution of the mutation rate along the genome before selection and its underlying causes. Chromosomal regions that replicate late or which are compacted as heterochromatin (such as those associated to the nuclear lamina) accumulate more germline and somatic mutations than early-replicating or euchromatic areas (). Across genes, the mutation rate is strongly dependent on whether and at what levels they are transcribed (). These features have proven to be major determinants of mutation distribution in eukaryotic cells at the megabase scale. This is demonstrated by the fact that the distribution of mutations found in the cells of a tumor years after neoplastic transformation, may be explained with high accuracy by the chromatin configuration of its cell of origin (). These discoveries were made possible by the availability of millions of germline variants detected across human populations—for instance, those identified by the 1000 Genomes Consortium ()—de novo mutations identified in trios of parents and offspring (see, for example,) and somatic mutations found in normal cells () and in tumors ().

Mutations arising as a result of the same processes in germ cells, referred to as de novo mutations or variants, are passed along to the offspring and incorporated to the inherited pool of genetic variation of a species (). Germline variants have long been linked to genetic diseases, ranging from those that follow a Mendelian pattern of inheritance () to complex disorders, such as diabetes or hypertension, with both strong genetic and environmental components. The study of evolution—from the somatic cells of an organism to the divergence between species—and of the genetic underpinning of disease relies on understanding how mutations naturally arise along genomes, specifically the propensity of different genomic regions to become mutated in different cell types.

Point mutations accumulate in the cells of multicellular organisms over cycles of stem cell divisions (). Mutations ( Box 1 ) appear as the result of DNA damage that remains unrepaired at the time of DNA replication or errors introduced by DNA polymerases. The majority of point mutations that occur in somatic cells are innocuous to the organism. However, some somatic mutations are capable of driving the tumorigenic transformation of cells () by providing capabilities which have been cataloged over the years as the “hallmarks of cancer” (). Somatic mutations that accumulate in specific cell types may also be one of the underlying causes of other human diseases. Examples of these include somatic mutations affecting LIS1 and DCX (lissencephaly), SCN1A (Dravet syndrome), TUBB2B (pachygyria), AKT3 (hemimegalencephaly), and FLNA (paraventricular nodular heterotopia) (). Somatic mutations also likely play a role in aging (), and their accumulation in certain areas of the brain has been associated to neurodegeneration ().

: Profile of tri-nucleotide changes likely related to a mutational process active across a number of tumors. Specific mutational signatures mentioned in the review correspond to those collected at the COSMIC database ( https://cancer.sanger.ac.uk/cosmic/signatures ).

Relative increase of mutation rate : Increase (or decrease if negative) of the observed rate of mutations with respect to its expected value given the sequence composition of the type of local feature under study.

The chromatin landscape at the local level—the focus of this Review ( Box 2 )—is dominated by the alternation of nucleosomes and linkers, interspersed with nucleosome-free regions where transcription factors (TFs) are often bound. The alternation of exons and introns in transcribed regions also contribute to this heterogeneous landscape at the local level. All these local features of the chromatin—and others that occur at lower frequencies—may interfere with one or more of the three steps of the mutational processes described above ( Figure 1 ), thus yielding a different mutation rates than their neighboring areas. In the following sections, we review our current knowledge of the influence of an array of local genomic features on the distribution of several types of DNA damage, the efficiency of the systems of DNA repair and damage tolerance, and ultimately, the mutation rate in somatic and germ cells.

Mismatches introduced by the polymerases during DNA replication also contribute to mutations in all cells if not corrected by the MMR (). The disappearance of the proofreading capability of DNA polymerases (such as POLE) that occurs in some tumors generates C>A mutations in a TCN context (). The defective functioning of certain DNA repair systems, such as MMR and homologous recombination-directed repair also leaves patterns of mutations with enrichment in distinctive contexts (see, for example,). Furthermore, a recent study highlighted the role of the overexpression of a group of proteins as an endogenous source of mutations, both in human and bacterial cells ().

Some exogenous damaging agents create bulky lesions on the DNA that distort the double helical structure. One such mutagen is the UV light, the incidence of which creates two types of DNA adducts called Cyclobutane Pyrimidine Dimers (CPDs) and (6-4) Pyrimidine-pyrimidone photoproducts (6-4PPs) at dypirimine sites (). Several carcinogens in tobacco also generate DNA adducts, i.e., benzo[a]pyrene-7,8-diol-9,10-epoxide on guanines, or BPDE-dGs (). Such bulky lesions may be recognized and repaired by the nucleotide excision repair (NER) machinery (reviewed in), but the ones that remain unrepaired during replication may promote the collapse of the replication fork through DNA polymerase stalling (). This outcome may be avoided by the recruitment of low-fidelity translesion polymerases, able to bypass the distorted region in an error-free or error-prone manner—i.e., introducing mismatches in the DNA sequence (reviewed in). The sequence preference of the damage and the properties of the DNA polymerases determine that the mutations generated at the sites of unrepaired lesions are primarily C>T (at TCN and CCN tri-nucleotide contexts) for UV light and G>T (C>A) for tobacco carcinogens (). In the case of UV light-induced lesions, the spontaneous deamination of cytosines or 5mCs involved in CPDs also contributes to the generation of mutations alongside low-fidelity translesion polymerases ().

DNA damaging processes can be both endogenous and external ( Figure 1 ). The spontaneous deamination of 5-methylcytosines (5mCs) in the CpG context is an important endogenous source of mutations (). This process takes place at different rates depending on structural chromatin features, such as whether the 5mC is located at a nucleosome or a linker (). Upon deamination, 5mCs convert to thymines, producing a G:T mismatch. Both the mismatch repair (MMR) pathway—as evidenced by the significantly high increase of mutations arisen from 5mCpG contexts in MMR deficient colorectal tumors ()—and the base excision repair (BER) pathway () play an important role in the recognition and correction of these errors. Unrepaired mismatches generated through 5mCs deamination appear primarily as CpG>TpG mutations upon DNA replication. MMR corrects mispairings due to 5mCs deamination more efficiently in early-replicating areas, and as a consequence, the number of CpG>TpG mutations correlates with replication timing (increasing in late-replicating areas) in MMR-proficient tumors, but not in their MMR-deficient counterparts (). Moreover, CpG>TpG mutations in POLE-deficient tumors are unevenly distributed with relation to the direction of DNA replication. One possible explanation for this is an increased misincorporation of adenine opposite 5mC by the leading-strand polymerase epsilon (). The spontaneous deamination of cytosines and 5mCs is the largest contributor to mutations in many somatic cell types and in germ cells. The rate of CpG>TpG mutations correlates with the number of cell divisions in the tissue under observation; in other words, it behaves in a clocklike manner ().

In recent years, the analysis of the frequencies of tri-nucleotide substitutions (the replaced base and those in its immediate 5′ and 3′ vicinity) in tumors has uncovered dozens of so-called mutational signatures. These constitute a set of irreducible vectors of relative frequencies from which the original tri-nucleotide substitution frequencies observed across tumors in a cohort can be best reconstructed (). Mutational signatures are proxies of the mutational processes active in tumors, and the discovery of their etiologies has profited from decades of epidemiological studies on cancer (as in the cases of tobacco, UV light, and BRCA deficiency) and from the sequencing the genomes of cell lines or cells from model organisms exposed to hosts of mutagens (). Mutational processes have their origin in the interplay between the damage created by mutagenic agents, DNA repair systems, and the replication machinery, and for some mutational signatures, this interplay has been inferred (). It is thus helpful to visualize the mutational processes as a three-step mechanism ( Figure 1 ), in which DNA damage or nucleotide mismatches (upper box) are dealt with by one or more cross-talking DNA repair systems (middle box). Different translesion polymerases during replication may then bypass them, either restoring the original sequence or leaving a mutation (bottom box).

In subsequent figures, which follow this three-step logic at different local chromatin features, we represent the distribution of the DNA damage, the efficiency of DNA repair machineries, and the mutations as solid lines when they have been experimentally established, or may be directly computed from single-nucleotide resolution maps. Otherwise, their distribution is represented through dashed lines.

The figure presents the three steps involved in the generation of a mutation. An exogenous or endogenous DNA damaging agent or a replication error (top panel) leaves specific lesions or mismatches (second panel). These lesions or mismatches may be dealt with by several DNA multi-protein machineries charged with their repair (thus returning the DNA sequence to its original state; third panel) or left uncorrected if these machineries fail to repair them. In this case, at the time of replication, their bypass by either high-fidelity polymerases or error-prone polymerases may lead to the introduction of the correct nucleotide (leaving no mutation) or an incorrect one, thus producing mutations (bottom panel).

More recently,systematically surveyed the distribution of whole-genome somatic SNVs detected across 3,494 tumors from 28 cohorts along more than 3.5 million mapped human intergenic nucleosomes and linkers. To eliminate any variability in the frequency of mutations between nucleosomes and linkers due to the tri-nucleotide composition of the DNA in these two regions, they subtracted the rate of SNVs expected solely from the sequence composition. SNVs contributed by some mutational processes showed a relative increase in their rate (with respect to that expected) in nucleosomes, whereas those contributed by others were depleted for nucleosomes ( Figure 2 ). For example, mutations contributed mainly by UV light appear at higher rate than expected within nucleosomes. The authors hypothesized that this could be attributed primarily to the known decreased efficiency of the global NER pathway () in nucleosomes compared to linkers. They proposed that other mechanisms of DNA repair, such as BER, could also show lower efficiency at nucleosomes than linkers ( Figure 2 ). This could explain, for example, the higher-than-expected nucleosomal rate of mutations contributed by processes represented by signature 17, possibly a result of oxidative damage mainly in esophageal and gastric adenocarcinomas (). On the other hand, while BPDE-dGs are mainly repaired by the global NER system in intergenic nucleosomes, their formation is known to be inhibited at nucleosomal DNA. Therefore, the interaction of the mutagen with the nucleosome probably determines the differential distribution of tobacco-induced mutations across nucleosomes and linkers.

As whole genomes of tumors started to become available, their analysis showed that also cancer somatic mutations appear at different rates in nucleosomes and linkers. The rate of cancer mutations was found to correlate with nucleosome occupancy (). More detailed analyses demonstrated that somatic mutations contributed by specific processes occur more frequently than expected in nucleosomes. For example,observed that mutations contributed by processes represented by signatures 17 and 18 in breast tumors are more frequent in nucleosomes than expected from their sequence composition. Sabarinathan and collaborators also reported, from an analysis centered at TF binding sites (TFBS), that melanoma mutations (the vast majority of which arise from unrepaired CPDs and 6-4PPs generated by UV light) follow a periodic pattern, which differs from their expected distribution based on sequence composition ( Figure 2 ). The maxima of this pattern correlate very strongly with the dyads of translationally well-positioned (see below) nucleosomes that flank the active TFBS ().

The availability of techniques to map the positions of nucleosomes in recent years () have fueled studies dedicated to answer the question of whether mutations distribute differentially between nucleosomes and linkers. Tolstorukov and collaborators observed in 2011 that germline SNVs appeared more frequent at bulk nucleosomes, but not at those bearing certain histone marks, than at linkers (). On the other hand, they reported that germline indels appeared less frequently at any type of nucleosomes than at linkers. In the interpretation of these results, the authors favored a model in which purifying selection—rather than the differential rate of mutations generation—plays the predominant role in the depletion for germline indels and SNVs at nucleosomes bearing certain histone marks. A study of the rate of de novo substitutions and indels in yeast showed similar results (), while another, comparing the rates of different substitution types, reported that C>T transitions occur less frequently at nucleosomal DNA (). Coherently, the analysis of the human-chimpanzee divergence found that while divergent sites in general tend to occur more frequently within nucleosomes, the frequency of C>T substitutions is reduced with respect to linkers (). C>T substitutions largely result from the spontaneous deamination of cytosines and 5mCs (see above). This deamination reaction is more likely to occur in transient single-stranded DNA regions, due to “breathing”—i.e., local opening of the DNA double helix—which is hindered by the structural constraints imposed on the DNA wrapped around nucleosomes (). Thus, in this case, the differential rate of damage generation at nucleosomes versus linkers—i.e., the first step of the model mutational process outlined above—is likely to contribute to the heterogeneity of the mutation rate.

Nucleosomes cover between 75% and 90% of the genomes of eukaryotes, making the alternation of nucleosome cores (nucleosomes, for simplicity) and linkers (represented respectively as green and yellow segments in Figure 2 ) their most pervasive local chromatin feature (). Consisting of histone proteins wrapped by 1.67 turns of the DNA fiber, nucleosomes provide the first level of DNA compaction, and constitute the building blocks of higher-order chromatin structures. Furthermore, they provide a framework for the binding of chromatin remodelling proteins which can interact with an array of post-translational modifications of histones (). Through these capabilities, nucleosomes play an important role in the regulation of chromatin structure and gene expression ().

The figure schematically represents two types of interactions of DNA damaging agents and DNA repair machineries with the nucleosomes, which result in two different periodic mutation rate patterns. UV light generates bulky DNA adducts in an aperiodic pattern, i.e., roughly the same amount of damage in nucleosomes and linkers (left panel), as inferred from whole-genome CPD maps (see text). BPDE-dGs generated by tobacco carcinogens, on the other hand follow a periodic distribution, with higher rate of damage in the linkers (right panel), also experimentally mapped (). DNA repair machineries (NER in these cases) charged with the repair of these lesions act more efficiently in linkers, which are more accessible. The resulting mutation distribution in both scenarios is radically different, with higher-than-expected mutation rate in the nucleosomes in UV light-induced mutations (left panel) and in the linkers for tobacco generated mutations (right panel).

In agreement with the results obtained bywhen comparing several Drosophila species,also observed a higher-than-expected frequency of rare SNVs and C>T polarized divergence at minor-in stretches. It is thus plausible that SNVs generated by certain mutational processes tend to accumulate more than expected at minor-in stretches even before the action of selection and therefore, the increased C>T divergence in evolution could arise through the differential mutagenesis at minor-in and minor-out sites in germ cells.

SNVs resulting from different mutational processes show a relative increase at either minor-in or minor-out DNA stretches (). For example, somatic SNVs arising from exposure to UV light (signature 7) or tobacco carcinogens (signature 4) exhibit higher rate than expected at minor-out stretches, while SNVs likely resulting from oxidative damage in esophageal and stomach carcinomas (signature 17) are more frequent than expected at minor-in segments ( Figure 3 ). In the case of the UV light signature, the mechanistic cause underlying the differential distribution of mutations between these two types of nucleosome-covered DNA segments has been traced to the interaction of DNA lesions and the mechanisms of DNA repair. On the basis of genome-wide maps of UV damage in yeast,demonstrated that more CPDs are generated in minor-out than minor-in segments immediately after irradiation. Despite recording a higher efficiency of NER at the more exposed minor-out stretches, more CPDs remained after 1 h. Combining similar genome-wide maps of UV damage in human cells and the distribution of mutations in melanomas,anddemonstrated that these differences in the generation and repair of DNA damage result in a differential distribution of mutations across minor-in and minor-out segments.

It is nevertheless also possible that the differential interaction of mutagenic agents and/or DNA repair machineries with minor-in and minor-out stretches of the DNA may result in different rates in the generation of mutations across these two regions. In support of the latter hypothesis, Mao and collaborators recently observed a decreased activity in the repair of methylated guanines—generated by MMS—at minor-in stretches of yeast DNA. Upon demonstration that these regions are also less efficiently degraded by the DNaseI than their counterparts facing away from histones, Mao and collaborators proposed that the underlying reason for the decreased activity of BER at minor-in regions is their reduced accessibility. They also put forward the idea that, brought to its natural conclusion, the reduced repair of methylated bases would result in an increased rate of MMS-induced mutations at minor-in regions ().

Several observations have been made that point in the direction of the existence of an interaction between the rotational positioning of nucleosomes and the mutation rate. For example,observed different mutational bias along nucleosome-covered DNA in humans and chimpanzees, as well as in two nucleosome-carrying archaebacteria. They proposed that these could be explained through mutations promoting nucleosome repositioning through evolution. An analysis of inherited SNPs across D. melanogaster populations, and the polarized divergence between D. melanogaster and a close species showed higher frequency of C>T substitutions at minor-in stretches of nucleosome-covered DNA (). The authors interpreted this as evidence of selection favoring the nucleotide changes that improve the bendability of the DNA fiber around the histones (see above).

The location of the nucleosomes along the DNA fiber, or translational positioning, is determined by both DNA sequence-dependent and -independent factors. For example, homopolymeric sequences are known to be unfavorable for the positioning of nucleosomes (). An example of sequence-independent factor is the positioning of TFs at their binding sites, which also excludes nucleosomes. The first whole-genome nucleosome mapping efforts (see previous section) observed that the translational positions of a small fraction were more conserved across the population of probed cells (). These well-positioned nucleosomes correspond precisely to those flanking the binding sites of TFs. Nevertheless, within the area occupied by nucleosomes (irrespective of the strength of their translational positioning), certain locations that result in a particular orientation of the DNA minor groove relative to the histone core are preferred. This is known as the rotational positioning of nucleosomes (). On detail, rotational positions that determine a coincidence of A/T-rich segments with stretches of DNA with the minor groove facing the nucleosome and C/G-rich segments at stretches of DNA with the minor groove facing away are preferred. In Figure 3 , these two structurally distinct stretches of DNA are represented as green (minor-in) and yellow (minor-out) segments, respectively. The reason for this is that the segments of the DNA whose minor groove faces the nucleosome are structurally constrained by chemical groups of histones H3 and H4. A/T (or WW) di-nucleotides, with their higher flexibility and low steric hindrance are more stable at these sites ().

The minor groove of nucleosome-wrapping DNA is oriented alternatively toward the histones or away from them every 10 bps. As a result, the interaction between some DNA damaging agents and the DNA repair machineries produces a 10 bp periodic patterns of mutation rate. (In the extended representation of nucleosome-covered DNA the segment corresponding to the dyads are omitted.) The distribution of 8-oxo-Guanines (8-oxo-Gs) across nucleosome-covered DNA is unknown, and therefore is represented with a dashed line (left panel). UV light generated bulky DNA adducts follow a pattern, with maxima at segments with the minor groove facing away from the histones (minor-out; right panel), as inferred from whole-genome CPD maps (). DNA repair machineries (NER in the case of UV light damage and possibly BER in that of 8-oxo-Gs) charged with the repair of these lesions act more efficiently in minor-out segments due to higher accessibility (). In the case of signature 17 mutations, possibly caused by 8-oxo-Gs, the mutation rate is higher-than-expected at minor-in segments (left panel). On the other hand, UV light-induced mutations distribute with a rate higher-than-expected at minor-out segments (right panel).

Taken together, these findnigs indicate that the frequency of somatic mutations observed at the regulatory regions, such as TFBSs, is shaped by the complex interplay between DNA damage and repair levels. For example, in melanomas, the higher somatic mutation rate at TFBSs can be explained by either increased UV-induced DNA damage, impaired NER activity, or a combination of both. Further studies are necessary to fully understand the determinants of this process.

In other tumor types, such as colorectal adenocarcinomas, no genome-wide increase of somatic mutation at TFBSs is observed. However, the binding sites of specific TFs like CTCF (CCCTC-binding factor) do exhibit higher somatic mutation rate in a subset of colorectal tumors () and other cancer types () ( Figure 4 C). Coherently, higher rates of somatic mutations have been observed at the binding sites of tissue-specific TFs, such as the estrogen receptor in estrogen-receptor-positive breast cancers () and androgen receptor in prostate cancers (). However, the mechanisms underlying mutagenesis at these specific sites are not fully understood.

In melanomas, recurrent C>T mutations at active promoter regions are observed at a specific nucleotide context (CTTCCG—the mutated base is at the 5′ end or one base upstream), which is also the binding site motif for ETS-family TFs ( Figure 4 B). This increase of mutation rate in a specific nucleotide context has been attributed to the increased level of UV-induced damage (CPDs) in both NER repair-proficient and -deficient cell lines (). Similarly, the analysis of UV-induced damage (CPDs and 6-4PPs) maps at the binding sites of specific TFs (such as CTCF, NFYB, POU2F2, and SP1) has uncovered different levels of damage at the binding sites of different TFs. Differences have also been observed depending on the strand of the DNA to which the TFs bind and the type of damage (). The differences in the DNA damage levels specific to a TF can be explained by different levels of DNA conformational changes (e.g., DNA bending) induced by the binding of the TF, which may expose positions that are more or less vulnerable to UV-induced damage ().

The rate of somatic mutations is lower at accessible regulatory regions (defined by the DHS) than in their flanking areas and the rest of the genome both in cell lines and primary tumors ( Figure 4 ). This has been attributed to the higher accessibility to DNA of global DNA repair machineries (such as NER and BER) (). Consistent with this, the in vitro analysis of NER activity in UV-irradiated skin-fibroblast cell lines has shown that DNA repair activity is higher in open chromatin regions (). Although open chromatin regions are highly accessible, the binding of regulatory proteins (like TFs) within these regions can influence the rate of DNA damage or repair activity locally. Several studies have shown that in specific tumor types (i.e., melanomas and some lung tumors), the rate of somatic mutations is higher at active TFBSs (i.e., TFBSs within DHS regions) compared to their immediate flanking regions. This can be linked to a reduced NER repair activity at the TFBS ( Figure 4 A).

(C) An increased rate in the generation of mutations at the binding sites of CTCF (in a context that does not correspond to mutational signature 7) with respect to their flanks is observed across microsatellite stable (MSS) colorectal (CRC) and gastric tumors with chromosome instability (CIN). A similar increase of mutation rate is observed at CTCF binding sites across melanomas (signature 7 mutations). Again, the increase of mutations is driven by specific positions within the binding site.

(B) At the binding sites of TFs of the ETS family, the increase of signature 7 mutations can be explained by an increase of UV-induced damage (computed from whole-genome single-nucleotide resolution CPDs map), which cannot be counteracted even if there is an increased NER efficiency. The increased rate in the generation of lesions is driven by specific positions of the TF binding site and caused by binding-induced perturbations in the DNA structure that favor CPD formation ().

(A) The rate of generation of UV-induced mutations (which correspond to mutational signature 7) at DNA sites bound by a TF is increased with respect to their flanks, which in the case of many TFs may be attributed to a decrease of NER efficiency driven by the obstruction caused by the TF to lesion recognition (). The generation of damage is also influenced by the binding of TFs to DNA, with different effects depending on the TF ().

The binding of transcription factors (TFs) to specific DNA sequences, referred to as transcription factor binding sites (TFBSs), can influence the expression of genes. In humans, there are 1,639 known or likely TF proteins, with at least one DNA binding domain (). Of these, around 70% have preferred binding sites (TFBS motifs), determined through various experimental approaches in vitro (such as protein binding microarray and SELEX based methods). High-throughput approaches, such as ChIP-seq and DNaseI-seq, are widely used for the genome-wide identification of TFBS and accessible DNA regions, respectively (). DNase I hypersensitive sites (DHS), corresponding to open chromatin regions, detected through DNaseI-seq help to demarcate regulatory DNA regions (where TFs bind) and genic regions that are actively transcribed (). Thus, the data generated using ChIP-seq and DNaseI-seq are complementary to each other and are used to locate TF binding sites within DHS.

In summary, genes that are actively transcribed are safeguarded by DNA repair machineries through different mechanisms during DNA replication and transcription and thus receive lower rate of mutations. However, the local chromatin structure within genic regions and error-prone DNA repair pathways play a major role in the local distribution of mutations.

Recent studies have shown that there is considerable local mutation rate variation along the gene body.showed that the rate of somatic mutations in exons is lower than expected based on their sequence context in tumors proficient in mismatch repair. In contrast, mismatch repair-deficient tumors do not exhibit this decreased exonic mutation burden. This indicates that this decrease is caused by higher activity of the mismatch-repair (MMR) pathway in exons than introns ( Figure 5 B). Higher MMR activity in exonic regions can be driven by the recruitment of MMR to the H3K36me3 histone mark (), which are enriched in the nucleosomes within exonic regions compared to intronic ones ().have shown that the 3′ end of the genes is enriched with higher rates of A>G mutations, particularly in tumor types exposed to carcinogens such as UV and alcohol. The authors attributed this observation to the activity of error-prone polymerases of the non-canonical MMR pathway, recruited to H3K36me3-rich regions in actively expressed genes.found that somatic indels (insertion/deletions) are particularly enriched in non-coding regions of highly expressed lineage-specific genes by yet unknown mutational processes in lung adenocarcinomas and other tumor types.

Within the genic region, the rate of mutations is higher in the non-transcribed (non-template) DNA strand due to the preferential activity of the transcription-coupled repair (TCR) on the transcribed (template) DNA strand. However, a process of transcription-couple damage (TCD) targeting the non-transcribed strand during transcription has also been reported () ( Figure 5 A). The mutational signatures associated to UV light exposure (signature 7), smoking (signature 4), and alkylating agents (signature 11) show the characteristic footprint of transcriptional-strand bias in the distribution of mutations, which can be primarily attributed to TCR. On the other hand, the APOBEC mutational signatures (signatures 2 and 13) and A>G substitutions (in a subset of liver cancers) that show the trace of transcriptional-strand bias can be explained by TCD ().

(B) Exons exhibit lower-than-expected mutation rate in certain tumor types due to differential mismatch repair between exonic and intronic areas. Exonic mismatches are corrected with higher efficiency than intronic ones. The distribution of mismatches and of the efficiency of MMR (represented as brown and blue dashed lines, respectively) across exons and introns are actually unknown and have been inferred from the computed expected mutation rate and its difference the observed mutation rate, respectively (). This difference of MMR efficiency is probably due to the exonic enrichment for certain histone marks, such as the tri-methylation of H3K36, known to be bound by the MSH2/MSH6 heterodimer, which forms the mismatch detection complex ().

(A) Mutations accumulate more in the non-template strand of the DNA than in the template strand due to transcription-coupled damage and repair. The transcription-coupled DNA repair pathway of NER (left panels) is recruited by RNA polymerase to sites of DNA lesions on the template strand that cause the stalling of transcription elongation. This asymmetric recruitment results in mutations accumulating more in the non-template than in the template strand. On the other hand, the lack of protection of the non-template strand during transcription causes an increase in its level of DNA lesions by mutagenic processes targeting single-stranded DNA (right panels).

Gene expression levels have long been associated with the variation of mutation rate across genic regions in the genome. Highly expressed genes tend to exhibit lower mutation rate than lowly expressed ones. This can be explained by two main factors: (1) transcription-coupled DNA repair, such as NER (TC-NER), that preferentially corrects DNA lesions encountered by the RNA polymerase on the transcribed (template) DNA strand (); and (2) highly expressed genes are more accessible to DNA repair, and they tend to replicate early and thus with high fidelity ().

Finally, local structural properties of the DNA double helix may also affect the mutation rate. A recent study showed that the curvature of the DNA around spontaneous variants in yeast and human tumor somatic mutations is significantly lower than that expected from the whole genome. In other words, mutations tend to accumulate at regions with lower curvature.

Stretches of single-stranded DNA may also appear at certain local features capable of adopting secondary structures that facilitate their appearance. Examples of these rare—covering between 0.07 and 4% of the genome () —local features include G-quadruplexes, hairpins or cruciform structures (susceptible to form from inverted repeats), slipped structures (that appear at regions with direct repeats), triple-stranded DNA (at mirror repeats or H-DNA stretches), or Z-DNA regions. A recent analysis surveyed the rate of somatic mutations at these rare features across several cancer types (). The authors observed that the mutation rate is higher at these local features than at their flanking regions. Furthermore, for some of these features (notably, inverted repeats), they found that the more the sequence favored their formation (i.e., with optimal spacer-to-arm ratios), the higher the rate of mutations observed at the spacer. Since spacers are composed of single-stranded DNA, their likelihood of mutation increases with the probability of formation of the structure. The same applies to G-quadruplexes, in which the authors found an enrichment for mutations at single-stranded loops compared to G-rich stretches. The cause of the increase of the mutation rate at these local chromatin features is thus the secondary structures they adopt.

The transient appearance of single-stranded DNA sequences at certain genomic regions may promote locally increased mutation rates. Such single-stranded DNA segments may form at otherwise double-stranded areas due to certain cellular processes. For example, events of three-stranded RNA:DNA hybrid including a stretch of single-stranded DNA occur in the course of transcription (R-loops) (). Cytosines and 5mCs at these stretches of single-stranded DNA spontaneously deaminate into uracils and thymines, respectively, at higher rates than those at double-strand stretches (). The deamination of cytosines into uracils at these stretches of single-strand DNA may also be the result of the AID (activation-induced cytidine deaminase) enzyme, the physiological role of which consists in the hypermutation of the Immunoglobulin heavy chain during B lymphocytes maturation (). C>U changes may be targeted by BER, either resulting in a correct repair or leaving behind an abasic site or a nick, which may result in base substitutions or strand breaks, respectively. The C>T deamination reaction, on the other hand, yields a mismatch that, if unrepaired, would be incorporated as a C>T substitution.

Implications of Local Variation of Mutation Rate

The recent studies reviewed here have uncovered the local variability of the mutation rate across the genome, with implications in different areas of knowledge. These include our understanding of (1) basic cellular processes (i.e., DNA damage and repair), (2) the evolution of the genome sequence, (3) the evolution of species and of tumors with special interest for the identification of disease related mutations, and (4) the study of the mutagenic effect of chemotherapeutic agents on healthy tissues.

The emerging local variability of the rate of mutations sheds light on the mechanisms of DNA repair. In particular, it shows how the efficiency of different DNA repair mechanisms is affected by the interaction of the DNA with certain genomic features, such as nucleosomes, transcription factors, and the transcriptional machinery. It also shows that the structure that DNA adopts at the local level has a strong influence on the rate at which some types of DNA lesions appear due to the interaction of mutagenic processes with these local genomic features. The interplay between differential DNA damage and repair ultimately results in the observed differential rates of mutations along the genome.

Pich et al., 2018 Pich O.

Muiños F.

Sabarinathan R.

Reyes-Salazar I.

Gonzalez-Perez A.

Lopez-Bigas N. Somatic and Germline Mutation Periodicity Follow the Orientation of the DNA Minor Groove around Nucleosomes. Mrázek, 2010 Mrázek J. Comparative analysis of sequence periodicity among prokaryotic genomes points to differences in nucleoid structure and a relationship to gene expression. Tolstorukov et al., 2011 Tolstorukov M.Y.

Volfovsky N.

Stephens R.M.

Park P.J. Impact of chromatin structure on sequence variability in the human genome. The uneven local mutation rate caused by mutational processes active in germ cells has the potential to leave an imprint in the genome. The differential efficiency of DNA repair (and/or DNA damage) at certain local features across evolutionary time may have influenced the sequence composition of eukaryotic genomes. For example, Pich and colleagues speculated that—over evolutionary time—the 10 bps periodic rate of C>T germline substitutions in nucleosome covered DNA could lead to a periodic enrichment of A/Ts in the eukaryotic genomes (). Indeed, it has long been observed that A/T di-nucleotides show a periodic pattern, with higher frequency at DNA minor groove facing the nucleosome, creating what is referred to as a WW periodicity along the sequence of eukaryotic genomes (). The results by Pich and collaborators thus propose a mechanism for the emergence of this WW periodicity across eukaryotic genomes.

Rheinbay et al., 2017 Rheinbay E.

Nielsen M.M.

Abascal F.

Tiao G.

Hornshøj H.

Hess J.M.

Pedersen R.I.I.

Feuerbach L.

Sabarinathan R.

Madsen H.T.

et al. Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. Evolutionary studies both of species and of tumors require the assessment of the background mutation rate to study the effect of selection. Local differences in the mutation rate need to be incorporated into the models of the background mutation rate. For example, methods aimed at the identification of cancer genes and non-coding genomic features containing driver mutations rely on the identification of signals of positive selection in the pattern of mutations. To this end, these methods identify genes or other genomic features with mutational patterns that are significantly different from those computed using a statistical model of the mutational processes operating across tumors. The abnormally high mutation rate observed at TFBS in melanomas, and other rare local features (see previous section), may thus prompt the identification of false-positive non-coding driver genomic elements such as promoters ().

Several DNA damaging agents have been used as chemotherapies against cancer for decades. However, we still lack a basic understanding on their effect in the genome of the healthy cells of treated patients. Moving forward, understanding how chemotherapy-induced DNA damage and repair processes are affected by local chromatin features will be key to identify tumor mutations that elicit resistance to these agents and to understand secondary malignancies related to treatment. Furthermore, somatic mutations are not only relevant in cancer, but also in aging and several related diseases. As the field of NGS applications refines and expands to identify somatic mutations in cells from non-tumoral tissues, including healthy donors and other diseased tissues, we will be able to study the interplay between endogenous and extrinsic DNA damaging agents and DNA repair across different healthy and unhealthy tissues. To carry out this analysis, we will need to take into account the influence of local features—in their cell-type specificity—on the generation of variants.

In conclusion, the interaction of DNA with certain local chromatin features has a strong influence in how nucleotides are damaged and repaired at the local level, which ultimately results in different mutation probabilities. This has implications on our understanding of basic cellular processes and for evolutionary and disease studies.