We use in situ Hi-C to probe the 3D architecture of genomes, constructing haploid and diploid maps of nine cell types. The densest, in human lymphoblastoid cells, contains 4.9 billion contacts, achieving 1 kb resolution. We find that genomes are partitioned into contact domains (median length, 185 kb), which are associated with distinct patterns of histone marks and segregate into six subcompartments. We identify ∼10,000 loops. These loops frequently link promoters and enhancers, correlate with gene activation, and show conservation across cell types and species. Loop anchors typically occur at domain boundaries and bind CTCF. CTCF sites at loop anchors occur predominantly (>90%) in a convergent orientation, with the asymmetric motifs “facing” one another. The inactive X chromosome splits into two massive domains and contains large loops anchored at CTCF-binding repeats.

Here, we report the results of an effort to comprehensively map chromatin contacts genome-wide, using in situ Hi-C, in which DNA-DNA proximity ligation is performed in intact nuclei. The protocol facilitates the generation of much denser Hi-C maps. The maps reported here comprise over 5 Tb of sequence data recording over 15 billion distinct contacts, an order of magnitude larger than all published Hi-C data sets combined. Using these maps, we are able to clearly discern domain structure, compartmentalization, and thousands of chromatin loops. In addition to haploid maps, we were also able to create diploid maps analyzing each chromosomal homolog separately. The maps provide a picture of genomic architecture with resolution down to 1 kb.

To interrogate all loci at once, we developed Hi-C, which combines DNA proximity ligation with high-throughput sequencing in a genome-wide fashion (). We used Hi-C to demonstrate that the genome is partitioned into numerous domains that fall into two distinct compartments. Subsequent analyses have suggested the presence of smaller domains and have led to the important proposal that compartments are partitioned into condensed structures ∼1 Mb in size, dubbed “topologically associated domains” (TADs) (). In principle, Hi-C could also be used to detect loops across the entire genome. To achieve this, however, extremely large data sets and rigorous computational methods are needed. Recent efforts have suggested that this is an increasingly plausible goal ().

Various methods have emerged to assess the 3D architecture of the nucleus. In one seminal study, the binding of a protein to sites at opposite ends of a restriction fragment created a loop, which was detectable because it promoted the formation of DNA circles in the presence of ligase. Removal of the protein or either of its binding sites disrupted the loop, eliminating this “cyclization enhancement” (). Subsequent adaptations of cyclization enhancement made it possible to analyze chromatin folding in vivo, including nuclear ligation assay () and chromosome conformation capture (), which analyze contacts made by a single locus, extensions such as 5C for examining several loci simultaneously (), and methods such as ChIA-PET for examining all loci bound by a specific protein ().

The spatial organization of the human genome is known to play an important role in the transcriptional control of genes (). Yet important questions remain, like how distal regulatory elements, such as enhancers, affect promoters, and how insulators can abrogate these effects (). Both phenomena are thought to involve the formation of protein-mediated “loops” that bring pairs of genomic sites that lie far apart along the linear genome into proximity ().

Like the peak loci of most other loops, nearly all the superloop anchors bind CTCF (23 of 24). The six anchor regions most frequently associated with superloops are large (up to 200 kb). Four of these anchor regions contain whole long noncoding RNA (lncRNA) genes: loc550643, XIST, DXZ4, and FIRRE. Three (loc550643, DXZ4, and FIRRE) contain CTCF-binding tandem repeats that only bind CTCF on the inactive homolog.

There were also significant differences in loop structure between the chromosome X homologs. We observed 27 large “superloops,” each spanning between 7 and 74 Mb, present only on the inactive X chromosome in the diploid map ( Figure 7 E). The superloops were also seen in all four unphased maps from karyotypically normal XX cells, but were absent in unphased maps from X0 and XY cells ( Figure S7 I). Two of the superloops (chrX:56.8 Mb-DXZ4 and DXZ4-130.9 Mb) were reported previously in a locus-specific study ().

Interestingly, the boundary between the superdomains (ChrX: 115 Mb ± 500 kb) lies near the macrosatellite repeat DXZ4 (ChrX: 114,867,433–114,919,088) near the middle of Xq. DXZ4 is a CpG-rich tandem repeat that is conserved across primates and monkeys and encodes a long noncoding RNA. In males and on the active X, DXZ4 is heterochromatic, hypermethylated and does not bind CTCF. On the inactive X, DXZ4 is euchromatic, hypomethylated, and binds CTCF. DXZ4 has been hypothesized to play a role in reorganizing chromatin during X inactivation ().

Pronounced differences were seen on the diploid intrachromosomal maps of chromosome X. The paternal X chromosome, which is usually inactive in GM12878, is partitioned into two massive domains (0–115 Mb and 115–155.3 Mb). These “superdomains” are not seen in the active, maternal X ( Figure 7 D). When we examined the unphased maps of chromosome X for the karyotypically normal female cell lines in our study (GM12878, IMR90, HMEC, NHEK), the superdomains on X were evident, although the signal was attenuated due to the superposition of signals from active and inactive X chromosomes. When we examined the male HUVEC cell line and the haploid KBM7 cell line, we saw no evidence of superdomains ( Figure S7 G).

We also observed differences in loop structure between homologous autosomes at some imprinted loci. For instance, the H19/Igf2 locus on chromosome 11 is a well-characterized case of genomic imprinting. In our unphased maps, we clearly see two loops from a single distal locus at 1.72 Mb (that binds CTCF in the forward orientation) to loci located near the promoters of both H19 and Igf2 (both of which bind CTCF in the reverse orientation, i.e., the above consensus motif lies on the opposite strand; see Figure 7 C). We refer to this distal locus as the H19/Igf2 Distal Anchor Domain (HIDAD). Our diploid maps reveal that the loop to the H19 region is present on the maternal chromosome (from which H19 is expressed), but the loop to the Igf2 region is absent or greatly attenuated. The opposite pattern is found on the paternal chromosome (from which Igf2 is expressed).

For autosomes, the maternal and paternal homologs exhibit very similar inter- and intrachromosomal contact profiles (Pearson’s R > 0.998). One interchromosomal difference was notable: an elevated contact frequency between the paternal homologs of chromosome 6 and 11 that is consistent with an unbalanced translocation fusing chr11q:73.5 Mb and all distal loci (a stretch of over 60 Mb) to the telomere of chromosome 6p ( Figures 7 B and S7 B). The signal intensity suggests that the translocation is present in between 1.2% and 5.6% of our cells ( Extended Experimental Procedures ). We tested this prediction by karyotyping 100 GM12878 cells using Giemsa staining and found three abnormal chromosomes, each showing the predicted translocation, der(6)t(6,11)(pter;q) ( Figures S7 C–S7F). The Hi-C data reveal that the translocation involves the paternal homologs, which cannot be determined with ordinary cytogenetic methods.

(I) The large panel on the right shows the unphased GM12878 (XX) chromosome X contact map with several long-range superloops. Five of the main anchors involved in the superloop network are indicated on the axes (LOC550643, XIST, x75.35, DXZ4, and FIRRE). Superloops are marked by squares. Blowouts show two of the loops in more detail in various maps (LOC550643-DXZ4 in blue and x75.35-DXZ4 in green). Maps from IMR90 (female cell line, has inactive X) as well as the phased inactive X in GM12878 (paternal homolog) show evidence of superlooping, while maps from KBM7 (a haploid cell line), HUVEC (a male cell line), and the phased active X in GM12878 (maternal homolog), all of which lack an inactive X, do not exhibit these superloops.

(G and H) Unphased contact maps for chromosome X in the four human female cell lines examined in this paper (GM12878, IMR90, HMEC, NHEK) exhibit the trace of the two superdomains, due to the contribution of the inactive X chromosome (G). Conversely, neither KBM7 (haploid cell line) nor HUVEC (male cell line) has an inactive X chromosome, and neither map displays the superdomains (H).

(E and F) Spectral karyotyping (SKY) confirms the existence of the Chr11/Chr6 translocation identified via Hi-C. A normal (above) and affected karyotype (below) are shown. Blowouts of chromosomes 6 and 11 are shown in (F). The nonreciprocal translocation is indicated. A trisomy of chromosome 16 is also seen in the cell with the translocation.

(C and D) Giemsa staining of GM12878 confirms the translocation identified through the Hi-C map. Karyotype (a) is normal, karyotypes (b)-(d) show all three cells with the translocation. Blowouts of chromosomes 6 and 11 from the above karyotypes are shown in (D) for the normal karyotype (a) and for the three abnormal cells (b-d). The affected regions of chromosomes 6 and 11 are marked in black.

(B) The interchromosomal diploid contact matrix between paternal chromosome 11 and paternal chromosome 6 shows a strong enrichment between chr6:0-1 Mb and chr11:72-74 Mb. The enrichment is not present in the contact matrix between maternal chromosome 11 and maternal chromosome 6 (nor in the other 6/11 homolog pairs). Far right: we divide the 6/11 paternal-paternal contact matrix by the 6/11 maternal-maternal contact matrix. The enrichment in the chr6:0-1 Mb; chr11:72-74 Mb region shows up as bright red (values larger than 1). These maps suggest an unbalanced translocation between the paternal homologs of chromosome 11 and chromosome 6 and specifically suggest that the locus on 11q is fused to the distal end of 6p.

Because many of our reads overlap SNPs, it is possible to use GM12878 phasing data () to assign contacts to specific chromosomal homologs ( Figure 7 A; Table S8 ). Using these assignments, we constructed a “diploid” Hi-C map of GM12878 comprising both maternal (238 M contacts) and paternal (240 M) maps.

(E) The “superloop” between FIRRE and DXZ4 is present in the unphased GM12878 map (top), in the paternal GM12878 map (middle right), and in the map of the female cell line IMR90 (bottom right); it is absent from the maternal GM12878 map (middle left) and the map of the male HUVEC cell line (bottom left). Contact matrices are shown at 50 kb resolution.

(D) The inactive (paternal) copy of chromosome X (bottom) is partitioned into two massive “superdomains” not seen in the active (maternal) copy (top). DXZ4 lies at the boundary. Contact matrices are shown at 500 kb resolution.

(C) Top: in our unphased Hi-C map of GM12878, we observe two loops joining both the promoter of the maternally-expressed H19 and the promoter of the paternally-expressed Igf2 to a distal locus, HIDAD. Using diploid Hi-C maps, we phase these loops: the HIDAD-H19 loop is present only on the maternal homolog (left) and the HIDAD-Igf2 loop is present only on the paternal homolog (right).

In mouse, we found that 7% of peak anchors lie within SINEB2 repeat elements containing a CTCF motif, which has been exapted to be functional. (The spread of CTCF binding via retrotransposition of this element, which contains a CTCF motif in its consensus sequence, has been documented in prior studies [].) The CTCF motifs at peak anchors in SINEB2 elements show the same strong bias toward convergent orientation seen throughout the genome (89% are oriented toward the opposing loop anchor versus 94% genome-wide). The orientation of these CTCF motifs is aligned with the orientation of the SINEB2 consensus sequence in 97% of cases. This suggests that exaptation of a CTCF in a SINEB2 element is more likely when the orientation of the inserted SINEB2 is compatible with local loop structure.

In addition, the presence of CTCF and RAD21 sites at many of our peaks provides an opportunity to compare our results to three recent ChIA-PET experiments reported by the ENCODE Consortium (in GM12878 and K562) in which ligation junctions bound to CTCF (or RAD21) were isolated and analyzed. We found strong concordance with our results in all three cases () ( Extended Experimental Procedures ).

The specific orientation of CTCF sites at observed peaks provides evidence that our peak calls are biologically correct. Because randomly chosen CTCF pairs would exhibit each of the four orientations with equal probability, the near-perfect association between our loop calls and the convergent orientation could not occur by chance (p < 10 −1,900 , binomial distribution).

The observation that looped CTCF sites occur in the convergent orientation also allows us to analyze peak loci containing multiple CTCF-bound motifs to predict which motif instance plays a role in a given loop. In this way, we can associate nearly two-thirds of peak loci (8,175 of 12,903, or 63.4%) with a single CTCF-binding motif.

If CTCF sites were randomly oriented, one would expect all four orientations to occur equally often. But when we examined the 4,322 peaks in GM12878 where the two corresponding peak loci each contained a single CTCF-binding motif, we found that the vast majority (92%) of motif pairs are convergent ( Figures 6 D and 6E). Overall, the presence, at pairs of peak loci, of bound CTCF sites in the convergent orientation was enriched 102-fold over random expectation ( Extended Experimental Procedures ). The convergent orientation was overwhelmingly more frequent than the divergent orientation, despite the fact that divergent motifs also lie on opposing strands: in GM12878, the counts were 3,971-78 (51-fold enrichment, convergent versus divergent); in IMR90, 1,456-5 (291-fold); in HMEC, 968-11 (88-fold); in K562, 723-2 (362-fold); in HUVEC, 671-4 (168-fold); in HeLa, 301-3 (100-fold); in NHEK, 556-9 (62-fold); and in CH12-LX, 625-8 (78-fold). This pattern suggests that a pair of CTCF sites in the convergent orientation is required for the formation of a loop.

The consensus DNA sequence for CTCF-binding sites is typically written as 5′-CCACNAGGTGGCAG-3′. Because the sequence is not palindromic, each CTCF motif has an orientation; we designate the consensus motif above as the “forward” orientation. Thus, a pair of CTCF sites on the same chromosome can have four possible orientations: (1) same direction on one strand, (2) same direction on the other strand, (3) convergent on opposite strands, and (4) divergent on opposite strands.

We found that most peak loci encompass a unique DNA site containing a CTCF-binding motif, to which all three proteins (CTCF, SMC3, and RAD21) were bound (5-fold enrichment). We were thus able to associate most of the peak loci (6,991 of 12,903, or 54%) with a specific CTCF-motif “anchor.”

We next wondered whether peaks are associated with specific proteins. We examined the results of 86 chromatin immunoprecipitation sequencing (ChIP-seq) experiments performed by ENCODE in GM12878. We found that the vast majority of peak loci are bound by the insulator protein CTCF (86%) and the cohesin subunits RAD21 (86%) and SMC3 (87%) ( Figure 6 C). This is consistent with numerous reports, using a variety of experimental modalities, that suggest a role for CTCF and cohesin in mediating DNA loops (). Because many of our loops demarcate domains, this observation is also consistent with studies suggesting that CTCF delimits structural and regulatory domains ().

We also found that overlapping loops are strongly disfavored: pairs of loops L1-L3 and L2-L4 (where L1, L2, L3 and L4 occur consecutively in the genome) are found 4-fold less often than expected under a random model ( Extended Experimental Procedures ).

In some cases, adjacent loop domains (bounded by peak loci L1-L2 and L2-L3, respectively) exhibit transitivity—that is, L1 and L3 also correspond to a peak. This may indicate that the three loci simultaneously colocate at a single spatial position. However, many peaks do not exhibit transitivity, suggesting that the corresponding loci do not colocate. Figure 6 B shows a region on chromosome 4 exhibiting both configurations.

A large fraction of peaks (38%) coincide with the corners of a contact domain—that is, the peak loci are located at domain boundaries ( Figures 6 A and S6 ). Conversely, a large fraction of domains (39%) had peaks in their corner. Moreover, the appearance of a loop is usually (in 65% of cases) associated with the appearance of a domain demarcated by the loop. Because this configuration is so common, we use the term “loop domain” to refer to contact domains whose endpoints form a chromatin loop.

(G–N) Peak annotations across many cell types further support the strong association between CTCF/Cohesin and peak loci. Here, we repeat the analysis presented in Figure 6 C for all of our peak annotations. All enrichments shown here are normalized enrichments (average protein enrichment normalized to unity). (G) Protein enrichments on GM12878 peak loci are shown again, this time including the results of 10 ENCODE ChIP-seq experiments for histone modifications as well as DNase Hypersensitivity Site (DHS) calls. DHSs are present on ∼90% of peak loci. (H-N) CTCF is present in 84% of peak loci on average across all of our cell types (>90%, if ChIP-seq peaks called in any replicate are included); RAD21 is present in 87% of peak loci on average.

(A–F) We repeat the analysis presented in Figure 6 A and show that our peak annotations in non-GM12878 cell types also tend to demarcate domains. We plot histograms of the corner score for peak pixels in a given cell type and for a control group of randomly-chosen pixels with an identical distance distribution.

(F) A schematic rendering of a 2.1 Mb region on chromosome 20 (48.78–50.88 Mb). Eight domains tile the region, ranging in size from 110 kb to 450 kb; 95% of the region is contained inside a domain (contour lengths are shown to scale). Six of the eight domains are demarcated by loops between convergent CTCF-binding sites located at the domain boundaries. The other two domains are not demarcated by loops. The motif orientation is indicated by the direction of the arrow. Note that not every CTCF-binding site is shown.

Occasionally, gene activation is accompanied by the emergence of a cell-type-specific network of peaks. Figure 5 D illustrates the case of ADAMTS1, which encodes a protein involved in fibroblast migration. The gene is expressed in IMR90, where its promoter is involved in six loops. In GM12878, it is not expressed, and the promoter is involved in only two loops. Many of the IMR90 peak loci form transitive peaks with one another (see discussion of “transitivity” below), suggesting that the ADAMTS1 promoter and the six distal sites may all be located at a single spatial hub.

Third, the presence of cell type-specific peaks is associated with changes in expression. When we examined RNA sequencing (RNA-seq) data produced by ENCODE, we found that the appearance of a loop in a cell type was frequently accompanied by the activation of a gene whose promoter overlapped one of the peak loci. For example, a cell-type-specific loop is anchored at the promoter of the gene encoding L-selectin (SELL), which is expressed in GM12878 (where the loop is present), but not in IMR90 (where the loop is absent, Figure 5 B). Genome-wide, we observed 557 loops in GM12878 that were clearly absent in IMR90. The corresponding peak loci overlapped the promoters of 43 genes that were markedly upregulated (>50-fold) in GM12878, but of only one gene that was markedly upregulated in IMR90. Conversely, we found 510 loops in IMR90 that were clearly absent in GM12878. The corresponding peak loci overlapped the promoters of 94 genes that were markedly upregulated in IMR90, but of only three genes that were markedly upregulated in GM12878. When we compared GM12878 to the five other human cell types for which ENCODE RNA-seq data were available, the results were very similar ( Figure 5 C; Table S7 ).

First, our peaks frequently have a known promoter at one peak locus (as annotated by ENCODE’s ChromHMM) () and a known enhancer at the other ( Figure 5 A). For instance, 2,854 of the 9,448 peaks in our GM12878 map bring together known promoters and known enhancers (30% versus 7% expected by chance). The peaks include classic promoter-enhancer loops, such as at MYC (chr8:128.35–128.75 Mb, in HMEC) and alpha-globin (chr16:0.15–0.22 Mb, in K562). Second, genes whose promoters are associated with a loop are much more highly expressed than genes whose promoters are not associated with a loop (6-fold).

(D) Left: two loops in GM12878 are anchored at the promoter of the inactive ADAMTS1 gene. Right: a series of loops and domains appear, along with transitive looping. ADAMTS1 is on.

(C) Genes whose promoters participate in a loop in GM12878 but not in a second cell type are frequently upregulated in GM12878 and vice versa.

(B) Left: a loop in GM12878, with one anchor at the SELL promoter and the other at a distal enhancer. The gene is on. Right: the loop is absent in IMR90, where the gene is off.

Next, we compared peaks across species. In CH12-LX mouse B-lymphoblasts, we identified 2,927 high-confidence contact domains and 3,331 peaks. When we examined orthologous regions in GM12878, we found that 50% of peaks and 45% of domains called in mouse were also called in humans. This suggests substantial conservation of 3D genome structure across the mammals ( Figures 4 B–4E).

We found that peaks were often conserved across cell types ( Figure 4 A): between 55% and 75% of the peaks found in any given cell type were also found in GM12878 ( Figure S5 D).

We also identified peaks in the other seven human cell lines ( Table S1 ). Because these maps contain fewer contacts, sensitivity is reduced, and fewer peaks are observed (ranging from 2,634 to 8,040). APA confirmed that these peak calls were consistent with the dilution Hi-C maps reported here (in IMR90, HMEC, HUVEC, and NHEK), as well as with all previously published Hi-C maps in these cell types () ( Data S2 , I.F).

Finally, we demonstrated that the peaks observed were robust to particular protocol conditions by performing APA on our GM12878 dilution Hi-C map and on our 112 supplemental Hi-C experiments exploring a wide range of protocol variants. Enrichment was seen in every experiment. Notably, these include five experiments (HIC043-HIC047; Table S1 ) in which the Hi-C protocol was performed without crosslinking, demonstrating that the peaks observed in our experiments cannot be byproducts of the formaldehyde-crosslinking procedure.

We also confirmed that our list of peaks was consistent with previously published Hi-C maps. Although earlier maps contained too few contacts to reliably call individual peaks, we developed a method called Aggregate Peak Analysis (APA) that compares the aggregate enrichment of our peak set in these low-resolution maps to the enrichment seen when our peak set is translated in any direction ( Experimental Procedures ). APA showed strong consistency between our loop calls and all six previously published Hi-C experiments in lymphoblastoid cell lines () ( Figure 3 D; Data S2 , I.E; Table S6 ).

To independently confirm that peak loci are closer than neighboring locus pairs, we performed 3D-FISH () on four loops ( Table S5 ). In each case, we compared two peak loci, L1 and L2, with a control locus, L3, that lies an equal genomic distance away from L2 but on the opposite side ( Figures 3 C and S5 B). In all cases, the 3D-distance between L1 and L2 was consistently shorter than the 3D-distance between L2 and L3 ( Extended Experimental Procedures ).

(D) Venn diagrams showing the overlap of our GM12878 annotation of peaks with our annotation of peaks in any other given cell type. On average, 63% of peaks called in one of our non-GM12878 cell types are present in GM12878 as well.

(C) Number of loops called on chromosome 20 (GM12878, primary+replicate data set) after subsampling the data indicate that most loops are recovered with only 60% of the data. Solid line: Lowess smoothing.

(B) Cumulative distributions of the distance between the peak loci L1 and L2 (blue), and between one peak locus L2 and a control locus L3 (red) as measured by 3D-FISH. Along the linear genome L2 is midway between L1 and L3, but the 3D-distance between L1 and L2 (peak loci) is consistently smaller than the 3D-distance between L2 and the control locus, L3, indicating that peak loci have an enhanced tendency for close spatial proximity. See Table S5 for probe locations.

(A) Alternate data visualization scheme, in which the number of contacts at a pixel is represented as the height in a 3D map. Peaks in 2D contact maps appear as sharp apexes in the corresponding 3D map.

These findings were reproducible across all of our high-resolution Hi-C maps. Examining the primary and replicate maps separately, we found 8,054 peaks in the former and 7,484 peaks in the latter, with 5,403 in both lists (see Figures 3 A and 3B; Data S1 , V; Table S4 ). The differences were almost always the result of our conservative peak-calling criteria ( Extended Experimental Procedures ). We also called peaks using our GM12878 dilution Hi-C experiment. Because the map is sparser and thus noisier, we called only 3,073 peaks. Nonetheless, 65% of these peaks were also present in the list of peaks from our in situ Hi-C data set, again reflecting high interreplicate reproducibility.

Our algorithm detected 9,448 peaks in the in situ Hi-C map for GM12878 at 5 kb matrix resolution. These peaks are associated with a total of 12,903 distinct peak loci (some peak loci are associated with more than one peak). The vast majority of peaks (98%) reflected loops between loci that are <2 Mb apart.

We next sought to identify the positions of chromatin loops by using an algorithm to search for pairs of loci that show significantly closer proximity with one another than with the loci lying between them ( Figure 3 A). Such pairs correspond to pixels with higher contact frequency than typical pixels in their neighborhood. We refer to these pixels as “peaks” in the Hi-C contact matrix and to the corresponding pair of loci as “peak loci.” Peaks reflect the presence of chromatin loops, with the peak loci being the anchor points of the chromatin loop. (Because contact frequencies vary across the genome, we define peak pixels relative to the local background. We note that some papers [] have sought to define peaks relative to a genome-wide average. This choice is problematic because, for example, many pixels within a domain may be reported as peaks despite showing no locally distinctive proximity; see Discussion .)

(D) APA plot shows the aggregate signal from the 9,448 GM12878 loops we report by summing submatrices surrounding each peak in a low-resolution GM12878 Hi-C map due to. Although individual peaks cannot be seen in thedata (that contains 42 M contacts), the peak at the center of the APA plot indicates that the aggregate signal from our peak set as a whole can be clearly discerned using their data set.

(A) We identify peaks by detecting pixels that are enriched with respect to four local neighborhoods (blowout): horizontal (blue), vertical (green), lower-left (yellow), and donut (black). These “peak” pixels indicate the presence of a loop and are marked with blue circles (radius = 20 kb) in the lower-left of each heatmap. The number of raw contacts at each peak is indicated. Left: primary GM12878 map; Right: replicate; annotations are completely independent. All contact matrices in this and subsequent figures are 10 kb resolution unless noted.

Upon closer visual examination, we noticed the presence of a sixth pattern on chromosome 19 ( Figure 2 F). Our genome-wide clustering algorithm missed this pattern because it spans only 11 Mb, or 0.3% of the genome. When we repeated the algorithm on chromosome 19 alone, the additional pattern was detected. Because this sixth pattern correlates with the Compartment B pattern, we labeled it B4. Subcompartment B4 comprises a handful of regions, each of which contains many KRAB-ZNF superfamily genes. (B4 contains 130 of the 278 KRAB-ZNF genes in the genome, a 65-fold enrichment). As noted in previous studies (), these regions exhibit a highly distinctive chromatin pattern, with strong enrichment for both activating chromatin marks, such as H3K36me3, and heterochromatin-associated marks, such as H3K9me3 and H4K20me3.

The other three interaction patterns (labeled B1, B2, and B3) are correlated with loci in compartment B ( Figure S4 E) and show very different properties. Subcompartment B1 correlates positively with H3K27me3 and negatively with H3K36me3, suggestive of facultative heterochromatin ( Figures 2 D and 2E). Replication of this subcompartment peaks during the middle of S phase. Subcompartments B2 and B3 tend to lack all of the above-noted marks and do not replicate until the end of S phase (see Figure 2 D). Subcompartment B2 includes 62% of pericentromeric heterochromatin (3.8-fold enrichment) and is enriched at the nuclear lamina (1.8-fold) and at NADs (4.6-fold). Subcompartment B3 is enriched at the nuclear lamina (1.6-fold), but strongly depleted at NADs (76-fold).

Two of the five interaction patterns are correlated with loci in compartment A ( Figure S4 E). We label the loci exhibiting these patterns as belonging to subcompartments A1 and A2. Both A1 and A2 are gene dense, have highly expressed genes, harbor activating chromatin marks such as H3K36me3, H3K79me2, H3K27ac, and H3K4me1 and are depleted at the nuclear lamina and at nucleolus-associated domains (NADs) ( Figures 2 D, 2E, and S4 I; Table S3 ). While both A1 and A2 exhibit early replication times, A1 finishes replicating at the beginning of S phase, whereas A2 continues replicating into the middle of S phase. A2 is more strongly associated with the presence of H3K9me3 than A1, has lower GC content, and contains longer genes (2.4-fold).

When we analyzed the data at low matrix resolution (1 Mb), we reproduced our earlier finding of two compartments (A and B). At high resolution (25 kb), we found evidence for at least five “subcompartments” defined by their long-range interaction patterns, both within and between chromosomes. These findings expand on earlier reports suggesting three compartments in human cells (). We found that the median length of an interval lying completely within a subcompartment is 300 kb. Although the subcompartments are defined solely based on their Hi-C interaction patterns, they exhibit distinct genomic and epigenomic content.

Next, we partitioned loci into categories based on long-range contact patterns alone, using four independent approaches: manual annotation and three unsupervised clustering algorithms (HMM, K-means, Hierarchical). All gave similar results ( Figure S4 B; Extended Experimental Procedures ). We then investigated the biological meaning of these categories.

(H) The percentage of transitions that occur from a cluster (indexed by the rows) to another cluster (indexed by the columns). Rows sum to 100. For example, A1 most frequently transitions to B1.

(E and F) Cluster similarity, determined by the Spearman correlation of the one-dimensional derivative vectors. In the genome-wide annotation (E), B1, B2, and B3 have similar patterns, as their derivative vectors all correlate with each other, and A1 and A2 have similar pattern, as they correlate with each other. A1 and A2 anticorrelate with B1, B2, and B3. (F) For Chr19, the derivative of the Chr 19 O/E matrix was calculated, and then collapsed onto a matrix of four rows, where each row represented a cluster and contained the mean derivative for that cluster. Pearson correlations of these derivatives were calculated for each pair of clusters. B4 is more closely correlated with B2 and B1 than it is with A1, suggesting a B-type pattern.

(D) Contact enrichment between the two k = 5 clustering assignments (clustering was performed separately on odd and even chromosomes). The mean number of contacts for each pair of clusters is divided by the expected number of contacts (calculated from the average of 100 shuffled clustering annotations). Each cluster on the odd chromosomes preferentially interacts with only one cluster on the even chromosomes (and vice versa); we merged each strongly interacting pair of clusters into a single cluster. (Data were normalized using Interchromosomal KR.)

(C) The total number of base pairs assigned to each of the clusters (cluster coverage) and the mean size of a contiguous cluster group (adjacent loci that share the same cluster assignment). Median sizes are shown with horizontal lines.

(A and B) We show the (A) Akaike Information Criterion and the Bayes Information Criterion (which differed only slightly) for the HMM clustering as the number of clusters is varied. While no sharp transition is seen, the curve suggests that choosing k between 4 and 8 is appropriate. (B) When we clustered our interchromosomal matrix using K-means or hierarchical clustering instead of an HMM, the results were highly similar. Here we show the percentage of each of the 5 K-means/hierarchical clusters that overlapped the HMM clusters (comparing the even K-means/hierarchical clustering to the even HMM clusters).

Loci within a contact domain show correlated histone modifications for eight different factors (H3K36me3, H3K27me3, H3K4me1, H3K4me2, H3K4me3, H3K9me3, H3K79me2, and H4K20me1) based on data from the ENCODE project in GM12878 cells (). By contrast, loci at comparable distance but residing in different domains showed much less correlation in chromatin state ( Figures 2 B, S2 I, and S2K; Extended Experimental Procedures ). Strikingly, changes in a domain’s chromatin state are often accompanied by changes in the long-range contact pattern of domain loci (i.e., the pattern of contacts between loci in the domain and other loci genome-wide), indicating that changes in chromatin pattern are accompanied by shifts in a domain’s nuclear neighborhood ( Figures 2 C and S3 C–S3E; Extended Experimental Procedures ). This observation is consistent with microscopy studies associating changes in gene expression with changes in nuclear localization ().

The presence of smaller domains in Hi-C maps is consistent with several other recent studies (). We explore the relationship between the domains we annotate and those annotated in prior studies in the Discussion

In our new, higher resolution maps (200- to 1,000-fold more contacts), we observe many small squares of enhanced contact frequency that tile the diagonal of each contact matrix ( Figure 2 A). We used the Arrowhead algorithm (see Experimental Procedures ) to annotate these contact domains genome-wide. The observed domains ranged in size from 40 kb to 3 Mb (median size 185 kb). As with megadomains, there is an abrupt drop in contact frequency (33%) for pairs of loci on opposite sides of the domain boundary ( Figure S2 G). Contact domains are often preserved across cell types ( Figures S3 A and S3B).

(E) We show two domains between 147.5 Mb and 149 Mb on chromosome 6 that are conserved between GM12878 and IMR90. In GM12878, both domains are marked with H3K27me3 and share a pattern of long-range contacts. In IMR90, the right domain (gray highlight) is instead marked with H3K36me3 and depleted for H3K27me3 while the left domain remains enriched for H3K27me3. Correspondingly, the long-range patterns of the two domains are anticorrelated.

(D) We show three domains between 15.5 Mb and 16.7 Mb on chromosome 3 that are conserved between GM12878 and IMR90. In GM12878, the central domain (gray highlight) is marked with H3K27me3 while the flanking domains are both marked with H3K36me3. The long-range patterns of the two flanking domains are correlated with each other and anticorrelated with that of the central domain. In IMR90, H3K36me3 spreads over the central domain, such that now all three domains are marked with H3K36me3, and the long-range pattern of the central domain changes to match that of the flanking domains.

(C) We examined a set of 153 conserved domain boundaries between GM12878 and IMR90 where in one cell type the level of H3K36me3 was comparable across the boundary (concordant) and in the other cell type, H3K36me3 was enriched on one side of the boundary and depleted on the other (discordant). Correlations of long-range contacts across discordant boundaries are markedly lower than across concordant boundaries, indicating that changes in histone modifications correspond to changes in the chromatin neighborhood.

(B) Domains that exhibit different patterns of histone modifications between cell types are still strongly conserved. For each cell line, we plot the cumulative distribution of the corner scores for four sets of domains: actual domains called in that cell line (blue), domains called in GM12878 (red), domains called in GM12878 that differ in H3K36me3 decoration between GM12878 and the cell line (“redecorated domains,” yellow), and random “domains” (maintaining the same size distribution as the GM12878 domain annotation, black). The median corner score for a domain called in GM12878 in another cell type corresponds to the 95 th percentile of random domains. The distribution of corner scores for GM12878 domains with changing histone marks is indistinguishable from the distribution of scores for all GM12878 domains.

(A) Venn diagrams showing the overlap of our GM12878 domain annotation with our domain annotations in any other given cell type. On average, 54% of domains called in one of our non-GM12878 cell types are present in GM12878.

(L) Same as (K), except we show the correlation matrix for 11 chromatin marks in relation to random domains (instead of true domains), in our in situ GM12878 map. No sharp changes occur at domain boundaries.

(K) Correlation matrices for 11 chromatin marks in relation to domain boundaries in our in situ GM12878 map. Domains were divided into 10 equally sized bins, and the bins were extended 10 to the left and to the right of the domain borders (with the bin size kept the same both inside and outside of the domains). For each mark, we calculated its 30 by 30 correlation matrix in these regions (the middle 10 by 10 values of this matrix indicate correlations of the mark at loci within the domain). Marks are correlated within domains, but the correlation sharply drops at domain borders. We also show the correlation matrices for all pairs of epigenetic marks in relation to domain boundaries, and find strong correlation or anticorrelation of marks within domains that drops at domain boundaries. We combine all of the correlation matrices into one large image. Domain boundaries are marked by black tick marks; red tick marks separate the different epigenetic marks.

(J) Correlation of supercoiling within domains, using data from Naughton et al. (2013), see Extended Experimental Procedures . Loci within the same domain show higher correlations than loci in different domains, in contrast to the smooth correlation matrix (right) calculated for randomly placed domains.

(I) Spearman correlation coefficient of various chromatin marks for loci separated by 50 kb. Loci in the same domain (top row) show higher correlations for chromatin marks than loci in different domains (bottom row).

(H) The cumulative distributions of the long-range gradient score G; higher values of G indicate greater differences in the long-range patterns of neighboring loci. The gradient score is enriched at domain boundaries (solid red), relative to random domain boundaries (solid gray), implying that pattern switches happen at domain boundaries. The interior of domains (dashed red) is depleted for long-range pattern changes relative to random domain interiors (dashed gray).

(G) The ratio of the mean interdomain contacts to mean intradomain contacts at various distances; contacts between loci in different domains are depleted, accounting for the drop in contacts we observe at domain boundaries.

(D–F) Same as on the left, except we apply the Arrowhead algorithm to a 1.75 Mb region in chromosome 14 in our in situ GM12878 map (D). Domains are transformed into arrowheads (or pairs of red and blue triangles) in the Arrowhead matrix (E). The Score matrix (obtained using T 1 = 0.2 and T 2 = 0.5) identifies areas of high corner likelihoods and the maxima are marked as domain corners (F). Our domain calls for this region are shown in black in (D).

(A–C) We apply the Arrowhead algorithm to a simulated contact map (A); the result is the Arrowhead matrix (B), where the domain has been transformed into a pair of triangles of opposite signs, named U and L. Using a heuristic scoring algorithm (with thresholds T 1 = 0.5 and T 2 = 0.5) we obtain the Score matrix (C), where the pixels located in the corner of the contact matrix appear as a patch of high values. The pixel with the maximum score in this patch is annotated as the domain corner, which in this case is exactly the corner of the simulated domain.

(C) Conserved contact domains on chromosome 3 in GM12878 (left) and IMR90 (right). In GM12878, the highlighted domain (gray) is enriched for H3K27me3 and depleted for H3K36me3. In IMR90, the situation is reversed. Marks at flanking domains are the same in both: the domain to the left is enriched for H3K36me3 and the domain to the right is enriched for H3K27me3. The flanking domains have long-range contact patterns that differ from one another and are preserved in both cell types. In IMR90, the highlighted domain is marked by H3K36me3 and its long-range contact pattern matches the similarly-marked domain on the left. In GM12878, it is decorated with H3K27me3, and the long-range pattern switches, matching the similarly-marked domain to the right. Diagonal submatrices, 10 kb resolution; long-range interaction matrices, 50 kb resolution.

(A) We annotate thousands of domains across the genome (left, black highlight). To do so, we define an arrowhead matrix A (right) such that A= (M∗– M∗)/(M∗+ M∗), where M∗ is the normalized contact matrix. This transformation replaces domains with an arrowhead-shaped motif pointing toward the domain’s upper-left corner (example in yellow); we identify these arrowheads using dynamic programming. See Experimental Procedures

We also found that individual 1 Mb loci could be assigned to one of two long-range contact patterns, which we called compartments A and B, with loci in the same compartment showing more frequent interaction. Megadomains—and the associated squares along the diagonal—arise when all of the 1 Mb loci in an interval exhibit the same genome-wide contact pattern. Compartment A is highly enriched for open chromatin; compartment B is enriched for closed chromatin ().

We began by probing the 3D partitioning of the genome. In our earlier experiments at 1 Mb map resolution (), we saw large squares of enhanced contact frequency tiling the diagonal of the contact matrices. These squares partitioned the genome into 5–20 Mb intervals, which we call “megadomains.”

Adequate tools for visualization of these large data sets are essential. We have therefore created the “Juicebox” visualization system that enables users to explore contact matrices, zoom in and out, compare Hi-C matrices to 1D tracks, superimpose all features reported in this paper onto the data, and contrast different Hi-C maps. All contact data and feature sets reported here can be explored interactively via Juicebox at http://www.aidenlab.org/juicebox/

To account for nonuniformities in coverage due to the number of restriction sites at a locus or the accessibility of those sites to cutting () we use a matrix-balancing algorithm due to Extended Experimental Procedures ).

We also performed 112 supplementary Hi-C experiments using three different protocols (in situ Hi-C, dilution Hi-C, and Tethered Conformation Capture) while varying a wide array of conditions such as extent of crosslinking, restriction enzyme, ligation volume/time, and biotinylated nucleotide. These include several in situ Hi-C experiments in which the formaldehyde crosslinking step was omitted, which demonstrate that the structural features we observe cannot be due to the crosslinking procedure. In total, 201 independent Hi-C experiments were successfully performed, many of which are presented in Data S1 and S2

When we used our original dilution Hi-C protocol to generate maps of GM12878, IMR90, HMEC, NHEK, HUVEC, and CH12-LX, we found that, as expected, in situ Hi-C maps were superior at high resolutions, but closely resembled dilution Hi-C at lower resolutions. For instance, our dilution map of GM12878 (3.2 billion contacts) correlated highly with our in situ map at 500, 50, and 25 kb resolutions (R > 0.96, 0.90, and 0.87, respectively) ( Data S1 , I; Figure S1 ).

(E–I) Correlation as a function of distance between our primary (isH1) and replicate (isH2) GM12878 in situ Hi-C experiments as well as our GM12878 dilution Hi-C (dH) experiment, shown at (E) 500 kb, (F) 50 kb, (G) 25 kb, (H) 5 kb, and (I) 1 kb resolutions. Our data are highly correlated: At 500 kb resolution, Pearson’s r > 0.95 even at distances greater than 100 megabases, and at 1 kb resolution, our data show strong correlations at short distances (up to 100 kb).

(D) The distribution of the four types of reads (“left,” “inner,” “outer,” “right”) as a function of distance for various restriction enzymes; the distance at which all read types are equally likely indicates the minimum meaningful contact distance. For HindIII and NcoI (6 bp restriction enzymes), this distance is ∼30 kb; for DpnII and MboI (4bp restriction enzymes), this distance is ∼3 kb. (Vertical lines drawn at where the four distributions converge to 25% +/− 1%.)

(A and B) Row sums on chromosome 1 subsequent to applying the coverage normalization techniques computed by our pipeline (top: vanilla, middle: square root, bottom: KR), performed on the intrachromosomal contact matrix at 1 Mb resolution (A) and 10 kb resolution (B). At low resolutions vanilla overcorrects low-coverage loci, while square root vanilla creates a matrix whose row and column sums are all approximately equal. KR normalization works at both low and high resolutions.

We constructed in situ Hi-C maps of nine cell lines in human and mouse ( Table S1 ). Whereas our original Hi-C experiments had a map resolution of 1 Mb, these maps have a resolution of 1 kb or 5 kb. Our largest map, in human GM12878 B-lymphoblastoid cells, contains 4.9 billion pairwise contacts and has a map resolution of 950 bp (“kilobase resolution”) ( Table S2 ). We also generated eight in situ Hi-C maps at 5 kb resolution, using cell lines representing all human germ layers (IMR90, HMEC, NHEK, K562, HUVEC, HeLa, and KBM7) as well as mouse B-lymphoblasts (CH12-LX) ( Table S1 ). Each map contains between 395 M and 1.1 B contacts.

A Hi-C map is a list of DNA-DNA contacts produced by a Hi-C experiment. By partitioning the linear genome into “loci” of fixed size (e.g., bins of 1 Mb or 1 kb), the Hi-C map can be represented as a “contact matrix” M, where the entry M i,j is the number of contacts observed between locus L i and locus L j . (A “contact” is a read pair that remains after we exclude reads that are duplicates, that correspond to unligated fragments, or that do not align uniquely to the genome.) The contact matrix can be visualized as a heatmap, whose entries we call “pixels.” An “interval” refers to a set of consecutive loci; the contacts between two intervals thus form a “rectangle” or “square” in the contact matrix. We define the “matrix resolution” of a Hi-C map as the locus size used to construct a particular contact matrix and the “map resolution” as the smallest locus size such that 80% of loci have at least 1,000 contacts. The map resolution is meant to reflect the finest scale at which one can reliably discern local features.

Our in situ Hi-C protocol combines our original Hi-C protocol (here called dilution Hi-C) with nuclear ligation assay (), in which DNA is digested using a restriction enzyme, DNA-DNA proximity ligation is performed in intact nuclei, and the resulting ligation junctions are quantified. Our in situ Hi-C protocol involves crosslinking cells with formaldehyde, permeabilizing them with nuclei intact, digesting DNA with a suitable 4-cutter restriction enzyme (such as MboI), filling the 5′-overhangs while incorporating a biotinylated nucleotide, ligating the resulting blunt-end fragments, shearing the DNA, capturing the biotinylated ligation junctions with streptavidin beads, and analyzing the resulting fragments with paired-end sequencing ( Figure 1 A). This protocol resembles a recently published single-cell Hi-C protocol (), which also performed DNA-DNA proximity ligation inside nuclei to study nuclear architecture in individual cells. Our updated protocol has three major advantages over dilution Hi-C. First, in situ ligation reduces the frequency of spurious contacts due to random ligation in dilute solution—as evidenced by a lower frequency of junctions between mitochondrial and nuclear DNA in the captured fragments and by the higher frequency of random ligations observed when the supernatant is sequenced ( Extended Experimental Procedures available online). This is consistent with a recent study showing that ligation junctions formed in solution are far less meaningful (). Second, the protocol is faster, requiring 3 days instead of 7 ( Extended Experimental Procedures ). Third, it enables higher resolution and more efficient cutting of chromatinized DNA, for instance, through the use of a 4-cutter rather than a 6-cutter ( Data S1 , I).

(D) Overview of features revealed by our Hi-C maps. Top: the long-range contact pattern of a locus (left) indicates its nuclear neighborhood (right). We detect at least six subcompartments, each bearing a distinctive pattern of epigenetic features. Middle: squares of enhanced contact frequency along the diagonal (left) indicate the presence of small domains of condensed chromatin, whose median length is 185 kb (right). Bottom: peaks in the contact map (left) indicate the presence of loops (right). These loops tend to lie at domain boundaries and bind CTCF in a convergent orientation.

(B) Contact matrices from chromosome 14: the whole chromosome, at 500 kb resolution (top); 86–96 Mb/50 kb resolution (middle); 94–95 Mb/5 kb resolution (bottom). Left: GM12878, primary experiment; Right: biological replicate. The 1D regions corresponding to a contact matrix are indicated in the diagrams above and at left. The intensity of each pixel represents the normalized number of contacts between a pair of loci. Maximum intensity is indicated in the lower left of each panel.

Discussion

Using the in situ Hi-C protocol, we probed genomic architecture with high resolution; in the case of GM12878 lymphoblastoid cells, better than 1 kb. We observe the presence of contact domains that were too small (median length = 185 kb) to be seen in previous maps. Loci within a domain interact frequently with one another, have similar patterns of chromatin modifications, and exhibit similar long-range contact patterns. Domains tend to be conserved across cell types and between human and mouse. When the pattern of chromatin modifications associated with a domain changes, the domain’s long-range contact pattern also changes. Domains exhibit at least six distinct patterns of long-range contacts (subcompartments), which subdivide the two compartments that we previously reported based on low resolution data. The subcompartments are each associated with distinct chromatin patterns. It is possible that the chromatin patterns play a role in bringing about the long-range contact patterns, or vice versa.

Our data also make it possible to create a genome-wide catalog of chromatin loops. We identified loops by looking for pairs of loci that have significantly more contacts with one another than they do with other nearby loci. In our densest map (GM12878), we observe 9,448 loops.

The loops reported here have many interesting properties. Most loops are short (<2 Mb) and strongly conserved across cell types and between human and mouse. Promoter-enhancer loops are common and associated with gene activation. Loops tend not to overlap; they often demarcate contact domains, and may establish them. CTCF and the cohesin subunits RAD21 and SMC3 associate with loops; each of these proteins is found at over 86% of loop anchors.

The most striking property of loops is that the pair of CTCF motifs present at the loop anchors occurs in a convergent orientation in >90% of cases (versus 25% expected by chance). The importance of motif orientation between loci that are separated by, on average, 360 kb is surprising and must bear on the mechanism by which CTCF and cohesin form loops, which seems likely to involve CTCF dimerization. Experiments in which the presence or orientation of CTCF sites is altered may enable the engineering of loops, domains, and other chromatin structures.

Sexton et al., 2012 Sexton T.

Yaffe E.

Kenigsberg E.

Bantignies F.

Leblanc B.

Hoichman M.

Parrinello H.

Tanay A.

Cavalli G. Three-dimensional folding and functional organization principles of the Drosophila genome. Cook and Brazell, 1975 Cook P.R.

Brazell I.A. Supercoils in human DNA. Vogelstein et al., 1980 Vogelstein B.

Pardoll D.M.

Coffey D.S. Supercoiled loops and eucaryotic DNA replicaton. Zehnbauer and Vogelstein, 1985 Zehnbauer B.A.

Vogelstein B. Supercoiled loops and the organization of replication and transcription in eukaryotes. Dixon et al., 2012 Dixon J.R.

Selvaraj S.

Yue F.

Kim A.

Li Y.

Shen Y.

Hu M.

Liu J.S.

Ren B. Topological domains in mammalian genomes identified by analysis of chromatin interactions. It is interesting to compare our results to those seen in previous reports. The contact domains we observe are similar in size to the “physical domains” that have been reported in Hi-C maps of Drosophila () and to the “topologically constrained domains” (mean length: 220 kb) whose existence was demonstrated in the 1970s and 1980s in structural studies of human chromatin (). On the other hand, the domains we observe are much smaller than the TADs (1 Mb) () that have been reported in humans and mice on the basis of lower-resolution contact maps. This is because detecting TADs involves detection of domain boundaries. With higher resolution data, it is possible to detect additional boundaries beyond those seen in previous maps. Interestingly, nearly all the boundaries we observe are associated with either a subcompartment transition (that occur approximately every 300 kb), or a loop (that occur approximately every 200 kb); and many are associated with both.

We created diploid Hi-C maps by using polymorphisms to assign contacts to distinct chromosomal homologs. We found that the inactive X chromosome is partitioned into two large superdomains whose boundary lies near the locus of the lncRNA DXZ4. We also detect a network of long-range superloops, the strongest of which are anchored at locations containing lncRNA genes (loc550643, XIST, DXZ4, and FIRRE). With the exception of XIST, all of these lncRNAs contain CTCF-binding tandem repeats that bind CTCF only on the inactive X.

In our original report on Hi-C, we observed that Hi-C maps can be used to study physical models of genome folding, and we proposed a fractal globule model for genome folding at the megabase scale. The kilobase-scale maps reported here allow the physical properties of genome folding to be probed at much higher resolution. We will report such studies elsewhere.

Just as loops bring distant DNA loci into close spatial proximity, we find that they bring disparate aspects of DNA biology—domains, compartments, chromatin marks, and genetic regulation—into close conceptual proximity. As our understanding of the physical connections between DNA loci continues to improve, our understanding of the relationships between these broader phenomena will deepen.