No statistical methods were used to predetermine sample size. The experiments were not randomized and investigators were not blinded to allocation during experiments and outcome assessment.

A detailed description of the methods used in this paper and many additional results are described in Supplementary Information. Here, we summarize the key aspects of the analysis.

Generation of the structural-variant call set

The final set of structural variants used in this Article was generated by the Technical Working Group of the PCAWG Consortium and is described in the main PCAWG paper8. In brief, four variant callers were used to identify somatically acquired structural variants from matched tumour and germline whole genome sequencing data: SvABA (Broad pipeline), DELLY (DKFZ pipeline), BRASS (Sanger pipeline) and dRanger (Broad pipeline). These were merged into a final call set using a graph-based algorithm to identify overlapping breakpoint junctions across algorithms. Detailed visual inspection of structural-variant calls suggested that a simple approach of accepting all structural-variant calls made by two or more of the four algorithms gave the best trade-off between sensitivity and specificity.

Structural-variant clustering and annotation

To identify clusters of structural variants, we developed a method for grouping structural variants into clusters and footprints to allow structural and mechanistic inferences to be made systematically. In parallel, we processed the somatic copy-number data and merged it with structural-variant junctions to enable us produce rearrangement patterns from the generated structural-variant clusters and footprints. We produced normalized representations of structural-variant cluster patterns, which enable us to tabulate the number of different cluster and footprint patterns and analyse their features. Finally, we performed manual and simulation-assisted interpretation of the recurrently observed cluster and footprint patterns. The individual steps of the structural-variant classification pipeline are outlined below and detailed in the subsequent subsections: (1) computing the exact breakpoint coordinates from clipped reads; (2) removing redundant ‘segment-bypassing’ structural variants; (3) merging rearrangement breakpoints with copy-number data to yield structural-variant breakpoint-demarcated, normalized, absolute copy-number data; (4) clustering individual structural variants into structural-variant clusters and footprints; (5) heuristically refining structural-variant clusters and footprints; (6) filtering artefactual fold-back-type structural variants with insufficient support; (7) determining balanced overlapping breakpoints (this step is to distinguish very short templated insertions from mutually overlapping balanced breakpoints); and (8) computing rearrangement patterns and categories.

Distribution of structural variants across the genome

We divided the hg19 human reference genome (autosomes and chromosome X) into 3,036,315 pixels of 1 kb, and calculated a suite of metrics per pixel to summarize a variety of genome properties with potential relevance to the distribution of rearrangements, as listed in the Supplementary Information. Properties were matched as closely as possible to the tissue of origin for cancer samples from the PCAWG data. All other genome properties were held fixed across all tissues. To test for associations between structural-variant event classes and the library of genome properties, the genome property metrics were compared between real structural-variant positions (randomly choosing one side of each breakpoint junction to reduce dependence between observations) and one million uniform random positions from the callable genome space. To compare the tissue-specific properties, each random position was assigned a random tissue type, drawing from the observed tissue-type distribution in the structural-variant call set. For each genome property and each event class, the real observations were pooled amongst the random ones, and then rank-transformed and normalized on a scale from 0 to 1. Under the null hypothesis of no event-versus-property association, the ranks of the real observations would follow a uniform distribution. We tested this in each case with a Kolmogorov–Smirnov test then applied a Benjamini–Yekutieli correction for false-discovery rate across the entire suite of tests and set the threshold for significance reporting at 0.01.

Structural-variant-signature analysis

We used two algorithms for extracting structural-variant signatures. Both used the same input files, comprising a matrix of counts per patient (across all patients) of structural-variant clusters falling into a number of mutually exclusive categories. These categories included the major classes of structural variants, with the more-common events (deletions, tandem duplications and inversions) split by size and/or replication timing. The two algorithms that were used for extracting the signatures were (1) a hierarchical Dirichlet process and (2) non-negative matrix factorization. Further details on the implementation of these algorithms are available in the Supplementary Information.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.