A detailed description of methods and validations is available as Supplementary Information. No statistical methods were used to predetermine sample size. The experiments were not randomized, and investigators were not blinded to allocation during experiments and outcome assessment.

Sample collection

Patients with advanced cancer not curable by local treatment options and being candidates for any type of systemic treatment and any line of treatment were included as part of the CPCT-02 (NCT01855477) and DRUP (NCT02925234) clinical studies, which were approved by the medical ethical committees (METC) of the University Medical Center Utrecht and the Netherlands Cancer Institute, respectively. A total of 41 academic, teaching and general hospitals across The Netherlands participated in these studies and collected material and clinical data by standardized protocols52. Patients have given explicit consent for whole-genome sequencing and data sharing for cancer research purposes. Core needle biopsies were sampled from the metastatic lesion, or when considered not feasible or not safe, from the primary tumour site and frozen in liquid nitrogen. A single 6-μm section was collected for haematoxylin and eosin (H&E) staining and estimation of tumour cellularity by an experienced pathologist and 25 sections of 20-μm were collected in a tube for DNA isolation. In parallel, a tube of blood was collected. Leftover material (biopsy, DNA) was stored in biobanks associated with the studies at the University Medical Center Utrecht and the Netherlands Cancer Institute.

Whole-genome sequencing and variant calling

DNA was isolated from biopsies (>30% tumour cellularity) and blood according to the supplier’s protocols (Qiagen) using the DSP DNA Midi kit for blood and QIAsymphony DSP DNA Mini kit for tissue. A total of 50–200 ng of DNA (sheared to average fragment length of 450nt) was used as input for TruSeq Nano LT library preparation (Illumina). Barcoded libraries were sequenced as pools on HiSeqX generating 2 × 150 read pairs using standard settings (Illumina). BCL output was converted using bcl2fastq tool (Illumina, v.2.17 to v.2.20) using default parameters. Reads were mapped to the reference genome GRCH37 using BWA-mem v.0.7.5a53, duplicates were marked for filtering and INDELs were realigned using GATK v.3.4.46 IndelRealigner54. GATK HaplotypeCaller v.3.4.4655 was run to call germline variants in the reference sample. For somatic SNV and indel variant calling, GATK BQSR56 was applied to recalibrate base qualities. SNV and indel somatic variants were called using Strelka v.1.0.1457 with optimized settings and post-calling filtering. Structural Variants were called using Manta (v.1.0.3)58 with default parameters followed by additional filtering to improve precision using an internally built tool (Breakpoint-Inspector v.1.5). To assess the effect of sequencing depth on variant calling sensitivity, we downsampled the BAMS of 10 samples at random by 50% and reran the identical somatic variant calling pipeline.

Purity, ploidy and copy number calling

Copy number calling and determination of sample purity were performed using PURPLE (PURity & PLoidy Estimator), which combines B-allele frequency, read depth and structural variants to estimate the purity of a tumour sample and determine the copy number and minor allele ploidy for every base in the genome. The purity and ploidy estimates and copy number profile obtained from PURPLE were validated on in silico simulated tumour purities, by DNA fluorescence in situ hybridization (FISH) and by comparison with an alternative tool (ASCAT59). ASCAT was run on GC-corrected data using default parameters except for gamma, which was set to 1 (which is recommended for massively parallel sequencing data). We implement a simple heuristic that determines if a WGD event has occurred: major allele ploidy > 1.5 on at least 50% of at least 11 autosomes as the number of duplicated autosomes per sample (that is, the number of autosomes which satisfy the above rule) follows a bimodal distribution with 95% of samples have either ≤6 or ≥15 autosomes duplicated.

Sample selection for downstream analyses

Following copy number calling, samples were filtered out based on absence of somatic variants, purity <20%, and GC biases, yielding a high-quality dataset of 2,520 samples. Where multiple biopsies exist for a single patient, the highest purity sample was used for downstream analyses (resulting in 2,399 samples).

Mutational signature analysis

Mutational signatures were determined by fitting SNV counts per 96 tri-nucleotide context to the 30 COSMIC signatures26 using the mutationalPatterns package60. Residuals were calculated as the sum of the absolute difference between observed and fitted across the 96 buckets. Signatures with <5% overall contribution to a sample or absolute fitted mutational load <300 variants were excluded from the summary plot.

Germline predisposition variant calling

We searched for pathogenic germline variants (SNVs, indels and copy number alterations) in a broad list of 152 germline predisposition genes previously curated61, using GATK HaplotypeCaller55 output from each sample. For each variant identified, we assessed the genotype in the germline (HET or HOM), whether there was a second somatic hit in the tumour, and whether the wild type or the variant itself was lost by a copy number alteration. We observed that for the variants in many of the 152 predisposition genes that a loss of wild type in the tumour via LOH was lower than the average rate of LOH across the cohort and that fewer than 5% of observed variants had a second somatic hit in the same gene. Moreover, in many of these genes, the ALT variant was lost via LOH as frequently as the wild type, suggesting that a considerable portion of the 566 variants may be passengers. For our downstream analysis and driver catalogue, we therefore restricted our analysis to a more conservative ‘high confidence’ list including only the 25 cancer related genes in the ACMG secondary findings reporting guidelines (v.2.0)62, together with four curated genes (CDKN2A, CHEK2, BAP1 and ATM), selected because these are the only additional genes from the larger list of 152 genes with a significantly increased proportion of called germline variants with loss of wild type in the tumour sample.

Clonality and biallelic status of point mutations

The ploidy of each variant is calculated by adjusting the observed VAF by the purity and then multiplying by the local copy number to work out the absolute number of chromatids that contain the variant. We mark a mutation as biallelic (that is, no wild type remaining) if variant ploidy > local copy number − 0.5. For each variant, we also determine a probability that it is subclonal. This is achieved via a two-step process involving fitting the somatic ploidies for each sample into a set of clonal and subclonal peaks and calculating the probability that each individual variant belongs to each peak. Subclonal counts are calculated as the total density of the subclonal peaks for each sample. Subclonal driver counts are calculated as the sum across the driver catalogue of subclonal probability × driver likelihood.

MSI status determination

To determine the MSI status, we used the method described by the MSIseq tool63 and counted the number of indels per million bases occurring in homopolymers of five or more bases or dinucleotide, trinucleotide and tetranucleotide sequences of repeat count four or more. MSIseq score of >4 were considered MSI.

Significantly mutated driver genes

We used Ensembl64 v.89.37 as a basis for gene definitions and have taken the union of Entrez identifiable genes and protein-coding genes as our base panel (25,963 genes of which 20,083 genes are protein coding). Pan-cancer and at an individual cancer level we tested the normalized nonsynonymous (dN) to synonymous substitution (dS) rate (that is, dN/dS) using dNdScv24 against a null hypothesis that dN/dS = 1 for each variant subtype. To identify significantly mutated genes in our cohort, we used a strict significance cut-off value of q < 0.01.

To search for significantly amplified and deleted genes, we first calculated the minimum exonic copy number per gene. For amplifications, we searched for all the genes with high-level amplifications only (defined as minimum exonic copy number >3 × sample ploidy). For deletions, we searched for all the genes in each sample with either full or partial homozygous gene deletions (defined as minimum exonic copy number < 0.5) excluding the Y chromosome. We then searched separately for amplifications and deletions, on a per-chromosome basis, for the most significant focal peaks, using an iterative GISTIC-like peel off method65. Most of the deletion peaks resolve clearly to a single target gene, which reflects the fact that homozygous deletions are highly focal, but for amplifications this is not the case and most of our peaks have ten or more candidates. We therefore annotated the peaks, to choose a single putative target gene using an objective set of automated curation rules. Finally, filtering was applied to yield highly significant deletions and amplifications.

Homozygous deletions were also annotated as common fragile sites based on their genomic characteristics, including a strong enrichment in long genes (>500,000 bases) and a high rate (>30%) of deletions between 20 kb and 1 Mb27.

Somatic driver catalogue construction

We created a catalogue of mutations in known cancer genes in our cohort across all variant types on a per-patient basis. This was done in a similar incremental manner to that previously described32 (N. Lopez, personal communication) in which we first calculated the number of genes with putative driver mutations in a broad panel of known and significantly mutated genes across the full cohort, and then assigned the candidate driver mutations for each gene to individual patients by ranking and prioritizing each of the observed variants. Key points of difference in this study were both the prioritization mechanism used and our choice to ascribe each mutation a probability of being a driver rather than a binary cut-off based on absolute ranking.

The four steps to create the catalogue are as follows. (1) Create a panel of candidate genes for point mutations using significantly mutated genes and known cancer genes using the union of Martincorena significantly mutated genes24 (filtered to significance of q < 0.01), HMF significantly mutated genes (q < 0.01) at global level or at cancer type level and COSMIC curated genes26 (v.83). (2) Determine TSG or oncogene status of each significantly mutated gene using a logistic regression classification model (trained using COSMIC annotation). (3) Add mutations from all variant classes to the catalogue when meeting any of the following criteria: (i) all missense and in-frame indels for panel oncogenes; (ii) all non-synonymous and essential splice point mutations for TSGs; (iii) all high-level amplifications for significantly amplified target genes and panel oncogenes; (iv) all homozygous deletions for significantly deleted target genes and panel TSGs; (v) all known or promiscuous in-frame gene fusions; and (vi) recurrent TERT promoter mutations. (4) Calculate a per-sample likelihood score (between 0 and 1) for each mutation in the catalogue as a potential driver event, to ensure that only likely pathogenic and excess mutations (based on dN/dS) are used to determine the number of drivers. All putative driver mutation counts reported at a per-cancer type or sample level refer to the sum of driver likelihoods for that cancer type or sample.

Clinical associations and actionability analysis

To determine clinical associations and potential actionability of the variants observed in each sample, we compared all variants with three external clinical annotation databases (OncoKB40, CGI41 and CIViC39) that were mapped to a common data model as defined by https://civicdb.org/help/evidence/evidence-levels. Here, we considered only A and B level variants. This classification of potential actionable events can also be mapped to the recently proposed ESMO Scale for Clinical Actionability of molecular Targets (ESCAT)66 as follows: ESCAT I-A+B (for A on-label) and I-C (for A off-label) and ESCAT II-A+B (for B on-label) and III-A (for B off-label). For each candidate actionable mutation, it was also determined to be either on-label (that is, evidence supports treatment in that specific cancer type) or off-label (evidence exists in another cancer type). To do this, we annotated both the patient cancer types and the database cancer types with relevant DOIDs, using the disease ontology database67. For each candidate actionable mutation in each sample, we aggregated all the mapped evidence that was available supporting both on-label and off-label treatments at the A or B evidence level. Treatments that also had evidence supporting resistance based on other biomarkers in the sample at the same or higher evidence level were excluded as non-actionable. Samples classified as MSI in our driver catalogue were also mapped as actionable at level A evidence based on clinical annotation in the OncoKB database. For each sample, we reported the highest level of predicted actionability, ranked first by evidence level and then by on-label vs off-label.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.