Clonal expansion of human SCs

Human SkM biopsies were obtained from healthy volunteers according to Ethical Permit EPN 2015/847-31/1, the Stockholm regional ethical review board. Informed consent was obtained from all donors. In total, 50–100 mg of tissue from the muscle vastus lateralis were obtained using a percutaneous needle biopsy technique40. A part of the tissue was frozen, while ~40 mg were freshly digested using 5 ml Trypl-LX in two rounds of 20 min incubations at 37 °C, 5% CO 2 and gentle agitation. Cells were pre-plated in the growth medium in a 10-cm cell culture dish (Dulbecco’s modified Eagle’s medium (DMEM) F12 containing 1% ABAM and 20% fetal bovine serum (FBS)) for 30 min at 37 °C and 5% CO 2 . The non-attached cells were transferred to a 6-cm cell culture dish coated with collagen I (5 µg/cm2 of Collagen I bovine protein, Gibco #A10644-01, following the “thin coating procedure”). The cells were left to adhere for 48 h. Then the cells were detached and stained for CD56-PE (clone MY31, BD #347747, dilution 1:200). Using a BD FACSAria™ Mu cell sorter (BD Biosciences, USA), the cells were sorted based on their size and strong positivity for the antigen (see Supplementary Fig. 1) and were single cell plated in at least one 96-multiwell collagen-coated plate per biopsy. The cells were grown in a conditioned medium, i.e., growth medium that was collected every 48 h from a culture of feeder SCs (40–60% cell confluency), filtered through a 2-µm-wide pore filter and supplemented with FBS 1/10 of the volume. Colonies were counted after 14 days and scored as positive when at least 16 cells were visible in the well. Approximately 21 days after plating, the colonies were moved to 24-multiwell plates. Depending on the cell confluency, the colonies were then moved to six-multiwell plates. The new plates were collagen coated but growth medium instead of conditioned medium was provided. After an average of 52.9 ± 2.2 days in culture, the colonies were confluent and used for the DNA extraction.

DNA extraction

DNA was extracted from the confluent wells of the six-multiwell plate using the Gentra Puregen Kit, Qiagen. DNA was extracted from 30 mg of muscle biopsy using the Gentra Puregen Kit, supplemented with a lysis buffer containing Proteinase K as recommended by the supplier. DNA was extracted from 3 ml of total blood that was collected in EDTA as recommended by the instructions of the Gentra Puregen Blood Kit.

Whole-genome sequencing

The library preparation and sequencing were carried out at NGI Sweden, Science for Life Laboratories, Stockholm, following standard methods. For the SCCs, the library preparation was performed by a semiautomatic NeoPrep station using the Illumina TruSeq Nano Kit (350 bp average insert size) and 25 ng of DNA starting material. The libraries of the bulk DNA samples were prepared with Illumina TruSeq PCR-free library preparations (350 bp average insert size). Sequencing was performed on Illumina HiSeq X, PE 2×150 bp.

SC purity of cultured CD56+ populations

The purity of the SC populations isolated with our protocol was compared in different muscle biopsies by assessing the transcript levels of the SC marker Pax7. In parallel with single-cell sorting, CD56+ cells were sorted in a 24-well plate (Supplementary Table 1) and cultured for 1 week to expand the population prior to RNA extraction. In addition, the purity of the single-cell culture was tested by assessing the percentage of clones derived from single-cell sorted cells that expressed the myocyte marker MyoD. RNA levels were tested in those SCCs that were able to grow confluent in a 24-well plate and 97.4% of clones (CES1 n = 3, CES6 n = 9, CES7 n = 14, and CES8 n = 13) were found positive. The relative expression of SC markers was measured using the same calibrator, the CD56+ sorted population from CES3, expanded in culture for 4 weeks (Pax7 expression characterized in Supplementary Fig. 1b). As negative controls, a human fibroblast population and two clones from human subcutaneous fat were used. Negative control samples never showed amplification of the SC genes.

To further characterize the contaminant cell types that were present in our CD56+ sorted populations, we expanded for 3 weeks and froze the unsorted populations left from the original cell sorting of the CES biopsies. Thawed cells were sorted for CD56 and fluorescence-activated cell sorting analysis was performed to assess the percentage of cells expressing markers of SCs (Pax7 monoclonal, Hybridoma Bank, 10 μg/ml) and bone marrow-derived cells (CD45-APC, clone HI30, BD Biosciences, USA). At least 7000 cells per sample were counted. In addition, 5000 cells were cytospun on glass slides and immunofluorescently labeled for the fibroblast marker TE-7 (CBL271, Millipore, CA, USA). At least 300 cells per sample were counted. Thawed cells sorted for CD56 were grown as single-cell clones with the same protocol used in the original experiment and again the myogenic origin was tested by MyoD expression. 100% of clones resulted positive (CES3 = 4, CES6 = 3, CES7 = 6, CES8 = 6).

Verification of SC origin of the sequenced SCCs

Culture of SCCs from CD56+ sorted cells does not guarantee the exclusive growth of SC colonies, due to the presence of contaminant cells of myogenic and non-myogenic origin in the CD56+ pool (see Supplementary Table 1). For this reason, the SC origin of the sequenced SCCs was tested in a myotube-formation assay or by assessing the expression of the SC markers Pax7, MyoD, and myogenin via qPCR. This test could be performed for 21 of the 29 clones. The remaining eight clones showed a scarce growth, and after DNA extraction, no cells were available for additional tests.

Cell-culture assays

Proliferation assays were performed on the seven biopsies included in the study (Supplementary Table 1). For every biopsy, at least 96 SCs were plated as single cells (CES1, CES7, and CES8: N = 96; CES2, CES3, CES4, and CES6: N = 144). After 2 weeks in culture, the percentage of cells able to form a colony of at least 16 cells was scored (colony-forming SCs). SCs that were able to grow up to 30,000–100,000 cells (confluent well of six-well plate) were scored positive for long-lasting proliferative potential independent of the time in culture. The differentiation assay was performed on confluent cells, by replacing growing medium with differentiating medium (DMEM-F12, 1% FBS) for 5 days. The clones were scored for the presence of visible myotubular structures and the RNA was extracted to verify the myogenic origin. For this assay, the original plating of the single-cell clones was used for biopsies CES1 n = 37, CES2 n = 39, and CES4 n = 29. For biopsies CES3, CES6, CES7, and CES8, the frozen expanded populations of unsorted cells were thawed and grown for 1 week, than sorted for CD56, single cell plated, and grown to confluency in 24-well plates (CES3 n = 21, CES6 n = 37, CES7 n = 24, and CES8 n = 28 SCCs).

Long-culture SCCs

Freshly isolated SCs from a young individual (Supplementary Table 1) were expanded in culture for 50 days using growth medium (DMEM F12 containing 1% ABAM and 20% FBS), then the SCs were single cell plated and cultured according to our protocol for clonal expansion of SCs from human SkMs. The clones and blood bulk DNA from the donor were sequenced, and the somatic variants were analyzed as a control for the in vitro-induced mutations.

Somatic variant calling

Raw reads were aligned to the human reference genome (GRCh37/hg19 assembly version), using bwa mem 0.7.1241. Samtools 0.1.1942 was used for alignment sorting and indexing, and qualimap v2.243 for the alignment quality-control statistics. The raw alignments were then processed using the GATK best practice44 with version 3.3 of the GATK software suite. GATK RealignerTargetCreator and IndelRealigner were used to realign around indels, Picard MarkDuplicates 1.120 to mark duplicates, and GATK BaseRecalibrator to recalibrate base quality scores. Finally, genomic VCF files were created using the GATK HaplotypeCaller 3.3. Reference files came from the GATK 2.8 resource bundle and steps were coordinated using Piper v1.4.0 (www.github.com/NationalGenomicsInfrastructure/piper).

To identify somatic variants, a specific pipeline was developed. For each SCC, the union of variants called with HaplotypeCaller (GATK)45, MuTect2 (GATK 3.5.0), and FermiKit version r17846 were subjected to further filtering steps. First, the variants present in any of the SCCs were gathered in a comprehensive list of interesting positions, which was specific to each individual. For every clone of the individual, the read distribution of the interesting positions was derived from the .bam files and matched to the relative blood bulk. Variants were called when all of the following criteria were met: the read fraction supporting the alternative allele fell in the desired region (0.4–0.6), the read fraction in the blood was low (alternative < 0.1), and the coverage in both the clone and blood was at least 15×. Chromosomes X and Y were excluded from the analyses. Additional quality filters were applied as follows: the reads supporting the variants were on both strands, the maximum coverage was 1000×, and the variants that were located in problematic regions47,48 and those detected in more than one individual were removed. The variants were annotated using the Ensembl Variant Effector Predictor from ref. 49. Filtered somatic mutations in young, old, and long-term cultured human SC identified in the study are available in Supplementary Data 1–3.

Estimate of false negative rate

The estimate of false negative rate was done by counting how often heterozygous SNVs were detected in the clone bam file with an allele frequency that allows our pipeline to classify it as have occurred in vivo, i.e., 0.4–0.6 allele frequency. First, a set of high-confidence SNVs for which the individual was germline heterozygote (bulk and blood) and present with an allele frequency > 0.3 in the SweGen population was generated. Second, the germline heterozygote SNVs with a clone depth minimum of 15× were selected. Third, the number of SNVs that were detected in the clone bam file by GATK and/or Fermikit and/or MuTect2 were collected. Finally, the number of SNVs with allele frequency 0.4–0.6 were counted using the Samtools mpileup. Similarly, we estimated how often homozygous SNVs are detected with minor allele frequency under 0.1 in blood. High-confident homozygous SNPs were created in the same way as above, and Samtools mpileup was used to count the number of SNVs with allele frequency below 0.1 in the blood. To yield the false negative rate, the positive prediction rate (e.g., the fraction of SNVs with allele frequency 0.4–0.6 of the total number of SNVs and correct called homozygous SNVs) was subtracted from 1.

Variant validation

The variant validation was performed on technical replicates of WGS. Clone P2703_113 was sequenced twice with independent library preparations. Clone P2703_116 was split into 2 wells during the cell culture (1000 cell-stage) and resulted in 2 independently grown clones (P2703_116 and P2703_119) derived from the same ancestor cell. The DNA was extracted and sequenced independently, but clone P2703_119 was not included in the study. Variants were called in clones P2703_113 and P2703_116 (discovery set) according to our somatic variant calling pipeline. Called variants that had a minimum coverage of 10× in both the discovery and the validation sets were used for the validation. In total, 798 SNVs and 82 indels of clone P2703_113 and 1140 SNVs and 66 indels of clone P2703_116 were tested. Variants were considered validated when at least three reads supporting the alternative alleles were present in the validation set. As a control for the background signal, we validated the variants in unrelated clones, e.g., clones derived from a different founder cell obtained from the same or a different biopsy. For a WGS-independent validation, we used Agena genotyping (sequenom maldi-tof technology), which was performed at the Mutation Analysis Facility at the Karolinska Institute. The SNVs predicted to have high-impact consequences on the encoded protein were selected from all clones and tested when assays were designable (n = 11) in all clone, muscle bulk, and blood bulk DNA included in the study. The variants were considered validated when the expected genotype was found in the clone from which the SNV was discovered, and the reference allele was found in all the other samples, including the relative blood bulk sample. In addition, six SNVs selected based on their impact on muscle physiology were validated by digital droplet PCR (ddPCR; Bio-Rad) in the clone from which the SNV was discovered and the relative bulk samples.

RNA extraction and qPCR

RNA was extracted from muscle biopsies using TriZol® Reagent (Invitrogen) and from cultured cells using the RNeasy Mini Kit (Qiagen), according to the manufacturers' instructions. cDNA synthesis was performed using random hexamers and SuperScript Reverse Transcriptase (Invitrogen). Quantitative reverse transcriptase PCR (RT-PCR) was performed using Taqman qPCR reagents and an ABI7500 fast system sequencing detection instrument (Applied Biosystems). Primers and probes were assays on demand from Applied Biosystem (Pax7 Hs00242962_m1; MyoD1, Hs00159528_m1; myogenin Hs01072232_m1) and the normalizer gene was glyceraldehyde 3-phosphate dehydrogenase (#4352665, Applied Biosystems).

Digital droplet PCR

The rare event detection (RED) analysis was performed using the QX200 ddPCR system (Bio-Rad, Hercules, CA, USA). DdPCR assays (Bio-Rad, Hercules, CA, USA) were designed to detect six SNVs, namely, HSPG2 (chr1: 22174499G>A, assay id: dHsaMDS988410713), TTN (chr2: 179560728G>A, assay id: dHsaMDS896792909), FN1 (chr2: 216272892C>T, assay id: dHsaMDS668887829), FNDC1 (chr6: 159682281T>A, assay id: dHsaMDS711365861), RAP1GDS1 (chr4: 99338610C>T, assay id: dHsaMDS467491822), and CACNA2D2 (chr3: 50426879C>T, assay id: dHsaMDS892017893). The PCR reactions were performed using 2× ddPCR Supermix for Probes (no dUTP) (Bio-Rad), 20× ddPCR mut assays (FAM/HEX labeled) (Bio-Rad), 5 U of HindIII restriction enzyme (New England BioLabs, Ipswich, MA, USA), and template DNA and ran according to the manufacturer’s instructions. gBlocks gene fragments (IDT, Coralville, IA, USA), carrying one of the six SNVs each, were spiked into the human DNA sample and used for testing the ddPCR assays and optimizing the PCR thermal cycling conditions prior to the analysis of the samples. Approximately 30 ng of sample DNA per well were used in the analysis, except for the case of DNA from the additional clones from individual CES6 for which only 5 ng per well were used due to the low amount of available DNA. All samples were run in two or more replicate wells, and the data were merged from all the replicate wells to calculate the fractional abundances of the mutant alleles. Sample data were only accepted when falling within established detection parameters, which include a minimum of 3 positive droplets per sample and 10,000 accepted droplets per well.

Owing to the exonic position of the HSPG2 and FN1 mutations, the same assay could be used with both DNA and cDNA. To perform the RED on the cDNA samples, 1 μg of RNA was treated with DNase using RQ1 RNase-Free DNase (Promega, Madison, WI, USA) prior to the cDNA synthesis via the SuperScript First-Strand Synthesis System for RT-PCR (Invitrogen, Carlsbad, CA, USA), which was performed according to the manufacturer’s instructions. The RED was performed as described for DNA. Approximately 50 ng of sample cDNA per well were used for the analysis, except for the case of cDNA from the SC population from a control individual, for which the amount of cDNA had to be reduced to 7.5 ng per wall to prevent overloading of the assay.

Mutational signatures

All analyses of base substitutions (Supplementary Fig. 2b) and mutational signatures (Fig. 1 and Supplementary Figs. 3 and 4) were carried out using the R-package MutationalPatterns16,50 following the creators’ instructions. In brief, base substitution patterns of the somatic mutations detected in SCCs were analyzed and visualized with MutationalPatterns using the reference genome set to “BSgenome.Hsapiens.UCSC.hg19”. Using the function “mut_matrix”, the relative contribution of the six single base substitution subtypes: C:G>A:T, C:G>G:C, C:G>T:A, T:A>A:T, T:A>C:G, and T:A>G:C, was refined by including the sequence context 5ʹ and 3ʹ of each mutated base to obtain a 96 trinucleotide substitution count matrix. We performed de novo extraction of mutational signatures from the SCC mutations using non-negative matrix factorization51 by applying the function “extract_signatures” to our data. We used the 96 trinucleotide substitution count matrices for all SCCs (n = 29) as input and specified that three signatures should be extracted (rank = 3). The contribution of the 96 possible trinucleotide substitution types to the extracted signatures was visualized using the function “plot_96_profile” (Supplementary Fig. 1f). The relative contribution of the extracted signatures to the mutational catalog was used for the PCA (Supplementary Fig. 1g). We used the function “fit_to_signatures” to assess how well the mutational catalog of our SCCs can be expressed as a combination of the previously identified cancer signatures (http://cancer.sanger.ac.uk/cosmic/signatures). Whole-genome data were then expressed as SNVs/Gbase/SCC and replotted (Fig. 1f). For each signature, the number of mutations per SCC was plotted and the linear fit with donor age was used to calculate the year increase (Fig. 1g). Analyses in R were done using RStudio Version 0.98.1025. Signature data were analyzed using the following pipeline to obtain principal components of the 29 sequenced SSCs and previously published sequence data from human fibroblasts27: (i) data were normalized on a per clone base; (ii) normalized data were transformed into principal components (NB: these principal components are the ones shown in the plots); (iii) 100,000 permutations of k-means clustering algorithm (with k = 2) have been performed on data restricted to the first three principal components. This pipeline has been implemented using MATLAB, and all programs have been run with default parameters.

Genomic distribution of mutations

The distribution of somatic mutations across the human genome (Supplementary Fig. 5a) was visualized using the function “plot_rainfall” of the R-package MutationalPatterns16,50. The observed and expected numbers of mutations in the different genomic regions (Fig. 2b, c) were analyzed using the function “genomic_distribution”. Input data were the somatic mutations detected in SCCs, genomic regions with enough sequencing depth (15×) in every SCC, and the genomic regions that should be tested for enrichment or depletion of somatic mutations, all represented as GRanges objects52. Regulatory regions were extracted from Ensembl using the BioMart database “regulation”, dataset “hsapiens_regulatory_feature” and GRCh “37”. Promoter regions were extracted using the filter “regulatory_feature_type_name” set to “Promoter”. Promoter flanking regions were extracted using the filter “regulatory_feature_type_name” set to “Promoter Flanking Region”. Open chromatin regions were extracted using the filter “regulatory_feature_type_name” set to “Open chromatin”. Transcription factor (TF)-binding regions were extracted using the filter “regulatory_feature_type_name” set to “TF binding site”. To test for significant depletion or enrichment of somatic mutations in genomic regions, we used the function “enrichment_depletion_test” with the observed and expected number of mutations in the genomic regions as input. The results were visualized using the fuction “plot_enrichment_depletion”, with the significance test data as input. Exon data were extracted from Ensembl using the BioMart database “ensembl”, dataset “hsapiens_gene_ensembl” and GRCh “37”. Information on genes that were detected or not detected on RNA level in the SkM was extracted from the human protein atlas (http://proteinatlas.com). A list of genes detected in the SkM was obtained by concatenating the results from the search terms “tissue_specificity_rna:skeletal muscle;elevated AND sort_by:tissue specific score”, “tissue_specificity_rna:any;expressed in all”, and “tissue_specificity_rna:skeletal muscle;mixed” at http://proteinatlas.org. A list of genes not detected in the SkM was generated by the search term “tissue_specificity_rna:skeletal muscle;not detected” at http://proteinatlas.org. Based on Ensembl ID, we selected the subsets of all exons extracted from the BioMart database “ensemble” that were present in the list of genes detected in the SkM or in the list of genes not detected in the SkM.

The distribution of mutations in 173 muscle disease genes was tested against a null model as follows. SCCs were assigned to the young (n = 13) and old (n = 16) groups and all genes harboring a mutation were registered for each group. The total number of mutations with encode annotation to a gene (e.g., harbored in extended gene regions including exons, introns, and 5 kb upstream and downstream of each gene) was counted for each group. We sampled 104 times the same number of exons from the list of all exons (“hsapiens_gene_ensembl” previously described) proportionally to their length and then computed how many extracted extended gene regions were present in the list of muscle disease genes. P-value and Z-score were then evaluated by comparing the empirical number of mutations with the distribution of the null-models using one-sided t-test. Full gene length was obtained from Gencode, release 19.

Mutation load in specific genomic regions and gene sets

For the analysis of genes, promoters and enhancers actively expressed during myoblast to myotube differentiation (Fig. 3a) data were downloaded from the FANTOM5 project at http://fantom.gsc.riken.jp/data/ (human_permissive_enhancers_phase_1_and_2_expression_tpm_matrix.txt.gz and hg19.cage_peak_phase1and2combined_counts_ann.osc.txt.gz)53. Expression data were available at 8 different time points during the myoblast to myotube differentiation (CNhs13847-14585): day 00 (basal), day 01, 02, 03, 04, 06, 08, 10, and 12 upon exposure to differentiation stimuli. Exploiting different time points, two expression sets, “basal” and “upon differentiation”, were created. “Basal” corresponds to transcribed regions on day 00. For “upon differentiation”, transcribed regions at all time points were intersected with expressed regions on day 00 and the non-overlapping portion of transcribed regions at all time points was taken. To determine actively transcribed promoters and enhancers, we determined an arbitrary threshold based on surveyed region size (see Supplementary Fig. 8). For enhancers, all regions with an expression score ≥1 were included and the considered regions were enlarged +/−100 bp. For promoters, all regions with an expression score ≥30 were included and the considered regions were enlarged of an asymmetric window of 860 bp upstream and 100 bp downstream. For exons, gene names were derived from expressed regions reported in the promoter list and the exonic regions of each gene were derived from “hsapiens_gene_ensembl” previously described. Three different donor replicates were used and regions transcribed with intensity equal or above the threshold in at least one replicate were included. Size of surveyed regions was 4525728 and 32409968 bp for enhancers in “basal” and “upon differentiation”, respectively, 5335985 and 3903736 for promoters, and 17542615 and 11123995 for exons. Somatic SNVs and indels were annotated to the described regions and quantified for every SCC and the average of mutations in young and old SCCs was reported. When enhancer and promoter regions overlapped, mutations were attributed to either one based on manual investigation of the region.

For the analysis of the numbers of mutations in gene sets involved in muscle function (Fig. 4b), genes were derived as described in Supplementary Tables 7 to 9. Lists of exonic mutations in young and old clones were screened and every mutation that was annotated with a gene name included in the gene set was scored. The average number of mutations per gene was calculated by dividing the number of mutations in the gene set by the number of tested SCC (young n = 13, old n = 16) and the number of genes in the gene set.

Statistical analyses

Unless otherwise indicated, P-values were calculated using two-tailed distribution, two-sample unequal variance Student’s t-tests, with significance defined as P < 0.05 (*P < 0.05, **P < 0.01, ***P < 0.001). The results are presented as the mean ± standard error of the mean (SEM). All calculations were performed using the GraphPad Prism software. The linear fits between mutation numbers and age shown in Fig. 1 were obtained using a robust mixed model where the dependent variable is the number of mutations or a given mutational signature, the fixed effect is age, and the random effect is the individual. Bonferroni correction for the number of tested signatures was applied in Fig. 1f. Analyses were performed in R.

Data availability

The sequence data that support the findings of this study are available within the article and its Supplementary Information (Supplementary Data 1–3). Raw sequence data are only available from the corresponding author upon request and will be made available to research projects that fulfill the informed consent, owing to regulations pertaining to the authors' deposition of these data in public repositories.