Participants

The cohort of 1806 total participants used in this study is summarised below. This study was approved by the human research ethics committees of Macquarie University (5201600387) and Sydney South West Area Health Service. Samples from ALS patients, family members, and unrelated controls were obtained from the Macquarie University Neurodegenerative Diseases Biobank, Molecular Medicine Laboratory at Concord Hospital, and the Australian MND DNA bank. Written informed consent was obtained from all study subjects and all methods were performed in accordance with the relevant guidelines and regulations. Most participants were of European descent and patients were clinically diagnosed with definite or probable ALS based on El Escorial criteria67. Genomic DNA was extracted from peripheral blood using standard protocols. RNA was extracted from peripheral blood with the QIAsymphony PAXgene blood RNA kit (Qiagen, Hilden, Germany).

Twin/triplet cohort

Three ALS discordant monozygotic twin pairs, one ALS discordant MZ triplet set and two control MZ twin pairs were included in this study (Fig. 1 and Table 1). Monozygosity for each twin/triplet set was confirmed using STR fragment analysis and/or SNP microarrays. Longitudinal samples were available from two twin sets (male and female sporadic ALS (SALS) twin sets in Fig. 1). The four discordant twin/triplet sets had previously undergone mutation analysis for known ALS genes and whole genome analysis for novel and/or rare de novo variants.

Data processing and extended cohorts

Additional samples were used in this study for data processing (Illumina HumanMethylation 450K, EpiTYPER methylation assays, and RNA-Seq) and for examination of significant findings (Illumina HumanMethylation 450K assay and RNA-Seq). Demographic characteristics between cases and controls in extended cohorts were assessed with t-tests for age and χ2 tests for sex.

For quality control and processing of the EpiTYPER data, 279 samples with C9orf72 EpiTYPER data (158 familial ALS/FTD samples, 56 asymptomatic samples (individuals harbouring a causal gene mutation but currently unaffected), and 65 control samples), and 261 samples with SOD1 EpiTYPER data (123 familial ALS, 65 asymptomatic, and 73 control samples) were used.

For the Infinium HumanMethylation 450K BeadChip, 1658 samples were used in data processing and normalisation. This comprised 889 individuals with sporadic or familial ALS, 92 asymptomatic and 668 controls. The familial ALS and asymptomatic cases largely overlap with the EpiTYPER cohort. The extended cohort subset comprised 650 sporadic ALS individuals and 539 unrelated controls.

One hundred and ninety samples were used for data processing and normalisation of the RNA-Seq data, comprising 114 individuals with ALS (99 sporadic ALS, 15 familial ALS) and 76 unrelated controls. The validation subset comprised of 96 sporadic cases and 69 controls. The majority of the 96 validation sporadic ALS cases were also present in the HumanMethylation 450K BeadChip SALS/control cohort.

Methylation assays and data processing

All quality control and data processing steps were carried out in R v 3.4.468.

EpiTYPER assay

Custom EpiTYPER assays (Sequenom, San Diego, USA) were used to quantify CpG methylation of 56 and 39 CpG units respectively of the two gene-associated CpG islands for C9orf72 and the gene-associated CpG island upstream of SOD1. EpiTYPER uses base-specific cleavage of bisulphite-converted DNA and matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MADL-TOF MS) to quantify DNA methylation69. Primers for overlapping amplicons were designed with Sequenom’s EpiDesigner software to target the CpG island regions, and therefore the promoter regions, as shown in Supplementary Fig. S1. Primer and assay details are available in Supplementary Table S1. Samples were assayed in one or two batches, and either in duplicate or as singletons (Supplementary Table S1). Sample processing was performed by Agena Bioscience (Brisbane, Queensland, Australia). As each gene assay was run across several plates of samples, the highly methylated DNA control was used to calculate the between-plate coefficient of variation, determined to be 4.9% and 2.3% for SOD1 and C9orf72 plates, respectively. CpG methylation was quantified as the percentage of methylated cytosines for each CpG unit, where CpG units consist of one or more CpG sites. For units with multiple CpG sites, methylation percentages were normalised by averaging across the number of sites.

EpiTYPER data processing

EpiTYPER data processing was adapted from a previously established method70. Twin samples were processed together with the full familial cohorts to leverage the increased sample size. In brief, CpG units that failed to meet assay reliability standards were discarded and samples in duplicate were averaged for the remaining CpG units. Given the relatively small number of CpG units remaining after removal of those determined to be unreliable and the relatively high failure rate of samples and units, a two-step sample/unit filtering process was used. First, failed samples, with ≥90% of CpG unit readings missing, were removed, followed by CpG units which were missing data for ≥90% of samples. Second, samples with a low detection success (missing data for ≥15% of units) were removed, and the same threshold applied to remove CpG units with low detection success (units missing data for ≥15% of samples). Finally, any remaining missing values were imputed with the mean for that unit. Following data processing and filtering, 28 of 56 and 23 of 39 CpG units (for C9orf72 and SOD1, respectively) remained for analysis.

Infinium Human Methylation 450K v1.2 BeadChip array

Genome-wide methylation was investigated using the Infinium HumanMethylation 450K v1.2 BeadChip (Illumina, San Diego, USA). This microarray provides qualitative methylation values for approximately 480,000 CpG sites distributed throughout the genome. Bisulphite-converted DNA was hybridised to the Infinium HumanMethylation 450K BeadChip. Fluorescence imaging of the BeadChip using an Illumina HiScan SQ scanner successfully generated raw Intensity Data files (.idat) for all samples.

450K data processing

Data processing of the .idat files was adapted from the method presented by71. Twin samples were processed together with the full cohort to leverage larger sample sizes. All default settings were used except where otherwise specified. In brief, samples with less than 99% of CpGs detected were removed. shinyMethyl (v. 1.12.072) was used to visually identify possible outliers, with confirmation of sex queries using RnBeads (v. 1.0.0.73). Samples with any possibility of incorrect identification were removed. Data were normalised with the dasen function from wateRmelon (v. 1.20.374). Probes that had failed to be detected (threshold p > 0.05) with the minfi (v. 1.22.175) function detectionP were removed (n = 10270 probes). Normalised data were submitted to Horvath’s online DNAm age calculator28. Samples that did not strongly correlate (r < 0.85) with the DNAm age results gold standard were removed. Leveraging technical replicate/duplicate samples (n = 30), both 1) in the form of multiple blood collections at the same time and resulting independent DNA extractions (technical replicates) and 2) multiple aliquots of single DNA extractions (duplicates), a custom filtering step was included to identify and remove highly variable probes. Any probe identified to have multiple pairs of technical replicate or duplicate samples with differences greater than three standard deviations from the probe’s mean difference was discarded (n = 38697). Of the remaining probes, any known to cross hybridise, be located on sex chromosomes, or bind to SNPs, were removed (n = 50362)76.

Following raw data processing, quantitative CpG methylation values for 1215 samples (including 34 twin samples from Table 1 and 1179 case/control extended cohort samples outlined in Supplementary Table S2) and 386183 probes remained for analysis and examination. Comparison of the case-control extended cohort (Supplementary Table S2) showed that sex (\({\chi }_{(1,{\rm{n}}=1179)}^{2}\) = 33.8, p < 0.01) and age (t 1174 = 4.20, p < 0.01) were significantly different between ALS cases and controls.

Analysis of methylation data

All statistical analyses were carried out in R v. 3.4.468.

Gene-specific targeted methylation analysis of SOD1 and C9orf72 in the FALS twin/triplet sets

Methylation of SOD1 or C9orf72, as quantified by both EpiTYPER and 450K assays, was visualised in the relevant monozygotic disease discordant twin/triplets. Four and five 450K CpGs were available in the post processing data set in the targeted region of C9orf72 (cg05990720, cg11613875, cg14363787, cg23074747) and SOD1 (cg16086310, cg17253939, cg18126791, cg19948014, cg26893544), respectively. Since only one twin set and one triplet set are available in our cohort for each respective variant, results are descriptive only.

Observed differences in DNA methylation age, blood cell composition and global methylation within a twin/triplet set

DNA methylation age was determined from 450K methylation data using the method of Horvath28. Blood cell proportions in whole blood derived methylation was estimated from 450K methylation data with the minfi implementation of Houseman et al.’s algorithm29. Global methylation levels were determined as the mean methylation estimate across all post-processing 450K CpG sites per sample. CpG sites were also divided into one of four categories based on HIL CpG classes (high-density CpG island (HC), intermediate-density CpG island (IC) and non-island (LC); ICshore, intermediate-density CpG island shore that borders HCs)77,78 and the mean methylation for each was calculated.

Methylome-wide analysis in MZ sets to identify differentially methylated probes

The list of differentially methylated probes (DMP) across all MZ sets was identified using an established ranked magnitude-significance method79. In brief, statistical significance per CpG site was determined using a paired t-test on methylation M-values, using the per-patient mean of longitudinal samples and unaffected triplets. The magnitude of the difference in methylation was calculated as the mean difference in β-methylation between co-twins. Both methods were used to rank all CpGs, and a final ranked list was determined from the mean of these two ranking methods. Top DMPs were the subset of all CpG probes that met the following two criteria, 1) they were in high CpG density regions of the genome and 2) the ranked list of high-density probes was truncated immediately prior to the first probe to show a difference in the direction of change across the four discordant MZ sets. The ability of these probes to discriminate between ALS and healthy individuals was assessed by hierarchical clustering and principal components analysis of all twin/triplet sets.

Within-twin/triplet set DMPs were also identified. A CpG probe was considered to be differentially methylated within a twin/triplet set where there was an absolute difference in β-methylation ≥0.25 between the affected twin and their unaffected co-twin/triplet.

Examination of identified twin DMPs in a sporadic ALS cohort

Twin/triplet DMPs were examined in the larger sporadic case-control cohort. Differences between cases and controls for each of the identified probes were analysed, along with the ability of the DMP list to cluster cases and controls separately.

Gene expression

RNA sequencing

Raw sequencing reads in fastq format were generated for male SALS twins (based on longitudinal sample availability) and the sporadic case-control validation cohort as outlined in Supplementary Table S3.

RNA-Seq data processing

The quality of raw sequencing reads was evaluated using fastQC (v 0.11.780) for both datasets. Trimming and alignment was performed as outlined in Table 1 using either Trimmomatic (v. 0.3681) or Cutadapt (version 1.8.182) and HISAT2 (v2.0.583). All subsequent data processing and analysis was completed in R (v. 3.4.4), using BioConductor packages edgeR (v. 3.18.184) and limma (v. 3.32.1085). A standard edgeR TMM normalisation and filtering pipeline was used in data processing, with only those genes where expression was greater than 0.3 counts per million in a minimum of 3 samples (male SALS twins) or 2 counts per million in a minimum of 75 samples (case-control cohort) retained for analysis, which is equivalent to approximately 12–15 raw counts in the smallest library size for each dataset. For the male SALS twins RNA-Seq data, of the 27685 human genes present in the per-gene read counts generated by HTSeq86, 13718 genes remained following raw data processing using edgeR84. Whereas in the case-control cohort, of the 23368 human genes present in the per-gene read counts generated by HTSeq86, 7354 genes remained following raw data processing using edgeR84. MDS (multi-dimensional scaling) indicated the presence of three outliers in the case-control cohort, 1 control and 2 SALS samples. All three were removed and final clinical details for the cohort can be found in Supplementary Table S4. Comparison of the RNA-Seq case-control validation cohort (Supplementary Table S4) showed that there were no significant differences in age (t 157.9 = 1.74, p = 0.08) between the ALS cases and controls, but a difference was observed in sex between cases and controls (\({\chi }_{(1,{\rm{n}}=165)}^{2}\) = 6.5, p = 0.01).

Differentially expressed genes in MZ twins

To identify differentially expressed genes (DEGs) using the paired longitudinal RNA-Seq samples from the ALS-discordant male SALS twins (Table 1), read count data was analysed using limma85, including model terms for longitudinal sample collection and disease status. Voom87 transformation was applied prior to modelling. Multiple testing correction using the BH-FDR method88 was applied to the full list of post-processing genes.

Validation of twin DEGs in a sporadic cohort

Genes identified in twin analyses were investigated for an effect of disease in the full case-control cohort with limma, including sex as a covariate. Data were voom transformed, given the highly variable library sizes. Multiple testing corrections using the BH-FDR method was applied only on the subset of genes identified as differentially expressed in twins. Hierarchical clustering and principal components analysis of the expression of these DEGs in the case-control cohort was assessed.

Combined methylation and expression analysis

Intersect of top CpGs and genes

To identify genes most likely to be altered in disease, results from independent analysis of genome-wide methylation and expression data sets were integrated. Longitudinal RNA-Seq data is only available for one male SALS twin set, therefore we first identified the overlap between top DEGs and the genes annotated to the most differentially methylated probes within that twin set. We extended this analysis by overlapping the same list of DEGs with the genes annotated to the top differentially methylated probes across all combined twin sets.

Gene Ontology analysis

Gene Ontology enrichment analysis89,90 for biological processes was applied to the genes identified as differentially expressed in male SALS twins. The gene list was analysed with PANTHER overrepresentation tests (GO Ontology database release 2018-08-0991). Enrichment was tested relative to all genes detected in the appropriate post-processing data set. Fisher’s exact test with FDR correction was used.

Statistics

All analyses were carried out in R (v. 3.4.4)68. Linear mixed effects models were used to analyse DNAm age, blood cell type proportions and global mean M-methylation. Modelling was carried out using the lmer function in the package lme4 (v. 1.1.1492) for DNAm age and mean methylation, while a mixed effects beta regression for cell type proportions was applied with the glmmTMB function from the glmmTMB package (v. 0.2.2.093). Blood cell type proportions were increased by 0.001 to all estimates to avoid taking the log of zero. All mixed models assessed the effect of disease status while controlling for age at sample collection and sex. When analysing DNAm age, the interaction of disease and age at collection was also tested. Random effects were introduced for repeated sampling within co-twins, and a random intercept per twin/triplet set. When modelling cell types, due to convergence issues, the random slope for repeated sampling was dropped, leaving random intercepts for each co-twin and twin set. Likelihood ratio tests were used to determine significance of model terms. Linear models were used for case-control examination of probes identified in the MZ cohort, with the same fixed effect terms of age at sample collection and sex as described for mixed models. Hierarchical clustering utilised the package cluster (v. 2.0.6), with Manhattan distance and ward clustering methods for 450K data94, and Spearman correlation distance and average linkage clustering for log-transformed RNA-Seq count data95. Where appropriate, technical replicates are shown as means with error bars indicating standard deviation (unless otherwise stated).