In this section, we describe: 1) how the data are processed (including methylation microarray and sequencing data); 2) the implementation of CancerLocator; 3) how the simulation data are generated while taking into account copy number aberrations; 4) how the training and testing data are split; and 5) what measures we use to evaluate performance.

Methylation data collection and processing

Data collection

We collect a large set of public methylation data of solid tumors and plasma cfDNA samples taken from both healthy people and cancer patients. The majority of tumor methylation profiles in TCGA were assayed using the Infinium HumanMethylation450 microarray. We collect those data for solid tumors with >100 samples from five different organs: 681 samples of breast (BRCA), 290 samples of colon (COAD), 522 samples of kidney (including 300 KIRC and 156 KIRP samples), 169 samples of liver (LIHC), and 809 samples of lung (including 450 LUAD and 359 LUSC samples) cancer.Footnote 2

The public methylation data of plasma cfDNA samples are from Chan et al. [19] and Sun et al. [21]. The two datasets include the WGBS data of plasma samples taken from 32 normal people, eight patients infected with HBV, 29 liver cancer patients, four lung cancer patients, five breast cancer patients, and a number of patients with tumors in organs without a large blood flow. We also generated WGBS data from plasma samples collected from eight cancer patients (five early-stage lung cancer patients, one late-stage lung cancer patient, two lung cancer patients with unknown stage information) and one patient with a benign lung tumor. We used only the normal, HBV, and breast/liver/lung cancer patients in our study, for a total of 87 plasma samples. Note that these public WGBS data have very low sequencing coverage (~4× on average), while the coverage of our newly generated data for all nine samples is around 10×.

Human subjects

The blood samples of eight lung cancer patients and one benign lung tumor patient were collected. The demographic and clinical features of the patients profiled are presented in Additional file 1: Table S2.

Cell-free DNA isolation and whole-genome bisulfite sequencing

Blood samples were centrifuged at 1600 × g for 10 minutes and then the plasma was transferred into new microtubes and centrifuged at 16,000 × g for another 10 minutes. The plasma was collected and stored at −80 °C. cfDNA was extracted from 5 ml plasma using the Qiagen QIAamp Circulating Nucleic Acids Kit and quantified using a Qubit 3.0 Fluoromter (Thermo Fisher Scientific). Bisulfite conversion of cfDNA was performed using a EZ-DNA-Methylation-GOLD kit (Zymo Research). After that, an Accel-NGS Methy-Seq DNA library kit (Swift Bioscience) was used to prepare the sequencing libraries. The DNA libraries were then sequenced with 150-bp paired-end reads.

Building features (CpG clusters)

The Infinium HumanMethylation450 microarray data from TCGA measure all solid tumor samples at ~450,000 CpGs. Since our testing sample [19] comprises WGBS data with very low sequencing coverage, we grouped the CpG sites into CpG clusters in order to use more mappable reads. For a CpG site covered by a probe on the microarray, we define the region 100 bp up- and downstream as its flanking region and assume that all CpG sites located within this region have the same average methylation level as the CpG sites covered by probes. Two adjacent CpG sites are grouped into a CpG cluster if their flanking regions overlap. Finally, only those CpG clusters containing at least three CpGs covered by microarray probes are used in this study. We choose the size of the flanking region and the number of CpGs in a cluster according to three criteria: (i) at least three CpG sites (in the microarray data) are included to obtain a robust measurement of methylation values in the solid tumor samples; (ii) the cluster is reasonably sized, so that there are sufficient CpG sites to calculate the methylation values, even when low coverage sequencing data are used; (iii) keep as many clusters that span within a type of genomic region (either CpG islands or shores) as possible. This procedure yielded 42,374 CpG clusters, which together include about one-half of all the CpG sites on the Infinium HumanMethylation450 microarray. Most of these clusters are each associated with only one gene. These CpG clusters are used for subsequent feature selection.

Methylation microarray data processing

The microarray data (level 3 in TCGA database) provide the methylation levels of individual CpG sites. We define the methylation level of a CpG cluster as the average methylation level of all CpG sites in the cluster. A cluster’s methylation level is marked as “not available” (NA) if more than half of its CpG sites do not have methylation measurements.

WGBS data processing

Bismark [24] is employed to align the reads to the reference genome HG19 and call the methylated cytosines. After the removal of PCR duplications, the numbers of methylated and unmethylated cytosines are counted for each CpG site. The methylation level of a CpG cluster is calculated as the ratio between the number of methylated cytosines and the total number of cytosines within the cluster. However, if the total number of cytosines in the reads aligned to the CpG cluster is less than 30, the methylation level of this cluster is treated as NA.

Feature filtering

For each CpG cluster, we used the methylation range (MR) to indicate a feature’s differential power between classes. We first obtained the average methylation level of all samples from each class (i.e., healthy plasma or each tumor type), then defined MR as the range of this set of mean values (i.e., the difference between the largest and smallest mean values). The higher the MR of a cluster is, the more differential power it has. Finally, we selected those CpG clusters whose MRs were no lower than a threshold.

Statistical inference of the ctDNA burden and tissue of origin

A mixture model of methylation levels of plasma cfDNAs

The cfDNA in the plasma of cancer patients can be regarded as a mixture of normal background DNA and tumor-released DNA. Formally, for each CpG cluster k ∈ {1, 2, ⋯, K}, the methylation level x k of the plasma cfDNA from a given patient can be approximated as a mixture of v k and u k , which are the methylation levels of the normal plasma sample and the solid tumor tissue, respectively. Let θ ∈ (0, 1) denote the proportion of tumor-derived DNAs in plasma cfDNA. Then x k can be expressed as the weighted sum of v k and u k , i.e., x k = (1 − θ)v k + θu k .

We assume that an individual carries at most one type of tumor among the T possible tumor types. Let t ∈ {0, 1, 2, ⋯, T} be the variable representing either normal plasma (t = 0) or a tumor type (1 ≤ t ≤ T). For each CpG cluster k, we model its methylation level in a sample of type t as a Beta distribution: v k ~ Beta(α k0 , β k0 ) for normal plasma samples (t =0) and u k ~ Beta(α kt , β kt ) for solid tumor samples of type t ∈ {1, ⋯, T}, where α k0 and β k0 (α kt and β kt ) are the parameters of the beta model of methylation levels of CpG cluster k in normal plasma (solid tumor) samples. As illustrated in step 1 of Fig. 1, the parameters of these Beta distributions are estimated by the method of moments, using the large amount of public tumor data and normal plasma data.

By integrating the two Beta distributions (v k and u k ), as shown in Fig. 2, x k can be modeled by a derived distribution with the given ctDNA burden θ and source tumor type t. This model is denoted as the probability density function ψ(x k |θ, t), which is calculated by the convolution of Beta(α k0 , β k0 ) and Beta(α kt , β kt ). It is formally expressed as:

$$ \psi \left({x}_k\Big|\theta, t\right)={\displaystyle \underset{0}{\overset{1}{\int }}}{f}_{\mathrm{Beta}}\left(\frac{x_k-\theta {u}_k}{1-\theta}\Big|{\alpha}_{k0},{\beta}_{k0}\right){f}_{\mathrm{Beta}}\left({u}_k\Big|{\alpha}_{k t},{\beta}_{k t}\right)\; d{u}_k $$ (1)

where f Beta is the probability mass function of the Beta distribution.

Modeling the methylated cytosine count of plasma cfDNA sequencing data

Due to its low abundance in plasma, the methylation profile of cfDNA is usually measured by sequencing-based methods, and the methylation levels (x k ) of a CpG cluster k can be characterized by the numbers of methylated and unmethylated cytosines in the reads. Let M = (m 1 , m 2 , ⋯, m K ) and N = (n 1 , n 2 , ⋯, n K ) be the number of methylated cytosines and the total number of cytosines mapped to all CpG sites, respectively, where the index runs over all K CpG clusters. For each CpG cluster k, m k can be modeled by a binomial distribution: m k ~ Binomial(n k , x k ). By integrating the mixture model of x k in Eq. 1, we have the likelihood function for each CpG cluster k which has the inputs from the model parameters (θ, t, α k0 and β k0 , α kt , and β kt ) and the sequence measurements of plasma samples (m k , n k ):

$$ f\left({m}_k\Big|\theta, t,{n}_k\right)={\displaystyle \underset{0}{\overset{1}{\int }}}{f}_{\mathrm{Binomial}}\left({m}_k\Big|{n}_k,{x}_k\right)\psi \left({x}_k\Big|\theta, t\right)\; d{x}_k $$ (2)

where f Binomial is the probability density function of the binomial distribution.

Maximum-likelihood estimation of blood tumor burden and type

Given the methylation sequencing profile of a patient’s plasma cfDNA sample, the vectors M and N, we aim to find the maximum-likelihood estimate of two model parameters: a sample’s cfDNA tumor burden θ and its source tumor type t. For integrating the mixture models of multiple markers into the formulation, we adopted a commonly used assumption: all features or markers are independent of each other. This assumption has been widely used in a number of cell-type deconvolution studies [25, 26]. Under this assumption, the log-likelihood can be written as:

$$ \log \kern0.5em L\left(\theta, t\Big| M, N\right)={\displaystyle \sum_{k=1}^K}\kern0.5em \log \kern0.5em f\left({m}_k\Big|\theta, t,{n}_k\right) $$ (3)

Since the integrals in Eqs. 1 and 2 cannot be easily solved analytically, we use Simpson's rule to calculate the log-likelihood. That is, a set of J predefined θ values, \( \Theta =\left\{0,\frac{1}{J},\frac{2}{J},\dots, \frac{J-1}{J}\right\} \), is used to conduct a grid search for the best estimation (i.e., a global optimization solution). The higher the resolution (J), the more precise the estimation. After obtaining the solution (i.e., \( \hat{\theta} \) and \( \hat{t} \)) that maximizes Eq. 3, we use the estimated parameters to calculate a simple yet effective prediction score that answers two questions: “Does the patient have cancer?”; and “If the patient has cancer, which tumor type is it?” This prediction score is defined below:

$$ \lambda =\frac{1}{K}\left[ \log \kern0.5em L\left(\hat{\theta},\hat{t}\Big| M, N\right)- \log \kern0.5em L\left(\theta =0\Big| M, N\right)\right] $$ (4)

where the denominator K is used to normalize the log-likelihood, so that λ is comparable when using a different number of features. The variable t is not included in L(θ = 0|M, N) because θ = 0 indicates a normal plasma sample. The larger the prediction score λ, the higher the chance that the patient has a cancer tumor of type \( \hat{t} \). Specifically, if λ is greater than a threshold, the patient is predicted as having cancer with the ctDNA burden \( \hat{\theta} \) and the tumor type \( \hat{t} \); otherwise, he/she is classified as not having cancer.

Simulation data generation

We simulate the methylation sequencing data of a patient’s plasma cfDNAs using the previously described probabilistic models: (i) a mixture model that treats the cfDNA as a mixture of normal plasma cfDNA and DNAs released from primary tumor sites; and (ii) a binomial model for the methylated cytosine count of plasma cfDNA sequencing data. In addition, to make the simulation data more realistic, we incorporate CNAs and read depth bias. The procedure for simulating plasma cfDNA methylation sequencing data is detailed in the following sections.

Inputs

Inputs include: (i) the genomic regions of all K CpG clusters; (ii) the total number of cytosines (Z) on the sequencing reads that are aligned to any CpG cluster; (iii) the range of θ : (θ L , θ U ); (iv) the collections of normal plasma samples (denoted as POOL normal ) and solid tumor samples (denoted as POOL tumor ); and (v) b k , the background probability for a CpG dinucleotide to be aligned to CpG cluster k, satisfying ∑ k = 1 K b k = 1. The last input reflects the read-depth bias introduced during the sequencing process and read alignment and the density of CpG sites in the clusters. Refer to Additional file 1 for details of how to obtain b k .

Output

Output comprises a simulated methylation sequencing profile of a plasma sample, represented by the integer vectors M = (m 1 , m 2 , ⋯, m K ) and N = (n 1 , n 2 , ⋯, n K ). The elements m k and n k are the number of methylated cytosines and the total number of cytosines in the reads mapped to CpG cluster k, respectively.

Procedure

1. Generate a random ctDNA fraction θ from the distribution θ ~ Uniform(θ L , θ U ). 2. Generate a random integer copy number c k for each CpG cluster k, from the categorical distribution c k ~ Cat(6, p 0 , p 1 , p 2 , p 3 , p 4 , p 5 ). Here, p c denotes the probability of observing copy number c ∈ {0, 1, 2, 3, 4, 5} in the sequencing data. The probabilities p c satisfy three criteria: (i) their sum is equal to one, \( \begin{array}{l}{\displaystyle {\sum}_{\mathrm{c}=0}^5{\mathrm{p}}_{\mathrm{c}}=1}\\ {}\end{array} \); (ii) the average copy number is equal to two, ∑ 5 c = 0 c ∗ p c = 2; and (iii) extreme CNAs are less likely to occur. In this work, we predefine p 0 = 0.005, p 1 = 0.16, p 2 = 0.7, p 3 = 0.105, p 4 = 0.025, p 5 = 0.005. Note that the sum of all these probabilities except p 2 (30% in this case) is the probability of any given CpG cluster having a CNA event. We have tried other probability configurations for the simulation with more (50%) or fewer (10%) CNA events and obtained similar results (Additional file 1). No CNA event is considered (i.e., c k is fixed to two) when simulating a normal plasma sample. 3. Randomly select a normal plasma sample from POOL normal whose methylation profile is denoted by (v 1 , v 2 , ⋯, v K ), and randomly select a solid tumor from POOL tumor whose methylation level profile is denoted by (u 1 , u 2 , ⋯, u K ). Note that we also randomly select two normal plasma samples from POOL normal in order to simulate a new normal plasma sample. 4. Calculate the methylation level x k of plasma cfDNA at CpG cluster k. This is the adjusted linear combination of v k and u k after incorporating the copy number c k generated in step 2. That is, x k = (1 − θ k ')v k + θ k ' u k , where θ k ' is the adjusted value of θ given by \( {\uptheta}_{\mathrm{k}}^{\hbox{'}}=\frac{{\uptheta \mathrm{c}}_{\mathrm{k}}}{{\uptheta \mathrm{c}}_{\mathrm{k}}+2\left(1-\uptheta \right)} \). θ k ' describes the actual ctDNA fraction after considering the copy number c k of the ctDNA. 5. Generate a random number n k , representing the total number of cytosines in CpG cluster k, from the Poisson distribution n k ~ Poisson(ZB k ). B k is the adjusted CpG dinucleotide bias b k , given by \( {\mathrm{B}}_{\mathrm{k}}=\frac{{\mathrm{b}}_{\mathrm{k}}\left(1-\uptheta +{\uptheta \mathrm{c}}_{\mathrm{k}}/2\right)}{{\displaystyle {\sum}_{\mathrm{k}=1}^{\mathrm{K}}}{\mathrm{b}}_{\mathrm{k}}\left(1-\uptheta +{\uptheta \mathrm{c}}_{\mathrm{k}}/2\right)} \), after scaling with the copy number c k generated in step 2. 6. Generate a random number m k from the binomial distribution m k ~ Binomial(n k , x k ).

Due to the limited number of normal plasma samples, we also simulated new normal plasma samples by mixing two normal plasma samples at different mixture ratios. The procedure is the same as above except that step 2 is ignored by fixing all copy numbers as two because there are no CNA events in the normal plasma samples.

Performance evaluation

Data partitions for learning signature features, simulation, and real data experiments

All TCGA solid tumor tissues and plasma samples are divided into non-overlapping sets for three tasks: (i) learning discriminating features; (ii) simulation experiments; and (iii) testing on the real data. Specifically, as shown in Fig. 6, we split TCGA solid tumors of each tissue type into two partitions: 75% for learning signature features and 25% for generating simulation data. We also split all normal plasma samples into two partitions: 75% for learning signature features and 25% for generating simulation data or for real data experiments. All the plasma samples of the cancer patients are used to form the testing set in the real data experiments. Note that not these plasma samples, but only solid tumor samples collected from public methylation databases, and a subset of normal plasma samples that were not used for testing, were used for learning features. All data are randomly partitioned following the above proportions, and applying a method on one such partition is regarded as “one run”. For making the robust results, we repeat the experiments for ten runs and aggregate all predictions obtained in the ten runs into a single confusion matrix as the final result. Because we had a limited number of real cancer plasma samples (only 5, 12, and 29 cfDNA samples from breast, lung, and liver cancer patients, respectively) for testing, it would not allow the typical cross-validation for the method’s hyperparameter estimation. For fully utilizing the test samples for effective performance evaluation, we report only the best prediction results for each of three methods (CancerLocator, RF and SVM) after examining all possible values of each method’s hyperparameters. The only hyperparameter of CancerLocator is the threshold of the prediction score λ, which is set as 0.023 to generate the predictions on the real plasma samples. For consistency with the real data experiments, we apply the same strategies to simulation data experiments and calculate the error rate averaged over ten runs.

Fig. 6 Illustration of the data partition for learning discriminating features, in both simulation and real data experiments. Note that simulation and real data experiments share the same subset (25%) of normal plasma samples Full size image

Prediction performance measures

The error rate and accuracy are the most popular and established multi-class classification performance measures [27–29]. They are equivalent to each other. This study uses the error rate, which is defined as the percentage of incorrect predictions out of all predictions.