Cell culture

HEK293T, MDA-MB-231, and THP-1 cells were obtained from ATCC. HEK293T cells were cultured in Dulbecco's Modified Eagle's Medium (DMEM, D5671, Sigma) supplemented with 10% fetal bovine serum (FBS), 2 mm l-glutamine, 100 U/ml penicillin, and 100 µg/ml streptomycin. MDA-MB-231 and THP-1 cells were cultured in Leibovitz’s L-15 Medium (11415064, Gibco) and RPMI-1640 Medium (52400025, Gibco), respectively, both supplemented with 10% FBS, 100 U/ml penicillin, and 100 µg/ml streptomycin. For passaging or harvesting, HEK293T and MDA-MB-231 cells were first detached by incubating with 1X TrypLE™ Express (Life Technologies) for 2 min at 37 °C.

TNFα stimulation

HEK293T, MDA-MB-231, and THP-1 cells were seeded in six-well plates at densities of 0.7 million cells, 0.5 million cells, and 1 million cells per well, respectively. After 2 days, cells were stimulated with TNFα (R&D Systems) at 10 ng/ml. Aliquots were collected for analysis at 0, 5, 10, 15, 30, and 60 min (stimulation was performed in reverse order to enable simultaneous harvesting of all conditions). At 20 min before harvesting, IdU was added to the medium at the final concentration of 10 μm. At 2 min before harvesting HEK293T and MDA-MB-231 media were replaced with 1X TrypLE to induce detachment. At the time of harvest, paraformaldehyde (PFA, Electron Microscopy Sciences) was added to the cell suspension at a final percentage of 1.6%, and samples were incubated at room temperature for 10 min. Crosslinked cells were washed twice with cell staining media (CSM, PBS with 0.5% BSA, 0.02% NaN 3 ). After removal of supernatant, ice-cold methanol was used to resuspend the cells, followed by a 10 min permeabilization on ice or long-term storage at −80 °C.

Immunofluorescence and three-dimensional reconstruction

CultureWell™ chambered coverglass wells (16-well, Thermo Fisher Scientific) were pre-coated by incubation with 10 μg/ml bovine plasma fibronectin (Thermo Fisher Scientific) at 37 °C for 1 h. MDA-MB-231 cells were then seeded at a density of 1500 cells per well. On the second day, 4% PFA was used to crosslink the cells at room temperature for 20 min. The slide was then washed with PBST (0.5% Tween 20 in PBS) three times, and cells were subsequently permeabilized for 5 min with 0.1% TritonX-100 (diluted in PBS) at room temperature. After washing with PBST three times, cells were incubated in blocking buffer (10% goat serum diluted in PBST) for 30 min at room temperature. Primary antibody (anti-GAPDH, 6C5, Thermo Fisher Scientific, 1 μg/ml; anti-Rab7, D95F2, Cell Signaling Technology, 1:100; or anti-β-actin, D6A8, Cell Signaling Technology, 1:200) was added, and slides were incubated overnight at 4 °C. Secondary antibody (goat anti-mouse Alexa Fluor® 488, 1:200 or goat anti-rabbit Alexa Fluor® 555, 1:200), supplemented with Hoechst 33342 at a final concentration of 100 μg/ml, was applied, and slides were incubated for 1 h at room temperature. Antibodies were diluted in the blocking buffer. Slides were washed three times with PBST after each incubation step. For cell-volume analyses, cells were stained for total proteins with Alexa Fluor® 647 NHS ester (Thermo Fisher Scientific) for 10 min at a final concentration of 1 μg/ml. Slides were then mounted with ProLong Gold Antifade Reagent (Life Technologies) before imaging with a CLSM Leica SP5 microscope. Stacks were imaged every 0.5 μm, and the three-dimensional reconstruction and quantification of the total cell volume was performed with Imaris 7.7.2.

Antibody conjugation

Isotope-labeled antibodies were generated with MaxPAR antibody conjugation kit (Fluidigm) using the manufacturer’s standard protocol. The antibody yield was determined based on absorbance of 280 nm. Candor PBS antibody stabilization solution (Candor Bioscience GmbH) was used to dilute antibodies for long-term storage at 4 °C.

Barcoding and staining protocol

Formalin-crosslinked and methanol-permeabilized cells were washed three times with CSM and once with PBS. Cells were incubated in PBS containing barcoding reagents (105Pd, 106Pd, 108Pd, 110Pd, 113In, 115In, and 139La) at a final concentration of 50 nm for 30 min at room temperature and then were washed three times with CSM16. Barcoded cells were pooled and stained with the metal-conjugated antibody mix (Supplementary Table 1) at room temperature for 1 h. The antibody mix was removed by washing cells three times with CSM and once with PBS. For DNA staining, iridium-containing intercalator (Fluidigm) was diluted in PBS with 1.6% PFA and incubated with the cells at 4 °C overnight. On the day before measurement, the intercalator solution was removed, and cells were washed with CSM, PBS, and doubly distilled H 2 O sequentially. Total protein staining was performed with 25 μg/ml ASCQ_Ru (96631, Sigma) in 0.1 m NaHCO 3 solution for 10 min at room temperature. Cells were then washed with CSM, PBS, and doubly distilled H 2 O sequentially. After the last wash, cells were resuspended in doubly distilled H 2 O and filtered through a 70 μm strainer.

Mass cytometry analysis

EQTM Four Element Calibration Beads (Fluidigm) were added to the cell suspension at a 1:10 ratio (v/v). Samples were analyzed on a Helios mass cytometer (Fluidigm). The manufacturer’s standard operation procedures were used for acquisition at a rate of ~200 cells per second. After data acquisition, all .fcs files from the same barcoded sample were concatenated. Data were then normalized, and bead events were removed41. Doublets were removed, and cells were de-barcoded into their corresponding wells using a doublet-filtering scheme and single-cell deconvolution algorithm42. Subsequently, data were processed using Cytobank (http://www.cytobank.org/). Additional gating on the DNA channels (191Ir and 193Ir) was used to remove remained doublets, debris, and contaminating particulates. Manual gating was performed on IdU, cyclin B1, p-HH3, and p-RB to identify cell-cycle stages32.

CellCycleTRACER workflow

CellCycleTRACER requires as an input measurements of the four cell-cycle markers (namely p-HH3, p-RB, cyclin B1, and IdU) as well as measurements of cell volume, ideally using the ASCQ_Ru markers.

Data processing and cell-volume correction

To determine cell volume at a single cell level, we initially experimented with three housekeeping proteins, namely GAPDH, actin, and RAB7 in HEK293T, MDA-MB-231, and THP-1 cells. Although all three proteins highly correlate with the total cell volume (Supplementary Fig. 14), single-cell measurements from a mixed population of three different cell lines revealed a large degree of cell-line-specific variability that surprisingly involved these housekeeping proteins (Supplementary Fig. 15). This indicated that these housekeeping proteins cannot be used for cell-volume correction when heterogeneous populations are analyzed.

ASCQ_Ru (Supplementary Fig. 3a) is conventionally used in electrophoresis for the determination of protein abundance and has been reported to outperform other staining reagents with high sensitivity and large linear dynamic range of protein binding. Taking advantage of its additional fluorescent property, we validated ASCQ_Ru as a precise cell volume indicator using three-dimensional reconstruction of confocal images (Supplementary Fig. 3b−d)). Measurements across the three cell lines showed reduced cell line variability in comparison to housekeeping protein measurements (Supplementary Fig. 16).

CellCycleTRACER corrects the data (uploaded as.fcs files) on ASCQ_Ru to enable correction for cell-volume heterogeneity. Let \(j = 1, \ldots ,m\) be the quantified protein marker in the \(i = 1, \ldots ,n\) single cell. Let \(y_{i,j}\) denote the abundance of marker \(j\) in cell i in raw experimental data. At the same time, let \(v = 1, \ldots ,l\) denote the subset of the protein markers \(\left( {\left\{ v \right\} \subset \left\{ {{j}} \right\}} \right)\) that contain information on the total cell volume—in our case the ASCQ_Ru markers. During the cell-volume correction step, CellCycleTRACER first normalizes the raw cell-volume measurements \(y_{i,v}\) by dividing each marker by its mean value:

$$y_{i,v}^{\mathrm{norm}} = \frac{{y_{i,v}}}{{\frac{1}{n}\mathop {\sum }

olimits_{i = 1}^n y_{i,v}}}.$$

The raw measurements of all \(j = 1, \ldots ,m\) markers are then corrected for cell-volume variations by dividing \(y_{i,j}\) by the mean value of \(y_{i,v}^{\mathrm{norm}}:\)

$$y_{i,j}^{\mathrm{corr}} = \frac{{y_{i,j}}}{{\frac{1}{l}\mathop {\sum }

olimits_{v = 1}^l y_{i,v}^{\mathrm{norm}}}}.$$ (1)

Results of this process are shown in Supplementary Fig. 17. Comparison of measurements of the phosphorylated vs. total amount of proteins MEK1/2 and ERK1/2 before and after cell-volume correction indicated that the correction process was equally effective for both activated and total amounts of proteins (Supplementary Fig. 18). To avoid dividing by zero, CellCycleTRACER checks the data for zero values and, if found, substitutes zeros with the respective mean value. For more details on volume correction see Supplementary Note 2.

After cell-volume correction, selected channels of the raw measurements are transformed using the inverse hyperbolic sine function (asinh):

$$y_{i,j}^{\mathrm{trans}} = {\mathrm{asinh}}\left( {y_{i,j}^{\mathrm{corr}}} \right) = \ln \left( {\frac{{y_{i,j}^{\mathrm{corr}}}}{c} + \sqrt {\left( {\frac{{y_{i,j}^{\mathrm{corr}}}}{c}} \right)^2 + 1} } \right),$$ (2)

where the constant c, commonly referred to as the cofactor, is set to 5 according to the CyTOF community’s standard practice. Unlike the standard logarithmic function that is undetermined at zero values, asinh is linear around zero and becomes logarithmic beyond a threshold determined by the cofactor value. The overall effect of this transformation is to selectively compress large values, eliminating the typical long tails found in raw cytometry measurements. This results in a more symmetrical distribution that facilitates clustering and other machine learning analyses.

Cell-cycle phase prediction

After the volume correction, CellCycleTRACER classifies the single cells into discrete cell-cycle phases. To achieve this, we exploit the measurements of the four above-mentioned cell-cycle markers (IdU, p-HH3, cyclin B1, and p-RB) that are typically used in mass cytometry for manual cell-cycle gating. To eliminate possible biases introduced by variations in antibody concentrations and affinities, the data are standardized and set to have zero mean and unit variance. The prediction process is based on a hybrid approach that consists of two steps: (i) a classification step in which single-cell measurements of the four cell-cycle markers are given as input into a decision tree classifier to automatically predict the cell-cycle phase and (ii) a clustering step in which a Gaussian mixture model (GMM) is used to fit the data into clusters that represent the cell-cycle phases. The GMM is initialized using the predictions of the decision tree as prior knowledge.

Detailed descriptions of the implementation and performance on different data sets are given in Supplementary Note 3. In brief, we first used the four cell-cycle marker measurements from an experiment using THP-1 cells together with their class labels derived by manual gating (Supplementary Fig. 5a) to train a decision tree classifier. The resulting decision tree and the class proportions at the terminal nodes are shown in Supplementary Fig. 5b. We observed four pure terminal nodes, equal to the number of classes, indicating 100% classification accuracy in the training set. The order of the splits was identical to the order in the classification performed manually, indicating that the model faithfully captures the manual gating process. The decision tree accuracy in the independent test set was also 100%, meaning that all cells were correctly classified. After classification performance was validated, new experimental measurements were given as inputs and were automatically classified. The results on a HEK293T test set are shown in Supplementary Fig. 7a.

Second, the measurements were clustered using a GMM where the number of components was set to four, equal to the number of cell-cycle phases in our model. The parameters of the GMMs (mean vectors, covariance matrices, and class proportions) were initialized using the decision tree predictions and iteratively refined until convergence using an expectation-maximization (EM) algorithm (Supplementary Fig. 6). After convergence, posterior probabilities of each of the four GMM components were computed, and the single cells were assigned to the component with the maximal posterior (results in Supplementary Fig. 7b).

This hybrid approach combines the intrinsic interpretability of decision trees, which enable extraction of a set of comprehensive if−else rules from the training data, with the probabilistic capabilities of the GMM framework. Specifically, decision trees are ideal for partitioning of a space using training data as prior knowledge, but they lack the notion of distribution and suffer from rigid boundaries. Since mass cytometry data are produced from different cell lines and possibly on different days, these data exhibit inter-experimental variability that makes the algorithm prone to misclassifying the points at the tails of the distribution. GMMs, on the other hand, allow flexible treatment of outliers. Shortcomings of GMMs—and other unsupervised clustering methods—are an inability to match the clusters to the known labels and no guarantee of convergence to the optimal solution. These limitations are especially acute when the classes are significantly imbalanced, as in the case of cell-cycle fractions which differ by an order of magnitude (e.g., G0/G1: 40–60%; M phase: 3–5% of the total cell population). By combining decision tree and GMM approaches we benefit from the advantages of decision trees to provide an initial guess close to the optimal solution and of GMMs to allow for a probabilistic interpretation of the class assignments. This refinement translates into better assignments for outliers and captures the classification uncertainty of cells transitioning between phases, a subtlety that is entirely missed by the decision tree.

Trajectory reconstruction, alignment, and correction

Cells progress along the cell cycle in a continuous way, gradually transitioning across consecutive phases whose boundaries are not always clearly defined, and exhibiting intra-phase variability (e.g., cells at early S and late S are drastically different). To better represent these pseudo-temporal fluctuations, we devised a method that reconstructs trajectories of biological cell-cycle time (pseudotime) from a population of unsynchronized single cells, ordering them according to cell-cycle progression. The details of the reconstruction method are given in Supplementary Note 4.

We assume that n single cells are classified in four cell-cycle phases. Let \(y_i\) denote a four-dimensional vector of cell-cycle marker abundances in each cell. We seek to construct a one-dimensional embedding function of the four-dimensional vectors\(y_i\), denoted as \(f_\alpha \left( y \right)\), that represents pseudotime. One possible choice is to define \(f_a(y)\) as a linear combination of \(y_i\):

$$f_\alpha \left( {y_i} \right) = \mathop {\sum }\limits_{j = 1}^4 \alpha _jy_{i,j},$$

where the coefficients α j take values in \({\Bbb R}_{ \ge 0}^4\). Under this formulation, our problem reduces to identifying a vector of coefficients \(\alpha = \left( {\alpha _1\,\alpha _2\,\alpha _3\,\alpha _4} \right)\) that optimally maps the cell-cycle marker measurements to pseudotime. Since the ordering of the discrete classes is known a priori (G1→S→G2→M), we follow an optimization process that aims to guarantee this ordering in the desired embedding by minimizing the difference across cells that belong to adjacent classes (see also Supplementary Fig. 8). More specifically, for all cells \(i_p,i_q\) belonging to adjacent classes p,q, we estimate α such that \(f_\alpha \left( {y_{i_p}} \right) < f_\alpha \left( {y_{i_q}} \right)\), a constraint that translates into preserving the ordering in the embedding.

Collapsing the four-dimensional measurements into a lower-dimensional space may not result in fully separated clusters, with the implication that the ordering constraints might not be satisfied for all cells. To tackle this problem, we introduced slack variables into all constraints; these non-negative variables represent a degree of violation of the ordering constraint. We then minimize over a weighted sum of all slack variables, which leads to a mathematically well posed linear programming (LP) problem. Even though degenerate LPs can have multiple equivalent optima (convex set of optimal solutions), due to the presence of extrinsic and intrinsic variability in CyTOF data this does not occur in practice. Thus, the LP results in a single, optimal ordering. Since the solution time of the resulting LP grows substantially with the number of cells (Supplementary Fig. 19a) and can thus be computationally intensive, we randomly picked a fixed percentage of cells from each class and computed an optimal set of weights. A numerical investigation (Supplementary Fig. 19b) indicated that the solution of the LP, which yields the values of parameters α, is robust to the sampling of cells even when a small percentage of cells is considered.

Once the values of parameters α are estimated, CellCycleTRACER visualizes the results by ordering the single cells based on their pseudotime values, resulting in single-cell trajectories for each marker. Additionally, CellCycleTRACER computes the mean trajectory of each marker by applying a mean filter on the single-cell trajectory, where the value for each cell is replaced by the mean of the neighboring cells in a sliding window of fixed size. Since different samples (e.g., different cell lines) can exhibit strong variations in the relative duration of the cell-cycle phases, CellCycleTRACER permits multi-sample analysis by either aligning the relative cell-cycle phase proportions across individual samples to the mean vector of cell-cycle phase proportions or, alternatively, aligning them to one sample (e.g., one cell line) of interest (Fig. 1f, Supplementary Note 5). To remove the effect of the cell cycle on the marker measurements, CellCycleTRACER exploits the abovementioned mean trajectory, rescales it around 1 by dividing by the mean abundance of the marker and then divides the single-cell trajectory by the rescaled mean. This step removes cell-cycle-specific fluctuations and redistributes the single cells independently of cell-cycle variation (Fig. 1g, Supplementary Note 6). After every step of the analysis (e.g., cell-cycle classification, correction, alignment), data can be exported as .fcs files for further analysis.

Implementation

All methods were implemented using the Statistical and Optimization Toolboxes of MATLAB R2011b.

Software availability

CellCycleTRACER is implemented as a web application, accessible using the following link: https://www.zurich.ibm.com/compsysbio/publications.html.

Data availability

CyTOF data for the three cell lines at all stimulation time-points are available on Cytobank under project 1129.