Data information: Boxplots: the horizontal line represents the median of the distribution, the upper and lower limit of the box indicate the first and third quartile, respectively, and whiskers extend 1.5 times the interquartile range from the limits of the box. Values outside this range are indicated as outlier points. Related to Figs EV1 and EV2 and Datasets EV1 EV2 and EV3

Analysis of protein abundance shift in cellular compartments in published datasets. Density plots for protein fold change distributions in different compartments for (B) lung vs. liver cells (Geiger), (C) hepatocytes vs. Kupffer cells (Azimifar), and (D) healthy human kidney vs. renal carcinoma cells (Guo). Distribution of protein fold changes for 10 different compartments is shown as boxplot (inset); mean fold changes (log) are indicated below the density plots; below the boxplot, asterisks mark the cellular compartments that show significantly different distribution of fold changes compared to the whole proteome (Mann–Whitney test *0.01).

Here, we show that in several published proteomics datasets proteins associated with different cell compartments/organelles show distinct distribution of fold changes across the conditions tested. This manifests as a consequence of underlying differences in cell morphology that are not taken into account by classical differential expression tools. Although such differences provide robust signals about different cell states and, as such, can be used as biomarkers, the non‐uniform distribution of fold changes can mask biologically relevant alterations in the composition of cell compartments/organelles. We thus propose an approach that is able to better reflect compartment‐specific protein changes (Fig 1 A), and we experimentally validate this by analyzing the proteomic changes identified in nuclei isolated from different cancer cell lines compared to their total lysate. Using our approach, changes in protein abundance identified in comparative proteomic studies can be re‐interpreted to better reflect the context of protein sub‐cellular localization, and to provide an additional level of detail about the biological differences between cellular states. To demonstrate this, we re‐analyzed a dataset of chronological aging in the nematode Caenorhabditis elegans and observed heterogeneous abundance changes among mitochondrial and extracellular proteins implying an age‐dependent remodeling of these cellular compartments. These novel biological insights were not apparent when traditional analysis methods were applied. Our approach is broadly applicable to large‐scale proteomic studies, and we anticipate analogous strategies to be derived for different context levels such as protein complexes and pathways.

Mass spectrometry‐based proteomics has been successfully used to determine sub‐cellular protein composition, discover new portions of the cellular interactome, and map post‐translational modifications. Different experimental strategies have been developed to perform quantitative experiments where differences in protein abundance are determined, for example, by introducing stable isotopes in one of the experimental conditions tested (Ong et al , 2002 ; Ong & Mann, 2005 ). A major factor for the interpretation of the experimental outcome is data processing (Park et al , 2003 ). The main data processing strategies used in proteomics, e.g., scaling by mean or median through quantile normalization, have been developed for microarray and RNAseq data, with the underlying assumption that total level of mRNA in the cell is stable and does not differ significantly between the compared samples (Bolstad et al , 2003 ). For transcriptomics data, it has been shown that global changes in transcript levels, e.g., down‐regulation of all transcripts due to general inhibition of transcription, can introduce artifacts when standard differential analysis approaches are employed (apparent up‐ and down‐regulation of transcripts instead of, e.g., widespread down‐regulation; Jaksik et al , 2015 ). Similarly, profound morphological differences between cellular types or states can also influence the outcome of comparative genomic and proteomic analysis (Lin et al , 2012 ; Lovén et al , 2013 ). Lundberg et al ( 2008 ) showed that the majority of all proteins are expressed in a cell size‐dependent fashion and that the comparative analysis of the protein expression values requires a normalization procedure. Additionally, it is known that different tissues show different levels of respiratory activity and variable amounts of mitochondria (Kirby et al , 2007 ; Fernández‐Vizarra et al , 2011 ) and that the number and composition of organelles can be affected, for example, by the aging process (Cellerino & Ori, 2017 ). Covariation of protein abundance across different conditions can be also exploited, and it can contribute to functional proteomics (Kustatscher et al , 2016 ). Currently, there is a lack of systematic approaches able to detect and deal with differences in cellular organization that might influence the outcome of proteomics data analysis from unfractionated samples.

Results and Discussion

Compartment‐specific shifts in protein abundance are apparent in large‐scale proteomics dataset We analyzed seven mass spectrometry datasets covering the proteomes of different mammalian tissues (Geiger et al, 2013), cell types (Azimifar et al, 2014; Sharma et al, 2015), healthy and diseased states (Wiśniewski et al, 2012; Guo et al, 2015; Tyanova et al, 2016), and cancer development stages (Wisńiewski et al, 2015). In these experiments, the abundance fold change (FC) of thousands of proteins has been calculated using standard differential analysis approaches (see Materials and Methods section). To investigate whether changes in organelle number or size are reflected in these datasets, we assigned cellular localization to each of the quantified proteins using Gene Ontology (GO) term annotation (The Gene Ontology Consortium, 2015). On average, 78% of the proteins in the analyzed dataset could be annotated using GO cellular compartments terms (Fig EV1A). We compared abundance changes of proteins belonging to four major cellular compartments (nucleus, cytoplasm, mitochondrion, and extracellular space) which, on average among the analyzed datasets, accounted for 96% of all the annotated proteins. In all the seven datasets, we observed that proteins assigned to specific cellular compartments tend to display similar protein fold changes, indicating that their abundances are associated with each other. We therefore calculated protein fold change distributions for each cellular compartment and found statistically significant shifts between such distributions (Fig 1B–D). We found a distinct increase in mitochondrial proteins in liver cells that can be readily captured from the comparison of lung and liver tissue proteomes (Geiger et al, 2013; average change of +1 log 2 FC, Mann–Whitney test P = 5.3 × 10−32; Fig 1B and Dataset EV1), consistent with the knowledge that hepatocytes have an elevated number of mitochondria as compared to other cell types (Veltri & Espiritu, 1990). Similar differences can be also detected between more closely related cell types deriving from the same organ. For instance, we found increased abundance of nuclear and extracellular proteins and decreased abundance of mitochondrial proteins in Kupffer cells vs. hepatocytes (Azimifar et al, 2014; average change of +0.4, +0.3 and −1.2 log 2 FC, Mann–Whitney test P = 3.3 × 10−22, P = 8.5 × 10−03, and P = 1.9 × 10−88, respectively; Fig 1C and Dataset EV2). Click here to expand this figure. Figure EV1.Protein annotation and compartment fold change distribution for different proteomics datasets A. Protein annotation in Gene Ontology (cellular component terms) for the seven datasets analyzed. Percentage of protein with (i) no GO annotation, (ii) any GO cellular component annotation, (iii) annotation to ten major cellular compartments (nucleus, cytoplasm, mitochondrion, extracellular space, endoplasmic reticulum, Golgi apparatus, cell membrane, nuclear membrane, lysosome, and peroxisome), and (iv) annotation to four major cellular compartments (nucleus, cytoplasm, mitochondrion, and extracellular space) are reported for each dataset.

B. et al , 2012 et al , 2012 et al , 2014 et al , 2015 et al , 2015 et al , 2015 et al , 2016 Percentage of proteins annotated to one, two, three, four, five, or more compartments in each dataset, in order (Geiger; Wiśniewski; Azimifar; Guo; Sharma; Wisńiewski; Tyanova).

C–E. P < 0.05) depicting the average fold change distributions for (C) lung vs. liver cells (Geiger et al , 2013 et al , 2014 et al , 2015 Density plot for the top four significant compartments (Mann–Whitney test0.05) depicting the average fold change distributions for (C) lung vs. liver cells (Geiger), (D) hepatocytes vs. Kupffer cells (Azimifar), and (E) healthy kidney vs. renal carcinoma cells (Guo).

F. x‐axis) and peroxisomal proteins (y‐axis) in seven different datasets (Geiger et al, 2012 et al, 2012 et al, 2014 et al, 2015 et al, 2015 et al, 2015 et al, 2016 R = 0.97, P = 3.5 × 10−4). Proteins shared between the two compartments (annotated as both mitochondrial and peroxisomal) were excluded for this analysis. Scatter plot of average abundance shift of mitochondrial proteins (‐axis) and peroxisomal proteins (‐axis) in seven different datasets (Geiger; Wiśniewski; Azimifar; Guo; Sharma; Wisńiewski; Tyanova); a gray line represents the line fitted using the resulting points (Pearson's= 0.97,3.5 × 10). Proteins shared between the two compartments (annotated as both mitochondrial and peroxisomal) were excluded for this analysis. Data information: Related to Fig Data information: Related to Fig 1 and Dataset EV4 Major morphological changes can also be a consequence of disease such as malignant transformation. We analyzed the abundance of proteins in healthy kidney cells and renal carcinoma cells (Guo et al, 2015) and found a decrease in mitochondrial proteins in cancer cells (average change of −0.5 log 2 FC, Mann–Whitney test P = 1.8 × 10−22; Fig 1D and Dataset EV3). In addition, we also observed progressive shifts in the relative abundance of nuclear (Mann–Whitney test P < 0.01) and extracellular (Mann–Whitney test P < 0.01) proteins between healthy colorectal mucosa, adenomas, and colon cancers (Wisńiewski et al, 2015; Fig EV2). Subsequently, we extended the analysis to proteins mapping to six additional organelles: endoplasmic reticulum, Golgi apparatus, cell membrane, nuclear membrane, lysosome, and peroxisome (Fig EV1C–E). This allowed us to observe a previously unappreciated correlation between the abundance changes of proteins annotated as peroxisomal and mitochondrial (Pearson's R = 0.97, P = 3 × 10−04) that manifested in all the seven different datasets used (Fig EV1F). Click here to expand this figure. Figure EV2.Collective shifts of abundance for nuclear and extracellular proteins in colorectal cancer P < 0.01), and for 2,172 and 2,155 extracellular proteins, respectively (Mann–Whitney test *P < 0.01; Wisńiewski et al, 2015 Cellular compartment shifts during colorectal cancer progression: cancer/healthy protein ratios are compared to cancer/adenoma protein ratios for 3,231 and 3,206 nuclear proteins, respectively (Mann–Whitney test *0.01), and for 2,172 and 2,155 extracellular proteins, respectively (Mann–Whitney test *0.01; Wisńiewski). The average ratio of the protein abundances between two conditions is represented by boxplots; significant comparisons are marked with stars. Related to Fig 1 . Boxplots: the horizontal line represents the median of the distribution, the upper and lower limit of the box indicate the first and third quartile, respectively, and whiskers extend 1.5 times the interquartile range from the limits of the box. Values outside this range are indicated as outliers points. Collectively, our analysis indicates the widespread existence of cell compartment‐specific shifts in the output of comparative mass spectrometry experiments reflecting morphological differences between the compared cell states. These major shifts can be detected using protein annotation and a simple statistical test and, if present, should be taken into account when interpreting the data. However, this approach does not inform about variations of protein abundance within the same cellular compartment. As an example, a mitochondrial protein complex might appear increased in abundance, reflecting an increase in mitochondrial number or size, although its actual abundance with respect to all other mitochondrial proteins remains unchanged.

Differential protein expression analysis in the context of cell compartments In order to gain insight into the composition of cellular compartments across cell states, we propose a normalization approach that complements standard differential analysis by taking into account differences in size or abundance of cell compartments. Our approach aims at partitioning total proteome data using prior knowledge deriving from the GO annotation, and calculates new relative abundances for proteins belonging to major cellular compartments. For each compartment, a linear model is built from the abundances of proteins in the two conditions compared (Fig 2A). In all the datasets that we tested, the log 2 abundances of proteins annotated to same cell compartment followed linear models between the compared samples; therefore, non‐linear modeling was not explored (see Materials and Methods section). Each linear model was evaluated through its statistics, namely the P‐value and the R2 (Dataset EV4A). In each linear model, the distance, i.e., the residual value, between the protein abundance and the linear fit, can be used as a compartment‐normalized variation (CNV) value. This value reflects the relative abundance difference of a protein compared to its cellular compartment (Fig 2B and Dataset EV3). Since many proteins are associated with more than one cellular compartment (on average, only 20% of the annotated proteins were specific to one compartment and 36% were annotated to two compartments, Fig EV1B; Thul et al, 2017), we wanted to assess the robustness of the linear models when taking into account multiple compartment annotations for the same proteins. Thus, we compared the CNV models built using all the proteins mapping to a given compartment to CNV models built using only proteins that are exclusive to a given compartment, so that there are no shared proteins between the linear models. We measured an average Pearson correlation of 0.97 between the CNV values for the same proteins using the two types of models. In the case of mitochondria, we evaluated an independent and curated annotation of mitochondrial proteins from MitoCarta (Calvo et al, 2016), and used it to build the mitochondrial‐exclusive CNV model. The average Pearson correlation between the previous models and these mitochondrion‐exclusive models was 0.99. The statistics of the CNV models and their correlation with compartment‐exclusive models are reported in the Dataset EV4A. Finally, we tested whether proteins belonging to different compartments are more likely to be detected as significantly affected by the CNV approach. We did not observe a significant association between multiple compartment annotation for proteins and their classification as differentially expressed by the CNV approach (Dataset EV4B). Thus, we conclude that the linear models underpinning the CNV approach are robust regardless of whether proteins with multiple compartments are considered or not. Figure 2.Compartment‐specific analysis reveals differences in organelle composition that can be validated by sub‐cellular fractionation A–C. et al , 2015 et al , 2015 et al , 2016 Mitochondrial proteins are plotted using their absolute abundance (IBAQ score) in healthy kidney vs. renal carcinoma cells (Guo). Each mitochondrial protein is colored according to (A) its fold change calculated by standard differential expression using the limma package (Ritchie; Phipson); (B) its CNV value (five proteins with the highest CNV value and five proteins with the lowest CNV values are highlighted and annotated in boxes); and (C) the absolute difference between the two values.

D. >Correlation between standard fold change values (left panel) and CNV values (right panel) of proteins quantified in the total lysate and isolated nuclei of HeLa and RKO cells. Only proteins that are differentially regulated (adj. P < 0.05) in isolated nuclei are shown.

E. Clustering of nuclear protein fold changes estimated from whole cells and isolated nuclei, and comparison to the CNV values obtained from whole cell data.

F–H. P‐values (limma) obtained for (F) lung vs. liver cells (Geiger et al, 2013 et al, 2014 et al, 2015 q‐value < 0.1). A stacked barplot (inset) shows the percentage of unique proteins belonging to each category for three q‐value thresholds (0.05, 0.1, and 0.25). Comparison of standard differential expression and CNV approach for the three datasets shown in Fig 1 B–D. Volcano plots based on fold changes and adjusted‐values (limma) obtained for (F) lung vs. liver cells (Geiger), (G) hepatocytes vs. Kupffer cells (Azimifar), and (H) healthy kidney vs. renal carcinoma cells (Guo). Proteins are colored depending on their significance when using the standard limma approach and the CNV approach (based on the four main compartments,value < 0.1). A stacked barplot (inset) shows the percentage of unique proteins belonging to each category for three‐value thresholds (0.05, 0.1, and 0.25). Data information: Related to Figs EV2, Data information: Related to Figs EV3 and EV4 , and Datasets EV1 EV3 , and EV5 Depending on the extent of the cellular compartment shift in whole proteome data, proteins can be assigned different standard fold change and CNV values (Fig 2C). Therefore, in order to evaluate the performance of the CNV approach, we analyzed the proteome profiles obtained from two cell lines that are known to have distinct morphological features. The commonly used HeLa cells and the colon carcinoma‐derived cells (RKO) have drastically different morphology and nuclear size, due to a > 2‐fold smaller RKO nuclear size as compared to HeLa (average nuclear surface area 206.3 and 543.5 Å2 for RKO and HeLa, respectively; Fig EV3). Proteome profiles from whole cell extract and isolated nuclei are available for both cell lines (Geiger et al, 2012; Ori et al, 2013). Indeed, when we analyzed whole cell data using a standard differential expression analysis tool [limma package (Ritchie et al, 2015; Phipson et al, 2016)], we observed that the majority (70%) of the nuclear proteins differentially regulated in isolated nuclei are classified as down‐regulated in RKO cells compared to HeLa cells, reflecting the smaller nuclear size of the former cell line. However, the same analysis performed on data from isolated nuclei showed that ratio between up‐ and under‐regulated proteins is, as expected, more balanced, reflecting differences in the composition of the nucleus between the two cell lines. The discrepancies between the fold changes estimated from whole cell and isolated nuclei result in a significant, but modest, correlation between the two datasets (R = 0.49; Fig 2D, left panel). We then re‐analyzed the whole cell data treating cellular localizations independently using our CNV approach (Dataset EV5). We found that our approach is able to take into account the differences attributed to changed morphology and provides fold change values for nuclear proteins that are considerably closer to the values obtained from isolated nuclei, with an improved correlation (R = 0.65) between whole cell and isolated nuclei estimates (Figs 2D, right panel, and E). These data show that our CNV approach is useful to derive insights on the proteome of a cell compartment when applied to total proteome data, irrespectively of morphological differences between the compared samples. Click here to expand this figure. Figure EV3.Comparison of nuclear surface area of HeLa and RKO cells 2) of 44 HeLa and 41 RKO cells (Ori et al, 2013 t‐test P = 2.5 × 10−15). Nuclear surface area was estimated from the radius of isolated nuclei measured in phase contrast images, assuming a spherical shape. Related to Fig Distributions, represented as boxplots, of nuclear surface area (Å) of 44 HeLa and 41 RKO cells (Ori). HeLa nuclei are significantly larger than RKO nuclei (‐test2.5 × 10). Nuclear surface area was estimated from the radius of isolated nuclei measured in phase contrast images, assuming a spherical shape. Related to Fig 2 D and Dataset EV5 . Boxplots: the horizontal line represents the median of the distribution, the upper and lower limit of the box indicate the first and third quartile, respectively, and whiskers extend 1.5 times the interquartile range from the limits of the box. Values outside this range are indicated as outliers points.

Comparison of proteome‐wide and compartment‐specific differential expression analysis In order to quantify the impact of the CNV approach on the analysis of proteomics data, we compared the outcome of standard differential expression and CNV approach for the three datasets where we detected significant compartment shifts (Fig 1B–D). Direct comparison of the statistics revealed low correlation between q‐values assigned by the two approaches (Fig EV4), indicating complementarity between them. Notably, the extent of complementarity was not uniform between datasets, being more pronounced when different tissues (e.g., liver vs. lung) are compared (Fig 2F and H). We reasoned that the CNV can provide two additional levels of information: (i) It can reveal alterations of protein level that reflect a compartment‐wide abundance change rather than a protein‐specific one; and (ii) it can discover new protein changes that emerge only after normalizing for compartment‐wide changes. Therefore, we explicitly investigated the overlap between significant proteins identified by standard differential expression and CNV approach. Across the three datasets tested, we found a variable proportion of cases (ranging between 50 and 92%) that were identified as differentially expressed by the standard approach (q‐value < 0.1), but are very close to the linear model of their respective compartment, and, thus, classified as not significant with the CNV approach (Fig 2F–H, colored in red). We interpret these cases as deriving from compartment‐wide abundance changes. This effect was particularly pronounced for the Azimifar et al (2014) dataset that showed very prominent shifts for nuclear, mitochondrial, and extracellular proteins (Fig 1C). Regarding newly discovered cases, we found 104, 53, and 38 proteins that were identified as significant (q‐value < 0.1) exclusively by the CNV approach, respectively, for the lung vs. liver (Geiger et al, 2013), hepatocytes vs. Kupffer cells (Azimifar et al, 2014), and healthy kidney vs. carcinoma (Guo et al, 2015) datasets (Fig 2F–H, colored in cyan). The majority of these cases display low fold changes relative to the total proteome, but appears as outlier in the linear models for the respective compartment. Taken together, these data demonstrate that protein expression can be analyzed in the context of cellular compartment by building simple linear models that allow a complementary interpretation of the results of canonical differential expression (Fig EV4), revealing new differences in the abundance of proteins belonging to the same compartment across cell types and states. Click here to expand this figure. Figure EV4.Comparison of statistics obtained by standard differential expression (limma) and CNV approach applied to the same dataset A–C. q‐values (−log 10 ‐transformed) associated to limma protein fold changes (x‐axis) and CNV values (y‐axis) for (A) lung vs. liver cells (Geiger et al, 2013 et al, 2014 et al, 2015 Scatter plots ofvalues (−log‐transformed) associated to limma protein fold changes (‐axis) and CNV values (‐axis) for (A) lung vs. liver cells (Geiger), (B) hepatocytes vs. Kupffer cells (Azimifar), and (C) healthy kidney vs. renal carcinoma cells (Guo). Pearson's correlations are reported inside the plots. Related to Fig 2 F and H and Datasets EV1 EV2 and EV3