Delineating the factors that govern protein expression and activity in cells is among the most fundamental research topics in biology. Although the number of potential protein‐coding genes in the human genome is stabilizing at about 20,000, high‐quality evidence for their physical existence has not yet been found for all and intense efforts are ongoing to identify these currently ~13% “missing proteins” (Omenn et al , 2017 ). While it is also generally accepted that the quantities of proteins vary greatly within and across different cell types, tissues and body fluids (Kim et al , 2014 ; Wilhelm et al , 2014 ), this has not been analysed systematically for many human tissues. Furthermore, it is not very clear yet how the many anabolic and catabolic processes are coordinated to give rise to the often vast differences in the levels of proteins. Messenger RNA levels are important determinants for protein abundance (Vogel et al , 2010 ; Schwanhäusser et al , 2011 ), and extensive mRNA expression maps of human cell types and tissues have been generated as proxies for estimating protein abundance (GTEx Consortium, 2013 ; Uhlén et al , 2015 ; Thul et al , 2017 ). However, other studies have also highlighted the much higher dynamic range of protein than transcript abundance as well as a rather poor correlation of mRNA and protein levels suggesting that further and possibly diverse regulatory elements play important roles (Schwanhäusser et al , 2011 ; Liu et al , 2016 ; Franks et al , 2017 ). Decades of careful research revealed numerous mRNA elements affecting translation or mRNA stability such as codon usage, start codon context or secondary structure to name a few. However, most of these studies focussed on single or few genes or single cell types or were performed in model organisms distinct from human systems and often did not cover a lot of proteins. Broader scale analyses have more recently become possible owing to advances in proteome and transcriptome profiling technologies, but these have mostly focussed on a single (disease) tissue or the cell‐type resolved analysis of protein expression in single tissues (Zhang et al , 2014 ; Mertins et al , 2016 ). To the best of our knowledge, no broad‐scale quantitative and integrative analysis of transcriptomes and proteomes across many healthy human tissues has been performed yet that would enable a comprehensive analysis of factors explaining the experimentally observed differences between mRNA and protein expression. Therefore, the purpose of this study was to generate a resource of molecular profiling data at the mRNA and protein level to facilitate the study of protein expression control and proteogenomics in humans. To this end, we analysed 29 major histologically healthy human tissues from the Human Protein Atlas (HPA) project (Uhlén et al , 2015 ) to provide a comprehensive baseline map of protein expression across the human body. As we show below as well as in Eraslan et al , 2019 , these data can be used in many ways to explore protein expression and its regulation in humans. To facilitate further research on this fundamentally important topic and the many further uses that can be envisaged, all data are available in ArrayExpress (Kolesnikov et al , 2015 ) and proteomeXchange (Vizcaíno et al , 2014 ).

Results and Discussion

Comprehensive transcriptomic and proteomic analysis of 29 human tissues We analysed 29 histologically healthy tissue specimen representing major human organs by label‐free quantitative proteomics and RNA‐Seq (Fig 1A; see Appendix Figs S1–S6 for the assessment of data quality). Tissues were collected by the HPA project (Fagerberg et al, 2014), and adjacent cryosections were used for paired (allele‐specific) transcriptome and proteome analysis. RNA‐Seq profiling detected and quantified in total 18,072 protein‐coding genes with an average of 12,262 (± 1,007 standard deviation, SD) genes per tissue (Fig 1B) when using a cut‐off of 1 fragment per kilobase million (FPKM; Uhlén et al, 2015). Proteomic profiling by mass spectrometry resulted in the identification and intensity‐based absolute quantification (iBAQ; Schwanhäusser et al, 2011) of a total of 15,210 protein groups with an average of 11,005 (± 680 SD) protein groups per tissue at a false discovery rate (FDR) of < 1% at the protein, peptide and peptide‐spectrum match (PSM) level (Fig EV1A). Protein identification was based on 277,698 non‐redundant tryptic peptides, representing a total of 13,640 genes and, on average, 10,541 (± 512 SD) genes per tissue covering, on average, 86% of the expressed genome in every tissue. While the total number of confidently identified proteins in this study is smaller than that of other (community‐based) resources such as ProteomicsDB (Schmidt et al, 2018) and neXtProt (Gaudet et al, 2017; coverage of 15,721 and 17,470 protein‐coding genes, respectively), it provides a highly consistent collection of tissue proteomes including the deepest proteomes to date for many of the tissues analysed. It also provides protein‐level evidence for 37 proteins (represented by at least one unique peptide) that are not yet covered by neXtProt (release 2018‐01‐17; Table EV1). These proteins were validated by synthetic peptides (see PRIDE submission for mirror spectra). Eighteen of these 37 have antibody staining in the current release of the HPA project and all of them show signal in the same tissue they were detected in by MS. This corroborates the detection of these new proteins by an independent method. Eight of these proteins also meet the guidelines of the Human Proteome Project that require ≥ 2 peptides for a new protein each with ≥ 9 amino acids in length (Deutsch et al, 2016). We note that the HPP guidelines use reasonable but ad hoc criteria which are likely too conservative and therefore likely discriminate against further genuine cases. Comparing spectra of endogenous to synthetic peptides is likely the more objective criterion which is why we added mirror plots of all evaluated cases to PRIDE (Zolg et al, 2017). The expression levels of the “new” proteins were about a factor 10 below median (iBAQ at log10 scale, 7.4 versus 8.3) which explains why they may have been missed before. Interestingly, 15 of these proteins were detected in the fallopian tube, an organ that has not yet been extensively profiled by proteomics. Figure 1.Comprehensive proteomic and transcriptomic analysis of 29 human tissues from healthy donors Body map of analysed tissues. Number of genes detected on protein and mRNA level in each tissue. The colouring of the bars indicates the fractions of transcripts and proteins that are expressed everywhere or enriched in certain tissues. The full classification is provided in the text. Abundance distribution of all transcripts detected in all tissues (grey); the fraction of detected proteins is shown in blue and the fraction of transcripts for which no protein was detected is shown in orange. Relative distribution and absolute numbers of transcripts and proteins in selected functional classes across the expression categories shown in panel (B). Colours are the same as in panel (B). Click here to expand this figure. Figure EV1.Further characterization of human proteomes and transcriptomes Number of identified protein groups for each of the 29 tissues. Number of genes in all tissues that were detected at the transcript with higher than average expression but not detected at the protein level. Note the very high number of such cases in testis. Abundance distribution of all proteins detected in human brain (grey). Proteins in blue are expressed in all 29 tissues, and proteins in orange show elevated expression in brain. Clustering of gene ontology terms (biological process) for proteins and transcripts that show the most divergent expression across all tissue. Boxes give examples of GO terms for four different tissues (Appendix, brain, heart and testis). Overall, 13,413 protein‐coding genes were detected on both transcript and protein levels, and the detected proteins spanned almost the entire range of mRNA expression again indicating very substantial coverage of the expressed proteome (Fig 1C). However, some proteins could not be detected even for highly expressed mRNAs (i.e. higher than the mean mRNA abundance). About 1/3 of these mRNAs were found in testis (478 of 1,408) and no other tissue contained nearly as many highly expressed mRNAs without protein evidence (Fig EV1B). The “missing” proteins in the testis were statistically significantly enriched for processes related to spermatogenesis by gene ontology analysis (clusterProfiler; n = 82 genes; BH‐adjusted P = 8 × 10−14). Although the rich expression of mRNAs in testis has been known for a long time and exploited for, e.g., the cloning of many genes from cDNAs, the apparent absence of so many testis proteins with high mRNA expression is surprising. This was not due to, e.g., poor coverage of the testis proteome (11,024 detected protein‐coding genes) or other obvious technical factors (such as inefficient extraction of membrane proteins or difficulties with identifying small proteins) that would prevent detection of these proteins. Interestingly, almost 300 of these “missing” proteins have also not been detected by antibodies in testis (according to HPA) and nearly 200 have no ascribed molecular function. The inability to detect these proteins by mass spectrometry or antibodies despite high levels of mRNA poses a number of questions. For example, are these proteins rapidly degraded implying specialized (and perhaps transient) functions in testis or sperm functionality? Are they perhaps stabilized in response to egg fertilization? Proteins missing at the lower end of the mRNA expression range (less than mean mRNA abundance) are overrepresented in G‐protein‐coupled receptor activity (n = 173; BH‐adjusted P = 8.3 × 10−50), ion channels (n = 109; BH‐adjusted P = 7 × 10−10) and cytokine‐related biology (n = 76; BH‐adjusted P = 6 × 10−9). The abundance of these proteins may simply have been below the mass spectrometric detection limit or, as described many times, can be difficult to extract from cells owing to the presence of multi‐pass transmembrane domains giving rise to few if any MS‐compatible tryptic peptides after digestion. To explore which and how many proteins show a tissue‐specific expression profile, we applied the classification scheme of Uhlén et al (2015, 2016) previously developed for mRNA profiling and which stratifies genes into the five classes “tissue‐enriched” (fivefold above any other tissue), “group enriched” (fivefold above any group of 2–7 tissues), “enhanced” (fivefold above the average of all other tissues), “expressed in all” (expressed in all tissues) as well as “mixed” genes (which do not match the other categories). Overall, a large fraction of all represented genes was expressed in all tissues: 37% (6,725) at the transcript level and 39% (5,400) at the protein level. However, 43% (7,866) of all transcripts and 53% (7,244) of all proteins showed elevated expression in one or more tissues (“tissue‐enriched”, “group‐enriched” or “tissue‐enhanced”). Only 0.73% (on average) of all transcripts and 0.65% of all proteins showed a tissue‐enriched profile. Two notable exceptions are brain and testis which exhibit a higher percentage of tissue‐enriched proteins and transcripts in line with a recent analysis of RNA‐Seq data from the HPA and GTEx projects (GTEx Consortium, 2013). Proteins with more tissue‐restricted expression tended to be of slightly lower abundance (Fig EV1C). For 1,270 of the total 1,998 tissue‐enriched proteins detected in our study, antibody staining was available in the HPA. In the 29 tissues that are common between HPA and the current study, 775 proteins were detected in the same tissue lending support to the mass spectrometry‐based data presented here. In addition, we compared our tissue‐enriched expression data to the targeted MS (PRM) data acquired for about 52 proteins by Edfors et al (2016) and 10 tissues that overlapped with our tissue panel (see Appendix Figs S7–S9). Incidentally, the Edfors’ study had data on three tissue‐enriched proteins. First, myoglobin (MB) was highly tissue‐enriched in our data in the heart which was confirmed by the PRM analysis as well as antibody staining in HPA. Second, the protein PDK1 (3‐phosphoinositide‐dependent protein kinase‐1) was also found to be a heart‐enriched protein and the PRM data confirmed this. This protein was detected in all tissues by antibody staining but we note that immunohistochemistry (IHC) stains are not quantitative so it is difficult to conclude if broad detection of this protein was due to overstaining or poor antibody specificity. The third example is the protein CANT1 (soluble calcium‐activated nucleotidase 1) which we detected as a prostate‐enriched protein. Again, this was confirmed by the PRM measurement but was again detected in most tissues by IHC. The above global trends in transcript and protein tissue expression distributions were also mirrored by functional categories of genes but with some interesting detail (Fig 1D, Table EV4). For example, while the tissue distribution of expression of disease‐associated genes followed that of all genes, the expression of drug targets in general and GPCRs in particular was much more tissue‐restricted speaking to the notion that proteins may make for better drug targets if they are not ubiquitously expressed (Hao & Tatonetti, 2016). In this context, we point out that our baseline map of protein expression across the human body may be of general value for drug discovery as one can, e.g., quickly examine the expression profile of a particular target of interest, to help better understand adverse clinical effects and off‐target mechanisms of action of drugs. For instance, a recent study revealed phenylalanine hydroxylase (PAH) as an off‐target of the pan‐HDAC inhibitor panobinostat (Becher et al, 2016). Our map of protein expression shows that PAH is abundantly expressed in liver (and kidney) which is also the major site of hydroxylation in the human body (Matthews, 2007), indicating that the liver is the major site where panobinostat exerts its detrimental effects, i.e. leading to decreased tyrosine levels, and eventually hypothyroidism in affected patients. In contrast, essential genes (Blomen et al, 2015; Hart et al, 2015; Wang et al, 2015) as well as mitochondrial genes were found in the vast majority of all tissues in line with their central roles for maintaining cellular homeostasis. Despite the differences in detail, our dataset confirms, at the protein level, that there is a core set of ubiquitously expressed genes/proteins and that individual tissues are not strongly characterized by the categorical presence or absence of mRNAs or proteins but rather by quantitative differences (Geiger et al, 2013). This is also evident from an analysis of the most divergently expressed proteins or transcripts that shows enrichment of proteins related to the functional specialization of the respective tissue (Fig EV1D, Table EV3).

mRNA and protein expression The relationship between mRNA and protein expression has been studied extensively over the past years and there continues to be debate in terms of how the various correlations that can be computed may be interpreted in terms of technical artefacts or biological meaning (Liu et al, 2016; Fortelny et al, 2017; Franks et al, 2017; Wilhelm et al, 2017). While it is beyond the scope of the current study to attempt to reconcile the different views, the extensive data on both mRNA and protein expression provided in this resource should help to eventually bring clarity. Therefore, in the following, we confine our analysis of the expression data to a few basic points we nonetheless deem important. The dynamic range of transcripts detected by RNA‐Seq spanned about four orders of magnitude and that of proteins detected by mass spectrometry spanned eight orders of magnitude (Fig 2A; see Appendix Fig 10 for the corresponding plot using copy numbers that show essentially the same characteristics; Table EV5). This difference alone explains (at least in part) the overall higher coverage of the expressed proteome by RNA‐Seq compared to that of LC‐MS/MS. This is because there is limited “sequencing capacity” particularly in mass spectrometry. Thus, detecting very low‐abundance molecules will be harder, the wider the dynamic range of expression and the lower the sampling depth is. For example, the (paired‐end) RNA data provided (on average) 18 M reads per tissue. Those 18 M reads are distributed across 4 orders of magnitude of abundance with an inevitable bias to the higher abundant transcripts. The MS data only provided (on average) ~76,000 peptides and ~284,000 identified tandem mass spectra (peptide to spectrum matches; PSMs) per tissue and these are distributed over eight orders of magnitude also with a bias for the more abundant proteins. As a result, it is currently much easier to cover many genes by RNA‐Seq than it is to cover the same number by LC‐MS/MS. Figure 2.Analysis of protein and transcript expression levels within and across tissues Distribution of global transcript and protein abundance in all tissues. It is apparent that the dynamic range of protein expression (iBAQ scale) exceeds that of mRNA expression (FPKM scale; see Appendix Fig S10 for the corresponding plot for RNA and protein copy numbers). Protein‐to‐mRNA abundance plot for brain tissue. The slope of the regression line indicates that high‐abundance mRNAs give rise to more protein copies per mRNA than low‐abundance mRNAs. Ranked abundance plot of proteins and transcripts in human heart. While the 10 most abundant transcripts cover almost 70% of all transcripts in this tissue, the corresponding proteins only represent about 20% of the total protein. Analysis of the number of genes that are shared among the 100 most abundant transcripts and proteins. Regardless of the tissue, the fraction of shared genes rarely exceeds 20%. Correlation analysis of protein‐to‐RNA abundance (in log10 scale) across tissues, resulting in almost 90% positive correlations. The proteins highlighted in the next panel are marked. Examples for proteins that show high (SYK, left panel) or no (EIF4A3, right panel) correlation of protein/RNA ratios across tissues. While the former indicates that different tissues express different quantities of SYK, EIF4A3 expression appears to be similar in all tissues. As noted before, the much wider dynamic range at the protein level implies that protein synthesis and protein stability play an important role in determining protein levels beyond mRNA levels (Schwanhäusser et al, 2011; Vogel & Marcotte, 2012). Similarly, the number of protein molecules produced per molecule of mRNA appears to be much larger for high‐ than for low‐abundance transcripts, leading to a nearly quadratic relationship between mRNA levels and protein levels in every tissue (slope of 2.6 in Fig 2B for brain and between 1.8 and 2.7 for all 29 tissues, Fig EV2A; Appendix Fig S11). While this observation has been made before in yeast (Csárdi et al, 2015), this study shows that it is a general phenomenon. The effect may be rationalized by cellular economics such that genes encoding highly abundant proteins not only express high mRNAs levels, but also encode regulatory elements that favour high translation efficiency and high protein stability (Vogel et al, 2010). The often vast differences in mRNA and protein expression within a tissue can also be visualized by plotting the ranked order of relative intensities of transcripts and proteins (Fig 2C, Appendix Fig S12). For example, in the heart (an extreme case), 41% of the total mRNA quantity (by FPKM) represents a single protein (MT‐ATP8) and nearly 60% of the total mRNA covers just five transcripts (all coding for mitochondrial proteins). In contrast, about 13% of the total protein quantity (by iBAQ) is contributed by five proteins (four of which are myosins and one represents a “contamination” from blood present in the tissue). One would expect the heart to be rich in both protein families owing to the contractile function of the organ which requires a lot of energy. While it is possible that some of the mitochondrial proteins are underrepresented in quantitative terms (because, e.g., MT‐ATP8 is a very small protein (7 kDa) and its iBAQ value may therefore not reflect its true quantity or because our lysis conditions may not have solubilized this organelle with high efficiency), it is surprising that even among the 100 most highly expressed mRNAs and proteins, only about 20% are the same (Fig 2D). This overlap only increases to about 60% for the 5,000 most abundant proteins and transcripts (Fig EV2B). The above reflects why RNA–protein abundance plots generally show only modest correlation. The above rank order lists of transcripts and proteins are also quite different between tissues with the spleen showing the opposite characteristics compared to the heart, and the lung showing a more even distribution of transcript and protein levels (Fig EV2C and D). Click here to expand this figure. Figure EV2.Relationships between mRNA and protein expression A. Slopes of the regression line in protein versus mRNA abundance plots (see main Fig 2 B) for each tissue.

B. Number of genes that are shared among the 5,000 most abundant transcripts or proteins in each tissue.

C, D. Ranked abundance plots for transcripts and proteins in spleen and lung showing different characteristics in the abundance distributions (see also main Fig 2 C and Appendix Fig S12 for all tissues).

E. Clustering of protein abundances across all tissues. It is apparent that many proteins have similar expression levels across several/many tissues. We find that many proteins are often expressed at broadly similar levels across human tissues (say within a factor 10; Fig EV2E). It is, therefore, not very surprising that the correlation of mRNA/protein ratios across tissues is generally not very strong (Fig 2E; median 0.35). Still, there is positive correlation in ~90% of all cases and almost half are also statistically significant. This distribution is not affected by requiring detection of a protein in 10, 20 or all 29 tissues (see Appendix Figs S13–S15). However, great care has to be taken when interpreting such distributions. As shown in Fig 2F, the transcript and protein levels of the tyrosine kinase SYK span an expression range of 45‐fold and 39‐fold, respectively (natural scale), and are highly correlated across tissues reflecting the specialized function of the protein in T‐ and B‐cell biology. In contrast, RNA and protein expression of EIF4A3 (a DEAD‐box RNA helicase involved in translation initiation) only spanned sixfold and 11‐fold between tissues and showed no correlation. We note that cases such as the latter are merely the result of technical variation in the measurement or genuinely similar expression levels in most tissues reflecting the roles of these proteins in central biological processes in all tissues (Appendix Fig S16; Wilhelm et al, 2017). It is noteworthy that proteomes correlate stronger between tissues (median of 0.77) than transcriptomes (median of 0.67; Fig 3A; see Appendix Fig S17). It is possible that, because of the fact that the dynamic range of protein levels is larger than that of RNA, small biological or technical variations of individual genes may or may not have impact on the overall rankings (Fortelny et al, 2017; Franks et al, 2017). It might, however, also imply that there are (hitherto not very clear) mechanisms in cells that “buffer” the protein quantities against changes in mRNA abundance (Liu et al, 2016; Kustatscher et al, 2017). The strongest correlations for both transcripts and proteins were found for the anatomically adjacent small intestine and duodenum. At the proteome level, the brain showed clear differences to other proteomes and gastrointestinal organs appear to be more similar to each other. Visualizing the transcriptome and proteome profiles in a plane using co‐inertia analysis (CIA; Culhane et al, 2005) indicate that mRNA and protein levels are more similar to each other within tissues than between tissues (Fig 3B) also reflected by an RV coefficient of 0.77 (a multivariate generalization of the squared Pearson correlation coefficient). Moreover, the CIA grouped several tissues according to similarities in their physiological function with tissues of the immune system and of the gastrointestinal tract representing the largest groups. It is interesting to note that this clustering appears to be driven by the cellular composition of individual tissues (Table EV6). For instance, the appendix co‐clusters with the spleen, lymph node and tonsil and all four tissues contain a large fraction of lymphocytes (Fig 3C, blue panel). Similarly, the duodenum and small intestine comprise a large proportion of (intestinal) glandular cells, which are important determinants of the molecular make‐up of those tissues (Fig 3C, grey panel). All the above illustrates that there must be multiple molecular factors and mechanisms determining the quantitative expression of proteins. This particular aspect of the present mRNA/protein expression resource may be particularly useful for the community as it provides a rich data source for the study of protein expression control (see also Eraslan et al, 2019). Figure 3.Correlation analysis of protein and transcript expression levels Global correlation analysis of proteomes versus proteomes and transcriptomes versus transcriptomes across human tissues. It is apparent that proteomes correlate stronger across tissues than transcriptomes. Co‐inertia analysis of transcriptome and proteome levels of all 29 tissues (arrow base: transcriptome; arrow head: proteome) showing that the information carried by transcriptomes and proteomes was closer to each other in the same than across different tissues. Grey lines are used to aid identifying tissue names for the respective arrows. Shaded areas highlight tissues that are related by their molecular profiles. Average cellular compositions of tissues highlighted in panel (B) showing that the molecular similarities in their transcriptomes and proteomes are driven by similarities in cell types.