Where discovery of most CT-genes depended on whole testis expression data, we here used a unique list of 16 589 genes expressed in human male germ cells generated in our laboratory [15] to identify true germ cell-specific cancer genes (GC-genes). This germ cell transcriptome has been generated from tissue samples provided by 6 men who underwent vasectomy reversal procedures. Specific germ cell subtypes were isolated using laser capture microdissection, and consequentially sequenced using Illumina HiSeq 2000, as described in detail in Jan et al. [15]. Using R2, a genomics analysis and visualization platform we developed recently [18], we compared this germ cell subtype specific transcriptome [15] to data from the Genotype-Tissue Expression (GTEx) project [16] and The Cancer Genome Atlas (TCGA) [17] (Supplementary Data 1A). The GTEx and TCGA datasets were both downloaded in November 2015 from the GTEx data portal (GTEx_Analysis_V4_RNA-seq_RNA-SeQCv1.1.8_gene_rpkm.gct.gz, RSEM processed [16]) and TCGA data portal (unc.edu_****.IlluminaHiSeq_RNASeqV2.Level_3.1.*.0.tar.gz, RSEM processed [17]), respectively. Analyses were performed by custom PERL scripts, or R/Bioconductor. Images were generated in R/Bioconductor or R2 platform (r2.amc.nl).

At the moment of our analysis, these databases contained 2617 samples from 53 different healthy somatic tissues (Supplementary Data 1B) and 9232 samples from 33 tumor types (Supplementary Data 1C), respectively. From the GTEx dataset, we excluded ovary and testis tissues. In addition, we excluded transformed lymphocytes and transformed fibroblasts, as they are transformed cell lines instead of whole tissues and may have upregulated genes associated with cancer cells. For each of the three datasets, the maximum expression measured per gene was used to determine arbitrary inclusion criteria that remove background noise (from the expression in male germ cells and tumors) and avoid false-positive GC-genes (expression in healthy somatic tissues) (Supplementary Figure 1). We have been most strict on the GTEx database (only 13% included) to avoid false-positives. Our comparison yielded 756 putative novel GC-genes that are highly expressed in germ cells, not expressed in any somatic tissue and highly expressed in tumors (Supplementary Data 1D). In order to visualize how the 756 GC-genes vary by tumor type, we stratified their expression in 33 tumor types in a heat map, showing that hundreds of GC-genes are expressed in all tumor types (Fig. 1).

Fig. 1 Hundreds of germ cell-specific genes are widely expressed in tumors. Shown here as hierarchical clustering of the average expression per tumor group (Euclidean distance, ward linkage). These germ cell cancer genes (GC-genes) divide tumors in two main groups, mainly based on GC-gene cluster 3, containing genes involved in mitotic and meiotic metaphase regulation. Gene expression levels are indicated by a Z-score-dependent color, where blue and red represent low and high expression, respectively Full size image

In order to enable everybody to determine their own inclusion criteria, we have developed a web-based application. For all interactively chosen cut-off criteria, the consequent GC-gene list can be downloaded. This application is available from: https://www.amsterdamresearch.org/web/reproduction-and-development/tools.htm.

To avoid false-positive results, the selection criteria we applied to identify GC-genes are more stringent than previous selection criteria used to identify CT-genes. Moreover, whereas most studies allowed expression in 1–2 tissues other than the testis [11,12,13,14], our selection excludes all genes expressed in healthy tissues other than the testis. Thus, for most of the 756 genes identified in this study we can be certain that they are true GC-genes. However, lowly expressed genes that are only shortly or temporarily expressed, are only expressed in rare cell types, or only expressed under certain conditions may have escaped our selection. In addition, because germ cell tumors can be expected to express many germ cell-specific genes, we analyzed which genes would not have been included in our initial list after excluding testicular germ cell tumors, and identified a subset of 45 GC-genes that are only expressed in germ cell tumors (Supplementary Data 2). From our original list of 16,589 genes expressed in male germ cells, 165 genes are present in a database containing genes specifically expressed in cancer and whole testis tissue: the CT database [10]. From the 255 CT-genes present in this database at the moment of our analysis, only 25 overlap with our newly identified 756 GC-genes. This can be explained by the fact that the testis for a large part consists of somatic cells. Germ cell-specific RNAs can therefore be diluted below detection levels in whole testis lysates, while testicular somatic genes are not included in our analysis. Indeed, from a more recent analysis that revealed 1019 potential CT-genes [9], only 123 (12%) were also present in our analysis (Fig. 2). In addition, the interactive Venn diagram can be used to show that being less strict on high tumor expression has the most significant impact on the overlap with the CT-gene studies. This implies that many genes identified by the previous CT-gene studies were not highly expressed in at least one of the tumor types that we investigated. These data combined, our current analysis has identified 630 GC-genes that have not been previously identified as CT-genes, of which 615 are expressed in non-testicular tumors.

Fig. 2 Most GC-genes have not been described before as CT-gene. Venn diagram comparing the present analysis of germ cell-specific cancer (GC) genes (red) to earlier identified Cancer/Testis (CT) genes by Wang et al. (light blue) and the CT-database (dark blue). The number in each section represents the number of genes. The overlap between the CT database [10], Wang et al. [9], and the present analysis was assessed by converting gene names to one common annotation (Supplementary Data 10A–C). 21 out of 276 genes in the CT database were either merged with existing genes (n = 19) or could not be retrieved (n = 2) (Supplementary Data 10D) Full size image

Hierarchical cluster analysis revealed that, based on expression of GC-genes, the tumors form two main groups, mostly characterized by high or low expression of a specific subset of GC-genes (gene cluster 3) (Fig. 1 and Supplementary Data 3A). We have investigated the biological processes of this gene cluster using DAVID Bioinformatics Resources 6.7 [19], and reported only significant annotation clusters (enrichment score > 1.3, corresponding to p < 0.05). Interestingly, this gene ontology (GO) analysis revealed that gene cluster 3 predominately contains genes involved in M-phase and cell cycle regulation, intriguingly both mitotic and meiotic (Table 1). Also processes pivotal to meiosis, such as DNA double-strand break repair and homologous recombination, are well represented in this cluster (Supplementary Data 3D). Further GO-analysis revealed that six biological processes are significantly represented by all 756 GC-genes (Supplementary Data 3F): the regulation of transcription and gene expression, including the metabolic processes required for RNA and DNA synthesis, the M-phase of the mitotic and meiotic cell cycle, DNA double-stranded break repair, DNA metabolic processes, spermatogenesis, and cell adhesion. Additional GO-analysis was performed on the top 25% GC-genes that were most widely expressed in tumors. Six biological processes appeared to be significantly represented by these 189 GC-genes, including cell cycle regulation and checkpoints, post-translational protein modification and DNA damage responses (Supplementary Data 3G). In line with previous research [8, 9, 20], these processes suggests that GC-genes are not just randomly expressed germ cell-specific genes but may actually contribute to tumor cell survival, proliferation, and metastasis.

Table 1 GC-genes represent processes that are likely to contribute to tumor cell survival, proliferation, and metastasis Full size table

Because proteins that are located on the outer cell surface would be ideal targets for induced adaptive immune (therapeutic) responses, we used the Panther 10.0 classification system [21] and identified 17 GC-genes that are known to encode cell surface proteins (Table 2 and Supplementary Data 4A). These genes are predominantly responsible for the regulation of multicellular organismal processes, cell proliferation and cell–cell communication (Supplementary Data 4B). Interestingly, all tumors express at least one of these 17 genes. Because CT-genes have previously been identified using gene expression profiles of whole testis, including the testicular somatic cells, we additionally compared gene expression of whole testis tissue from the GTEx project [16] to gene expression in human male germ cells [15]. For each gene with high expression (>1.6) in male germ cells [15] and low expression (<1.8) in any other tissue [16], the difference in expression compared to whole testis was calculated. In order to correct for using different expression distributions, genes with a difference in expression above one log 2 value (i.e., double expression) were included. This resulted in a list of 706 genes that are expressed in germ cells, but were not previously detected in testis as a whole. When comparing our list of 756 GC-genes to these 706 genes, we identified a subset of 334 GC-genes (44%) whose expression is very low or undetectable (<1.8) in testis as a whole [16] (Supplementary Data 5A). Interestingly, among this subset of 334 GC-genes, the average expression across all tumors is higher than in the remaining GC-genes (p = 0.032, Student’s t-test assuming unequal variances). GO-analysis of this subset identified transcription regulation and cell adhesion as significantly enriched processes (Supplementary Data 5B), both of which essential for germ cell development, as well as tumor proliferation and metastasis. This, and the fact that they are highly germ cell and cancer specific, makes these genes interesting candidates for future research.

Table 2 GC-genes that are known to encode cell surface proteins Full size table

Previously, CT-genes have been divided in X and non-X CT-genes, depending on whether they are located on the X chromosome or not. According to the CT-database, approximately half the CT-genes are CT-X [10]. However, of the 1019 CT-genes identified by Wang et al. [9], only 105 were located on the X chromosome. In line with this study, our analysis returned only 29 (4%) X-linked GC-genes and GC-genes seem to be distributed relatively evenly across all chromosomes (Supplementary Figure 2).

To validate to what extent germ cell-specific RNA expression reflects protein expression in various human tissues we used the Human Protein Atlas (v15) [22]. From the atlas, we retrieved all proteins expressed in testis or ovary and selected for highly reliable immunohistochemistry (Premium Tissue). In addition, because many CT-genes are known to be expressed in trophoblasts [23] that will later form the placenta, we similarly retrieved all proteins expressed in placenta. The resulting proteins were then aligned with our 756 GC-genes, resulting in a list of 49 genes that were manually checked for germ cell- or placenta-specific protein expression (Supplementary Data 6). This yielded three proteins that are exclusively present in placenta and 24 proteins that are present in male germ cells and not in somatic cells of the testis or elsewhere (Table 3).

Table 3 GC-genes whose restricted expression in non-cancerous tissues is validated on the protein level Full size table

To develop a therapy without side effects in healthy tissues it would in principle be sufficient to identify genes that are uniquely expressed in tumors. For this, our list of human germ cell expressed genes would not be required. We therefore performed a similar analysis without our list of germ cell expressed genes and including testis from the GTEx database. This resulted in 724 cancer-specific genes, of which 301 genes appeared not to be GC-genes (Supplementary Data 7A). However, GO-analysis revealed that these 301 genes are predominately involved in immunological responses (Supplementary Data 7B). Hence, in contrast to GC-genes, targeting these genes as a cancer therapy can be expected to lead to immunological side effects.

Infertility is a major side effect of current anticancer treatments, and it would still be a potential side effect when targeting most GC-genes. A way to circumvent this would be to exclude genes expressed in the spermatogonial stem cells. In humans, these stem cells are included in the pool of quiescent or mitotically proliferating and differentiating spermatogonia, and are required to maintain life-long spermatogenesis. Because our dataset contains information about germ cell-type-specific gene expression [15], we were able to exclude genes expressed in spermatogonia. Of the 756 GC-genes, 69 displayed negligible expression in the spermatogonial stages (Supplementary Data 8). Hence, targeting these 69 GC-genes would not affect the spermatogonial stem cells and therefore only lead to temporary infertility. Importantly, we have recently found that spermatogonia already express many mRNAs that are not translated until later stages during spermatogenesis [15]. This implies that the number of GC-genes that can be targeted without inducing permanent infertility will most likely be larger than 69.

We here show that expression of hundreds of germ cell-specific genes may not only contribute to already established hallmarks of cancer [1, 2], but can be considered as a hallmark of cancer in itself. Germ cells and cancer cells share the intrinsic drive to propagate, regardless of survival of the soma [24,25,26]. Studying the behavior and characteristics of germ cells may thus lead to novel insights in cancer development. Because our datasets are publically available, more tumor types can now be analyzed on the expression of germ cell-specific genes. We anticipate that this will lead to a better understanding of tumor biology and improved treatment options.