Data and filters

We used mRNA sequencing data (Illumina paired-end, 76 bp) from the GTEx project Analysis Freeze V6 release (phs000424.v6.p1). RNA-seq libraries are non-strand specific, with Poly-A selection and generated with Illumina TruSeq protocol. Further details on sample collection, processing and quality control of the RNA-seq samples from version V6 can be found in the supplementary material of24 and in21,23,40. As described in Carithers et al.21 the GTEx project made an effort to collect tissues within 8 h of PMI and RIN values ≥6. All samples under analysis in this study were collected and preserved with the PAXgene Tissue preservation system developed by Qiagen21.

From the 55 available tissues in the V6, we started by selecting those with at least 20 samples. Brain samples are preserved either with PAXgene Preserved or Fresh Frozen methods. The latter does not have ischemic time available. We have included 161 samples from Cerebellum and 147 from Cortex preserved with PAXgene method and with ischemic time available. We further removed cell lines from the set of tissues. The final dataset comprises 36 tissues, with 36–1049 samples with a mean of 253 samples, see Supplementary Fig. 1.

Gene expression

RNA-seq reads were aligned to the human genome (hg19/GRCh37) using TopHat41 (v1.4) and Gencode annotation v1926 was used for gene quantification. We considered genes with at least 5 reads mapping in exons and from all biotypes in the annotation. The raw read counts were used for differential expression analysis and the RPKM42 values, which were log2 transformed with an added pseudo-count used in the remaining analysis. For the regression analyses, the matrix of expression values was obtained for the samples of each tissue and then normalized with the normalize.quantiles function from the preprocessCore library43.

In order to investigate the global patterns of gene expression we have considered the gene expression values of all the tissues and all the genes from the annotation26. We have then performed multi-dimensional scaling (MDS) using the isoMDS function from the package MASS in R. We defined the distance for two samples A and B as:

dist(A, B) = 1–PearsonCorrel(A, B). As shown in Supplementary Fig. 3 there is a clear transcriptional signature characterizing each tissue. Further discussion on detectable and tissue specificity expression and gene expression patterns across tissues can be found here25.

Post-mortem interval (PMI) information

GTEx annotation provides information on three types of ischemic time: Total Ischemic time for a sample, Total Ischemic time for a donor and Ischemic Time (time for the start of the GTEx procedure). All these variables are quantified in minutes. Throughout the text we have used the term Post-Mortem Interval (PMI) to refer to ischemic time and except if explicitly stated it refers to the sample ischemic time. Negative PMI values (observed in blood samples) correspond to samples extracted pre-mortem. GTEx annotation contains a total of 249 sample and subject variables, which are divided in six groups: Sample attributes (prefix SM), Death circumstance (DTH), Demographic (no specific prefix), Medical History (MH), Tissue Recovery (TR), Serology results (LB). We selected those variables with > = 2 and < = 15 values, removing cases of unknown values. We then computed a linear regression with PMI, obtaining the adjusted R2 and the Pearson correlation with PMI (cor.test in R with use = “na.or.complete”). Categorical variables were converted to numeric. Supplementary Fig. 4 shows the respective correlation values for all the variables, where we observe that TR, LB and DTH variables are highly correlated with ischemic time. This basically reflects different aspects of the tissue collection procedure that are highly associated with PMI.

Covariate selection

In order to assess the impact of PMI on gene expression we need to account for the possible effect of other pre-mortem variables on the variation of gene expression. For the selection of the variables of interest we excluded TR, LB and DTH variables due to their strong association with PMI, which may result from the fact that all these variables capture the underlying characteristics of death circumstances and tissue extirpation procedure. We then focused on sample, demographic and medical history variables (correlation values with ischemic time ignoring missing values), summing 166 variables. We further filter for those covariates that are qualitative and describe a phenotype such as age or gender and that exclude those that are simply metrics from the sequencing (SM). Finally, we kept those variables with |r| > 0.1 with PMI. Non-numerical variables are converted to numeric format. The final set of covariates used for regression analysis of PMI and gene expression is presented in Supplementary Table 5.

Gene expression and PMI regression model

In order to assess the impact of PMI on gene expression we took into account a set of fourteen covariates (Supplementary Table 5). Then, for each gene (with average expression across the tissue samples greater than 0.5 RPKM) we implemented a linear regression model where the gene expression profile is modeled with relation to the covariates: reg = lm(gene_expression ~ matrix.selectedCovariates). The residuals of the model are then used as the expression phenotype: gene_expression.resid = residuals(reg). Finally, the correlation between the residuals and the PMI is calculated, r = correlation(gene_expression.resid, pmi.vals). The corresponding correlation and p-values (adjusted with BH method44) are then stored for all genes. This procedure is repeated for all tissues (Supplementary Fig. 37). To compute the correlation values of gene expression and PMI without the covariates, a procedure similar to the above was used where Pearson correlation is obtained between gene expression and PMI values (Supplementary Fig. 38). Supplementary Note 2 provides the algorithmic details of the methodology.

Non-linear temporal differential expression

In order to find non-linear differential expression we developed a method to identify significant changes between different post-mortem intervals. For each tissue we grouped the samples as in21 and in five different PMI intervals I1: < 1 h, I2: ≥ 1 h and < 4 h, I3: ≥ 4 h and < 6 h, I4: ≥ 6 h and < 15 h and I5: ≥ 15 h. We then normalized the gene expression of each gene computing a Z-score = ((X-mean)/stdev) and calculated the median expression in each of these intervals. Every two consecutive intervals, with a minimum number of five samples are then compared. We consider an event of temporal differential expression between T i and T i+1 , where T i and T i+1 correspond to the expression values of the gene in the interval i and i + 1 if we meet the two following conditions:

$${{\mathrm{pval}}(i,\,i + 1)= {\mathrm{wilcox.test}}\left( {T_i},\,{T_{i + 1}} \right),{\mathrm{with}}\,{\mathrm{pval}} < 0.05}$$ (1)

$${\mathrm{fold}}\_{\mathrm{change}}(i,\,i + 1) = log_2\left( {\mathrm{median}}\left( {T_i} \right)/{\mathrm{median}} ( {T_{i + 1}} ) \right) ,|{\mathrm{fold}}\_{\mathrm{change}}(i,\,i + 1)| > 2$$ (2)

Supplementary Fig. 39 provides the algorithmic details of the methodology.

Tissue similarity for PMI correlated expression

In order to build a tissue similarity matrix of the correlation profiles (gene expression and PMI) we performed a pairwise comparison of all tissues. For every pair of tissues we obtain the common genes by intersecting genes that in both tissues have correlation value of gene expression with PMI. For every pair of tissues we then obtain a Spearman ranking correlation based on the correlation values of the common genes. We then used the heatmap.2 function from gplots to calculate the heatmap with dendrogram in Fig. 2f.

Functional enrichment analysis

For functional enrichment analysis we used the R libraries: DOSE45, ClusterProfiler46, Kegg.db47 following the tutorial of ClusterProfiler46.

Differential expression in blood samples

For differential expression analysis we used the statistical methods implemented in the edgeR package48. We started by building a matrix with gene read counts in premortem (n = 169) and postmortem (n = 223) Blood samples. Genes were filtered to have at least 5 reads per million mapped reads in at least 10% of the samples on one of the tested groups (cpm function). We created a design matrix taking into account 2 groups (pre- and postmortem samples) and several covariates:

$$\begin{array}{l} {\mathrm{design}} < - {\mathrm{model.matrix}} \left(\sim \right. {\mathrm {SMRIN + AGE + ETHNCTY + MHCANCERNM +}} \\ \left. {\mathrm{SMCENTER + SMTSTPTREF + SMNABTCHT + GENDER + group,covars.matrix}} \right)\end{array}$$

Covariates SMRIN and AGE were discretized according to the following intervals: SMRIN = < 7*/7−8/8−9/9–10 and AGE = 20−30/30−40/40−50/50−60/60−70. Covariates were converted as factors. See Supplementary Table 5 for the description of the covariates, where group variable corresponds to the pre and post-mortem samples. We then followed the protocol at48 performing the normalization with the TMM method49 Generalized Linear Model (GLM) based functions to estimate common dispersion and differential tests. For the differential expression analysis across different post-mortem blood intervals, we first divided the post-mortem samples in four groups, which provided an equal number of samples in each group: G 1 (n = 56): 0 < pmi < = 406 min; G 2 (n = 56): pmi > 406 and pmi < = 635; G 3 (n = 56): pmi > 635 and pmi < = 867; G 4 (n = 55): pmi > 867 and pmi < = 1401. A similar approach as the one described above was then applied to compare all the pre-mortem samples with each of the G 1 , G 2 , G 3 , and G 4 groups. Figure 4b and Supplementary Table 8 shows the number of differentially expressed genes for the different intervals.

Transcriptional patterns of pre and post-mortem blood

In order to explore the transcriptional differences in pre and post mortem blood samples we have built the respective expression matrix based on RPKM values that were then log2 converted and normalized with normalize.quantiles function as previously described. We then performed hierarchical clustering (HC) and multidimensional scaling (MDS). We defined the distance between samples a and b as, dist(a,b) = 1 – cor(a,b), where cor is the Pearson correlation of a and b vector. Hierarchical clustering solution was then computed with hclust function using the average method. Visualization was performed using the heatmap.2 function with the input of the distance matrix and the previously calculated HC solution as the dendrogram parameter. Postmortem samples (n = 20) with a PMI smaller than the respective individual PMI were excluded. Heatmap with PMI interval colors is shown in Supplementary Fig. 20, and samples in MDS plot were colored according to Hardy Scale (Fig. 4a).

Signaling pathway models

The hiPathia34 tool was used for the interpretation of the consequences of the combined changes of gene expression levels and/or genomic mutations in the context of signaling pathways (see Supplementary Fig. 40 and 41). Significant circuits associated to PMI were obtained by fitting a linear model and were summarized by the median value across samples per circuit and time points. Supplementary Note 3 and Hidalgo et al.34 provide the algorithmic details of the methodology.

Gene structural features

Features were derived from the Gencode annotation v1926, including the number of projected (non-redundant exonic regions) exons, length of the coding regions, overall length of the gene, biotype. We obtained projected exons first by sorting by genomic coordinates and then by merging exons. We used bedtools50 for this step. GC content was obtained from the Ensembl Biomart (www.ensembl.org/biomart). For each tissue we have calculated the Pearson correlation between the vector of gene features and the respective correlation value between gene expression and PMI. Supplementary Table 7 contains the correlation values per tissue for each feature.

Mitochondrial transcription

For estimating the mitochondrial RNA concentrations (MT%), we divided all reads in annotated mitochondrial (mt) genes by the total number of reads in annotated (nuclear and mitochondrial) genes. To account for the substantial different mitochondrial activity across tissues, we divided each sample by the median MT% found in the corresponding tissue (nMT%). We then regressed a linear model nMT% ~ PMI and compared the slopes obtained at each time point between the different tissues (Supplementary Fig. 17). Correcting for the influence of age in the linear model changed the distribution of relative MT% only marginally (shifted values < = 0.05) (Supplementary Fig. 16).

RNA-seq metrics across tissues

We have explored if the different tissues show differences in RNA-seq quality control metrics obtained with the RNA-SeQC40 pipeline. Supplementary Table 11 lists the variables used for this analysis. The mapping proportions along the different gene features are shown in Supplementary Fig. 11 and 12. Degradation of the RNA may result in different mapping bias effects, in particular in a higher read coverage at the 3′ end of the genes. We have calculated for each sample a read coverage ratio between the 5′ and the 3′ 50bp-based normalization. Distribution of these values for the different tissues is shown in Supplementary Fig. 13 and the relation with RIN and PMI are shown in Supplementary Fig. 14 and 15.

Clustering modularity

To assess if gene expression signatures of tissues are preserved across the PMI bins, as defined above (section “Non-linear Temporal Differential Expression”), we selected only the tissues that had at least 10 samples within each bin of PMI. Because differences in the number of samples per tissues can introduce variation in the network structure, we randomly selected the same number of samples per tissue in each PMI bin, corresponding to the minimum number of samples per tissue across all the PMI bins. Thus, we have 4 combinations of 422 samples, one for each PMI bin, with the same tissues, and the same number of samples per tissue. For each combination of samples, we compute pairwise Pearson’s correlation coefficient on the log2-transformed RPKM expression values after adding a pseudo-count of 1. From each matrix of correlation coefficients we built 7 networks, where nodes are the samples and edges are connections between samples that are correlated with a coefficient higher than a given threshold (out of 7 thresholds, from 0.86 to 0.92). These thresholds gave comparable network densities, defined as the proportion of connected nodes over the total number of possible edges, across all the networks. We used the modularity formula (R package igraph51, modularity function) to measure how well the samples in each network are aggregated by tissue type. Supplementary Fig. 9 shows the distribution of modularity with relation to network density.

Exon inclusion analysis

GTEx samples were processed through the Integrative Pipeline for Splicing Analyses (IPSA) pipeline with default settings52. Namely, short reads were mapped to human genome (hg19/GRCh37) using TopHat41(v1.4). The alignments were filtered to have an overhang of at least 8 nt and entropy of the offset distribution of at least 1.5 bits. Novel short exons (shorter than the read length) were predicted using reads with more than one split with canonical GT/AG splicing nucleotides and minimum entropy of at least 1.5 bits for each splice junction. The percent-spliced-in (PSI) metric was computed as in Wang et al.33 by using inclusion and exclusion reads with the minimum total count of 5 reads; that is, exons for which the combined number of inclusion and exclusion reads was less than 5 were excluded.

From the PSI values calculated by the IPSA pipeline52 we have performed further filtering on a tissue basis based on the three following criteria: Select exons with NAs in less than 10% of the cases; Select exons they have a standard deviation greater than 0; Select exons if the difference between the max and min PSI values is larger than 0.1;

From this subset of selected exons we have then performed correlation analysis of the PSI value with the PMI value for each tissue. Supplementary Fig. 18a shows the number of tested exons according to the above filtering and Fig. 3d the number of significant exons at 1% FDR and |r| > 0.5.

Differential exon inclusion in blood

Exons with PSI values following the three criteria defined in the previous section in Blood samples were selected for differential expression analysis. This yielded a set of 3441 exons. Next, differential exon inclusion was tested with Wilcoxon Sum Rank test, with multiple testing adjustments by Benjamini-Hochberg method44, and the median of the PSI values in each group calculated. Exons were deemed significant included if they pass the following criteria: FDR < 1%; and |ΔPSI| > 0.1;

Cellular composition

In order to perform gene expression signal deconvolution we applied CIBERSORT35 v1.04 and the LM22 gene signature to all blood samples, using gene RPKMs, with default parameters, deconvoluting the signal into 22 different cell types. We discarded four cell types with average fraction below 0.01 in both conditions, keeping 18 cell types. We compared the cell-fractions of pre- and post-mortem samples globally using the Anderson–Darling test53 and by cell-type obtaining a p-value using the two-sided Wilcoxon Rank-Sum test54, adjusted to multiple-testing by Benjamini–Hochberg method44.

Splicing entropy analysis and PMI association

To investigate changes in patterns of isoform usage and how these correlate with the PMI we have calculated the splicing entropy based on the relative abundance of an isoform/transcript within a gene. The following selection criteria and calculation was performed on a tissue-by-tissue basis: start by selecting genes with two or more isoforms; next, select genes with a non-zero expression in 90% of the samples of the tissue. Calculate isoform ratios for each gene: For a gene G, with k isoforms I, the splicing ratio is defined as:

\(P\left( {I_i} \right) = \frac{I_i}{\sum

olimits_{i = 1}^{k} {I_i}}\), where I i corresponds to the RPKM value for the isoform i of G.

Finally, calculate the entropy of a gene based on the Shannon Entropy formula, as:

$$E( {G} ) = - \mathop {\sum}

olimits_{i = 1}^{k} {p}\left( {I_i} \right) \times \log p\left( {I_i} \right)$$

The splicing entropy of gene G is maximal if all its isoforms have the same ratio and minimal if one of the isoforms dominates all the expression of G.

Then, for each gene, we correlate the splicing entropy with the respective PMI of the sample. From this test, we obtain the r-value and p-value. Perform p-value adjustment for multiple testing by Benjamini–Hochberg method44. We repeat this analysis for all the selected tissues. In Supplementary Table 10 we provide the total number of genes tested per tissues and the genes with a |r| > 0.5 and FDR < 5%. Figures 3g, h present an example of a gene with a significant change in lung. Supplementary Fig. 19 presents the distribution of the correlation values for Splicing Entropy and PMI across the different tissues.

Machine learning models for PMI prediction

The predictive model for PMI based on gene expression was constructed with a two-step approach using an ensemble of gradient boosted trees (Supplementary Fig. 24) in order to provide a robust estimate of PMI and avoid overfitting. 528 available individuals are initially partitioned into training and testing datasets, using 75% and 25% of the data, respectively. This partition is performed in such a way that we try to keep a similar underlying distribution of the number of available tissues per individual both for the training and testing datasets.

In order to build these tissue models (with the R implementation of the xgboost package37), we first create a fixed split of individuals into training and test sets. For a given tissue, we perform 3-repeat-5-fold cross validation with the samples corresponding to the individuals of the training block in order to select the best model, and we generate the predictions over the unseen test set using this model. This process is repeated 13 times using different seeds to take into account the variation in the hyperparameter optimization process. The output is a matrix of n samples×13 columns, where each column represents the tissue PMI prediction of all test samples for each iteration. The final tissue PMI predictions will be taken as the row average of this matrix. Only protein coding genes with a correlation > = 0.4 with the tissue ischemic time were used in order to reduce the computational burden of model fitting. No other covariate was considered. Hyperparameter search was performed during this cross-validation loop using standard grid search for tree depth ranging from 4 to 6, η ranging from 0.001 to 0.1, γ ranging from 0 to 0.15 and up to 1000 rounds, using RMSE as optimization criteria. For each tissue we repeat the previous process 13 times using different seeds to determine the training set fold partitions in order to have a measure of the variability of the final prediction while applying the models on the test set (Supplementary Fig. 24b). On Supplementary Fig. 25a, we show the variability of the number of genes selected by the each 13 models, per tissue.

Once we have obtained the 13 models per tissue, we use each one of them to generate PMI predictions for each tissue sample in the test set. Therefore, 13 predictions are generated per tissue for a given test individual. We take the average of these predictions as our final PMI prediction for that particular tissue. On Supplementary Fig. 27 we can see examples of the tissue performance on the test set.

In the second step of our procedure, for each individual we will correct the final tissue PMI predictions by subtracting the elapsed time of the GTEx procedure. Since we know how much time has passed since the beginning of the GTEx procedure until a specific sample has been processed, this time difference has to be subtracted from the tissue PMI predictions in order to normalize them to a reference level, which is the start of the GTEx procedure. Finally, to predict the individual PMI (which is considered to be the time from death until the beginning of the procedure), we compute the average of these corrected tissue PMI predictions.

One important remark is that the final quality of the individual PMI prediction for the test individuals will highly depend on how accurate the individual tissue models are. For this reason, while performing the second step of the prediction procedure, we decided to use only the top 20 tissues with the best R2 performance in the training data (Supplementary Fig. 25b). The R2 for each of these tissues while applying the models on the test set is shown on Supplementary Fig. 26. The density of the individual PMI prediction error, which is defined as the signed difference of the real and predicted individual PMI is shown on Supplementary Fig. 30.

To investigate whether we are recovering a real predictive signal from gene expression instead of just an artifact, we used whole Blood samples, where there is information available for pre-mortem individuals as well. We reasoned if we were able to predict accurately the time to death for pre-mortem individuals this would reflect overfitting. To this end, for each cohort (pre-mortem and postmortem), we partitioned the data into training and testing datasets, fitted the model on the training data with 3-repeat-5-fold cross validation, performed the predictions on the test set and then obtained the regression statistics of real vs. predicted Blood PMI. We repeated this process a hundred times with different seeds, generating a different training/testing partition each time, in order to study the variability of the regression statistics. There is a significant difference of the regression statistics between pre-mortem and postmortem samples, with a very poor fit for pre-mortem samples (Supplementary Fig. 29).

To study the stability of the tissue PMI predictive models we decided to evaluate the tissue performance irrespective of the training/testing partition used so far. For each tissue we partition all the available samples into 75% for training and 25% for testing. Once each tissue model is fitted on the training set with 3-repeat-5-fold cross validation, we calculate the p-value of the F-test (Supplementary Fig. 28), R2 and slope (not shown) for the regression of the predicted tissue PMI in the test set versus the real tissue PMI. This process is then repeated 50 times by varying the training/testing partition in order to measure the variability of the regression statistics.

In order to find the optimal subsets of tissues for predicting an individual’s PMI, we computed the individual PMI with all the possible combinations of sizes 2–6 of the available tissues for each individual in the test set. Then for each set size we keep the tissue combination that performed the best for each individual, in terms of individual PMI prediction error. With this, we can calculate the proportion of times a given tissue appeared in the optimal subset of a fixed size (Supplementary Fig. 32). We see that Adipose—Subcutaneous, Lung, Thyroid and Skin—Sun Exposed (Lower leg) are the tissues that consistently appeared on more optimal subsets of all sizes; and these tissues are also the ones that appear among the top most stable ones on the previously mentioned stability analysis. We then compute the individual PMI using all the possible combinations from size 2 to 4 using these four tissues (Supplementary Fig. 33 and 34).

As an additional description of stability of the individual PMI predictions, we have computed the standard deviation (SD) and coefficient of variation (CV) of the corrected tissue PMI predictions for each individual in our original test set, and generated the density curves of these statistics, shown on Supplementary Fig. 31, both for the top 20 tissues and the top 4 tissues of the best subset analysis.

Supplementary Fig. 35 shows the distribution of the death classes in the test set individuals of our PMI prediction methodology. In order to inspect if there is any immediate effect of the cause of death on the individual predictions, we have grouped the death classes in three larger categories: cerebrovascular disease, heart disease (which groups “Ischemic heart disease” and “Other forms of heart disease”) and others (which groups the remaining classes). We observe that the performance of the predictions in terms of R2 is very similar among the death classes.

To perform the prediction using TIN15 data, we have employed the same methodology as with gene expression data, using the same initial partition of training and testing individuals. Since it is a proof of concept, we have only performed 3 repetitions of the process instead of 13 like in the gene expression method in order to reduce the computational burden. Supplementary Fig. 36a shows the performance of the model based on the TIN measure at the tissue level, while Supplementary Fig. 36b compares the predictions based on TIN with the predictions based on gene expression at the individual level. On Supplementary Fig. 36c, we show the number of most informative genes (defined as the union of genes with importance > = 0.1 across all the 13 models in the case of the gene expression model, and the union of the genes with importance > = 0.1 across the 3 models in the case of TIN, with “importance” being a measure computed by xgboost) with respect to each tissue and data type.

Data Availability

All data are available from dbGaP (accession phs000424.v6.p1) with multiple publicly available data views available from the GTEx Portal (www.gtexportal.org). The code can be obtained at https://public_docs.crg.es/rguigo/Papers/human_PMI_transcriptome/.