Defining and computing disease prevalence curves

To capture the distribution of disease prevalence across age and sex, we computed disease prevalence curves by dividing the total number of disease codes (ICD codes) within each age-sex stratum by the number of enrolled patients matching these demographics (see Methods part 1 for the precise definitions). A disease prevalence curve’s shape reflects the multiplicity of age-specific landmark events in a patient’s life, ranging from health-neutral medical checkups (which can nevertheless reveal underlying conditions), to age-specific hormonal changes (e.g., puberty, pregnancy, or menopause), and to traumas and infections that may also correlate with age. Despite the existence of some country-specific variations (for example, in bipolar disorder, rheumatoid arthritis, and depression, see Fig. 1a), the curve shapes are rather consistent across countries for a large set of diseases, (see autism and gastrointestinal infection curves, and Supplementary Fig. 1 for selected examples, which were first discovered using the US25 and Danish26 cohorts, and then validated using the Swedish data27,28).

Fig. 1: Disease prevalence curves fall into five major shape clusters. a Representative disease prevalence curves for neurodevelopmental, psychiatric, infectious, inflammatory, autoimmune, and some miscellaneous diseases; we show disease names at the top of the corresponding plot. A curve’s x-axis corresponds to the age of diagnoses (not necessarily the first one in the patient’s recorded health trajectory), and the y-axis denotes the relative prevalence of each diagnosis in the corresponding age and sex group. For ease of comparison across countries, we re-normalized each curve to sum to 1. We computed the curves for two countries: the US and Denmark. US male-specific curves are depicted with blue-dotted lines and female-specific ones with red solid lines. As for their Danish counterparts, male-specific curves are shown with green-dotted lines and female-specific ones with purple solid lines. Each curve is supplied with a 99% confidence interval (in transparent colors). We find that some disease curves are consistent across countries and sexes (e.g., autism and gastrointestinal infection), while others vary by country only (e.g., bipolar disorder and rheumatoid arthritis), and still others vary by sex only (e.g., osteoporosis and Crohn’s disease). b A distance matrix, shown as a heatmap, represents the shape dissimilarity between curves measured via the Jensen-Shannon divergence (Methods part 2). We applied a hierarchical clustering algorithm and elbow model selection to arrive at a five-cluster classification of curve shapes; the five clusters, c1–c5, are shown in red, yellow, green, blue, and purple, respectively. c At the left side of the plate, three columns of stacked bar charts summarize the compositions of each cluster in terms of disease category, sex, and country. At the right side of the plate, we show the optimal curve alignments (after relative shifts along the x-axis) of several representative diseases from each cluster. For each disease, we computed variations across four prevalence curve instances (two countries by two sexes), showing the curve mean and variation as a solid line and a shaded same-color area, respectively. The optimal relative shifts for the alignment are written as bracketed integer numbers (in years) after each disease name. Full size image

To further investigate shape-of-curve similarity, we defined a symmetric distance measure (see Methods part 2 for analytical details; Fig. 1b showing the full dissimilarity matrix). Our comparison across the whole disease spectrum and two countries (US and Denmark) identified five clusters (see Supplementary Fig. 2 for model selection results and Supplementary Data 7 for the complete list of over 500 studied diseases), and they have very distinct disease category, sex, and country compositions (Fig. 1c). For example, the smaller Clusters 4 and 5 primarily comprise neoplastic and developmental diseases, respectively. In these two clusters, the proportions of Denmark-derived disease curves are larger than US-derived ones. In contrast, US-derived curves are more common in Cluster 3. These clusters correspond to distinct shapes of curves: Cluster 1 corresponds to L-shaped early-onset conditions; Clusters 2 and 4 include reversed L-shaped curves (the former being early but slow rising, while the latter being later- but steeper-rising); Cluster 3 is the only multi-modal curve shape type; and Cluster 5 presents a skewed bell shape with a less heavy right tail than that of Cluster 1. For each cluster, we show a few representative disease curve alignments across different categories (indicated below in brackets). For example, in Cluster 1, parasitic infection aligns with an array of noninfectious diseases, including neurofibromatosis (hereditary and neoplastic), tympanic membrane disorders (otic), osteogenesis imperfecta (hereditary and musculoskeletal), and congenital eye anomaly (developmental). To the best of our knowledge, disease curves, standardized across age and sex, have never previously been systematically compared. The discovered resemblance, which can be of great interest to researchers and physicians, likely reflects a combination of shared factors in genetics (e.g., among autism, conduct disorder, tics, and Tourette syndrome29), environmental exposures, and developmental triggers (onset of puberty or menopause), or even direct causal links.

Defining and computing disease embeddings

To further augment disease similarity descriptors for this imputation procedure, we implemented a disease embedding approach30,31,32, inspired by natural language processing. To construct an embedding, we used a neural network as the mathematical representation of a word’s underlying semantics, given its surrounding words (Methods part 3). The intuition behind this method is that semantically-similar words likely share contexts and would thus be encoded by similar vectors of continuous parameters learned by a neural network. By analogy, chronologically ordered diseases in a patient’s health record are “words,” while the entire historical record becomes a “sentence.” We employed over 151 million unique patient histories to compute the embedding for over 500 major diseases within a 20-dimensional continuous space, in which each disease is represented by a 20-dimensional vector (see Fig. 2 for snapshots of 3-dimensional projections of the embedding). To make our choice of dimensionality for the embedding space, we were driven by the following considerations: (1) the space dimensionality should be much smaller than the “vocabulary” size (over 500 in our case), but also be reasonably large enough to ensure adequate predictive power, and (2) the disease embedding with 20 latent dimensions should generate a reasonable nosology, as judged by physicians in our team. We further compiled a collection of published parameter estimates, including 1146 h2 estimates and 1947 various corr estimates (see Supplementary Data 1 and 2, respectively).

Fig. 2: An embedding disease mapping into metric space positions, with related diseases close to each other. Diseases can be mapped to points in low-dimensional metric space (so-called “disease embedding”). See the three-dimensional projections of our 20-dimensional embedding in (a)–(d) in this figure, where similar diseases are closer to each other in metric space than dissimilar ones. This 20-dimensional disease embedding turned out to be extremely useful in this study for estimating population-genetics parameters for individual diseases. a We projected the 20-dimensional disease embedding vectors of over 500 diseases into 3-dimensional space for ease of visualization, using the t-SNE algorithm52. We color-coded the spheres representing the diseases by each corresponding disease category. Plate b shows Mendelian vs. non-Mendelian disease distribution. Plate c shows disease-specific sex bias (defined in such a way that it is 0 for diseases that are equally frequent in males and females, −0.5 for diseases that occur only in females, and +0.5 for those occurring only in males). Plate d shows diseases color-coded in accordance with their onset ages, where green colors indicate early-onset childhood diseases, and warmer colors point to later-onset diseases. Full size image

Building estimators from disease descriptors

Disease prevalence curves and disease embeddings derived from the US dataset were used as disease-specific descriptors for modeling. The modeling features also included specifications about predicted estimates (data type and mathematical model used), basic information about the investigated cohorts (country of origin and sex), and disease characteristics (category of biological systems that the disease belongs to, and the onset age). A detailed description of disease features used in the model can be found in Methods part 4. Equipped with various descriptors for individual diseases, disease–disease similarity measures, and large collections of legacy heritability and inter-disease correlation estimates, we proceeded to estimate missing genetic parameters across the whole pathology spectrum (see Fig. 3a for the outline of our modeling framework, and Fig. 3b–d for full collection of results). We tested a battery of modeling approaches, out of which a gradient boosting regression performed the best33,34,35 (see Table 1 and Methods part 7 for details). As a measure of estimate quality, we used Pearson’s correlation between imputed values and “actual” parameter values, training the ensemble regression model on 80% of the data and testing it on a held-out 20%. To ensure that results were not biased by a single lucky data split, we repeated this computation for 1000 randomly partitioned datasets.

Fig. 3: Estimating population-genetics parameters for hundreds of diseases and thousands of disease pairs. Here, h2 denotes heritability, and corr is a correlation between a disease pair which can be genetic, environmental, or phenotypic. a A workflow explains the key steps of our model development. We used three national-scale health registries, representing the United States, Denmark, and Sweden, which comprised 3.8 billion, 154 million, and 95 million disease diagnoses, respectively. We computed curves reflecting disease prevalence by age and sex (disease prevalence curves) and derived a metric mapping (disease embedding in metric space) for the whole disease spectrum. We used these two complementary representations to estimate hundreds of thousands of disease-specific parameters. We then validated the accuracy of our model’s predictions by benchmarking them against previously-published (“actual”) estimates that were not used in model training. Plates b and c show kernel density estimation plots we computed from 1000 random 4:1 splits of data (4/5 for training and 1/5 for testing). We used these plots to visualize the joint distribution of the actual data for testing and model-predicted values. The linear fit slopes between the actual and predicted values are 0.996 for h2 and 0.993 for corr, indicating nearly perfectly unbiased estimations. d The distributions of Pearson’s correlations between the actual and predicted values have mean values of 0.870 for h2 and 0.874 for corr. e A distribution of the mean age of disease-specific diagnosis bearers. The median of the mean ages over all diseases is around 42 years, and specifically, the mean ages of autism, bipolar disorder, and schizophrenia that appeared in the US data are 9, 40, and 41, respectively. f There is a significant positive correlation between disease onset age and diagnosis count in the US data, suggesting there are less-than-expected, rare, late-onset diseases. g The relationship also holds for each of the five disease clusters. For individual clusters (c1–c5), we show the best linear approximation, regression coefficients (p values were computed using Student’s t test), and Spearman’s correlation ρ (p values were computed using algorithm AS 89), color-coded by the shape cluster. Superscript asterisks indicate significance level of the estimates being different from 0. Full size image

Table 1 Performance comparison of different modeling algorithms. Full size table

Contour plots in Fig. 3b, c show the joint distributions of model predicted and previously published estimates: in the case of h2, the density peaked around (0, 0) and (0.4, 0.4), indicating denser collocations of published and predicted estimates there; while as for corr, the estimates exhibited a unimodal distribution with a peak close to (0.05, 0.05). The slopes of both linear regressions were close to 1, with negligible intercepts, indicating that our estimates were nearly perfectly unbiased. The correlations between our predictions and the corresponding legacy estimates had means of 0.870 ± 0.001 (95% confidence interval computed based on 1000 replicates) for h2 and 0.874 ± 0.001 for corr (Fig. 3d). As shown in Supplementary Table 1, over 40% of the useful information is from disease prevalence curves, over 30% from disease embeddings, and the rest mostly attributable to data types and mathematical models used in the relevant published studies. A detailed breakdown of the contribution of the 20 embedding factors shows that all the 20 factors contribute nearly equally.

To evaluate if our estimators were reasonably accurate, we computed a measure of agreement among previously published h2 estimates. For this comparison, we used only published, independent estimates, matched both by data type and estimation methodology (if there were more than two estimates of the same type, we used all of them, generating all possible comparison pairs). We obtained 205 pairs of estimates in total and computed a Pearson’s correlation value of 0.51, Student’s t test p = 3.5 × 10−15 (Supplementary Fig. 3a); agreement among these past estimates was much lower than what we observed for our estimates. Furthermore, the comparison between our estimates and very recently published sets of estimates12 (which were used in neither training nor validation in our analysis) showed a significantly higher concordance between the two sets of estimates than legacy data (Pearson’s correlation 0.71, Student’s t test p = 4.8 × 10−22, the number of estimates for comparison = 136; see Supplementary Fig. 3b and Supplementary Data 3 for comparison details). Similarly, to assess the accuracy of correlation estimates, we were able to identify an additional, independent dataset of genetic correlations36 and reserved it exclusively for testing purposes. This test dataset was generated in context of GWASs and using a linkage disequilibrium score (LDSC) regression, we compared our predictions for the same data type and mathematical method. This confirmed a significantly high concordance (Pearson’s correlation = 0.73, Student’s t test p = 1.7 × 10−14, the number of estimates for comparison = 80; please see Supplementary Fig. 3d and Supplementary Data 5 for comparison details). We therefore cautiously claim that our estimates are at least as good as those computed with traditional methods.

The addition of numerous genetic parameter estimates helped to statistically empower our downstream findings.

Properties of disease heritability estimates

Our initial analysis of US medical data generated a curious finding: The apparent overabundance of diseases with early onset. The distribution over the mean age of first disease diagnosis is approximately bell-shaped but skewed towards younger ages, with onset age mode around 42 years over all diseases (Fig. 3e). The total number of disease assignments is positively correlated with disease onset age (see Methods part 1 for the precise definition; Spearman’s correlation is ρ = 0.32, p < 10−16 computed using algorithm AS 89 (refs. 37,38), as shown in Fig. 3f), and individual shape clusters all agree on the positive correlation (Methods part 5, Fig. 3g, and Supplementary Table 2). This observation suggests that early-onset diseases outnumber late-onset diseases in the human population, and the former tend to have lower prevalence. Possible explanations for this could be associated with: (1) current clinical practice, such as routine newborn screening and monitoring, generating an overabundance of early-life health observations; (2) a tendency for conditions with substantial genetic etiology to have an earlier onset age while simultaneously being driven to lower prevalence through negative selection; and (3) a historical bias in biomedical discovery since early-onset diseases were easier to document and categorize.

Because hypotheses (1) and (3) mentioned above cannot be tested with the data currently available to us, we focused first on hypothesis (2) and sought evidence of a systemically increased genetic load among early-onset diseases, using a narrow-sense heritability. The relevant legacy estimates were sparse (Fig. 4a and Supplementary Fig. 4a plot the twin/family and SNP/PRS-type estimates, respectively, against disease onset age). This study’s imputation analysis added about 800 estimates to twin/family and SNP/PRS-type heritability estimates. The overall linear relationship between onset age and heritability was significantly, negatively sloped (Fig. 4a, b and Supplementary Fig. 4b). However, when examined individually, the five-disease prevalence curve shape clusters exhibited heterogeneous behavior (Methods part 5, Fig. 4c, and Supplementary Fig. 4c). The detailed results from this analysis are provided in Supplementary Table 3.

Fig. 4: Analyses empowered by our estimates of heritability (h2), and genetic and environmental correlations (r g and r e ). Plate a includes analyses solely based on the previously published estimates of twin/family-type h2, suggesting a significantly negative correlation between disease onset age and heritability. b Our estimator substantially enriched the collection of twin/family-type h2 estimates, filling in numerous missing estimates for under-studied diseases. When we analyzed disease prevalence curves jointly, we found a significantly negative correlation between disease onset age and h2, which also holds for h2 estimates based on other data types, such as SNP/PRS-type (Supplementary Fig. 4b). c We performed the same analysis for diseases within each of the five curve shape clusters, also confirming the significantly negative correlations for shape Clusters 1–3. In the smaller Clusters 4 and 5, the correlations were not significant (Methods part 5). d To understand the relationship between a disease pair’s dissimilarity of disease prevalence curves (D soc ), and the r g and r e for the same disease pair, we performed a regression analysis, expressing D soc as a function of r g , r e , and an interaction term r g ·r e (p values were computed using Student’s t test, see Methods part 6). The corresponding regression coefficients turned out to be 0.41, −0.30, and −0.52, respectively. This regression analysis suggests the following: When two diseases have only high genetic correlation, their prevalence curves are likely to be very different; if only environmental correlation is high, the prevalence curves tend to be much more similar. However, disease prevalence curves are most similar when both environmental and genetic correlations between the two diseases are high. The included disease pairs across all categories are represented as hundreds of thousands of data points in the plot and they are colored according to the D soc values. We also repeated the same D soc regression analysis with all disease pairs from distinct disease categories (see Supplementary Data 6 and Supplementary Fig. 5). Full size image