Deriving the distribution of absolute genetic risk

Human diseases are often considered to be dichotomous traits; you are either affected or unaffected. For such traits, the heritability of liability is frequently used to study inheritance2. The concept implies that every individual has a liability to disease, which is the sum of e.g. several genetic and environmental components. Usually the liability is assumed to be normally distributed in the population, and a threshold on the liability scale determines whether an individual acquires the disease. Hence, the standard liability model is usually interpreted as a threshold model7,21. This model allows for the decomposition of the variance into genetic and environmental components. It is appealing, because the variance on the liability scale does not depend on the disease prevalence. Furthermore, the normally distributed liability may have some justification in the central limit theorem; if we believe that the liability of a trait is due to several additive genetic and environmental factors, the liability may approximately follow a normal distribution.

In the 1970s a mathematically equivalent interpretation of the threshold model was described, which is based on the genetic liability ι G , i.e., the liability solely due to genotype22. In the Methods section, we have derived the risk of disease given ι G , which we denote Y. Indeed, we express the distribution of Y to study how the genetic risk varies on an individual level. Wray et al.23 use some similar concepts to see that the probit model fits with real, observed family data24. Here, we will use summary estimates of the heritability h 2 from twin studies to derive the distribution of Y for 15 common cancers. When the absolute risk distribution is derived, we can obtain various measures of the genetic inequality in risk.

Exploring inequality in risk for 15 cancers

Mucci et al.20 recently reported heritability estimates for 15 common cancers based on the heritability of liability model, using data from Nordic twin registries. We will apply the sampling algorithm described in the Methods section to derive the distribution of absolute risk for these 15 cancers. To illustrate this, Fig. 1 shows the estimated genetic risk distribution for the 4 most common cancers. We interpret the genetic risk as the individual life-time risk of disease, given that the individual’s genetic make-up was known, but the environmental exposure unknown. The interpretation relies on the assumptions underlying the heritability of liability model, e.g. that genetic factors and the environmental factors are independent on the liability scale.

Fig. 1 Genetic risk distribution for four common cancers. The distribution of risk due to genetic differences is displayed for four common cancers, using heritability and prevalence data from Nordic twin registries20 Full size image

By obtaining the risk distributions, we are able to explore the genetic contribution to disease risk. To do this, we will suggest some useful summary measures.

Gini index

First, we use the Lorenz curve, and its summary measure the Gini index. Although rarely used in medicine and epidemiology, this metric adequately describes the variation in disease risk25,26. Importantly, it allows for comparison across measurement scales; the Gini index does not depend on the cumulative risk of a disease in a population (or the total size of an economy), neither on the size of the population itself. It only relies on the relative mean absolute difference between individuals26. Crudely, the Gini index is a number between 0 and 1, describing the inequality in disease risk across individuals. More precisely, the Lorenz curve is represented by a function L(S), in which S is a cumulative proportion of the population, and L(S) is the fraction of the total risk that is carried by S. E.g. if the risk is equal among subjects in the population, the fraction of risk carried by any 50% of the population would be L(0.5) = 0.5, which means that the Lorenz curve is a straight line. The Gini index is a ratio describing the deviation from this straight line, which can be interpreted as a coefficient of deviation in risk, either on the absolute or the relative scale26 (A formal mathematical derivation is found in the Methods section).

In our context, a Gini index of 0 means that everybody has the same genetic risk to a particular cancer, whereas a Gini index of 1 implies maximum inequality in risk across individuals. The Gini index is widely used in economics and demography, e.g., to study inequality in income and wealth. In Fig. 2, we show the Gini index for 15 common cancers. The Gini index is derived by using the heritability h 2 and life-time risk estimates form a recent Nordic twin study20. The red dashed line denotes the Gini index of income in the USA, using data from the World Bank27. Interestingly, the plot reveals a major inequality in cancer risk for the common cancers. For all specific cancers, the inequality in genetic risk seems to be larger than the inequality in income in the USA. We also studied the genetic risk of cancer overall, using the heritability of acquiring any type of cancer. This heritability estimate is lower than the individual cancers20, which is expected because a factor increasing the risk of a particular cancer does not necessarily increase the risk of other cancers. Still, the Gini index of acquiring any type of cancer was almost as large as the Gini index for income in the USA.

Fig. 2 Gini indices for 15 common cancers. The Gini indices with 95% confidence intervals are derived by using data from Nordic twin registries20. The red dashed line marks the Gini index of income in the USA Full size image

We have displayed the relation between the Gini index and the heritability (Fig. 3a), and the relation between the Gini index and the observed relative risk in monozygotic co-twins of affected individuals (λ M ) (Fig. 3b). The areas of the circles are proportional to the life-time risk of the cancers. The three different measures of genetic contribution are related, but not co-linear, indicating that they capture non-overlapping information about the risk of disease. In particular, for cancer sites with similar heritability, the Gini index is relatively larger for the rarer sites.

Fig. 3 Gini indices, h 2 estimates and twin recurrence risks are not co-linear. a The relation between Gini indices and heritability estimates are displayed for 15 common cancers. b The relation between Gini indices and monozygotic twin recurrence risks (λ M ). The area of each circle is proportional to the life-time risk of the correspondin cancer Full size image

Quantile ratios

Alternatively, we may study the inequality in risk by using a quantile ratio. The population is partitioned into subset according to quantiles of genetic risk, and we may estimate the ratio of affected individuals in the highest risk partition compared to the lowest risk partition. This metric is also frequently used to compare incomes in economics, e.g., the 20:20 ratio (RR 20:20 ) which assess the 20% richest compared to the 20% poorest of a population. Table 1 shows the RR 20:20 of genetic risk, which highlight a substantial difference in risk across subgroups; those in the highest 20 percentile carry substantially more of the disease burden than those in the lowest 20 percentile. In comparison, RR 20:20 for income is ~5 in the UK and ~9 in the USA28.

Table 1 Summary measures of the genetic risk of 15 common cancers Full size table

A hypothetical intervention

Related to quantile ratios, we may estimate the effect of hypothetical interventions on particular risk groups. Suppose, for example, that we were able to reduce the genetic risk of each individual in the upper 20 percentile to the average risk in the lowest 20 percentile. This question could be relevant for public health professionals, because it suggests the potential benefit of identifying and subsequently intervening on high-risk populations.

We could calculate the relative risk of such interventions, assuming that the environment is left unaltered. Indeed, this relative risk is immediately obtained from the cumulative risk distribution. Let y 20 denote the 20 percentile of genetic risk and let y 80 denote the 80 percentile. Then

$${\rm{RR}}_{{\rm{interv}}{\rm{.}}} = \frac{{{\int}_0^{y_{80}} yf_Y(y){\rm{d}}y + {\int}_0^{y_{20}} yf_Y(y){\rm{d}}y}}{{E(Y)}}.$$

Relative risk estimates after such hypothetical interventions are found in Table 1. Indeed, these risk estimates also suggest a major contribution of genes to disease development; if we, e.g., were able to reduce the risk of prostate cancer in the upper 20 percentile to the average risk in the lower 20 percentile, we would reduce the number of cancers by a proportion of 1 − 0.26 = 0.74.

Using different sources of heritability data

Heritability data may not only be derived from twin studies. Genome-wide association studies (GWAS) allows for the calculation of heritability estimates without relying on family structures29,30. These estimates account for the variability due to genetic variants tagged by single-nucleotide polymorphisms (SNPs), usually with a population frequency above 1–5%. Such array heritability estimates are therefore considered to be lower bounds of the overall heritability, but may yield important information about the inequality in risk due to genetic variants associated with common SNPs. Lu et al.29 estimated array heritability for a range of cancers, highlighting that array estimates captures approximately half the heritability from older twin studies. We may immediately apply our approaches to explore the inequality in cancer risk due to genetic variants tagged by SNPs. This could yield insight into, e.g., the benefit of targeting genetic variants tagged by SNPs in future interventions. In Fig. 4, we display the Gini indices derived from the array heritability estimates in Lu et al.29, again highlighting the substantial inequailty in genetic risk.

Fig. 4 Gini indices derived from array heritability estimates. Gini indices with 95% confidence intervals are calculated from array heritability estimates derived from Lu et al.29 The black boxes are based on array heritability removing loci with known association with the cancers. The red dashed line marks the Gini index of income in the USA Full size image

Alternative to the threshold model

Although frequently used, the assumptions of the heritability of liability model are not necessarily satisfied1. Considering the liability to be normally distributed is convenient and may agree with the central limit theorem, but testing this assumption is usually infeasible in practice7,24, and it may not be robust if the genetic risk is determined by few, rare genes1. When using twin data, we usually assume no gene-environment interaction on the liability scale1,31, and we consider monozygotic- and dizygotic twins to share the same amount of environmental factors. Another issue is the confidence intervals of heritability and common environmental components, which are often wide even when hundreds of thousands are included in the study20.

Until now we have based our results on the heritability of liability assumptions. We may, however, suggest a different approach that does not rely on the concept of heritability. We achieve this by assuming that the risk due to both heritable factors and common environment follows a parametric distribution. First, we let this distribution be the beta distribution, which allows for a wide range of shapes of the risk distribution and is bounded by 0 and 1. Importantly, in this model the risk distribution is uniquely defined by the observed recurrence risk (e.g., λ m ) and the disease prevalence6. First, we use the beta model to investigate the risk distribution due to the total effect of genes and shared environment. That is, this measure will capture the maximum inequality in risk due to genes and shared environment. Hence, we would generally assume that inequality measures from this approach, e.g., the Gini index, are larger in magnitude than the heritability based estimates. Intuitively, the differences should be relatively large if the shared environmental component is substantial, and relatively small if the common environmental component is minor. In Table 1, the Gini index from the beta models (GC beta ) are shown together with the Gini index from the heritability model \(\left( {GC_{h^2}} \right)\). The Gini indices from the beta model are generally larger than the estimates from the heritability model. As expected, the discrepancy is larger for the cancers with larger shared environmental components, which may be obtained by twin data as the fraction of the variance on the liability scale due to shared environment20 (env2 in Table 1). A plot similar to Fig. 2 including the beta Gini estimates is found in Fig. 5. For the cancers that were studied in both Mucci et al.20 and Lu et al.29, we have also compared twin heritability, array heritability and the estimates derived in this section (Fig. 6).

Fig. 5 Gini indices from the beta distribution. Gini indices with 95% confidence intervals are displayed for the twin estimates in Fig. 2 (blue) together with estimates from the alternative beta distribution (red). The red dashed line marks the Gini index of income in the USA Full size image

Fig. 6 Comparing Gini indices from different risk distributions. The Gini indices with 95% confidence intervals displayed in Figs. 2, 4 and 5 are shown together. Only the cancers that were reported in both Mucci et al.20 and Lu et al.29 are included. The red dashed line marks the Gini index of income in the USA Full size image