Important knowledge about the determinants of complex human phenotypes can be obtained from the estimation of heritability, the fraction of phenotypic variation in a population that is determined by genetic factors. Here, we make use of extensive phenotype data in Iceland, long-range phased genotypes, and a population-wide genealogical database to examine the heritability of 11 quantitative and 12 dichotomous phenotypes in a sample of 38,167 individuals. Most previous estimates of heritability are derived from family-based approaches such as twin studies, which may be biased upwards by epistatic interactions or shared environment. Our estimates of heritability, based on both closely and distantly related pairs of individuals, are significantly lower than those from previous studies. We examine phenotypic correlations across a range of relationships, from siblings to first cousins, and find that the excess phenotypic correlation in these related individuals is predominantly due to shared environment as opposed to dominance or epistasis. We also develop a new method to jointly estimate narrow-sense heritability and the heritability explained by genotyped SNPs. Unlike existing methods, this approach permits the use of information from both closely and distantly related pairs of individuals, thereby reducing the variance of estimates of heritability explained by genotyped SNPs while preventing upward bias. Our results show that common SNPs explain a larger proportion of the heritability than previously thought, with SNPs present on Illumina 300K genotyping arrays explaining more than half of the heritability for the 23 phenotypes examined in this study. Much of the remaining heritability is likely to be due to rare alleles that are not captured by standard genotyping arrays.

Phenotype is a function of a genome and its environment. Heritability is the fraction of variation in a phenotype determined by genetic factors in a population. Current methods to estimate heritability rely on the phenotypic correlations of closely related individuals and are potentially upwardly biased, due to the impact of epistasis and shared environment. We develop new methods to estimate heritability over both closely and distantly related individuals. By examining the phenotypic correlation among different types of related individuals such as siblings, half-siblings, and first cousins, we show that shared environment is the primary determinant of inflated estimates of heritability. For a large number of phenotypes, it is not known how much of the heritability is explained by SNPs included on current genotyping platforms. Existing methods to estimate this component of heritability are biased in the presence of related individuals. We develop a method that permits the inclusion of both closely and distantly related individuals when estimating heritability explained by genotyped SNPs and use it to make estimates for 23 medically relevant phenotypes. These estimates can be used to increase our understanding of the distribution and frequency of functionally relevant variants and thereby inform the design of future studies.

Funding: This work was funded by NIH grant R03HG005732 (NZ and ALP), NIH fellowship 5T32ES007142-27 (NZ), and the Rose Traveling Fellowship Program in Chronic Disease Epidemiology and Biostatistics (NZ). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Copyright: © 2013 Zaitlen et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

For most of the 23 phenotypes examined here, our results show that accounts for more than half of . As GWAS have not identified many SNPs with large effect sizes (i.e. is small), and is greater than by a considerable margin, it follows that there must be many associated sequence variants that remain to be discovered, i.e. these phenotypes are highly polygenic. Currently, only common variants are well captured by the genotyping arrays used in most GWAS studies. As the difference between and is likely due to common and rare variants not captured by the genotyping array [12] , it may be assumed that a fair number of association signals remain to be identified through more comprehensive approaches, such as whole genome-sequencing. However, our estimates of show that GWAS genotyping arrays capture a greater proportion of than indicated by previous twin-based estimates of .

We further introduce a new variance components method that provides simultaneous estimates of and . This method has two principal advantages. First, by adequately taking account of both closely and distantly related pairs of individuals, it minimizes the standard error of the estimates, whilst avoiding the upward bias that can result from calculations based on closely related pairs. Second, it produces both estimates of heritability for the same population sample, ensuring that and are directly comparable.

Here, we analyze the heritability of 23 complex phenotypes in an Icelandic cohort of 38,167 individuals, leveraging both a population-wide genealogical database and genotype data from over 300,000 SNPs that have been long-range phased across and between chromosomes (i.e. where not only the phase, but also the parental origin of alleles has been determined) [20] . Importantly, we develop an approach that allows to be estimated on the basis of both closely and distantly related pairs of individuals. We find, for all of the quantitative phenotypes, that our estimates of are smaller than those from the literature that were based on MZ/DZ twins [21] . Our results indicate that previous estimates were inflated by the impact of epistasis or shared environment.

There are two major challenges in comparing and to quantify missing heritability. First, there is the potential for inflation of estimates based on closely related individuals such as MZ/DZ twins. It is well known that epistatic interactions can inflate heritability estimates in studies of related individuals [13] . Recent work from Zuk et al [10] has examined this in detail. Other factors that could also lead to inflated estimates of using closely related pairs of individuals include dominance and shared environment. Second, there is a tradeoff between inflation and sampling variance when estimating . The recent variance component approach described by Yang et. al results in inflated estimates of in the presence of related individuals [12] , [14] , [15] , [16] , [17] . However, removing related individuals reduces the sample size, resulting in a larger standard error around the estimate [18] , [19] . Both of these issues can adversely affect estimates of missing heritability.

The narrow-sense heritability of a phenotype ( ) is the fraction of phenotypic variance that can be described by an additive model over the set of SNPs that are functionally related to the phenotype (i.e. the causal SNPs) [11] . It is commonly estimated by comparing the phenotypic correlation of monozygotic (MZ) to that of dizygotic (DZ) twins. The difference between and the fraction of phenotypic variance accounted for by variants discovered by means of GWAS ( ) is the so-called missing heritability. Recently, Yang et al [12] developed a method to estimate the variance explained by all SNPs on a genotyping platform including those that are not genome-wide significant ( ), representing the limit of for infinite sample size.

Although genome-wide association studies (GWAS) have resulted in the discovery of thousands of novel associations of loci to hundreds of phenotypes [1] , concerns have been raised about the finding that these loci appear to explain a relatively small proportion of the estimated heritability, the fraction of phenotypic variation in a population that is due to genetic variation [2] . This has led to considerable speculation by researchers about the genetic basis of complex human phenotypes and the “missing heritability”, i.e. the fraction of heritability not accounted for by the associations discovered to date [3] , [4] , [5] , [6] , [7] , [8] , [9] . Among the proposed explanations for missing heritability is the existence of many presently unidentified common variants with small effect sizes, rare variants not captured by current genotyping platforms, structural variants, epistatic interactions, gene-environment interactions, parent-of-origin effects, or inflated heritability estimates [3] , [5] , [10] . Studies that examine the sources of missing heritability can help researchers to evaluate the prospects of future studies focusing on common versus rare variation and thereby devise effective strategies to discover the remaining sequence variants that affect disease risk and other aspects of phenotypic variation in humans.

Results

Overview of methods Below, we provide an overview of the approaches we used to estimate various components of heritability. The details of these approaches are provided in the Methods section. We used a linear mixed model approach to estimate components of heritability. In this approach, each phenotype is modeled using a multivariate normal distribution. Each of the components of heritability that we estimated corresponds to a different model of the phenotypic covariance. Narrow-sense heritability ( ) estimates from variance component models rely on covariance matrices specifying the genome-wide genetic relatedness of individuals in the data set. An estimate of can be obtained by using an identity-by-descent (IBD) based covariance matrix, which is trivial to obtain from long-range phased genotype data (see below). The fine-scale estimates of IBD used here rely on long-range phasing data that are not available in most data sets. An estimate of can also be obtained by using an identity-by-state with threshold (IBS>t) based covariance matrix with all values below a threshold t set to 0, i.e. focusing on closely related individuals. An alternative is to use the full IBS based covariance matrix to obtain an estimate of the heritability explained by genotyped SNPs ( ), however, this requires removing related individuals [12]. If related individuals are included, the resulting estimate is neither an estimate of nor an estimate of . Previous approaches to estimating the heritability explained by genotyped SNPs ( ) required filtering related individuals, thereby increasing the standard error of the estimates. However, joint estimates of and can be obtained using two covariance matrices based on IBS>t and IBS. The first component provides an estimate of , and the second provides an estimate of . This approach removes the need to filter related individuals. Alternately, joint estimates of and can be obtained using two covariance matrices based on IBD and IBS, where here IBD replaces IBS>t to estimate . Broad-sense heritability (H2) is the sum of additive, dominant, and epistatic components of heritability. The additive, dominant, environmental (ADE) model can be used to obtain joint estimates of dominance and additive components of heritability, using two covariance matrices based on IBD2 (two copies shared IBD) and IBD [22]. Below, we investigate all of these modeling approaches. Table S1 contains definitions of all parameters quantifying components of heritability that are used in the text.

Estimates of narrow-sense heritability (h2) Ideally, estimates of narrow-sense heritability of a particular phenotype would be based on a genetic relationship matrix constructed from causal variants, representing the true genetic contribution to the phenotype [23]. However, as this set of variants is typically not known for most phenotypes, a proxy must be used for the pair-wise genetic covariance of individuals at the causal variants. Traditionally, this proxy has been derived from genealogical information, representing, for each pair of individuals in a sample, the expected fraction of their genomes that is identical-by-descent (IBD) – i.e. identical as a result of being inherited from a recent common ancestor [23]. The availability of dense genome-wide data from microarray SNP genotyping platforms has made it possible to directly estimate the fraction of the genome shared IBD between each pair of individuals (K IBD ). However, fine-scale estimation of K IBD in population samples is dependent on information about the chromosomal phase of alleles, which requires long-range phasing of the data [17], [24], [25]. Previous studies reporting estimates in close relatives based on K IBD [26], [27], [28] had very high standard errors based on their study design and sample size. Recent work has examined IBD-based heritability estimates from distantly related individuals [29]. Ours is the first study to provide fine-scale IBD-based estimates of based on pairs of individuals at a range of relationship from siblings to distant relatives. We refer to these estimates as . IBD-based estimates of for the 11 quantitative traits ( ) are shown in Table 1. For these and subsequent estimates of h2, age, sex, and geographic region were included as covariates to prevent confounding. These estimates range from 0.099 for recombination rate to 0.691 for height. The only quantitative trait yielding an estimate not significantly different from 0 was sex-ratio of offspring. For each of the eight quantitative phenotypes with published estimates of , our estimates were smaller than the mean published estimate. For example, our estimate of 0.69 for height was lower than previous estimates of 0.80 [30], but was consistent with previous estimates in being lower for females (0.724 s.e. 0.019) than males (0.780 s.e. 0.029) [31]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 1. Narrow-sense heritability estimated from IBD ( Narrow-sense heritability estimated from IBD () and thresholding IBS () for 11 quantitative traits. https://doi.org/10.1371/journal.pgen.1003520.t001 Previous studies, based on either genealogical or direct estimates of IBD sharing, have been limited to closely related individuals (first-cousins or closer), and may therefore be upwardly biased due to the impact of shared environment, dominance, or epistatic interactions [10]. On average, our estimates of were lower than those from previous studies by a ratio of 0.75 (s.e. 0.067), most likely because the latter were inflated by one of the three aforementioned factors. We return to this point below, performing a pedigree-based analysis to assess the impact of these factors. Dichotomous phenotypes in this study were ascertained to increase the number of available cases, leading inflation in . A discussion of this inflation and the resulting estimates are presented in Text S1 and Table S2. In most cases, researchers do not have access to long-range phased genotypes with which to estimate h2. One suggested solution to this problem is the use of K IBS , the genome-wide proportion of alleles shared identical-by-state (IBS) at all genotyped loci, as a substitute for K IBD [32], when estimating on the basis of closely-related pairs of individuals (it is assumed that K IBS provides a poor estimate of K IBD for distantly related pairs of individuals). Taking advantage of long-range phase based estimates of K IBD , we sought to evaluate the use of K IBS for the estimation of h2. For this purpose, we computed K IBS as defined in [12] and found that it produced downwardly biased estimates of for both simulated and real data sets that included many pairs of distantly related individuals (see Methods). As noted by Vattikuti et al [19], this bias is due to the fact that, when used to estimate h2, the K IBS matrix captures information from two distinct sources, depending on the degree of relationship between pairs of individuals. For large values of IBS it estimates genetic covariance over all SNPs in the genome. For low values of IBS it estimates genetic covariance over just those SNPs on the genotyping platform (see next section). Thus, the resulting heritability estimates from K IBS therefore tend to lie between the true value, , and the typically lower value of . To avoid this bias, we implemented a different approach, retaining all individuals for the calculation of h2, but setting values of K IBS less than or equal to a threshold t (K IBS>t ) to 0, for t = 0.00, 0.025 and 0.05. This threshold defines the separation between closely and distantly related individuals. We evaluated this approach using both simulations and real data sets and observed a significant downward bias of narrow-sense heritability estimated from tresholded IBS ( ) at t = 0. For example, when t = 0 for height is 0.58, while when t = 0.05 = 0.70 (similar results were obtained for the other phenotypes). We observed no bias at t = 0.025 or t = 0.05 (see Methods and Table S3). To err on the side of caution, we present values for t = 0.05 ( ) in Table 1 and Table S2, for the quantitative and dichotomous traits, respectively. The difference between narrow-sense heritability estimated from IBD ( ) and was less than 0.015 for all traits and not statistically significant for any of them. The correlation between the two estimators was 0.9998 and 0.9999 for the quantitative and dichotomous traits, respectively. Furthermore, in our extensive simulations over real data, the difference between the estimators was always less than 0.02 and not statistically significant (see Methods and Tables S3, S4, S5). These results indicate that when phase information is not available K IBS can provide unbiased and precise estimates of h2, by means of , in data consisting of a mixture of closely and distantly related pairs of individuals. The choice of threshold t is a function of the relatedness structure of the individuals in the study as well as the properties of the population they are drawn from (see Discussion).

Joint estimation of Recently, Yang et al [12] developed a method for estimating , the fraction of narrow sense heritability explained by genotyped SNPs (and SNPs in LD with genotyped SNPs). The interest in derives from the fact that it is the upper bound on the heritability that can be described from GWAS ( ) conducted on the same genotyping platform used to estimate . The Yang et al. method is based on a variance component model with a genetic relationship matrix K IBS estimated from the genotyped SNPs. To prevent inflation, the method requires that all pairs of individuals have K IBS <0.025 [12]. In studies where the Yang et al. [12] approach has been applied [18], [19], the removal of related individuals resulted in a significant decrease in sample size and a concomitant increase in the standard error of the heritability estimates (e.g. a standard error of 19% in one study [18]). Filtering such that K IBS <0.025 for all individuals in our data leaves less than 3000 individuals, which is not adequate to estimate with low standard error(for example, resulting in a standard error for of 10.0% for height). To enable unbiased calculation of in data sets that contain a both closely and distantly related pairs of individuals, we have devised an alternative approach based on a model with two variance components (see Methods). The first variance component, K IBS has a parameter and is an estimate of , the genetic variance due genotyped SNPs. The second variance component K IBS , has a parameter and is an estimate of , the total narrow-sense heritability (the subscript+is used for both parameters to denote that they are estimated simultaneously). Although we have access to fine-scale estimates of K IBD , based on long-range phased genotype data, we demonstrate the application of this approach using K IBS >t , because fine-scale K IBD estimates are typically not available to most researchers. We note that in the empirical results and in simulation, the use of K IBD and K IBS in the model produced results that were similar to those obtained using K IBS >t and K IBS (see Methods). Extensive testing of this model was performed to demonstrate that it estimates the appropriate quantities (see Methods), and estimates of closely match those of narrow-sense heritability estimated from tresholded IBS and IBD ( and ), both in our data and in simulations. Table 2 shows heritability results for quantitative traits using the joint model where heritability estimated from thresholding IBS ( ) and heritability explained by genotyped SNPs ( ) are fit jointly. We examined the nine quantitative traits where h2>0. Our results are concordant with the previous estimates of for height, high-density lipoprotein (HDL), WHR, and BMI [6], [19]. For most of the traits, accounts for more than half of , with a maximum of 0.75 for waist-to-hip ratio (WHR), and a minimum of 0.33 for age at menopause. For each trait, we tested for deviation from a /h2 ratio of 0.53 (the average across all the traits) and found that only height, with a value of 0.58 was significantly different (p-value<0.0017, see Text S1). However, as our estimates of were smaller than previous estimates, the fraction 0.53 (s.e. 0.042) of variance explained by genotyped SNPs based on our estimates of heritability was larger than the fraction 0.40 (s.e. 0.037) based on published estimates [6]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 2. Heritability estimated from thresholding IBS ( Heritability estimated from thresholding IBS () and heritability explained by genotyped SNPs (). https://doi.org/10.1371/journal.pgen.1003520.t002

Joint estimation of narrow-sense and heritability explained by genotyped SNPs for dichotomous phenotypes For dichotomous phenotypes, ascertainment in samples with closely related pairs of individuals induces an upward bias in narrow-sense heritability jointly estimated from IBS above a threshold ( ) when converting to the liability scale (Table S6 and Text S1). Thus, our estimates should be viewed as an upper bound. However, it is possible to account for case-control ascertainment amongst distantly related pairs when converting heritability explained by genotyped SNPs jointly estimated from IBS below a threshold ( ) from the observed to the liability scale [33]. This correction is not possible when affected relatives are included in the study. For example, a study that ascertains affected sib pairs will have severely inflated heritability estimates, and the case-control ascertainment correction does not address this type of bias (see Text S1). Table 3 shows estimates on the liability scale, derived from a model with two variance components K IBS>t and K IBS , for 11 dichotomous traits. Estimates of primarily capture the heritability derived from distantly related pairs of individuals. Results on the observed scale are given in Table S7. The inflated narrow-sense heritability estimates of the dichotomous phenotypes leads to a lower ratio of heritability explained by genotyped SNPs ( ) to . PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Narrow-sense heritability explained by genotyped SNPs ( Narrow-sense heritability explained by genotyped SNPs () for dichotomous phenotypes on the liability scale. https://doi.org/10.1371/journal.pgen.1003520.t003 Our results are lower than previous estimates of the heritability explained by genotyped SNPs ( ) for rheumatoid arthritis (RA), 2 diabetes (T2D), and coronary artery disease (CAD) [34]. The differences between our estimates and previous estimates could be due to the use of population controls in our study rather than non-affected controls, differences in disease prevalence between populations, differences in the genotyping platform used, differences in ascertainment strategies such as age matching in previous work, or actual differences in the heritability of the phenotype across populations. If a small number of common variants were responsible for a large fraction of the phenotypic variation, they would have been identified by previous GWAS. However, since most of the loci identified through GWAS have a small effect, our results suggest a highly polygenic model of disease for the dichotomous phenotypes, as in the case of the quantitative traits. This is consistent with previous studies [6].