Motivated by the prediction that rare variants have large effect sizes that explain some of the missing heritability in complex traits [25], a variety of study designs can be utilized for finding rare variant associations. The success of such a study depends on multiple factors that influence the range of observable effect sizes (for example, sample size and the magnitude and direction of natural selection) [26].

Extreme phenotype sampling

For studies of quantitative traits, it has been shown that the power to detect rare variant associations can be increased by sampling from the extremes of the trait distribution [27,28]. To do so, typically the phenotype (or a transformed version of the phenotype) is assumed to follow a normal distribution. Then, the largest and smallest n th percentile of the distribution are chosen for study, where n is typically less than five. For disease outcomes, the power of the study may be increased by sampling from the extremes of known risk factors (such as looking at early onset disease) [29]. For instance, Lange and colleagues [30] sampled from the extremes of the distribution of LDL-cholesterol levels to select individuals for WES. They combined these ‘extreme’ samples with other ‘normal’ samples to discover a burden of rare and low-frequency variants in the PNPLA5 gene that were associated with LDL-cholesterol. Similarly, Emond and co-workers [31] chose samples for exome sequencing based on the extremes of the first time to Pseudomonas infection in individuals with cystic fibrosis. This approach yielded a novel genetic association between rare coding variants in the DCTN4 gene and time to first Pseudomonas infection, a surrogate measure of cystic fibrosis severity. More recently, Flannick and colleagues [32] selected individuals from the extremes of type 2 diabetes (T2D) risk by including both young and lean T2D cases as well as elderly, non-obese controls. The initial analysis discovered a nonsense variant in SLC30A8 that was strongly protective against T2D. Additional genotyping of over 44,000 cases and controls confirmed a 53% reduction in T2D risk for carriers of the nonsense variant.

Although extreme sampling may boost the statistical power of a study to detect associations, data analysis often requires sophisticated statistical techniques to remove sampling bias [33,34]. Furthermore, the results may be difficult to generalize to the underlying population from which the extremes were drawn. For rare variants, tens of thousands of samples may still be necessary in order to detect modest effects even for extreme trait designs [27].

Population isolates

Owing to a variety of demographic forces (for example, famine, war, migration), many subpopulations around the world have undergone extreme population bottlenecks, and have become isolated and remained so for many generations [35,36]. These extreme bottlenecks and the resultant population isolates produce several genetic and phenotypic consequences that are interesting to a geneticist. From a phenotypic perspective, population isolates often demonstrate environmental and cultural homogeneity, resulting in a lack of phenotypic variability that can be advantageous for an association study. Furthermore, because of this reduced genetic diversity (due to the bottleneck) and increased genetic drift (due to isolation), population isolates often show a lack of concordance in allele frequencies with other non-isolated populations [37]. Because the power to detect an association is partly determined by allele frequency, population isolates can be very useful in discovering rare variant associations [37]. If the disease-causing variant(s) occurs at high frequency in the population isolate, the power to detect an association may be high.

Recently, a study of WGS data from 1,795 Icelanders identified a non-coding, low-frequency variant associated with prostate cancer [38]. The risk allele was observed at much higher frequency in Iceland (3% in cases, 1% in controls) compared to other populations that served for replication (for example, 0.4% and 0.1% in Spanish cases and controls, respectively). The same group later identified several low-frequency and rare variants that were associated with T2D [39]. These variants occurred at much higher frequency in Icelandic and Danish populations compared to an Iranian population used for replication. Perhaps the most well-known examples of risk variants found in a population isolate are the BRCA1 and BRCA2 mutations, which occur at high frequency in the Ashkenazi Jewish population and are associated with risk for breast and ovarian cancer [40]. However, the lack of genetic diversity in population isolates can serve as a serious disadvantage as well. Disease-causing variant(s) may be exceedingly rare, or monomorphic in the population isolate, leaving little chance of detecting an association.

Family studies

A different type of design for identifying disease-causing rare variants is to study a family with multiple affected members. Often referred to as ‘family studies’, such a design involves sequencing co-affected family members and searching for overlapping variants that co-segregate with the condition of interest. Both linkage-based and genetic association methodologies are amenable to family studies. This type of design has been very successful in identifying large effect, highly penetrant mutations that underlie Mendelian disorders [41,42]. However, for many common diseases co-segregation analysis cannot sufficiently distinguish among a large set of candidate pathogenic variants [43]. Given the challenges of performing a comprehensive analysis of pedigree sequencing data, most studies rely on a series of ad hoc filtering criteria, although there has been recent progress in developing unified and rigorous methods for analyzing sequence data from pedigrees [44]. If the disease-causing variant occurs with high frequency in the affected families (compared with the general population), a family study may provide a significant boost in statistical power compared to other designs. For family studies, as well as population isolates, this is sometimes referred to as ‘hitting the jackpot’, because investigators are essentially hoping to ‘get lucky’ by observing the disease-causing variant with high frequency in the affected families (or population isolate) [45].

In addition to co-segregation analysis, genotype data from trios (an affected offspring and his or her parents) are often used in studies with a family-based design. The transmission disequilibrium test (TDT) [46] has been developed to detect associations in these types of designs. For rare diseases (for instance, those with a prevalence <0.5%) the TDT for n number of trios provides the same statistical power as a case–control design, with n cases and n controls [47]. For common diseases, case–control designs are more powerful (Figure 1).

Figure 1 Comparison of power for trios and case–control designs. Power to detect associations for 10,000 cases and 10,000 controls (blue) and 10,000 trios (red) across a range of minor allele frequencies (MAFs). Power was calculated with a significance threshold of P < 0.05, a prevalence of 0.1 and a relative risk of 1.1, using the Genetic Power Calculator tool [112]. Full size image

The underlying genetic architecture of the trait of interest determines which study design is best powered to detect the association [48]. For most complex traits, the genetic architecture is unknown or at best partially known. Thus, there is no way to predict a priori which design will be most powerful. Both population-based studies (such as extreme phenotype sampling or population isolates) and family designs are powerful and useful in differing contexts and should serve as complementary, rather than competing, approaches for uncovering the genetic contribution of rare variants to complex traits.