Despite the recent rapid growth in genome-wide data, much of human variation remains entirely unexplained. A significant challenge in the pursuit of the genetic basis for variation in common human traits is the efficient, coordinated collection of genotype and phenotype data. We have developed a novel research framework that facilitates the parallel study of a wide assortment of traits within a single cohort. The approach takes advantage of the interactivity of the Web both to gather data and to present genetic information to research participants, while taking care to correct for the population structure inherent to this study design. Here we report initial results from a participant-driven study of 22 traits. Replications of associations (in the genes OCA2, HERC2, SLC45A2, SLC24A4, IRF4, TYR, TYRP1, ASIP, and MC1R) for hair color, eye color, and freckling validate the Web-based, self-reporting paradigm. The identification of novel associations for hair morphology (rs17646946, near TCHH; rs7349332, near WNT10A; and rs1556547, near OFCC1), freckling (rs2153271, in BNC2), the ability to smell the methanethiol produced after eating asparagus (rs4481887, near OR2M7), and photic sneeze reflex (rs10427255, near ZEB2, and rs11856995, near NR2F2) illustrates the power of the approach.

Twin studies have shown that many human physical characteristics, such as hair curl, earlobe shape, and pigmentation are at least partly heritable. In order to identify the genes involved in such traits, we administered Web-based surveys to the customer base of 23andMe, a personal genetics company. Upon completion of surveys, participants were able to see how their answers compared to those of other customers. Our examination of 22 different common traits in nearly 10,000 participants revealed associations among several single-nucleotide polymorphisms (SNPs, a type of common DNA sequence variation) and freckling, hair curl, asparagus anosmia (the inability to detect certain urinary metabolites produced after eating asparagus), and photic sneeze reflex (the tendency to sneeze when entering bright light). Additionally our analysis verified the association of a large number of previously identified genes with variation in hair color, eye color, and freckling. Our analysis not only identified new genetic associations, but also showed that our novel way of doing research—collecting self-reported data over the Web from involved participants who also receive interpretations of their genetic data—is a viable alternative to traditional methods.

Competing interests: NE, JMM, JYT, LSH, BN, SS, LA, AW, and JM are or have been employed by 23andMe and own stock options in the company. 23andMe co-president AW has provided general guidance, including guidance related to the company's research undertakings and direction. PLoS Genetics' Editor-in-Chief Gregory S. Barsh is a potential consultant to 23andMe and therefore recused himself from the editorial and peer-review process. PLoS co-founder Michael B. Eisen is a member of the 23andMe Scientific Advisory Board.

Funding: This study was funded by the participants and by 23andMe. Company co-president and co-author AW has provided financial support to 23andMe for its general operational needs.

Copyright: © 2010 Eriksson et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

The other traits analyzed here fall into three broad categories. The first category consists of laterality preferences: handedness, footedness, ocular dominance, and hand-clasp (which thumb is on top when clasping one's hands). The second group consists of simple physical characteristics: whether participants have had cavities, have worn braces, have had wisdom teeth removed, have astigmatism, wear glasses, have attached earlobes, and suffer from motion sickness while riding in a car. The third group consists of personality traits and preferences: optimism, a preference for sweet versus salty food, and preference for night-time versus morning-time activity. None of these traits have well-established associations with SNPs, although handedness [12] and diurnal preference [13] have putative genetic associations.

Listed under “ACHOO (Autosomal-dominant Compelling Helio-Ophthalmic Outburst) syndrome” in Online Mendelian Inheritance in Man (OMIM), the “photic sneeze reflex” refers to the tendency to sneeze when moving from relative darkness into bright light—most often sunlight. Aristotle discussed the trait in a section of his Book of Problems called “Problems concerning the nose,” hypothesizing that heat-generated movement led to tickling of the nose. No previous studies have reported genes or SNPs associated with this particular reflex.

The study of the ability to smell the urinary metabolites of asparagus (probably mostly methanethiol, a sulfur-containing compound) also dates back over 100 years [10] . Since then, authors have debated whether the variation among humans is in the ability to produce methanethiol (thought from family studies to be inherited in an autosomal dominant manner [10] ) or in the ability to smell that compound [11] . No previous studies have reported genes or single nucleotide polymorphisms (SNPs) associated with this sensory ability.

Human hair varies in thickness as well as in the extent of curl, which is related to the shape of the hair (round versus flat cross-section). Over 100 years ago Davenport and Davenport [8] reported a study of hair morphology in families, concluding that straight hair was recessive to curly hair. More recently, a candidate gene approach discovered that EDAR is associated with hair thickness in Asians [9] .

Pigmentation has been a fruitful area for genetic research since the 19th century, when scientists realized that the mice with varied coat colors that “mouse fanciers” had been developing for centuries provided easily tracked phenotypes for genetic analysis [7] . Many genetic variants underlying “normal” variation in human pigmentation have recently been discovered [2] – [5] . These variants account for a significant part of the known variation in pigmentation, (approximately 30% for hair color, 60% for eye color, see Results ), but much remains unexplained.

From the initial set of surveys released, we report results on the 22 traits meeting our sample size criteria (over 1500 unrelated northern European respondents with the additional requirement of at least 500 cases for binary traits). The phenotypes that met these criteria are described below.

The parallel and continual nature of this research framework facilitates the rapid recruitment of participants to many studies at once. Furthermore, the presentation of interpreted genetic data to the participants creates incentive for them to return to the website, lowering the marginal cost of recontacting for additional analyses. The participant-driven nature of this study design and the resulting heterogeneity of the data sets require that care be taken to eliminate population stratification and other possible sources of bias. However, the challenges of eliminating such stratification and biases are balanced by the continuous accrual of new data as participants sign up and respond to new surveys.

(A) Individual participant signs up for service via web; (B) Participant receives sample collection kit; (C) Participant consents, via web, to use of anonymized genotype and survey responses for research; (D) Participant sends saliva sample to contracted lab; (E) Participant logs on to service web site with option to respond to one or more surveys, prior to having access to personal genetic data; (F) Laboratory processes sample, generating data for single nucleotide polymorphisms (SNPs); (G) Encrypted genotype data transferred from laboratory to secure server; (H) Participant logs on to service web site with option to access personal genetic data (both raw genotypes and customized reports); (I) Participant logs on and has the option to respond to one or more surveys; (J) New genetic reports posted, new surveys posted; (K) Genotype data and survey responses for individuals, coded and stripped of individually identifying information, are transferred to research team. Shaded boxes indicate participant actions; clear boxes indicate lab or service actions. Dashed boxes indicate optional participant actions within framework of service access.

Data for these studies was collected within a research framework wherein research participants, derived from the customer base of 23andMe, Inc., a direct-to-consumer genetic information company, consented to the use of their data for research and were provided with access to their personal genetic information ( Figure 1 ). They were then given the option of contributing phenotype data via a series of web-based surveys. The result is a single, continually expanding cohort, containing a self-selected set of individuals who participate in multiple studies in parallel.

Many common human traits have long been understood to have a genetic basis, yet in only a few cases have influential genes been identified. Even pigmentation, for which almost 40 years ago Cavalli-Sforza [1] estimated that there were four genes underlying variation, has yielded associations only recently [2] – [5] (showing that pigmentation is rather polygenic [6] ). We have conducted, within a novel, web-based research framework, genome-wide association studies of 22 common human traits. These traits were selected based on indications of heritability or a simple mode of inheritance, ease of phenotype data collection via web-based self-report, and broad interest.

Results

Of the 22 studies, eight yielded positive results, with novel associations discovered for four traits and replications for five traits (Table 1). All five replications are for pigmentation-related traits. The novel associations reveal SNPs associated with hair morphology, detection of a urinary metabolite of asparagus, photic sneeze reflex, and freckling. Manhattan and qqplots for the novel associations and replications are shown in Figure 2, Figure 3, and Figure 4.

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 2. Manhattan plots for new associations. Shown are (A) hair curl, (B) asparagus anosmia, (C) photic sneeze reflex, and (D) freckling. Plots show scores ( p-values) for all SNPs by physical position. All plots are trimmed at a maximum score of 15. For regions with a more significant association, the strongest score in that region is shown above the region. https://doi.org/10.1371/journal.pgen.1000993.g002

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Manhattan plots for replications. Shown are (A) hair color, (B) red hair, (C) blue to brown eye color, and (D) green versus blue eye color. Plots show scores ( p-values) for all SNPs by physical position. All plots are trimmed at a maximum score of 15. For regions with a more significant association, the strongest score in that region is shown above the region. https://doi.org/10.1371/journal.pgen.1000993.g003

PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Quantile-quantile plots for new associations and replications. Shown are (A) hair curl, (B) asparagus anosmia, (C) photic sneeze reflex, (D) freckling, (E) hair color, (F) red hair, (G) blue to brown eye color, and (H) green versus blue eye color. All plots are trimmed at a maximum score of 15 to better show details. Approximate 95% CIs are indicated in blue. The red line passes through the median p-value. https://doi.org/10.1371/journal.pgen.1000993.g004

The studies were performed on multiple overlapping datasets drawn from a single cohort. The cohort was derived from the subset of the customer base of 23andMe, Inc. who took surveys relevant to the 22 traits considered here. Of these individuals, only those assessed to be of northern European ancestry were included. In addition, individuals were eliminated until any pair of participants shared at most 700 cM of full or half identity by descent (IBD), approximately the lower end of sharing between a pair of first cousins. Average IBD between a pair of participants was 0.146cM, median IBD was 0 and only 123 pairs of individuals shared more than 100 cM. The resulting cohort consisted of a total of 9126 individuals who had answered at least one of the surveys considered here. Each individual was genotyped on the Illumina HumanHap550+ BeadChip platform (consisting of the HumanHap550 panel along with a custom set of approximately SNPs selected by 23andMe). After quality control (see Methods), SNPs were used from this platform.

Phenotypes were collected using 13 surveys posted on the 23andMe website. From these surveys, 22 traits met our criteria for inclusion, which required over unrelated participants who responded to the relevant survey questions and were assessed to be of northern European origin. In addition, for binary phenotypes we required at least participants with each outcome before analysis. See Text S5 for full descriptions of the phenotypes.

Detailed summaries of results are shown in Table 2, Table 3, and Table 4. These summaries include the SNPs selected within each associated region to give the best predictive model while attempting not to over-fit (using a stepwise regression approach with the AIC, see Methods). The reader should be warned that this approach is anti-conservative about the number of effects fitted and, in particular, all SNPs appearing in these tables are not necessarily independently associated. Throughout, “score” refers to the negative p-value for the association between a SNP and a trait. We also include Bayes factors (see Methods) in the tables and the plots of imputed and genotyped SNPs within each associated region due to their usefulness in comparing associations and their ability to incorporate uncertainty arising from imputation. Because we performed multiple (22) parallel studies, we used a very conservative threshold of for a SNP before it was claimed to reach genome-wide significance (see Methods). Associations significant under this correction for a single study but not for all studies (that is, with scores between and ) are called “suggestive” and are shown in Table 5.

Phenotype data The collection of a broad range of data from each participant allows us to control for some sources of bias and to assess the error rate of the phenotype collection. To control for sources of bias, phenotypes were checked for correlations with a set of covariates (age, sex, and principal components of population variation). Covariates showing significant correlations (at the level) were included in the analysis (Supplementary Table 2 in Text S4). In addition, by asking participants the same question multiple times in different ways, we were able to assess the repeatability of responses. We asked about eye color, hair color, freckles, handedness and age twice each in different places or ways. A total of 177 people were removed from analysis based on a single discordant answer to one of these questions. Overall, a total of only 0.72% of participants answered any pair of these questions inconsistently. One source of bias unique to our replication studies was the fact that participants were shown their genetic data along with analyses of their data for approximately 100 traits and diseases. In some cases, this led to a severe bias. For example, a survey examining perceived performance in sprint versus long-distance races was placed on a web page within 23andme.com where customers were shown their genotype for rs1815739 (a SNP in ACTN3 [14]). If they logged on before their genotype data had been processed, they saw the survey question alongside sample data. If their data was available, they were predicted to fall in a category including either world-class sprinters (carriers of the C allele) or endurance athletes (T homozygotes). The response distribution differed significantly ( ) between respondents who had seen their genotypes with the suggested outcome versus those who hadn't. The results (Supplementary Table 1 in Text S5) of this comparison are consistent with large fractions (24.2% of C carriers, 41.2% of T homozygotes) of respondents answering differently than they would have if they had not seen their genotype data and interpretation. Six of the 13 surveys were posted on pages where customers were shown their genotypes and predictions for related conditions. Due to the possibility of bias from this prediction, primary analysis of these six surveys considered only those participants who took the surveys before receiving their genetic data (so they only saw a sample prediction for the phenotype). As a result, none of these traits made our sample size cutoff. For the 22 phenotypes considered here, participants were shown predictions for hair color, eye color, and freckling, although they were on separate pages from the surveys for these phenotypes. There was no evidence that for any of these traits participants who saw their genotypes gave different responses from those who did not (Methods). Therefore, we did not restrict attention to only those who hadn't seen their genotypes. Survey response rates correlated with sex, age and the first (north-to-south) principal component of population structure. That is, women, people of northern European ancestry, and older people were more likely to answer more surveys than men, people of central European ancestry, and younger people (p-values , , and ), respectively. A genome-wide association study (GWAS) using the number of surveys answered as the trait analyzed did not show any significant associations (with p-values under ) when these covariates were taken into account.

Freckling We find one new association for freckling and replicate two known regions. The novel association is at rs2153271, in an intron of BNC2 (Zinc finger protein basonuclin-2), with a score of 9.4 and an estimated of −0.4 (on a 17 point scale). See Supplementary Table 2 in Text S6 for details. Our most significant association, rs12203592, with score 90.7, lies in an intron in IRF4 (Figure 10). This SNP was previously associated with hair color, eye color, and tanning response to sunlight [5]. A more mildly associated SNP, rs1540771 (with score 13.2), in this region has previously been associated with freckling (as well as eye color, sensitivity to sun, and hair color) [4], however rs12203592 (60kb away) was not typed in that analysis. For eye and hair color and tanning ability it was suggested [5] that in fact rs12203592 was in closer LD with the causal SNP. Here we confirm this finding for hair and eye color and establish the same for freckling. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 10. Bayes factors for genotyped and imputed SNPs for freckling around BNC2. For details, see Figure 5. https://doi.org/10.1371/journal.pgen.1000993.g010 The other loci we associate with freckling are MC1R, ASIP, and TYR, all known associations [2], [22]. Although the SNPs selected by the regression procedure as most influential are slightly different than those for red hair, the sets are quite similar for these highly correlated phenotypes.

Hair color We confirm known associations for hair color, both blond to brown and non-red to red. For blond to brown, excluding red, we find hits in five regions: OCA2/HERC2, IRF4, SLC24A4, SLC45A2, and MC1R (aside from MC1R, the same set of regions as [5] in their analysis excluding red hair). A multiple regression using the seven SNPs in Table 3 (with sex and five principal components) estimates that these five regions together explain about 28.1% of the variance in hair color (blond to brown) within northern Europe. In the OCA2/HERC2 region, rs12913832, first found by [3], has a score of and of , explaining % of the variance. These numbers (as well as those for the other SNPs) concord well with those in [5] (which estimated % of the variance was explained by this SNP and using a five point scale from dark to light as compared to our eight point scale from light to dark). For IRF4, rs12203592 has an estimated of 0.59 and explains 3.9% of the variance. For SLC24A4, rs12896399 has an estimated of and explains 1.7% of the variance. In SLC45A2, rs16891982 has and explains % of the variance. Finally, rs12931267 near MC1R has of and explains % of the variance. Sulem et al. [4] found associations for hair color in four of these five regions (excluding SLC45A2) as well as KITLG. The SNP rs12821256 in KITLG showed a mild but significant association with hair color in our study, with a score of 4.2 and of (95% CI from to ). This is similar to the relatively weak effect for this SNP found in [5]. For the other direction of variation in hair color (red versus non-red hair) we found many associated SNPs in the MC1R region, long known to be associated with this phenotype [2]. Although some of the SNPs contributing to the model in this region lie far from MC1R, some of the biggest effects are from rs1805008 and rs1805009, non-synonymous changes in the MC1R gene. This region is strongly associated with common variation in red hair [4], [5]. We also replicated the claim that a large haplotype containing the pigmentation gene ASIP is associated with red hair (also associated with burning and freckling in [22]). While rs291671 is about 900kb away from ASIP, it appears to be tagging the same haplotype found there.