We have implemented a large, IRB-approved genetic study using social media. Participants must be at least 18 years old, live in the US, and have a Facebook account. They are recruited via snowball sampling, i.e., by finding our Genes for Good Facebook application through friends, family, and social media connections. Once a person has consented, they are invited to complete online health history assessments at their convenience. The surveys consist of health history questionnaires, daily tracking surveys, and an optional health conditions module in which participants can list other conditions that they have. Once they have completed a minimum number of required questionnaires, they are mailed a spit kit to collect DNA for analysis. The cost of each participant is about $80, which includes postage, DNA extraction, and genotyping; there is essentially no cost associated with recruitment or data collection. Throughout the course of the study, we have typically employed 2–3 full-time staff (study coordinator, developers), several graduate and undergraduate students, and a part-time administrative assistant to assist with sending and receiving spit kits.

For the GWAS of Genes for Good participants’ BMI, the BMI measurements were calculated from the height and weight survey in the app, which was derived from height and weight questionnaires available from PhenX Toolkit.Weight measurements for the first several thousand genotyped participants were bottom-coded at 80 lb and top-coded at 251 lb; then, the top-coded value was changed to 381 lb partway through the study to capture a greater range of variation. For participants that were pregnant at the time of answering the survey, we used their pre-pregnancy weight obtained from the same survey. The BMI values were then regressed on sex, age, array chip version, and the first five principal components; the residuals were inverse-normal transformed in order to compare effect size estimates to the largest published meta-analysis of BMIand to reduce the impact of extreme observations. We used the SAIGE softwareto run a mixed model GWAS, accounting for sample relatedness and population structure. Polygenic risk scores were calculated using PLINK.

DNA is genotyped at ∼600,000 SNPs using either the Illumina Infinium CoreExome-24 v.1.0 or v.1.1 arrays, which include both nonsynonymous exonic variants and a panel of common genome-wide markers (see Web Resources ). The standard set of markers on the array is augmented with missense, loss-of-function, and potential lipid- and myocardial infarction-associated variants identified in the HUNT whole-genome sequencing and whole-exome sequencing projects;height-associated variants from GIANT;potential stop-gain variants in 96 genes at loci potentially implicated in type 2 diabetes, blood lipid levels, Alzheimer disease, nicotine/alcohol metabolism, and several others with mutations implicated in serious but treatable health conditions; complex trait-associated variants in the EBI/NHGRI GWAS catalog;a random subset of Neanderthal SNPs from the 1000 Genomes Project;ancestry informative markers identified by Paschou et al. that were highly correlated with the principal components of Human Genome Diversity Project samples;and pain-related variants proposed by Dr. Chad Brummett of the University of Michigan Division of Pain Research. Genotypes at an additional >30 million variants in the 1000 Genomes Phase 3 panelare imputed using Minimac3.After quality control, local genetic ancestry is estimated using RFMix,global ancestry with ADMIXTURE,and principal components analysis performed with TRACE,using the Human Genome Diversity Project samples as a reference panelfor all three analyses. We provide each Genes for Good participant with a section in the app to view these estimates of genetic ancestry on the sample they provided.

We provide participants with several ways to interact with both their own data and the research study as a whole. After each health history survey is completed, we provide charts summarizing the information, in some cases comparing each participant’s answers to the Genes for Good study population (example in Figure 7 ). Similarly, for daily tracking surveys, we generate summaries of each participant’s health behavior over time as well as summary statistics for the entire study (example in Figure 8 ). In addition to providing this ongoing feedback and summary of the survey responses, we also offer participants who submit a sample a breakdown of their genetic ancestry; the current version includes seven continental human populations (Europe, Africa, East Asia, Central/South Asia, West Asia/North Africa, Americas, and Oceania), and results are served in the form of a global ancestry estimate, local ancestry inference, and principal components analysis using the methods described previously (RFMIX, ADMIXTURE, TRACE). Before seeing their estimates of genetic ancestry, they are required to watch a short video on how to interpret their results. Participants can also download their array and imputed genotypes.

Privacy and Data Security

All Genes for Good data are divided into two classes: (1) personally identifiable information, such as email addresses, Facebook user IDs, and physical mailing addresses; and (2) research information, such as survey responses and genetic data. Each class of data is stored in a distinct relational database and served from a distinct server. Extracts for outside researchers include only research-specific data. We plan to ask participants to allow use of their mailing address to link to information such as geocode pollution, built environment (for instance, the number of fast food outlets or public parks within a certain radius of one’s home), and census tract data. In these cases, the participants’ physical address would still be withheld from external collaborators, but variables generated using addresses could be shared upon request.

The privacy of Genes for Good data is monitored by the University of Michigan Institutional Review Board. All genetic and survey results are stored in a secure server on campus that is not directly connected to the public internet, and DNA samples are stored in physically secure spaces with restricted access. In addition, all archived data are de-identified to protect subject privacy including participants’ demographic summary and genetic information. Even though Genes for Good uses Facebook to authenticate login, Facebook does not access information we collect through the app and we do not use participant’s social media postings and connections in our research. We make efforts to communicate with participants about the extensive measures we take in ensuring the privacy of their data and to ease their worries about using social media as a platform for genetic research.

All communication to and from the application is encrypted. Participants are authenticated using a Facebook account and Facebook’s OAuth implementation, ensuring that participants have access only to their own data once inside the application. Communication with Facebook servers is limited to authentication only; although Facebook receives and retains information about which Facebook accounts have accessed the Genes for Good app, all other information provided by participants is provided directly to Genes for Good servers. Facebook cannot see any of the data entered by participants.

Once participants have their genetic data analyzed, they are notified that they may access results inside the app with a Results Access Code, a randomly generated alphanumeric code that must be requested by the participant and will be delivered to the email address on the participant’s Genes for Good profile. Participant genotype data is processed internally on University of Michigan servers and is distributed to participants upon request via Box, a secure third-party file-sharing platform. Participants may request their raw genotypes as often as they like from within the genetic results section of the app. Each request compresses and uploads raw genotype data and supplementary information to a private, password-protected Box account directory. For security purposes, all requested genotypes automatically expire from Box servers 3 days after being uploaded.