As we were organizing analyses for the ExAC flagship paper , we were inspired by Titus Brown’s manifesto on reproducibility . As with most collaborative papers, we had a core team of analysts, each responsible for their subset of analyses and figure panels. Part of the way through, we realized there were many shared features that we wanted to keep consistent across analyses, particularly variant filtering strategies and annotations. We made the decision to organize the code in a central github repository, which the team members could access and edit as analyses progressed. Today, we are happy to release this code, which you can use to reproduce every figure in the paper!

How to use

The code and miscellaneous data are available at https://github.com/macarthur-lab/exac_2015.

git clone git@github.com:macarthur-lab/exac_2015.git cd exac_2015

Then, start R, and run the following. Warning: this process requires a decent amount of memory, ideally on a machine that has at least 16G RAM (newer Macbook Pros are fine, but not Macbook Airs). On Linux and machines with stricter memory policies, we recommend 32G or even 64G to allow for overhead in loading.

source('exac_constants.R') exac = load_exac_data()

## [1] "Using locally loaded exac file. Run exac = load_exac_data(reload=T) to update to newest file." ## [1] "Step (1/6): Loading data..." ## [1] "Now processing data (2/6): Determining sequence contexts..." ## [1] "(3/6): Computing allele frequencies..." ## [1] "(4/6): Calculating bad regions of the genome..." ## [1] "(5/6): Determining what to _use_..." ## [1] "(6/6): Parsing SIFT and PolyPhen scores, and separating functional categories..." ## [1] "Done! Here you go. Took:" ## user system elapsed ## 857.437 17.924 878.059

This will take about 10-15 mins to load, plus a ~800M download the first time you run it. exac is then the data frame of ALL ~10M variants in the dataset with over ~100 annotations per variant. Each variant is now its own entry, as opposed to a typical VCF where multi-allelic variants are combined into a single line. Note that these are unfiltered variant calls. Make sure to filter to either PASS sites, or better yet, our criteria for high-quality calls (given by the column use , see below for details):

filtered_calls = subset(exac, filter == 'PASS')

or:

use_data = subset(exac, use)

You can view the table using head() or with dplyr’s tbl_df , which formats the output nicely:

tbl_df(exac)

## Source: local data frame [10,195,872 x 116] ## ## chrom pos ref alt id filter ## (chr) (int) (chr) (chr) (chr) (chr) ## 1 1 13372 G C . PASS ## 2 1 13380 C G . VQSRTrancheSNP99.60to99.80 ## 3 1 13382 C G . VQSRTrancheSNP99.60to99.80 ## 4 1 13402 G C . VQSRTrancheSNP99.60to99.80 ## 5 1 13404 G A . VQSRTrancheSNP99.60to99.80 ## 6 1 13404 G T . VQSRTrancheSNP99.60to99.80 ## 7 1 13417 C CGAGA . PASS ## 8 1 13418 G A . AC_Adj0_Filter ## 9 1 13426 A G . VQSRTrancheSNP99.80to99.90 ## 10 1 13438 C A . VQSRTrancheSNP99.80to99.90 ## .. ... ... ... ... ... ... ## Variables not shown: ac (int), ac_afr (int), ac_amr (int), ## ac_adj (int), ac_eas (int), ac_fin (int), ac_hemi (int), ## ac_het (int), ac_hom (int), ac_nfe (int), ac_oth (int), ## ac_sas (int), ac_male (int), ac_female (int), ## ac_consanguineous (int), ac_popmax (int), an (int), an_afr ## (int), an_amr (int), an_adj (int), an_eas (int), an_fin ## (int), an_nfe (int), an_oth (int), an_sas (int), an_male ## (int), an_female (int), an_consanguineous (int), an_popmax ## (int), hom_afr (int), hom_amr (int), hom_eas (int), hom_fin ## (int), hom_nfe (int), hom_oth (int), hom_sas (int), ## hom_consanguineous (int), clinvar_measureset_id (int), ## clinvar_conflicted (int), clinvar_pathogenic (int), ## clinvar_mut (chr), popmax (chr), k1_run (chr), k2_run ## (chr), k3_run (chr), doubleton_dist (dbl), inbreedingcoeff ## (dbl), esp_af_popmax (dbl), esp_af_global (dbl), esp_ac ## (int), kg_af_popmax (dbl), kg_af_global (dbl), kg_ac (int), ## vqslod (dbl), gene (chr), feature (chr), symbol (chr), ## canonical (chr), ccds (chr), exon (chr), intron (chr), ## consequence (chr), hgvsc (chr), amino_acids (chr), sift ## (chr), polyphen (chr), lof_info (chr), lof_flags (chr), ## lof_filter (chr), lof (chr), context (chr), ancestral ## (chr), coverage_median (dbl), coverage_mean (dbl), ## coverage_10 (dbl), coverage_15 (dbl), coverage_20 (dbl), ## coverage_30 (dbl), pos_id (chr), indel (lgl), ## bases_inserted (int), transition (lgl), cpg (lgl), ## alt_is_ancestral (lgl), af_popmax (dbl), af_global (dbl), ## singleton (lgl), maf_global (dbl), daf_global (dbl), ## daf_popmax (dbl), pos_bin (fctr), bin_start (dbl), sector ## (chr), bad (lgl), use (lgl), lof_use (lgl), sift_word ## (chr), sift_score (dbl), polyphen_word (chr), ## polyphen_score (dbl), category (chr)

A previous blog post by Monkol Lek, A Guide to the Exome Aggregation Consortium (ExAC) Data Set, described some of the challenges in exploring the data. With the dataset loaded into R, we can dive into a few more tips and tricks. First, you’ll notice that there are many annotations per variant pre-calculated, in particular:

Allele frequency In particular, one set of questions that always comes up is regarding the allele counts (AC) and allele numbers (AN). The raw counts ( ac and an ) refer to the total number of chromosomes with this allele, and total that were able to be called (whether reference or alternate), respectively. Thus, the allele frequency is ac/an . However, for more stringent applications, we adjusted the allele count to only include individuals with genotype quality (GQ) >= 20 and depth (DP) >= 10, which we term ac_adj (and correspondingly, an_adj for total number of chromosomes from individuals with those criteria regardless of genotype call). In this table, af_global is then ac_adj/an_adj and ac and an from each population is filtered to only adjusted counts (i.e. ac_EAS includes only individuals of East Asian descent that meet this criteria). Additionally, we provide a maximum allele frequency across all populations as af_popmax and the corresponding population as popmax , which is useful for filtering variants for Mendelian gene discovery. Finally, we have hom_* for all populations, which is the count of homoyzgous individuals for this allele.

Functional annotations The output from VEP is parsed for each allele and, in the default table, summarized using the most severe consequence across all transcripts. You can run load_exac_data('canonical') for instance to load the annotations for only the canonical transcripts, or pass transcript=T or gene=T to load in a data frame with one entry for every transcript or gene, respectively, for each variant. Loss-of-function (LoF) annotations of variants from LOFTEE are provided in the lof , lof_filter , and lof_flags columns, and the highest-confidence variants are indicated as lof_use . PolyPhen and SIFT annotations are parsed into words (e.g. benign or possibly_damaging ) and scores. There are also annotations from ClinVar, intersected with output of Eric Minikel’s Clinvar parsing tool. A query for pathogenic variants without conflicting evidence may use clinvar_pathogenic == 1 && clinvar_conflicted == 0 && clinvar_mut == 'ALT' .