For the results of associations between various low frequency variant filtering methods, for 797 DrugBank genes using whole exome sequencing data, we found a total of 91 results that passed the Bonferroni threshold (P-value = 1.08e − 07); all 91 results passing this threshold are in Supplementary Table 1. Table 1 also lists the most potentially novel gene-phenotype associations for clinical lab and diagnosis codes of our study.

Table 1 Potential novel associations from PheWAS analyses. Full size table

The two most significant results of this study are associations between genes and phenotypes where the impact of loss of function highly relates to the known function of these genes. For example, the top result from the diagnosis codes analysis observed in the functional annotation filter category 2 was an association between the calcium-sensing receptor gene (CASR) and the diagnosis of “hypercalcemia” (ICD-9 275.42, P-value = 1.34e − 22, beta = 3.89, functional annotation filter 2), Supplemental Table 1. CASR plays an essential role in calcium homeostasis and is expressed mostly in kidneys and parathyroid glands. Mutations in CASR lead to familial hypocalciuric hypercalcemia (FHH)21. In our study, this association was Bonferroni significant using all four functionally annotated filter categories with the most significant result for functional annotation filter 2. Associations between CASR and hypercalcemia were least significant via the ‘all variants’ filtering category (functional annotation filter 1). This suggests the effect of association is impacted more strongly by functionally annotated variants in the CASR gene rather than non-functionally annotated variants. The diagnoses used in drug treatments that target CASR are hyperparathyroidism, bone destruction, chronic kidney disease with secondary hyperparathyroidism and impaired renal function. The most significant result from the clinical laboratory measurement analyses was the association between the gene GPT and alanine aminotransferase levels (P-value = 3.29e − 83; Beta = −0.64; functional annotation filter 1), Supplemental Table 1. The GPT gene encodes the enzyme glutamate-pyruvate transaminase 1, also known as cytosolic alanine aminotransferase. Comparative analyses of GPT with alanine aminotransferase among the four filter categories suggests that functionally annotated variants have a larger impact on phenotypic variation than non-functionally annotated variants.

Associations with Clinical Lab Measures

To show the overall landscape of results for the clinical lab measures for both highly significant and more potentially suggestive associations, we plotted all results below an exploratory P-value of 0.001 for clinical labs in Fig. 1A and B. There were 197 unique gene-phenotype combinations.

Figure 1 Phewas-view plot of clinical laboratory measures. (A) Represents clinical laboratory measurement gene based associations for results with P-value < 0.001. The Y-axis lists all phenotypes. Triangles represent the –log10 P-value of associations on the left, with normalized beta in on the right with standard error bars. Points are colored based on the filter category. The direction of the triangle corresponds to the direction of effect: up is positive, down is negative. The results for alanine aminotransferase (ALT) are plotted separately in (B) due to the significance of tshe results on a different scale from the rest of the results. (C) Is a Manhattan plot of associations with P-value < 0.001 for ICD-9 code based case/control diagnoses. The x-axis corresponds to ICD-9 category and y-axis corresponds to the −log10(P-value) of the association and the points are colored based on the ICD-9 category. Within each diagnosis category the plotted points are ordered from most significant associations to the least significant associations. Full size image

We found a total of 21 Bonferroni significant associations with quantitative laboratory measurements from a total of 5 unique gene-phenotype combinations (due to significant associations for the same gene-phenotype combinations from multiple ways of filtering rare variants). Of these, 20 associations are supported by the known function of these genes, impact on respective associated phenotypes through previously reported common genetic variation results, and through other existing biological knowledge. We identified 1 potentially novel association without considerable previous biological knowledge relating the gene to the phenotype.

A total of 14 associations out of all Bonferroni significant results were for bilirubin levels and represented alternative splicing forms of UGT1A gene. The UGT1A gene is known for highly significant associations with bilirubin for common frequency variants22,23,24,25. This gene family encodes enzyme UDP-glucuronosyltransferase, which converts toxic bilirubin into non-toxic form26. Of note, all the highly significant associations only included functionally annotated filtered and all variants categories. No associations in the non-functionally annotated category reached Bonferroni significance.

We also identified associations between GOT1 and aspartate aminotransferase levels27 (in functional annotation Filter 1 and 2). GOT1 is known as Glutamic-Oxaloacetic Transaminase 1 (also known as aspartate aminotransferase), therefore its association with aspartate aminotransferase levels in serum reflects knowledge of this gene.

We found a highly significant association between TUBB1 and platelet counts (P-value = 7.85e − 11; normalized beta = −6.50; functional annotation Filter 2) that was also Bonferroni significant via functional annotation filters 1 and 3. TUBB1 was also associated with the related mean platelet volume (P-value = 3.57e − 08; Normalized Beta = 5.51 in functional annotation Filter 3), but with a positive direction of effect. The TUBB1 gene is highly expressed in platelets and megakaryocytes and has been inferred to be involved in proplatelet production and platelet release28,29. The TUBB1 protein is one of the two core families to form microtubules. Loss of function in the TUBB1 gene inhibits platelet release, which results in decrease in platelet counts. This supports our finding of a negative direction of effect in our association with platelet counts, indicating that an enrichment in functionally annotated mutations leads to a decrease in platelet counts. It has also been shown that mutation in this gene is associated with autosomal dominant macrothrombocytopenia, with both a reduction in platelet counts as well as an increase in platelet volume. This is also consistent with the observation in our study that loss of function of this gene is positively associated with platelet mean volume (a measure of the average size of platelets in blood)30,31.

Another association from our study is between GLCCI1 gene (using functional annotation filter 2) with white blood cell (WBC) counts (Table 1). Common frequency SNPs in the GLCCI1 gene have been previously associated with asthma32,33, but not WBC counts. High WBC counts are a reflection of inflammation, and higher WBC counts are observed in patients with severe allergy and asthma34. A direct connection between this gene and WBC levels is not known.

Associations with Clinical Diagnoses

Of the 70 Bonferroni significant associations with ICD-9 code based diagnoses, there were 60 unique gene-phenotype combinations. Of these, 14 associations are closely related to the known function of these genes, and 57 associations are more novel with respect to existing understanding of these genes.

For ICD-9 code associations, we have highlighted some of the key results of these associations in the Manhattan plot of Fig. 1C. Among the top associations is the gene TACR1 associated with chronic sinusitis (ICD-9 473.9; P-value = 2.01e − 10; beta = 2.34; functional annotation filter 3). The TACR1 gene is from the family of tachykinin receptors that are characterized by the interactions with G-proteins. Other G-protein receptors such as I KACH , GNB2 are linked to some other forms of sinusitis35,36 but this association between functional annotation mutations in TACR1 and chronic sinusitis is novel. The drug aprepitant is a known target drug for gene TACR1, and is an antagonist of the receptor. It is used to treat nausea and vomiting symptoms caused by chemotherapy treatment for cancer37,38. This novel association among functionally annotated mutations in TACR1 and chronic sinusitis warrants further investigation to understand what impact this drug may have in relation to sinusitis, including a potential side effect of tachykinin receptor blocking through the use of this drug.

In our study, we also identified functionally annotated variants in the PTGR2 gene (from all categories where functionally annotated mutations were tested) with “abnormal glucose intolerance of mother”. These association results are below the Bonferroni cutoff in all 4 categories where functionally annotated variants are included. For these associations, we observed the highest odds ratio of 55.70 in functional annotation filter 2 and lowest OR of 18.17 in the all variants category. These results are shown in Table 1. This association is also not p previously reported. Gene PTGR2 also showed a Bonferroni significant association in the non-functionally annotated category with the diagnosis of ICD-9 code 309.28 (anxiety and depression).

Overall Trends of Results Across Variant Filtering Approaches

A focus of this study was comparing and contrasting the results of gene-based comprehensive associations across a wide range of phenotypes when using a range of approaches for filtering rare variants. Figure 2 below shows a circos plots representing all results with P-values less than 0.001 from the functional annotation filter 2 category for results from associations with ICD-9 based case/control status as well as the quantitative clinical lab measures. Plots for other functionally annotated filters, all variants and non- functionally annotated filter categories are shown in Supplementary Figures 1, 2, 3 and 4. We have presented results from each of the functional annotation filters in separate colors to exemplify the differences and similarities observed in these analyses.

Figure 2 Circos plot of –log 10(P-value) association results by chromosome from ICD-9 codes (outer circle, points represented as triangles) and clinical laboratory values (points represented as squares) for results with P-value < 0.001. The genes are labeled at their respective chromosome base pair location boundaries. Yellow points represent results from functional annotation filter 1 category. The axis on both plots is same and goes from 0 to 22 (−log10 P-value). Full size image

For ICD-9 based diagnoses, the results from functional annotation filter 2 had the least number of associations that were significant at an alpha level of 0.001. However, this filter also had the most significant association of the entire study (CASR and hypercalcemia), as well as the most significant associations for ICD-9 based diagnoses. Also for ICD-9 based diagnoses the non-functionally annotated category showed the highest number of results at P-value < 0.001, but the lowest P-value was 1.08e − 09 for the gene GSK3B associated with the diagnosis “thoracic aneurysm”, ICD-9 441.2. In the other functionally annotated filter categories for associations with diagnoses, the lowest P-value was 1e − 10, and in the non-functionally annotated category the lowest P-value was 1e − 08.

For clinical lab measures associations, filtering rare variants for functional annotation showed more number of highly statistically significant results than non-filtering by functional annotation (all variants and non-functional annotation filter categories), with the top result for functional annotation filter 1, 2 and 3 was for GPT associated with alanine aminotransferase levels (P-value = 3.29e − 83).

To compare and contrast the effect of associations that are significant for one rare variant filter and marginally or not significant in other filters, we picked the top 5 genetic associations from each functional annotation filter and plotted the P-values of the same gene and phenotype associations from the other functional annotation categories. These results are shown in Fig. 3A and B. For example, for the ICD-9 diagnosis based associations, the most significant association in the all variants category was between the gene THBD and the diagnosis “other closed fractures of distal end of radius” (ICD-9 code 813.42, P-value = 3.54e − 09). This result seems to be influenced mainly by non-functionally annotated variants as it is (a) significant in all variants, (b) not statistically significant for the non-functionally annotated filter (P-value = 2.23e − 06) and (c) not significant in the functional annotation filter categories.

Figure 3 (A) Shows the top 5 results from each rare variant filtering strategy and corresponding –log10(P-value) and magnitude and direction of effect (boxes linked to the data points) from other filtering strategies for the same gene-phenotype associations for clinical lab measures. The x axis shows the clinical lab name and y-axis shows the –log10(p-value). Beta coefficient are represented by numbers next to points which are color coded by filter category and shape corresponds to gene name. (B) Shows the top 5 results from each rare variant filtering strategy and corresponding –log10(P-value) and magnitude and direction of effect (boxes linked to the data points) from other filtering strategies for the same gene-phenotype associations for ICD-9 based diagnoses. The x axis shows the general ICD-9 category the diagnosis was grouped into and y-axis shows the –log10(p-value). Beta coefficients are represented by numbers next to points, are color coded by filter category, and the shape corresponds to the specific gene listed in the legend. Full size image

The association between gene ADRA2B and “alcohol abuse” ICD-9 305.00 (mental disorders category) is observed as most significant result for functional annotation filter 2 (P-value = 3.88e − 10). This association does not reach Bonferroni significance in other filter categories implying the relevance of LOF and deleterious variants from this category and their link to alcohol abuse. ADRA2B has been linked to diseases such as hypertension, obesity, epilepsy, etc39,40,41 and its association with addiction is also known42, but the specific link with alcohol consumption and abuse has not been reported. Notably, the Non-functionally annotated results for this gene and phenotype were very non-significant.

In the functional annotation filter 3 category, we observed an association between the gene NPR3 with “anxiety behavior” ICD-9 309.24 (diagnosis category “adjustment disorder with anxiety”) with P-value = 1.21e − 09. The result however was statistically non-significant for functional annotation filter 1 and All Variants, underscoring the contribution of LOF variants in filter 3 to these associations. Natriuretic receptors are well known to play essential role in blood pressure regulation. These receptors are also known to be very important in fluid regulation in central nervous system and thus can effect emotional behaviors such as anxiety43.

We repeated this analysis using only the more highly significant top 5 clinical laboratory measure associations from each rare-variant filter from the results presented in Fig. 3A. As previously mentioned, the association between GPT gene and alanine aminotransferase is the most significant result in all 4 categories consisting of functionally annotated variants. This association is not significant in non-functionally annotated category. Also, it is interesting to note that top 5 associations that are significant in non-functionally annotated category are not significant at all in other categories and do not pass Bonferroni significance in general.

Next, we explored the results intersecting among the rare-variant filters and again looked at the count of results with P-value < 0.001, shown in Fig. 4. For clinical laboratory measures, we observed 8 results that were in functionally annotated filtered categories. For ICD-9 diagnoses, we observed only 5 results that were shared among all categories implying that the effect of association is from the combination of all functionally annotated and not annotated variations, 200 results that were only present for functionally annotated filters, and 202 results in both the functionally annotated and all variants filter.

Figure 4 Intersection of results passing a P-value < 0.001 for associations from different rare variant filtering strategies. The results for clinical laboratory measurements are on the left in (A) and the results for ICD-9 codes are on the right in (B). This figure shows the breakdown of these results based on how the variants were filtered, and whether or not the association was found only with one approach for filtering variants, or more than one way of filtering variants (lines connected between filtering method). The “intersection size” shows for each combination of filtering approach how many results passed the P-value cutoff, and set size indicates across that filtering approach total how many results there were. Full size image

The intersection plots shown in Fig. 4 for the counts of associations shared in each category highlight results due to various functionally annotated filters. There were 155 associations in functional annotation filter 2 and 3 in the ICD-9 based associations and 6 associations for the laboratory measures t indicating that these associations were mostly influenced by LOF and deleterious variants.

Associations Only Identified For Functionally Annotated Filters

The PheWAS-view plot in Fig. 5 represents the common results identified from all functionally annotated filters, with minimum number of cases 501 (track 3 in Fig. 5). Among the top most significant associations that were found only by functional annotation filtering of variants, we identified associations between the FOS gene with “sensorineural hearing loss” ICD-9 389.10 (See Table 1). One factor that causes sensorineural hearing loss is noise and studies in mouse models have suggested that noise exposure activates the MAPK signaling pathway and the FOS gene among other genes is up-regulated in MAPK Signaling pathway44,45. Another interesting association was observed is between gene ATF7 and the ICD-9 diagnosis 278.02 (overweight) (see Table 1). Even though this result did not achieve statistical significance, it might reflect clinical importance with further study. The ATF7 gene is known to be linked to familial atrial fibrillation46 but its association with obesity is not completely known.

Figure 5 PheWAS-view plot showing P-values, Beta and case number track for all results with strongest associations derived from filtering on functionally annotated variants. Full size image

Associations Only Identified for Variants Not Functionally Annotated

We explored the top-most associations where the effect was not due to functionally annotated variants, only due to associations with non-functionally annotated variants. We identified 2120 such associations from the ICD-9 analyses and 31 associations from the clinical lab analyses where the results are filtered at P-value of 0.001. Among the most significant associations is a novel association between the gene PTGS2 (Prostaglandin-Endoperoxide Synthase 2) and ICD-9 493.20 (chronic obstructive asthma) with P-value = 4.90e − 08 and also between the gene ADRA1D and ICD9-9 780.93 (memory loss) with P-value = 7.64e − 07, where the gene ADRA1D is already known to be associated with Schizophrenia47. All Bonferroni significant results in this category are shown in Supplementary Table 1.

The top most associations are between gene NGF and WBC counts (P-value = 5.24e − 06, normalized beta = −4.55) and gene AZIN2 and creatinine levels (P-value = 5.94e − 06, normalized beta = −4.54). Both of these associations have not been reported previously by rare variant association studies. Nerve Growth Factor (NGF) has been known to act in inflammatory responses in rat studies48,49 and is also known to be responsible for T and B-cell activation in humans50, therefore, its association with WBC counts based on rare variants offer further evidence. Further EHR-based research could help in providing useful insights into understanding the genetics behind these results.

Associations with diagnoses matching the diagnoses for drugs of the DrugBank database

We characterized gene-disease associations where the diagnosis matches the reason drugs are prescribed that target specific genes. The DrugBank database provides list of genes that are targets for drugs, along with the condition the drugs are prescribed for. We mapped these gene and drug combinations to the ICD-9 code ranges corresponding to disease diagnosis. We matched results below a P-value threshold of 0.001 to the DrugBank listed ICD-9 codes range as explained above. A total of 1,277 associations (7 out of those associations were Bonferroni significant) had a match between the target gene, associated phenotype, and the diagnosis drugs that are prescribed to target that gene. There were 874 unique gene – phenotype combinations. Figure 6A shows how these results across individual ways of filtering rare variants, and when the associations were present with more than one way of filtering rare variants. In supplemental materials, we describe in more detail the number of associations we had depending on the functional annotation of variants.

Figure 6 (A) Shows the intersection of results where the diagnosis of the association matched the diagnosis used for the drug targeting the gene of the DrugBank database. A total of 874 associations had a match between the target gene, associated phenotype, and the diagnosis that drugs are prescribed to target that gene, for p < 0.001. (B) Represents the forty-one associations where there was a match between the target gene, associated phenotype, and the diagnosis drugs are prescribed for that target that gene, for p < 0.001. The x-axis shows the disease description and tracks from top to bottom show -log10 P-value, the magnitude and direction of effect, and the number of cases. Colors represent the different filters applied to the variants before gene based association testing. The Drug name is listed in the same column as the association showing the drug prescribed for the diagnosis listed. Bonferroni significant associations are shown in larger size on the plot. Full size image

Two of the top associations, apart from the CASR results already described, were for functionally annotated filtered variants. We found an association between ADRA2B and “substance addiction and disorders” (ICD-9 305.00; lowest P-value = 3.88e − 10; beta = 4.51; functional annotation filter 2), the drugs known to target this gene include Amoxipine, Amphetamine, Loxapine, Clozapine, and the drugs are prescribed for are depression, stress and attention deficit hyperactivity disorder (ADHD). There were also associations between RBP1 and ICD-9 695.89 “other specified erythematous conditions” (lowest P-value 1.28e − 09, beta = 3.25, functional annotation filter 1). RBP1 plays an important role in transportation of vitamin A to epithelial tissue and thus is a gene target for the drug Acitretin51 which is used to treat psoriasis, squamous cell carcinoma, chronic hand dermatitis, and malignant melanoma, among others52.

In order to further contrast how the significance of results changed depending on variant filter, Fig. 6B shows these results in PheWAS-view plot53 where we present P-values, betas and the number of cases for each unique gene-phenotype combination. We also list the drugs as listed in the DrugBank database that are prescribed for the matching diagnosis code. Notably in the figure, regardless of the annotation or all variant filter, for each association the direction of effect is consistently positive. Thus, all associations are with a protective direction of effect for the disease conditions of the associations. In Supplemental file 1 we describe in detail how many associations, and how much overlap, there was for these associations depending on variant filtering.

Cross-phenotype Associations

Exploring cross-phenotype and potentially pleiotropic associations are possible with PheWAS studies due to a wide range of phenotypes evaluated at once. We filtered all results at P-value threshold of 1.08e − 08 and number of cases greater than 150 (for ICD-9 code associations) to explore cross-phenotype associations. We did not observe any cross-phenotype associations where results were all Bonferroni significant. To explore these results further, for each gene where P-value with phenotype was Bonferroni significant, we also extracted results for same genes at P-value < 1e − 04. Here we only report unique gene-phenotype combinations from all filter categories as represented in Supplementary Fig. 5. This exploratory search of cross-phenotype associations at an exploratory P-value cutoff resulted in several interesting observations. For example, we see association among gene ABCA1 with HDL levels, total cholesterol levels, and irritable bowel syndrome (IBS). The drug Probucol is used to target ABCA1 gene which helps in controlling cholesterol and is also known to lower HDL cholesterol levels. In the MyCode dataset, we observed 1,486 patients that are cases for IBS and out of these patients 528 have total cholesterol > 200 mg/dl and 198 patients have HDL <40 mg/dl. A connection between lipid levels and IBS warrants further investigation.

Next, we also observed associations of various phenotypes such as hyposomality (abnormal levels of electrolytes), hypopotassemia, chloride levels and septicemia with the gene Cytochrome oxidase 6 C (COX6C). The only Bonferroni significant association with COX6C is abnormal electrolyte levels but this exploratory search points to the direction of the changes in body caused due to sepsis infection. Septicemia can cause abnormal levels of potassium, sodium, chloride, etc54,55. Cholic acid is used to target COX6C in treatment of adults and children with bile acid synthesis disorders such as Zellweger Syndrome56. Cytochromes are known to be crucial during development of sepsis57,58 and there is a relationship between cholic acid levels increasing with sepsis.

Pathway Analyses for Gene-Sets

With the results of our DrugBank PheWAS, we also used gene set enrichment analysis to see if there were multiple genes within the same genetic pathways, that had been associated with the same phenotype. This provides further information about key pathways impacting phenotypic variability, and the impact of perturbing biological networks on outcomes. This approach can also identify multiple potential drug targets across a single pathway. We used the P-values from the regression analyses, separated by each phenotype and variant filtering approach, and ranked results from most significant gene association to least significant association. We then performed gene-set enrichment analysis (GSEA)59,60 for the results of each phenotype and filtering approach separately (described further in methods)

Supplementary Fig. 6 shows an overview of number of results in ICD-9 code categories at FDR q-value < 0.25 for each of the filter categories. Similarly, Supplementary Fig. 7 represents an overview of the number of results from clinical laboratory measurements. In our analysis, we did not identify the enrichment of highly-significant genes in any pathways, we instead observed that less significant genes (ranked lower in the list) were enriched in pathways. These results are plotted in a heatmap in Supplementary Fig. 8.

We also explored gene set enrichment analysis at a non-stringent FDR q-value threshold of 0.001. Counts varied from range of 1-5, we picked all gene-sets, gene and phenotype combination where minimum counts of genes are 3. Using a less stringent approach with the association results resulted in combination of 18 genes and 3 gene-sets (GO Drug Metabolic process, KEGG Drug Metabolism Cytochrome P450 and KEGG Retinol Metabolism) that are found to be most enriched in our analysis. These results are shown in Supplementary Fig. 9.

It is not surprising to see drug metabolic processes as the top results from GSEA since we started our analysis with genes that are common drug targets. We observed that in majority of cases for each of these enriched pathways, that even though we performed gene set enrichment separately, there were multiple filter categories (evident from overlaying points in Supplemental Fig. 9) that showed similar gene enrichment results. We did however observe some results where genes were found to be enriched in pathways for results only from one functional variant filtering category. For example, all the genes listed in Supplementary Fig. 9 in the KEGG retinol metabolism pathway are from associations with triglycerides for functional annotation filter 2, there was no enrichment for these genes from other variant filters in this pathway.