Overview of PGx analytics method

To facilitate the identification of drugs that are predicted to exhibit significant population differences in response between a pair of population examined, we employed a novel ‘PGx analytics’ method as detailed in Fig. 1. The approach involves evaluating each SNP based on two properties: (1) whether the allele frequency of the SNP in one population is significantly different from the frequency in another population and (2) whether the SNP is predicted to be potentially functional affecting either gene/protein expression or activity. SNPs that fulfill either the first criteria alone pdSNP, or both criteria pf-pdSNPs were mapped to their corresponding genes. Genes containing these pf-pdSNPs were then mapped to drug pathways using publicly available drug–gene databases. Multiple random samplings-based statistical analyses were subsequently performed to identify drugs that have an enriched representation of genes carrying pf-pdSNPs. This approach was then evaluated for concordance with literature reported real-world occurrences of population difference in drug response. In addition, the relationship between drugs enriched with genes carrying pf-pdSNPs and PGx warning labels or adverse reaction reports was investigated to provide further evidence of the utility of this method.

Fig. 1 Big-data and deep analytics approach to identify drugs associated with genes carrying SNPs that are differentiated between different populations. F ST statistics were determined for all SNPs from the 1000 Genome Project and dbSNP. SNPs with F ST statistics in the top 1% of all SNPs in each population pair comparison were regarded as pdSNPs. pdSNPs were then queried against the pfSNP database (http://pfs.nus.edu.sg/) to identify potentially functional pf-pdSNPs. Genic pf-pdSNPs were then mapped to their corresponding genes and genes containing at least one pf-pdSNPs were named pf-pdGene. Four databases (CTD, Chembl, DrugBank, and PharmGKB) were employed to identify genes associated with drugs/drug pathways. Multiple random samplings-based statistical analyses was performed to identify drugs that are enriched with pf-pdGenes (enrichment Z-score > 2.58). The robustness of the algorithm was evaluated for its capability to detect such enrichment in drugs previously reported with a real-world population differences in response. We also determined if drugs with pharmacogenetics (PGx) warning labels or adverse drug reaction (ADR) reports are associated with population differentiation Full size image

Identification of potentially functional, population differentiated SNPs

A big-data approach was employed to identify drugs enriched with genes carrying pfSNPs that exhibit significant population differentiation (Fig. 1). SNPs with significant population differentiation were identified using data from the 1000 Genomes Project comprising a total of 1029 unrelated individuals representing 14 different populations, including 59 ASW (African ancestries from Southwest United States); 79 LWK (Luhya individuals in Kenya); 86 YRI (Yoruba individuals in Nigeria); 97 CHB (Han Chinese in Beijing); 100 CHS (Han individuals in Southern China); 89 JPT (Japanese individuals in Tokyo); 60 CLM (Columbian in Medellin); 58 MXL (Mexicans in Los Angeles USA); 55 PUR (Puerto Rican in Puerto Rico); 54 CEU (Northern and Western European ancestries in Utah USA); 93 FIN (Finish in Finland); 87 GBR (British in England and Scotland); 98 TSI (Toscani in Italy); and 14 IBS (Iberian in Spain) [34].

To identify SNPs which are associated with significant population variations, allele frequencies of SNPs data-mined from the 1000 Genomes Project [34] were calculated using VCF tools (version 0.1.9). Pairwise F ST statistics [35, 36] were computed in the ‘R’ environment for each SNP with each of the 14 population-pair comparisons, resulting in 91 different population-pair F ST scores for each SNP. pd using pairwise F ST was determined on 71.56% (22,866,661) of the total SNPs from the 1000 Genomes Project but was not computed for the other 28.44% of SNPs as some SNPs were only observed in a single population, while others had variant call errors. The F ST statistic is a measure of the proportion of genetic variance found within a population relative to the genetic variance found in both populations and is often defined as: [37]

$$F_{\mathrm{ST}} = \frac{{H_{\rm{T}} - H_{\rm{S}}}}{{H_{\rm{T}}}}$$

For polymorphic biallelic markers where M is the mean frequency of the more frequent allele across K subpopulations, p k is the frequency of the allele in subpopulation k, n k is the size of subpopulation k and N is the sum of subpopulation sizes:

$$H_{\mathrm{S}} = 1 - \frac{1}{N}\mathop {\sum }\limits_{k = 1}^K n_k[p_k^2 + \left( {1 - p_k} \right)^2]$$

and

$$H_{\mathrm{T}} = 1 - [M^2 + \left( {1 - M} \right)^2]$$

In this study, as only pairwise comparisons were made, K was set to 2. Between two populations, a SNP is regarded as a population differentiated SNP or pdSNP if its F ST score is amongst the top 1% of all the F ST scores in the respective pairwise population comparison. This allowed us to extract SNPs that are considered to be extremely population differentiated between two populations by considering those positioned at the top one percentile with respect to the F ST scores distribution.

SNPs were mapped to functional gene regions and categorized based on their location according to the NCBI dbSNP (build 137) [38]. In the coding region (i.e., exons), amino acid-substituting SNPs are classified as nonsynonymous (nsSNPs), whereas the silent or nonamino acid-substituting SNPs are referred as synonymous (sSNPs). For SNPs in noncoding regions, the following classifications were applied: promoter for SNPs residing within 5.5 Kb upstream of a gene transcription start site; intronic for SNPs residing in introns; as well as 5′ UTR and 3′ UTR for SNPs residing in the 5′ or 3′ terminal of mRNA untranslated regions.

pdSNPs, which were predicted/evaluated to be potentially functional were named potentially functional (pf) population differentiated (pd) SNPs or pf-pdSNPs. The pfSNP (http://pfs.nus.edu.sg/) resource [39] developed by our laboratory was employed to evaluate the potential functionality of the pdSNPs (Supplementary Table 1). pfSNPs are defined as SNPs, where a single nucleotide change is predicted to either alter the expression, structure, function, or activity of the associated gene/protein or their isoform, or that reside within regions that are genetically determined to be under natural selection forces. For coding SNPs, we evaluated if they reside within important protein domains/functional regions, potentially altering important protein modification sites (e.g., phosphorylation sites) [40], or are predicted to alter nonsense-mediated decay or exonic splice enhancer/silencer sites [41, 42]. sSNPs within the coding region were further evaluated for significant codon usage bias as this may potentially influence translational speed and structure/function of the protein [43, 44], while nsSNPs were selected if they were predicted to be deleterious [45,46,47,48].

For noncoding SNPs, those residing in the promoter/5′ UTR regions were evaluated to see if they alter transcription factor binding sites, while those in 3′ UTR were selected if they reside within 3′ UTR conserved regions [49], as they may have functional consequences [50] or alter miRNA binding sites [51,52,53]. Noncoding SNPs in introns were selected if they alter splice sites [54] or intronic splice regulatory elements [55]. The pd-SNPs and pf-pdSNPs were then mapped on to genes in the following way. A pdGene is a gene, which carries at least one pdSNP, while a gene containing at least one pf-pdSNP is regarded as a pf-pdGene. Supplementary Table 2 contains an explanation of all the abbreviations used in the paper.

Enrichment analyses of pf-pdGenes in drug pathways for identification of drugs with population differentiated response

To identify drugs (pf-pdDrugs) enriched with genes carrying pf-pdSNPs, we integrated four major literature-backed drug–gene databases (PharmGKB [56], Chembl [57], Comparative Toxicogenomics Database (CTD) [58], and Drug Bank [59]) to obtain drug–gene information from 10,902 unique drugs/compounds (Supplementary Fig. 1). The identification of genes in the pathway of the drugs is based on scientific, peer-reviewed literature evidence curated by these four databases. An example of a few genes documented to be associated with the drug statin is shown in Supplementary Table 3. Through the integration of the FDA approved drugs/compounds with these four drug–gene databases, gene information for 1512 FDA-approved drugs were obtained for this study. These drugs were then evaluated for enrichment of pf-pdGenes in their drug pathway as follows.

The population-pair specific enrichment Z-score of each drug was obtained by performing 10,000 sampling iterations involving random genes that are of a similar size range to the genes in that drug pathway. For each drug random sampling set, the proportion of pf-pdGenes found in the random sample was recorded. These 10,000 iterations would yield an empirical distribution specific to the drug and population-pair in question. The population-pair Z-scores of the drug will signify enrichment of the observed proportion of pf-pdGenes in the drug pathway relative to the empirical distribution generated in the random sampling, which can be calculated with the following equation.

$$Z\;score = \frac{{p_{{\mathrm{pf}} - {\mathrm{pdGenes}}} - \bar P_{{\mathrm{pf}} - {\mathrm{pdGenes}}}}}{{eSD_{{\mathrm{pf}} - {\mathrm{pdGenes}}}}},$$

where:

p pf-pdGenes = observed proportion of pf-pdGenes in the drug pathway for the specific population-pair

\({\bar{\mathrm{P}}}_{{\mathrm{pf}} - {\mathrm{pdGenes}}}\) = mean proportion of pf-pdGenes in the empirical distribution of the respective drug for the specific population-pair

eSD pf-pdGenes = standard deviation of empirical distribution of the respective drug for the specific population-pair

A drug that is significantly enriched with pf-pdGenes for a population-pair has a Z-score of >2.58 or is within the top 0.5 percentile of the respective empirical distribution. On the other hand, a drug that is not enriched in pf-pdGenes for that population-pair has a low enrichment Z-score for that population-pair. The availability of SNP and gene information from 14 different populations in our database resulted in each drug having up to 91 population-pair Z-scores. This enabled us to identify the specific pair of populations with an enrichment of pf-pdGenes in a particular drug pathway.

Evaluating the performance of the algorithm

The performance of the algorithm was evaluated in terms of whether it can appropriately detect population differentiation patterns in drugs that have been previously reported [7] to show population differences in response. This could provide an initial gauge on the potential capability and real-world relevance of our approach. Only drugs previously reported to be associated with population differences in response and with available population-pair enrichment Z-score were included in this literature-based evaluation. Supplementary Table 4 details the publications of the drugs that were reported to be population differentiated. It includes information about the actual population pairs reported and the population pair from our database that was most similar to the one reported. The accuracy of our method was evaluated by comparing the concordance between the drugs that pass the Z-score threshold from our algorithm with the drugs found from the literature. The maximum and minimum F ST scores specific to the reported drug for each population pair were also determined.

Classification and ranking of population differentiated drugs by drug classes/disease conditions

The Anatomical Therapeutic Chemical (ATC) Classification System by the World Health Organization (WHO) (http://www.whocc.no/) or manual curation was employed to categorize drugs found to be population differentiated by our algorithm into their respective 414 drug classes, while the CTD database was used to categorize these drugs based on their associated 2783 disease conditions. Only drug classes (n = 134) or disease groups (n = 1775) containing three or more drug members were included in our analyses. We then ranked the drug classes/diseases groups by multiplying the mode of all population pair Z-scores (most common Z-score) for each drug by the observed number of population pairs exhibiting significant population differentiation for that drug. This value for each drug in the drug class is, then summed and normalized against the number of constituent drugs within each drug class/disease group. After obtaining the top 30 drug classes/disease groups, information about each individual drug’s number of enriched population pairs was extracted and presented in graphical form.

Analyses of pharmacogenetic labels and ADR reports

We further assessed whether drugs with existing pharmacogenetic (PGx) warning labels or ADR reports were associated with the number of significantly enriched population pairs. Drugs with PGx warning labels were obtained from PharmGKB [56], which contains information issued by the US Food and Drug Administration (FDA), European Medicines Agency (EMA), Japan’s Pharmaceuticals and Medical Devices Agency (PMDA), and Health Canada (Santé Canada) (HCSC). The four categories of drugs labels or PharmGKB ‘PGx Levels’ used in this study were ‘Genetic testing required’, ‘Genetic testing recommended’, ‘Actionable PGx’, and ‘Informative PGX’. The definitions of these labels can be found at https://www.pharmgkb.org/page/drugLabelLegend.

The ADR reports summary was obtained from publicly available quarterly reports of the US FAERS database (2007–2009) (https://www.fda.gov/Drugs/GuidanceComplianceRegulatoryInformation/Surveillance/AdverseDrugEffects/) and from Pharmacovigilance Branch at the Singapore Health Science Authority (HSA) (http://www.hsa.gov.sg/content/hsa/en/Health_Products_Regulation/Safety_Information_and_Product_Recalls/Report_Adverse_Events_related_to_health_products.html) (2007–2009). Both databases are based on voluntary reporting of suspected ADR, which can be directly submitted by healthcare professionals and consumers or through mandatory reporting from drug manufacturers.

Cumulative distribution function (CDF) plots of the number of significant population-pairs against the fraction of drugs with PGx labels/adverse drug reactions were constructed in R. Bar plots of the average number of population pairs showing significant genomic differentiation across the different PGx labelled drug groups/ADR groups were also constructed in R and statistical significance assessed using the two-sided Student's t-test. There was a total of 373 drugs with no ADR reports, 978 drugs with ADR reports, 966 drugs with ADR reports from US, and 391 drugs with reports from Singapore. For the PGx labels, there were 1206 drugs with no PGx labels, 124 drugs with ‘Genetic Testing Recommended’, and 22 drugs with ‘Genetic Testing Required’. All groups had a similar variance. The incidence rates and population pair profiles of the top 20 drugs with the highest ADR rates in Singapore and in the USA were also compared.