M. tuberculosis strain dataset

The selected set of M. tuberculosis strains are representative of various antimicrobial resistance phenotypes, geographic isolation sites, and genetic diversity. References for the published and unpublished data sets are provided (Supplementary Discussion, Supplementary Data 7). The sequencing data for the TB Antibiotic Resistance Catalog (TB-ARC) projects (Supplementary Data 7) were generated at the Broad institute. Additional information for each of these unpublished projects can be found at the Broad Institute website. All data were acquired from the PATRIC database.

M. tuberculosis pan-genome construction and QA/QC

We employed QA/QC of the constructed 1595 pan-genome by initially filtering out outlier strains. The initial selection of 1603 strains was reduced to 1595 upon review of both the cluster size distribution and the number of unique clusters across the set of all strains (Supplementary Fig. 3a, b). We found only four strains in the PATRIC database that had either a very low (<2000) or high number (>5500) of clusters. The final selection of 1595 strains has a cluster size distribution between 3900 and 4400, and a reasonable unique cluster distribution where the number of unique clusters did not exceed 160 (note that unique is defined here as being in only one strain). The pan-genome of all 1595 strains was constructed by clustering protein sequences based on their sequence homology using the CD-hit package (v4.6). CD-hit clusters protein sequences based on their sequence identity33. CD-hit clustering was performed with 0.8 threshold for sequence identity and a word length of 5.

Pan-genome core and unique cutoff determination

We determined the core and unique pan-genome through sensitivity analysis by plotting the change in core and unique cutoff values by the change in percentage. The cutoffs were chosen to be at the point where the second derivative of the curve is the largest. The curve represents the change in pan-genome core percentage to changes in the number of strains a gene must be found in to be defined as core (Supplementary Fig. 3c, d).

Phylogenetic tree and categorization of lineages

We created a robust phylogenetic tree of the 1595 strains using SNPs at the core genome. Specifically, we chose a set of 2803 core genes that appeared in at least 1593 strains, included the H37Rv reference strain (83332.12). We used needle34 to align sequences within the 2803 pan-genome clusters (a cluster is representative of a particular loci) to the H37Rv reference allele. We built a binary SNP matrix using all of the SNPs identified from the 2803 genes (21,206 SNPs in total), and then estimated a maximum-likelihood phylogeny using RaXML version 835. The tree was visualized using iTOL36.

We used an existing SNP typing scheme11 for categorizing the strains into lineages and sublineages.

Specifically, we used a total of 141 SNPs for identifying lineages and sublineages for our 1595 TB strains. These SNPs were previously determined to be sufficient for categorizing lineages11. Of these SNPs, 61 were in nonsynonymous sites and the other 70 were SNPs found in drug resistance genes. These 141 SNPs comprised a total of 74 genes. The presence of SNPs were then used to categorize the strains into the defined lineages. Of the 1595 strains, 1366 strains were categorized and 229 were uncategorized. The remaining 229 strains were categorized according to their proximity to strains with lineage-defining SNPs, with proximity defined according to our core genome SNP phylogeny. We have included the frequency of lineage variants in order to help readers discern between epistatic alleles and those in tight linkage (Supplementary Data 8). Implicated co-occurring alleles that span different lineages are unlikely to be in tight linkage (i.e., hitchhikers).

For the numeric subscripts shown in Fig. 2—describing the number of unique sublineages for each allele–allele pair—were determined as the maximum number of unique sublineages at a single branch amongst all lineage/sublineage branches.. For example, an allele co-occurrence which has strains in both lineage 1.1 and 1.2 counts as two sublineages. An allele co-occurrence which has strains in both lineage 1.1, 1.1.2, 1.1.3, 1.1.3.1, 1.1.3.2, and 1.1.3.3 counts as three sublineages (1.1.3.1, 1.1.3.2, and 1.1.3.3). If an allele co-occurrence has strains in sublineages 4.1, 4.1.2, and 4.1.2.1, then only one sublineage is counted, since the strains can be traced through a single lineage (4.1 to 4.1.2 to 4.1.2.1).

Pan-genome-wide correlation analysis

We performed pairwise association analysis for all alleles in the pan-genome and for the 13 antibiotics to identify key AMR genes. We utilized MI, chi-squared tests, and ANOVA F-tests. MI has many statistical benefits, which include being a nonparametric method that can quantify nonlinear relationships, unlike Pearson’s correlation which measures a linear relationship. MI has proven to be a natural and powerful means to equitably quantify statistical associations in large datasets37. The pairwise MI was calculated for each column vector in the unique variant pan-genome with each drug susceptibility vector (Supplementary Fig. 3g). The discrete entropy calculations were carried out using the Non-Parametric Entropy Estimation Toolbox (NPEET, https://github.com/gregversteeg/NPEET). Since both vectors are binary, the naive implementation of discrete entropy estimation used in NPEET is sufficient. The top 40 MI associations for 11 drugs are recorded (Supplementary Data 1).

Associations were similarly calculated with chi-squared and ANOVA tests. P values were adjusted using the Bonferroni multiple-hypothesis testing correction. Theses statistical tests and corrections were implemented using the python package, statsmodels38. The top 40 associations determined by chi-squared and ANOVA F-test were recorded for 10 AMR classifications (Supplementary Data 1).

Allele feature selection through support vector machines

The support vector machine (SVM) attempts to account for all variants together by learning a multidimensional hyperplane that best separates the susceptible and resistant strains. The resulting hyperplane is a function of all exact-variant vectors in the pan-genome. Since the goal is not to predict resistance with high accuracy, but to instead extract key insights from the data, we take a feature selection approach by gearing the linear SVM with an L1-norm penalty and stochastic gradient descent optimization algorithm using the scikit-learn package. The L1-norm enforces sparsity in the decision function, which is ideal for feature selection. The stochastic gradient descent algorithm, in conjunction with the L1-norm, returns a different set of significant features each run. Since the chosen SVM does not reach the same solution, we look at the ensemble of 200 SVM feature selection simulations. Furthermore, we performed bootstrapping by randomly selecting a subpopulation representing 80% of the training data for each SVM simulation.

Prior to simulation, we took out the primary resistance-conferring gene of an antibiotic from the machine learning analysis of other antibiotics in order to amplify the signal of other genes—a preprocessing step previously utilized in AMR gene identification studies5 (Supplementary Table 3). For example, all katG alleles were only accounted for as features in the machine learning analysis for isoniazid. Furthermore, we removed all mobile element proteins, PE/PPE/PE-PGRS proteins, transposases, and hypothetical proteins from consideration in the machine learning analysis due to primarily appearing in the accessory and unique pan-genome of M. tuberculosis which may confound the results. Finally, we balanced the class weight in the SVM algorithm in order to account for the imbalance of resistant and susceptible strains seen for each drug in our dataset.

Features were selected from the SVM based on a threshold value. The value was determined through tenfold cross-validation where the threshold value was optimized through grid search (Supplementary Table 3). The use of bootstrapping in the machine learning algorithm may account for biased subpopulations in the data, which often confounds GWAS analysis for M. tuberculosis29,30.

Filtering of gene sets for epistatic analysis

Leveraging machine learning towards identification of genetic interactions, we constructed a correlation matrix of allele weights across the ensemble of randomized SVM hyperplanes for each antibiotic (Supplementary File 3). We limited our machine learning analysis to AMR classifications that achieved an average AUC (i.e., average area under ensemble of receiver–operator curves) greater than 0.80 (Supplementary Fig. 5). We selected the top 100 gene–gene correlations that include genes in the top 25 ranked SVM alleles for each antibiotic. We limited the correlations to in the top 25 ranked alleles in order to avoid the case when low weighted alleles appear sparsely with other low weighted alleles, which lead to significant correlations. The resulting set of gene–gene pairs were then analyzed using a logistic regression model in order to determine statistically significant interactions. The filtering of potential gene–gene pairs prior to classical quantitative epistasis analysis addresses the problem of combinatorial explosion of pairwise interaction terms in conventional techniques.

Epistatic analysis with logistic regression models

We utilized logistic regression to identify significant epistatic interactions. A logistic regression model was built for each potential gene–gene pair previously determined by the ensemble SVM correlation analysis. The variables of the gene–gene logistic regression model were composed of both alleles and allele–allele interaction variables:

$$Y \sim \beta _o + \mathop {\sum}

olimits_i {\beta _ia_i} + \mathop {\sum}

olimits_j {\beta _{I + j}b_j} + {\sum} {\mathop {\sum}

olimits_{i,j} {\beta _{I + J + k}a_ib_j} }{,}$$ (1)

where i and j index the alleles for genes a and b, respectively, I and J are the total number of alleles for genes a and b, respectively, Y is the binary AMR phenotype, k indexes each unique interaction term, a i b j , and β is the regression coefficient corresponding to each predictor. The interaction terms were limited to cases in which the two alleles co-occur in at least one strain. The interaction variable was the dot product of the two allele presence–absence vectors. In order to account for collinearity in the variables, we applied the following three filtering criteria (note that a i is interchangeable with b j ):

1. If the allele a i presence–absence is the same as the interaction a i b j presence–absence, remove the a i b j interaction variable from the logistic regression model 2. If the allele a i presence–absence is equal to allele b j presence–absence, remove both variables as well as the allele–allele interaction variable, a i b j . 3. If the allele a i presence–absence is equal to the sum of all interaction variables involving that allele (i.e., a i b j for all j), remove the allele variable, but keep the interaction variables.

We filtered for allele–allele interactions with P value < 0.05 after Benjamini–Hochberg multiple-testing corrections. The resulting set of gene–gene interactions encompassing significant allele–allele interactions were portrayed through allele co-occurrence tables (Supplementary Data 5). Logistic regression and statistical tests were implemented using the python package statsmodels38.

Calculation of log odds ratio in allele co-occurrence tables

The odds ratio of each cell in the allele co-occurrence tables was determined as follows:

$${{\mathrm{OR} = \frac{{\mathrm{BR} \ast {\mathrm{NR}}}}{{\mathrm{BS} \ast {\mathrm{NS}}}}}}{,}$$ (2)

where BR is the number of strains that have both alleles and are resistant to the specified antibiotic, NR is the number of strains that do not have both alleles and are resistant to the specified antibiotic, BS is the of strains that have both alleles and are susceptible to the specified antibiotic, NS is the number of strains that do not have both alleles and are susceptible to the specified antibiotic. For a single allele, the odds ratio was calculated the same way with each variable representing the single allele case. If any of the four values (BR, BS, NR, and NS) were zero, 0.5 was added to each value in order to ensure a value when computing the logarithm of the odds ratio.

Missing alleles in allele co-occurrence tables counts

The lack of specific alleles shown in the allele co-occurrence table is due to strains missing some alleles. For example, embB allele 5 is found in 147 strains but only 144 strains have both embB allele 5 and ubiA allele 2 (Fig. 2). Specifically, the three strains missing the three ubiA alleles are the following PATRIC strains as described by their genome identifiers: 1423432.3, 1448794.3, and 1448824.3. Searching on the PATRIC database for either ubiA or Rv3806c results in 0 hits for these organisms. While it is unlikely that the strain is missing this allele, these limitations are not due to the analysis but instead results from the selection of strains. These events happen quite rarely and were accounted for in the partitioning of pan-genome portions. The large sample size was able to recapitulate the key genes due to large sample size.

Structural protein analysis of identified AMR genes

For identified AMR genes, the ssbio software was used to gain gene-specific, protein sequence and structure based information about residue-level changes (SNPs and deletions) present in the M. tuberculosis alleles19. Each AMR gene was mapped to a reference protein sequence file obtained from UniProt39 and sequence-based metadata identifying protein-specific features (e.g., active sites, secondary structures, and mutations in studied wild-type strains) was used to determine the occurrence of allele-specific AMR mutations within the gene feature set (Supplementary Data 6). When available, AMR genes were additionally mapped to experimentally obtained protein structures from the RCSB Protein Data Bank or to homology structures generated using the Iterative Threading ASSEmbly Refinement (I-TASSER) platform40,41. To help elucidate the mechanistic effects of AMR mutations, both AMR mutations and the residue-level feature set were mapped to these structures and visualized using the NGLview Jupyter notebook plugin42. The structural information was utilized to calculate distances between each mutation and annotated protein feature (Supplementary Data 6).

Code availability

The computational platform is provided as a github code repository.