To our knowledge, this study represents the most comprehensive dataset on MTBC lineages that has been created by systematically assembled genotyping data from studies that used representative sampling techniques. The data show geographic variation in MTBC genotypes, which is consistent with previously published studies that used convenience samples and much smaller datasets. We find some evidence for clinical variation between genotypes, though, we also show significant variation between studies, which highlights the need for additional data.

Global variation in bacterial strains that cause TB disease

The results presented in this study are consistent with previously published maps that showed that MTBC strains that evolved more recently in human history—lineage 2, lineage 3, and lineage 4 strains—tend to be more widely distributed around the world [22, 35, 47, 48]. We also showed that lineage 1, lineage 2, and lineage 3 are more prevalent in Europe and in North and South America than shown in previously published maps [35, 47, 48]. Moreover, we show that lineage 3 strains may be increasing in prevalence in Europe, while lineage 1 strains may be decreasing in prevalence in West Asia. These patterns in genotype distribution likely reflect both historical and recent movement of strains with people from East Asia and the Indian subcontinent to Europe and the American continent. The dominance of lineage 4 globally, and in particular in South American countries, also supports the hypothesis that European colonialists aided in the dispersion of this lineage in the mid-sixteenth to nineteenth centuries [32, 48, 49]. If the first inhabitants of the American continent brought early forms of lineage 2 strains with them when they migrated from north-eastern Asia, these strains may have been eliminated with the arrival of strains from European colonialists.

Human migration is likely not the only determinant of MTBC genotype distribution. Lineages 5 and 6 are prevalent only in West Africa [35, 47, 48]. The reasons for this geographic restriction are largely unknown but may have to do with clinical characteristics of the patients infected with these strains. Patients infected with lineage 6 are more likely than patients infected with other strains to be older, HIV-infected, and severely malnourished [50]. In addition, we showed that lineages 5 and 6 strains may be less likely to cause transmission chains than lineage 4 strains and that these findings were more consistent in Europe and the Americas than in Africa, which may reflect biological differences and/or social mixing which prevents these strains from spreading through non-West African populations. We also found that lineage 3 strains were associated with reduced risk of transmission chains in Europe and the Americas, which is consistent with the findings from a household contact study in Montreal [51]. In contrast, we found that Beijing family strains may be more likely to cause transmission chains, which could reflect the ability of Beijing strains to spread quickly through human populations [46, 52, 53]. These findings are not consistent with previous work that showed no differences between lineages in transmission from household contacts [46, 54, 55]. Thus, further studies would be required to confirm our findings.

Several studies included in our analysis showed that treatment failure was associated with lineage 2 Beijing family strains [43, 44]. Beijing family strains are also associated with drug resistance [56], which has been reviewed previously [12, 22, 23]. Additionally, lineage 1 strains have been associated with more rapid response to treatment in drug-susceptible TB cases in the USA [57]. Thus, there is evidence for a relationship between bacterial genotype and treatment outcome, at least in certain populations or contexts. Future studies that carefully control for potential confounders that may impact treatment failure are required to confirm these findings. This type of information could be particularly important to clinicians if it could inform the development of novel diagnostic tools that test for bacterial genotypes associated with poor response to treatment and development of drug resistance.

Variation between studies and implications for variation in MTBC genotypes

There was variation in the sampling methods and representativeness of the studies included in this systematic review. The majority of studies were representative of much smaller geographic locations than the national level, and despite the large number of bacterial isolates included in this study, they represented only a small fraction of the total estimated TB cases. While the goal of this study was to summarize the MTBC genotyping data available, not to make nationally representative estimates, it is important to note that this variation was not distributed evenly throughout the world. There was less information available about MTBC genotype distribution in South America and Sub-Saharan Africa than in other regions, and the data in Central and Eastern Asia represented a smaller proportion of all estimated TB cases than elsewhere. Thus, the genetic diversity shown in the map in Fig. 3 for these regions is likely less representative of the underlying populations.

Another source of variation that may impact representativeness is whether studies were biased towards including either rural or urban populations. There is likely greater MTBC genetic diversity in patients from urban populations than patients from rural areas since urban areas experience higher rates of travel and migration. Most studies included in this analysis did not report the urban/rural composition of their sample, and the bias towards one or the other would likely vary depending on study location. For example, the majority of the studies included in our systematic review used samples collected from public hospitals or reference laboratories. Therefore, in countries such as India, where people in urban areas may be more likely to seek care from private health clinics [58], the urban population may be underrepresented and we may have underestimated genetic diversity. On the other hand, in countries such as Uganda, where the rural population has limited access to public health facilities [59], the rural population may be underrepresented and we may have overestimated genetic diversity. This highlights the importance of data from prevalence surveys that use active surveillance techniques to reach a broader subset of the population.

We also identified a significant amount of heterogeneity between studies in the meta-analysis of genetic clustering associated with genotypes. One source of this heterogeneity is likely methodological differences between the studies, such as genotyping method, sampling method, and study duration, which have been shown to impact genetic clustering [27, 28]. For example, duration of sampling ranged from 2 months to 9 years, and genotyping methods ranged from the use of either spoligotyping or MLVA typing to the use of both methods (Additional file 1: Table S4). Studies that used shorter sampling durations may have missed transmission chains and underestimated clustering, while studies that used spoligotyping only may have overestimated clustering [60]. An additional source of heterogeneity may be confounders that impact genetic clustering and transmission, such as social mixing, immigration, age structure, comorbidities, and underlying TB incidence [27, 28]. These confounders likely also varied between these studies but were often not reported. For example, only 14 of the studies reported HIV prevalence (range 0 to 91%), only 6 reported proportion of immigrants (range 0 to 78%), and only 14 reported mean age of patients (range 25 to 50) included in the sample (Additional file 1: Table S4). If social mixing was high in each of the studies, this could have led us to overestimate the impact of genotype on transmission chains, while if migration was high, this could have led us to underestimate the presence of transmission chains.

Study limitations

A limitation of this study is that we grouped strains into seven lineages, which masks within-lineage variation. Distinct sub-lineages of the Beijing family are associated with differences in transmissibility in human populations [61, 62], and lineage 4 contains both geographically widespread and restricted sub-lineages [49]. However, we propose that this was the best method as it allowed us to (1) include a broad range of studies, including those that did not report sub-lineages, and (2) synthesize studies that used WGS- or PCR-based typing together with studies that used methods more common in resource-limited settings, such as spoligotyping and MLVA typing.

Another limitation is that we did not include data from WGS databases. A challenge of incorporating WGS data is identifying study meta-data, such as sampling methods and demographic characteristics of patients, linked with genomes. In addition, many of the WGS data available are poised for phylogeographic studies and for examining the presence of specific mutations [32, 49, 56], but are less representative of the populations they are isolated from. These data are often from outbreaks or studies of specific sub-populations, which we excluded in this analysis. As WGS data linked with meta-data become more available (through prevalence surveys [63] and endeavors such as ReSeqTB) including this data would be an important extension of our study. Our study supports these future studies by illustrating the importance of using genome sequences to determine phylogenetic lineages or sub-lineages. The dataset we have created could be used to fill geographic gaps in future WGS-based maps, particularly in regions where WGS technology is unavailable, and to verify results from convenience-based samples.