Collection of dataset and gene clustering with OrthoMCL

To collect a comprehensive dataset, we considered the largest bacterial genome repertoire at NCBI Genebank34 (ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/) and retrieved 109 completely sequenced Mycobacterium proteomes (Supplementary Table S1). Proteins sharing minimum 30% of amino acid identity over at least 50% of their length were clustered together with the OrthoMCL algorithm using default parameters35. OrthoMCL is an all-against-all BLAST search program that cluster homologous sequences (orthologs and recent paralogs) based on Markov gene clustering method35. Inflation index is an important parameter that determines the size of the cluster35. A lower inflation value (stringent clustering i.e. fewer clusters with more proteins) may place true orthologs in separate clusters, while higher inflation value (lenient clustering i.e. more clusters with fewer proteins) may cluster sequences of different functions together35. Here, we considered inflation index value of 1.5 which was suggested to balance sensitivity and specificity while including the maximum number of sequences in the clusters35. OrthoMCL resulted in 17,215 families of orthologous proteins (Cluster of Orthologous Groups or COGs) of which 523 groups of short proteins (less than 50 amino acids) and 252 probable species-specific groups containing members from same species (paralogs) were discarded. Remaining 16,440 groups were considered for further analysis.

Ortholog identification and tree construction

In order to identify probable orthologs outside of Mycobacterium genus, the longest member of each remaining COG (16,440) was searched against the NCBI non-redundant (NR) protein database with the BlastP (v-2.3.0+) algorithm. For each COG, Blast hits were complemented with other members of the group and filtered with a combination of three cutoffs, minimum e-value 1 × 10−6, sequence identity 50% and query coverage 70%. COGs were then classified into two categories based upon the distribution of their significant blast hits, (a) groups showing hits within Mycobacterium genus and (b) groups showing hits outside of Mycobacterium genus. If a group was found to have multiple significant hits in same species (indicated by same taxonomic identifier), we retained the hit with the highest degree of identity. Here we considered the groups (total 9,014 groups, details in Supplementary File 2) showing at least one significant hit outside of Mycobacterium genus and at least 3 significant hits altogether. Amino acid sequences of significant hits were retrieved from NCBI non-redundant (NR) database using Blastdbcmd utility and their full taxonomic lineages were obtained from NCBI taxonomy database using ETE toolkit36 with the help of their NCBI Gene Identifier (GI) numbers. For each group, multiple sequence alignment was generated by Muscle (v-3.6) multiple sequence alignment tool under default settings37. Phylogenetic trees were constructed in two phases. First, we constructed maximum likelihood trees from each such alignments using double-precision version of FastTree38 algorithm with parameters -spr 4, -mlacc 2, –slownni for slower and more exhaustive search and –gamma option which accounts for the uncertainties in rates of evolution at different sites38. For each group, the best fitting amino acid substitution model optimized for maximum likelihood trees was detected by ProtTest339 and implemented in FastTree. Currently, FastTree supports three amino acid substation models JTT, WAG, and LG38. When ProtTest suggested (for 183 of 9104 groups) any other model we used the best of these three models. For the initial screening here we used FastTree because FastTree was shown to be 100–1,000 times faster than many of other standard maximum likelihood-based phylogenetic tree construction algorithms such as PhyML 3.0 or RAxML38,40. However, the maximum likelihood topology of FastTree was suggested to be less accurate than that of RAxML which uses more intensive tree search algorithm38. Therefore, to reduce false detection of HGTs all the groups where we found signals for probable HGTs (next section) in FastTree phylogeny were again subjected to phylogenetic tree construction using RAxML40 with 100 bootstrap replicates and option for automatic detection of the best substitution model parameters. Finally, HGTs were inferred in RAxML phylogeny when we found similar signals for HGT as in FastTree phylogeny. All these phylogenetic trees have been deposited at http://www.mediterranee-infection.com/article.php?laref=981.

Detection of HGT by tree-topology analysis

For inferring HGT from tree-topology, we followed the principle as suggested in the previous literature. In a well-supported phylogenetic tree if any gene exhibits higher sequence homology with some distant species rather than with its closest relatives then it is considered as a probable case of HGT1,2,3,4,6,7,8,29,30,31. The pattern that we considered as probable HGT with Mycobacterium species is described in Fig. 1. The pattern has basically three components a receiver clade (target Mycobacterium groups), a donor clade (the sister group of the receiver clade) and an external clade (the sister group of the donor clade). Mycobacterium species were considered to receive genes from other bacterial species, when one or more Mycobacterium genes (receiver clade) branched within some unrelated bacteria (with overall minimum bootstrap value 70%) such that there is no Mycobacterium gene in up to five neighboring sister clades and there is no member from Corynebacterium and Nocardia genera (closest relatives of Mycobacterium41) in the donor clade. To detect gene exchanges with non-bacterial species, we assumed that acquired genes must be present in Mycobacterium species but will be absent from other closely related bacterial genomes. Here, we considered similar patterns, however, no bacterial homolog was allowed either in the donor or in the external clade (Fig. 1).

Figure 1 Schematic representation of the phylogenetic patterns that we searched for inferring probable HGTs. Each topology has three basic components: receiver clade (in green), donor clade (in red) and external clade (in blue). The species those we considered for identifying different types of gene transfer are indicated in different panels. (A) Gene transfer from other bacteria to Mycobacterium, (B) gene transfer from archaea to Mycobacterium, (C) gene transfer from virus to Mycobacterium and (D) gene transfer from eukaryotes to Mycobacterium. Full size image

Inferring donor and receiver clades

Although, our aim was to identify probable HGTs at the species level, however, in most of the cases we found that the donor and receiver groups consisted of genes from multiple genomes. To identify the point of acquisition of foreign genes we reconstructed their ancestral states along the Mycobacterium phylogeny using Count42, a bioinformatic suite for evolutionary analysis. Given the phyletic distribution of any character along a phylogeny, Count can reconstruct its evolutionary history by various statistical models. Here, the ancestral states were reconstructed through unordered states assumption of Wagner Parsimony model using various gain penalties (1, 3, 5 etc.). For Mycobacterium species tree needed for this analysis, we first listed all the Mycobacterium genomes involved in gene transfer (receiver clade of trees where we found HGT). A representative phylogenetic tree (Supplementary Fig. S1) was constructed for these genomes based on the alignment of 16S rRNA gene sequences retrieved from SILVA (release 128)43, a high-quality ribosomal RNA database. Reliable 16S rRNA gene sequences could not be found for few genomes and were not included in the tree. For each probable HGT, a gene presence/absence matrix is generated by mapping the species in receiver groups with the species in the tree and used as the character matrix for ancestral state reconstruction. To identify the evolutionary lineage of donor groups, we searched the lowest taxonomic level (lowest common ancestor) of the members of each donor clade. Donor groups were inferred according to the lowest taxonomic rank (at genus/species/family, etc. level) of the common ancestor. Lowest taxonomic rank of the species in donor groups was fetched from ETE toolkit36 using their taxonomic identifier.

Detecting HGT by T-REX algorithm

T-REX is a suite of phylogenetic tools dedicated for several analyses including in-silico detection of HGT32,33. Given a test and a reference tree, T-REX calculates their proximity by several distance-based measures and predicts minimum-cost scenario HGTs by the progressive reconciliation of those trees. T-REX was optimized by taking into account the evolutionary events including gene duplication and deletion and was shown to be faster and more accurate than most of the other currently available tree discordance methods like LatTrans and RIATA-HGT32. The trees generated by the FastTree38 algorithm were used as gene trees for this analysis. Species trees were constructed for each gene tree separately. For this, we retrieved the taxonomic ids of all the members of each alignment and fetched their common tree from NCBI taxonomy browser using those ids. Each pair of species and gene tree was then subjected to the T-REX algorithm for inferring probable HGTs. We discarded the “Trivial” (low confidence) HGTs from the prediction and considered the cases where one or more Mycobacterium species were listed in the acceptor field with no Mycobacterium in the donor field.

Functional annotations of horizontally transferred genes

Functional annotation of candidate HGTs was done using the InterPro (v-66) functional annotation tool44. InterPro provides comprehensive information about protein families, domains, and functional sites by integrating signatures from several protein annotation servers including Pfam, PRINTS, PROSITE, SMART, ProDom, SUPERFAMILY, PANTHER, CATH-Gene3D, TIGRFAMs and HAMAP44. COGs are clusters of homologous genes and supposed to consist of proteins sharing the same function. Therefore, we choose one representative protein (longest member) from each COG and used it for functional prediction. However, we first tested the utility of our representative protein in retrieving the functional annotation of groups affected by HGT. We retrieved all members of receiver groups for more than 100 families and compared their functional annotations with the functional annotation of the corresponding representative proteins. As expected, we noticed that all the members of each receiver group were composed of similar types of functional domains and were reflected in the composition of the representative proteins. By this way, we could annotate more than 68% of the groups involved in gene transfer with their Gene Ontology (GO) terms and Pfam domains. To get a relative idea, we compared the distribution of Pfam domains and three ontologies (biological process, cellular component, and molecular functions) among the candidate HGTs in reference to their distribution in the whole proteome of 15 Mycobacterium species (Supplementary Table S2). Functional annotations of all (76,243) proteins of these 15 Mycobacterium species were also retrieved from InterPro44. Functional enrichment analysis was performed using Fisher’s exact test with the help of 2 × 2 contingency tables and statistical significance was evaluated through P–values. For reliable statistics, functional enrichment analysis was conducted for the GO terms/Pfam domains which appear at least 10 times considering both the dataset (Supplementary File 2). It is noteworthy that since we searched through representative protein, the number of entries in HGT category is based on HGT events rather than genes.

Mapping of horizontally transferred genes on KEGG pathways

Genes involved in HGT were mapped to the KEGG45 (Kyoto Encyclopedia of Genes and Genomes) functional categories using BlastKOALA46 genome annotation tool. BlastKOALA assigns KEGG Orthology functional categories (designated by “K” numbers) based upon homology to precompiled databases. Here, the representative sequence (longest) of each HGT COG was searched against “species prokaryotes” database of BlastKOALA using default settings. Enzyme Commission (EC) numbers and pathway information were retrieved following the functional description of the best “K” number assigned to each input sequence. KEGG metabolic pathway map was generated using iPATH Interactive Pathways Explorer (v-3)47 with the help of ‘K’ numbers.

Antibiotic resistance and Mycobacterium virulence

Candidate HGT proteins were annotated with antibiotic resistance information based upon their sequence similarity with known antibiotic resistance genes. Representative sequences from the candidate HGT COGs were searched against the following databases with BlastP algorithm (i) ARDB-Antibiotic Resistance Genes Database (v-1.1)48, (ii) The Comprehensive Antibiotic Resistance Database (CARD) (v-1.2.1)49, (iii) Antibiotic Resistance Gene-ANNOTation database (v-3)50, and (iv) MEGARes: an Antimicrobial Database for High-Throughput Sequencing (v-1.0.1)51. These databases are repositories of putative antibiotic resistance genes and related information collected from various resources. Currently, these databases contain 7828, 2311, 1808, and 3824 putative antibiotic resistance gene/protein sequences respectively. To identify the sequences with significant similarity we considered cut-off values of 50% identity over half (50%) of the protein length and e-value less than 10−5. HGT proteins were classed according to their putative antibiotic resistance potentiality based upon the functional description of their significant Blast hits. For virulence information we retrieved the protein coding sequences of bacterial virulence factor determinants from the Virulence Factors of Bacterial Pathogens database52 and treated in a similar manner. Currently (last updated 24th October 2017), this database contains sequences of 2,595 experimentally verified and 26,524 known and predicted virulence factors. Here, our query sequences were annotated against the experimentally verified dataset.