With a genome size of ∼580 kb and approximately 480 protein coding regions, Mycoplasma genitalium is one of the smallest known self-replicating organisms and, additionally, has extremely fastidious nutrient requirements. The reduced genomic content of M. genitalium has led researchers to suggest that the molecular assembly contained in this organism may be a close approximation to the minimal set of genes required for bacterial growth. Here, we introduce a systematic approach for the construction and curation of a genome-scale in silico metabolic model for M. genitalium. Key challenges included estimation of biomass composition, handling of enzymes with broad specificities, and the lack of a defined medium. Computational tools were subsequently employed to identify and resolve connectivity gaps in the model as well as growth prediction inconsistencies with gene essentiality experimental data. The curated model, M. genitalium iPS189 (262 reactions, 274 metabolites), is 87% accurate in recapitulating in vivo gene essentiality results for M. genitalium. Approaches and tools described herein provide a roadmap for the automated construction of in silico metabolic models of other organisms.

There is growing interest in elucidating the minimal number of genes needed for life. This challenge is important not just for fundamental but also practical considerations arising from the need to design microorganisms exquisitely tuned for particular applications. The genome of the pathogen Mycoplasma genitalium is believed to be a close approximation to the minimal set of genes required for bacterial growth. In this paper, we constructed a genome-scale metabolic model of M. genitalium that mathematically describes a unified characterization of its biochemical capabilities. The model accounts for 189 of the 482 genes listed in the latest genome annotation. We used computational tools during the process to bridge network gaps in the model and restore consistency with experimental data that determined which gene deletions led to cell death (i.e., are essential). We achieved 87% correct model predictions for essential genes and 89% for non-essential genes. We subsequently used the metabolic model to determine components that must be part of the growth medium. The approaches and tools described here provide a roadmap for the automated metabolic reconstruction of other organisms. This task is becoming increasingly critical as genome sequencing for new organisms is proceeding at an ever-accelerating pace.

Copyright: © 2009 Suthers et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

In this paper, we highlight the development of an in silico model of metabolism of M. genitalium. It was subjected to network connectivity gap detection and reconnection as well as restoration of consistency with in vivo gene essentiality experiments [19] . We subsequently used the model to pinpoint components in the growth medium that are needed for the production of all components of biomass in an effort to eventually eliminate the need for non-defined components such as serum in the growth medium.

The above underlines the importance of investigating the molecular biology of mycoplasma and M. genitalium in particular. However, a major hindrance to M. genitalium research and laboratory diagnosis of infection has been their cultivation in vitro. While defined media are present for some mycoplasmas [28] – [30] , researchers have often had to resort to complex media to cultivate most mycoplasmas, including M. genitalium. M. genitalium, and many other mycoplasmas are cultured in vitro in SP-4 medium. This extremely rich medium contains several undefined additives including peptones, yeast hydrolysate, yeast extract and 17% fetal bovine serum [31] . The use of complex undefined growth media has interfered with the molecular definition of mycoplasma metabolic pathways, genetic analyses, estimation of growth requirements, characterization of auxotrophic mutants and examining the nutritional control of bacterial pathogenecity.

Mycoplasma genitalium is not only the closest known approximation of a minimal cell but also an important sexually transmitted human pathogen. It is a cause of nongonococcal urethritis in men and is associated with genital tract inflammatory diseases in women, including endometritis, cervicitis, pelvic inflammatory disease, and tubal factor infertility (for a recent review see [23] ). Additionally, evidence suggests that M. genitalium infection increases the risk of contracting HIV-1 [24] – [26] . Mycoplasmas, the generic name for the bacteria that comprise the Mollicutes taxon, evolved from the low G+C Gram positive bacteria through a process of massive genome reduction [27] . Their salient characteristics in addition to small genomes are a lack of a cell wall, and an almost complete inability to synthesize the building blocks of DNA, RNA, proteins, and cell membranes.

M. genitalium has received considerable attention as it is the smallest organism that can be grown in pure culture, having a genome size of ∼580 kb and approximately 480 protein coding regions [18] , [19] . An examination of its genome content revealed limited metabolic capabilities [20] , leading researchers to suggest it may be a close approximation to the minimal set of genes required for bacterial growth [19] , [21] . Several researchers have carried out genomic and proteomic analysis of M. genitalium to quantify this minimal set. For example, Mushegian and Koonin have carried out a detailed comparison of M. genitalium and H. influenzae proteins to derive a set of 256 genes that they suggested are necessary for viability [22] . Further, genomic analyses of these species revealed that Mycoplasma genes encode for several catabolic and metabolite transport proteins but for only a limited number of anabolic proteins suggesting that Mycoplasma species need to scavenge for the required nutrients from the surrounding environment [20] . More recently, Glass and co-workers performed global transposon mutagenesis and established that 382 of the 482 protein coding sequences are essential genes for this minimal bacterium [19] . These gene sets and essential gene analyses, however, have not been put into context of a complete functional metabolic model.

The four steps used in the present work are 1) identification of biotransformations using automated homology searches 2) assembly of reaction sets into a genome-scale metabolic model 3) automated network connectivity analysis and restoration, and 4) automated evaluation and improvement of model performance when compared to in vivo gene essentiality data.

In response to this flood of present and future genomic information, automated tools such as Pathway Tools [16] and SimPheny (Genomatica) have been developed that, using homology comparisons, allow for the automated generation of draft organism-specific metabolic reconstructions that can subsequently be upgraded into metabolic models. All of these models remain to some extent incomplete as manifested by the presence of unreachable metabolites [17] and some growth inconsistencies between model predictions and observed in vivo behavior [2] . In particular, optimization-based techniques for automatically identifying metabolites disconnected from the rest of metabolism (i.e., GapFind) and hypotheses generators (i.e., GapFill) for reconnecting them have recently been introduced [17] . In order to resolve substrate utilization prediction inconsistencies, Reed et al. [2] introduced a novel approach for identifying what reactions to add to the genome-scale metabolic models of E. coli to correct some of the in silico growth predictions. In our group, we have taken the next step for gene deletion data by attempting to correct all such growth inconsistencies by allowing not just additions but also eliminations of functionalities in the model (i.e., GrowMatch) (Satish Kumar and Maranas, submitted). As outlined in Figure 2 , in this work, we describe the application of these automated methodologies during the Mycoplasma genitalium model construction process (as opposed to an a posteriori mode of deployment).

The numbers of both metabolic models and completed genome sequences have been growing exponentially. Despite the increase in the number of metabolic models, their total remains smaller than 1% of that of sequenced genomes.

Genome-scale metabolic reconstructions are already in place or under development for a growing number of organisms including eukaryotic, prokaryotic and archaeal species [1] . Metabolic pathway reconstructions are increasingly being queried by systems engineering approaches to refine the quality of the resulting metabolic models [2] . Curated metabolic models are indispensable for computationally driving engineering interventions in microbial strains for targeted overproductions [3] – [6] , elucidating the organizing principles of metabolism [7] – [10] and even pinpointing drug targets [11] , [12] . Currently, over 700 genomes have been fully sequenced [13] whereas only about 20 organism-specific genome-scale metabolic models have been constructed [1] , [14] , [15] . Figure 1 pictorially demonstrates, in logarithmic space, the widening gap between organism-specific metabolic models and fully sequenced genomes over the past twelve years. It appears that metabolic model generation can only keep pace with about 1% of the fully sequenced genomes.

Results

The metabolic model reconstruction process is subdivided into four steps. These four steps are summarized in Figure 2 and include (1) identification of biotransformations using homology searches, (2) assembly of reaction sets into a genome-scale metabolic model, (3) network connectivity analysis and restoration, and (4) evaluation and improvement of model performance when compared to in vivo gene essentiality data. Model construction is followed by minimal defined medium component elucidation.

Generation of Computations-Ready Model A metabolic reconstruction has been described as a 2-D annotation of a genome [32]. The generation of a computations-ready model requires the complete assignment of metabolites to reactions, inclusion of exchange reactions, resolution of gene-enzyme associations, and derivation of biomass equations. Here, we largely follow the steps put forth by [33] in the latest E. coli metabolic reconstruction. Assignment of ORFs, reactions & metabolites and their classification. Under this step we first include in the model all reactions associated with enzymes as identified above (ensuring, when necessary, that the metabolites have the appropriate protonation state corresponding to pH 7.2 and are elementally balanced). We then extend the list to include non-gene associated reactions (e.g., account for the substrates and products for the kinase pool). Subsequently, we append spontaneous reactions for chemical transformations that are not enzyme catalyzed and for the transport of molecules that can freely diffuse across the cell membrane and are likely to be present in the natural or laboratory-adopted environment of M. genitalium. A distribution of the ORFs, metabolites, and reactions included in the final model iPS189 are provided (see Tables S1, S2, S3 and Text S1). A breakdown of the ORFs into functional clusters is shown in Figure 3. Here, each ORF is assigned to a cluster of orthologous groups (COGs) ontology [35] along with the percent of the total ORFs for the entire M. genitalium genome that is included in iPS189 for each classification. System boundary identification. The system boundary includes the entire reaction network. Exchange reactions were created that allowed metabolites to enter or leave the extracellular space. Constraints on these exchange reactions were established to account for the simulated growth conditions by restricting the uptake (substrate) metabolites of the system. Conversion of model into computations-ready form. The stoichiometric matrix for iPS189 after step 1 involves 179 rows (i.e., reactions and exchange reactions) and 292 columns (i.e., total metabolites, taking the compartment designations into account). We also established gene-protein-reaction (GPR) associations to link reactions to genes. Only 48 reactions contained more than a single GPR association. Of these, none contained single-protein isozymes, and 47 were catalyzed by just one of the various enzyme complexes (among which was the ribosome and its associated proteins). Only one contained both isozymes and protein complexes. Biomass equation derivation. The derived biomass equation drains all metabolites present in cellular biomass in their appropriate molar biological ratios (see Table 3 for a list of components). Because the dry cell content of M. genitalium has not been experimentally measured, the biomass composition was estimated from the protein composition of closely related mycoplasmas, comparisons with a metabolic model for the related Bacillus subtilis, and a reduction of the biomass equation from the Escherichia coli model iAF1260. These biomass compositions were adjusted to account for components known to be absent in M. genitalium such as the cell wall. The biomass equation explicitly includes charged and uncharged tRNA molecules, rather than only the amino acids themselves. It also contains a non-growth-associated ATP maintenance reaction. The genome and RNA content was adjusted to reflect the low G+C content of M. genitalium. While determining the precise biomass composition has been a challenge for many other models [36], previous work has shown that the calculated biomass production is relatively insensitive to the exact ratios used [37]. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 3. Classification of the ORFs included in iPS189 grouped into COG functional categories. The percent assigned to each class refers to the coverage of the total number in the genome accounted for in the model. Some of the ORFs in Mycoplasma genitalium do not currently have a COG functional category assignment (here represented as N/A). Note that although each ORF is only counted once within each COG functional category, some ORFs have multiple COG category assignments. https://doi.org/10.1371/journal.pcbi.1000285.g003 PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Table 3. Biomass components used in iPS189. https://doi.org/10.1371/journal.pcbi.1000285.t003 The computations-ready model along with the biomass description allows for the use of optimization-based techniques for testing and correcting for the presence of connectivity gaps (step 3) and growth prediction inconsistencies (step 4).

Analysis and Restoration of Network Connectivity The initial model was constructed almost exclusively based on homology searches within model libraries. This procedure led to the presence of many network gaps [17] preventing 177 reactions (99% of total) from carrying flux under all uptake conditions (i.e., they were blocked). As a consequence, these blocked reactions precluded the formation of some of the biomass components. Using GapFind [17] we found that a total of 175 (70%) cytoplasmic metabolites could not be produced inside or transported into the intracellular space. These metabolites included a number of biomass precursor metabolites (e.g., some amino acids, cofactors and metal ions) that had not been assigned uptake reactions. Of all the blocked metabolites, thirteen were involved in nucleotide metabolism and eight were metal ions without an identified transporter. We also note that 40 of these metabolites are charged/uncharged tRNA molecules, which are active in closed reaction cycles used in forming the protein component of biomass. Through the use of GapFill [17] we subsequently sought to bridge these network gaps through the addition of reactions, transport pathways and relaxation of irreversibilities of reactions already in the model. Reactions known not to be present in M. genitalium (e.g., an incomplete TCA cycle) were excluded as gap filling candidates. We first applied GapFill to unblock constituents of biomass guided by the known components in the growth medium. We unblocked biomass production by adding 65 reactions, for which most (i.e., 43) were involved in metabolite transport, such as for the uptake of amino acids (14), folate, riboflavin, metal ions (8), and cofactors such as CoA. Among the remaining reactions were those responsible for the hydrolysis of dipeptides (15) and eight reactions involving other biotransformations. We performed an additional round of BLASTp comparisons of genes annotated with these reactions against the M. genitalium genome to determine if we could associate any of these reactions with specific genes in M. genitalium. We found five proteins catalyzing these reactions that had BLASTp scores smaller than 10−5 (see Table 2). For example, GapFill suggested the addition of reaction glutamyl-tRNA(Gln) amidotransferase in the model to allow the formation of the gln-tRNA molecule. BLASTp searches allowed us to link this activity with the genes encoding for the three subunits (MG098, MG099 and MG100). Note that these three genes (and others added during this step) were not added earlier (steps 1 and 2) on account of their ambiguous functional characterization. By bringing to bear both homology (though BLASTp) and connectivity restoration (through GapFill), here we rely on multiple pieces of evidence when appending a new functionality and corresponding genes to the model. Even after unblocking biomass formation, 43 metabolites remained blocked and were subsequently analyzed by GapFill. The results from GapFill are summarized in Figure 4. We were able to reconnect three metabolites by treating three reactions as reversible. We also found that the originally assigned (based on the auto-model) directionality of 1-acyl-sn-glycerol-3-phosphate acyltransferase was incorrect. It was subsequently reversed and found to be in accordance with both KEGG and MetaCyc entries. An additional 21 metabolites were reconnected by adding 18 reactions from the KEGG and MetaCyc databases (see Materials and Methods). The addition of these 18 reactions also introduced an additional nine metabolites (three of which were involved in glycerolipid metabolism) to the model. Finally the incorporation of uptake/transport reactions reconnected an additional four metabolites. We performed an additional round of BLASTp comparisons and we were able to associate three out of 22 reactions with specific genes (see Table 2). We found that the associated gene (i.e. MG066) for 1-deoxy-D-xylulose 5-phosphate synthase was already included in the model but with a different functionality (i.e., transketolase). The secondary synthase functionality, revealed by GapFill/BLASTp, was subsequently associated with gene MG066 in the model. A similar situation occurred with MG053, which was already associated with phosphomannomutase in the model. In addition, gene MG259 (annotated as “modification methylase, HemK family” in the Comprehensive Microbial Resource, http://cmr.jcvi.org) was added to the model to carry out the glutamine-N5 methyltransferase activity elucidated by GapFill/BLASTp. The model statistics after correcting for network gaps are summarized in Table 1. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 4. Summary of the GapFind and GapFill procedures after biomass unblocking. During biomass formation unblocking, most metabolites that could not be produced (disconnected metabolites) were reconnected through the addition of transport reactions. Metabolites not directly connected to biomass were mostly reconnected through the addition of reactions from KEGG and Metacyc. https://doi.org/10.1371/journal.pcbi.1000285.g004

Model Correction Using In Vivo Gene Essentiality Data Based on in vivo gene essentiality data [19] we deduced that there are 174 essential genes and 19 non-essential genes among the 193 genes provisionally present in the model (after steps 1, 2, and 3). We note that the in vivo gene essentiality experiments were performed using non-defined medium containing serum and yeast hydrolysate among other rich components. During the in silico model predictions/comparisons, we allowed the uptake of all extracellular metabolites with transport reactions, except for sugars other than glucose, in order to computationally approximate this medium. Using a recently proposed diagnostic of the percentage of correctly-identified essential genes [38],[39], we found that the model correctly identified 137 out of a total of 174 essential genes (i.e., specificity of 79%) and 16 out of a total of 19 non-essential genes (i.e., sensitivity of 84%). This implies that the model (after steps 1, 2, and 3) was 79% correct in its overall accuracy in growth predictions (i.e., 153 of 193). Most of the mismatches (92%) were over-predictions of the metabolic capabilities (i.e., predicting growth when none is observed in vivo) instead of under-predictions (i.e., predicting no growth when growth is observed in vivo). We subsequently deployed the GrowMatch method (Satish Kumar and Maranas, submitted) to rectify as many as possible of the erroneous essentiality predictions by the model. GrowMatch functions by identifying the minimal number of model modifications required to restore consistency between growth predictions and gene essentiality experiments (see Materials and Methods). Model under-predictions include mutants (MG410 and MG411), which encode the subunits for the phosphate transporter, preventing in both cases the uptake of phosphate. This implies that M. genitalium must have an additional uptake route of phosphates. Even though GrowMatch suggested a number of phosphate uptake alternatives to resolve this conflict and Glass and coworkers [19] had posited the activity of a putative phosphonate transporter (MG289, MG290, and MG291), we decided not to add them to the model as no direct evidence exists to ascertain their presence. For instance, the putative phosphonate transporter might be nonspecific thus also enabling uptake of phosphate. Alternatively, the unidentified phosphonate substrate might be catabolized to yield phosphate through a number of reactions. The other incorrect under-prediction involved MG138 (homologous to elongation factor 4 in E. coli), which had been associated with macromolecule formation during the automodel construction. We observed that deletion mutants of the homolog in E. coli (lepA) are viable [40]. Based on this information, we removed this gene and its erroneous association as an essential component of the biomass equation from the model. Interestingly, three of the 37 erroneous over-predictions were corrected by adding three membrane components to the biomass equation (see Table 3). These components were not added during the initial model construction because it was not clear which (if any) of this class of metabolites were essential. An additional three erroneous predictions of non-essentiality were corrected by suppressing two reactions. One of these reactions, inosine kinase, was added during GapFill but not linked to an associated gene. Suppression of this reaction corrected two over-predictions but did not invalidate any correct model predictions, suggesting that the reaction activity is unlikely to be present in vivo, at least under the experimental conditions, and perhaps is not an activity encoded by M. genitalium. An additional six over-predictions involved two metal ion ABC transporters. GrowMatch identified each transporter to be essential when the other one was suppressed. We rejected co-regulation of the two transporters as a model restoration mechanism. Instead, we restored consistency for three of the six genes by assigning the cobalt uptake to the complex with a better homology to characterized cobalt transporters (MG179, MG180, MG181). The remaining three genes were removed from the model. An alternative interpretation of the GrowMatch results is that some other ion uptake reaction(s) are uniquely associated with these transporters and are thus responsible for the in vivo phenotype. Overall, the application of GrowMatch to the metabolic model led to the generation of a number of testable hypotheses regarding the presence or absence of specific functionalities and emphasized the importance of determining the substrate specificity of the transporters. We also identified reactions that had to be inactivated only for certain knock-outs suggesting their dependence on the genetic background in addition to the specific environmental conditions. For example, MG112 (ribulose-phosphate 3-epimerase) had to be suppressed in conjunction with two single gene deletions (i.e., conditional suppressions) to restore consistency with the in vivo data, suggesting possible regulation events. Figure 5 summarizes the complete GrowMatch results. Considering only those changes that could be repaired with global model adjustments conservatively raised the overall percent accuracy of the model (versioned as iPS189) from 79 to 87%. By recalculating the diagnostics of the percentage of correctly- identified essential genes [38],[39] we found that the model is now 87% (i.e., 149 of 171) correct in its essentiality predictions (specificity) and 89% (i.e., 16 of 18) correct in its non-essentiality predictions (sensitivity). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Figure 5. Summary of the GrowMatch reconciliations. The ORFs included in the comparisons are ordered within each COG functional category (see Figure 3 for abbreviations). If an ORF belongs to more than one category only the first one is used. The dark shading indicates an ORF that was essential, and the light shading indicates a non-essential ORF. Before the application of GrowMatch, 79% of the model predictions on gene essentiality agreed with in vivo experiments. Nearly all incorrect predictions occurred when the model predicted growth, but no growth was observed in the in vivo experiment (termed GNG mismatches). GrowMatch was able to rectify eighteen of these mismatches. This increased the percent agreement to 87%. The rightmost heat-map and the number of resolved mismatches also include the removal of one gene to correct a GNG experiment and three genes to resolve NGG mismatches. https://doi.org/10.1371/journal.pcbi.1000285.g005

M. genitalium iPS189 Model Characteristics The iPS189 model predicts that M. genitalium uptakes fructose via a PTS system. The fructose is converted to fructose 1,6-bisphosphate (fdp) and finally enters the glycolytic pathways to produce lactate via lactate dehydrogenase. Neither fructose nor glycerol uptake was found to be essential as glucose could be efficiently taken up and converted. Specifically, glucose is transformed to glucose 6-phosphate and finally to fdp via phosphofructokinase. As expected, the model also indicates that co-enzyme A (CoA) is taken up, since M. genitalium has no coA biosynthesis genes. Additionally, accetal-CoA (accCoA) is not formed via pyruvate formate lyase but rather by pyruvate dehydrogenase. Interestingly, we find that should acetate be taken up, it is converted to acetyl phosphate (actp) and finally to accoa by phosphotransacetylase. Sources for acyl-CoA (aCoA) and CDP-glucose are also required for lipid production. The metabolites riboflavin and nicotinic acid (niacin) are taken up for synthesis of the cofactors FAD and NAD, respectively. In addition, both spermidine and putrescene are directly imported as biomass components. Similarly, we also found that D-ribose (rib-D) is needed to fuel the truncated pentose phosphate pathway. Examination of fluxes indicated that the uptake of rib-D results in production of 5-phospho-α-d-ribose 1-diphosphate, which enables the conversion of adenine to amp. We also deduced that only adenine and cytidine are precursors to nucleotides and nucleosides (CTP, dCTP, UTP, dUTP, dTTP). Interestingly, the model required the direct uptake of GTP and could not be produced through the uptake of guanine. Model modifications that eliminate this requirement using GrowMatch resulted in a number of incorrect gene essentiality predictions. The need for the direct uptake of GTP is consistent with the fact that in M. mycoides the guanine nucleotide pathways depend on transport of preformed guanine derivatives [34], and that a number of other Mycoplasmas are not able to grow on medium that only contains guanine as a nucleobase [41]. In addition, all amino acids are imported directly from the environment as either monomers or dipeptides. Unlike many other mycoplasmas, M. genitalium is an arginine nonfermenting species, and not surprisingly arginine deiminase activity was not present in the model. Furthermore, in iPS189, the only participation of the amino acid arginine is its direct incorporation into biomass. Finally, flux predictions revealed that lactate is the main product of M. genitalium fermentation.