The use of computational modeling algorithms to guide the design of novel enzyme catalysts is a rapidly growing field. Force-field based methods have now been used to engineer both enzyme specificity and activity. However, the proportion of designed mutants with the intended function is often less than ten percent. One potential reason for this is that current force-field based approaches are trained on indirect measures of function rather than direct correlation to experimentally-determined functional effects of mutations. We hypothesize that this is partially due to the lack of data sets for which a large panel of enzyme variants has been produced, purified, and kinetically characterized. Here we report the k cat and K M values of 100 purified mutants of a glycoside hydrolase enzyme. We demonstrate the utility of this data set by using machine learning to train a new algorithm that enables prediction of each kinetic parameter based on readily-modeled structural features. The generated dataset and analyses carried out in this study not only provide insight into how this enzyme functions, they also provide a clear path forward for the improvement of computational enzyme redesign algorithms.

Copyright: © 2016 Carlin et al. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

In this study we report the largest data set of its kind, in which 100 mutants of BglB are produced, purified, and kinetically characterized (i.e., kinetic constants k cat , K M , and K i measured) using the reporter substrate p-nitrophenyl-β-D-glucoside (pNPG). The production of this dataset revealed several mutations to non-catalytic residues (i.e. those not directly involved in the proposed reaction chemistry) that are as important to the enzyme-catalyzed reaction as catalytic residues. In addition, we demonstrate the ability to use this dataset to train computational algorithms for the prediction of k cat , K M , and k cat /K M using readily-calculated metrics derived from molecular modeling. Finally, we illustrate how machine learning can be used to identify structural features from the molecular models that significantly improve the predictive accuracy of the molecular modeling. These analyses provide insight into the factors important for catalysis in BglB as well as a path forward for the development and evaluation of next-generation enzyme reengineering algorithms.

(A) Structure of BglB in complex with the modeled p-nitrophenyl-β-D-glucoside (pNPG) used for design. Alpha carbons of residues mutated shown as blue spheres. The image was drawn with PyMOL. [ 16 ] (B) The BglB–catalyzed reaction on pNPG used to evaluate kinetic constants of designed mutants.

Here, we take the first step towards developing a data set of enzyme mutants with measured effects on kinetic constants that is both large enough and has a wide enough dynamic range to enable training of computational protein design algorithms. The initial enzyme of focus is a family 1 glycoside hydrolase: β-glucosidase B (BglB) from Paenibacillus polymyxa. The family 1 glycoside hydrolases have been the subject of numerous structural and kinetic studies due to their importance as the penultimate step in cellular ligno-cellulose utilization. [ 15 ] The structure of BglB indicates that it follows a classical Koshland double-displacement mechanism in which E353 performs a nucleophilic attack on the anomeric carbon of the substrate’s glucose moiety. The leaving group is protonated by E164. A third active site residue, Y295, orients E353 for catalysis with a hydrogen bond. [ 15 ] The protein structure and reaction scheme are provided in Fig 1 .

The use of large datasets to train and evaluate force-field based algorithms for protein function has been previously validated in the context of protein thermostability. For example, the ProTherm database has over twenty thousand measured effects of mutations on thermostability, and serves as the gold standard for the development of numerous algorithms developed to predict effects of mutations on thermostability. [ 7 , 8 , 9 ] Current algorithms for protein redesign, in contrast, are not directly trained on experimentally measured effects, but rather indirect measures such as sequence recovery (i.e. the ability to recapitulate a known active site after running a design simulation). While there have been several previous efforts to construct large families of functionally-characterized mutants, none have produced, purified, and measured the kinetic constants of more than twenty mutants. [ 10 , 11 , 12 , 13 , 14 ] In order to develop algorithms for the rational modulation of kinetic parameters, we hypothesize that it will be necessary to develop libraries of mutant enzymes for which the functional effects of mutations on catalytic efficiency (k cat /K M ), apparent substrate affinity (estimated by K M ), and turnover rate (k cat ) have been measured.

The ability to rationally reengineer enzyme function using computational approaches has the potential to enable rapid development of highly efficient and specific catalysts tailored for needs beyond those selected for during natural evolution. [ 1 ] A growing route for engineering enzyme catalysts is the use of computational tools to evaluate potential mutations in silico prior to experimental characterization. Using the Rosetta Molecular Modeling Suite, reengineering of both specificity and chemistry has been accomplished. [ 2 , 3 , 4 , 5 , 6 ] However, often less than ten percent of designs engineered using this force-field based approach are found to have the intended functional effect. Furthermore, there have been no reports evaluating the predictive power of the Rosetta Molecular Modeling Suite on the functional effects of enzyme mutations. Therefore efforts to both evaluate and improve the predictive power of this computationally inexpensive and widely accessible algorithm are necessary.

Results

Computationally-directed engineering of BglB A crystal structure (PDB 2JIE) of recombinant BglB in complex with the substrate analog 2-deoxy-2-fluoro-α-D-glucopyranose was used to identify the substrate binding pocket and the catalytic residues. To generate a molecular model representative of a proposed transition state in the hydrolysis of pNPG, an S N 2-like transition state structure was built and minimized in Spartan based on a 3D conformer of PubChem CID 92930. Functional constraints were used to define catalytic distances, angles, and dihedrals between pNPG, the acid-base E164, the nucleophile E353, and Y295, which is proposed to orient the nucleophilic glutamate. The angle between the attacking oxygen from E353, the anomeric carbon, and the phenolic oxygen was constrained to 180°, in accordance with an S N 2-like mechanism. [17] A complete set of files that were used for modeling are provided in S1 Code. Two approaches were used to establish a set of mutants to generate and kinetically characterize. The first approach was a systematic alanine scan of the BglB active site where each residue within 12 Å of the ligand in our model was individually mutated to alanine. In the second approach, mutations predicted to be compatible with the modeled pNPG transition state in BglB structure were selected by students learning about molecular modeling through the program Foldit, a graphical user interface to the Rosetta Molecular Modeling Suite. [4,18] Mutations were modeled and scored in Foldit and a selection of mutations that were either favorable or did not increase the energy of the overall system by greater than 5 Rosetta energy units were chosen to synthesize and experimentally characterize. Fig 1A illustrates the positions in the protein where mutations were introduced, and the complete set of mutations selected is listed in S1 Table. A total of 69 positions were covered over the 100 mutants made.

Protein production and purification Each of the 100 mutants was made via Kunkel mutagenesis [19] using the Transcriptic cloud laboratory platform and verified by Sanger sequencing. Plasmids containing the mutant genes were transformed into Escherichia coli BL21(DE3), 5 mL cultures grown in Terrific Broth and expression induced with IPTG. Proteins were purified via immobilized metal affinity chromatography and eluted in 200 μL HEPES buffer, as described in detail in S1 Text. The absorbance at 280 nm of eluted protein was used to quantify protein yield and SDS-PAGE was used to evaluate purity (S1 Fig). All proteins used in this study were greater than 80% pure, and fresh resin was used for each mutant to prevent wild type contamination. A total of ten biological replicates of the native BglB were used to assess expression and purification. The average concentration of proteins after purification was found to be 1.2 ± 0.4 mg/mL. Of the 100 mutants synthesized, 89 express and purify as soluble protein (Fig 2). The final concentrations for all 100 mutants are included in S1 Table. Greater than 35% maintained the yields obtained for native BglB, and 15% did not express and purify as a soluble protein above our limit of detection (0.1 mg/mL) for protein yield after purification based on A 280 and SDS-PAGE. PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 2. Log scale relative kinetic constants of 100 BglB mutants. The heatmap depicts the effect of each mutation on each kinetic constant relative to native BglB, normalized at 0. As indicated in the color legend, gold is for higher value and blue for a lower value. The metric 1/K M is used so a higher value is consistently corresponding to a “better” kinetic constant (assuming a lower K M is better) when evaluating k cat , k cat /K M , and K M . If the kinetic constant was not measurable, an X is depicted in the box. Proteins that were expressed as soluble protein with a final purification concentration of >0.1 mg/mL and validated by SDS-PAGE are labeled with a black box in the first column. Those below our limit of detection of 0.1 mg/mL are labeled with an empty box. Values are on a log scale and the ranges are as follows: 10–11,000 min-1 (k cat ), 0.6–85 mM (K M ), and 10–560,000 M-1min-1 (k cat /K M ) with wild type constants of 880 ± 10 min-1, 5.0 ± 0.2 mM, and 171,000 ± 8000 M-1 min-1 for k cat , K M , and k cat /K M respectively. A full table of kinetic constants and substrate versus velocity curves for each are provided in S1 Table and S3 Fig. https://doi.org/10.1371/journal.pone.0147596.g002

Kinetic characterization of mutants Michaelis-Menten kinetic constants for each of the 100 mutants were determined using the colorimetric assay of pNPG hydrolysis. The results are represented as a heatmap in Fig 2. Ten biological replicates of the wild type enzyme have an average k cat of 880 ± 10 min–1, K M of 5.0 ± 0.2 mM, and k cat /K M of 171,000 ± 8,000 M–1 min–1. To determine kinetic constants, observed rates at 8 substrate concentrations were fit to the Michaelis-Menten equation. If no clear saturation was observed then a linear equation was used to determine k cat /K M . Experimentally measured kinetic constants and nonlinear regression analysis for each mutant can be found in S1 Table and S3 Fig, respectively. Based on the maximum concentration of enzyme used in our assays and colorimetric absorbance changes at the highest substrate concentration used, we estimate our limit of detection for k cat /K M to be 10 M-1min-1. Of the 89 solubly purified mutants, 6 are below the limit of detection. The highest catalytic efficiency observed is 560,000 M-1min-1 for mutation R240A. In addition, while no substrate inhibition is observed for the wild type BglB, four mutants exhibit measurable substrate inhibition (the inhibition parameter K i for only these mutants is reported in S1 Table as it was not measurable for most mutants).

Observed sequence–structure–function relationships in BglB In agreement with previous studies, our results demonstrate the importance of E164, E353, and Y295 for catalysis. Mutating any of these residues to alanine results in a >85,000-fold reduction in catalytic efficiency (k cat /K M ). However, beyond the catalytic residues, the systematic alanine scan of every residue within 12 Å of the ligand revealed mutations which have an equivalent functional effect to mutating the established catalytic residues to alanine. For example, the Q19A mutant showed a dramatic effect on function: catalytic efficiency decreased by 57,000-fold. Analysis of the crystal structure of BglB suggests that both the nitrogen and oxygen of the amide sidechain interact with hydroxyl groups on the substrate (Fig 3A). A multiple sequence alignment of the BglB enzyme family in the Pfam database (including 1,554 non-redundant proteins), revealed that Q19 is 95% conserved in this family (Fig 3B). Unlike E353, the nucleophilic glutamate directly involved in the reaction chemistry, Q19 is not directly involved in the reaction. This is consistent with the theory that orientation of the substrate is a critical aspect of catalysis ("orbital steering") for which Q19 is likely crucial. [20] A crystal structure of BglB Q19A in complex with the 2-deoxy-2-fluoro-α-D-glucopyranose inhibitor may help elucidate the structural effect of this mutation. Based on molecular modeling, no major structural change for this mutant is predicted (S2 Fig). PPT PowerPoint slide

PowerPoint slide PNG larger image

larger image TIFF original image Download: Fig 3. Active site model and conservation analysis of BglB. (A) Docked model of pNPG in the active site of BglB showing established catalytic residues (navy) and a selection of residues mutated (gold). A multiple sequence alignment of the Pfam database’s collection of 1,554 family 1 glycoside hydrolases was made and the sequence logo for (B) selected regions around specific residues discussed in the text and (C) over the entire BglB coding sequence is represented. The height for each amino acid indicates the sequence conservation at that position. https://doi.org/10.1371/journal.pone.0147596.g003 A novel finding was a tenfold increase of k cat by a single point mutant, R240A. The BglB crystal structure reveals that R240 forms two hydrogen bonds with E222 (Fig 3A). Molecular modeling of the R240A mutant predicts that E222 would adopt an alternative conformation in which the acid functional group of the glutamate is 2 Å closer to the active site (S2 Fig). This would likely result in a significant change of the electrostatic environment around the active site, and indicates that the electronegative environment enhances catalysis of pNPG hydrolysis. Consistent with this hypothesis is the observation that the mutation E222A decreases k cat by tenfold. Both observations support previous evidence that the electrostatic environment of an enzyme's active site is of primary importance to catalysis. [21]

Conservation analysis of the BglB active site Of the 44 positions in the active site systematically mutated to alanine, 11 are conserved by >85% in amino acid identity with respect to 1,554 homologues in the Pfam database. When any one of these amino acids is mutated to alanine, catalytic efficiency decreases >100-fold (S3 Table). This supports the widely held assumption that highly conserved residues within an enzyme active site are functionally important. However, only 11 of the 44 residues within 12 Å of the active site are >85% conserved. Of the 33 remaining residues within 12 Å of the active site, only 8 alanine mutations resulted in a decrease in catalytic efficiency of greater than 100-fold, and 10 of these 33 mutations were not found to significantly affect catalytic efficiency. Based on these findings, there does not appear to be a strong correlation between residue identity and function if a particular residue is <85% conserved. This observation supports the hypothesis that native sequence recovery is not a good metric for training design algorithms. In addition, the mutation R240A, which is not observed in any natural variant in the glycoside hydrolase 1 family, resulted in a 10-fold increase in k cat on pNPG. This emphasizes the importance of not limiting design efforts to changes previously observed in nature when engineering function towards a new substrate.

Computational modeling and evaluation of predictive ability In order to evaluate the Rosetta Molecular Modeling Suite’s ability to predict the functional effects of mutations on BglB kinetic properties, molecular models were generated for each of the 100 BglB mutants. For each mutant, the modeled pNPG previously described was docked into the active site by a Monte Carlo simulation with random perturbation of the ligand followed by functional constraint optimization through rigid body minimization of the ligand, sidechain and ligand conformational sampling, and, finally, ligand, sidechain, and backbone minimization. This protocol was used to mimic protocols used in successful enzyme reengineering efforts. [2] An example set of input files for wild type BglB are provided in S1 Code. For each mutant, 100 models were generated as described above and the lowest 10 in overall system energy for each mutant were selected for subsequent structural analysis. A value for each of 59 potentially informative features (such as predicted interface energy, number of hydrogen bonds between protein and ligand, and change in solvent accessible surface area upon ligand binding) was calculated for each model. Correlation of the average calculated structural features to each kinetic constant was assessed using Pearson Correlation Coefficient (PCC) and Spearman Rank Correlation (SRC). For both k cat /K M and k cat , the strongest correlation observed is to the total number of non-local contacts (count of residues separated by more than 8 sequence positions that interact with each other), with a PCC of 0.57 (p-value 0.009; Wilcoxon test) and 0.43 (p-value 0.004; Wilcoxon test), respectively. For 1/K M , the highest PCC is 0.29 (p-value 0.0005; Wilcoxon test) to the total number of hydrogen bonds in each BglB model. The SRC follows similar trends to PCC for all three predicted constants (SRC of 0.55, 0.42 and 0.38 for k cat /K M , k cat and 1/K M respectively). The PCC and SRC values for all features are available in S2 Table.