There are many ways to envision integration of AI-based CASP tools into the medicinal chemistry workflow, and adoption appears to be on the rise. The discussion below will largely focus on the use of tools within the open source ASKCOS software suite. (20) However, the applications are general. We will break the use cases into multistep route planning, forward-reaction prediction, and condition recommendation, as outlined in the introduction. Finally, we will briefly discuss how the incorporation of programmatic interfaces can aid DMTA workflow and the general feedback from MLPDS member companies on ASKCOS functionality and its adoption in their organizations.

Multistep Route Planning

Many of the available commercial and academic synthetic route planning software provide a stand-alone graphical user interface (GUI) or web-based interface where users can interact with the suggested routes and predictions. The target users of the software range from nonchemists, without much knowledge of chemical reactivity, to highly trained expert chemists who want to streamline their synthesis workflow. The MLPDS consortium member companies report that the primary users of the software are expert, Ph.D.-level chemists, and adoption is reported to vary from indifference up to enthusiastic and everyday use. At Janssen, many chemists use synthesis planning tools in parallel with traditional database lookups of known reactions to generate ideas more quickly. Other users are computational chemists and chemical engineers who may not have as much practice at retrosynthetic planning but are involved with molecular design or process development. Most companies pilot small rollouts to select expert chemists, who are in the strongest positions to evaluate the capabilities of machine learning CASP tools and identify key limitations. At BASF, experts from different stages in product development (e.g., early phase and process development) provided understanding of the different expectations across business areas. These small rollouts are necessary to understand the obstacles to wider adoption and further integration into synthesis pipelines. A close contact is necessary between the company’s beta testers and the developers of retrosynthesis algorithms since the true assessment of performance must be carried out by trained experts who can validate the model(s) suggestions.

The proof of principle for full pathway planning has been established, but further refinement will require the input of chemists who can objectively evaluate retrosynthetic predictions. Input from the MLPDS member companies has identified some general trends in which the machine learning algorithms perform well and poorly. Generally, target molecules that are in a similar chemical space to product molecules found in Reaxys or USPTO tend to perform well using the ASKCOS suite of tools. These target molecules can be accessed using well-established chemistries and the models can perform adequately within their domains of applicability.

1, 6 to 5 is likely problematic and could be easily substituted with an orthogonal protecting group when deciding on the final route. As previously noted, one area for improvement is better prediction of regioselectivity of reactions where there are two pendant reactive groups. For the proposed CN-bond-forming reaction, ASKCOS suggests Boc protection of the N–H of the alkynamide intermediate (3) while the second coupling partner (4) contains a free carboxamide. In the literature synthesis of 1, the authors note that a carboxamide unexpectedly prevents the Buchwald (C–N) coupling from proceeding. Thus, this team performs the reaction with the nitrile substituted for the carboxamide of 4. While the exact ASKCOS C–N coupling has never been tried, the fact that chemists attempted the C–N coupling with a free carboxamide demonstrates that the ASKCOS prediction is reasonable (i.e., worth trying), but the carboxamide would likely need to be converted to a nitrile. This example is just one small piece of evidence that ASKCOS can reasonably disconnect modern drug-like molecules and how nuanced experimental results (including failures), not captured in large reaction databases, can negatively impact algorithm performance. For example, input of the structure of branebrutinib (BMS-986195, 3 Figure ) resulted in an ASKCOS proposed synthesis. While the synthetic route was first reported in 2016, (55,56) the training data for ASKCOS stops before the initial disclosure of this molecule, which demonstrates the ability of the machine-learning models to generalize to new target compounds. The ASKCOS-proposed synthesis begins from commercially available starting materials that are similar to those in the reported route and uses several types of reactions (C–N cross-coupling, heterocycle formation, diazotization/reduction) to arrive at the final product. While the overall synthesis appears plausible, the selective Boc deprotection oftois likely problematic and could be easily substituted with an orthogonal protecting group when deciding on the final route. As previously noted, one area for improvement is better prediction of regioselectivity of reactions where there are two pendant reactive groups. For the proposed CN-bond-forming reaction, ASKCOS suggests Boc protection of the N–H of the alkynamide intermediate () while the second coupling partner () contains a free carboxamide. In the literature synthesis of, the authors note that a carboxamide unexpectedly prevents the Buchwald (C–N) coupling from proceeding. Thus, this team performs the reaction with the nitrile substituted for the carboxamide of. While the exact ASKCOS C–N coupling has never been tried, the fact that chemists attempted the C–N coupling with a free carboxamide demonstrates that the ASKCOS prediction is reasonable (i.e., worth trying), but the carboxamide would likely need to be converted to a nitrile. This example is just one small piece of evidence that ASKCOS can reasonably disconnect modern drug-like molecules and how nuanced experimental results (including failures), not captured in large reaction databases, can negatively impact algorithm performance.

Figure 3 Figure 3. Retrosynthetic analysis of branebrutinib performed by ASKCOS. The route is similar to that in ref (56), with the difference that the authors found a nitrile analogue of 4 to be optimal for the C–N-bond-coupling step.

Many different aspects play into the “success” of machine-learning-based path-planning tools. One of the simplest factors in whether these programs are able to find pathways is the coverage of the database of compounds considered to be commercially available; simply put, a larger starting material database increases the odds that a search will terminate successfully. In an effort to better understand how the database of buyable chemicals affects tree search outcomes, GlaxoSmithKline compared the stock ASKCOS database of buyable compounds (138k) and a larger set that was augmented with their internal compounds/vendors (8M). On an internal set of 69 target molecules, using the most liberal path-planner settings, a route was found by ASKCOS for 54% of compounds with the stock database and 67% of compounds with their internal database. These results highlight the dependency of path-planning algorithms on the database used for a stop criterion. The dependence on a buyable database, however, complicates the comparison of CASP tools since every software package uses a different (typically undisclosed) buyable database. This problem may be alleviated by the implementation of straightforward utilities to load and use custom building-block sets in every CASP tool. This requirement is generally useful since all MLPDS corporate members maintain internally large collections of building blocks.

A followed by an evaluation using ASKCOS’s retrosynthesis tree search. While an expert chemist can easily infer that compound A is disconnected by amide formation and C–C cross-coupling, the general knowledge of the commercial availability of the methylated starting materials is less likely. The input of methyl analogues of A results in the expected bond disconnections (8 can be purchased after only a few retrosynthetic steps. Compounds 9 and 10 require an extra step compared to 8. Finally, access to indole 11 would necessitate further steps to synthesize, which draws the step count to almost double that of the synthesis of compound 8. Notably, the information is obtained using ASKCOS with one search per compound. This assessment now provides chemists with the information on which analogues are most synthetically accessible and can factor into the decision-making process for prioritization of target molecules. How the availability of starting materials affects the SA can be seen by enumerating methylated variants of compound (57) followed by an evaluation using ASKCOS’s retrosynthesis tree search. While an expert chemist can easily infer that compoundis disconnected by amide formation and C–C cross-coupling, the general knowledge of the commercial availability of the methylated starting materials is less likely. The input of methyl analogues ofresults in the expected bond disconnections ( Figure 4 , representative ASKCOS results shown). Since the stop criteria for the tree search is commercial availability, the algorithm will assess at each disconnection whether the suggested starting materials are purchasable. In this example, the starting material for compoundcan be purchased after only a few retrosynthetic steps. Compoundsandrequire an extra step compared to. Finally, access to indolewould necessitate further steps to synthesize, which draws the step count to almost double that of the synthesis of compound. Notably, the information is obtained using ASKCOS with one search per compound. This assessment now provides chemists with the information on which analogues are most synthetically accessible and can factor into the decision-making process for prioritization of target molecules.

Figure 4 Figure 4. Retrosynthetic analysis of methylated analogues of compound A. ASKCOS proposes different length of routes which is dependent on the availability of starting materials, where the tree search stop criteria is commercial availability (<$100/g). Compound 8 can be accessed from commercially available starting materials in 3 steps, compounds 9 and 10 require one extra step, and compound 11 requires 2 extra steps.

An expected feature of machine-learning methods for predictive chemistry is that retraining models on proprietary data ought to allow companies to achieve better predictive ability on chemistries that are used in-house. (58) These in-house chemistries may not be well represented in public or published data sets, which most of the CASP systems are trained on. Researchers from AstraZeneca and the University of Bern applied a workflow for retrosynthetic template extraction (28) and training/application (29) to several public and proprietary data sets and compared the performance of the different models. (59) They found that Reaxys has the most unique reaction templates, of which 2% are shared between all the data sets used in the study, and only 0.6% are shared between Reaxys and a subset of their proprietary ELN data. Eli Lilly identified a subset of 6k target compounds from approved, experimental, and investigational drugs to represent the chemical space of interest to the company. Using the Lilly database of building blocks and ChemoPrint, an in-house synthetic planning platform, retrosynthetic expansion was performed using a template set from (1) only Lilly eLN data (13297 templates) and (2) Lilly eLN data plus patent data (13297 + 50275 templates). Routes could be found for 40.1% of the 6k compounds with the first template set. Supplementing the template set with additional patent templates only provided a 5.8% increase in the ability to successfully furnish a route, corresponding to a success rate of 46.9%. (60) For full-pathway planning, these results demonstrate the necessity of further testing on internal and proprietary data sets and the influence that company data may have on multistep path planning.

There are still many molecular structures for which retrosynthesis planning fails to find any route. The MLPDS consortium members have identified lack of coverage for several company-specific target molecules or reactions in full-path planning. Commonly identified substructures that are not successful in full-path planning are small, densely functionalized carbon cores with or without many contiguous stereocenters, caged structures where 3D geometry is crucial for selectivity, newly discovered heterocycles, and complicated polycyclics. Some of these substructures, such as densely functionalized carbon cores, require chemistry that is specific to each core’s substructural environment (perhaps with <5 precedents in the literature). Using the conventional template extraction procedures, the model will not be able to generalize due to the high specificity of the template. Conversely, path planning of some target molecules will find numerous pathways but include many poor retrosynthetic suggestions where regio- or stereoselectivity may not be predicted appropriately. To correct the issues of selectivity, further filtering using an accurate forward-prediction model will provide richer route suggestions. Another set of failures are due to the limitation of the search methods for navigating a synthetic tree. Since recursive retrosynthetic expansion has to restrict the search to avoid combinatorial explosion, most implementations cannot yet navigate a search path deeper than 15 synthetic steps. If chemists are using CASP tools for the ideation of routes and pathway planning cannot successfully navigate a synthetic graph to produce a route, another solution is necessary.

When full-path planning fails, a chemist may resort to using single-step retrosynthetic predictions to manually construct a route. Figure 5 is an example where a path to branebrutinib is manually explored. Interestingly, the suggestion of using a nitrile, which was found to be ideal in practice, is in the precursors lists but is ranked #37, so a chemist would have to sort through many higher-ranked suggestions. Manually constructing a route from tens to thousands of disconnections is a time-consuming task. A synthesis planning feature that was born from discussions between MLPDS member companies and MIT was the implementation of an interactive path planner using single-step retrosynthetic predictions. The interactive planner addresses the issue of displaying diverse suggestions and having more control over a synthetic plan. When chemists are initially developing a route, the precise choice of leaving groups matters less, and as the routes are refined, specific leaving groups are chosen based on the desired reactivity. Machine-learning models for retrosynthesis generally handle all possible reactants as distinct options. For example, the chloride, bromide, and iodide form of a halogenated precursor are not normally lumped into a single category. It is inconvenient for chemists to sort through numerous suggestions that have the same fundamental disconnection but different leaving groups. Thus, a clustering algorithm was developed to group similar suggestions (based on a k-means clustering using a reaction fingerprint (61) ) and expedite the exploration of distinct disconnections. Several routes are displayed using one visualization, which can be download and shared. Although none of the underlying machine-learning models were changed, expert users are much happier with exploring pathways interactively when an automated path-planning job fails. This success demonstrates the that tight collaboration between end users and the developers of synthetic planning software is helpful for adoption, particularly when it comes to the user interface.

Figure 5 Figure 5. Screenshot of the ASKCOS interactive path planner. (Left) Visualization of a full synthetic graph constructed by the user. Boxes are color-coded green if they are purchasable and blue for the root target compound (branebrutinib). (Right) The selected molecule is displayed on top for which a single-step retrosynthesis prediction has been performed and on the bottom are all of the predicted precursors.

An advantage of many synthesis planning packages is that reaction templates, or rules, are associated with a specific set of literature precedents. MLPDS member companies report that CASP tools are used more often when literature examples, upon which predictions are based, are easily accessible. For example, ASKCOS provides a mechanism to use reaction IDs tied to reaction examples in the training data and can direct users to literature lookups or in-house reaction entries.