During algorithm development and selection of heuristics (cf. above), we scrutinized its results on the total of 548 reactions which can be referred to as a “training set”. This set comprised 241 typical reactions with full stoichiometry and taken from the Organic Syntheses collection37; set of 191 randomly selected and typically mostly stoichiometrically unbalanced reactions from Reaxys collection38; and 116 mechanistically complex reactions (73 stoichiometrically balanced and 43 unbalanced reactions) taken from various literature sources (e.g., Kurti’s “Strategic Application of Named Reactions in Organic Synthesis”39 or Grossman’s “The Art of Writing Reasonable Organic Reaction Mechanisms”1)—whose mapping should pose a challenge even to human experts. These reactions were mapped by our algorithm, by ReactionMap32, Marvin JS version 16.4.1833, and as a benchmark for correctness, by the authors (S.S., B.M.K., T.K., all expert organic chemists with track record as co-developers of the Chematica retrosynthetic software13,40). All mapped reactions are provided in the Supplementary Note 2. For the 241 typical reactions with full stoichiometry, our algorithm provides 93.8% correct assignments compared to 92.1% for ReactionMap and 86.7% for MarvinJS methods—that is, it does slightly better than the competing solutions. For the 191 Reaxys reactions in the second collection, the accuracy of our algorithm is 94.2% vs 90.5% for MarvinJS and only 12% for ReactionMap. The poor performance of ReactionMap could be expected since, as its authors admit25, the program cannot generally tackle reactions without full stoichiometry. Our algorithm outperforms the competition most decisively in mapping the 116 complex reactions—here, it mapped correctly 85.3% reactions compared to 44.8% for ReactionMap and 59.4% for MarvinJS.

Next, we performed similar comparisons on a set of 401 reactions that were not considered during training (for reaction miniatures with mappings, see Supplementary Note 3). This “test” set comprised: 100 relatively simple reactions with full stoichiometry taken from total syntheses published in Org. Lett., J. Am. Chem. Soc., and J. Org. Chem.; 100 typical reactions without full stoichiometry taken from patents; and 201 mechanistically complex reactions (92 stoichiometrically balanced and 109 unbalanced reactions) which include rearrangements and multicomponent reactions taken from recent literature (in most cases, after 2010 and from Org. Lett., J. Am. Chem. Soc., and J. Org. Chem.). For all 401 reactions, we compared the performance of our algorithm not only against MarvinJS and ReactionMap but also ChemDraw Prime (version 16.0.0.82) and Indigo (version 1.3.0 beta). The results summarized in Fig. 5 evidence that for simple reactions with full stoichiometry (green bars), all algorithms with exception of Indigo (65% correctness) are performing well—ours has 98% correctness, MarvinJS 85%, ReactionMap 95%, and ChemDraw 96%. For 100 reactions without full stoichiometry (blue bars), our algorithm is slightly more accurate than ChemDraw, MarvinJS, and Indigo (100% vs 97%, 96% and 90% correctness, respectively) and significantly better than ReactionMap (11% correctness), which does not handle missing stoichiometry. As in the training set, the major differences are observed for complex reactions (red bars) for which our algorithm is correct in 84% of cases compared to 66% for ChemDraw, 62% for MarvinJS, 38% for Indigo, and 33% for ReactionMap. We make two comments regarding these results. First, the figures of merit for ChemDraw and Indigo are overestimated, because a sizable fraction of the results these mappers produce come without full mappings (see Supplementary Fig. 5)—though the fragments missing atom assignments are chemically unique and can be unambiguously assigned by a human chemist (on the other hand, such results might not be treated as correct during, say, automatic reaction rule extraction). If a more stringent criterion is applied that all atoms in the molecule must have unique numbers, the statistics for the two mappers are much worse (solid parts of the bars in the figure): for ChemDraw, 27% correctness on simple reactions with full stoichiometry, 16% for typical reactions without full stoichiometry, and 27% for complex reactions; for Indigo, 37% for simple reactions with full stoichiometry, 12% for typical reactions without full stoichiometry, and 16% for complex reactions. Second, it is instructive to compare the performance of all mappers on complex reactions with full stoichiometry (73 such reactions in the training set and 92 in the test set). For such reactions, correctness of ChemDraw, MarvinJS, and Indigo does not change much (respectively, 67%, 59%, 34%) but the performance of ReactionMap improves quite significantly (~71%)—though it is still perceptibly below our mapper (86% correctness for the stoichiometrically-balanced reactions in the training set and 87% for balanced reactions in the test set).

Fig. 5 Performance of various mappers quantified on the 401 reactions from the main test set. For ChemDraw and Indigo, solid parts of the bars correspond to the more stringent criterion that all atoms in the molecule must be assigned numbers. Shaded parts of the bars give the percentages of correct answers assuming a more lenient criterion allowing for unmapped but chemically unique atoms (see examples in Supplementary Fig. 6). ReactionMap does significantly better (~71%) on complex reactions with full stoichiometry (see main text) Full size image

Some of these complex reactions we tested on, involving multiple mechanistic steps and/or rearrangements, are shown in Fig. 6. For example, reaction in Fig. 6a commences with a [3,3]-sigmatropic allylic azide rearrangement with subsequent intramolecular Schmidt-Aubé reaction to form bicyclic amide41. In Fig. 6b, cyclopropenyl lithium reagent generated in situ reacts with an aldehyde, followed by an intramolecular Diels-Alder reaction with a furan ring to form bridged bicyclic scaffold42. In Fig. 6c, a Lewis acid catalyzed semipinacol rearrangement creates bicyclo[3.2.1]octan-8-one scaffold43. In Fig. 6d, a Lewis acid catalyzed oxa-Piancatelli rearrangement of an alcohol creates oxaspirocycle scaffold44. A multistep sequence in Fig. 6e involves cross-coupling of 2-bromofuran with an amide, intramolecular Diels-Alder reaction, thermal rearrangement of amidofuran to dihydro-2H-carbazolone and, finally, cyclization with allylamine to form the main scaffold of minfiensine alkaloid45. On the flipside of the coin, the 16% of incorrectly mapped reactions are usually ones in which key, mechanistically important substrates or by-products are missing, those that comprise sequences of mechanistically complex rearrangements and unusual migrations of functional groups (changing atomic environments in non-trivial ways, cf. Supplementary Fig. 8), or those that are cascades of not necessarily complex but just too many (>4, 5) reactions46 that could/should be written as separate transformations (Fig. 6f). As narrated earlier in the text, such corner cases cannot be overcome by simply adding more specialized heuristics since their application creates additional decoy solutions—especially when the heuristics’ atom cores overlap—having comparable scores but ultimately yielding wrong atom assignments.

Fig. 6 Examples of mapped, complex reactions. a–e Reactions mapped correctly by our algorithm. f An example of a reaction in which the mapping found is incorrect. Correct mapping is shown in the bottom row. Bonds that are disconnected are colored in red whereas bonds created are marked green. For discussion, see main text and refs. 41,42,43,44,45,46 for chemical details Full size image

As the second test set, we considered the performance of our algorithm in mapping patent reactions from which reactivity scores for atom pairs11 or reaction templates/cores47 were previously derived and then used for, respectively, reaction prediction or retrosynthetic planning. Proper atom assignments in this and similar reaction sets are important since machine-extraction of mapped reaction cores is common to most retrosynthetic design programs (Wiley/CAS ChemPlanner17, InfoChem’s ICSynth16, BenevolentAI/Waller’s12, and MIT/Coley’s47 programs), though not of our Chematica platform13,40. Here, we considered a subset of 50,000 reactions selected by Landrum and co-workers5,48 from the United States Patent and Trademark Office (USPTO) database to represent reactions most essential for medicinal chemistry (this collection was later used by the MIT team in the abovementioned studies11 and47). After further cleaning for chemically nonsensical entries (see Supplementary Fig. 9), we categorized the reactions according to the number of bonds that were altered (from one to six) and selected samples from each class at random to ultimately collect 281 examples (cf. Fig. 7 and Supplementary Note 4)—we note that this method of categorization and selection placed emphasis on the more complicated reactions that would otherwise be infrequent (~72% of the USPTO set alters one or two bonds) but are important in the context of learning non-trivial chemical reaction rules. When the reactions were mapped by human experts, we compared the correctness of mappings provided in the USPTO set (red bars in Fig. 7a) vs. the mappings generated by our algorithm (green bars). As seen, for reactions disconnecting/creating one to two bonds, there are no major performance differences; on the other hand, for more complex reactions changing more than two bonds, the percentage of correctly mapped reactions in the USPTO collection drops rapidly to 16–52% while it remains between 82 and 92% for our mapper. A manifestation of this trend are illustrated in Fig. 7b which shows two useful reactions, Diels-Alder cycloaddition and N-alkylation of amides, incorrectly mapped in the USPTO set. We observe that when the authors of ref. 11 used such mappings to derive reactivity scores for atom pairs and then—using a deep neural network based on Weisfeiler-Lehman architecture— predicted outcomes of new reactions, they claimed that “the overall model performance does not depend strongly on atom mapping quality”. While this might be the case for some very simple chemistries, we have verified that if incorrect mappings are used to train the neural net, predictions of products of test reactions are, in vast majority, chemically nonsensical (see Supplementary Notes 5.2 vs. 5.3 and more thorough discussion in Section 8 of the Supplementary Information to ref. 49). In a wider context, such examples help us understand why synthesis-planning softwares based on machine-extracted rules or reactivity indices that come with faulty atom mappings are generally not applicable to complex targets whose syntheses require the use of more advanced chemistries. We note, however, that other approaches are also being developed that do not require atom mappings but, instead, base reaction predictions on reaction fingerprints14 or the so-called sequence to sequence models50,51.

Fig. 7 Comparison of our algorithm against USPTO mappings. a Percentages of correct USPTO mappings (red bars) and those by our algorithm (green bars) categorized according to the number of bonds broken/created, from one to six. Statistics are based on 50 reactions each for one to five bonds being changed and 31 reactions for six bonds. We note that in addition to chemically meaningful reactions for which we compared the mappings, the USPTO set also contained a large fraction of nonsensical reactions that were likely due to human entry errors in the databases they used (e.g., missing key reaction partners, creating atoms “ex nihilo”, etc., see Supplementary Fig. 9). Such reactions as well as simple deprotections were not included in the statistics shown. b, c Examples of two reactions—Diels-Alder cycloaddition (b) and N-alkylation of amide (c)—mapped incorrectly in the USPTO set and correctly by our software. For more examples, see Supplementary Note 4 Full size image

We sought additional validation from synthetic chemists outside of our group (three M.Sc. students, two Ph.D. candidates, two postdoctoral fellows, all listed in the Acknowledgments section). First, to benchmark our expert mappings, four of these chemists re-mapped a randomly-chosen sample of our 100 reactions—their mappings agreed with those of our internal experts. Then, to ensure that the reactions we chose for our tests were not in any way biased, all seven external chemists were asked to provide additional samples (25 reactions per person) they considered representative to modern synthetic chemistry. All these reactions—differing in the level of mechanistic complexity, provided to us in a typical synthetic notation (in 38.9% of cases, without full stoichiometry), and all listed in the Supplementary Section 5—were mapped by our algorithm and the results were compared against external mappings (in addition, our internal experts validated the external mappings once more). In the end, the algorithm provided 90% correct mappings for 100 reactions provided by Ph.D. candidates and postdocs, and 95% correct answers for reactions provided by M.Sc. students.

Finally, we considered algorithm’s speed. Because our method uses several heuristics for the NP hard subgraph isomorphism problem, its speed is, in principle, exponential with respect to graph size—however, tests summarized in Supplementary Fig. 7 indicate that typical mapping times remain practical (91.5% of reactions mapped within 1 s, 96.5% in less than 10 s).

In summary, mapping of organic reactions is an example of a NP-hard problem for which prior attempts have been largely restricted to simple reaction types which chemists can typically map without much effort. In our approach, junction of graph theory and combinatorics with domain chemical knowledge (embodied in the minimal set of reaction heuristics) enables mapping of both simple and very complex organic reactions, including those that might challenge human experts. The graphical user interface of our algorithm is made freely available at http://mapper.grzybowskigroup.pl/marvinjs/ (see Supplementary Note 1.3 for a short tutorial) and we hope it will be useful for colleagues working on the applications we touched upon above (especially assignment of atoms in machine-extracted synthetic rules) and also in related fields, notably in the mapping of biochemical pathways (see examples in the Supplementary Note 1.4).