Medicinal chemists’ “intuition” is critical for success in modern drug discovery. Early in the discovery process, chemists select a subset of compounds for further research, often from many viable candidates. These decisions determine the success of a discovery campaign, and ultimately what kind of drugs are developed and marketed to the public. Surprisingly little is known about the cognitive aspects of chemists’ decision-making when they prioritize compounds. We investigate 1) how and to what extent chemists simplify the problem of identifying promising compounds, 2) whether chemists agree with each other about the criteria used for such decisions, and 3) how accurately chemists report the criteria they use for these decisions. Chemists were surveyed and asked to select chemical fragments that they would be willing to develop into a lead compound from a set of ∼4,000 available fragments. Based on each chemist’s selections, computational classifiers were built to model each chemist’s selection strategy. Results suggest that chemists greatly simplified the problem, typically using only 1–2 of many possible parameters when making their selections. Although chemists tended to use the same parameters to select compounds, differing value preferences for these parameters led to an overall lack of consensus in compound selections. Moreover, what little agreement there was among the chemists was largely in what fragments were undesirable. Furthermore, chemists were often unaware of the parameters (such as compound size) which were statistically significant in their selections, and overestimated the number of parameters they employed. A critical evaluation of the problem space faced by medicinal chemists and cognitive models of categorization were especially useful in understanding the low consensus between chemists.

The counsel of experts is often sought on subjects or items within their field that are too complex for a non-expert to handle – for example, bloodstock agents are consulted to assess how promising a yearling thoroughbred horse is prior to purchase, or a specialized doctor might be sought to diagnose a puzzling symptom. These assessments are often summarized in verbal or written reports, which in turn inform decisions. It would seem almost ludicrous for an expert to make an important recommendation based on their “gut feeling,” yet there seems to be mounting evidence that the unconscious mind under certain circumstances in fact outperforms the conscious mind. Research suggests that the unconscious is especially good at making complex decisions, [32] and that introspection can actually reduce the quality of decisions. [33] It has also been reported that humans are often unaware of the important factors that play a role during complex problem solving. [34] Furthermore, people seem to be ultimately less satisfied with choices that were consciously made, compared to those made unconsciously. [35] , [36] Importantly, complex pattern recognition, which is especially relevant to the current study, can be obtained unconsciously. [37] This invites one to reconsider the role of the conscious and unconscious mind when expert chemists prioritize compounds. When faced with the inherently complex problem of assessing the desirability of a compound, are chemists aware of the criteria they use when selecting compounds to carry forward during drug discovery campaigns?

Turning to our domain of interest, medicinal chemistry, reports of consensus between chemists from previous studies have been varied. When assessing the synthetic accessibility of compounds, chemists have demonstrated both a considerable amount of consensus (the correlation coefficient r 2 between chemists ranged from 0.73 to 0.84), [29] and moderate consensus (r 2 ranged from 0.50 to 0.63). [30] Lower consensus was observed when chemists assessed the drug-likeness of compounds (r 2 ranged from 0.40 to 0.56). [30] In a study most relevant to the current paper, chemists asked to remove undesirable compounds from lists of putative compounds for inventory acquisition showed little consensus. [31] One difference in the present work is that in our case chemists were asked to actively select desirable compounds, rather than reject undesirable compounds. More importantly, we have gone a step further by analyzing what criteria individual chemists use to select desirable compounds, revealing why there is an apparent lack of consensus, and the degree – if any – to which these criteria are consistent across chemists.

Second, prior work on expert classification suggests that expert specialization can affect consensus within a common domain of expertise. For instance, tree experts with different specializations (maintenance, landscaping, or taxonomy) overall agreed in their classification of local tree species, but only landscaping experts showed a distinct tendency to group trees based on their utilitarian value. [27] Similarly, a comparison of Native American and majority-culture fisherman in northern Wisconsin showed overall consensus in their categorization of local freshwater fish species, but also clear differences with respect to the use of morphological (majority-culture) and ecological (Native American) dimensions [28] .

Another question of interest is the degree to which highly-trained and experienced medicinal chemists agree with each other when making decisions about promising chemical fragments. In a seminal paper, Einhorn argued that consensus among experts is a mark of expertise, implying that a lack of consensus among experts demonstrates a lack of expertise. [15] However, evidence from previous work on expert agreement is mixed. First, consensus proved to vary with the domain of expertise [24] : for example, stockbrokers have demonstrated low consensus, [16] while weather forecasters have demonstrated high consensus. [25] Shanteau proposed that the degree of consensus among experts may depend on the properties of the problem space, such as predictability [24] , [26] .

In this paper, we address the question of how expert medicinal chemists approach the problem of selecting promising compounds from large sets. Do they aim for exhaustive assessment of each compound, by taking into account all pieces of available information, or do they simplify the problem by focusing on a small subset of compound properties?

Consistent with this view, experts often use only a subset of available information in decision making. This has been observed in fields as diverse as medical radiology, [14] medical pathology, [15] stock trading, [16] , clinical psychology, [17] and grain judging. [18] – [20] Moreover, experts appear to utilize fewer cues in realistic decision-making settings than in more controlled experimental settings. [21] For example, judges tended to use all available information when reaching decisions in a simulated courtroom setting, but only a small subset in an actual courtroom. [22] Indeed, experts do not appear to differ from novices in the amount of information they use, but rather what information they use, suggesting that experts are more capable of discriminating what is diagnostic from what is not [23] .

However, recent developments in the study of reasoning question the idea that “less” always means “worse.” As Gigerenzer, Todd, and the ABC research group proposed, [6] the accuracy-effort trade off is not the only reason why people resort to using incomplete information. In certain environments (i.e., those characterized by high cue redundancy [a cue can be thought of as a feature that signals something. For example, shorts and cleats are cues that someone is a soccer player], low predictability of outcomes, or with a small amount of evidence relative to the number of potentially available cues), heuristic-based reasoning that efficiently ignores some of the available information and uses simpler computations can in fact lead to more accurate decisions. [8] In one study, the predictive accuracy of two relatively simply heuristics–“tallying” and “take-the-best”–was compared to multiple regression, a more complex estimation technique, in 20 scenarios ranging from predicting fish fertility to fuel consumption. [10] (The tallying heuristic ignores cue weights and simply counts the number of favoring cues, while take-the-best searches through cues in order of validity and bases a decision on the first cue that discriminates between the alternatives. Regression methods weight the cues differentially, and uses all of them when making predictions.) Regression was shown to be superior in fitting the available data, but its flexibility came with the price of capturing unsystematic patterns in the data, and it was ultimately outperformed by both heuristic methods when it came to prediction (see also [11] ). Such “less-is-more” effects - where less information leads to higher accuracy - have been observed in a variety of settings. For example, expert sports players often make better decisions under time pressure. [12] , [13] It appears that for some kinds of problems and environments, ignoring pieces of available information can be a signature of expert decision making rather than faulty reasoning.

For most decisions we face in the real world based on sampling available information, the world is much like a superstore – it offers too much, and most of what’s offered does not meet our specific requirements. Given this state of perpetual information overload, people are bound to filter out a great deal of information. Classic work in cognitive science has been critical of this strategy, portraying human reasoning as plagued with biases, based on heuristics that ignore relevant information, and prone to fallacies. [4] , [5] This work claims that cognitive limitations lead people to selectively attend to a subset of available information and therefore to systematically make non-normative decisions.

In this paper we examine how chemists tackle this problem as a way of addressing the more general question of how humans deal with cognitive complexity. Specifically, we asked chemists to sort through ∼4,000 chemical fragments over several sessions, and to identify those they deemed attractive for follow-up. (Chemical fragments are compounds with molecular weight<300, that are smaller than typical drug-sized compounds. They are used as starting points for building larger, more drug-like compounds.) We built classification models to best characterize which objective properties of the fragments were most predictive of each individual chemist’s decisions. In order to ascertain the potentially complex patterns of features that chemists might find desirable or undesirable, we applied two orthogonal classification algorithms: semi-naïve Bayesian (SNB) and Random Forest (RF). While both methods are capable of identifying important features and recognizing complex interdependencies between features, SNB is more readily interpretable. Thus both methods were used to identify important features, while SNB models were used to visualize and interpret chemists’ preferences. We also asked chemists to explain their decision-making. We aim to address three major questions: 1) How and to what extent do chemists simplify the problem of identifying promising chemical fragments to move forward in the discovery process? 2) Do different chemists use the same criteria for such decisions? 3) Can chemists accurately report the criteria they use for such decisions? Below we provide a background for these three questions.

For instance, in research departments across the pharmaceutical industry, medicinal chemists routinely sift through long lists of compounds with associated data (biochemical activities, physicochemical properties, etc.) in order to prioritize some for further optimization or study, and discard others in the search for new drug candidates. [1] Although computational tools have been developed to aid compound prioritization, [2] medicinal chemists remain intimately involved in compound review. In order to prioritize compounds, chemists must consider whether they possess desirable physical chemical properties (e.g., solubility), how easily they can be synthetically accessed and chemically manipulated, and whether they can be optimized to bind a desired target while avoiding undesirable biological properties such as off-target interactions or mutagenicity. Indeed, guiding compounds through all the potential pitfalls that lie between an initial ensemble of hits and a drug candidate is an extremely complex task, and the selection of the initial chemical starting points for this endeavor greatly impacts the path that is explored, and the ultimate success of a drug discovery campaign.

A core function of human cognition is to reduce the complexity of the world to manageable proportions. In everyday life, we ignore most of the information available in the environment in an attempt to focus on what is likely to be most important. In some professional contexts, this process is raised to an art form, providing a useful context in which to investigate the human cognitive response to complexity.

Perhaps one of the more astounding discrepancies from above, chemist 3 reported that several properties were important, but failed to report that size played any role during selections. Our SNB and RF classifiers both revealed that size, an especially straightforward parameter to assess, was the most important feature in distinguished chemist 3’s selections from rejections (discussed above).

To assess the extent of chemists’ self-awareness, we compared the parameters reported by chemists to those identified by our SNB and RF classifiers ( Fig. 2 ). The average number of parameters reported by each chemist (8.1±2.2) was much larger than the number of parameters identified by the SNB (2.1 ± 0.5) or RF (1.6±0.6) classifiers for each chemist, which the two-tailed paired sample t-test indicates as significant (p = 9.1×10 −10 and p = 5.7×10 −10 , respectively). Indeed every single chemist reported properties that were never identified as important by our SNB or RF classifiers. In addition to the properties reported in Figure 2 , there were simple parameters (chiral centers and rotatable bonds; included in averages above) and more complex parameters (shape and complexity); not included in the averages above) that were reported by chemists though our approach never identified them as being useful in reproducing selections ( Figure S11 ). Furthermore, Fisher exact probability tests indicated that for each parameter reported in Figure 2 , the SNB parameters or RF parameters were independent of the self-reported parameters (p-values range from 0.46–0.74 for SNB or 0.22–0.80 for RF, excluding the Novelty/IP parameter, Fig. 2 ), while indicating that the SNB and RF parameters are consistent with each other (p-values range from 0.0058–0.11). In addition, for 12/19 chemists, the primary parameters identified by SNB and RF are in agreement with each other. In other words, there was no systematic relation between the parameters reported by the chemists and those indicated by our modeling, although the parameters identified by the SNB and RF classifiers were consistent with each other.

We next assessed to what extent the consensus between chemists with high estimated consensus was enhanced compared to the consensus between the same number of chemists selected randomly when considering the entire dataset of selections ( Fig. S9B and S9C ). The chemists with high estimated consensus (chemist 1, 6, 8, 11, 15, 18, and 19) showed a significantly greater agreement in undesirable compounds ( Fig. S9B ). The agreement in desirable compounds, however, was no greater than the agreement between chemists selected randomly ( Fig. S9C ). This reinforces the notion that while there seems to be agreement in what is undesirable, there does not appear to be agreement in what is desirable.

The cultural consensus model was applied to a subset of fragments (311) with >75% agreement by chemists. The estimated consensus obtained by this method is plotted against the fraction of fragments passed by chemists for the entire survey. Each shape describes the primary SNB parameter used to reproduce chemists’ selections, and the color depicts the ROC score of naïve Bayesian classifiers built using ECFP4 as a descriptor for each chemist. A subset of high consensus chemists is above the dashed grey line.

We then sought to characterize the selection characteristics of chemists who agreed most with the group. We found that chemists with higher estimated consensus tended to select an intermediate fraction of fragments (∼0.2–0.7, Fig. 6 ). This is not entirely intuitive, since the majority of compounds that the CCM was built on were rejected compounds, so we might expect a high rejection rate for chemists with high estimated consensus. We might also suspect that chemists with high estimated consensus rely on the same parameters when making selections. Since the ring topology metric was the most common primary SNB parameter for chemists ( Fig. 2 ), it makes sense intuitively that it should be an important property to chemists with the highest estimated consensus. Indeed, ring topology was identified as the primary SNB parameter for the chemists with the highest estimated consensus (chemist 6, 8, 11, and 19), and as a secondary SNB parameter for the chemists with the next highest estimated consensus (chemist 1, 15, and 18). We also noted that a chemist’s estimated consensus was unrelated to the predictability of the chemist’s selections (color-coded, Fig. 6 ).

We then investigated to what extent individual chemists agreed with the group as a whole on compounds where there appeared to be consensus. The cultural consensus model (CCM) is an ideal method for this purpose since it estimates the knowledge - what we term estimated consensus - of respondents on a scale of 0–1 based on the observed agreement between survey answers. [40] (The cultural consensus theory assumes that high consensus is a sign of knowledge (expertise), and thus high-consensus individuals are termed high-knowledge individuals. We use the cultural consensus model as an atheoretical tool to identify members that agree most with the group, so we term them “high estimated consensus” individuals, rather than “high estimated knowledge” individuals.) In this case the survey answers are the fragment selections. As a prerequisite, a single underlying model explaining respondent’s decisions must first be demonstrated. The CCM as implemented in ANTHROPAC 4.0 [41] was used to test for consensus. As expected, a single underlying model did not fit the entire set of selections. By preselecting a set of high agreement compounds (>75% agreement, 313 compounds), a one culture model could be built, as attested by a large ratio of 6.9 between the first and second eigenvalue. In general, an eigenvalue ratio greater than 3 to 1 indicates a single pattern of responses across questions. [42] Importantly, by applying the CCM to the subset of high consensus compounds, an estimated consensus of each chemist was obtained which revealed a vast spectrum of agreement with the group, ranging from 0.07 to 0.66. From this analysis we could also identify a subset of chemists who agreed most with the group; from this subset we could further investigate agreement among high consensus chemists (see below).

Furthermore, NB models were built on the consensus (≥75% agreement) selections of all chemists ( Table S11 – 12 ). Separate models were built to identify consensus “good” compounds and consensus “bad.” Models were built with extended connectivity fingerprints (ECFP4). We anticipate that the features identified by consensus selections of chemists for identifying undesirable compounds will be particularly useful in removing undesirable fragments from large collections of compounds, for example, during compound acquisition or when designing focused in-house screens of fragments.

To further investigate these patterns, we calculated the percentage of chemists in agreement on each compound ( Fig. S9A ). Strikingly, consensus (defined here as 75% of chemists’ agreeing on acceptance or rejection) was reached for only 8% of the compounds reviewed (313 compounds). Moreover, agreement was asymmetrical; 1% of the compounds are considered good while 7% of the compounds are considered bad ( Fig. S9A ). This is not simply due to a bias in chemists rejecting more compounds than they accept, since on average chemists accepted nearly half (45%) of the compounds. Representative examples of the most undesirable fragments are depicted in Figure S10 .

We next examined how similar individual chemist’s selections were to themselves (consistency) and to each other (consensus) when viewing the same compounds. The modified Tanimoto similarity (S TM ), [38] which ranges from 0 (entirely dissimilar) to 1 (identical), was used to assess the agreement between chemist’s selections. This measure is symmetrical, and therefore equally sensitive to both agreement in selections and rejections. It also takes into account the fraction of selections or rejections for a given comparison; for example, if there is a low number of selections when comparing two chemists, agreement in selections will be weighed more heavily than agreement in rejections. For assessing consistency, a subset of 227 compounds that were present in more than one batch was used. When chemists were compared to themselves, the similarity between selections ranged from 0.37–0.82, with an average of 0.52 ( Fig. S8A ), indicating moderate consistency. To examine consensus between chemists, the entire set of 3,685 unique compounds was used. When chemists selections were compared to each other, the similarity ranged from 0.05–0.52, with an average similarity of 0.28 ( Fig. S8B –D); this indicates substantial disagreement about particular fragments. In sum, chemists were moderately internally consistent in their evaluation of compounds, but the consensus between chemists was low.

One simple metric of agreement is the fraction of compounds selected by each chemist per batch. The fraction of compounds deemed suitable to carry forward varied widely between chemists, ranging from 7% to 97% (average = 45%), though each chemist was relatively consistent from batch to batch (average standard deviation = 7%, Fig. S6A ). This variance between chemists was not related to their ideal library size ( Fig. S7A ) nor linearly related to the number of targets a chemist had previously worked on (R 2 = 0.05, Fig. S7B ). The fraction passed could, however, be explained by each chemist’s reported selection strategy ( Fig. S7C ). Chemists who reported selecting only the “best” fragments passed a lower fraction of compounds (0.13±0.07) than chemists that reported excluding only the “worst” fragments (0.61±0.34); those who reported intermediate strategies passed an intermediate fraction of compounds (0.39±0.25).

Similar logic can be used to examine agreement on two-parameter models; here, with 36 unique binary combinations of nine parameters, probability of random agreement is .028. One chemists’ decisions could only be described by a one-parameter model; eleven different two-parameter models were needed to describe the remaining 18 chemists. Of these, more than expected by chance used ring topology plus functional groups (N = 5, p = 0.0001). Likewise, more chemists used ring topology plus hydrogen bond donors/acceptors than expected by chance (N = 4, p = 0.001). No other two-parameter model was observed more than expected by chance.

While 14 parameters were available for constructing models, only 9 parameters were actually observed in the SNB classifiers for each chemist; 5 were observed in the one-parameter models. If preference for each parameter is equally likely, we can take .111 (i.e., 1 out of a possible 9 parameters observed) as a hypothetical random probability of a given chemist preferring a given parameter, and compare the observed distribution to this prediction via binomial probability (i.e., compute whether more chemists prefer a particular model than expected by chance). Doing so, we observed that eight chemists’ best one-parameter model utilized ring topology (p = .0006). Four chemists utilized functional groups, and another four used hydrogen bond donors/acceptors; these distributions of parameter preferences did not differ from chance levels (p = 0.153).

Because our classifiers revealed which parameters best predicted individual chemists’ responses ( Fig. 2 ), one way in which chemists might show agreement is by relying on the same parameters to guide decisions. For the following analysis, we rely on the SNB classifiers, as their predictive accuracy was on average greater than that of the RF classifiers.

The question of consensus among chemists is a complex one; accordingly we approached it in a number of ways. As a first step, the agreement in parameters used by each chemist during selections was examined. We then investigated the fraction of compounds selected by each chemist. Next, we assessed the similarity of chemist’s selections with themselves (consistency) and with each other (consensus). Finally, we investigated the amount of consensus between chemist selections as a group, and applied the cultural consensus model to assess to what extent individual chemists agreed with the group.

In sum, our models show that medicinal chemists appear to have approached a complex decision-making problem regarding the attractiveness of chemical starting points by reducing a massively multidimensional problem space down to one or two salient parameters (or types of information). In some cases, these parameters represent a simple pattern of selections, while in others more complex patterns have been identified, such as multiple dimensions being considered jointly.

The most favorable and unfavorable keys for the RingBonds_AromaticBonds_RingAssemblies (RB_AB_RA ) descriptor model, which measures the number of ring bonds (RB), aromatic bonds (AB), and ring assemblies (RA) present in a compound, were examined. Representative scaffolds that correspond to these keys are depicted, and are clustered based on how chemists viewed them. The Bayes score for each models built on individual chemists for each key is reported in a heat map. The favorable keys receive a positive score, while unfavorable keys receive a negative score.

We then investigated how models built with the same parameter compared between chemists. Seven chemists based their decision largely on ring topology; Figure 5 depicts a subset of the most desirable and undesirable values for a descriptor that jointly measures the number of ring bonds, aromatic bonds, and ring assemblies present in a fragment. Representative ring systems that match each descriptor value are depicted. Once again, we see that interdependencies between features are present in ring system preferences. For example, for chemist 19, fused aromatic 6 member rings (11_11_1) are desirable, but when they are connected to an aliphatic 6 member ring (17_11_2), they are undesirable. We note that the rings are grouped together in a chemically intuitive way when they are clustered based on the chemists’ preferences. The chemists were also clustered based on which descriptor values they preferred, revealing the underpinnings of some of the similarities (S MT ) observed between chemists (discussed below). For example, one of the highest similarities observed was between chemist 11 and 19 (S MT = 0.47, Fig. S8 ), and for the subset of values from chemists’ models depicted in Figure 5 , they are also the most similar and cluster together first. The ring topology preferences of chemist 10 and 16, on the other hand, are in clear contrast with each other. For example, chemist 10 favors 1–2 ring structures that are not fused, while chemist 16 disfavors these ( Fig. 5 ). Furthermore, chemist 16 highly favors certain fused tricyclic ring structures (17_12_1, 16_11_1, and 16_6_1, Fig. 5 ) which are disfavored by chemist 10. These differences explain at least in part the low similarity between chemist 10 and 16’s overall selections (S MT = 0.19, Fig. S8 ). Thus, even if chemists use the same parameter to assess compounds, their individual preferences can be quite different. We explore the question of consensus between chemists, which these comparisons foreshadow, in depth in the next section.

Keys that represent the presence (black) or absence (white) of chemical substructures are ordered from negative (bad) on the left to positive (good) values on the right ( A ). The worst and best substructure keys are zoomed in on ( B ). Specific chemical substructures (tertiary amine – blue, aromatic heteroatom – violet, hydroxyl – aqua, and carboxylic acid - orange) are highlighted for one of the worst keys and two of the best keys, and illustrative examples of fragments that would be described by these keys are depicted ( C ).

In contrast to these straightforward preferences, we also observed models that revealed more complex preferences, revealing interdependencies between features. For example, the primary SNB parameter for chemist 1 was identified as functional groups ( Fig. 2 ). Chemist 1’s selections were based on specific combinations of these functional groups ( Fig. 4 ). For example, compounds with hydroxyl groups and tertiary amines were deemed favorable, but if aromatic heteroatoms were also present, they were deemed unfavorable. In fact, chemist 1 in general disfavored compounds containing aromatic heteroatoms. If, however, fragments containing aromatic heteroatoms also contain a carboxylic acid, the compound was seen as favorable. This may be due to the carboxylic acid increasing the attractiveness of the otherwise unfavorable fragment since it might be seen as an especially desirable chemical handle. Importantly, these interdependencies would not have been recognized by our SNB classifiers if the functional groups were considered independently rather than jointly.

A : Histogram of number of atoms of fragments selected by chemist 3 as good (green) or bad (red) starting points for drug discovery campaigns. Frequencies are normalized by the total number of selected or unselected compounds, respectively. B : Bayesian score versus number of atoms for minimal Bayesian model build for chemist 3. A positive score indicates a favorable number of atoms, while a negative score indicates an unfavorable number of atoms. C : Histogram of molecular polar surface area of fragments selected by chemist 12 as good (green) or bad (red) starting points for drug discovery campaigns. Frequencies are normalized by the total number of selected or unselected compounds, respectively. D : Bayesian score versus molecular polar surface area bins for SNB classifier built for chemist 12.

We found that in some cases when SNB classifiers were applied to chemists’ decisions, models revealed relatively straightforward preferences. For instance, compounds above a certain cutoff for a particular property are favored, while those below it are disfavored, or vice versa. For chemist 3, size (as measured by the number of atoms) was the most important parameter ( Fig. 2 ); indeed larger fragments were more desirable ( Fig. 3A–B ). In contrast, modeling revealed polarity to be the primary parameter for chemist 12 ( Fig. 2 ), who showed a strong preference for compounds with a molecular polar surface area less than ∼70 Å ( Fig. 3C–D ).

One of the advantages of our approach is that the SNB classifiers built for each chemist could be visually investigated to bring to light each chemist’s preferences in detail. It should be noted that two models that use the same number of parameters can vary immensely in the complexity or amount of information that they use, although the type of information is the same. For example, two chemists might select fragments based on size and polarity. In one case, a complex strategy where interdependencies of these parameters might be used (“large and polar” or “small and nonpolar” compounds are desirable), while another chemist might use a simple strategy where these parameters are considered independently (“large” is desirable, and “highly polar” is desirable). We verified that our SNB classifiers could represent both of these strategies (See Methods and Fig. S2 ).

The primary parameters for the classifiers are depicted as stars, and the secondary parameters are depicted as circles. The one-tailed Fisher exact probability test (p) is reported for each parameter (except chains and charge), indicating that the SNB and RF parameters show agreement with each other, while the self reported parameters are independent of either of the classifier’s parameters.

The types of parameters used by the SNB and RF classifiers are depicted in Figure 2: we refer to the most important parameter as primary (stars), and all other parameters used as secondary (circles). The descriptors that underlie these parameters are reported in Tables S9 and S10 . To our surprise, the majority of the classifiers only used 1–2 types of information. For example, for the SNB classifiers, the majority of classifiers used 2 parameters (16 chemists), while only a few used 1 (1 chemist) or 3 (2 chemists) parameters. The RF classifiers suggest even fewer parameters are important: the majority of classifiers use 1 (9 chemists) or 2 (9 chemists) parameters, while only 1 classifier uses 3 parameters. This suggests that medicinal chemists reduce a complicated problem into a more tractable one by assessing generally just a 1–2 parameters (or types of information) rather than several.

As a first step, we assessed the predictive accuracy of the SNB and RF classifiers compared to benchmark classifiers built with state of the art descriptors that are not as interpretable ( Figure 1 ). For the benchmark classifiers, we trained classifiers with extended connectivity fingerprints (ECFP4) and simple physical properties (ALogP, Molecular_Weight, Num_H_Donors, Num_H_Acceptors, Num_Rotatable_Bonds, and Molecular_FractionalPolarSurfaceArea). The interpretable SNB and RF models compared favorably in predictive accuracy, and in many cases outperformed the corresponding benchmark. The high predictive accuracy of the majority of the classifiers supports the notion that most of the chemists evaluate compounds in an internally consistent manner. For example, for the SNB benchmark, 15/19 models yielded a ROC score >0.7 ( Figure 1A , black).

Chemists (N = 19) were asked to select desirable fragments from 8 batches of 500 fragments each. In order to determine the number and type of properties that best predicted each chemist’s decisions, we built semi-naïve Bayesian (SNB) and Random Forest (RF) classifiers based on individual chemist’s selections. Medicinal chemistry relevant descriptors were used to train the classifiers, so that the resulting models could readily be related to what types of information (or parameters) were important during selections.

Discussion

Overview In this paper we explored how medicinal chemists categorized chemical fragments as desirable or undesirable starting points for development into lead compounds. This allowed us to not only investigate the cognitive basis of this important aspect of drug discovery, but also to address basic issues in cognitive science. We focused on three major questions: 1) to what extent, if any, do chemists simplify the problem of identifying promising chemical fragments to move forward in the discovery process? 2) Do chemists agree with each other about the criteria used for such decisions? 3) Can chemists accurately report the criteria they use for such decisions?

Reducing Complexity Our results clearly show that chemists greatly reduced the complexity of the problem they were solving. Potentially, one could utilize dozens of parameters (or types of information) to make decisions about fragment suitability. We specifically queried 14 possible parameters in our modeling, 9 of which were used at least once by at least 1 chemist according to either the SNB or RF classifiers. Strikingly, our modeling suggests that the vast majority of chemists only used 1–2 parameters to categorize compounds. In other words, chemists transformed a massively complex categorization problem into a tractable one- or two-dimensional problem. This does not seem to be a bias of our approach since applying our method to simulated classifiers indicated that we could correctly identify at least 4 parameters used in categorization. Furthermore, we used two types of orthogonal classification algorithms to reach these conclusions. It should also be pointed out that SNB models using only 1 parameter can capture rather complex preferences, as in the case of chemist 1’s functional group model. Even so, it is clear that a one parameter model does not use all of the types of information that are available. Category formation based on one dimension, as opposed to many, has been observed in previous psychology experiments as well, even when subjects were asked to use all dimensions when categorizing items [43].

Consensus among Chemists We found evidence of moderate agreement among medicinal chemists with respect to the parameters that best modeled their decisions about chemical fragments. For example for the SNB classifiers, eight chemists’ primary parameter was ring topology, and out of 36 possible two-parameter models, two accounted for 47% of chemists. However, we found little agreement with respect to decisions about particular fragments. Only 8% of fragments were accepted or rejected by more than 75% of chemists, the similarity among chemists’ decisions was low, and the cultural consensus model failed to reveal a single underlying model of chemists decisions for the complete fragment set. In other words, even if chemists used the same feature to categorize compounds–which they generally did–they often preferred different values for these features. Moreover, more agreement among chemists was observed regarding what constitutes an undesirable fragment. We also applied the cultural consensus model to identify individuals that agreed the most with the group as a whole, and to assess the amount of agreement between the chemists. Applying the model to a subset of compounds with high agreement between chemists (≥75%) was necessary in order to obtain a one culture model. It should be noted that the majority of these compounds were deemed undesirable (265/313, Fig. S9A). When we looked at the agreement on desirable and undesirable fragments (for the entire set of survey compounds) between a subset of chemists with high estimated consensus versus a subset of randomly selected chemists, the agreement in the fraction of undesirable compounds was greater, but there was no difference in the fraction of desirable compounds (Fig. S9B–C). These results imply that while there is some agreement regarding undesirable fragments, there does not seem to be a significant amount of agreement regarding desirable fragments. This may be an example of negativity bias – “bad” information tends to be processed longer than “good” information, and stronger memories are formed of “bad” items. [44], [45] Perhaps chemists have retained more knowledge of chemical motifs or properties that literature refers to as undesirable, or that they have had bad personal experiences with, and also paid more attention to these undesirable motifs or properties while they were processing the compounds. In some sense this finding also seems to contradict the notion that chemists tend to recycle privileged scaffolds that they find attractive, ultimately constraining the diversity of chemical series and libraries. [46] It suggests that while individuals have preferences for specific scaffolds, as evidenced by the highly predictive SNB and RF classifiers that were built, these biases are not often shared between chemists. As mentioned in the introduction, a lack of consensus does not necessarily reflect a lack of expertise, but rather may be a result of the particular problem space under investigation. [24], [26] Three structural factors that contribute to lack of consensus among experts are especially relevant to compound prioritization in drug discovery. One factor that leads to low consensus is if a single solution does not exist. [24] This is especially true in drug discovery, as evidenced by multiple drugs often being developed for a single target. In light of this, chemists may be playing to their own strengths. In the same way that a master chess player must navigate his chess pieces towards victory, and opens a game in a manner that compliments his own style of play, a medicinal chemist, in the context of a project team, must navigate the path of compounds that he selects to work with towards more optimal properties. The path that one chemist might take likely differs from another, due to the diversity of knowledge and skill sets that an individual brings to the table. A second factor that leads to low consensus is if the basic science in a field is still evolving. [24] This is particularly true of drug discovery – for example, some topics that have recently garnered much attention that are especially relevant to the current paper are which scaffolds are the most promising in drug discovery, [47] what are the optimal properties of chemical starting points [48] or drug candidates, [49], [50] what are the actual properties of compounds explored by medicinal chemists and how have they varied over time, [51] and how does the subset of chemical reactions that tend to be employed in drug discovery constrain the exploration of chemical space. [52], [53] These studies bear testament that there is still a great deal to learn about the basic science of drug discovery. A third structural factor that results in low consensus is when experts work in dynamic situations with evolving constraints. [24] In drug discovery, the intended targets of therapeutics are constantly changing, and thus the chemical matter employed to perturb these targets is constantly evolving as well. Furthermore the constraints placed on what defines a suitable therapeutic compound have changed over time. More than ever, researchers are aware of undesirable on or off-target effects, and in many cases are able to interrogate them, ultimately raising the bar for target specificity and minimal toxicity. Indeed, it has been argued that many historically successful therapeutics such as aspirin and acetaminophen would not be considered suitable therapeutics in the current drug discovery environment [54].

Tying Complexity Reduction and Consensus Together: Goal Derived Categories One interesting way to frame both the complexity reduction and consensus results is in terms of goal-derived categories. Goal-derived categories unite otherwise diverse entities in the service of a particular goal; for instance, shirts, novels, and toothbrushes are all things to pack in a suitcase. [55] Like common taxonomic categories (e.g., dog, tree, car), goal-derived categories have been shown to exhibit prototype structure (i.e., some exemplars are more prototypical or “better” members of the category than others). However, different factors determine prototype structure for the two types of categories. The best examples of taxonomic categories tend to be similar to many other members; they represent the central tendency of the category. In contrast, the best examples of goal-derived categories tend to be instances that satisfy specific ideals–i.e., instances that have characteristics that serve the goal optimally. Another determinant of typicality for goal-derived categories is frequency of instantiation, or how often an instance is encountered as a member of the category. It’s plausible that our chemists are deciding whether or not the target fragments are members of the goal-derived category promising fragments for drug discovery follow-up. If so, chemists should make decisions based on how well fragments satisfy ideals, and their frequency of instantiation as promising leads. [56] In our case, ideals are characteristics that fragments should possess if they are considered desirable for lead development (e.g., synthetic accessibility, facile derivatization, etc.), whereas the frequency of instantiation could be thought of as the number of times a chemist encounters a compound or chemical motif and associates it with being desirable or undesirable for lead development. Our results show that although chemists tend to converge on a small subset of possible parameters for making these decisions, they show little agreement on the optimal values for these parameters. This lack of consensus could arise from several sources. First, the complexity of what constitutes an attractive starting compound for optimization in the drug-discovery process may have led to differences in the ideals that chemists sought to optimize. Second, people often optimize more than one ideal during categorization, [55] and it is likely that in our case individual chemists may also weight the importance of multiple ideals differently. For example, one chemist might place more emphasis on making sure a fragment can be easily evolved, while another might place more emphasis on reducing potential toxicity. Furthermore, chemists may also associate different parameters with these ideals. For instance, two chemists may both desire a fragment that specifically interacts with a target, and one chemist may view shape as an important feature, while another may view hydrogen bonding interactions as more important. One reason that chemists might share the same ideals (e.g., synthetic ease), while favoring different values for these ideals may be due to their personal experience (e.g., synthetic transformations they are most familiar with). In other words, the distribution of frequencies of instantiation is undoubtedly different for individuals, and this may be reflected by different optimal values. If chemists have worked in different target areas, they may have been exposed to different chemotypes or functional groups. [47], [57] A follow-up questionnaire was employed to identify which target areas survey takers had experience in (Fig. S12). The diversity of backgrounds that was observed may have lead chemists to view different motifs that are commonly encountered while working on specific drug target areas as “druglike,” privileged, or easy to work with. It is also likely that even if chemists have been exposed to the same target classes during their professional careers, they may extract different features from desirable compounds during learning based on their backgrounds [58], [59]. There is likely a complex relationship between a chemist’s ideals and the parameters that were identified by the SNB and RF classifiers as indicative of their selections. In specific cases, however, by visually inspecting the individual SNB classifiers, it is tempting to extrapolate ideals for individual chemists based on the ideal’s impression upon optimal values for specific parameters. For example, in one model (chemist 12), compounds with a polar surface below a certain threshold are desirable, and those above it are undesirable. This ideal has been stated in drug design literature: the polar surface area of a drug-like compound should not be too high, as it negatively impacts oral bioavailability [60], [61].

Chemists’ Awareness of Decision Criteria Chemists were largely unaware of the factors that influenced their decisions about compounds. Chemists reported that they relied on more parameters than they actually did, according to the SNB and RF classifiers, and there was little agreement overall between the properties chemists identified and the parameters that predicted their decisions. We should point out that for specific instances parts of the self reports were extremely accurate. For example, chemist 10 disclosed a list of features largely related to the ring topology parameter. This list was written down before evaluating the first set of compounds, and was used as a reminder throughout the exercise. Although the reported features were evident in chemist 10’s selections, several other self-reported parameters were not identified as important. In stark contrast to chemist 10 is a chemist who reported that sometimes, in addition to the specific properties they reported, they trusted their “gut feeling.” Perhaps, since a predictive model could be built for this chemist, this “gut feeling” is really based on previous unconscious learning. As discussed in the introduction, such lack of awareness of the factors affecting decisions is fairly characteristic of human decision-making in complex situations. Furthermore, experts have also been described as inarticulate about the process used to make decisions. [62] In our study, the intuition was clearly rooted in expertise: a compound is unlikely to “strike” anyone as promising or unpromising unless one has extensive record of performing such complex evaluations. This raises an interesting question: would novice chemists be more or less aware of the parameters they based their decisions on than experts proved to be? If lack of expertise makes the compound evaluation a slower, more effortful process, we can expect novices to be more accurate in reporting the parameters that influenced their decisions - unless they are put under time pressure forcing them to rely on their fast (non-expert) intuitive thinking. Another question is why the participants overestimated the number of parameters they relied upon. Perhaps, if the self-reports were based on post hoc rationalization of already made decisions, the reports were driven by a meta-expectation about the average number of parameters an expert should consider in such a situation in order to arrive to a justified decision. If chemists reading this paper find themselves surprised at the small number of parameters their colleagues used, their reaction informally testifies to the existence of that very meta-expectation.