How will this molecule smell? We still do not understand what a given substance will smell like. Keller et al. launched an international crowd-sourced competition in which many teams tried to solve how the smell of a molecule will be perceived by humans. The teams were given access to a database of responses from subjects who had sniffed a large number of molecules and been asked to rate each smell across a range of different qualities. The teams were also given a comprehensive list of the physical and chemical features of the molecules smelled. The teams produced algorithms to predict the correspondence between the quality of each smell and a given molecule. The best models that emerged from this challenge could accurately predict how a new molecule would smell. Science, this issue p. 820

Abstract It is still not possible to predict whether a given molecule will have a perceived odor or what olfactory percept it will produce. We therefore organized the crowd-sourced DREAM Olfaction Prediction Challenge. Using a large olfactory psychophysical data set, teams developed machine-learning algorithms to predict sensory attributes of molecules based on their chemoinformatic features. The resulting models accurately predicted odor intensity and pleasantness and also successfully predicted 8 among 19 rated semantic descriptors (“garlic,” “fish,” “sweet,” “fruit,” “burnt,” “spices,” “flower,” and “sour”). Regularized linear models performed nearly as well as random forest–based ones, with a predictive accuracy that closely approaches a key theoretical limit. These models help to predict the perceptual qualities of virtually any molecule with high accuracy and also reverse-engineer the smell of a molecule.

In vision and hearing, the wavelength of light and frequency of sound are highly predictive of color and tone. In contrast, it is not currently possible to predict the smell of a molecule from its chemical structure (1, 2). This stimulus-percept problem has been difficult to solve in olfaction because odors do not vary continuously in stimulus space, and the size and dimensionality of olfactory perceptual space is unknown (1, 3, 4). Some molecules with very similar chemical structures can be discriminated by humans (5, 6), and molecules with very different structures sometimes produce nearly identical percepts (2). Computational efforts developed models to relate chemical structure to odor percept (2, 7–11), but many relied on psychophysical data from a single 30-year-old study that used odorants with limited structural and perceptual diversity (12, 13).

Twenty-two teams were given a large, unpublished psychophysical data set collected by Keller and Vosshall from 49 individuals who profiled 476 structurally and perceptually diverse molecules (14) (Fig. 1A). We supplied 4884 physicochemical features of each of the molecules smelled by the subjects, including atom types, functional groups, and topological and geometrical properties that were computed using Dragon chemoinformatic software (version 6, Talete S.r.l., see supplementary materials) (Fig. 1B).

Fig. 1 DREAM Olfaction Prediction Challenge. (A) Psychophysical data. (B) Chemoinformatic data. (C) DREAM Challenge flowchart. (D) Individual and population challenges. (E) Hypothetical example of psychophysical profile of a stimulus. (F) Connection strength between 21 attributes for all 476 molecules. Width and color of the lines show the normalized strength of the edge. (G) Perceptual variance of 21 attributes across 49 individuals for all 476 molecules at both concentrations sorted by Euclidean distance. Three clusters are indicated by green, blue, and red bars above the matrix. (H) Model Z-scores, best performers at left. (I and J) Correlations of individual (I) or population (J) perception prediction sorted by team rank. The dotted line represents the P < 0.05 significance threshold with respect to random predictions. The performance of four equations for pleasantness prediction suggested by Zarzo (10) [from top to bottom: equations (10, 9, 11, 7, 12)] and of a linear model based on the first seven principal components inspired by Khan et al. (8) are shown.

Using a baseline linear model developed for the challenge and inspired by previous efforts to model perceptual responses of humans (8, 11), we divided the perceptual data into three sets. Challenge participants were provided with a training set of perceptual data from 338 molecules that they used to build models (Fig. 1C). The organizers used perceptual data from an additional 69 molecules to build a leaderboard to rank performance of participants during the competition. Toward the end of the challenge, the organizers released perceptual data from the 69 leaderboard molecules so that participants could get feedback on their model and to enable refinement with a larger training + leaderboard data set. The remaining 69 molecules were kept as a hidden test set available only to challenge organizers to evaluate the performance of the final models (Fig. 1C). Participants developed models to predict the perceived intensity, pleasantness, and usage of 19 semantic descriptors for each of the 49 individuals and for the mean and standard deviation across the population of these individuals (Fig. 1, D and E).

We first examined the structure of the psychophysical data using the inverse of the covariance matrix (15) calculated across all molecules as a proxy for connection strength between each of the 21 perceptual attributes (Fig. 1F and fig. S1). This yielded a number of strong positive interactions, including those between “garlic” and “fish”; “musky” and “sweaty”; and “sweet” and “bakery”; and among “fruit,” “acid,” and “urinous”; and a negative interaction between pleasantness and “decayed” (Fig. 1F and fig. S1A). The perception of intensity had the lowest connectivity to the other 20 attributes. To understand whether a given individual used the full rating scale or a restricted range, we examined subject-level variance across the ratings for all molecules (Fig. 1G). Applying hierarchical clustering on Euclidean distances for the variance of attribute ratings across all the molecules in the data set, we distinguished three clusters: subjects that responded with high-variance for all 21 attributes (left cluster in green), subjects with high-variance for four attributes (intensity, pleasantness, “chemical,” and “sweet”) and either low variance (middle cluster in blue) or intermediate variance (right cluster in red) for the remaining 17 attributes (Fig. 1G).

We assessed the performance of models submitted to the DREAM Challenge by computing for each attribute the correlation between the predictions of the 69 hidden test molecules and the actual data. We then calculated a Z-score by subtracting the average correlations and scaling by the standard deviation of a distribution based on a randomization of the test-set molecule identities. Of the 18 teams who submitted models to predict individual perception, Team GuanLab (author Y.G.) was the best performer with a Z-score of 34.18 (Fig. 1H and table S1). Team IKW Allstars (author R.C.G.) was the best performer of 19 teams to submit models to predict population perception, with a Z-score of 8.87 (Fig. 1H and table S1). The aggregation of all participant models gave Z-scores of 34.02 (individual) and 9.17 (population) (Fig. 1H), and a postchallenge community phase where initial models and additional molecular features were shared across teams gave even better models with Z-scores of 36.45 (individual) and 9.92 (population) (Fig. 1H).

Predictions of the models for intensity were highly correlated with the observed data for both individuals (r = 0.56; t test, P < 10–228) and the population (r = 0.78; P < 10–9) (Fig. 1, I and J). Pleasantness was also well predicted for individuals (r = 0.41; P < 10–123) and the population (r = 0.71; P < 10–8) (Fig. 1, I and J). The 19 semantic descriptors were more difficult to predict, but the best models performed respectably (individual: r = 0.21; P < 10–33; population: r = 0.55; P < 10–5) (Fig. 1, I and J). Previously described models to predict pleasantness (8, 10) performed less well on this data set than our best model (Fig. 1J). To our knowledge, there are no existing models to predict the 19 semantic descriptors.

Random-forest (Fig. 2A and table S1) and regularized linear models (Fig. 2B and table S1) outperformed other common predictive model types for the prediction of individual and population perception (Fig. 2, fig. S2, and table S1). Although the quality of the best-performing model varied greatly across attributes, it was exceptionally high in some cases (Fig. 2C), and always considerably higher than chance (dotted line in Fig. 1I), while tracking the observed perceptual values (fig. S2 for population prediction). In contrast to most previous studies that attempted to predict olfactory perception, these results all reflect predictions of a hidden test set and avoid the pitfall of inflated correlations due to overfitting of the experimental data.

Fig. 2 Predictions of individual perception. (A) Example of a random-forest algorithm that utilizes a subset of molecules from the training set to match a semantic descriptor (e.g., garlic) to a subset of molecular features. (B) Example of a regularized linear model. For each perceptual attribute y i , a linear model utilizes molecular features x i,j weighted by β i to predict the psychophysical data of 69 hidden test-set molecules, with sparsity enforced by the magnitude of λ. (C) Correlation values of best-performer model across 69 hidden test-set molecules, sorted by Euclidean distance across 21 perceptual attributes and 49 individuals. (D) Correlation values for the average of all models (red dots, mean ± SD), best-performing model (white dots), and best-predicted individual (black dots), sorted by the average of all models. (E) Prediction correlation of the best-performing random-forest model plotted against measured standard deviation of each subject’s perception across 69 hidden test-set molecules for the four indicated attributes. Each dot represents one of 49 individuals. (F) Correlation values between prediction correlation and measured standard deviation for 21 perceptual attributes across 49 individuals, color coded as in (E). The dotted line represents the P < 0.05 significance threshold obtained from shuffling individuals.

The accuracy of predictions of individual perception for the best-performing model was highly variable (Fig. 2C), but the correlation of six of the attributes was above 0.3 (white circles in Fig. 2D). The best-predicted individual showed a correlation above 0.5 for 16 of 21 attributes (Fig. 2D). We asked whether the usage of the rating scale (Fig. 1G) could be related to the predictability of each individual. Overall, we observed that individuals using a narrow range of attribute ratings—measured across all molecules for a given attribute—were more difficult to predict (Fig. 2, E and F, derived from the variance in Fig. 1G). The relations between range and prediction accuracy did not hold for intensity and pleasantness (Fig. 2, E and F).

We next compared the results of predicting individual and population perception. The seven best-predicted attributes overall (intensity, “garlic,” pleasantness, “sweet,” “fruit,” “spices,” and “burnt”) were the same for both individuals and the population (Fig. 2D and Fig. 3A except “fish”). Similarly, the seven attributes that were the most difficult to predict (“acid,” “cold,” “warm,” “wood,” “urinous,” “chemical,” and “musky”) were the same for both individual and population predictions (Figs. 2D and 3A), and except for a low correlation for “warm,” these attributes are anticorrelated or uncorrelated to the “familiarity” attribute (14). This suggests some bias in the predictability of more familiar attributes, perhaps due to a better match to a well-defined reference molecule (14), and that, in this categorization, individual perceptions are similar across the population. For the population predictions, the first 10 attributes have a correlation above 0.5 (Fig. 3A). The connectivity structure in Fig. 1F follows the model’s performance for the population (Fig. 3A). “Garlic”-“fish” (P < 10–4), “sweet”-“fruit” (P < 10–3), and “musky”-“sweaty” (P < 10–3) are pairs with strong connectivity that were also similarly difficult to predict.

Fig. 3 Predictions of population perception. (A) Average of correlation of population predictions. Error bars, SDs calculated across models. (B) Ranked prediction correlation for 69 hidden test-set molecules produced by aggregated models (open black circles; gray bars, SD) and the average of all models (solid black dots; black bars, SD). (C to E) Prediction correlation with increasing number of molecular features using random-forest (red) or linear (black) models. Attributes are ordered from top to bottom and left to right by the number of features required to obtain 80% of the maximum prediction correlation using the random-forest model. Plotted are intensity and pleasantness (C), and attributes that required six or fewer (D) or more than six features (E). The combined training + leaderboard set of 407 molecules was randomly partitioned 250 times to obtain error bars for both types of models.

We analyzed the quality of model predictions for specific molecules in the population (Fig. 3B). The correlation between predicted and observed attributes exceeded 0.9 (t test, P < 10–4) for 44 of 69 hidden test-set molecules when we used aggregated model predictions, and 28 of 69 when we averaged all model correlations (table S1). The quality of predictions varied across molecules, but for every molecule, the aggregated models exhibited higher correlations (Fig. 3B). The two best-predicted molecules were 3-methyl cyclohexanone followed by ethyl heptanoate. Conversely, the five molecules that were most difficult to predict were l-lysine and l-cysteine, followed by ethyl formate, benzyl ether, and glycerol (Fig. 3B and fig. S3).

To better understand how the models successfully predicted the different perceptual attributes, we first asked how many molecular features were needed to predict a given population attribute. Although some attributes required hundreds of features to be optimally predicted (Fig. 3, C to E), both the random-forest and linear models achieved prediction quality of at least 80% of that optimum with far fewer features. By that measure, the algorithm to predict intensity was the most complex, requiring 15 molecular features to reach the 80% threshold (Fig. 3C). Fish was the simplest, requiring only one (Fig. 3D). Although Dragon features are highly correlated, these results are remarkable because even those attributes needing the most molecular features to be predicted required only a small fraction of the thousands of chemoinformatic features.

We asked what features are most important for predicting a given attribute (figs. S4 to S6 and table S1). The Dragon software calculates a large number of molecular features but is not exhaustive. In a postchallenge phase (triangles in Fig. 1H), four of the original teams attempted to improve their model predictions by using additional features. These included Morgan (16) and neighborhood subgraph pairwise distance kernel (NSPDK) (17), which encode features through the presence or absence of particular substructures in the molecule; experimentally derived partition coefficients from EPI Suite (18); and the common names of the molecules. We used cross-validation on the whole data set to compare the performance of the same models using different subsets of Dragon and these additional molecular features. Only Dragon features combined with Morgan features yielded decisively better results than Dragon features alone, both for random-forest (Fig. 4A) and linear (Fig. 4B) models. We then examined how the random-forest model weighted each feature (table S1 for a similar analysis using the linear model). As observed previously, intensity was negatively correlated with molecular size but was positively correlated with the presence of polar groups, such as phenol, enol, and carboxyl features (fig. S6A) (1, 7). Predictions of intensity relied primarily on Dragon features.

Fig. 4 Quality of predictions. (A and B) Community phase predictions for random-forest (A) and linear (B) models using both Morgan and Dragon features for population prediction. The training set was randomly partitioned 250 times to obtain error bars: *P < 0.05, **P < 0.01, ***P < 0.001, corrected for multiple comparisons [false discovery rate (FDR)]. (C) Comparison between correlation coefficients for model predictions and for test-retest for individual perceptual attributes by using the aggregated predictions from linear and random-forest models. Error bars reflect standard error obtained from jackknife resampling of the retested molecules. Linear regression of the model-test correlation coefficients against the test-retest correlation coefficients yields a slope of 0.80 ± 0.02 and a correlation of r = 0.870 (black line) compared with a theoretically optimal model (perfect prediction given intraindividual variability, dashed red line). Only the model-test correlation coefficient for burnt (15) was statistically distinguishable from the corresponding test-retest coefficient (P < 0.05 with FDR correction). (D) Schematic for reverse-engineering a desired sensory profile from molecular features. The model was presented with the experimental sensory profile of a molecule (spider plot, left) and tasked with searching through 69 hidden test-set molecules (middle) to find the best match (right, model prediction in red). Spider plots represent perceptual data for all 21 attributes, with the lowest rating at the center and highest at the outside of the circle. (E) Example where the model selected a molecule with a sensory profile 7th closest to the target, butyric acid. (F) Population prediction quality for the 69 molecules in the hidden test set when all 19 models are aggregated. The overall area under the curve (AUC) for the prediction is 0.83, compared with 0.5 for a random model (gray dashed line) and 1.0 for a perfect model.

There is already anecdotal evidence that some chemical features are associated with a sensory attribute. For example, sulfurous molecules are known to smell “garlic” or “burnt,” but no quantitative model exists to confirm this. Our model confirms that the presence of sulfur in the Dragon descriptors used by the model correlated positively with both “burnt” (r = 0.661; P < 10–62) (fig. S4A) and “garlic” (r = 0.413; P < 10–22; table S1). Pleasantness was predicted most accurately using a mix of both Dragon and Morgan-NSPDK features. For example, pleasantness correlated with both molecular size (r = 0.160; P < 10–3) (9) and similarity to paclitaxel (r = 0.184; P < 10–4) and citronellyl phenylacetate (r = 0.178; P < 10–4) (fig. S6B). “Bakery” predictions were driven by similarity to the molecule vanillin (r = 0.45; P < 10–24) (fig. S4B). Morgan features improved prediction in part by enabling a model to template-match target molecules against reference molecules for which the training set contains perceptual data. Thus, structural similarity to vanillin or ethyl vanillin predicts “bakery” without recourse to structural features.

Twenty of the molecules in the training set were rated twice (“test” and “retest”) by each individual, providing an estimate of within-individual variability for the same stimulus. This within-individual variability places an upper limit on the expected accuracy of the optimal predictive model. We calculated the test-retest correlation across individuals and molecules for each perceptual attribute. This value of the observed correlation provides an upper limit to any model, because no model prediction should produce a better correlation than data from an independent trial with an identical stimulus and individual. To examine the performance of our model compared with the theoretically best model, we calculated a correlation coefficient between the prediction of a top-performing random-forest model and the test data. All attributes except “burnt” were statistically indistinguishable from the test-retest correlation coefficients evaluated at the individual level (Fig. 4C). The slope for the best linear fit of the test-retest and model-test correlation coefficients was 0.80 ± 0.02, with a slope of 1 expected for optimal performance (Fig. 4C). Similar results were obtained using a model-retest correlation. Thus, given this data set, performance of the model is close to that of the theoretically optimal model.

We evaluated the specificity of the predictions of the aggregated model by calculating how frequently the predicted sensory profile had a better correlation with the actual sensory profile of the target molecule than it did with the sensory profiles of any of the other 68 molecules in the hidden test set (Fig. 4, D and E). For 14 of 69 molecules, the highest correlation coincided with the actual sensory profile (P < 10–11). For an additional 20%, it was second highest, and 65% of the molecules ranked in the top-ten predictions [Fig. 4F and table S1; area under the curve (AUC) = 0.83]. The specificity of the aggregated model shows that its predictions could be used to reverse-engineer a desired sensory profile by using a combination of molecular features to synthesize a designed molecule.

Finally, to ensure that the performance of our model would extend to new subjects, we trained it on random subsets of 25 subjects from the DREAM data set and consistently predicted the attribute ratings of the mean across the population of the 24 left-out subjects (fig. S7A). To test our model across new subjects and new molecules, we took advantage of a large unpublished data set of 403 volunteers who rated the intensity and pleasantness of 47 molecules, of which only 32 overlapped with the stimuli used in the original study (table S1). Using a random-forest model trained on the original 49 DREAM Challenge subjects and all the molecules, we are able to show that the model robustly predicts the average perception of all of these molecules across the population (fig. S7B).

The DREAM Olfaction Prediction Challenge has yielded models that generated high-quality personalized perceptual predictions. This work substantially expands on previous modeling efforts (2, 3, 7–11) because it predicts not only pleasantness and intensity, but also 8 out of 19 semantic descriptors of odor quality. The predictive models enable the reverse-engineering of a desired perceptual profile to identify suitable molecules from vast databases of chemical structures and closely approach the theoretical limits of accuracy when accounting for within-individual variability. Although highly significant, there is still much room for improving, in particular, the individual predictions. Although the current models can only be used to predict the 21 attributes, the same approach could be applied to a psychophysical data set that measured any desired sensory attribute (e.g., “rose,” “sandalwood,” or “citrus”). How can the highly predictive models presented here be further improved? Recognizing the inherent limits of using semantic descriptors for odors (12–14), we think that alternative perceptual data, such as ratings of stimulus similarity, will be important (11).

What do our results imply about how the brain encodes an olfactory percept? We speculate that, for each molecular feature, there must be some quantitative mapping, possibly one to many, between the magnitude of that feature and the spatiotemporal pattern and activation magnitude of the associated olfactory receptors. If features rarely or never interact to produce perception, as suggested by the strong relative performance of linear models in this challenge, then these feature-specific patterns must sum linearly at the perceptual stage (19). Peripheral events in the olfactory sensory epithelium, including receptor binding and sensory neuron firing rates might have nonlinearities, but the numerical representation of perceptual magnitude must be linear in these patterns. It is possible that stronger nonlinearity will be discovered when odor mixtures or the temporal dynamics of odor perception are investigated. Many questions regarding human olfaction remain that may be successfully addressed by applying this method to future data sets that include more specific descriptors; more molecules that represent different olfactory percepts than those studied here; and subjects of different genetic, cultural, and geographic backgrounds.

Results of the DREAM Olfaction Prediction Challenge may accelerate efforts to understand basic mechanisms of ligand-receptor interactions, and to test predictive models of olfactory coding in both humans and animal models. Finally, these models have the potential to streamline the production and evaluation of new molecules by the flavor and fragrance industry.

Supplementary Materials www.sciencemag.org/content/355/6327/820/suppl/DC1 Materials and Methods Figs. S1 to S7 Table S1 Reference (20)