Atomic and molecular properties could be evaluated from the fundamental Schrodinger’s equation and therefore represent different modalities of the same quantum phenomena. Here, we present AIMNet, a modular and chemically inspired deep neural network potential. We used AIMNet with multitarget training to learn multiple modalities of the state of the atom in a molecular system. The resulting model shows on several benchmark datasets state-of-the-art accuracy, comparable to the results of orders of magnitude more expensive DFT methods. It can simultaneously predict several atomic and molecular properties without an increase in the computational cost. With AIMNet, we show a new dimension of transferability: the ability to learn new targets using multimodal information from previous training. The model can learn implicit solvation energy (SMD method) using only a fraction of the original training data and an archive median absolute deviation error of 1.1 kcal/mol compared to experimental solvation free energies in the MNSol database.

Here, we present the AIMNet (atoms-in-molecules), a chemically inspired, modular deep neural network (DNN) molecular potential. We use multimodal and multitask learning to obtain an information-rich representation of an atom in a molecule. We show the state-of-the-art accuracy of the model to simultaneously predict energies, atomic charges, and volumes. We also show how the multimodal information about the atom state could be used to efficiently learn new properties, such as solvation free energies, with much less training data.

Natural phenomena often inspire the structure of ML models and techniques ( 19 ). For example, the human brain is constantly interacting with various types of information related to the physical world; each piece of information is called a modality. In many systems, multiple data modalities can be used to describe the same process. One such physical system is the human brain, which provides more reliable information processing based on multimodal information ( 20 ). Many ML-related fields of research have successfully applied multimodal ML model training ( 21 ). In chemistry, molecules, which are often represented by structural descriptors, can also be described with accompanying properties (dipole moments and partial atomic charges) and even electron densities. The multimodal learning that treats multimodal information as inputs has been an actively developing field in recent years ( 22 ). Multimodal and multitask learning aims at improving the generalization performance of a learning task by jointly learning multiple related tasks together and could increase the predictivity of a model ( 23 ). This boost is caused by the use of additional information that captures the implicit mapping between the learnable endpoints.

The success of modern ML may be largely attributed to a highly flexible functional form for fitting to high-dimensional data. ML is known to extract complex patterns and correlations from these data. These data can contain counterintuitive statistical correlations that are difficult for humans to comprehend. With a few notable exceptions ( 16 , 17 ), these models do not capture the underlying physics of electrons and atoms. These statistical fits are often fragile, and the behavior of an ML potential far from the training data could be nonphysical. The essential challenge for ML is to capture the correct physical behavior. Therefore, the immediate frontiers in ML lie in physics-aware artificial intelligence (PAI) and explainable artificial intelligence (XAI) ( 18 ). Future PAI methods will learn a model with relevant physics-inspired constraints included. The inclusion of physics-inspired constraints promises to deliver better performance by forcing the model to obey physical laws and cope better with sparse and/or noisy data. XAI will step even further, complementing models with logical reasoning and explanations of their actions, to ensure that researchers are getting the right answer for the right reasons ( 18 ).

Various techniques for improving the accuracy and transferability of ML potentials have been used. Active learning methods ( 11 , 12 ), which provide a consistent and automated improvement in accuracy and transferability, have contributed greatly to the success of general purpose models. An active learning algorithm achieves this by deciding what new QM calculations should be performed and then adding the new data to the training dataset. The act of letting the ML algorithm drive sampling is shown to greatly improve the transferability of an ML potential. Further, transfer learning methods allow the training of accurate ML potentials by combining multiple QM approximations ( 13 ). Several recent reviews summarized the rapid progress in this field ( 14 , 15 ).

The high computation cost of quantum chemical (QM) methods has become a critical bottleneck, which limits a researcher’s abilities to study larger realistic atomistic systems, as well as long-time scales relevant to an experiment. Hence, robust approximate but accurate methods are required for continued scientific progress. Machine learning (ML) has been successfully applied to approximate potential energy surfaces of molecules ( 1 – 4 ), obtain atomic forces ( 5 ), and even predict reaction synthesis ( 6 ). ML techniques have become popular for the use in predicting QM molecular properties. ML models seek to learn a “black box” function that maps a molecular structure to the property of interest. Until recently, many ML-based potentials relied on a philosophy of parametrization to one chemical system at a time ( 7 ). These methods can achieve high accuracy with relatively small amounts of QM data but are not transferable to new chemical systems. Using this approach for any new system requires a new set of QM calculations and extra parametrization time for each new study. Recent breakthroughs in the development of ML models in chemistry have produced general purpose models that accurately predict potential energies and other molecular properties for a broad class of chemical systems ( 2 , 3 , 8 – 10 ). General purpose models promise to make ML a viable alternative to empirical potentials and classical force fields. Force fields are known to have many weaknesses, for example, poor description of the underlying physics and lack of transferability, and are hard to improve in accuracy systematically.

RESULTS

The AIMNet architecture The name and concept of the AIMNet model is inspired by Bader’s theory of atoms in molecules (AIM) (24). The quantum theory of AIM is a model in which a molecule could be partitioned into interacting atoms via an observable electron density distribution function. When atoms combine into a molecule, their electron density changes due to interaction with other atoms. In density functional theory (DFT), the final solution for the electron density and energy for the molecule is usually obtained with an iterative self-consistent field (SCF) procedure. Within the AIM model, each step of the SCF-like procedure could be viewed as the change of electron density distribution within atomic basins to reflect changes in the basins of neighboring atoms. In the AIMNet model, instead of electron density, atoms are characterized by learnable feature vectors and complex interatomic interactions are approximated with the DNN. The high-level architecture of the AIMNet model is shown in Fig. 1. The model uses atomic coordinates (R) and numbers (Z) as inputs and transforms them into atom-centered environment vectors (AEVs) that are used as features for embedding, interaction, update, and AIM neural network blocks. The model predicts a set of molecular and/or atomic properties (p). The overall algorithm can be summarized as follows 1) Encode relative positions R of all neighboring atoms as AEVs. 2) Select initial atomic feature vectors (AFVs) corresponding to atomic numbers. 3) For each atom, embed its AEV into the space of AFVs of neighboring atoms, combining geometrical and atomic feature information. 4) Calculate interaction of the atom with the environment to get AIM representation of the atom. 5) Calculate atom properties from the AIM representation. 6) Calculate environment-dependent update to the AFVs and repeat steps 3 to 5 until converged. Fig. 1 Architecture of the AIMNet model. The model uses atomic numbers Z and coordinates R as input features. The coordinates are transformed with ANI-type symmetry functions into AEVs. Atom types are represented with learnable atomic feature vectors (AFV), which are used as embedding vectors for AEVs. The interaction of an atom with its environment produces the AIM representation of the atom used to predict a set of target atom properties {p k }. The environment-dependent update to AFV within N iterations is used to make the embedding vectors for each atom consistent with its environment. Input data are colored in blue, predicted endpoints are in orange, and neural network blocks are in green. At step 6, the AFV for every atom in the molecule is updated, which changes the embedding at the next iteration. This is effectively describing the interactions between atoms by passing messages (25, 26) through the neural network. Convergence is a learned feature of the model, when the state of each atom (AFV) is consistent with the state of its neighbors and subsequent updates are approaching zero. Therefore, we call this procedure “SCF-like.” The implementation details of individual AIMNet blocks are given below. Embedding block. Geometrical arrangement for ith atom of a molecule are encoded as a set of ANI-type (3, 7) radial g ij ( r ) and angular g ijk ( a ) AEVs with indexes j corresponding to every neighboring atom and jk to every unique pair of neighbors g ij ( r ) = exp ( − η ( r ) ( r ij − r s ( r ) ) ) f C ( r ij ) (1) g ijk ( a ) = 2 1 − ζ ( 1 + ( cos θ ijk − θ s ( a ) ⊺ ) ) exp ( − η ( a ) ( r ij + r ik 2 − r s ( a ) ) 2 ) f C ( r ij ) f C ( r ik ) (2) f C ( r ij ) = 0.5 cos ( π min ( r ij R C , 1 ) ) + 0.5 (3) Here, r and θ are the distances and angles between atoms and f c is the cosine cutoff function, which smoothly zeroes AEVs for neighbors located outside of cutoff radius R C , chosen at 4.6 Å for radial AEVs and at 3.1 Å for angular AEVs. All hyperparameters for AEVs, such as radial and angular probe vectors r s and θ s , respectively and probe widths η and ζ match the ANI-1x model (12) (see also the Supplementary Materials for details). The g ij ( r ) and g ijk ( a ) AEVs contain only geometrical information but not atom types. To differentiate neighbors by atom type, we embed these vectors into the space of learnable AFVs defined for every chemical element Z as ℝ, a z ∈ ℝd, with dimensionality d being another hyperparameter of the model (here used d = 16). We selected the outer product of g ij ( r ) and a j as an embedding operation for radial AEVs. The result is a matrix G ij ( r ) ∈ ℝm × d, where m is the dimensionality of g ij ( r ) (the size of probe vectors r s in Eq. 1) and d is the size of AFVs. By their design, symmetry functions (Eqs. 1 to 3) are many body functions, i.e., they could be summed for all the neighbors of an atom, providing an integral description of the atomic environment (3). The same applies to outer products G ij ( r ) , given a sufficiently large embedding vector a j . We obtain radial features of atomic environment of the ith atom as a fixed-length vector after flattening the corresponding matrix G i ( r ) = ∑ j g ij ( r ) ⋅ a j ⊺ (4) Embedding of the angular symmetry functions g ijk ( a ) requires atom-pair feature vectors a jk * defined for every combination of chemical elements. If introduced as learnable parameters of the model, then the size of this embedding layer would grow as O(N2)with the number of chemical elements. Instead, in the AIMNet model, we learn an interaction between AFVs, which give appropriate atom-pair atomic features a jk * . We construct concatenation of elementary symmetric polynomials, e.g., element-wise multiplication and addition of two AFVs and use it as an input layer for multilayer neural network or perceptron function (MLP) ℱ MLP 1 a jk * = F MLP 1 ( [ a j ∘ a k , a j + a k ] ) (5) Analogous to the radial part, the combined angular AEV is defined as G i ( a ) = ∑ jk g ijk ( a ) ⋅ a jk * ⊺ (6) The embedding stage is finalized by application of another neural network to the concatenation of embedded radial and angular symmetry functions f i = F MLP 2 ( [ G i ( r ) , G i ( a ) ] ) (7) The ℱ MLP 2 function extracts information about the environment of the atom; therefore, vector f is referred to as the atomic environment field. In this work, the AIMNet model was trained to learn six diverse atomic or molecular properties. In addition to molecular energies, we used modules for atomic electric moments of the atoms up to l = 3, i.e., atomic charge, dipole, quadrupole, and octupole, as well as atomic volumes. Figure 2 provides correlation plots for four predicted quantities. Atomic dipoles and quadrupoles are probably not very useful per se and could be considered as “auxiliary.” Hence, the accuracy of their fits is summarized in the Supplementary Materials. Fig. 2 Performance of the AIMNet model on the DrugBank subset of the COMP6-SFCl benchmark. Plots show correlation between ground-truth DFT (x axes) and predicted with AIMNet (y axes) values for total molecular energies, components of force vectors on atoms (∂E/∂R), atomic charges, and volumes. For each plot, units for both axes and for RMSD and mean absolute deviation values are the same. Logarithm of point density is shown with color. The accuracy of fit is assessed on the DrugBank subset of the COMP6-SFCl benchmark. This benchmark contains properties of 23,203 nonequilibrium conformations of 1253 drug-like molecules. The median molecule size is 43 atoms, more than three times larger than molecules in the training dataset. Thus, this benchmark shows transferability and extensibility of the AIMNet model. The root mean square error (RMSE) of the energy predictions is 4.0 kcal/mol within the range of about 1500 kcal/mol. For comparison, an ensemble of ANI models trained and evaluated on the same datasets has an RMS energy error of 5.8 kcal/mol and a force error of 7.1 kcal mol–1Å–1. Predicted components of atomic forces have RMS deviation (RMSD) = 4.7 kcal mol–1Å–1, highlighting the AIMNet model utility to reproduce the curvature of molecular potential energy surfaces accurately and thus its applicability for geometry optimization and molecular dynamics. Atomic charges could be learned up to a “chemical accuracy” of 0.01e. Overall, this level of accuracy for both AIMNet and ANI-1x is on par with the best single-molecule potentials constructed with the original Behler-Parrinello (BP) descriptor (27).

Iterative SCF-like update The AIMNet model was trained with outputs from every SCF-like pass contributing equally to the cost function (see the Supplementary Materials for details). This way, the model learns to give the best possible answer for each pass, given the input AFVs. The AFVs are improved with every iteration, leading to lower errors. The model with t = 1 is conceptually similar to the ANI-1x network since no updates are made to the atomic features in both models [in BP-type networks, the representation of atomic features is hidden within the neural network (NN) layers], and the receptive field of the AIMNet model is roughly equal to the size of the AEV descriptor in ANI-1x. Figure 3 shows the aggregated performance of prediction for energies (E), relative conformer energies (ΔE), and forces (F) improves with increasing number of passes t. As expected, the accuracy of AIMNet with t = 1 is very similar or better compared to the ANI-1x network. The second iteration (t = 2) provides the biggest boost in performance for all quantities. After t = 3, results do not change much; therefore, we used t = 3 to train all models in the paper. Overall, the biggest gains in accuracy were up to 0.75 kcal/mol for relative energies and 1.5 kcal mol–1Å–1 for forces. This corresponds to about 15 to 20% error reduction. Fig. 3 AIMNet predictions with different number of iterative passes t evaluated on the DrugBank subset of the COMP6-SFCl benchmark. (A) Comparison of AIMNet performance at different t values with ANI-1x model trained on exactly the same dataset for relative conformer energies (ΔE), total energies (E), and atomic forces (F). (B) AIMNet accuracy in prediction of total energies (E), relative conformer energies (ΔE), atomic forces (F), charges (q), and volumes (V) at different t values. Relative RMSE is calculated as ratio of RMSE at given t divided by RMSE at t = 3 (the values used to train model). Notably, there is a major difference between an SCF-like update and a real SCF procedure used along with DFT methods. Convergence of the SCF procedure is determined by the variational principle, e.g., the solution is electron density distribution that minimizes the total energy. The AIMNet model is not variational; lower energy does not imply more correct prediction. Therefore, there is no guarantee for convergence of SCF-like updates. Although, on average, AIMNet converges very fast, within two or three iterations, this behavior is controlled by the L2 regularization we used during training for every learned parameter. The model learns to make better prediction using the smallest possible update to the AEVs. After training with t = 3, the output of the model not only does not improve with larger number of iterations but also does not explode because of the accumulation of errors. We tested the model with up to six updates and found that, at t = 7, the AIMNet still performs better than with t = 1 (Fig. 3B). Figure 3B shows that atomic charges and volumes are much more sensitive toward the iterations than energies and forces. To illustrate the importance of long-range interactions on charge redistribution, let us consider a simple but realistic series of substituted thioaldehydes (Fig. 4). Here, a substituent R is located as far as 5 Å away from the sulfur atom. This distance is longer than the cutoff radius of 4.6 Å we used in AEV construction. However, because of a conjugated chain and the high polarizability of sulfur, the partial atomic charge on sulfur could change by as much as 0.15e by varying R with different electron-donating or electron-withdrawing groups. The AIMNet model with t = 1 (as well as all neural network potentials (NNPs) with local geometric descriptors) will incorrectly predict that the sulfur partial charge in all molecules is equal. The correct trend could be recovered with either an increase of the radius for local environment (usually at a substantial computational cost and potentially an impact on model extensibility) or with iterative updates to the AFVs. AIMNet with t = 2 reproduces DFT charges on the sulfur atom notably better than with t = 1, except for the most polar ones. At t = 3, the charge redistribution in the AIMNet model completes and quantitatively reproduces DFT charges for all molecules considered. Fig. 4 DFT ωB97x/def2-TZVPP atomic charges on the sulfur atom of substituted thioaldehyde and AIMNet prediction with a different number of iterative passes t.

The nature of the AFV representation To gain insights into the learned latent information inside the AFVs, we performed a parametric t-distributed stochastic neighbor embedding (pt-SNE) (28) of this 16-dimensional (16D) space into a 2D space. The pt-SNE is an unsupervised dimensionality reduction technique. pt-SNE learns a parametric mapping between the high-dimensional data space and the low-dimensional latent space using a DNN in such a way that the local structure of the data in the high-dimensional space is preserved as much as possible in the low-dimensional space. Figure 5 shows the 2D pt-SNE for 3742 DrugBank molecules or about 327 k atoms in total. Fig. 5 pt-SNE of AFVs (t = 3) for a set of drug-like molecules. (B) Feature vectors for different chemical elements. Values of feature vectors at t = 1 (before SCF-like update) marked with cross symbol. (A and C) Feature vectors for several of the most common types of the carbon and hydrogen atom environments. In the AIMNet model, the AFVs are used to discriminate atoms by their chemical types. The trivial discrimination could be achieved with orthogonal vectors (which would effectively be a one-hot encoding). The pt-SNE of AFVs in Fig. 6A shows that the location of clusters corresponding to different chemical elements resembles their positions in the periodic table. The pt-SNE component on the horizontal axis roughly corresponds to a period and vertical component to a group in the periodic table. Embeddings for hydrogen atoms are closer to halogens than to any other element. It is interesting to note the wide spread of the points corresponding to sulfur and hydrogen atoms. In the case of sulfur, this is the only element in the set that may have distinctly different valence states (6 and 2) in common organic molecules. Fig. 6 Performance of the AIMNet model on several benchmark sets compared to MM, semiempirical, DFT, and ab initio QM methods. (A and B) Torsion benchmark of Sellers et al. (29) for gas phase and solvent models. The dots correspond to mean absolute error (MAE) for each of the 62 torsion profiles for each method. (C) The dots correspond to absolute error for relative conformer energy for each of the 196 conformers in the dataset. (A to C) The boxes represent the upper and lower quartiles, while the whiskers represent 1.5 times the interquartile range or the minimum/maximum values. The methods are ordered in descending median errors from top to bottom. Boxes colored by the class of the computational method (ML, MM, SQM, DFT, and ab initio). The basis sets used to obtain reference energies are 6-311+G**, for (A) and (B), and def2-TZVPD, for (C), where applicable but not specified. (D) Correlation plot of experimental solvation energies for 238 neutral molecules from the MNSol database (31) and AIMNet predictions, calculated as the difference between the prediction of DFT and DFT(SMD) energies. The most structure and diversity in the pt-SNE plot is observed for carbon atoms. In Fig. 5A, we show a zoomed in region of the carbon atoms, with coloring by the hybridization and structure of local chemical environments. Two main distinct clusters corresponding to sp2 (or aromatic) and sp3 C atoms appear. Inside every cluster, atoms are grouped by the local bonding environment. There is also a clear trend in the increase of substituent polarity from the top to the bottom of the plot. Similarly, the spread for the H atoms is determined mainly by the parent sp2 and sp3 carbon atoms or heteroatom environments (Fig. 5C).

Conformations and dihedral profiles benchmark One of the most promising applications of the NNPs for computational drug discovery is conformer generation and ranking. Therefore, we evaluated the performance of the AIMNet model against two distinct external datasets. Both are tailored to benchmark the performance of molecular potentials to describe conformer energies, which have high-quality CCSD(T)/CBS reference data. The small-molecule torsion benchmark of Sellers et al. (29) measures the applicability of an atomistic potential in drug discovery. This benchmark includes 62 torsion profiles from molecules containing the elements C, H, N, O, S, and F computed with various force fields, semiempirical QM, DFT, and ab initio QM methods. These methods are compared to CCSD(T)/CBS reference calculations. Figure 6A provides the performance of the methods presented in Sellers et al. (29) together with AIMNet single-point calculations on MP2/6-311+G** optimized geometries. According to this benchmark, the AIMNet potential is much more accurate compared to semiempirical methods and OPLS-type force fields, which is specifically tailored to describe conformation energies of drug-like molecules. The performance of the AIMNet model could be directly compared to MP2/6-311+G** and DFT methods. Another benchmark set, MPCONF196 (30), measures the performance of various potentials to rank both the low- and high-energy conformers of acyclic and cyclic peptides and several macrocycles, including 13 compounds in total. The reference data were obtained as single-point calculations at the CCSD(T)/CBS level (Tight-DLPNO approximation for the largest molecules in the dataset) on MP2/cc-pVTZ geometries. Figure 6C shows a comparison of AIMNet to a subset of methods benchmarked in the original paper of Řezáč et al. (30). The DFT methods fall into two categories, depending on whether dispersion corrections were included or not, with highly empirical M06-2x being somewhere in between. The AIMNet model has been trained to reproduce the ωB97x functional without explicit dispersion correction. Therefore, its performance is clearly worse compared to the dispersion-corrected counterpart. However, much of the error could be reduced by adding an empirical D3 dispersion correction to AIMNet energies. The benchmark data show that the AIMNet model is on par with traditional DFT functionals without dispersion correction, and it clearly outperforms semiempirical methods, even methods that have built-in dispersion correction.