We will first outline theoretical results, most prominently the ML-HK map, and then illustrate the approach with simulations of one-dimensional (1D) systems and three-dimensional (3D) molecules.

ML-HK map

Previous results show that for an ML-OF approach, the accuracy of ML-KS kinetic energy models \(T_{\rm{s}}^{{\rm{ML}}}\left[ n \right]\) improve rapidly with the amount of data. However, minimizing the total energy via gradient descent requires the calculation of the gradient of the kinetic energy model \(T_{\rm{s}}^{{\rm{ML}}}\) (Fig. 1) and calculating this gradient is challenging. Due to the data-driven nature of, e.g., kernel models, the machine-learned kinetic energy functional has no information in directions that point outside the data manifold26. This heavily influences the gradient to an extent that it becomes unusable without further processing20. There have been several suggestions to remedy this problem, but all of them share a significant loss in accuracy compared to T s [n]21, 22, 27.

Here, we propose an interesting alternative to gradients and the ML-OF approach. Recently, it has been shown that the HK map for the density as a functional of the potential can be approximated extremely accurately using semiclassical expressions28. Such expressions do not require the solution of any differential equation, and become more accurate as the number of particles increases. Errors can be negligible even for just two distinct occupied orbitals.

Inspired by this success, we suggest how one might circumvent the kinetic energy gradient and directly train a multivariate ML model. We name this the ML-HK map:

$${n^{{\rm{ML}}}}[v](x) = \mathop {\sum}\limits_{i = 1}^M {\beta _i}(x)k\left( {v,{v_i}} \right).$$ (1)

Here each density grid point is associated with a group of model weights β. Training requires solving an optimization problem for each density grid point. Although this is possible in 1D, it rapidly becomes intractable in 3D, as the number of grid points grows cubically.

The use of a basis representation for the densities, as in

$${n^{{\rm{ML}}}}[v](x) = \mathop {\sum}\limits_{l = 1}^L {u^{(l)}}[v]{\phi _l}(x),$$ (2)

renders the problem tractable even for 3D. A ML model that predicts the basis function coefficients u (l)[v] instead of the grid points is then formulated.

Predicting the basis function coefficients not only makes the ML model efficient and allows the extension of the approach to 3D but also permits regularization, e.g., to smooth the predicted densities by removing the high-frequency basis functions, or to further regularize the ML model complexity for specific basis functions.

For orthogonal basis functions, the ML model reduces to several independent regression models and admits an analytical solution analogous to kernel ridge regression (KRR, see Supplementary Eq. (5)):

$${{\boldsymbol{\beta}}^{(l)}} = {\left( {{{\bf{K}}_{{{\boldsymbol{\sigma}}^{(l)}}}} + {{\boldsymbol{\lambda}}^{(l)}}{\bf{I}}} \right)^{ - 1}}{{\bf{u}}^{(l)}},\quad l = 1, \ldots ,L.$$ (3)

Here, for each basis function coefficient, λ (l) are regularization parameters and \({{\bf{K}}_{{{\boldsymbol{\sigma }}^{(l)}}}}\) is a Gaussian kernel with kernel width σ (l). The parameters λ (l) and σ (l) can be chosen individually for each basis function via independent cross-validation (see refs. 12, 29). This ML-HK model avoids prior gradient descent procedures and, with it, the necessity to “de-noise” the gradients. Due to the independence of Eq. (3) for each l, the solution scales favourably.

Functional and density-driven error

How can the performance of the ML-HK map be measured? It has recently been shown how to separate out the effect of the error in the functional F and the error in the density n(r) on the resulting error in the total energy of any approximate, self-consistent DFT calculation30. Let \(\tilde F\) be an approximation of the many body functional F, and \(\tilde n({\bf{r}})\) the approximate ground-state density when \(\tilde F\) is used in the Euler equation. Defining \(\tilde E[n] = \tilde F[n] + {\int} {{d^3}r\,{\kern 1pt} n({\bf{r}})\,{\kern 1pt} v({\bf{r}})}\) yields

$$\Delta E = \tilde E\left[ {\tilde n} \right] - E[n] = \Delta {E_{\rm{F}}} + \Delta {E_{\rm{D}}}$$ (4)

where \(\Delta {E_{\rm{F}}} = \tilde F[n] - F[n]\) is the functional-driven error, while \(\Delta {E_{\rm{D}}} = \tilde E\left[ {\tilde n} \right] - \tilde E[n]\) is the density-driven error. In most DFT calculations, ΔE is dominated by ΔE F . The standard DFT approximations can, in some specific cases, produce abnormally large density errors that dominate the total error. In such situations, using a more accurate density can greatly improve the result30,31,32. We will use these definitions to measure the accuracy of the ML-HK map.

1D potentials

The following results demonstrate how much more accurate ML is when applied directly to the HK map. The box problem originally introduced in Snyder et al.20 is used to illustrate the principle. Random potentials consisting of three Gaussian dips were generated inside a hard-wall box of length 1 (atomic units), and the Schrödinger equation for one electron was solved extremely precisely. Up to 200 cases were used to train an ML model \(T_{\rm{s}}^{{\rm{ML}}}[n]\) for the non-interacting kinetic energy functional T s [n] via KRR (for details, see Supplementary Methods).

To measure the accuracy of an approximate HK map, the analysis of the previous section is applied to the KS DFT problem. Here F is just T s , the non-interacting kinetic energy, and

$$\Delta {E_{\rm{F}}} = {\tilde T_{\rm{s}}}[n] - {T_{\rm{s}}}[n],$$ (5)

i.e., the error made in an approximate functional on the exact density. Table 1 on the left gives the errors made by ML-OF for the total energy, and its different components, when the density is found from the functional derivative. This method works by following a gradient descent of the total energy functional based on the gradient of the ML model \(T_{\rm{s}}^{{\rm{ML}}}\),

$${{{n}}^{\left( {j + 1} \right)}} = {{{n}}^{\left( j \right)}} - \epsilon P\left( {{{{n}}^{\left( j \right)}}} \right)\frac{\delta }{{\delta {{n}}}}{E^{{\rm{ML}}}}\left( {{{{n}}^{\left( j \right)}}} \right),$$ (6)

where \(\epsilon\) is a small number and P(n (j)) is a localized PCA projection to de-noise the gradient. Here, and for all further 1D results, we use

$${E^{{\rm{ML}}}}[n] = T_{\rm{s}}^{{\rm{ML}}}[n] + {\int} {{\rm{d}}x\,n(x)\,v(x).}$$ (7)

The density-driven contribution to the error, ΔE D , which we calculate exactly here using the von Weizsäcker kinetic energy33, is always comparable to, or greater than, the functional-driven error, ΔE F , due to the poor quality of the ML functional derivative20. The calculation is abnormal and can be greatly improved by using a more accurate density from a finer grid. As the number of training points M grows, the error becomes completely dominated by the error in the density. This shows that the largest source of error lies in the use the ML approximation of T s to find the density by solving the Euler equation.

Table 1 Energy errors for the 1D data set Full size table

The next set of columns analyzes the ML-HK approach, using a grid basis. The left-most of these columns shows the energy error we obtain by utilizing the ML-HK map:

$$\Delta E = \left| {{E^{{\rm{ML}}}}\left[ {{n^{{\rm{ML}}}}[v]} \right] - E} \right|.$$ (8)

Note that both ML models, \(T_{\rm{s}}^{{\rm{ML}}}\) and n ML, have been trained using the same set of M training points.

The ML-HK approach is always more accurate than ML-OF, and its relative performance improves as M increases. The next column reports the density-driven error, ΔE D , which is an order-of-magnitude smaller than that of ML-OF. Lastly, we list an estimate to the density-driven error

$$\Delta E_{\rm{D}}^{{\rm{ML}}} = \left| {{E^{{\rm{ML}}}}\left[ {{n^{{\rm{ML}}}}[v]} \right] - {E^{{\rm{ML}}}}[n]} \right|,$$ (9)

which uses the ML model \(T_{\rm{s}}^{{\rm{ML}}}\) for the kinetic energy functional in 1D. This proxy is generally a considerable overestimate (a factor of 3 too large), so that the true ΔE D is always significantly smaller. We use it in subsequent calculations (where we cannot calculate \(T_{\rm{s}}^{{\rm{ML}}}\)) to (over-)estimate the energy error due to the ML-HK map.

The last set of columns are density-driven errors for other basis sets. Three variants of the ML-HK map were tested. First, direct prediction of the grid coefficients: in this case, \({{u}}_i^{(l)} = {n_i}\left( {{x_l}} \right)\), l = 1, …, G. Five hundred grid points were used, as in Snyder, et al.20. This variant is tested in 1D only; in 3D the high dimensionality is prohibitive. Second, a common Fourier basis is tested. The density can be transformed efficiently via the discrete Fourier transform, using 200 Fourier basis functions in total. In 3D, these basis functions correspond to plane waves. The back-projection \({{u}} \mapsto n\) to input space is simple, but although the basis functions are physically motivated, they are very general and not specifically tailored to density functions. The performance is almost identical to the grid, on average, although maximum errors are much less. For M = 20, the error that originates from the basis representation starts to dominate. This is a motivation for exploring, third, a Kernel PCA (KPCA) basis34. KPCA35 is a popular generalization of PCA that yields basis functions that maximize variance in a higher dimensional feature space. The KPCA basis functions are data-driven, and computing them requires an eigen-decomposition of the Kernel matrix. Good results are achieved with only 25 KPCA basis functions. The KPCA approach gives better results because it can take the non-linear structure in the density space into account. However, it introduces the pre-image problem: it is not trivial to project the densities from KPCA space back to their original (grid) space (Supplementary Note 1). It is thus not immediately applicable to 3D applications.

Molecules

We next apply the ML-HK approach to predict electron densities and energies for a series of small molecules. We test the ML models on KS-DFT results obtained using the PBE XC functional36 and atomic pseudopotentials with the projector augmented-wave (PAW) method37, 38 in the Quantum ESPRESSO code25. As the ML-OF approach applied in the previous section becomes prohibitively expensive in 3D due to the poor convergence of the gradient descent procedure (Supplementary Note 2), we compare the ML-HK map to the ML-KS approach. This approach models the energy directly as a functional of v(r), i.e., it trains a model

$${E^{{\rm{ML}}}}[v] = \mathop {\sum}\limits_{i = 1}^M {\alpha _i}k\left( {{v_i},v} \right)$$ (10)

via KRR (for details, see Supplementary Methods).

We also apply the ML-HK map with Fourier basis functions. Instead of a \(T_{\rm{s}}^{{\rm{ML}}}[n]\) model, we learn an E ML[n] model

$${E^{{\rm{ML}}}}[n] = \mathop {\sum}\limits_{i = 1}^M {\alpha _i}k\left( {{n_i},n} \right),$$ (11)

which avoids implementing the potential energy and XC functionals.

Both approaches require the characterization of the Hamiltonian by its external potential. The external (Coulomb) potential diverges for 3D molecules and is, therefore, not a good feature to measure the distance in ML. Instead, we use an artificial Gaussian potential of the form

$$v({\bf{r}}) = \mathop {\sum}\limits_{\alpha = 1}^{{N^{\rm{a}}}} {{\bf{Z}}_\alpha }\,{\rm{exp}}\left( {\frac{{ - {{\left\| {{\bf{r}} - {{\bf{R}}_\alpha }} \right\|}^2}}}{{2{\gamma ^2}}}} \right),$$ (12)

where R α are the positions, and Z α are the nuclear charges of the N a atoms. The Gaussian potential is used for the ML representation only. The width γ is a hyper-parameter of the algorithm. The choice is arbitrary but can be cross-validated. We find reasonable results with γ = 0.2 Å. The idea of using Gaussians to represent the external potential has been used previously39. The Gaussian potential is discretized on a coarse grid with grid spacing Δ = 0.08. To use the discretized potential in the Gaussian kernel, we flatten it into a vector form and thus use a tensor Frobenius norm.

Our first molecular prototype is H 2 , with the only degree of freedom, R, denoting the distance between the H atoms. A data set of 150 geometries is created by varying R between 0.5 and 1.5 Å (sampled uniformly). A randomly chosen subset of 50 geometries is designated as the test set and is unseen by the ML algorithms. These geometries are used to measure the out-of-sample error reported below.

The remaining 100 geometries make up the grand training set. To evaluate the performance of the ML-KS map and the ML-HK map, subsets of varying sizes M are chosen out of the grand training set to train the E ML[v] and n ML[v] models, respectively. Because the required training subsets are so small, careful selection of a subset that covers the complete range of R is necessary. This is accomplished by selecting the M training points out of the grand training set so that the R values are nearly equally spaced (see Supplementary Note 3 for details).

For practical applications, it is not necessary to run DFT calculations for the complete grand training set, only for the M selected training points. Strategies for sampling the conformer space and selecting the grand training set for molecules with more degrees of freedom are explained for H 2 O and MD simulations later on.

The performance of the ML-KS map and ML-HK map is compared by evaluating E ML[v] that maps from the Gaussian potential to the total energy and the combination of n ML[v] that maps from Gaussian potential to the ground-state density in a 3D Fourier basis representation (l = 25) and E ML[n] that maps from density to total energy. The prediction errors are listed in Table 2.

Table 2 Prediction errors on H 2 and H 2 O Full size table

The mean average error (MAE) of the energy evaluated using the ML-HK map is significantly smaller than that of the ML-KS map. This indicates that even for a 3D system, learning the potential-density relationship via the HK map is much easier than directly learning the potential-energy relationship via the KS map.

Figure 1b shows the errors made by the ML-KS and the ML-HK maps. The error of the ML-HK map is smoother than the ML-KS error and is much smaller, even for the most problematic region when R is smaller than the equilibrium bond distance of R 0 = 0.74 Å. The MAE that is introduced by the PBE approximation on the H 2 data set is 2.3 kcal/mol (compared to exact CI calculations), i.e., well above the errors of the ML model and verifies that the error introduced by the ML-HK map is negligible for a DFT calculation.

The next molecular example is H 2 O, parametrized with three degrees of freedom: two bond lengths and a bond angle. In order to create a conformer data set, the optimized structure (R 0 = 0.97 Å, θ 0 = 104.2 using PBE) is taken as a starting point. A total of 350 geometries are then generated by changing each bond length by a uniformly sampled value between ±0.075 Å and varying the angle θ between ±8.59° (±0.15 rad) away from θ 0 (see Supplementary Fig. 1 for a visualization of the sampled range). For this molecule, the out-of-sample test set again comprises a random subset of 50 geometries, with the remaining 300 geometries used as the grand training set. Because there are now three parameters, it is more difficult to select equidistant samples for the training subset of M data points. We therefore use a K-means approach to find M clusters and select the grand training set geometry closest to each cluster’s center for the training subset (see Supplementary Note 4 for details).

Models are trained as for H 2 . The results are given in Table 2. As expected, the increase in degrees of freedom for H 2 O compared to H 2 requires a larger training set size M. However, even for the more complicated molecule, the ML-HK map is consistently more precise than the ML-KS map and provides an improved potential energy surface, as shown in Fig. 2. With an MAE of 1.2 kcal/mol for PBE energies relative to CCSD(T) calculations for this data set, we again show that ML does not introduce a new significant source of error.

Fig. 2 Result of the comparison for H 2 O. (Top) Distribution of energy errors against PBE on the H 2 O data set for ML-KS and ML-HK. The errors are plotted on a symmetric log scale with a linear threshold of 0.01, using nearest neighbor interpolation from a grid scan for coloring. Black dots mark the test set geometries with averaged bond lengths. (Bottom left) Comparison of the PBE errors made by ML-HK and ML-KS on the test set geometries. (Bottom right) Energy landscape of the ML-HK map for symmetric geometries (R vs. θ). All models were trained on M = 15 training points. Energies and errors in kcal/mol. A black cross marks the PBE equilibrium position Full size image

The ML maps can also be used to find the minimum energy configuration. The total energy is minimized as the geometry varies with respect to both bond lengths and angles. For optimization, we use Powell’s method40, which requires a starting point and an evaluation function to be minimized. For the H 2 O case, the search is restricted to symmetric configurations with a random symmetric geometry used as the starting point. Results are reported in Table 2. The optimizations consistently converge to the correct minimum regardless of starting point, consistent with the maps being convex, i.e., the potential energy curves are sufficiently smooth to avoid introducing artificial local minima.

For larger molecules, generating random conformers that sample the full configurational space becomes difficult. Therefore, we next demonstrate that MD using a classical force field can also be used to create the grand training set. As an example, we use benzene (C 6 H 6 ) with only small fluctuations in atomic positions out of the molecular plane (Supplementary Fig. 2). Appropriate conformers are generated via isothermal MD simulations at 300, 350, and 400 K using the General Amber Force Field (GAFF)41 in the PINY_MD package42. Saving snapshots from the MD trajectories generates a large set of geometries that are sampled using the K-means approach to obtain 2000 representative points for the grand training set. Training of n ML[v] and E ML[n] is performed as above by running DFT calculations on M = 2000 points. We find that the ML error is reduced by creating the training set from trajectories at both the target temperature and a higher temperature to increase the representation of more distorted geometries. The final ML model is tested on 200 conformational snapshots taken from an independent MD trajectory at 300 K (Fig. 3a). The MAE of the ML-HK map for this data set using training geometries from 300 and 350 K trajectories is only 0.37 kcal/mol for an energy range that spans more than 10 kcal/mol (Table 3).

Fig. 3 Energy errors of ML-HK along classical MD trajectories. PBE values in blue, ML-HK values in red. a A 20 ps classical trajectory of benzene. b A 20 ps classical trajectory of ethane Full size image

Table 3 Energy and density-driven errors of the ML-HK approach on the MD data sets Full size table

For benzene, we further quantify the precision of the ML-HK map in reproducing PBE densities. In Fig. 4, it is clear that the errors in the Fourier basis representation are larger than the errors introduced by the ML-HK map by two orders of magnitude. Furthermore, the ML-HK errors in density (as evaluated on a grid in the molecular plane of benzene) are also considerably smaller than the difference in density between density functionals (PBE vs. LDA43). This result verifies that the ML-HK map is specific to the density used to train the model and should be able to differentiate between densities generated with other electronic structure approaches.

Fig. 4 The precision of our density predictions using the Fourier basis for the molecular plane of benzene. The plots show a the difference between the valence density of benzene when using PBE and LDA functionals at the PBE-optimized geometry. b Error introduced by using the Fourier basis representation. c Error introduced by the n ML[v] density fitting (a–c on same color scale). d The total PBE valence density. e The density differences along a 1D cut of a–c. f The density error introduced with the ML-HK map (same data but different scale, as in c) Full size image

Ethane (C 2 H 6 ), with a small energy barrier for the relative rotation of the methyl groups, is also evaluated in the same way. Using geometries sampled with K-means from 300 and 350 K classical trajectories, the ML-HK model reproduces the energy of conformers with a MAE of 0.23 kcal/mol for an independent MD trajectory at 300 K (Fig. 3b). This test set includes conformers from the sparsely-sampled eclipsed configuration (Supplementary Fig. 3). Using points from a 400 K trajectory improves the ML-HK map due to the increased probability of higher energy rotamers in the training set (Table 3). The training set could also be constructed by including explicit rotational conformers, as is commonly done when fitting classical force field parameters41. In either case, generating appropriate conformers for training via computationally cheap classical MD significantly decreases the cost of the ML-HK approach.

As additional proof of the versatility of the ML-HK map, we show that this approach is also able to interpolate energies for proton transfer in the enol form of malonaldehyde (C 3 H 4 O 2 ). This molecule is a well-known example of intramolecular proton transfer, and our previous AIMD and ab initio path integral studies44 found classical and quantum-free energy barrier values of 3.5 and 1.6 kcal/mol, respectively, from gradient-corrected DFT. In this work, classical MD trajectories are run for each tautomer separately, with a fixed bonding scheme, then combined for K-means sampling to create the grand training set. The training set also includes an artificially constructed geometry that is the average of tautomer atomic positions. For the test set, we use snapshots from a computationally expensive Born–Oppenheimer ab initio MD trajectory at 300 K. Figure 5a shows that the ML-HK map is able to predict DFT energies during a proton transfer event (MAE of 0.27 kcal/mol) despite being trained on classical geometries that did not include these intermediate points.

Fig. 5 Energy errors of ML-HK along ab initio MD and ML-generated trajectories. a Energy errors of ML-HK along a 0.25 ps ab initio MD trajectory of malonaldehyde. PBE values in blue, ML-HK values in red. The ML model correctly predicts energies during proton transfer in frames 7–15 without explicit inclusion of these geometries in the training set. b Energy errors of ML-HK along a 1 ps MD trajectory of malonaldehyde generated by the ML-HK model. ML-HK values in red, PBE values of trajectory snapshots in blue Full size image

We show, finally, that the ML-HK map can also be used to generate a stable MD trajectory for malonaldehyde at 300 K (Fig. 5b). In principle, analytic gradients could be obtained for each timestep, but for this first proof-of-concept trajectory, a finite-difference approach was used to determine atomic forces. The ML-HK-generated trajectory samples the same molecular configurations as the ab inito MD simulation (see Fig. 6 and Supplementary Table 1) with a mean absolute energy error of 0.77 kcal/mol, but it typically underestimates the energy for out-of-plane molecular fluctuations at the extremes of the classical training set (maximum error of 5.7 kcal/mol, see Supplementary Fig. 4). Even with the underestimated energy values, however, the atomic forces are sufficiently large to return the molecule to the equilibrium configuration, thus resulting in a stable trajectory. The new set of coordinates could be further sampled to expand the training set in a self-consistent manner. Using iterative ML-HK-generated MD trajectories would eliminate the need to run computationally expensive MD simulations with KS DFT and would provide an iterative approach to reduce the energy errors for conformations not included in the classical training set.