Physics-informed descriptors

The successful application of ML approaches on the modeling of material properties requires the selection of an appropriate set of modeling variables or, namely, the descriptors for the property of interest. In general, the descriptors are expected to be capable of both sufficiently distinguishing each of the modeled compounds/materials and determining the targeted property. In this context, chemical compositions are straightforwardly used as one type of the most common descriptors as they are usually unique for each modeled material, and many material properties are eventually compositional dependent. In fact, several recent works have shown that using chemical compositions only as descriptors can describe the glass properties through the artificial neural network based ML algorithm20,21,22,35. However, only using compositional descriptors could make the model have limited extrapolative ability13,24,28.

Alternatively, one can construct the descriptors using a group of material feature parameters that have physical correlations with the targeted property. In this way, the resulting model could potentially capture the underlying physical mechanisms after training, and thus offer reliable predictions for the chemistries beyond the training set. These material quantities are generally classified into two categories, namely the chemical and structural feature parameters15,17. Chemical feature parameters are usually elemental properties, such as the effective ionic charge, atomic radius and weight, and electronegativity, which can be obtained by requiring the knowledge of the material chemistry only. Structural feature parameters, such as the atomic coordination number and bonding distances, and radial distribution function, require knowledge of the specific atomistic structures of the material (in addition to the chemistry), and they need to be determined experimentally or from atomistic simulations, such as MD simulations in the present work.

For fast mapping the glass properties in a complex computational space, it is not efficient to use both the chemical and structural feature parameters to construct descriptors. Densities and elastic moduli of the SiO 2 -based glasses are indeed strongly correlated with or determined by many of the glass structural features, such as atomic packing density, coordination numbers and ring sizes of network formers5,7,36,37,38,39,40,41. These structural feature parameters, however, are unknown for a given glass composition in the present work unless the MD simulations have been performed to obtain the corresponding atomistic structure. On the other hand, if the atomistic structure of a glass material is already known, there is no need to perform any ML-based predictions as the elastic moduli and density can be easily and quickly calculated via a MS simulation using the strain-stress method described in Methods Section. In fact, obtaining the glassy structure via MD simulations is the most time-consuming step when computing the density and elastic moduli of a SiO 2 -based glass. Thus, only the chemical quantities are considered for the construction of the descriptors for the ML model in the present work. As a result, the developed ML model is able to predict the properties by only requiring the information of the glass chemistry and without the need to run any additional MD simulations.

The construction of the descriptors should always start with the ones that are physically relevant to the material property of interest. In the MD and MS simulations, the interatomic interactions are determined by the force-field potentials. The calculated density and elastic moduli are actually derived quantities from many multilevel and intricate MD and MS runs. Therefore, the parameters of the force-field potentials can be a set of suitable candidates to construct the ML descriptors due to their intrinsic characteristics to describe the physical features of interatomic bonds.

In the present work, a set of self-consistent force-field potentials27,42,43,44,45,46,47,48,49,50,51 are employed to perform the MD and MS simulations. The potential consists of long-range Coulomb interactions and short-range interactions described in the Buckingham form52, which can be expressed as,

$${\it{U}}_{{i},{j}}\left( {{\it{r}}_{{i},{j}}} \right) = \frac{{{\it{q}}_{\it{i}}{\it{q}}_{j}}}{{4{\it{\uppi \upvarepsilon }}_0{\it{r}}_{{\it{i}},{\it{j}}}}} + {\it{A}}_{{\it{i}},{\it{j}}}{\mathrm{exp}}\left( { - \frac{{{\it{r}}_{{\it{i}},{\it{j}}}}}{{{\it{B}}_{{\it{i}},{\it{j}}}}}} \right) - \frac{{{\it{C}}_{{\it{i}},{\it{j}}}}}{{{\it{r}}_{{\it{i}},{\it{j}}}^6}}$$ (1)

Here r i,j is the interatomic distance between atom i and j; q i and q j are the effective ionic charges of atom i and j, respectively; A i,j , B i,j , and C i,j are the energy parameters of the Buckingham form between i and j. The values of the effective ionic charges and Buckingham parameters for each element are summarized in Table 1. Therefore, according to Eq. 1, the descriptors associated with the Coulomb interactions for a given glass composition is written as,

$$u_{q_m,q_n}^{\mathrm{Coul}} = \sum \limits_{i} c_{i_m} \cdot q_m \cdot \sum \limits_{j} c_{j_n} \cdot q_n$$ (2)

Here q m and q n denote the effective ionic charges listed in Table 1, which have values among −1.2, +0.6, +1.2, +1.8, and +2.4 e; \(c_{i_m}\) and \(c_{j_n}\) denote the mole fractions of the constituent elements i and j with effective ionic charge q m and q n , respectively. For example, for a glass that containing Na, K, Ca, and Sr, as the effective ionic charges are +0.6 e for Na/K and +1.2 e for Ca/Sr, respectively, the descriptor that corresponds to the Coulomb interactions between the ions with +0.6 and +1.2 e charges is calculated as \(u_{ + 0.6, + 1.2}^{{\mathrm{Coul}}} = \left( {c_{Na_{ + 0.6}} + c_{K_{ + 0.6}}} \right) \cdot 0.6 \cdot \left( {c_{Ca_{ + 1.2}} + c_{Sr_{ + 1.2}}} \right) \cdot 1.2\), where \(c_{Na_{ + 0.6}},c_{K_{ + 0.6}}\), \(c_{Ca_{ + 1.2}},{\it{{\mathrm{and}}}}\,c_{Sr_{ + 1.2}}\) are the elemental mole fractions of Na, K, Ca, and Sr, respectively. Because there are five different types of charge valences assigned for the elements that modeled in the present work, the total number of the Coulomb interactions descriptors, \(u_{q_m,q_n}^{{\mathrm{Coul}}}\), is \(C_5^1 + C_5^2\) = 15.

As shown in Eq. 1 and Table 1, the MD parameters associated with the Buckingham term describe the short-range interactions between each ion in a very complex way. Since we do not have a priori knowledge of how to combine these parameters to result in optimal modeling results, based on our previous experience17, the corresponding descriptors are constructed as a series of weighted Hölder means, from which the ML model selects the most useful descriptors for modeling and predicting the glass properties of interest. As shown in Table 1, there are three individual Buckingham parameters (i.e., A i,o , B i,o , and C i,o ) for each element to describe its short-range interactions with the O anions (including the O–O self-interactions). Among these three parameters, the B i,o term influences the short-range interaction energy exponentially based on Eq. 1. Therefore, different from A i,o and C i,o , B i,o is not directly used as the feature parameter for the descriptor construction. Instead, in order to accurately describe the exponential effects of B i,o , we proposed to use a parameter, \(B_{i,O}^\prime\), for the descriptor construction. The \(B_{i,O}^\prime\) parameter is calculated from B i,o ,

$$B_{i,O}^\prime = {\exp}\left( { - \frac{{r_{i,O}^0}}{{B_{i,O}}}} \right)$$ (3)

where \(r_{i,O}^0\) is the distance where the first derivative of the Buckingham form becomes zero. Therefore, for each type of the ions, \(r_{i,O}^0\) is actually calculated from the values of A i,o , B i,o , and C i,o . In addition, since C i,o of Li has a zero value, extra procedures were applied to obtain the value of the \(r_{i,O}^0\) term for Li, which is described in detail in Supplementary Note 3. The calculated values of the \(B_{i,O}^\prime\) term for all the elements studied in the present work are summarized in Table 1, along with their MD parameters. Thus, the descriptors associated with the short-range interactions are eventually generated from A i,o , \(B_{i,O}^\prime\), and C i,o based on the glass composition (c i ) as the following,

$$\begin{array}{l}u_p^x = \left( {\mathop {\sum}\limits_{i \in S_{ele}} {c_ix_{i,O}^p} } \right)^{\frac{1}{p}},p = - 4, - 3, - 2, - 1,1,2,3,4,\\ u_p^x = exp\left( {\mathop {\sum}\limits_{i \in S_{ele}} {c_i{\mathrm{ln}}(x_{i,O})} } \right),p = 0,\end{array}$$ (4)

where \(u_p^x\) denotes the descriptors generated from the feature parameter x i,O associated with the Buckingham short-range interactions between the element i and O. There are three types of x i,O , A i,o , \(B_{i,O}^\prime\), and C i,o . Let \(S_{{\mathrm{ele}}} = \left\{ {{\mathrm{Si,O,Li,Na,K}} \ldots } \right\}\) be the set of the elements contained in the glass. Different values of p results in different Hölder means of x, which are the quartic-harmonic mean (p = −4), cubic-harmonic mean (p = −3), quadratic-harmonic mean (p = −2), harmonic mean (p = −1), geometric mean (p = 0), arithmetic mean (p = 1), Euclidean mean (p = 2), cubic mean (p = 3), and the quartic mean (p = 4), respectively. In addition, in Eq. 4, c i is the mole fraction of the glass constituent element i. Besides, we also consider the standard deviation of the arithmetic means (\(u_1^x\)) as a type of descriptors, which is calculated as,

$$u_{1}^{x - \sigma } = \left( {\left( {\frac{1}{{1 - {\sum

olimits_{i \in S_{{\mathrm{ele}}}}} c_{i}^2}}} \right) \cdot \left( {{\sum \limits_{i = i \in S_{{\mathrm{ele}}}}} c_{i}\left( {x_{i,O} - u_{1}^x} \right)^2} \right)} \right)^{\frac{1}{2}}$$ (5)

Based on Eqs. 4 and 5, thirty distinct descriptors are generated in total from A i,o , \(B_{i,O}^\prime\), and C i,o (27 from Eq. 4, and 3 from Eq. 5). In addition, we include the multiplications between any two of the thirty descriptors as interaction terms to consider the non-linear relations among these descriptors. Finally, we also include the arithmetic mean of the atomic mass as an individual descriptor. As a result, overall 511 input descriptors are generated for the ML models, in which there are fifteen descriptors associated with long-range Coulomb interactions, thirty descriptors generated from the MD parameters of the Buckingham term and 465 corresponding interaction terms (including self-interactions, thus \(C_{30}^1 + C_{30}^2\) = 465), and one descriptor representing the mean atomic mass.

Regressions accuracy of training data

In the present work, the training dataset was generated by high-throughput MD simulations, which contains the densities, bulk and shear moduli (i.e., K and G) of 498 individual glass compositions in 11 binary and 20 ternary SiO 2 -based systems as summarized in Supplementary Table 5. In all, 11 types of additive oxides were considered, namely Li 2 O, Na 2 O, K 2 O, CaO, SrO, Al 2 O 3 , Y 2 O 3, La 2 O 3 , Ce 2 O 3 , Eu 2 O 3 , and Er 2 O 3 . The ML models (i.e. GBM-LASSO and M5P models) were applied to learn each of the glass properties separately.

The densities from the MD-calculated training dataset are plotted in Fig. 1 against the corresponding regression results from the GBM-LASSO and M5P models. For the sake of a clear representation, the data points are grouped into four categories, which are pure amorphous SiO 2 , type-I glasses that only contain alkali and alkaline earth oxides as additives, type-II glasses that contain Al 2 O 3 and other oxides, and type-III glasses that contain rare-earth and other oxides. As shown in Fig. 1, the glass densities produced from both GBM-LASSO and M5P models agree well with the results from MD calculations with root-mean-squared-errors (RMSE) as small as 0.0229 and 0.0325 g cm−3, respectively. It is also found that the distributions of the prediction residuals are close to norm distributions. Together with the histogram of residuals, Fig. 1 implies the ML models demonstrate the correlations of interests very well without any abnormal performance. The regression results of the two ML models on the bulk and shear moduli are also illustrated as parity plots shown in Figs 2 and 3. Still, good agreements are observed between the predictions from ML models and those from MD simulations in the training set. The residuals of the models also approximately follow normal distributions. The regression RMSEs of K and G of the GBM-LASSO model are 2.99 and 1.31 GPa, respectively, while 2.59 and 0.97 GPa for the M5P model. In addition, the GBM-LASSO model seems to yield slight underestimations on the glass samples with higher moduli, as shown in Figs 2a and 3a.

Fig. 1: Performances of the ML models on the glass densities of the training set. a Performance of the GBM-LASSO model. b Distribution of residuals between the GBM-LASSO predictions and the MD results of the training set. c Performance of the M5P model. d Distribution of residuals between the M5P predictions and the MD results of the training set. The curved lines in b, d are normal distributions constructed from the mean and the standard deviation of the residuals. The data points are grouped into four categories based on their glass chemistry, which are pure amorphous SiO 2 , type-I glasses that only contain alkane and alkane earth oxides as additives, type-II glasses that contain Al 2 O 3 and other oxides, and type-III glasses that contain rare-earth and other oxides. Full size image

Fig. 2: Performances of the ML models on the bulk moduli (K) of the training set. a Performance of the GBM-LASSO model. b Distribution of residuals between the GBM-LASSO predictions and the MD results of the training set. c Performance of the M5P model. d Distribution of residuals between the M5P predictions and the MD results of the training set. The curved lines in b, d are normal distributions constructed from the mean and the standard deviation of the residuals. The data points are grouped into four categories based on their glass chemistry by following the definitions Fig. 1. Full size image

Fig. 3: Performances of the ML models on the shear moduli (G) of the training set. a Performance of the GBM-LASSO model. b Distribution of residuals between the GBM-LASSO predictions and the MD results of the training set. c Performance of the M5P model. d Distribution of residuals between the M5P predictions and the MD results of the training set. The curved lines in b, d are normal distributions constructed from the mean and the standard deviation of the residuals. The data points are grouped into four categories based on their glass chemistry by following the definitions in Fig. 1. Full size image

Here, to further evaluate the regression accuracy of the ML models, we define the relative error as,

$${\mathrm{Relative}}\;{\mathrm{error}} = \frac{{\left| {{\it{X}}_{{\it{{\mathrm{ML}}}}} - {\it{X}}_{{\it{{\mathrm{MD}}}}}} \right|}}{{{\it{X}}_{{\it{{\mathrm{MD}}}}}}}\;\left( {X = {\mathrm{density}},K\;{\mathrm{or}}\;G} \right)$$ (6)

where X MD is the density or elastic modulus calculated from MD simulation and X ML is the prediction from the GBM-LASSO or M5P model. As shown in Table 2, for both K and G, over 60% of the predictions from both ML models have a relative error of <5%, and over 90% predictions are within a relative error of <10%, indicating that excellent regression accuracy is achieved. Additionally, we find that the LASSO method has indeed significantly shrunk the size of the descriptor set. Among the 511 input descriptors, only 119, 127, and 87 descriptors are found to have non-zero regression coefficients when the ML models predict the glass density, K and G, respectively. It is also found that many of these descriptors have been multiply used for the LASSO regressions at different GBM iterative steps, indicating they are indeed important and useful to describe these glass properties.

Table 2 Regression results of the GBM-LASSO and M5P machine learning models on the training set, including root mean squared error (RMSE), and the percentage of predictions within 5%, 10%, 20%, and 30% relative errors according to Eq. 6, respectively. Full size table

Prediction capability

Since the ML models are only trained with a small set of data from MD simulations for the binary and ternary systems, providing reliable predictions out of the domain of the training set is quite crucial for the present models in terms of the future applications in the practical glass design spaces. Here, we randomly choose 11 ternary, 30 quaternary, 30 quinary, and 30 senary glass compositions that are not included in the training dataset to evaluate the prediction capabilities of the ML models in the compositional space beyond the training set. For each chosen composition, the GBM-LASSO and M5P models are applied to predict its density, K and G, and then MD simulations are correspondingly performed to validate the ML predictions. The validation results are shown as parity plots in Fig. 4. In addition, the prediction errors are analyzed and summarized in Table 3 in the same way as the error analysis of the training process (Table 2). On the one hand, it is found that the M5P model seems to yield large uncertainties when extrapolating. As shown in Table 3, the RMSEs of the predictions from the M5P model with respect to MD validations are 0.1774 g cm−3, 5.24 and 2.27 GPa for density, K and G, respectively, which are much larger compared to the RMSEs of the learning results listed in Table 2 (0.0325 g cm−3, 2.59 and 0.97 GPa for density, K and G). In addition, as shown in Fig. 4a–c, the data points in the parity plots of the extrapolative predictions are more scattered compared to the results of the training process (Figs 1c, 2c, and 3c). Particularly, as marked out in Fig. 4b, c, there are several predictions for the bulk and shear moduli that largely deviated from the MD results. Their relative errors are found to be over 20%. Moreover, it is worth to note that the M5P model is also trained by further decreasing the number of descriptors, which only resulted in a further increase in the training RMSEs but no significant improvements on the prediction RMSEs.

Fig. 4: Prediction performances of the ML models on glass compositions beyond the training set. The predictions from the M5P and GBM-LASSO model are plotted versus the validation results from MD simulations. a–c Density, bulk and shear moduli predicted by the M5P model. d–f Density, bulk and shear moduli predicted by the GBM-LASSO model. The glass compositions used for the testing are from 101 randomly chosen ternary, quaternary and senary systems that are not included in the training set. The composition information of each data point can be found in Supplementary Table 6. The data points within the region between two black dot-dashed lines have relative errors less than 10%. Full size image

Table 3 Prediction errors of the GBM-LASSO and M5P machine learning models for the glass compositions that are not included in the training set, including root mean squared error (RMSE), and the percentage of predictions within 5%, 10%, 20%, and 30% relative error according to Eq. 6, respectively. Full size table

On the other hand, the developed GBM-LASSO model shows very promising prediction capabilities for multicomponent glass systems beyond the training set. As shown in Fig. 4d–f, the density, K and G predicted from the GBM-LASSO model are in very good agreement with the MD results. Nearly 85% of the predictions for K and over 90% for G have relative errors <10%. Moreover, as shown in Table 3, the RMSEs of the predictions from the GBM-LASSO model with respect to MD validations are 0.0536 g cm−3, 3.69 and 1.34 GPa for density, K and G, respectively, agreeing well to the training uncertainties of the model listed in Table 2. The results suggest that, after training with a small set of data for only binary and ternary systems, the developed GBM-LASSO model shows promising abilities to give reliable predictions for multicomponent k-nary glasses as long as their constituent oxides are included in the training set.

Moreover, we find the prediction range of the GBM-LASSO model can be possibly extended to cover more types of additive oxides by adding a small amount of related binary and ternary MD data to the training set. Here we use B 2 O 3 and ZrO 2 as examples, as the Buckingham potentials for boron and Zr have been recently developed by Du et al.43,53, which are also compatible with the set of MD potentials used in the present work. The original training set is slightly modified by adding a few new binary and ternary data with glass compositions containing B 2 O 3 or ZrO 2 . Specifically, 7 binary and 21 ternary data are added with compositions from the xB 2 O 3 -(100-x)SiO 2 (x = 5, 10, 15, 20, 25, 30, and 35) and xB 2 O 3 -yNa 2 O-(100-x-y)SiO 2 systems (x, y = 5, 10, 15, 20, 25, and 30, and x + y ≤ 35), respectively. Also, for ZrO 2 , 13 new data are added to the training dataset, which are xZrO 2 -(100-x)SiO 2 (x = 5, 10, 15, 20, 25, 30, 35) and xZrO 2 -(35-x)Na 2 O-65SiO 2 (x = 5, 10, 15, 20, 25, 30). The GBM-LASSO model is re-trained with the corresponding new training set. Notably, the density, K and G of the newly added glass compositions are well reproduced by the new training dataset, and the overall RMSEs are just slightly varied (0.012 g cm−3 for density, 0.26 GPa for K and 0.30 GPa for G) from the values listed in Table 2. As shown in Fig. 5a, b, the non-linear effects of B 2 O 3 on the bulk and shear moduli are accurately described for the xB 2 O 3 -(100-x)SiO 2 and xB 2 O 3 -(30-x)Na 2 O-70SiO 2 glasses after training. Moreover, the newly trained model can then be expanded to the multicomponent glasses that contain B 2 O 3 and ZrO 2 . As shown in Fig. 5c, the ML predictions for several B 2 O 3 -containing compositions, which are not in the training set, are well confirmed by MD validations. Similar results are also observed for the ZrO 2 -containing glasses as shown in Supplementary Fig. 4. These results suggest that the developed GBM-LASSO has great potentials to be further expanded to cover more types of additive oxides in the future. To achieve such expansions, we only need a few of MD simulations to generate the binary and ternary data containing new types of oxides for the training set.

Fig. 5: Extendibility of the GBM-LASSO model for new oxide species. a, b Reproduction of the non-linear effects of B 2 O 3 on bulk and shear modulus in the a xB 2 O 3 -(100-x)SiO 2 and (b) xB 2 O 3 -(30-x)Na 2 O-70SiO 2 glasses in the training set. c Predictions from GBM-LASSO versus MD results on the test set. The test set is composed of 15 randomly selected compositions for the B 2 O 3 -containing multicomponent glasses (detailed information is listed in Supplementary Table 7) that are not included in the training dataset. Full size image

We believe the outstanding prediction capability of the GBM-LASSO model may benefit from two aspects: the method of descriptor construction and the advantages of the regression algorithms employed in the model. As described in Eqs. 2–5, instead of directly using the chemical composition as descriptors, the present model constructs descriptors from the compositional averages of the MD potential parameters. As a result, these descriptors can not only smoothly map the entire design space as they are continuous functions of the glass compositions but also contain the information to reflect the intrinsic physical features of each component element, which are compositionally discrete. More importantly, the construction method ensures that the total number of the descriptors is invariant to the arity of the glass chemistry. In other words, it generates the same number of descriptors for any given glass composition, no matter how many types of additive oxides it contains, as long as the interatomic potentials based on Eq. 1 is used for MD simulations. In addition, most of the descriptors still have non-zero values even when the investigated glass contains only one or two types of additive oxides. As a result, this would allow the ML models to transform the extrapolation problems in the chemical compositional space into interpolation-like problems in the constructed descriptor space based on both glass composition and MD force-field parameters.

Furthermore, the GBM-LASSO model may also benefit from some unique features of the regression algorithms employed in the model. In principle, a good prediction ability means a model should avoid over-fitting performance and still achieve a regression accuracy as high as possible. In the present work, due to a relatively small size of the train set, the number of descriptors is almost the same as the number of training data. This results in a potential risk of over-fitting if all the descriptors are considered equally strong and used for regression. The LASSO regression method could be particularly useful to resolve this issue as it screens out the nonsignificant descriptors by setting their coefficient to zero. As a result, the risk of over-fitting could be efficiently reduced as the regression is actually produced by a much smaller number of descriptors.

Moreover, for a broader comparison, we also applied our descriptors and training/testing data with other two typical ML models, a frequently used GBM regression tree model (GBM-RT) implemented in the XGboost package54 and a model using the elastic net method55 under the GBM framework (GBM-EN). The prediction performances of these two models are described in detail in Supplementary Note 5. Comparing the prediction performances of all the test ML models (i.e., GBM-LASSO, GBM-EN, GBM-RT and M5P), it is noticed that GBM-LASSO/EN models generally show better performance than the tree-based models when predicting beyond the training set. One possible reason could be that the GBM-LASSO/EN models conduct continuous regression functions (LASSO and EN) by considering all the observations/descriptors simultaneously at each GBM-iterative step, and they do not perform data classification like the tree-based model. As a result, the regression processes enforce more smoothness than the tree-based models in the functions mapping continuous descriptors to observations, especially when the size of the training set is small and the targeted responses are continuous functions of descriptors. On the other hand, tree-based methods usually require hard thresholds on the classification boundary. This requirement could result in large prediction uncertainties for the untrained sample if one or several input descriptors have values very close to the classification boundary, especially when the model itself is trained with a small set of data but used for extrapolative predictions. For this reason, the GBM-LASSO model proposed in the present work could be advantageous for many of materials problems. In these cases, the properties of interests (e.g., density and elastic moduli) are reasonably continuous and smooth to the descriptors (e.g., compositions), but the training set is relatively small and established from the studies of sparse regions.

Comparison between ML predictions and experimental measurements

To further evaluate the model reliability, the predictions of the present GBM-LASSO model are validated with a large amount of experimental data across a multicomponent compositional space. Specifically, we collected the experimentally measured density and shear (G) and Young’s (E) moduli from the Sciglass 7.12 database, which in turn were gathered from academic literature and patents published up to May 201456, for the SiO 2 -based glasses containing the 12 additive oxides (i.e., Li 2 O, Na 2 O, K 2 O, CaO, SrO, Al 2 O 3 , Y 2 O 3, La 2 O 3 , Ce 2 O 3 , Eu 2 O 3 , Er 2 O 3 , and ZrO 2 ) that have been considered in the present work. When collecting the data, we constrained the composition of SiO 2 to be no <50 mol%. In comparison, it is worth to note all the glass compositions in our MD training dataset have no <65 mol% SiO 2 . Overall 550 data points, including 142 binary, 303 ternary, 95 quaternary, and 10 higher-order data (oxide components more than four), were collected for G; 1010 data points, including 231 binary, 464 ternary, 157 quaternary, and 158 higher-order data, were collected for E; 4647 data points, including 1327 binary, 2483 ternary, 607 quaternary and 230 higher-order data, were collected for density. Moreover, ~30% of the data have the SiO 2 composition less than 65 mol%, which can serve as a validation to test the extrapolation capability of the present ML model in the compositional space. In addition, among these collected data, some of them can correspond to the same or very similar glass compositions, but they are gathered from different literature sources, as the density and elastic moduli for those compositions have been measured multiple times previously.

For each of the collected experimental data point, we took the corresponding glass composition to predict the G, E and density using our GBM-LASSO ML model and compare them with the experimental values. The predicted E is calculated from predicted K and G as described by Eq. 10 in Methods Section. It is worth to mention that the GBM-LASSO model is still only trained with the MD training set, and the collected experimental data were not used for training. As shown in Fig. 6, the validation results are characterized as 2D-hexbin plots with the ML predicted results versus the experimental values. It is found that the predictions from the GBM-LASSO model generally agree well with the experimental measurements. Compared to the experimental values, over 50% of the model predictions have relative errors <7%, and ~90% predictions are with relative errors <15% for both G and E. In terms of density, the predictions from the ML model yields even better agreement with experiments, where over 80% of predictions have relative errors <3% and 96% of predictions are with relative error <6%.

Fig. 6: Glass properties predicted by the GBM-LASSO model validated with experimental results. Experimental data were collected from the Sciglass 7.12 database56. The GBM-LASSO model is only trained with the MD training set and the experimental data were not used for training. a Shear modulus; b Young’s modulus; c Density. The dashed line is the identity where the predictions are equal to the experimental values. The hexagonal unit with a hotter color means that there are more data points within the coverage area of the unit. The dashed-line circles in a mark out typical examples of prediction outliers caused by experimental data inconsistency. Full size image

Besides the general agreement between the ML predictions and experimental data, as shown in Fig. 6, it is noted that there are still scattered ML predictions that are largely deviated from the experimental values. After a careful analysis, we found that many of these prediction outliers should result from the inconsistency between the experimental data as they were gathered from different sources. In other words, the predictions of the ML model are in a good agreement with other sets of the experimental data with the glass compositions that are equal or close to the outliers. Here we show two typical examples as marked by the dashed-line circles in Fig. 6a. One set of the data there corresponds to a measurement on the Li 2 O-SiO 2 binary glasses with Li 2 O contents ranging from 26 mol% to 40 mol%, in which shear modulus of the glasses were reported to range from 5.71 to 13.79 GPa57. In contrast, at the corresponding compositions, the ML model predicted that the shear moduli should be ~31–33 GPa, which are actually in very good agreement with the results of experimental measurements on similar glass compositions from other two studies58,59. Another set of data marked by the circle in Fig. 6 corresponds to a measurement on the Al 2 O 3 -Y 2 O 3 -SiO 2 glasses60, where the ML model yields conflict predictions. However, in the meanwhile, the ML predictions on the Al 2 O 3 -Y 2 O 3 -SiO 2 glass systems are also confirmed by other experimental measurements61,62,63 (More details are described in Supplementary Note 6). In addition, we acknowledge that, for some of the prediction outliers in Fig. 6, we still cannot have clear reasons as there are no other data available for comparison. These outliers can result from the inaccuracy of the MD simulations or the ML model when predicting the elastic moduli and densities for some specific glass chemistries. For example, it is found that the present ML model generally underestimates the densities of ternary glasses containing both Al 2 O 3 and rare-earth oxides (i.e., Y 2 O 3 , La 2 O 3 , Eu 2 O 3 and Er 2 O 3 ).

More importantly, after we remove these outliers (i.e. 15 out of 550, 35 out of 1010, and 77 out of 4647 in total for G, E and density, respectively) that can be confidently regarded as the experimental inconsistency, the RMSEs of the predictions from the present GBM-LASSO model are 2.51, 6.67, and 0.0700 g cm−3 for G, E and D, respectively, which are reasonably small by considering the possible uncertainties of the experimental measurements. Such uncertainties are quite common in the Sciglass database due to different experimental methods and sources (one example is shown in Supplementary Fig. 2b). The general agreements between the ML predictions and experimental data shown in Fig. 6 further support the prediction reliability of the present GBM-LASSO model in a complex compositional space.

In addition, when validating with the experimental data for the B 2 O 3 -containing glasses from the Sciglass database, we found that the present GBM-LASSO model could have relatively large uncertainties in prediction accuracy. For example, the model predictions on the Young’s moduli of the B 2 O 3 -Na 2 O-SiO 2 ternary glasses are found to agree with the experimental measurements from some certain groups64,65,66 (RMSE: ~6.33 GPa) but largely deviate from other experimental data in the Sciglass database (RMSE: ~15.05 GPa)56. There are two possible reasons for such fluctuations in prediction accuracy. First, the experimental data from different studies already contain large fluctuations in elastic moduli for glasses with similar chemical compositions67,68,69, indicating potentially large errors in some experiments. Second, the force-field potential of B 2 O 3 employed in the present work can be inaccurate in terms of describing the elastic moduli. As reported by the developers of this B 2 O 3 potential53, the MD predicted bulk, shear and Young’s modulus can be much higher than the experimental values in the B 2 O 3 -Na 2 O-SiO 2 ternary system (up to 50% depending on the concentrations), although the variation trends with respect to the glass compositions are reproduced. However, because of the consistency between the MD results and our ML predictions (Fig. 5c), our developed GBM-LASSO model still has the capability to provide more reliable and accurate predictions for the B 2 O 3 -containing glasses, as long as compatible interatomic potentials that are more accurate on elastic properties are developed. Under that situation, one would only need to use the new interatomic potential to calculate a small amount of binary and ternary data and incorporate them into the training set.

Furthermore, the prediction capability of the GBM-LASSO model on elastic moduli is also evaluated by comparing it with a widely used physics-based model developed by Makishima and Mackenzie70,71, hereafter referred to as MM model. Noteworthily, the MM model requires the actual density of the glass as an extra input, but the present GBM-LASSO model can make predictions only according to glass compositions, which makes it more suitable to be used as a fast screening tool before practical syntheses. Additionally, in the MM model, the interactions between atoms are assumed to be fully ionic so that Young’s modulus can be derived from the Coulomb form of the electrostatic energy71. Such an ionic assumption could be problematic when it is applied for modeling the transition-metal oxides since the partially covalent characteristics of the metal-oxygen chemical bonds cannot be ignored. However, the covalent characteristics can be well captured by the Buckingham short-range interaction parameters in MD simulations, which are also used as input features to construct ML descriptors in the present work.

Indeed, compared to the MM model, it is found that the GBM-LASSO model yields considerable improvements on the elastic moduli predictions for the SiO 2 -based glasses containing transition-metal oxides. By using an experimental validation dataset collected from the Sciglass database, which is composed of multicomponent SiO 2 -based glasses with Y 2 O 3 as one of the constituent components, the prediction RMSE of the GBM-LASSO model is calculated to be 10.16 GPa. As a comparison, the prediction RMSE of the MM model on the same dataset is as high as 22.42 GPa if the density-inputs are taken from the predictions of a widely used empirical regression model developed by Priven10, and 13.39 GPa if experimental densities are used as inputs. Similar results were also observed for the ZrO 2 -containing glasses, where the prediction RMSE of the GBM-LASSO model is 6.69 GPa, much smaller than that of the MM model, which is 10.55 GPa. More detailed information is provided in Supplementary Note 7.

As a further demonstration, we also performed an investigation in the Y 2 O 3 -SiO 2 binary systems. Since there are no experimental measurements for this binary system, we performed ab initio MD simulations (AIMD) on bulk modulus (K) for several glass compositions to validate the results of our classical MD simulations. Due to the high computational costs, the AIMD simulations were not performed for predicting Young’s modulus. The calculation settings of the AIMD simulations are described in detail in Supplementary Note 8. As shown in Fig. 7, the bulk modulus predicted from the GBM-LASSO model agree well with both the classical MD and AIMD simulations. However, the predictions from the MM model largely deviate from the results of MD simulations using the glass densities no matter computed from classical MD simulations or predicted from the widely used empirical model developed by Priven10.

Fig. 7: Bulk modulus of the Y 2 O 3 -SiO 2 binary glasses. The bulk moduli predicted from the GBM-LASSO and MM models70 are in comparison with the calculations from the classical MD and AIMD simulations. The error bar of the AIMD results are generated from the results calculated under different applied strains. Full size image

Rapid screening of glass density and elastic moduli

The GBM-LASSO model developed in the present work is able to predict the density and elastic moduli of a given glass composition in a negligible fraction of a second, making it possible for a rapid and comprehensive screening on these properties in a complex compositional space. As an illustration, we apply the trained GBM-LASSO model to systematically map the distributions and variations of the densities and elastic moduli of Y 2 O 3 -doped soda-lime-alumina glasses. Specifically, a quinary compositional space composed of Na 2 O, CaO, Al 2 O 3 , Y 2 O 3 , and SiO 2 is homogenously meshed with a compositional interval of 1.0 mol% and under a constraint that the concentration of SiO 2 is no less than 65.0 mol%. The GBM-LASSO model is employed to predict the density, K and G for the glass composition at each mesh point. Overall, 82,251 compositions were studied by running the program on a regular personal computer (PC) in just a few hours. In contrast, tremendous computational powers (108–109 CPU hours) will be burned if purely using the MD simulations to generate the same amount of data.

The prediction results are visualized in Fig. 8 as a 2D histogram plot with respect to density and Young’s modulus, E, which is calculated from predicted K and G. From a practical point of view, one would expect a structural glass to have Young’s modulus as high as possible, and meanwhile keep a relatively low density. From Fig. 8 we can know that most of the glasses in the Na 2 O-CaO-Al 2 O 3 -Y 2 O 3 -SiO 2 system have Young’s moduli around 83 GPa and densities around 2.6 g cm−3. From the screening, it is also found that low Young’s moduli generally occur for the glasses with high Na 2 O contents, while the large additions of Al 2 O 3 and Y 2 O 3 result in a significant enhancement on Young’s moduli, which is consistent with the previous experimental observation61. As marked by the red-dashed-line circle in Fig. 8, one can achieve a series of glasses with Young’s moduli higher than 100 GPa and densities ranging from 2.5 to 3.1 g cm−3 by optimizing the contents of the additive oxides. In addition, from the screening results, one can also know that it is probably difficult to prepare glasses with densities lower than 2.4 g cm−3 but Young’s moduli larger than 80 GPa in this system. All in all, using the present developed GBM-LASSO model, a compositional-property database for any glass systems of interest can be rapidly generated as long as the corresponding force-field potentials are available and accurate enough to describe the structural and elastic properties. These databases allow the designers to have a fruitful overview on the density and elastic properties to enlighten their own design before experimental syntheses.