Numerous properties are related to the distribution of solvents around solutes, including solvation, free energy, partial molar volume, salting-out constants and binding free energies. However while it is possible to approximate the solvent distribution from rigorous statistical mechanics, determining these properties directly from simple solvent data alone has proved problematic. As Maxim Federov and colleagues point out in their Journal of Physics: Condensed Matter report, “using a purely theoretical approach, it is difficult to relate these [solvent] distributions to the substance’s biological effects which are a result of a large number of complex interrelated phenomena, such as toxicity or bioaccumulation.” Using machine learning based on a 3D convolutional neural network, they show how they can bridge the gap between this simple input data and the complex physical-chemical properties associated with it.

Federov – Director of the Skoltech Center for Computational and Data-Intensive Science and Engineering and a researcher at the Skolkovo Institute of Science and Technology in Estonia and Professor at the University of Strathchlyde in Scotland – worked alongside Sergey Sosnin at Skoltech, Maksim Misin at the University of Tartu in Russia and David S Palmer at the University of Strathclyde. They obtained input data of the concentration of water molecules around various organic molecules from molecular theory using the three-dimensional reference interaction site model. They then split a data set from the USA Environmental Protection Agency into two sets – one to train their neural network and one to test it. The data set comprised measured bioconcentration factor values for various fish species including carp and salmonids.

From simple data to complex properties

Federov and colleagues found that their neural network could determine bioconcentration factors from solvent data with an accuracy equal to the ‘consensus’ model provided by the US EPA. “This result is noteworthy due to the fact that our model was based only on the 3D distribution of water molecules while the EPA’s models used a large set of descriptors of varying nature,” they explain in their report. In contrast results from a graph convolution model were notably less accurate.

The researchers have already simplified their script for the model and further developments are needed to handle the size of the input data to describe the input solvent distribution. However the work demonstrates that artificial neural networks can provide useful links between simple input data and complex physical-chemical properties.

Full details are reported in the Journal of Physics: Condensed Matter.