Synthetic data for AI research

Data privacy and data security have become a major issue of our times. GDPR in Europe has now enforced strict rules as to how data is gathered, stored and used. This can quickly become an expensive nightmare for any company that deals with data. GDPR can become an even more serious issue for companies that are doing R&D, since in those cases, data is indispensable.

I was recently enquired around the data from my research around using machine learning to predict football injuries. Unfortunately, this data is not open, so it can’t be shared with other researchers. Datasets in this field are very difficult to acquire which hinders further research. However, there is an easy way to bypass this problem: synthetic data.

Synthetic data is data produced by an algorithm, based on an original dataset. Synthetic data can’t always replace the real thing, but in many cases synthetic data can be just as good. For example, in my course for predicting football injuries using R, Python and Weka, I used synthetic data based on the original data I had used for my PhD. For the purposes of teaching machine learning, synthetic data is just as good.

Synthetic data can also be useful when developing and testing new algorithms. A recent article in MIT discusses this very topic. They report that for more than 70% of the cases examined, research that used synthetic data performed the same or better than the original dataset.

So, how can you generate synthetic data? There are different approaches. Given that this is a relatively new concept, there are no standards methods yet, but this doesn’t mean that you can’t go ahead and do it.

Generating synthetic data

The simplest way to generate synthetic data is to simply fit a distribution individually on each column and then produce a bunch of random numbers. You need to be careful to fit distributions which actually make sense for your data. For example, if you are modelling variables that do not have any zero or negative values, the Gaussian distribution might be a bad choice. Then again, height is a variable that is famously normally distributed, even if height can’t be zero or negative. It’s just that in this case the normal distribution can be an easy simplification.

The problem with this approach is that it does not take into account any dependencies or correlations between variables. In the MIT paper, they used Gaussian copulas in order to calculate the covariance between the columns. This is a more advanced, but complete method. The copula package in R can be a good choice if you are willing to go down this path.

Finally, another way to create synthetic data, which applies mainly to computer vision problems, is data augmentation. Data augmentation uses transformations and noise on the original image, in order to create multiple versions of the original. So for example, we can shift, zoom in, add blur, etc. ending up with 10 new images for each original. This is an easy way to enhance your original dataset with more examples.

Synthetic data: the solution to data privacy

Is synthetic data the solution to everything related to data privacy and security? Not always, as in some cases the real data will be needed. For example, when making predictions, real data has to be used. Furthermore, the data that will be used to generate the synthetic data still needs to be stored in a database, that can be potentially vulnerable to hacking.

However, synthetic data opens up many possibilities. A hospital for example could share synthetic data based on its patient records, instead of the original, eliminating the risk of identifying individuals. Therefore, if you are in a field where you handle sensitive data, you should seriously consider trying synthetic data. It could help you approach research questions which otherwise might have had remained unapproachable.