Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In PloS one ; h5-index 176.0

INTRODUCTION : The potential for synthetic data to act as a replacement for real data in research has attracted attention in recent months due to the prospect of increasing access to data and overcoming data privacy concerns when sharing data. The field of generative artificial intelligence and synthetic data is still early in its development, with a research gap evidencing that synthetic data can adequately be used to train algorithms that can be used on real data. This study compares the performance of a series machine learning models trained on real data and synthetic data, based on the National Diet and Nutrition Survey (NDNS).

METHODS : Features identified to be potentially of relevance by directed acyclic graphs were isolated from the NDNS dataset and used to construct synthetic datasets and impute missing data. Recursive feature elimination identified only four variables needed to predict mean arterial blood pressure: age, sex, weight and height. Bayesian generalised linear regression, random forest and neural network models were constructed based on these four variables to predict blood pressure. Models were trained on the real data training set (n = 2408), a synthetic data training set (n = 2408) and larger synthetic data training set (n = 4816) and a combination of the real and synthetic data training set (n = 4816). The same test set (n = 424) was used for each model.

RESULTS : Synthetic datasets demonstrated a high degree of fidelity with the real dataset. There was no significant difference between the performance of models trained on real, synthetic or combined datasets. Mean average error across all models and all training data ranged from 8.12 To 8.33. This indicates that synthetic data was capable of training equally accurate machine learning models as real data.

DISCUSSION : Further research is needed on a variety of datasets to confirm the utility of synthetic data to replace the use of potentially identifiable patient data. There is also further urgent research needed into evidencing that synthetic data can truly protect patient privacy against adversarial attempts to re-identify real individuals from the synthetic dataset.

Arora Anmol, Arora Ananya

2023