In Studies in health technology and informatics ; h5-index 23.0
BACKGROUND : Assurance of digital health interventions involves, amongst others, clinical validation, which requires large datasets to test the application in realistic clinical scenarios. Development of such datasets is time consuming and challenging in terms of maintaining patient anonymity and consent.
OBJECTIVE : The development of synthetic datasets that maintain the statistical properties of the real datasets.
METHOD : An artificial neural network based, generative adversarial network was implemented and trained, using numerical and categorical variables, including ICD-9 codes from the MIMIC III dataset, to produce a synthetic dataset.
RESULTS : The synthetic dataset, exhibits a correlation matrix highly similar to the real dataset, good Jaccard similarity and passing the KS test.
CONCLUSIONS : The proof of concept was successful with the approach being promising for further work.
Bilici Ozyigit Eda, Arvanitis Theodoros N, Despotou George
Machine learning, generative adversarial networks, privacy, realistic synthetic dataset