ArXiv Preprint
This paper presents a novel approach to simulating electronic health records
(EHRs) using diffusion probabilistic models (DPMs). Specifically, we
demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that
capture mixed-type variables, including numeric, binary, and categorical
variables. To our knowledge, this represents the first use of DPMs for this
purpose. We compared our DPM-simulated datasets to previous state-of-the-art
results based on generative adversarial networks (GANs) for two clinical
applications: acute hypotension and human immunodeficiency virus (ART for HIV).
Given the lack of similar previous studies in DPMs, a core component of our
work involves exploring the advantages and caveats of employing DPMs across a
wide range of aspects. In addition to assessing the realism of the synthetic
datasets, we also trained reinforcement learning (RL) agents on the synthetic
data to evaluate their utility for supporting the development of downstream
machine learning models. Finally, we estimated that our DPM-simulated datasets
are secure and posed a low patient exposure risk for public access.
Nicholas I-Hsien Kuo, Louisa Jorm, Sebastiano Barbieri
2023-03-22