ArXiv Preprint
Electronic health records (EHR) offer unprecedented opportunities for
in-depth clinical phenotyping and prediction of clinical outcomes. Combining
multiple data sources is crucial to generate a complete picture of disease
prevalence, incidence and trajectories. The standard approach to combining
clinical data involves collating clinical terms across different terminology
systems using curated maps, which are often inaccurate and/or incomplete. Here,
we propose sEHR-CE, a novel framework based on transformers to enable
integrated phenotyping and analyses of heterogeneous clinical datasets without
relying on these mappings. We unify clinical terminologies using textual
descriptors of concepts, and represent individuals' EHR as sections of text. We
then fine-tune pre-trained language models to predict disease phenotypes more
accurately than non-text and single terminology approaches. We validate our
approach using primary and secondary care data from the UK Biobank, a
large-scale research study. Finally, we illustrate in a type 2 diabetes use
case how sEHR-CE identifies individuals without diagnosis that share clinical
characteristics with patients.
Anna Munoz-Farre, Harry Rose, Sera Aylin Cakiroglu
2022-11-30