In Clinical pharmacology and therapeutics
Natural language processing (NLP) tools turn free-text notes (FTN) from electronic health records (EHR) into data features that can supplement confounding adjustment in pharmacoepidemiologic studies. However, current applications are difficult to scale. We used unsupervised NLP to generate high-dimensional feature spaces from FTN to improve prediction of drug exposure and outcomes compared to claims-based analyses. We linked Medicare claims with EHR data to generate 3 cohort studies comparing different classes of medications on the risk of various clinical outcomes. We used 'bag-of-words' to generate features for the top 20,000 most prevalent terms from FTN. We compared machine learning (ML) prediction algorithms using different sets of candidate predictors: Set1 (39 researcher-specified variables), Set2 (Set1+ML-selected claims codes), Set3 (Set1+ML-selected NLP-generated features), vs. Set4 (Set1+2+3).When modeling treatment choice, we observed a consistent pattern across the examples: ML models utilizing Set4 performed best followed by Set2, Set3, then Set1. When modeling outcome risk, there was little to no improvement beyond models based on Set1. Supplementing claims data with NLP-generated features from free text notes improved prediction of prescribing choices but had little or no improvement on clinical risk prediction. These findings have implications for strategies to improve confounding using EHR data in pharmacoepidemiologic studies.
Wyss Richard, Plasek Joseph M, Zhou Li, Bessette Lily G, Schneeweiss Sebastian, Rassen Jeremy A, Tsacogianis Theodore, Lin Kueiyu Joshua
2022-Dec-17