In Journal of the American Heart Association ; h5-index 70.0
Background Real-world healthcare data are an important resource for epidemiologic research. However, accurate identification of patient cohorts-a crucial first step underpinning the validity of research results-remains a challenge. We developed and evaluated claims-based case ascertainment algorithms for pulmonary hypertension (PH), comparing conventional decision rules with state-of-the-art machine-learning approaches. Methods and Results We analyzed an electronic health record-Medicare linked database from two large academic tertiary care hospitals (years 2007-2013). Electronic health record charts were reviewed to form a gold standard cohort of patients with (n=386) and without PH (n=164). Using health encounter data captured in Medicare claims (including patients' demographics, diagnoses, medications, and procedures), we developed and compared 2 approaches for identifying patients with PH: decision rules and machine-learning algorithms using penalized lasso regression, random forest, and gradient boosting machine. The most optimal rule-based algorithm-having ≥3 PH-related healthcare encounters and having undergone right heart catheterization-attained an area under the receiver operating characteristic curve of 0.64 (sensitivity, 0.75; specificity, 0.48). All 3 machine-learning algorithms outperformed the most optimal rule-based algorithm (P<0.001). A model derived from the random forest algorithm achieved an area under the receiver operating characteristic curve of 0.88 (sensitivity, 0.87; specificity, 0.70), and gradient boosting machine achieved comparable results (area under the receiver operating characteristic curve, 0.85; sensitivity, 0.87; specificity, 0.70). Penalized lasso regression achieved an area under the receiver operating characteristic curve of 0.73 (sensitivity, 0.70; specificity, 0.68). Conclusions Research-grade case identification algorithms for PH can be derived and rigorously validated using machine-learning algorithms. Simple decision rules commonly applied in published literature performed poorly; more complex rule-based algorithms may potentially address the limitation of this approach. PH research using claims data would be considerably strengthened through the use of validated algorithms for cohort ascertainment.
Ong Mei-Sing, Klann Jeffrey G, Lin Kueiyu Joshua, Maron Bradley A, Murphy Shawn N, Natter Marc D, Mandl Kenneth D
computable phenotype, machine learning, pulmonary hypertension