In Cancer epidemiology, biomarkers & prevention : a publication of the American Association for Cancer Research, cosponsored by the American Society of Preventive Oncology
BACKGROUND : To expand nasopharyngeal carcinoma (NPC) screening to larger populations, more practical NPC risk prediction models independent of Epstein-Barr virus (EBV) and other lab tests are necessary.
METHODS : Patient data before diagnosis of NPC were collected from hospital electronic medical records (EMR) and used to develop machine learning (ML) models for NPC risk prediction using XGBoost. NPC risk factor distributions were generated through connection delta ratio (CDR) analysis of patient graphs. By combining EMR-wide ML with patient graph analysis, the number of variables in these risk models was reduced, allowing for more practical NPC risk prediction ML models.
RESULTS : Using data collected from 1,357 NPC patients and 1,448 control patients, an optimal set of 100 variables (ov100) was determined for building NPC risk prediction ML models that had, the following performance metrics: 0.93-0.96 recall, 0.80-0.92 precision, and 0.83-0.94 AUC (area under curve). Aided by the analysis of top CDR-ranked risk factors, the models were further refined to contain only 20 practical variables (pv20), excluding EBV. The pv20 NPC risk XGBoost model achieved 0.79 recall, 0.94 precision, 0.96 specificity, and 0.87 AUC.
CONCLUSIONS : This study demonstrated the feasibility of developing practical NPC risk prediction models using EMR-wide ML and patient graph CDR analysis, without requiring EBV data. These models could enable broader implementation of NPC risk evaluation and screening recommendations for larger populations in urban community health centers and rural clinics.
IMPACT : These more practical NPC risk models could help increase NPC screening rate and identify more early-stage NPC patients.
Chen Anjun, Lu Roufeng, Han Ruobing, Huang Ran, Qin Guanjie, Wen Jian, Li Qinghua, Zhang Zhiyong, Jiang Wei
2022-Dec-07