Receive a weekly summary and discussion of the top papers of the week by leading researchers in the field.

In Computer methods and programs in biomedicine

BACKGROUND AND OBJECTIVE : Since the early symptoms of chronic obstructive pulmonary disease (COPD) are not obvious, patients are not easily identified, causing improper time for prevention and treatment. In present study, machine learning (ML) methods were employed to construct a risk prediction model for COPD to improve its prediction efficiency.

METHODS : We collected data from a sample of 5807 cases with a complete COPD diagnosis from the 2019 COPD Surveillance Program in Shanxi Province and extracted 34 potentially relevant variables from the dataset. Firstly, we used feature selection methods (i.e., Generalized elastic net, Lasso and Adaptive lasso) to select ten variables. Afterwards, we employed supervised classifiers for class imbalanced data by combining the cost-sensitive learning and SMOTE resampling methods with the ML methods (Logistic Regression, SVM, Random Forest, XGBoost, LightGBM, NGBoost and Stacking), respectively. Last, we assessed their performance.

RESULTS : The cough frequently at age 14 and before and other 9 variables are significant parameters for COPD. The Stacking heterogeneous ensemble model showed relatively good performance in the unbalanced datasets. The Logistic Regression with class weighting enjoyed the best classification performance in the balancing data when these composite indicators (AUC, F1-Score and G-mean) were used as criteria for model comparison. The values of F1-Score and G-mean for the top three ML models were 0.290/0.660 for Logistic Regression with class weighting, 0.288/0.649 for Stacking with synthetic minority oversampling technique (SMOTE), and 0.285/0.648 for LightGBM with SMOTE.

CONCLUSIONS : This paper combining feature selection methods, unbalanced data processing methods and machine learning methods with data from disease surveillance questionnaires and physical measurements to identify people at risk of COPD, concluded that machine learning models based on survey questionnaires could provide an automated identification for patients at risk of COPD, and provide a simple and scientific aid for early identification of COPD.

Wang Xuchun, Ren Hao, Ren Jiahui, Song Wenzhu, Qiao Yuchao, Ren Zeping, Zhao Ying, Linghu Liqin, Cui Yu, Zhao Zhiyang, Chen Limin, Qiu Lixia

2023-Jan-06

Chronic obstructive pulmonary disease, Disease risk prediction, Imbalanced data, Machine learning