In Journal of medical Internet research ; h5-index 88.0
BACKGROUND : The pandemic caused by the SARS-Cov2 virus will probably stand as the greatest health catastrophe of the modern era. The Spanish healthcare system has been exposed to uncontrollable numbers of patients in a short period of time, causing system collapse. Given that diagnosis is not immediate and there is no effective treatment, other tools have had to be developed to identify patients at risk of severe disease complications, and thus optimize material and human resources in health care. There are no tools to establish which patients have a worse prognosis than others.
OBJECTIVE : In this study, we aimed to process a sample of electronic health records of COVID-19 patients in order to develop a machine learning model to predict the severity of infection and mortality through clinical laboratory parameters. Early patient classification can help optimize material and human resources, and analysis of the most important features of the model could provide insights into the disease.
METHODS : After an initial performance evaluation based on a comparison with several other well-known methods, the extreme gradient boosting (XGBoost) algorithm was chosen as the predictive method for this study. In addition, SHAP (SHapley Additive exPlanations) was used to analyze the importance of the features of the resulting model.
RESULTS : After data preprocessing, 1823 confirmed COVID-19 patients and 32 predictor features were selected. On bootstrap validation, the XGBoost classifier yielded a value of 0.97 (95% CI 0.96-0.98) for the area under the receiver operator characteristic curve, 0.86 (95% CI 0.80-0.91) for the area under the precision-recall curve, 0.94 (95% CI 0.92-0.95) for accuracy, 0.77 (95% CI 0.72-0.83) for F-score, 0.93 (95% CI 0.89-0.98) for sensitivity, and 0.91 (95% CI 0.86-0.96) for specificity. The four most relevant features for model prediction were LDH, C-reactive protein, neutrophils, and urea.
CONCLUSIONS : The predictive model obtained in this work achieved excellent results in the discrimination of COVID-19 dead patients, by mainly employing laboratory parameter values. The analysis of the resulting model identified a set of features with the most significant impact on the prediction, and so relating them to a higher risk of mortality.
Domínguez-Olmedo Juan L, Gragera-Martínez Álvaro, Mata Jacinto, Pachón Victoria