In Interdisciplinary sciences, computational life sciences
Microsatellite instability (MSI), a vital mutator phenotype caused by DNA mismatch repair deficiency, is frequently observed in several tumors. MSI is recognized as a critical molecular biomarker for diagnosis, prognosis, and therapeutic selection in several cancers. Identifying MSI status for current gold standard methods based on experimental analysis is laborious, time-consuming, and costly. Although several computational methods based on machine learning have been proposed to identify MSI status, we need to further understand which machine learning model would favor identification for MSI and which feature subset is strongly related to MSI. On this basis, more effective machine learning-based methods can be developed to improve the performance of MSI status identification. In this work, we present MSINGB, an NGBoost-based method for identifying MSI status from tumor somatic mutation annotation data. MSINGB first evaluates the prediction performance of 11 popular machine learning algorithms and 9 deep learning models to identify MSI. Among 20 models, NGBoost, a novel natural gradient boosting method, achieves the overall best performance. MSINGB then introduces two feature selection strategies to find the compact feature subset, which is strongly related to MSI, and employs the SHAP approach to interpreting how selected features impact the model prediction. MSINGB achieves a better prediction performance on both the tenfold cross-validation test and independent test compared with state-of-the-art methods.
Chen Jinxiang, Wang Miao, Zhao Defeng, Li Fuyi, Wu Hao, Liu Quanzhong, Li Shuqin
2022-Nov-09
Feature selection, Machine learning, Microsatellite instability, NGBoost