In Cancer informatics
Motivation : Prediction of cancer outcome is a major challenge in oncology and is essential for treatment planning. Repositories such as The Cancer Genome Atlas (TCGA) contain vast amounts of data for many types of cancers. Our goal was to create reliable prediction models using TCGA data and validate them using an external dataset.
Results : For 16 TCGA cancer type cohorts we have optimized a Random Forest prediction model using parameter grid search followed by a backward feature elimination loop for dimensions reduction. For each feature that was removed, the model was retrained and the area under the curve of the receiver operating characteristic (AUC-ROC) was calculated using test data. Five prediction models gave AUC-ROC bigger than 80%. We used Clinical Proteomic Tumor Analysis Consortium v3 (CPTAC3) data for validation. The most enriched pathways for the top models were those involved in basic functions related to tumorigenesis and organ development. Enrichment for 2 prediction models of the TCGA-KIRP cohort was explored, one with 42 genes (AUC-ROC = 0.86) the other is composed of 300 genes (AUC-ROC = 0.85). The most enriched networks for both models share only 5 network nodes: DMBT1, IL11, HOXB6, TRIB3, PIM1. These genes play a significant role in renal cancer and might be used for prognosis prediction and as candidate therapeutic targets.
Availability And Implementation : The prediction models were created and tested using Python SciKit-Learn package. They are freely accessible via a friendly web interface we called surviveAI at https://tinyurl.com/surviveai.
Nayshool Omri, Kol Nitzan, Javaski Elisheva, Amariglio Ninette, Rechavi Gideon
2022
Cancer survivors, classification, gene expression, molecular targeted therapy, supervised machine learning