In Journal of molecular biology ; h5-index 65.0
MOTIVATION : Continuous emergence of new variants through appearance/accumulation/disappearance of mutations is a hallmark of many viral diseases. SARS-CoV-2 variants have particularly exerted tremendous pressure on global healthcare system owing to their life threatening and debilitating implications.The sheer plurality ofvariants and huge scale of genomic data have added to the challenges of tracing themutations/variants and their relationship to infection severity (if any).
RESULTS : We explored the suitability of virus-genotype guided machine-learning in infection prognosis and identification of features/mutations-of-interest. Total 199519 outcome-traced genomes, representing45,625 nucleotide-mutations, were employed. Among these, Low and High severity genomes were classified using an integrated model (employing virus genotype, epitopic-influence and patient-age) with consistently high ROC-AUC (Asia:0.97±0.01, Europe:0.94±0.01, N.America:0.92±0.02, Africa:0.94±0.07, S.America:0.93±0). Although virus-genotype alone could enable high predictivity (0.97±0.01, 0.89±0.02, 0.86±0.04, 0.95±0.06, 0.9±0.04), the performance was not found to be consistent since the models for a few geographies displayed significant improvement in predictivity when the influence of age and/or epitope was incorporated with virus-genotype (Wilcoxon p_BH < 0.05). Neither age or epitopic-influence or clade information could out-perform the integrated features. A sparse model (6 features), developed using patient-age and epitopic-influence of the mutations, performed reasonably well (> 0.87±0.03, 0.91±0.01, 0.87±0.03, 0.84±0.08, 0.89±0.05). High-performance models were employed for inferring theimportant mutations-of-interest using Shapley Additive exPlanations (SHAP). The changes in HLA interactions of the mutated epitopes of reference SARS-CoV-2 were then subsequently probed. Notably, we also describe the significance of a 'temporal-modeling approach' to benchmark the models linked with continuously evolving pathogens. We conclude that while machine learningcan play a vital role in identifying relevant mutations and factors driving the severity, caution should be exercised in using the genotypic signatures for predictive prognosis.
Nagpal Sunil, Kumar Pinna Nishal, Pant Namrata, Singh Rohan, Srivastava Divyanshu, Mande Sharmila S
Genome classification, Machine learning, Mutation identification, Predictive prognosis, SARS-CoV-2