In Molecular informatics
Current pandemics propelled research efforts in unprecedented fashion, primarily triggering computational efforts towards new vaccine and drug development as well as drug repurposing. There is an urgent need to design novel drugs with targeted biological activity and minimum adverse reactions that may be useful to manage viral outbreaks. Hence an attempt has been made to develop Machine Learning based predictive models that can be used to assess whether a compound has the potency to be antiviral or not. To this end, a set of 2358 antiviral compounds were compiled from the CAS COVID-19 antiviral SAR dataset whose activity was reported based on IC50 value. A total 1157 two-dimensional molecular descriptors were computed among which, the most highly correlated descriptors were selected using Tree-based, Correlation-based and Mutual information-based feature selection methods. Seven Machine Learning algorithms i. e., Random Forest, XGBoost, Support Vector Machine, KNN, Decision Tree, MLP Classifier and Logistic Regression were benchmarked. The best performance was achieved by the models developed using Random Forest and XGBoost algorithms in all the feature selection methods. The maximum predictive accuracy of both these models was 88 % with internal validation. Whereas, with an external dataset, a maximum accuracy of 93.10 % for XGBoost and 100 % for Random Forest based model was achievable. Furthermore, the study demonstrated scaffold analysis of the molecules as a pragmatic approach to explore the importance of structurally diverse compounds in data driven studies.
John Lijo, Soujanya Yarasi, Mahanta Hridoy Jyoti, Narahari Sastry G
Antivirals, Chemoinformatics, Feature Selection, MCC, Machine Learning, Molecular Descriptors, SARS-COVID-19