In Frontiers in genetics ; h5-index 62.0
Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, based on the chemical properties of substances and experiments performed on cell lines, would bring a significant reduction in the cost of clinical trials and faster development of drugs. The current study aims to build predictive models of risk of DILI for chemical compounds using multiple sources of information. Methods: Using several supervised machine learning algorithms, we built predictive models for several alternative splits of compounds between DILI and non-DILI classes. To this end, we used chemical properties of the given compounds, their effects on gene expression levels in six human cell lines treated with them, as well as their toxicological profiles. First, we identified the most informative variables in all data sets. Then, these variables were used to build machine learning models. Finally, composite models were built with the Super Learner approach. All modeling was performed using multiple repeats of cross-validation for unbiased and precise estimates of performance. Results: With one exception, gene expression profiles of human cell lines were non-informative and resulted in random models. Toxicological reports were not useful for prediction of DILI. The best results were obtained for models discerning between harmless compounds and those for which any level of DILI was observed (AUC = 0.75). These models were built with Random Forest algorithm that used molecular descriptors.
Lesiński Wojciech, Mnich Krzysztof, Rudnicki Witold R
data integration, drug induced liver injury, feature selection, machine learning, random forest