In Journal of chemical information and modeling
In the context of bioactivity prediction, the question of how to calibrate a score produced by a machine learning method into a probability of binding to a protein target is not yet satisfactorily addressed. In this study, we compared the performance of three such methods, namely Platt Scaling (PS), Isotonic Regression (IR) and Venn-ABERS Predictors (VA) in calibrating prediction scores obtained from ligand-target prediction comprising the Naïve Bayes (NB), Support Vector Machines (SVMs) and Random Forest (RF) algorithms. Calibration quality was assessed on bioactivity data available at AstraZeneca for 40 million data points (compound-target pairs) across 2,112 targets and performance was assessed using Stratified Shuffle Split (SSS) and Leave 20% of Scaffolds Out (L20SO) validation. VA achieved the best calibration performances across all machine learning algorithms and cross validation methods tested, and also the lowest (best) Brier score loss (mean squared difference between the outputted probability estimates assigned to a compound and the actual outcome). In comparison, the PS and IR methods can actually degrade the assigned probability estimates, particularly for the RF for SSS and during L20SO. Sphere Exclusion (SE), a method to sample additional (putative) inactive compounds, was shown to inflate the overall Brier score loss performance, through the artificial requirement for inactive molecules to be dissimilar to active compounds, but was shown to result in over-confident estimators. VA was able to successfully calibrate the probability estimates for even small calibration sets. The multi-probability values (lower and upper probability boundary intervals) were shown to produce large discordance for test set molecules that are neither very similar nor very dissimilar to the active training set, which were hence difficult to predict, suggesting that multi-probability discordance can be used as an estimate for target prediction uncertainty. Overall, we were able to show in this work that VA scaling of target prediction models is able to improve probability estimates in all testing instances, which is currently being applied for in-house target prediction models.
Mervin Lewis, Afzal Avid M, Engkvist Ola, Bender Andreas