In BMC biology
BACKGROUND : Identifying variants that drive tumor progression (driver variants) and distinguishing these from variants that are a byproduct of the uncontrolled cell growth in cancer (passenger variants) is a crucial step for understanding tumorigenesis and precision oncology. Various bioinformatics methods have attempted to solve this complex task.
RESULTS : In this study, we investigate the assumptions on which these methods are based, showing that the different definitions of driver and passenger variants influence the difficulty of the prediction task. More importantly, we prove that the data sets have a construction bias which prevents the machine learning (ML) methods to actually learn variant-level functional effects, despite their excellent performance. This effect results from the fact that in these data sets, the driver variants map to a few driver genes, while the passenger variants spread across thousands of genes, and thus just learning to recognize driver genes provides almost perfect predictions.
CONCLUSIONS : To mitigate this issue, we propose a novel data set that minimizes this bias by ensuring that all genes covered by the data contain both driver and passenger variants. As a result, we show that the tested predictors experience a significant drop in performance, which should not be considered as poorer modeling, but rather as correcting unwarranted optimism. Finally, we propose a weighting procedure to completely eliminate the gene effects on such predictions, thus precisely evaluating the ability of predictors to model the functional effects of single variants, and we show that indeed this task is still open.
Raimondi Daniele, Passemiers Antoine, Fariselli Piero, Moreau Yves
Bias in machine learning, Cancer driver variant prediction, Clever Hans effect