ArXiv Preprint
Machine learning (ML) approaches to data analysis are now widely adopted in
many fields including epidemiology and medicine. To apply these approaches,
confounds must first be removed as is commonly done by featurewise removal of
their variance by linear regression before applying ML. Here, we show this
common approach to confound removal biases ML models, leading to misleading
results. Specifically, this common deconfounding approach can leak information
such that what are null or moderate effects become amplified to near-perfect
prediction when nonlinear ML approaches are subsequently applied. We identify
and evaluate possible mechanisms for such confound-leakage and provide
practical guidance to mitigate its negative impact. We demonstrate the
real-world importance of confound-leakage by analyzing a clinical dataset where
accuracy is overestimated for predicting attention deficit hyperactivity
disorder (ADHD) with depression as a confound. Our results have wide-reaching
implications for implementation and deployment of ML workflows and beg caution
against na\"ive use of standard confound removal approaches.
Sami Hamdan, Bradley C. Love, Georg G. von Polier, Susanne Weis, Holger Schwender, Simon B. Eickhoff, Kaustubh R. Patil
2022-10-17