ArXiv Preprint
There is increasing interest in data-driven approaches for dynamically
choosing optimal treatment strategies in many chronic disease management and
critical care applications. Reinforcement learning methods are well-suited to
this sequential decision-making problem, but must be trained and evaluated
exclusively on retrospective medical record datasets as direct online
exploration is unsafe and infeasible. Despite this requirement, the vast
majority of dynamic treatment optimization studies use off-policy RL methods
(e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform
poorly in purely offline settings. Recent advances in offline RL, such as
Conservative Q-Learning (CQL), offer a suitable alternative. But there remain
challenges in adapting these approaches to real-world applications where
suboptimal examples dominate the retrospective dataset and strict safety
constraints need to be satisfied. In this work, we introduce a practical
transition sampling approach to address action imbalance during offline RL
training, and an intuitive heuristic to enforce hard constraints during policy
execution. We provide theoretical analyses to show that our proposed approach
would improve over CQL. We perform extensive experiments on two real-world
tasks for diabetes and sepsis treatment optimization to compare performance of
the proposed approach against prominent off-policy and offline RL baselines
(DDQN and CQL). Across a range of principled and clinically relevant metrics,
we show that our proposed approach enables substantial improvements in expected
health outcomes and in consistency with relevant practice and safety
guidelines.
Milashini Nambiar, Supriyo Ghosh, Priscilla Ong, Yu En Chan, Yong Mong Bee, Pavitra Krishnaswamy
2023-02-15