ArXiv Preprint
We consider local kernel metric learning for off-policy evaluation (OPE) of
deterministic policies in contextual bandits with continuous action spaces. Our
work is motivated by practical scenarios where the target policy needs to be
deterministic due to domain requirements, such as prescription of treatment
dosage and duration in medicine. Although importance sampling (IS) provides a
basic principle for OPE, it is ill-posed for the deterministic target policy
with continuous actions. Our main idea is to relax the target policy and pose
the problem as kernel-based estimation, where we learn the kernel metric in
order to minimize the overall mean squared error (MSE). We present an analytic
solution for the optimal metric, based on the analysis of bias and variance.
Whereas prior work has been limited to scalar action spaces or kernel bandwidth
selection, our work takes a step further being capable of vector action spaces
and metric optimization. We show that our estimator is consistent, and
significantly reduces the MSE compared to baseline OPE methods through
experiments on various domains.
Haanvid Lee, Jongmin Lee, Yunseon Choi, Wonseok Jeon, Byung-Jun Lee, Yung-Kyun Noh, Kee-Eung Kim
2022-10-24