ArXiv Preprint
Offline reinforcement learning is important in domains such as medicine,
economics, and e-commerce where online experimentation is costly, dangerous or
unethical, and where the true model is unknown. However, most methods assume
all covariates used in the behavior policy's action decisions are observed.
This untestable assumption may be incorrect. We study robust policy evaluation
and policy optimization in the presence of unobserved confounders. We assume
the extent of possible unobserved confounding can be bounded by a sensitivity
model, and that the unobserved confounders are sequentially exogenous. We
propose and analyze an (orthogonalized) robust fitted-Q-iteration that uses
closed-form solutions of the robust Bellman operator to derive a loss
minimization problem for the robust Q function. Our algorithm enjoys the
computational ease of fitted-Q-iteration and statistical improvements (reduced
dependence on quantile estimation error) from orthogonalization. We provide
sample complexity bounds, insights, and show effectiveness in simulations.
David Bruns-Smith, Angela Zhou
2023-02-01