Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
11th November 2022, 1:30 pm – 2:30 pm
In this talk, I will focus on offline reinforcement learning (RL) problems where one aims to learn an optimal policy from a fixed dataset without active data collection. Depending on the composition of the offline dataset, two categories of methods are used: imitation learning which is suitable for expert datasets and vanilla offline RL which often requires uniform coverage datasets. However in practice, datasets often deviate from these two extremes and the exact data composition is usually unknown a priori.
To bridge this gap, I will present a new offline RL framework that smoothly interpolates between the two extremes of data composition, hence unifying imitation learning and vanilla offline RL. The new framework is centered around a weak version of the concentrability coefficient that measures the deviation from the behavior policy to the expert policy alone. Under this new framework, we show that a lower confidence bound (LCB) algorithm based on pessimism is adaptively minimax optimal for solving offline contextual bandit problems. Extensions to Markov decision processes will also be discussed.