Discriminator-Actor-Critic: Addressing Sample Inefficiency and Reward Bias in Adversarial Imitation Learning


Algorithms for imitation learning based on adversarial optimization, such as generative adversarial imitation learning (GAIL) and adversarial inverse reinforcement learning (AIRL), can effectively mimic demonstrated behaviours by employing both reward and reinforcement learning (RL). However, applications of such algorithms are challenged by the inherent instability and poor sample efficiency of on-policy RL. In particular, the inadequate handling of absorbing states in canonical implementations of RL environments causes an implicit bias in reward functions used by these algorithms. While these biases might work well for some environments, they lead to sub-optimal behaviors in others. Moreover, despite the ability of these algorithms to learn from a few demonstrations, they require a prohibitively large number of the environment interactions for many real-world applications. To address these issues, we first propose to extend the environment MDP with absorbing states which leads to task-independent, and more importantly, unbiased rewards. Secondly, we introduce an off-policy learning algorithm, which we refer to as Discriminator-Actor-Critic. We demonstrate the effectiveness of proper handling of absorbing states, while empirically improving the sample efficiency by an average factor of 10. Our implementation is available online.