The Mirage of Action-Dependent Baselines in Reinforcement Learning


Model-free reinforcement learning with flexible function approximators has shown recent success for solving goal-directed sequential decision-making problems. Policy gradient methods are a promising class of model-free algorithms, but they have high variance, which necessitates large batches resulting in low sample efficiency. Typically, a state-dependent control variate is used to reduce variance. Recently, several papers have introduced the idea of state and action-dependent control variates and showed that they significantly reduce variance and improve sample efficiency on continuous control tasks. We theoretically and numerically evaluate biases and variances of these policy gradient methods, and show that action-dependent control variates do not appreciably reduce variance in the tested domains. We show that seemingly insignificant implementation details enable these prior methods to achieve good empirical improvements, but at the cost of introducing further bias to the gradient. Our analysis indicates that biased methods tend to improve the performance significantly more than unbiased ones.