Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics

Arsenii Kuznetsov
Samsung AI
Pavel Shvechikov
Samsung AI, HSE
Alexander Grishin
Samsung AI, HSE
Dmitry Vetrov
Samsung AI, HSE

Abstract
The overestimation bias is one of the major impediments to accurate off-policy learning. This paper investigates a novel way to alleviate the overestimation bias in a continuous control setting. Our method — Truncated Quantile Critics, TQC,— blends three ideas: distributional representation of a critic, truncation of critics prediction, and ensembling of multiple critics. Distributional representation and truncation allow for arbitrary granular overestimation control, while ensembling provides additional score improvements. TQC outperforms the current state of the art on all environments from the continuous control benchmark suite, demonstrating 25% improvement on the most challenging Humanoid environment.
Main idea

To achieve granularity in controlling the overestimation, we "decompose" the expected return into atoms of distributional representation. By varying the number of atoms, we can control the precision of the return distribution approximation.

To control the overestimation, we propose to truncate the approximation of the return distribution: we drop atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. By varying the total number of atoms and the number of dropped ones, we can flexibly balance between under- and overestimation. The truncation naturally accounts for the inflated overestimation due to the high return variance: the higher the variance, the lower the Q-value estimate after truncation.

To improve the Q-value estimation, we ensemble multiple distributional approximators in the following way. First, we form a mixture of distributions predicted by $N$ approximators. Second, we truncate this mixture by removing atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. The order of operations---the truncation of the mixture vs. the mixture of truncated distributions---may matter. The truncation of a mixture removes the largest outliers from the pool of all predictions. Such a truncation may be useful in a hypothetical case of one of the critics goes crazy and overestimates much more than the others. In this case, the truncation of a mixture removes the atoms predicted by this inadequate critic. In contrast, the mixture of truncated distributions truncates all critics evenly.

Results

We compare our method with original implementations of state of the art algorithms: SAC, TrulyPPO, and TD3. For HalfCheetah, Walker, and Ant we evaluate methods on the extended frame range: until all methods plateaus (5e6 versus usual 3e6). For Hopper, we extended the range to 3e6 steps.

For our method we selected the number of dropped atoms d for each environment independently, based on separate evaluation. Best value for Hopper is d=5, for HalfCheetah d=0 and for the rest d=2.

Figure on top shows the learning curves. In table we report the average and std of 10 seeds. Each seed performance is an average of 100 last evaluations. We evaluate the performance every 1000 frames as an average of 10 deterministic rollouts. As our results suggest, TQC performs consistently better than any of the competitors.

Environment
TrulyPPO
TD3
SAC
TQC
Hopper
2.0 3.3 2.9 3.7
HalfCheetah
5.8 15.1 12.4 18.0
Walker2d
4.0 5.1 5.8 7.0
Ant
0.0 5.7 6.2 8.0
Humanoid
5.9 5.4 7.8 9.5
Average of the seed returns (thousands) on MuJoCo
Video