To achieve granularity in controlling the overestimation, we "decompose" the expected return into atoms of distributional representation. By varying the number of atoms, we can control the precision of the return distribution approximation.
To control the overestimation, we propose to truncate the approximation of the return distribution: we drop atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. By varying the total number of atoms and the number of dropped ones, we can flexibly balance between under- and overestimation. The truncation naturally accounts for the inflated overestimation due to the high return variance: the higher the variance, the lower the Q-value estimate after truncation.
To improve the Q-value estimation, we ensemble multiple distributional approximators in the following way. First, we form a mixture of distributions predicted by $N$ approximators. Second, we truncate this mixture by removing atoms with the largest locations and estimate the Q-value by averaging the locations of the remaining atoms. The order of operations---the truncation of the mixture vs. the mixture of truncated distributions---may matter. The truncation of a mixture removes the largest outliers from the pool of all predictions. Such a truncation may be useful in a hypothetical case of one of the critics goes crazy and overestimates much more than the others. In this case, the truncation of a mixture removes the atoms predicted by this inadequate critic. In contrast, the mixture of truncated distributions truncates all critics evenly.
We compare our method with original implementations of state of the art algorithms: SAC, TrulyPPO, and TD3. For HalfCheetah, Walker, and Ant we evaluate methods on the extended frame range: until all methods plateaus (5e6 versus usual 3e6). For Hopper, we extended the range to 3e6 steps.
For our method we selected the number of dropped atoms d for each environment independently, based on separate evaluation. Best value for Hopper is d=5, for HalfCheetah d=0 and for the rest d=2.
Figure on top shows the learning curves. In table we report the average and std of 10 seeds. Each seed performance is an average of 100 last evaluations. We evaluate the performance every 1000 frames as an average of 10 deterministic rollouts. As our results suggest, TQC performs consistently better than any of the competitors.