Ray: [rllib] TD3 & DDPG bugs

Created on 20 Apr 2019  路  1Comment  路  Source: ray-project/ray

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): linux
  • Ray installed from (source or binary): source
  • Ray version: master
  • Python version: 3.6
  • Exact command to reproduce: N/A

Describe the problem

(I'm currently fixing the issues below; this issue is just to flag the above problems before I make a PR)

The TD3/DDPG implementation has a few bugs:

  • The current implementation tries to change the actor & critic learning rates by scaling the corresponding losses by {actor,critic}_loss_coeff. That _will_ change the gradient magnitudes at each iteration, as intended. However, it won't affect the actual Adam updates much because Adam is normalising gradients by a running standard deviation, which will cancel out the effect of any constant rescaling factor.
  • policy_delay is implemented by zeroing out the critic loss on some iterations. That also won't work because Adam updates network parameters using a running mean of gradients, rather than just the gradient at the current iteration. Zeroing losses (& thus gradients) at one step will just decrease the magnitude of that running mean slightly. Really the actor & critic need separate optimisers.
  • When smooth_target_policy is enabled, ActionNetwork uses IID Gaussian noise for exploration, as desired (see the use_gaussian_noise branch in ActionNetwork.__init__()). However, that noise ignores the eps given to ActionNetwork & updated via the exploration schedule. Thus the exploration schedule is ignored when running TD3, and exploration is implicitly still enabled at evaluation time.

On usability: it'd be nice to separate configuration options out into those targeted at vanilla DDPG, those introduced by TD3, and those introduced by SAC. At the moment it's very hard to tune the implementation to imitate TD3 because it's not clear which configuration options have to be changed. The fact that some options (e.g smooth_target_policy) cause other options to be silently ignored makes the situation even worse. Separate tuned examples for DDPG and TD3 would improve the situation. Arguably the best fix would be to separate TD3 out into its own agent class so that it actually works out of the box, although refactoring the current implementation to support that without lots of code duplication would be a bit of a pain.

Source code / logs

N/A

bug

Most helpful comment

Thanks for pointing these issues out! It's unfortunate that we will need multiple optimizers, but I guess it can't be helped. cc @hartikainen @joneswong

Arguably the best fix would be to separate TD3 out into its own agent class so that it actually works out of the box, although refactoring the current implementation to support that without lots of code duplication would be a bit of a pain.

Agree this is a good idea, we can do something similar to DQN vs APEX in terms of splitting up the class.

>All comments

Thanks for pointing these issues out! It's unfortunate that we will need multiple optimizers, but I guess it can't be helped. cc @hartikainen @joneswong

Arguably the best fix would be to separate TD3 out into its own agent class so that it actually works out of the box, although refactoring the current implementation to support that without lots of code duplication would be a bit of a pain.

Agree this is a good idea, we can do something similar to DQN vs APEX in terms of splitting up the class.

Was this page helpful?
0 / 5 - 0 ratings