(I'm currently fixing the issues below; this issue is just to flag the above problems before I make a PR)
The TD3/DDPG implementation has a few bugs:
{actor,critic}_loss_coeff. That _will_ change the gradient magnitudes at each iteration, as intended. However, it won't affect the actual Adam updates much because Adam is normalising gradients by a running standard deviation, which will cancel out the effect of any constant rescaling factor.policy_delay is implemented by zeroing out the critic loss on some iterations. That also won't work because Adam updates network parameters using a running mean of gradients, rather than just the gradient at the current iteration. Zeroing losses (& thus gradients) at one step will just decrease the magnitude of that running mean slightly. Really the actor & critic need separate optimisers.smooth_target_policy is enabled, ActionNetwork uses IID Gaussian noise for exploration, as desired (see the use_gaussian_noise branch in ActionNetwork.__init__()). However, that noise ignores the eps given to ActionNetwork & updated via the exploration schedule. Thus the exploration schedule is ignored when running TD3, and exploration is implicitly still enabled at evaluation time.On usability: it'd be nice to separate configuration options out into those targeted at vanilla DDPG, those introduced by TD3, and those introduced by SAC. At the moment it's very hard to tune the implementation to imitate TD3 because it's not clear which configuration options have to be changed. The fact that some options (e.g smooth_target_policy) cause other options to be silently ignored makes the situation even worse. Separate tuned examples for DDPG and TD3 would improve the situation. Arguably the best fix would be to separate TD3 out into its own agent class so that it actually works out of the box, although refactoring the current implementation to support that without lots of code duplication would be a bit of a pain.
N/A
Thanks for pointing these issues out! It's unfortunate that we will need multiple optimizers, but I guess it can't be helped. cc @hartikainen @joneswong
Arguably the best fix would be to separate TD3 out into its own agent class so that it actually works out of the box, although refactoring the current implementation to support that without lots of code duplication would be a bit of a pain.
Agree this is a good idea, we can do something similar to DQN vs APEX in terms of splitting up the class.
Most helpful comment
Thanks for pointing these issues out! It's unfortunate that we will need multiple optimizers, but I guess it can't be helped. cc @hartikainen @joneswong
Agree this is a good idea, we can do something similar to DQN vs APEX in terms of splitting up the class.