Hi,
I skimmed over the author's implementation and it seems that they don't use the value network (at least in the follow up "SAC and Applications"). Instead they only use the Q-networks. Seems they removed it in this commit
Thanks,
Lukas
Hello,
Thanks for pointing out that change.
I skimmed over the author's implementation and it seems that they don't use the value network
Did you try that variant? Does that improve the results?
Hi, I didn't compare the performance. It would be quite computationally expensive and even then one can't be certain. It's RL in the end (https://arxiv.org/abs/1709.06560) :).
In the original SAC [1,2], we observed that adding another learned value function stabilized the learning. However, when testing the more recently released version of SAC [3,4], we found no cases where the value function would make any difference (or at least improve the performance) and decided to drop it for the sake of simplicity.
We believe it's either the lack of reparameterization trick in the policy update or the use of gaussian mixture models that originally introduced more variance in the learning and thus the learned value function made things more stable. We have not yet confirmed this hypothesis though. If anyone wants to give it a try, I'd love to hear the results.
[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/
@hartikainen thanks for the answer =)
Most helpful comment
In the original SAC [1,2], we observed that adding another learned value function stabilized the learning. However, when testing the more recently released version of SAC [3,4], we found no cases where the value function would make any difference (or at least improve the performance) and decided to drop it for the sake of simplicity.
We believe it's either the lack of reparameterization trick in the policy update or the use of gaussian mixture models that originally introduced more variance in the learning and thus the learned value function made things more stable. We have not yet confirmed this hypothesis though. If anyone wants to give it a try, I'd love to hear the results.
[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/