Stable-baselines: Value network in SAC

Created on 12 Apr 2019 · 4Comments · Source: hill-a/stable-baselines

Hi,

I skimmed over the author's implementation and it seems that they don't use the value network (at least in the follow up "SAC and Applications"). Instead they only use the Q-networks. Seems they removed it in this commit

Thanks,

Lukas

question

Source

jendelel

Most helpful comment

In the original SAC [1,2], we observed that adding another learned value function stabilized the learning. However, when testing the more recently released version of SAC [3,4], we found no cases where the value function would make any difference (or at least improve the performance) and decided to drop it for the sake of simplicity.

We believe it's either the lack of reparameterization trick in the policy update or the use of gaussian mixture models that originally introduced more variance in the learning and thus the learned value function made things more stable. We have not yet confirmed this hypothesis though. If anyone wants to give it a try, I'd love to hear the results.

[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/

hartikainen on 14 May 2019

👍2

All 4 comments

Hello,

Thanks for pointing out that change.

I skimmed over the author's implementation and it seems that they don't use the value network

Did you try that variant? Does that improve the results?

araffin on 15 Apr 2019

Hi, I didn't compare the performance. It would be quite computationally expensive and even then one can't be certain. It's RL in the end (https://arxiv.org/abs/1709.06560) :).

jendelel on 21 Apr 2019

[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/

hartikainen on 14 May 2019

👍2

@hartikainen thanks for the answer =)

araffin on 14 May 2019

Was this page helpful?

0 / 5 - 0 ratings