Stable-baselines: Value network in SAC

Created on 12 Apr 2019  路  4Comments  路  Source: hill-a/stable-baselines

Hi,

I skimmed over the author's implementation and it seems that they don't use the value network (at least in the follow up "SAC and Applications"). Instead they only use the Q-networks. Seems they removed it in this commit

Thanks,

Lukas

question

Most helpful comment

In the original SAC [1,2], we observed that adding another learned value function stabilized the learning. However, when testing the more recently released version of SAC [3,4], we found no cases where the value function would make any difference (or at least improve the performance) and decided to drop it for the sake of simplicity.

We believe it's either the lack of reparameterization trick in the policy update or the use of gaussian mixture models that originally introduced more variance in the learning and thus the learned value function made things more stable. We have not yet confirmed this hypothesis though. If anyone wants to give it a try, I'd love to hear the results.

[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/

All 4 comments

Hello,

Thanks for pointing out that change.

I skimmed over the author's implementation and it seems that they don't use the value network

Did you try that variant? Does that improve the results?

Hi, I didn't compare the performance. It would be quite computationally expensive and even then one can't be certain. It's RL in the end (https://arxiv.org/abs/1709.06560) :).

In the original SAC [1,2], we observed that adding another learned value function stabilized the learning. However, when testing the more recently released version of SAC [3,4], we found no cases where the value function would make any difference (or at least improve the performance) and decided to drop it for the sake of simplicity.

We believe it's either the lack of reparameterization trick in the policy update or the use of gaussian mixture models that originally introduced more variance in the learning and thus the learned value function made things more stable. We have not yet confirmed this hypothesis though. If anyone wants to give it a try, I'd love to hear the results.

[1] https://arxiv.org/pdf/1801.01290.pdf
[2] https://github.com/haarnoja/sac
[3] https://arxiv.org/pdf/1812.05905.pdf
[4] https://github.com/rail-berkeley/softlearning/

@hartikainen thanks for the answer =)

Was this page helpful?
0 / 5 - 0 ratings

Related issues

Unimax picture Unimax  路  3Comments

junhyeokahn picture junhyeokahn  路  3Comments

acyclics picture acyclics  路  3Comments

maystroh picture maystroh  路  3Comments

RyanRizzo96 picture RyanRizzo96  路  3Comments