Stable-baselines: Choice of Q value in the policy loss of SAC algorithm

Created on 14 May 2019 · 5Comments · Source: hill-a/stable-baselines

Hello,

Why the Q value 1 is chosen in order to calculate the policy loss in the SAC algorithm? Shouldn't it be the min of the two Q values? If not can you briefly explain me why?

In the following file, line 237:

stable_baselines/sac/sac.py

# Take the min of the two Q-Values (Double-Q Learning)
min_qf_pi = tf.minimum(qf1_pi, qf2_pi)

# ...

# Compute the policy loss
# Alternative: policy_kl_loss = tf.reduce_mean(logp_pi - min_qf_pi)
policy_kl_loss = tf.reduce_mean(self.ent_coef * logp_pi - qf1_pi) # min_qf_pi instead of qf1_pi?

Thank you for your help,

question

Source

maximeLR

Most helpful comment

Yeah, I believe there is no particular reason for that choice -- they all work pretty much equally well.

haarnoja on 14 May 2019

👍2

All 5 comments

Hello,

good question.

Why the Q value 1 is chosen in order to calculate the policy loss in the SAC algorithm? Shouldn't it be the min of the two Q values?

I don't see obvious reason to choose one or the other option. This was done to follow the original implementation, even if they seem to have changed that in their new implementation (in fact this was changed in that commit: https://github.com/rail-berkeley/softlearning/commit/c4fc3d71f195208f376cbdd317f77c5d9f501b6f).

This is more a question for @hartikainen and @haarnoja then.

(I would expect no significant change in performance if you change it to the min instead of the first q value, but if you try, I'm interested in your results ;) )

Also related: #270

araffin on 14 May 2019

👍1

Good question. Based on my tests, there was no difference at all between using the min vs. a single value, and I converged to using the min just to be consistent with the usage of Q in the TD-update.