qf1, qf2, value_fn = self.policy_tf.make_critics(self.processed_obs_ph, self.actions_ph,
create_qf=True, create_vf=True)
qf1_loss = 0.5 * tf.reduce_mean((q_backup - qf1) * 2)
qf2_loss = 0.5 * tf.reduce_mean((q_backup - qf2) * 2)
values_losses = qf1_loss + qf2_loss + value_loss
Are two Q networks just initialized differently?If so, does it improve the effect significantly?
Yes, that's one of the tricks in SAC. See SpinningUp description of SAC. You may close this issue if you have no further questions related to stable-baselines.
Are two Q networks just initialized differently?If so, does it improve the effect significantly?
I recommend you to read TD3 (which introduces the clipped double q-learning) and SAC papers for a better understanding.
In short, yes, they are initialized differently and it allows to reduce overestimation of the q-value by taking the min of the two.
Thank you!!!