If I understand correctly, the policy net result represents the params of a gaussian distribution with respect to the state observations.
Here is the code that sets the distribution params ontop of the policy net core:
line 236 in stable_baselines.common.distributions.DiagGaussianProbabilityDistributionType:
def proba_distribution_from_latent(self, pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0):
mean = linear(pi_latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
q_values = linear(vf_latent_vector, 'q', self.size, init_scale=init_scale, init_bias=init_bias)
return self.proba_distribution_from_flat(pdparam), mean, q_values
The mean depends on the latent observation vector. But the standard deviation is designed as constant w.r.t. the state observation.
Why is the design chosen like this? What is the intuition behind?
In my understanding, the confidence to choose an action (std of the Gaussian) could very well depend on the state. The risk of obtaining degenerate distributions should be guarded by the entropy loss added to the target.
Hello,
Good question.
Among the different algorithms that support continuous actions and that rely on a probability distribution, most of them have an exploration independent on the state (e.g. PPO, A2C).
Only SAC have a state dependent exploration.
I don't think there is a particular reason more than "it works best that way".
Why is the design chosen like this?
The main reason is that we followed the original implementations, and the performances were matching the published ones.
In my understanding, the confidence to choose an action (std of the Gaussian) could very well depend on the state. The risk of obtaining degenerate distributions should be guarded by the entropy loss added to the target.
You could also see it that way: having a state-independent std is like having a global exploration schedule (explore a lot at the beginning and become more and more deterministic over training). It is true that without the entropy bonus, the std can decrease quite fast, leading to early convergence. In practice, this factor is usually set to zero (cf rl zoo) and at the end we use the deterministic policy for testing (for continuous actions environment).
as mentioned in the spinning up guide by @jachiam :
The way standard deviations are parameterized. In VPG, TRPO, and PPO, we represent the log std devs with state-independent parameter vectors. In SAC, we represent the log std devs as outputs from the neural network, meaning that they depend on state in a complex way. SAC with state-independent log std devs, in our experience, did not work. (Can you think of why? Or better yet: run an experiment to verify?)
so I think it is mostly for experimental reason, if you try I would be interested by the results ;) (you may need to tune a bit the hyperparameters too)
Thanks for your answer.
I quite like the intuition you give (exploration schedule). This makes a lot of sense, if you assume the optimal policy to be deterministic. The random part of the policy then merely expresses the uncertainty about the exact point in continuous space (or the need to explore) . But also, the optimal policy might as well be a non-degenerative gaussian choice. This all of course depends on the nature of the underlying MDP.
As you stated, it is probably best to turn this into an empirical argument. I will try both approaches and see what happens. If I get any insights on this, I might post results here. But for now I gonna close the issue.
Also thx for the reference to spinning up guide and rl-baselines-zoo. As I am relatively new to the topic, those seem valuable resources to get deeper into it and to make a plan on how to approach things.
Update:
I tried different "policy headers" in my application.
In my use case, the action can have any real value within (0,1).
So the most natural action distribution would have support only within this range. Otherwhise i'd have to clip actions (like stable baselines implementation does)
I tried the following distributions:
The best results have been achieved by the Beta header (, presumably the support restriction helps to find reasonable policies faster). But I got similar performances also with other distributions.
Only the Normal distribution has parameters, that independently control for either position or spread. But as other distributions put out better results, I abandoned to investigate the impact of separating parameters from the observation layers.
I think, the best choice of distribution and parametrization design very much depends on the problem to solve.