Stable-baselines: [question] What does .action_probability mean for continuous spaces?

Created on 12 Dec 2018 · 3Comments · Source: hill-a/stable-baselines

I understand what action probability's return means for discrete actionspaces -- it simply returns an array that has the probability of each discrete action.

But I'm puzzled by the return for action_probability in continuous action spaces. For example, here is a stable_baselines RLModel that I briefly trained on the CartPole environment:

(pdb) expert_policy.action_probability(obs)
array([-0.51748574], dtype=float32)

(Pdb) expert_policy.predict(obs)
(array([0.6057056], dtype=float32), None)
(Pdb) expert_policy.predict(obs)
(array([-1.9825947], dtype=float32), None)
(Pdb) expert_policy.predict(obs)
(array([0.31649584], dtype=float32), None)

Clearly expert_policy here is choosing a actions stochastically here. But it seems unlike that .action_probability(obs) returning the parameters of a Normal distribution. For that, don't I need to return two floats, the mean and the variance?

How should I interpret the return?

question

Source

shwang

All 3 comments

Hello,

That is a good question. action_probability does not makes sense for continuous actions, but we left it because that simplifies the code (remove the use of a if branch in several places). It is not meant to be used.

The output is defined here: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/policies.py#L185
So, in the case of continuous actions, it should be the same as predict.

@hill-a, @erniejunior should we rather throws NotImplementedError?

araffin on 13 Dec 2018

👍1

Hey,

Well first of all I'm assuming you are talking about Actor-Critic models, as @araffin said this would not make sense for deterministic continuous action spaces models (eg: DDPG).

But, we can easily make this work for Actor-Critic models, as the continuous action probability distribution object in the policy has access to the policy mean and std.

Something like this here should do what you want, and work quite well:

self.policy_proba = self.policy
if self.is_discrete:
    self.policy_proba = tf.nn.softmax(self.policy_proba)
elif self.is_box:
    self.policy_proba = [self.proba_distribution.mean, self.proba_distribution.std]
self._value = self.value_fn[:, 0]

And then, you could even return self.policy_proba = None for distribution not implement yet and raise a NotImplementedError in that case.

hill-a on 13 Dec 2018

👍1

Ok, started a quick branch for this issue and possibly #127.

Decided not to use NotImplementedError, but instead only a warning for distributions not implemented yet, as I still want the code to be backwards compatible.

hill-a on 13 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings