Stable-baselines: [question] MLPpolicy network output layer tanh activation for limiting actions?

Created on 10 Jul 2019  路  4Comments  路  Source: hill-a/stable-baselines

Hello,

I want to take actions in a limited range from MLP policy network by adding tanh activations to output layer. As far as I could follow the output layer is implemented at line 236-237 of common/distribution.py:

https://github.com/hill-a/stable-baselines/blob/0e940c791e5f625c07fb4ffa9e042aeb0288caf3/stable_baselines/common/distributions.py#L236-L241

Instead of directly outputing linear, I want to use a limited output function such as tanh. Is there a way to change it? Also, does it break the training process if used?

question

Most helpful comment

Hello,

If I understand your question well, you might be interested by this issue #112 and this small thread .

I want to take actions in a limited range from MLP policy network by adding tanh activations

Be careful, in the case of continuous actions, the output of the network (for most algorithms) are the parameters of a gaussian distribution with diagonal covariance matrix, i.e. it outputs the mean and standard deviation (in fact log std) of the distribution.

Currently, the action bound is handled automatically by clipping the action sampled from the gaussian distribution (so for A2C, PPO2, TRPO, ...), this is in the documentation.
DDPG (and TD3) outputs directly the action (in fact a tanh that is scaled afterward), so it already takes that into account.
Finally, SAC squashes the gaussian with a tanh and account for the change in probability distribution (see here).

All 4 comments

Hello,

If I understand your question well, you might be interested by this issue #112 and this small thread .

I want to take actions in a limited range from MLP policy network by adding tanh activations

Be careful, in the case of continuous actions, the output of the network (for most algorithms) are the parameters of a gaussian distribution with diagonal covariance matrix, i.e. it outputs the mean and standard deviation (in fact log std) of the distribution.

Currently, the action bound is handled automatically by clipping the action sampled from the gaussian distribution (so for A2C, PPO2, TRPO, ...), this is in the documentation.
DDPG (and TD3) outputs directly the action (in fact a tanh that is scaled afterward), so it already takes that into account.
Finally, SAC squashes the gaussian with a tanh and account for the change in probability distribution (see here).

@halil93ibrahim
This has been closed already, but recently I've encountered similar concerns about PPO2 in stable-basaelines. I have continuous action space of [-1,-1] ~ [1,1] for RC car controlling/steering.
These are my test approaches and results:

Case 1: PPO2 original version of stable-baselines
It can learn to control the car, but entropy was very high (almost bigger than 5) with default hyper parameter. It seemed that the RL learned to control the car outside of the desirable action range, so that the control signal was very high or very low like -1/-1/1/1/0.9/-0.9. So I tuned ent_coef a bit lower (0.01 as default to 0.003), and than it works well.

Case 2: PPO2 with SAC style action squashing
SAC adds tanh operation after samples from normal distribution of policy network. This is done to mu and (mu + sampled eps * sigma). As you see, mu and sigma is output of policy network. I tried this to PPO2 and I have to create some custom distribution class to handle this which calculates sampling and neglogp as proper way as SAC. But modifying entropy function is somewhat hard and I cannot solve additional entropy term caused by the tanh transformation (=E(log(1-tanh(x)^2)). So when it comes to entropy I used original normal distribution entropy because this just works as regularization term. But unfortunately this seems to be unstable in gradient calculation and the process crashes after some timesteps. There could be mistakes I made but I didn't go further.

Case 3: PPO2 with just wrapping mu of normal distribution by tanh
This is very similar to what you proposed. Here is my modification.

def proba_distribution_from_latent(self, pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0): 
     mean = linear(pi_latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias) 
     with tf.variable_scope('pi'):
          mean = tf.tanh(mean) # squashing mean only
     logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer()) 
     pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1) 
     q_values = linear(vf_latent_vector, 'q', self.size, init_scale=init_scale, init_bias=init_bias) 
     return self.proba_distribution_from_flat(pdparam), mean, q_values 

And this works well also (without ent_coeff tuning). The entropy loss declines to -3 (sigma is about 0.01). But I'm not sure that this is theoretically correct way.

hello,
Case 1: you should have looked at tuned hyperparameters in the zoo ;) (cf pr #536 )
For continuous actions, the ent coef is usually not used (set to zero)

Case 2: that's what i mentioned, by squashing the output , you change the probability distribution

Case 3: this legal because you probability distribution is still a gaussian where you restrict the mean to be in a given range. You still need to clip the action to avoid out of bound command.

@araffin
For case 1, good to know that! Thanks.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

shwang picture shwang  路  3Comments

RyanRizzo96 picture RyanRizzo96  路  3Comments

stefanbschneider picture stefanbschneider  路  3Comments

maystroh picture maystroh  路  3Comments

matthew-hsr picture matthew-hsr  路  3Comments