Hello,
I want to take actions in a limited range from MLP policy network by adding tanh activations to output layer. As far as I could follow the output layer is implemented at line 236-237 of common/distribution.py:
Instead of directly outputing linear, I want to use a limited output function such as tanh. Is there a way to change it? Also, does it break the training process if used?
Hello,
If I understand your question well, you might be interested by this issue #112 and this small thread .
I want to take actions in a limited range from MLP policy network by adding tanh activations
Be careful, in the case of continuous actions, the output of the network (for most algorithms) are the parameters of a gaussian distribution with diagonal covariance matrix, i.e. it outputs the mean and standard deviation (in fact log std) of the distribution.
Currently, the action bound is handled automatically by clipping the action sampled from the gaussian distribution (so for A2C, PPO2, TRPO, ...), this is in the documentation.
DDPG (and TD3) outputs directly the action (in fact a tanh that is scaled afterward), so it already takes that into account.
Finally, SAC squashes the gaussian with a tanh and account for the change in probability distribution (see here).
@halil93ibrahim
This has been closed already, but recently I've encountered similar concerns about PPO2 in stable-basaelines. I have continuous action space of [-1,-1] ~ [1,1] for RC car controlling/steering.
These are my test approaches and results:
Case 1: PPO2 original version of stable-baselines
It can learn to control the car, but entropy was very high (almost bigger than 5) with default hyper parameter. It seemed that the RL learned to control the car outside of the desirable action range, so that the control signal was very high or very low like -1/-1/1/1/0.9/-0.9. So I tuned ent_coef a bit lower (0.01 as default to 0.003), and than it works well.
Case 2: PPO2 with SAC style action squashing
SAC adds tanh operation after samples from normal distribution of policy network. This is done to mu and (mu + sampled eps * sigma). As you see, mu and sigma is output of policy network. I tried this to PPO2 and I have to create some custom distribution class to handle this which calculates sampling and neglogp as proper way as SAC. But modifying entropy function is somewhat hard and I cannot solve additional entropy term caused by the tanh transformation (=E(log(1-tanh(x)^2)). So when it comes to entropy I used original normal distribution entropy because this just works as regularization term. But unfortunately this seems to be unstable in gradient calculation and the process crashes after some timesteps. There could be mistakes I made but I didn't go further.
Case 3: PPO2 with just wrapping mu of normal distribution by tanh
This is very similar to what you proposed. Here is my modification.
def proba_distribution_from_latent(self, pi_latent_vector, vf_latent_vector, init_scale=1.0, init_bias=0.0):
mean = linear(pi_latent_vector, 'pi', self.size, init_scale=init_scale, init_bias=init_bias)
with tf.variable_scope('pi'):
mean = tf.tanh(mean) # squashing mean only
logstd = tf.get_variable(name='pi/logstd', shape=[1, self.size], initializer=tf.zeros_initializer())
pdparam = tf.concat([mean, mean * 0.0 + logstd], axis=1)
q_values = linear(vf_latent_vector, 'q', self.size, init_scale=init_scale, init_bias=init_bias)
return self.proba_distribution_from_flat(pdparam), mean, q_values
And this works well also (without ent_coeff tuning). The entropy loss declines to -3 (sigma is about 0.01). But I'm not sure that this is theoretically correct way.
hello,
Case 1: you should have looked at tuned hyperparameters in the zoo ;) (cf pr #536 )
For continuous actions, the ent coef is usually not used (set to zero)
Case 2: that's what i mentioned, by squashing the output , you change the probability distribution
Case 3: this legal because you probability distribution is still a gaussian where you restrict the mean to be in a given range. You still need to clip the action to avoid out of bound command.
@araffin
For case 1, good to know that! Thanks.
Most helpful comment
Hello,
If I understand your question well, you might be interested by this issue #112 and this small thread .
Be careful, in the case of continuous actions, the output of the network (for most algorithms) are the parameters of a gaussian distribution with diagonal covariance matrix, i.e. it outputs the mean and standard deviation (in fact log std) of the distribution.
Currently, the action bound is handled automatically by clipping the action sampled from the gaussian distribution (so for A2C, PPO2, TRPO, ...), this is in the documentation.
DDPG (and TD3) outputs directly the action (in fact a tanh that is scaled afterward), so it already takes that into account.
Finally, SAC squashes the gaussian with a tanh and account for the change in probability distribution (see here).