it would be very useful to be able to adjust the gradient based on a binary vector of which outputs you want to be considered when computing the gradient.
This would be insanely helpful when dealing with environments where the number of actions is dependent on the observation. A simple example of this would be in StarCraft. At the beginning of a game, not every action is valid.
Hello,
How would you implement that?
which outputs you want to be considered when computing the gradient.
I think I only understand partially your idea. The high-level idea is clear (prevent from taking invalid action) but the technical details are not for me.
EDIT: do you have any reference implementation for that?
I am not entirely sure myself. You could call tf.stop_gradients on the weights of the last layer that go to each output you don't want considered, but I don't see a way to do it for the deeper layers as you go backwards.
A simpler approach would be adding an optional argument on .step() for a binary vector of which actions to consider.
Edit: The original request is for Box, and the second one is for Discrete, to clarify.
I hacked this for a little bit and figured out how to do it. I made a new tf tensor that is the masking vector. I then multiply where appropriate in common/distriubtions.py.
The scope of this tensor is rather buried away from a user perspective (in common/policies). I cheated and used a global to access the mask vector created by the environment in the policy; I am not by any means familiar with this code base, but if this feature is desired, I could work with someone to make a PR.
I just found a similar idea, but that only does masking, no use of tf.stop_gradient in this repo:
https://github.com/calclavia/NOS
I hacked this for a little bit and figured out how to do it.
It runs, but does it actually work?
It works fine. I've been doing this for two-ish years with my own implementation of A3C.
Do you have a minimal example on how you made it work?
I'm afraid this would require a lot of changes if we include that in the codebase. However, if it is no so hard to integrate in a custom version, I would like to have it in the documentation ;) (dealing with invalid actions seems to be often a problem with games).
I made a tf placeholder and in feed, I accessed the valid actions from a global variable, which obviously isn't a real solution.
You're absolutely right about invalid actions being a "feature" of games. RL is getting applied to more and more things everyday it seems, and this feature would be very valuable for many.
In PySC2, they embed this information within the observation. Maybe allow environments to allow returning a dict observation (lots of change on your end), or allow environments to have a function that tells you the valid actions, and if it is implemented, use the mask vector and apply it?
Both require some recoding of various areas, but the second one does not introduce a possible breaking update that would cause old environments to no longer work.
I accessed the valid actions from a global variable
aybe allow environments to allow returning a dict observation (lots of change on your end),
It already exists, this is called the info dict ;) (returned when calling env.step(), one caveat: it is not present at reset)
That's awesome to hear. Would you consider adding the feature if I made a PR?
Would you consider adding the feature if I made a PR?
as I told you, it really depends on the amount of complexity it adds. Do you have a branch where you made the changes?
But I would like to have it documented anyway, this could be really helpful to others ;)
Really the only changes that are needed are creating a 1D tensor of all ones, then at each step, check to see if info dict contains something like "valid_actions", assign it to the tensor, then in common/distributions.py, apply the masking tensor to the various functions for each distribution.
I do not know what is considered "standard" when computing negative log likelihood with masked actions. Do you only take the softmax of the valid actions? I don't think any literature exists on this topic, unfortunately.
apply the masking tensor to the various functions for each distribution.
Only for the discrete distribution (and derivatives), no?
Do you only take the softmax of the valid actions?
You can apply the mask before or after the softmax. In both cases, the loss will only affect the taken action, no? (so invalid actions won't be affected)
Only for the discrete distribution (and [derivatives]), no?
Yeah
You can apply the mask before or after the softmax. In both cases, the loss will only affect the taken action, no? (so invalid actions won't be affected)
Wouldn't applying the masking before vs after the softmax influence the gradient computed when trying to minimize the negative log likelihood?
Wouldn't applying the masking before vs after the softmax influence the gradient computed when trying to minimize the negative log likelihood?
You are right. I think we need to write that down or test it to see if there is a huge difference in doing one way or the other.
I have a gut feeling that if you apply the mask before the softmax, gradients may explode (and possibly become inf). Testing is required for sure.
Replying to your email:
so I figured I'd email you as to not clutter up issue#351 on hill-a/stable-baselines.
That's not a problem to have a long discussion in the issue. I prefer this way so everybody (especially the other maintainers) know about it.
I am confused on how to set up the dict that created in _train_step (I
am working with PPO2 to start, function begins at line 247).
Yes, PPO2 seems like a good starting point.
My question is: I have a placeholder for the action mask, sized
(n_batch, ac_space.n), somewhat following suite with the action_ph, but
I cannot get a forward to pass to work, because the the action_mask at
runtime is just shape ac_space.n, but when I remove the batch size dim,
the dimensions don't match when actually training.
I am rather lost at untangling this problem, any insight would be helpful.
It seems like you would need some broadcasting.
In numpy this would look like that:
import numpy as np
# n_batch = 10
actions = np.ones((10, 2))
# Only the second action is valid
# mask has shape (,2)
mask = np.array([0, 1])
# Element wise multiplication
print(actions * mask)
And in tensorflow:
import tensorflow as tf
import numpy as np
actions = tf.placeholder(shape=(None, 2), dtype=tf.float32)
masks = tf.placeholder(shape=(2,), dtype=tf.float32)
masked_actions = actions * masks
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess.run([masked_actions], feed_dict={actions: np.ones((10, 2)), masks: np.array([1.0, 0.0])})
To start, the action_mask_ph is currently in the BasePolicy class, is
that appropriate?
It seems ok.
I have this working using the info dict to store the Boolean vector. I use tf.boolean_mask to apply it. I'd like to test this on more environments beyond my custom gym environment to figure out where the best location to apply the mask with respect to the softmax activation (before or after).
I'll submit a PR once I get all the tests passing and write a some tests for this, per the contributing guide.
We've encountered a similar issue while setting up our custom environment. It would be really helpful if you could outline your final plan in more detail, or post some code while you're getting it ready for a PR. We've got limited time - the internship ends soon! Thanks!
@indisputablyk8: I have been busy with two paper deadlines that are in the week, so I haven't been able to get everything ready for a PR in the timeline that I wanted. It is ready for PPO, which i just uploaded here for your sake: https://github.com/H-Park/stable-baselines/commit/6107ef4a260f824506622669fa530dff1865552b
@araffin: I am realizing how much I changed in common/policies.py and common/distributions.py. You said that if this feature didn't add too much complexity it'd get merged. Is this too much overhead?
Note: My fork will only work for PPO, extending it to another algorithm is pretty straight-forward if you follow what I did in ppo2/ppo.py. It is not possible to run other algorithms using a discrete action space as is, it will error out.
Usage guideline:
In your environment's step method, return the following:
return obs, reward, done, {'valid_actions': valid_actions}
where valid_actions is a binary vector.
Thank you so much! We've been pouring over this issue thread trying to reconstruct what you suggested.
You said that if this feature didn't add too much complexity it'd get merged. Is this too much overhead?
I don't have much time now to look closely (will be AFK for some weeks) but at least finish and polish what you did with PPO2 because anyway we will put a link to your fork (in the best case, we'll merge your PR ;)) in the documentation if it is too much changes.
I did a very quick pass, so here is one remark:
I have mostly everything working (and removed the try/catch blocks), except one set of unit tests. In the distribution tests in test/test_distri.py, due to my addition of a tensor ("action_mask_ph"), causes the test to fail because this test does not instantiate an actor/critic model.
I am not entirely sure how to solve this while still keeping the unit test simple. Thoughts?
This would be extremely usefull!
I redid my changes and moved them to the actual policy, which removes the tensor look up by name in common/distributions.py, of which I was never a fan. The implementation requires that the action space be from [0, n], as the mask is just an element-wise multiplication. There is no way to use tf.boolean_mask on an element by element basis. While tf.ragged.boolean_mask exists, it returns a ragged tensor, which is not a valid datatype for the operations in common/distributions. The code is a lot cleaner now, but there is a lot of repeated code across algorithms, but that seems to be common for general data handling.
@H-Park I'm back ;)
Could you open a draft PR so it will be easier to take a look at what you did?
but there is a lot of repeated code across algorithms, but that seems to be common for general data handling.
Unfortunately, we cannot avoid that for some part of the code (this is a trade-off between ease of use for the user and modularity)
Although we have not yet discussed the correct solution, we offer another solution for the transition period.
Most helpful comment
I have this working using the info dict to store the Boolean vector. I use tf.boolean_mask to apply it. I'd like to test this on more environments beyond my custom gym environment to figure out where the best location to apply the mask with respect to the softmax activation (before or after).
I'll submit a PR once I get all the tests passing and write a some tests for this, per the contributing guide.