Describe the bug
"Sometimes", AgentAction ignores masked actions while using Discrete Vector Space and on demand Decision Requests which causes invalid agent actions and corrupts learning process. I noticed this in my own project and tried to replicate it in original Unity examples; I was able to modify "Basic" example to replicate this problem there too.
To Reproduce
Steps to reproduce the behavior:
AddVectorObs(m_Position, 20);: var maskList =new List<int>();
for(int i = 3; i < 210;++i) {
maskList.Add(i);
}
SetActionMask(0, maskList);
var dummyObsList = new List<float>();
for (int i = 0; i < 52; ++i) {
dummyObsList.Add(i);
}
AddVectorObs(dummyObsList);
var movement = (int)vectorAction[0];:Debug.Assert(movement < 3, "Action was called eventhough it was masked."+movement); trainer: ppo
batch_size: 10
beta: 5.0e-3
buffer_size: 100
epsilon: 0.2
hidden_units: 128
lambd: 0.95
learning_rate: 3.0e-4
learning_rate_schedule: linear
max_steps: 15.0e4
memory_size: 256
normalize: false
num_epoch: 3
num_layers: 2
time_horizon: 64
sequence_length: 64
summary_freq: 1000
use_recurrent: false
vis_encode_type: simple
reward_signals:
extrinsic:
strength: 1.0
gamma: 0.99
mlagents-learn config/config.yaml --run-id=basicx3 --trainConsole logs / stack traces
You will *eventually see the Assertion Line you added in BasicAgent.cs and that must not normally happen as all those actions must have been masked.
Environment (please complete the following information):
UPDATE: Starting from 26 actions ( Branch size: 26 ), this problem occurs. No need to above extra dummy observation lines, etc. I did not replace them as they may still give ideas. The step 2 above can be replaced with the following, other steps except the branch size and reverted observation space size(20), remain the same:
public override void CollectObservations() {
AddVectorObs(m_Position, 20);
SetActionMask(0, new int[] { 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25 });
}
I also debugged at that moment to check ActionMasker's current state and I perfectly see the masked values there. So, somehow, AgentAction is still called with an action that is clearly marked as masked. I tried with lower branch sizes and it does not "seem like" there will be a problem, though I cannot guarantee, but over >= 26 I always get the assertion at a certain point.
Hi @jacobson,
Sorry for the delay on this - I'm going to look into it in more detail tomorrow.
Hi,
I was able to reproduce this behavior with the steps you provided. Using a branch size of 210, this happened 6 times in the 150000 training steps.
I think the cause of this is an epsilon that we add to the probabilities before taking the log here: https://github.com/Unity-Technologies/ml-agents/blob/1a1919a2cff14582da06c89df7eccd14673b78d9/ml-agents/mlagents/trainers/models.py#L445-L449
(to avoid taking log(0)).
I don't have a good workaround at the moment, but I'll see if we can get a simple fix for this. I've also logged it in our internal tracker as MLA-218.
@chriselion Thank you. I checked it and yes it affects the stability.
I changed EPSILON from 1e-7 to 1e-10 and it seems like it is working: AgentAction was NOT called with an invalid action during 630.000 steps (210 total actions). (I stopped it after 630.000 steps). (1e-10 also worked for 99.999 total actions and I stopped it after 42.000 steps.)
I understand that 1e-7 was defined as EPSILON thinking about 32bit in mind; but as I tested, EPSILON definition in models.py is already working as 64bit (np.finfo(EPSILON).eps) by default (on most modern computers) as it was not defined specifically as float32, unlike many places in the codebase. On top of that, I see that 1e-10 is already used in models.py (inside create_learning_rate function), so it looks like it is not banned. For now, it looks like this is an acceptable workaround? (Probably even 2.220446049250313e-16 can be used as EPSILON "for now")
What do you think?
@jacobson Glad the workaround works. Using any sort of epsilon here still feels bad because there's a chance it can pick an invalid action; we're looking at rewriting this using tfp.distributions.Multinomial which takes probabilities instead of log-probabilities, so zeroing out will be cleaner.
(Even 2.220446049250313e-16 does not work good enough after 1.5 million steps. This workaround is no longer helpful enough. I am looking forward to the fix you have mentioned. Thank you for your concern, time and efforts. I do not want to dig into python code as you can modify it anytime, so I really appreciate if that happens to be possible to increase the priority of this bug internally, as it directly affects training with many actions while using discrete vector space type. I am not able to resume the training too with --load under these circumstances.)
Hi @jacobson,
I'm not sure about the current prioritization of the bug, but I'll try to get someone to look into it this week.
In the meantime, I'd suggest you add a check for masked actions before taking them, and add some sort of fallback logic in the case it does get selected.
First of all, thank you very much for considering at least the possibility of the prioritization.
I especially have not done that to not to make the statistics of the brain more complex: I want to provide 100% useful result to an incoming decision. I could add a DoNothing method for invalid actions (which should not pass the turn of my board game, but re-request another decision), but then it would learn that those decision would cause a (VALID) DoNothing result where as it should never ever consider making a decision that would cause DoNothing. Yes those may become insignificant with after many decisions but it would still pollute the brain. I understand that this is perfectly usable for various situations; but in my case, I want to provide the correct pairs (action-result) to the brain. If my assumptions are not correct please correct me. I welcome any suggestions and I would be grateful to read about your experience about similar things. Thank you!
Hi @CharlieReece,
Sorry for the long delay on this. This bug is still open but we hope to get it fixed soon.
As for what to do when an invalid action, I would recommend:
Thank you for the update. To be honest, I have stopped using ML agents right after this problem; because:
I have not given up using ML Agents at all, but I postponed the use of it (in my current project) until its the first major release or the release candidate version. I want to start using it again in non-production environments and test projects, for adapting to the latest changes, as soon as I have time.
In the mean time, I had to go back and make the AI in my project the old-school way; hopefully to be replaced or updated with ML Agents.
Thank you for your concern, time, sharing ML Agents with us and letting us be a part of it too.