Ml-agents: Cumulative Reward and Extrinsic Rewards graphs are different when only extrinsic rewards are used

Created on 8 Apr 2020 · 7Comments · Source: Unity-Technologies/ml-agents

Hi, I've brought up this issue before in #3563 but as mentioned there even though this changed behaviour/bug was introduced in the same update as self-play, it happens all the time.

The issue is that the Cumulative Rewards graph and the Extrinsic Rewards graph no longer look the same even though I'm only using extrinsic rewards. (Before the update, they were always the same.) Example screenshot attached. Zero-sum rewards are used so the cumulative rewards graph is correct; I'm not sure how to interpret the other one.
75930654-0e0f7280-5e41-11ea-8a43-413d77a512b1

In the migration notes it was mentioned that steps are now counted per-agent rather than as environment steps. I thought maybe that's the reason but the x-axes are the same for both graphs and they appear to have the same number of data points.

Could you please verify whether this changed behavior is a bug or not?

bug

Source

niskander

All 7 comments

Hi @niskander

The reason for this is that in release v0.15, rewards are averaged over all agents which share a behavior name. So, in zero sum games, this results in the figure you see. Your concern (raised in the issue 3563) led us to modify this so that only trajectories collected by the learning agent will be logged in tensorboard. This is currently on master and will be part of the next release.

Thank you very much for using this feature and helping us refine it!

andrewcoh on 9 Apr 2020

👍1

What do Cumulative Rewards and Extrinsic Rewards represent when we have multiple agents in the same instance?

fedetask on 9 Apr 2020

👀1

@andrewcoh That's great to hear. However, I think there has to be another issue there, because the graphs are not the same even when self-play is turned off and all agents have the same brain.

Here's an example graph:
rewards3
In this scenario there are rewards and penalties but the rewards are always higher in magnitude.
It looks like the cumulative environment rewards graph is summing the rewards while the extrinsic rewards graph is displaying each one as a separate data point. Again this behaviour is different from what it used to be. Could that be it?

niskander on 10 Apr 2020

@niskander Do you have the same issue with PPO?

andrewcoh on 10 Apr 2020

I am using PPO with self-play. My environment has fixed episode length, +1 for winning, -1 for losing, and 0 for a draw. The cumulative reward is correct, always zero, while the extrinsic reward shows some spikes.

extr
cum