Hello, I would like to understand what the metrics returned by the train() function mean. most of them have intuitive names, but they sometimes have values that mismatch what I expect.
My main concern at the moment is to understand why 'episode reward' and 'policy reward' mismatch in multiagent environments even when there is only one policy (e.g. with PPO policy reward seems higher in general?).
I am also interested in understanding what the time metrics (grad_time, sample_time, load_time...) really correspond to (sometimes they don't seem to sum up as I would expect).
Could you explain it a little bit please ? I think this would be helpful in the documentation (also for the algorithm-specific metrics).
Regards,
Yann
Ray version: 0.8.2
I have found this issue that also reports a strange behavior of episode_reward VS policy_reward in multiagents environments. https://github.com/ray-project/ray/issues/6970
I still think that a documentation page describing these different metrics (at least the common ones) would be more than helpful.
Hi, I'm a bot from the Ray team :)
To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.
If there is no further activity in the 14 days, the issue will be closed!
You can always ask for help on our discussion forum or Ray's public slack channel.
Most helpful comment
I have found this issue that also reports a strange behavior of episode_reward VS policy_reward in multiagents environments. https://github.com/ray-project/ray/issues/6970
I still think that a documentation page describing these different metrics (at least the common ones) would be more than helpful.