I get the following output when training a PPO model on my environment:
| approxkl | 0.00069229305 |
| clipfrac | 0.00390625 |
| ep_rewmean | nan |
| eplenmean | nan |
| explained_variance | 0.847 |
| fps | 142 |
| nupdates | 782 |
| policy_entropy | 3.3405766 |
| policy_loss | -0.011813248 |
| serial_timesteps | 100096 |
| time_elapsed | 656 |
| total_timesteps | 100096 |
| value_loss | 0.4478733 |
What do these values mean or where can I find a description of the meaning of these values respectively?
Hello,
For that, I recommend you to read PPO paper.
The parameters not related to PPO:
Stable-Baselines Documentation: https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html
Additional Documentation: https://spinningup.openai.com/en/latest/algorithms/ppo.html
Hi,
Great. Thank you very much for the pointers.
- serial_timesteps, i think it the same as total_timesteps (here for legacy reason I suppose)
I'm not sure that the explanation of serial_timesteps and total_timesteps is correct. If you look at where these come from, in the for update in range(1, nupdates + 1): training loop of PPO2.learn, total_timesteps can be seen to be the number of _gradient updates_ (i.e. epochs) performed on the network, and thus has nothing to do with the number of steps of the environment that have been made.
By contrast, serial_timesteps is a slightly confusing metric of the number of environment steps that have been made, but one which disregards the number of envs running in parallel i.e. with n_steps=64, serial_timesteps will increment by 64 every time new data is collected to train the policy network for noptepochs epochs. It doesn't matter whether n_envs=1 or n_envs=100, serial_timesteps will only increase by 64. I might open an issue suggesting that this is changed such that serial_timesteps is renamed env_timesteps and also returns n_envs*n_steps each time a policy is trained, rather than n_steps. Similarly, perhaps total_timesteps should be renamed n_epochs, or removed altogether as given n_updates and the number of epochs per update it provides somewhat redundant information.
@araffin you've mentioned the fps and I'm trying to figure out why the fps is showing "0" for me, when using ppo2 (I only tested ppo2 ). Does that indicate an issue in the implementation or could be caused by slow steps per second? Here's a sample of my output
| approxkl | 0.000308714 |
| clipfrac | 0.0 |
| ep_len_mean | 43.7 |
| ep_reward_mean | -162 |
| explained_variance | -1.19e-07 |
| fps | 0 |
| n_updates | 8 |
| policy_entropy | 1.79129 |
| policy_loss | -0.00336207 |
| serial_timesteps | 1024 |
| time_elapsed | 1.92e+03 |
| total_timesteps | 1024 |
| value_loss | 1433.95 |
------------------------------------
Does that indicate an issue in the implementation or could be caused by slow steps per second?
Please fill in the issue template for that. I don't have enough information to reproduce the described behavior.
EDIT: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/ppo2/ppo2.py#L373
Looking at the code, you have a super slow simulation and it gets rounded to zero.
Most helpful comment
Hello,
For that, I recommend you to read PPO paper.
The parameters not related to PPO:
Stable-Baselines Documentation: https://stable-baselines.readthedocs.io/en/master/modules/ppo2.html
Additional Documentation: https://spinningup.openai.com/en/latest/algorithms/ppo.html