Hi! I believe that Distributed PPO uses multiple workers for data collection and gradient calculation (the gradients are then averaged to update the policy). Is using PPO2 with multi-processing equivalent to DPPO?
Hello,
Looking at the paper, there are some similarities and some differences.
I wouldn't call it DPPO, more "multiprocessed" PPO with a single process for doing the gradient update.
First, it uses several workers to gather data. However compared to DPPO, PPO2 does not average gradients from the workers, it samples from the all the experience gathered by the workers (see here).
Then, PPO2 uses the "clip" variant of PPO (instead of the kl penalty) and does not have (for now) any check on the estimated kl divergence (see https://github.com/hill-a/stable-baselines/issues/213), which is a feature that would nice to have.
Finally (and there are maybe more differences but I did not read the paper thoroughly yet), the OpenAI PPO version (PPO2) also clips the value function (but since #343 you can deactivate that or tune it) and normalizes the advantage. Note that those changes were not documented by OpenAI.
Then, PPO2 uses the "clip" variant of PPO (instead of the kl penalty) and does not have (for now) any check on the estimated kl divergence (see #213), which is a feature that would nice to have.
What feature would you suggest exactly? (Maybe early stopping?) I am not sure that would make much sense since this requires clever hyperparameterizing, and having few hyperparameters is originally one strength of PPO-clip, unless I am missing something?
(First time here, I may start contributing to this nice repo)
What feature would you suggest exactly? (Maybe early stopping?)
yes, it is the feature describe in #213, because the clipping strategy is not an hard constrain, it may be violated (see here).
I am not sure that would make much sense since this requires clever hyperparameterizing, and having few hyperparameters is originally one strength of PPO-clip, unless I am missing something?
The idea would be at least to allow the user to set it up (so deactivating this feature by default, in the same vein of what was done with clipping the value function in #343 ) or having a very high value by default.
(First time here, I may start contributing to this nice repo)
Contributions are welcomed =) (please don't forget to read the contribution guide first)
Most helpful comment
Hello,
Looking at the paper, there are some similarities and some differences.
I wouldn't call it DPPO, more "multiprocessed" PPO with a single process for doing the gradient update.
First, it uses several workers to gather data. However compared to DPPO, PPO2 does not average gradients from the workers, it samples from the all the experience gathered by the workers (see here).
Then, PPO2 uses the "clip" variant of PPO (instead of the kl penalty) and does not have (for now) any check on the estimated kl divergence (see https://github.com/hill-a/stable-baselines/issues/213), which is a feature that would nice to have.
Finally (and there are maybe more differences but I did not read the paper thoroughly yet), the OpenAI PPO version (PPO2) also clips the value function (but since #343 you can deactivate that or tune it) and normalizes the advantage. Note that those changes were not documented by OpenAI.