Episode reward summaries are all concentrated together on a few steps, with jumps in between.
Zoomed out:

Zoomed in:

Every other summary looks fine:

To reproduce, run PPO2 on DummyVecEnv(["Pendulum-v0" for _ in range(8)]).
Hello,
I have also encountered that issue in the past... I did not investigate a lot but I think I found that came from using multiple environments.
Could you run two experiments that could provide some insights:
I can confirm that it works with just one env. The relevant code is in total_episode_reward_logger which is called by PPO2 here and . To me it is absolutely unclear what total_episode_reward_logger is doing exactly and I have no time to look into the issue unfortunately.
The root cause of this problem can be related to this:
total_episode_reward_logger is "borrowed" from the A2C module, and used incorrectly in PPO2.
It calculates the step counter of the add_summary call by adding the length of the episode to the current self.num_timetsteps variable.
It will be correct only, if self.num_timesteps += self.n_batch is called after total_episode_reward_logger like in A2C. If that line is called before the logger, it will shift the step counter by n_bacth.
@balintkozma
Hey. Could you make a pull request out of this? On a brief study it looks like one mostly has to move this block before self.num_timesteps increment. Rest of the variables seem to be fine with this modification
@Miffyli
I created the PR, meanwhile I found another problem:
If a shorter episode is added to the episode reward summary after a longer one, the graph will go backwards, because tensorboard connects the dots in the order they are added, and the calculated step counter will be smaller.
So, the curves on the zoomed-in picture can be avoided by sorting the finished episodes by length before they are added to the tensorboard summary.
Not implemented yet, I will create a separete PR.
Not implemented yet, I will create a separete PR.
Please do only one PR that solves this issue.
@balintkozma
Thanks for the quick reply!
I think that could also be fixed in the same PR, as these two are relate-...
Ninj'd by Arrafin
There are much more issues with the timestep computation than just the call to total_episode_reward_logger. All other plots are also wrong when running a VecEnv.
Orange: multiple environment, Red: single environment

I do not understand the computation of the current timestep:
timestep = self.num_timesteps // update_fac + ((self.noptepochs * self.n_batch + epoch_num * self.n_batch + start) // batch_size)
Am I missing something? To me the only requirement to the timestep computation is that the values are plotted in the same order as they were computed.
I already fixed this for my own use and would make a pull request if appreciated.
I already fixed this for my own use and would make a pull request if appreciated.
If you have a solution that does not change too many parts at once, go ahead and make a PR out of it :). If it is a large change it might need time/discussion before merge, as we (try) to focus on v3.0 at the moment.
Am I missing something? To me the only requirement to the timestep computation is that the values are plotted in the same order as they were computed.
Looking at the issue again, the computation of timestep does not make really sense. A real fix would be to plot the average of thoses values instead of plotting each one of those...
Hi, I also encountered some issues described in the comments above. A recap follows.
If you run ppo2 with a single process training for 256 timesteps (N=1, T=256) and try to visualize the episode reward and the optimization statistics:
T (instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355timestep calculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173
Moreover, if you try to plot data using multiple processes (for instance N=4 workers with T=256 timesteps per worker):
T timesteps followed by a jump of (N-1)*T timesteps in the plot
I implemented the following solutions for the visualization issues:
K epochs on N*T//M minibatches (being M the training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namely K * N*T//M K * N*T//M optimization data are equally distributed over the batch size N*TAs a result, in the showcases above:

N workers are plotted side by side
The modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?
If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.
@paolo-viceconte thanks, I'll try to take a look at what you did this week (unless @Miffyli can do it before), we have too many issue related to that function (cf all linked issues).
Most helpful comment
Hi, I also encountered some issues described in the comments above. A recap follows.
PPO2 tensorboard visualization issues
If you run ppo2 with a single process training for 256 timesteps (
N=1,T=256) and try to visualize the episode reward and the optimization statistics:T(instead of being in [0,256], it is plotted in [256,512]) for the reason explained in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-552952355timestepcalculations highlighted in https://github.com/hill-a/stable-baselines/issues/143#issuecomment-584530173Moreover, if you try to plot data using multiple processes (for instance
N=4workers withT=256timesteps per worker):Ttimesteps followed by a jump of(N-1)*Ttimesteps in the plotPPO tensorboard visualization proposed solution
I implemented the following solutions for the visualization issues:
Kepochs onN*T//Mminibatches (beingMthe training timesteps related to a minibatch), therefore a fixed number of data is collected during the optimization, namelyK * N*T//MK * N*T//Moptimization data are equally distributed over the batch sizeN*TAs a result, in the showcases above:
Nworkers are plotted side by sideThe modifications are just a few and straightforward. Regarding the side-by-side visualization of the rewards in the multiprocess case, do you believe that plotting the mean and variance of the collected data would instead be more appropriate?
If it is appreciated, I would open a PR with the implemented modifications, which I can update if the mean and variance solution is recommended.