Stable-baselines: Learn with number of episodes rather than total_timesteps

Created on 21 Nov 2019 · 8Comments · Source: hill-a/stable-baselines

Hi,

I would like to find a way to define the .learn method in PPO1 (and I guess other agents) to stop after a given number of episodes e.g .learn(nr_episodes) rather than explicitly defining a number of steps. This could be useful in situations where different episodes have different lengths which cannot be determined exactly beforehands.

As a quick hack, I did some changes in pposgd_simple.py. Added a new default argument in :

.learn(..., total_episodes=None)

And then replaced

if total_timesteps and timesteps_so_far >= total_timesteps:
    break

with

if total_episodes and episodes_so_far >= total_episodes:
    break

Finally, I was planning to just call

.learn(None, total_episodes=nr_episodes)

But then noticed this line

elif self.schedule == 'linear':
    cur_lrmult = max(1.0 - float(timesteps_so_far) / total_timesteps, 0)

So I'll probably do a rough estimation of the timesteps and set total_timesteps accordingly or alternatively, change the schedule line to

elif self.schedule == 'linear':
    cur_lrmult = max(1.0 - float(episodes_so_far) / total_episodes, 0)

But I was wondering, do any of these modifications have consequences I'm not considering?

duplicate question

Source

joaoreisucl

❤1 👍1

All 8 comments

Hello,

I think this is a duplicate (in a way) of #62 .
What you suggest (having a maximum number of episode) will work only in two cases:

when you have one environment
when the number of timesteps per episode is fixed

in all the other case, there is not proper way of training for n episodes.
I think we should for V3 (cf #576 ) we should have the callback called at every step of the environment even when we have multiple environment. That way, using a callback you could monitor the number of episode and exit when the desired number is reached.

it is possible to do that with the current version of SB (cf doc for how to use callback to stop training early) but it won't be accurate (you will get a small difference between the desired number of episodes and the actual number).

EDIT: for a more extensive explanation on how to use callbacks, please take a look at our recent tutorial: jnrr19

araffin on 29 Nov 2019

👎1

Hello,

I think this is a duplicate (in a way) of #62 .
What you suggest (having a maximum number of episode) will work only in two cases:

when you have one environment

when the number of timesteps per episode is fixed

in all the other case, there is not proper way of training for n episodes.
I think we should for V3 (cf #576 ) we should have the callback called at every step of the environment even when we have multiple environment. That way, using a callback you could monitor the number of episode and exit when the desired number is reached.

it is possible to do that with the current version of SB (cf doc for how to use callback to stop training early) but it won't be accurate (you will get a small difference between the desired number of episodes and the actual number).

EDIT: for a more extensive explanation on how to use callbacks, please take a look at our recent tutorial: jnrr19

Hi @araffin ,

Even though this is closed, and maybe there is something I am not getting, I would like to make my case for this issue.

For the particular case where the number of timesteps per episode is known and fixed is quite common for stock trading envs. Also, for stock trading scenarios, it can be quite valuable to scan all data points an equal amount of time thoroughly.
I do not think this is not that similar to issue #62 , and also I am not sure about the impact of using callbacks to identify the end of episodes.
Also, not necessarily, we want to monitor anything. In particular, it is just more convenient and error-prone to use episode count instead of time steps count.

For now, I am counting the number of data points I have in my price time series, and multiplying it for the number of episodes I want my model to experience during learning.

Alternatively, I am also considering the use of the SubprocVecEnv approach where the num_envs variable could be equivalent to my number of episodes and then set the total_timesteps to the amount of time points in my training sample.

All things considered, I think it would be quite useful to have an option to set a specific number of episodes when calling the learn() function.

If there is something wrong with my reasoning, or if you have any suggestions, please feel welcomed to point out.

Thanks in advance for your time. =)

xicocaio on 13 Aug 2020

👍1

If episode lengths are unknown, then any scheduler based on number of timesteps would not make sense (e.g. linear scheduler that goes to zero at 1M steps taken, but if the number of desired episodes is reached then the schedule is cut mid-way all the sudden). If lengths are known, then you can do what you suggested and set it training length num_episodes_wanted * episode_length.

In any case, we now better callbacks since arrafin's comment which can trigger on each step. You can track number of episodes played with that and kill training once enough episodes has passed.

Miffyli on 13 Aug 2020

Hi, @Miffyli

Yes, this num_episodes_wanted * episode_length is exactly what I meant for:

For now, I am counting the number of data points I have in my price time series, and multiplying it for the number of episodes I want my model to experience during learning.

However, this answer on StackOverflow says that

Where the episode length is known, set it to the desired number of episode you would like to train. However, it might be less because the agent might not (probably wont) reach max steps every time.

I must admit I do not know how accurate this answer is but Is it right?

Also, won't the callback method that maybe cause the env to be run more times than desired for an async case? Or even slow things down, by calling a callback at every step for checking a condition?

Thank you for your help.

xicocaio on 13 Aug 2020

👍1

I must admit I do not know how accurate this answer is but Is it right?

Not sure what they mean by "max steps". If it means "training steps", then it is true that some of the steps will be "wasted" in the end of training because it was gathering a batch, reached maximum number of steps and terminated training there.

Also, won't the callback method that maybe cause the env to be run more times than desired for an async case? Or even slow things down, by calling a callback at every step for checking a condition?

The slowdown is negligible for checks like this unless the environment is super-fast like CartPole envs. There is no async behaviour in stable-baselines, all vecenvs are synchronous, so that should not be an issue.

Miffyli on 13 Aug 2020

Hi @Miffyli,

What I understand from the mentioned answer is quite the opposite of wasting. I think it will miss scanning some data points in equal amounts.

About the async, ok, that makes sense.

Still, for this callback approach, I would have to pass a total_timesteps variable that is high enough so that I can have the desired number of episodes. This callback approach seems like an out of the way workaround.

As I see that you are also a contributor in V3, can I expect that at some point passing a number of episodes instead of total_timesteps for model.learn() can be implemented, instead of having to rely on a callback?

Should I open an issue in that repo if that is a feature that I would like so see?

Thank you

xicocaio on 14 Aug 2020

👍1

Yes, open an issue suggesting that feature on the other repo where we can further discuss if it should be included. For this repo no new features are included and we focus on bug-fixing here, so I doubt it should be included here.

Miffyli on 14 Aug 2020

👎1

Excellent, I will do that, thank you very much.

xicocaio on 14 Aug 2020

Was this page helpful?

0 / 5 - 0 ratings