Stable-baselines: [question] Training sequential tasks on multiple instances of an environment

Created on 7 Jun 2020 · 5Comments · Source: hill-a/stable-baselines

I want to train an agent to reach point B from point A and then reach point C from point B. The idea is to train two separate agents, which one of them learn A -> B move and another one learns B -> C move. There can be multiple ways to this:

Create two environment instances and in the second one, initialize the second agent's position randomly around point B
Create two environment instances, and initialize the second agent's position in the second environment to the last point the first agent visited in the first environment. For this, we need to train agent 1 for one episode, and then train agent 2 for one episode and repeat this loop again and again.

Is it possible to implement the second idea with stable_baselines?

question

Source

mhtb32

Most helpful comment

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).

Miffyli on 7 Jun 2020

👍2

All 5 comments

Second one is doable, although not trivial. You need to launch two separate training instances with two different environments (1 and 2), where environments wait for information from the other env: Environment 1 plays and ends to some point X, transfers this information to the environment 2 (e.g. sockets) and then initializes the env to start from that point X.

Sidenote: If possible, I would just create one environment for "A -> B" and second for "B -> C", and train separate agents on them, but I assume A, B and C are not known (or at least B is not known).

Miffyli on 7 Jun 2020

👍1

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

PartiallyTyped on 7 Jun 2020

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

Miffyli on 7 Jun 2020

👍2

Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).

Right, yes, I did that once and it took a long time to learn anything...

When I tried to do something similar for hierarchical algorithms, I used generators.

@mhtb32
If you are comfortable with the source code, you can use generators. You have to change the learn function to call yield when you reach the target. Then it boils down to:

agent1 = ...
agent2 = ...
g1 = agent1.learn(...)
g2 = agent2.learn(...)
while your_condition:
     next(g1)
     next(g2)

But you will have to slightly modify the source.

PartiallyTyped on 7 Jun 2020

👍1

I think DLR-RM/stable-baselines3#55 addresses this.

mhtb32 on 11 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Tuple action space with stable baselines PPO2 [question]

sahilgupta2105 · 3Comments

SubprocVecEnv problem

maystroh · 3Comments

can any of the baseline can be used for chess? [question]

Unimax · 3Comments

[questions] variable in the function

junhyeokahn · 3Comments

[question] Tensorboard callback during testing/predicting?

stefanbschneider · 3Comments