Stable-baselines: [question] Training sequential tasks on multiple instances of an environment

Created on 7 Jun 2020  路  5Comments  路  Source: hill-a/stable-baselines

I want to train an agent to reach point B from point A and then reach point C from point B. The idea is to train two separate agents, which one of them learn A -> B move and another one learns B -> C move. There can be multiple ways to this:

  1. Create two environment instances and in the second one, initialize the second agent's position randomly around point B
  2. Create two environment instances, and initialize the second agent's position in the second environment to the last point the first agent visited in the first environment. For this, we need to train agent 1 for one episode, and then train agent 2 for one episode and repeat this loop again and again.

Is it possible to implement the second idea with stable_baselines?

question

Most helpful comment

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).

All 5 comments

Second one is doable, although not trivial. You need to launch two separate training instances with two different environments (1 and 2), where environments wait for information from the other env: Environment 1 plays and ends to some point X, transfers this information to the environment 2 (e.g. sockets) and then initializes the env to start from that point X.

Sidenote: If possible, I would just create one environment for "A -> B" and second for "B -> C", and train separate agents on them, but I assume A, B and C are not known (or at least B is not known).

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.

Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).

Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).

Right, yes, I did that once and it took a long time to learn anything...

When I tried to do something similar for hierarchical algorithms, I used generators.

@mhtb32
If you are comfortable with the source code, you can use generators. You have to change the learn function to call yield when you reach the target. Then it boils down to:

agent1 = ...
agent2 = ...
g1 = agent1.learn(...)
g2 = agent2.learn(...)
while your_condition:
     next(g1)
     next(g2)

But you will have to slightly modify the source.

I think DLR-RM/stable-baselines3#55 addresses this.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

sahilgupta2105 picture sahilgupta2105  路  3Comments

maystroh picture maystroh  路  3Comments

Unimax picture Unimax  路  3Comments

junhyeokahn picture junhyeokahn  路  3Comments

stefanbschneider picture stefanbschneider  路  3Comments