I want to train an agent to reach point B from point A and then reach point C from point B. The idea is to train two separate agents, which one of them learn A -> B move and another one learns B -> C move. There can be multiple ways to this:
Is it possible to implement the second idea with stable_baselines?
Second one is doable, although not trivial. You need to launch two separate training instances with two different environments (1 and 2), where environments wait for information from the other env: Environment 1 plays and ends to some point X, transfers this information to the environment 2 (e.g. sockets) and then initializes the env to start from that point X.
Sidenote: If possible, I would just create one environment for "A -> B" and second for "B -> C", and train separate agents on them, but I assume A, B and C are not known (or at least B is not known).
You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.
You can do it with just the callbacks. You can override the _on_step function to start learning the second agent when it reached checkpoint B.
Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).
Hmm it might be possible, but we do not recommend calling train repeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).
Right, yes, I did that once and it took a long time to learn anything...
When I tried to do something similar for hierarchical algorithms, I used generators.
@mhtb32
If you are comfortable with the source code, you can use generators. You have to change the learn function to call yield when you reach the target. Then it boils down to:
agent1 = ...
agent2 = ...
g1 = agent1.learn(...)
g2 = agent2.learn(...)
while your_condition:
next(g1)
next(g2)
But you will have to slightly modify the source.
I think DLR-RM/stable-baselines3#55 addresses this.
Most helpful comment
Hmm it might be possible, but we do not recommend calling
trainrepeatedly in row (might do initializations and stuff all over again, possibly leaking memory and also erasing any optimizer statistics).