Stable-baselines: [question] Custom environment recommendations

Created on 18 Mar 2019  路  6Comments  路  Source: hill-a/stable-baselines

Hi! I am trying to create some RL-based agents in my custom Unity ML-agents environment. I implemented all required functions in the Env but I have several questions:

  • should the environment reset itself every particular number of timesteps? In my case it is important to learn behavior from different perspectives (which are generated every each reset), yet I did not find any env.reset() calls in learn function - perhaps I should call it myself, repeat learn() call or do something else?
  • what happens when the agent reaches its target? (when the environment is "done"?) I noticed some freezing when one of the agents sends done signal, perhaps the environment should take care of such situation and reset this agent environment? Or should it ignore this fact and wait for all environments to reset? Is it somehow taken care of in the baselines lib?

For now I am using A2C algorithm with several concurrent environments. Of course, I can provide all neccessary additional information on my setup.

custom gym env question

All 6 comments

Hello,

yet I did not find any env.reset() calls in learn function

It depends on which algorithm you are using. For instance, PPO2/A2C use VecEnv that reset automatically (as stated in the doc).
For other algorithm, like SAC, the reset is explicit.

should take care of such situation and reset this agent environment?

I assume you are talking about VecEnv, then the answer is in the previous paragraph ;)

Btw, if you are using A2C with continuous actions, there is a bug in the current implementation that is fixed in #206 (will be merged soon, but the fix is only one line of code), I would recommend you to either use PPO2 (until it is merged) or fix the code (see commit https://github.com/hill-a/stable-baselines/pull/206/commits/689afd16f5b07d2fead1fa5e8474a8efa2826a64).

Thanks for your answer!

Btw, if you are using A2C with continuous actions, there is a bug in the current implementation that is fixed in #206 (will be merged soon, but the fix is only one line of code), I would recommend you to either use PPO2 (until it is merged) or fix the code (see commit 689afd1).

That is very good to know, I've been training A2C for last few days and concentrated on my reward function thinking that could be the cause of not learning. Perhaps now I will try PPO2 to see if I configured everything OK and then will try to use the new A2C.

should the environment reset itself every particular number of timesteps?

I recommend just returning done = True after some timeout even if the target is not reached. Be sure to omit any terminal rewards in that case. Do not call reset() manually.

Hi @PiotrJZielinski ,

Were you able to implement PPO2 and use it in your custom environment?

Hi @valdezf10 ,
Yes, I did use the PPO2 for my solution. It worked fine, however since then the Unity ML-agents API has changed and I did not upgrade my app

Looks like this issue is solved

Was this page helpful?
0 / 5 - 0 ratings