I have implemented PPO2 with MlpPolicy on a custom environment. I can do some training and also testing and got some feasible results.
Now I wanted to switch to MlpLstmPolicy. I think I do not understand how the input should be structured during prediction phase.
For MlpPolicy I was creating an observation space like this:
obs_space = spaces.Box(low=0, high=1, shape=(history_lookback + 1, num_features))
So it was a matrix (number of steps to look back in time * num_features) and everything worked fine.
Now with MlpLstmPolicy I thought, I have to pass only one row of features per time step, resulting in a simple vector as an input. Accordingly I adjusted the observation space method for retrieving the next observation. It seems to me that training worked fine but when I try to use the model to predict on an input vector I receive the following error:
ValueError: Cannot feed value of shape (1, 1, 15) for Tensor 'input/Ob:0', which has shape '(4, 1, 15)'
It seems that the model wants to get 4 rows instead of one?
Here is the relevant code I am using for testing the model (not changed since using MlpPolicy):
while not done:
action, _states = model.predict(obs)
obs, reward, done, info = test_env.step(action)
obs is indead a vector of length=num_features.
These are the (imo) relevant parameters I am using locally:
{
"n_steps" : NumberInt(128),
"learning_rate" : 0.00025,
"max_grad_norm" : 0.5,
"lam" : 0.95,
"nminibatches" : NumberInt(2),
"noptepochs" : NumberInt(2),
"cliprange" : 0.2,
"_init_setup_model" : true,
"gamma" : 0.99,
"vf_coef" : 0.5,
"n_cpu" : 4,
}
Many thanks in advance!!!
(edit: I am using stable-baselines 2.6.0 if that is important)
Duplicate of https://github.com/hill-a/stable-baselines/issues/166
See answer here: https://github.com/hill-a/stable-baselines/issues/166#issuecomment-456374442
We should maybe adds that to the documentation.
Ok, thank you! That did the trick for me.
Here is my final code for testing the MlpLstmPolicy:
init_obs = test_env.reset()
done = False
state = None
n_cpu = 4
zero_completed_obs = np.zeros((n_cpu,) + test_env.observation_space.shape)
zero_completed_obs[0, :] = init_obs
while not done:
action, state = model.predict(zero_completed_obs, state=state)
new_obs, reward, done, info = test_env.step(action[0])
zero_completed_obs[0, :] = new_obs
Just for clarification there are a few questions open:
n_cpu (usually equals the n_training_envs) in the model.predict(...) call because the model at that point just expects the same shape dispite using only one environment in the end?because the model at that point just expects the same shape dispite using only one environment in the end?
yes
because I will use the action related to the first (and only) env? Can I just ignore the others?
yes
f I am right with both of the above questions, the interface differs to using a MlpPolicy, dispite both of the policies are using multiple envs for training.
the interface is slightly different (input and output are the same but the expected shapes differs).
The main issue here is that it seems not to be properly documented.
s there a reason why we have to use a different interace here?
This comes from tensorflow, and it's hard to solve (the lstm code is not the easiest to read neither). We could maybe do the zero completion inside the agent though (but this won't work if the user provide an observation with n_envs > n_training_envs).
Thanks for clarification!