Stable-baselines: [question] MlpLstmPolicy - input values

Created on 9 Jan 2020 · 4Comments · Source: hill-a/stable-baselines

I have implemented PPO2 with MlpPolicy on a custom environment. I can do some training and also testing and got some feasible results.

Now I wanted to switch to MlpLstmPolicy. I think I do not understand how the input should be structured during prediction phase.

For MlpPolicy I was creating an observation space like this:
obs_space = spaces.Box(low=0, high=1, shape=(history_lookback + 1, num_features))
So it was a matrix (number of steps to look back in time * num_features) and everything worked fine.

Now with MlpLstmPolicy I thought, I have to pass only one row of features per time step, resulting in a simple vector as an input. Accordingly I adjusted the observation space method for retrieving the next observation. It seems to me that training worked fine but when I try to use the model to predict on an input vector I receive the following error:
ValueError: Cannot feed value of shape (1, 1, 15) for Tensor 'input/Ob:0', which has shape '(4, 1, 15)'

It seems that the model wants to get 4 rows instead of one?
Here is the relevant code I am using for testing the model (not changed since using MlpPolicy):

        while not done:
            action, _states = model.predict(obs)
            obs, reward, done, info = test_env.step(action)

obs is indead a vector of length=num_features.

These are the (imo) relevant parameters I am using locally:

{
        "n_steps" : NumberInt(128), 
        "learning_rate" : 0.00025, 
        "max_grad_norm" : 0.5, 
        "lam" : 0.95, 
        "nminibatches" : NumberInt(2), 
        "noptepochs" : NumberInt(2), 
        "cliprange" : 0.2, 
        "_init_setup_model" : true, 
        "gamma" : 0.99, 
        "vf_coef" : 0.5,
        "n_cpu" : 4,
}

Many thanks in advance!!!

(edit: I am using stable-baselines 2.6.0 if that is important)

documentation duplicate question

Source

tonyskulk

All 4 comments

Duplicate of https://github.com/hill-a/stable-baselines/issues/166
See answer here: https://github.com/hill-a/stable-baselines/issues/166#issuecomment-456374442

We should maybe adds that to the documentation.

araffin on 9 Jan 2020

👍1

Ok, thank you! That did the trick for me.
Here is my final code for testing the MlpLstmPolicy:

        init_obs = test_env.reset()
        done = False
        state = None
        n_cpu = 4
        zero_completed_obs = np.zeros((n_cpu,) + test_env.observation_space.shape)
        zero_completed_obs[0, :] = init_obs
        while not done:
            action, state = model.predict(zero_completed_obs, state=state)
            new_obs, reward, done, info = test_env.step(action[0])
            zero_completed_obs[0, :] = new_obs

Just for clarification there are a few questions open:

Am I using n_cpu (usually equals the n_training_envs) in the model.predict(...) call because the model at that point just expects the same shape dispite using only one environment in the end?
I have to use action[0] instead of action here, because I will use the action related to the first (and only) env? Can I just ignore the others?
If I am right with both of the above questions, the interface differs to using a MlpPolicy, dispite both of the policies are using multiple envs for training. Is there a reason why we have to use a different interace here? (would be more convinient having one interface when trying different policies/algorithms for your problems).

tonyskulk on 10 Jan 2020

because the model at that point just expects the same shape dispite using only one environment in the end?

yes

because I will use the action related to the first (and only) env? Can I just ignore the others?

yes

f I am right with both of the above questions, the interface differs to using a MlpPolicy, dispite both of the policies are using multiple envs for training.

the interface is slightly different (input and output are the same but the expected shapes differs).
The main issue here is that it seems not to be properly documented.

s there a reason why we have to use a different interace here?

This comes from tensorflow, and it's hard to solve (the lstm code is not the easiest to read neither). We could maybe do the zero completion inside the agent though (but this won't work if the user provide an observation with n_envs > n_training_envs).

araffin on 10 Jan 2020

👍1

Thanks for clarification!

tonyskulk on 14 Jan 2020

Was this page helpful?

0 / 5 - 0 ratings