Stable-baselines: [question] Questions about MlpLstmPolicy

Created on 8 Jan 2020 · 14Comments · Source: hill-a/stable-baselines

I successfully implemented PPO2 with MlpPolicy with two different custom environments I built. Now I want to extend to MlpLstmPolicy in one of my games.

I tried to understand the MlpLstmPolicy by reading the source code but it's a bit involved. So several questions:

If successfully implemented, does the LSTM memorize the steps taken in a game only? Or does it also memorize what steps it took in the previous games (before resetting)?

Follow up question on this, if the answer to the second question is no, is there any way to achieve this? Concretely, I want my agent to come up with paths that are vastly different with the previous games (quantitatively measured by correlation). Implementing curiosity might seem to help, but it is not directly learning to find paths distinct from the previous games.

What role does the variable nminibatches play in training? Does it only affect the training speed?
I tried replacing MlpPolicy with MlpLstmPolicy in my game directly without changing anything, and it appears that the learning is much worse - even after many more learning steps, the reward is far worse than that learnt with MlpPolicy. Are there general tips to using MlpLstmPolicy / necessary modifications when switching from MlpPolicy to MlpLstmPolicy?

Thanks a million in advance!

question

Source

matthew-hsr

👍1

Most helpful comment

1) LSTM only memorizes past _inside the single game_, it does not remember things outside that episode.
2) nminibatches specifies the number of minibatches to use when updating the policy on gathered samples. E.g. if you have 1000 samples gathered in total and nminibatches=4, it will split samples into four minibatches of 250 elements and do parameter updates on these batches noptepochs times.
3) LSTMs are generally harder to train than non-recurrent networks (more parameters, gradients are dependent over multiple timesteps, etc etc), and the implementation here is probably not one of the best (see e.g. R2D2 paper on research on this). I would run it at least 5x longer than non-recurrent version to see when/if the learning starts to happen later.

If you feel something in docs was not clear on these questions, please point them out so we can fix these :)

Miffyli on 8 Jan 2020

👍2

All 14 comments

If you feel something in docs was not clear on these questions, please point them out so we can fix these :)

Miffyli on 8 Jan 2020

👍2

Thanks a lot @Miffyli!

Do you have any possible suggestions for the follow up question too? Thanks again!

matthew-hsr on 8 Jan 2020

Quick idea that pops to my mind is simply extending one episode to be multiple game episodes. E.g. The episode does not end when player dies, but only after player dies e.g. 10 times. RNNs could then store memory between these games. However I do not know if this would help with your problem at all. Things like curiosity sound better if you want different paths from different games.

A possibly tangentially related paper to this could be MERLIN where the agent learns to remember where the goal is in the level, and then can navigate to it easily.

Miffyli on 8 Jan 2020

Thanks a lot @Miffyli ! I will read about the paper and see if it helps!

In addition, I am trying a very simple custom environment to test the LSTM policy, however, it fails to do what I expected should be easy for an agent with LSTM policy.

In particular, there are only two possible actions, 0 or 1. And the observation is always 0.

With a memory, the agent should easily figure out the steps required to achieve full score, but both MlpLstmPolicy and MlpLnLstmPolicy are unable to reach the optimal result (and in fact worse than simply MlpPolicy even though it's trained with more episodes.

I thought it might be because the memory only remembers the states seen, but the LSTM agent still fails to solve these after changing the observation to the previous action.

solns = [
    [0] + [1],
    ([0] + [1]) * 2,
    ([0] + [1]) * 3,
    [0] * 2 + [1] * 2,
    [0] * 3 + [1] * 3,
    [0] * 4 + [1] * 4,
]

for soln in solns:
    print(f'For soln = {soln}')
    class CustomEnv(gym.Env):
        def __init__(self):
            super(CustomEnv, self).__init__()
            self.action_space = gym.spaces.Discrete(2)
            self.observation_space = gym.spaces.Discrete(1)
            self.step_count = 0
            self.soln = soln

        def step(self, action):
            if action!=self.soln[self.step_count]:
                done = True
                reward = -10
            else:
                done = False
                reward = 1
            self.step_count+=1
            if self.step_count == len(self.soln):
                done = True
                reward = 100
            return 0, reward, done, {}
        def reset(self):
            self.step_count = 0
            return 0
        def render(self, mode='human'):
            pass
        def close (self):
            pass

    env0 = CustomEnv()
    env = DummyVecEnv([lambda: env0])  # The algorithms require a vectorized environment to run
    mlp_model = PPO2(MlpPolicy, env)
    lstm_model = PPO2(MlpLstmPolicy, env, nminibatches=1)

    mlp_model.set_env(env)
    lstm_model.set_env(env)

    def evaluate(model, num_games=100):
        """
        For DummyVecEnv only.
        """
        episode_rewards = [0.0]
        obs = env.reset()
        all_rewards = []
        for i in range(num_games):
            temp_reward = 0
            dones = [False]
            while dones[0]==False:
                action, _states = model.predict(obs)
                obs, rewards, dones, info = env.step(action)
                temp_reward += rewards[0]
            obs = env.reset()
            all_rewards.append(temp_reward)
    #     print("Mean reward:", np.mean(np.array(all_rewards)))
        return np.mean(np.array(all_rewards))

    mlp_mean_reward = []
    for i in range(100):
        mlp_model.learn(total_timesteps=200)
        mlp_mean_reward.append(evaluate(mlp_model, num_games=50))
    print(f'For mlp, mean reward is {np.mean(np.array(mlp_mean_reward[-50:])):.2f}')

    plt.plot(mlp_mean_reward)
    plt.show()

    lstm_mean_reward = []
    for i in range(1000):
        lstm_model.learn(total_timesteps=200)
        lstm_mean_reward.append(evaluate(lstm_model, num_games=50))
    print(f'For lstm, mean reward is {np.mean(np.array(lstm_mean_reward[-50:])):.2f}')

    plt.plot(lstm_mean_reward)
    plt.show()

Outputs:

For soln = [0, 1]
For mlp, mean reward is 99.76
For lstm, mean reward is 101.00
For soln = [0, 1, 0, 1]
For mlp, mean reward is 6.69
For lstm, mean reward is -9.00
For soln = [0, 1, 0, 1, 0, 1]
For mlp, mean reward is -5.43
For lstm, mean reward is -8.08
For soln = [0, 0, 1, 1]
For mlp, mean reward is 7.98
For lstm, mean reward is -7.96
For soln = [0, 0, 0, 1, 1, 1]
For mlp, mean reward is -7.00
For lstm, mean reward is -7.00
For soln = [0, 0, 0, 0, 1, 1, 1, 1]
For mlp, mean reward is -6.01
For lstm, mean reward is -6.00

Am I missing something trivial?

matthew-hsr on 9 Jan 2020

Looking at your code it seems you do not run LSTM policies longer, but you repeat the experiment 10x more than with MLPPolicy. Try training both policies for e.g. 1000 or even 10000 steps to see if that makes a difference. I would also try bit more classical recurrent-testing environment where the agent has to recall the value seen N steps ago. It _should_ be able to solve that one eventually with enough training, but with all the nuisance-factors of RL algorithms it may have hard time doing so. As far as I know the current LSTM implementation has been useful in robotics/locomotion tasks.

Miffyli on 9 Jan 2020

I made the game easier, i.e. select the observation 2 steps ago, so that there is feedback to reach the solution (reward = -np.abs(action - self.soln[self.step_count - 2])). I also trained the models 5000 steps instead. However, the LSTM versions still do not outperform the plain MLP version.

I also corrected the evaluation code for LSTM, feeding in the internal state to the agent (It may be a good idea to add some documentation to the code - I only figured out that is needed after reading through the underlying code).

I tried both PPO2 and A2C as the agent but the results are the same.

I must be missing something and I'd really appreciate some help / insight.

Thanks in advance!

action_space_size = 10
observation_space_size = 10
soln_length = 1000

class CustomEnv(gym.Env):
    def __init__(self):
        self.action_space = gym.spaces.Discrete(action_space_size)
        self.observation_space = gym.spaces.Discrete(observation_space_size)
        self.step_count = 0

    def step(self, action, done = False):
        if self.step_count < 2:
            reward = 0
        else: # step_count>=2
            reward =  -np.abs(action - self.soln[self.step_count - 2])
        self.step_count+=1
        if self.step_count == soln_length-1:
            done = True          
        return self.soln[self.step_count], reward, done, {}
    def reset(self):
        self.step_count = 0
        self.soln = np.random.randint(action_space_size, size = soln_length)
        return self.soln[0]

env0 = CustomEnv()
env = DummyVecEnv([lambda: env0])  # The algorithms require a vectorized environment to run
mlp_model = PPO2(MlpPolicy, env)
lstm_model = PPO2(MlpLstmPolicy, env, nminibatches=1)

mlp_model.set_env(env)
lstm_model.set_env(env)

def evaluate(model, num_games=100):
    """
    For DummyVecEnv only.
    """
    episode_rewards = [0.0]
    obs = env.reset()
    all_rewards = []
    for i in range(num_games):
        temp_reward = 0
        dones = [False]
        while dones[0]==False:
            action, _states = model.predict(obs)
            obs, rewards, dones, info = env.step(action)
            temp_reward += rewards[0]
        obs = env.reset()
        all_rewards.append(temp_reward)
    return np.mean(np.array(all_rewards))

def evaluate_ltsm(model, num_games=100):
    """
    For DummyVecEnv, LSTM only.
    """
    episode_rewards = [0.0]
    obs = env.reset()
    all_rewards = []
    for i in range(num_games):
        _states = None
        temp_reward = 0
        dones = [False]
        while dones[0]==False:
            action, _states = model.predict(obs, 
                                            state=_states,
                                            mask=dones)
            obs, rewards, dones, info = env.step(action)
            temp_reward += rewards[0]
        obs = env.reset()
        all_rewards.append(temp_reward)
    return np.mean(np.array(all_rewards))

mlp_mean_reward = []
for i in range(100):
    mlp_model.learn(total_timesteps=soln_length*5)
    mlp_mean_reward.append(evaluate(mlp_model, num_games=50))
print(f'For mlp, mean reward is {np.mean(np.array(mlp_mean_reward[-50:])):.2f}')

lstm_mean_reward = []
for i in range(100):
    print(i)
    lstm_model.learn(total_timesteps=soln_length*5)
    lstm_mean_reward.append(evaluate_ltsm(lstm_model, num_games=50))

print(f'For lstm, mean reward is {np.mean(np.array(lstm_mean_reward[-50:])):.2f}')

plt.plot(lstm_mean_reward)
plt.show()

lnlstm_mean_reward = []
for i in range(100):
    lnlstm_model.learn(total_timesteps=soln_length*5)
    lnlstm_mean_reward.append(evaluate_ltsm(lnlstm_model, num_games=50))
print(f'For lnlstm, mean reward is {np.mean(np.array(lnlstm_mean_reward[-50:])):.2f}')

plt.plot(lnlstm_mean_reward)
plt.show()

Output:

For mlp, mean reward is -2494.52
For lstm, mean reward is -2490.98
For lnlstm, mean reward is -2492.31

matthew-hsr on 10 Jan 2020

Or, more simply, does anyone have any example code of applying MlpLstmPolicy to custom environment that I may refer to? I feel that I must be missing something trivial...

matthew-hsr on 10 Jan 2020

Hmm looking over the code it does seem right, sorry for not spotting the lack of states in the first one ^^'. One last thing I would try is even longer training, but apart from that I am out of ideas.

@araffin

RNNs in RL are not my forté. Is this expected behavior, and/or should LSTMs be tested in a different environment?

Miffyli on 10 Jan 2020

Is this expected behavior, and/or should LSTMs be tested in a different environment?

I don't have much time to invest into that issue but if I get it right, there is a question on how to get better performances with LSTM?
My answer would be hyperparameter optimization (cf doc).

Otherwise, I don't have much experience in RL + RNN (when I used RNN, it was mostly when doing NLP), usually stacking frames (history of observations) + MLP is enough...

araffin on 10 Jan 2020

@araffin Noted. Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment. I hazard a guess the issue lies in the problems brought when you throw together RNNs and RL algorithms

@matthew-hsr
I know the LSTM implementation has improved performance in some RL environments (video games), so the problem may lie in hyperparameter tuning. You could look at rl-zoo for parameters and hints that could work (afaik it does not include LSTM runs). Edit: See the reply below this one.

Miffyli on 10 Jan 2020

Question was on how to test the current LSTM implementation if it works right, and so far there was trouble to solve a simple recall environment.

@Miffyli

We have a test for that ;)
https://github.com/hill-a/stable-baselines/blob/master/tests/test_lstm_policy.py#L43 (see PR https://github.com/hill-a/stable-baselines/pull/244)

araffin on 10 Jan 2020

👍1

@matthew-hsr @araffin @Miffyli does the observation space need to contain information about past time steps when using any of the policies with an Lstm? For instance, in the example by @matthew-hsr, the observation space only contains 1 time step of information, i.e. the state at time t. Now, I am wondering if instead it should contain e.g. the last 10 time steps of information, in this context a vector with values from t-10 to t.

When training an LSTM in e.g. Keras, the input X generally takes the shape [number of observations, number of time steps per observation, number of features per observation]. In other words, every observation is a matrix with dimensions [number of time steps per observation, number of features per observation], meaning that every observation contains several time steps of information. Hence, I am wondering if this applies to the context of RL as well or if, instead, every observation should only contain 1 time step of information.

I was not sure if I should create a new issue for this question (in addition to this one here and this one : https://github.com/hill-a/stable-baselines/issues/667). Thought it might fit here because there is an example already provided. In any case, it would be great to have some information in the documentation about how the observation space for recurrent policies should look like.