Ray: [rllib] 'use_lstm' in PPO value function

Created on 16 Aug 2018 · 12Comments · Source: ray-project/ray

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 16.04
Ray installed from (source or binary): source
Ray version: 5.0
Python version: 3.6.1
Exact command to reproduce:

Describe the problem

I'm looking into the source code of PPO + LSTM, but I found i can not use both LSTM and GAE for the value function.

https://github.com/ray-project/ray/blob/079c4e482acd1300d65505543d1511b141e9bd30/python/ray/rllib/agents/ppo/ppo_policy_graph.py#L156-L166

As I've seen the paper "Learning Dexterous In-Hand Manipulation" from OpenAI, they successfully used both LSTM and GAE.

Question

Why can't I use both LSTM and GAE in the framework?
If I remove the vf_config["use_lstm"] = False statement, can i use LSTM for the value function without any problem?

Thank you for help.

Source code / logs

question

Source

whikwon

All 12 comments

Hey @whikwon , this was out of expediency in implementation, there's not a good reason why you can't. I believe you can use LSTM for the value function with some changes to the ppo policy:

allow the vf model to have use_lstm: True
add the vf model's state_in and state_out to state_inputs / state_outputs here: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/ppo/ppo_policy_graph.py#L194
add the vf model initial state to get_initial_state() of ppo policy graph

This will enable rllib will track the RNN state for both models during rollouts. It might also make sense to share the lstm cells, in which case you can redefine the vf model to instead share self.model.last_layer, which is the lstm cell output, and don't have to add anything to state_inputs/outputs.

ericl on 16 Aug 2018

@ericl Thank you so much. I'll try it.

whikwon on 16 Aug 2018

@ericl I'm having trouble understanding the code. Please help.

Please check if I understood right.

I have to add attributes for vf model's state_in and state_out to TFPolicyGraph: https://github.com/ray-project/ray/blob/78b6bfb7f94192cfcc8e6fab03e8063dcdae674a/python/ray/rllib/evaluation/tf_policy_graph.py#L38
RNN state of the vf model should be generated here: https://github.com/ray-project/ray/blob/78b6bfb7f94192cfcc8e6fab03e8063dcdae674a/python/ray/rllib/evaluation/sampler.py#L356-L359
vf model's cell/hidden state has to be initialized here: https://github.com/ray-project/ray/blob/78b6bfb7f94192cfcc8e6fab03e8063dcdae674a/python/ray/rllib/agents/ppo/ppo_policy_graph.py#L246-L247

Question

To implement the LSTM for vf, Should I have to handle states of model / vf_model attributes seperately?
How does env_runner generate states for the model? How can I adjust it to the vf_model?
https://github.com/ray-project/ray/blob/78b6bfb7f94192cfcc8e6fab03e8063dcdae674a/python/ray/rllib/evaluation/sampler.py#L177
Thanks.

whikwon on 19 Aug 2018

To implement the LSTM for vf, Should I have to handle states of model / vf_model attributes seperately?

I think you're right, that's needed if you use two different sets of placeholders for (state_in, state_out, seq_len). However if you use a "combined model" that produces both logits and vf outputs from a single set of (state_in, state_out, seq_len), then it can be used in rllib out of the box.

The simplest solution is to share layers, e.g. the following works:

--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -161,8 +161,9 @@ class PPOPolicyGraph(TFPolicyGraph):
             vf_config["free_log_std"] = False
             vf_config["use_lstm"] = False
             with tf.variable_scope("value_function"):
-                self.value_function = ModelCatalog.get_model(
-                    obs_ph, 1, vf_config).outputs
+                from ray.rllib.models.misc import linear, normc_initializer
+                self.value_function = linear(
+                    self.model.last_layer, 1, "value", normc_initializer(1.0))
             self.value_function = tf.reshape(self.value_function, [-1])
         else:
             self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])

Actually the above is how rllib implements value functions in a3c and impala. Probably PPO should do it in the same way. To do more complex things, you can also define a custom LSTM model.

How does env_runner generate states for the model? How can I adjust it to the vf_model?

It will generate the initial state using get_initial_state(), pass it in using state_in, and get it via state_out. However as noted it only works for a single model, not multiple

Just for reference, originally I was thinking the following would be sufficient (it isn't, because we assume only one set of placeholders for state in/out):

--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -159,10 +159,9 @@ class PPOPolicyGraph(TFPolicyGraph):
             # mean parameters and standard deviation parameters and
             # do not make the standard deviations free variables.
             vf_config["free_log_std"] = False
-            vf_config["use_lstm"] = False
             with tf.variable_scope("value_function"):
-                self.value_function = ModelCatalog.get_model(
-                    obs_ph, 1, vf_config).outputs
+                self.vf_model = ModelCatalog.get_model(obs_ph, 1, vf_config)
+                self.value_function = self.vf_model.outputs
             self.value_function = tf.reshape(self.value_function, [-1])
         else:
             self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])
@@ -191,9 +190,9 @@ class PPOPolicyGraph(TFPolicyGraph):
             action_sampler=self.sampler,
             loss=self.loss_obj.loss,
             loss_inputs=self.loss_in,
-            state_inputs=self.model.state_in,
-            state_outputs=self.model.state_out,
-            seq_lens=self.model.seq_lens,
+            state_inputs=self.model.state_in + self.vf_model.state_in,
+            state_outputs=self.model.state_out + self.vf_model.state_out,
+            seq_lens=self.model.seq_lens + self.vf_model.seq_lens,
             max_seq_len=config["model"]["max_seq_len"])

         self.sess.run(tf.global_variables_initializer())
@@ -244,4 +243,4 @@ class PPOPolicyGraph(TFPolicyGraph):
             self._loss, colocate_gradients_with_ops=True)

     def get_initial_state(self):
-        return self.model.state_init
+        return self.model.state_init + self.vf_model.state_init

ericl on 19 Aug 2018

@ericl Thank you so much.

Are you considering implementing independently modifiable actor and critic in rllib?

Is there an example used of LSTM implemented in rllib(sharing value and policy) in the paper?

whikwon on 20 Aug 2018

Here's an example of sharing LSTM layers: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/a3c/a3c_tf_policy_graph.py#L53

For having independent actor and critic models, since that's a more advanced customization, it should be done by modifying the policy graph, either in place or by plugging in a custom one: https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphs

ericl on 20 Aug 2018

From the link you attached, https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphs

I think independent actor and critic models wouldn't work if i make Customizing Policy Graphs only because env_runner would not return critic's state.

Do I also have to add Model-Based Rollouts to make env_runner return critic's state?

whikwon on 20 Aug 2018

What you can do is have the actor and critic in the same model class. There
can be multiple independent tensorflow graphs in the same model class. For
example, the state could be split into two array and the actor uses the
first 2 elements only and the critic the second 2 only.

On Sun, Aug 19, 2018, 7:47 PM Whi Kwon notifications@github.com wrote:

From the link you attached,
https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphs

I think independent actor and critic models wouldn't work if i make
Customizing Policy Graphs only because env_runner would not return
critic's state.

Do I also have to add Model-Based Rollouts to make env_runner return
critic's state?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2666#issuecomment-414182771,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SmUCiiUf3eUbb5FXeNPbqSY7N-btks5uSiNWgaJpZM4V_Es9
.

ericl on 20 Aug 2018

Thank you. I followed your instructions and made model script as below.

I ran the code below and got the error. Could you check the thing I missed from what you've meant?

Error:

whikwon on 21 Aug 2018

@whikwon the following works for me (it's a bit less modular than having two LSTM()s, but I think there are some complications with that approach).

Note that the new LSTM state is now a list of four elements:

diff --git a/python/ray/rllib/agents/ppo/ppo_policy_graph.py b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
index 894d868..30cec63 100644
--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -155,19 +155,7 @@ class PPOPolicyGraph(TFPolicyGraph):
         self.logits = self.model.outputs
         curr_action_dist = dist_cls(self.logits)
         self.sampler = curr_action_dist.sample()
-        if self.config["use_gae"]:
-            vf_config = self.config["model"].copy()
-            # Do not split the last layer of the value function into
-            # mean parameters and standard deviation parameters and
-            # do not make the standard deviations free variables.
-            vf_config["free_log_std"] = False
-            vf_config["use_lstm"] = False
-            with tf.variable_scope("value_function"):
-                self.value_function = ModelCatalog.get_model(
-                    obs_ph, 1, vf_config).outputs
-                self.value_function = tf.reshape(self.value_function, [-1])
-        else:
-            self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])
+        self.value_function = self.model.value_function

         self.loss_obj = PPOLoss(
             action_space,

class LSTM(Model):
    """Adds a LSTM cell on top of some other model output.

    Uses a linear layer at the end for output.

    Important: we assume inputs is a padded batch of sequences denoted by
        self.seq_lens. See add_time_dimension() for more information.
    """

    def _build_layers(self, inputs, num_outputs, options):
        cell_size = options.get("lstm_cell_size", 256)
        use_tf100_api = (distutils.version.LooseVersion(tf.VERSION) >=
                         distutils.version.LooseVersion("1.0.0"))
        last_layer = add_time_dimension(inputs, self.seq_lens)

        # Setup the LSTM cell
        if use_tf100_api:
            lstm1 = rnn.BasicLSTMCell(cell_size, state_is_tuple=True)
            lstm2 = rnn.BasicLSTMCell(cell_size, state_is_tuple=True)
        else:
            lstm1 = rnn.rnn_cell.BasicLSTMCell(cell_size, state_is_tuple=True)
            lstm2 = rnn.rnn_cell.BasicLSTMCell(cell_size, state_is_tuple=True)
        self.state_init = [
            np.zeros(lstm1.state_size.c, np.float32),
            np.zeros(lstm1.state_size.h, np.float32),
            np.zeros(lstm2.state_size.c, np.float32),
            np.zeros(lstm2.state_size.h, np.float32)
        ]

        # Setup LSTM inputs
        if self.state_in:
            c1_in, h1_in, c2_in, h2_in = self.state_in
        else:
            c1_in = tf.placeholder(
                tf.float32, [None, lstm1.state_size.c], name="c1")
            h1_in = tf.placeholder(
                tf.float32, [None, lstm1.state_size.h], name="h1")
            c2_in = tf.placeholder(
                tf.float32, [None, lstm2.state_size.c], name="c2")
            h2_in = tf.placeholder(
                tf.float32, [None, lstm2.state_size.h], name="h2")
            self.state_in = [c1_in, h1_in, c2_in, h2_in]

        # Setup LSTM outputs
        if use_tf100_api:
            state1_in = rnn.LSTMStateTuple(c1_in, h1_in)
            state2_in = rnn.LSTMStateTuple(c2_in, h2_in)
        else:
            state1_in = rnn.rnn_cell.LSTMStateTuple(c1_in, h1_in)
            state2_in = rnn.rnn_cell.LSTMStateTuple(c2_in, h2_in)
        lstm1_out, lstm1_state = tf.nn.dynamic_rnn(
            lstm1,
            last_layer,
            sequence_length=self.seq_lens,
            time_major=False,
            dtype=tf.float32)

        with tf.variable_scope("value_function"):
            lstm2_out, lstm2_state = tf.nn.dynamic_rnn(
                lstm2,
                last_layer,
                sequence_length=self.seq_lens,
                time_major=False,
                dtype=tf.float32)

        self.value_function = tf.reshape(
            linear(
                tf.reshape(lstm2_out, [-1, cell_size]),
                1, "vf", normc_initializer(0.01)),
            [-1])

        self.state_out = list(lstm1_state) + list(lstm2_state)

        # Compute outputs
        last_layer = tf.reshape(lstm1_out, [-1, cell_size])
        logits = linear(last_layer, num_outputs, "action",
                        normc_initializer(0.01))

        return logits, last_layer

ericl on 21 Aug 2018

❤1

@ericl It works. Thank you so much for the help.

whikwon on 21 Aug 2018

👍1

Filed https://github.com/ray-project/ray/issues/2716 for customizing value functions more easily

ericl on 22 Aug 2018

Was this page helpful?

0 / 5 - 0 ratings