I'm looking into the source code of PPO + LSTM, but I found i can not use both LSTM and GAE for the value function.
As I've seen the paper "Learning Dexterous In-Hand Manipulation" from OpenAI, they successfully used both LSTM and GAE.
Why can't I use both LSTM and GAE in the framework?
If I remove the vf_config["use_lstm"] = False statement, can i use LSTM for the value function without any problem?
Thank you for help.
Hey @whikwon , this was out of expediency in implementation, there's not a good reason why you can't. I believe you can use LSTM for the value function with some changes to the ppo policy:
This will enable rllib will track the RNN state for both models during rollouts. It might also make sense to share the lstm cells, in which case you can redefine the vf model to instead share self.model.last_layer, which is the lstm cell output, and don't have to add anything to state_inputs/outputs.
@ericl Thank you so much. I'll try it.
@ericl I'm having trouble understanding the code. Please help.
Please check if I understood right.
Question
To implement the LSTM for vf, Should I have to handle states of model / vf_model attributes seperately?
I think you're right, that's needed if you use two different sets of placeholders for (state_in, state_out, seq_len). However if you use a "combined model" that produces both logits and vf outputs from a single set of (state_in, state_out, seq_len), then it can be used in rllib out of the box.
The simplest solution is to share layers, e.g. the following works:
--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -161,8 +161,9 @@ class PPOPolicyGraph(TFPolicyGraph):
vf_config["free_log_std"] = False
vf_config["use_lstm"] = False
with tf.variable_scope("value_function"):
- self.value_function = ModelCatalog.get_model(
- obs_ph, 1, vf_config).outputs
+ from ray.rllib.models.misc import linear, normc_initializer
+ self.value_function = linear(
+ self.model.last_layer, 1, "value", normc_initializer(1.0))
self.value_function = tf.reshape(self.value_function, [-1])
else:
self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])
Actually the above is how rllib implements value functions in a3c and impala. Probably PPO should do it in the same way. To do more complex things, you can also define a custom LSTM model.
How does env_runner generate states for the model? How can I adjust it to the vf_model?
It will generate the initial state using get_initial_state(), pass it in using state_in, and get it via state_out. However as noted it only works for a single model, not multiple
Just for reference, originally I was thinking the following would be sufficient (it isn't, because we assume only one set of placeholders for state in/out):
--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -159,10 +159,9 @@ class PPOPolicyGraph(TFPolicyGraph):
# mean parameters and standard deviation parameters and
# do not make the standard deviations free variables.
vf_config["free_log_std"] = False
- vf_config["use_lstm"] = False
with tf.variable_scope("value_function"):
- self.value_function = ModelCatalog.get_model(
- obs_ph, 1, vf_config).outputs
+ self.vf_model = ModelCatalog.get_model(obs_ph, 1, vf_config)
+ self.value_function = self.vf_model.outputs
self.value_function = tf.reshape(self.value_function, [-1])
else:
self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])
@@ -191,9 +190,9 @@ class PPOPolicyGraph(TFPolicyGraph):
action_sampler=self.sampler,
loss=self.loss_obj.loss,
loss_inputs=self.loss_in,
- state_inputs=self.model.state_in,
- state_outputs=self.model.state_out,
- seq_lens=self.model.seq_lens,
+ state_inputs=self.model.state_in + self.vf_model.state_in,
+ state_outputs=self.model.state_out + self.vf_model.state_out,
+ seq_lens=self.model.seq_lens + self.vf_model.seq_lens,
max_seq_len=config["model"]["max_seq_len"])
self.sess.run(tf.global_variables_initializer())
@@ -244,4 +243,4 @@ class PPOPolicyGraph(TFPolicyGraph):
self._loss, colocate_gradients_with_ops=True)
def get_initial_state(self):
- return self.model.state_init
+ return self.model.state_init + self.vf_model.state_init
@ericl Thank you so much.
Are you considering implementing independently modifiable actor and critic in rllib?
Is there an example used of LSTM implemented in rllib(sharing value and policy) in the paper?
Here's an example of sharing LSTM layers: https://github.com/ray-project/ray/blob/master/python/ray/rllib/agents/a3c/a3c_tf_policy_graph.py#L53
For having independent actor and critic models, since that's a more advanced customization, it should be done by modifying the policy graph, either in place or by plugging in a custom one: https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphs
From the link you attached, https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphs
I think independent actor and critic models wouldn't work if i make Customizing Policy Graphs only because env_runner would not return critic's state.
Do I also have to add Model-Based Rollouts to make env_runner return critic's state?
What you can do is have the actor and critic in the same model class. There
can be multiple independent tensorflow graphs in the same model class. For
example, the state could be split into two array and the actor uses the
first 2 elements only and the critic the second 2 only.
On Sun, Aug 19, 2018, 7:47 PM Whi Kwon notifications@github.com wrote:
From the link you attached,
https://ray.readthedocs.io/en/latest/rllib-models.html#customizing-policy-graphsI think independent actor and critic models wouldn't work if i make
Customizing Policy Graphs only because env_runner would not return
critic's state.Do I also have to add Model-Based Rollouts to make env_runner return
critic's state?—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/ray-project/ray/issues/2666#issuecomment-414182771,
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAA6SmUCiiUf3eUbb5FXeNPbqSY7N-btks5uSiNWgaJpZM4V_Es9
.
Thank you. I followed your instructions and made model script as below.
I ran the code below and got the error. Could you check the thing I missed from what you've meant?
Error:

@whikwon the following works for me (it's a bit less modular than having two LSTM()s, but I think there are some complications with that approach).
Note that the new LSTM state is now a list of four elements:
diff --git a/python/ray/rllib/agents/ppo/ppo_policy_graph.py b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
index 894d868..30cec63 100644
--- a/python/ray/rllib/agents/ppo/ppo_policy_graph.py
+++ b/python/ray/rllib/agents/ppo/ppo_policy_graph.py
@@ -155,19 +155,7 @@ class PPOPolicyGraph(TFPolicyGraph):
self.logits = self.model.outputs
curr_action_dist = dist_cls(self.logits)
self.sampler = curr_action_dist.sample()
- if self.config["use_gae"]:
- vf_config = self.config["model"].copy()
- # Do not split the last layer of the value function into
- # mean parameters and standard deviation parameters and
- # do not make the standard deviations free variables.
- vf_config["free_log_std"] = False
- vf_config["use_lstm"] = False
- with tf.variable_scope("value_function"):
- self.value_function = ModelCatalog.get_model(
- obs_ph, 1, vf_config).outputs
- self.value_function = tf.reshape(self.value_function, [-1])
- else:
- self.value_function = tf.zeros(shape=tf.shape(obs_ph)[:1])
+ self.value_function = self.model.value_function
self.loss_obj = PPOLoss(
action_space,
class LSTM(Model):
"""Adds a LSTM cell on top of some other model output.
Uses a linear layer at the end for output.
Important: we assume inputs is a padded batch of sequences denoted by
self.seq_lens. See add_time_dimension() for more information.
"""
def _build_layers(self, inputs, num_outputs, options):
cell_size = options.get("lstm_cell_size", 256)
use_tf100_api = (distutils.version.LooseVersion(tf.VERSION) >=
distutils.version.LooseVersion("1.0.0"))
last_layer = add_time_dimension(inputs, self.seq_lens)
# Setup the LSTM cell
if use_tf100_api:
lstm1 = rnn.BasicLSTMCell(cell_size, state_is_tuple=True)
lstm2 = rnn.BasicLSTMCell(cell_size, state_is_tuple=True)
else:
lstm1 = rnn.rnn_cell.BasicLSTMCell(cell_size, state_is_tuple=True)
lstm2 = rnn.rnn_cell.BasicLSTMCell(cell_size, state_is_tuple=True)
self.state_init = [
np.zeros(lstm1.state_size.c, np.float32),
np.zeros(lstm1.state_size.h, np.float32),
np.zeros(lstm2.state_size.c, np.float32),
np.zeros(lstm2.state_size.h, np.float32)
]
# Setup LSTM inputs
if self.state_in:
c1_in, h1_in, c2_in, h2_in = self.state_in
else:
c1_in = tf.placeholder(
tf.float32, [None, lstm1.state_size.c], name="c1")
h1_in = tf.placeholder(
tf.float32, [None, lstm1.state_size.h], name="h1")
c2_in = tf.placeholder(
tf.float32, [None, lstm2.state_size.c], name="c2")
h2_in = tf.placeholder(
tf.float32, [None, lstm2.state_size.h], name="h2")
self.state_in = [c1_in, h1_in, c2_in, h2_in]
# Setup LSTM outputs
if use_tf100_api:
state1_in = rnn.LSTMStateTuple(c1_in, h1_in)
state2_in = rnn.LSTMStateTuple(c2_in, h2_in)
else:
state1_in = rnn.rnn_cell.LSTMStateTuple(c1_in, h1_in)
state2_in = rnn.rnn_cell.LSTMStateTuple(c2_in, h2_in)
lstm1_out, lstm1_state = tf.nn.dynamic_rnn(
lstm1,
last_layer,
sequence_length=self.seq_lens,
time_major=False,
dtype=tf.float32)
with tf.variable_scope("value_function"):
lstm2_out, lstm2_state = tf.nn.dynamic_rnn(
lstm2,
last_layer,
sequence_length=self.seq_lens,
time_major=False,
dtype=tf.float32)
self.value_function = tf.reshape(
linear(
tf.reshape(lstm2_out, [-1, cell_size]),
1, "vf", normc_initializer(0.01)),
[-1])
self.state_out = list(lstm1_state) + list(lstm2_state)
# Compute outputs
last_layer = tf.reshape(lstm1_out, [-1, cell_size])
logits = linear(last_layer, num_outputs, "action",
normc_initializer(0.01))
return logits, last_layer
@ericl It works. Thank you so much for the help.
Filed https://github.com/ray-project/ray/issues/2716 for customizing value functions more easily