The documentation does not state explicitly whether the default policies in common.policies share weights between the value network and the policy network. After carefully reading the code, I could deduce that the
LstmPolicy shares all weights between value and policy network except for the very last linear layer.FeedForwardPolicy shares weights if the 'cnn' extractor is used but uses two entirely different data streams if the 'mlp' extractor is used.Are there any justifications for this specific setup?
Are there any justifications for this specific setup?
I think the only justification I can give you is that it was there in the original baselines.
Also, in my opinion, sharing the CNN makes sense, because sharing the feature extractor saves memory and computation (convolutions are expensive!).
For future versions, it could be a good idea to allows both (shared or separate networks). For now, if you want to change that, you can still define a custom policy, but that requires more effort.
I see we have a cnn_extractor so why not also introduce an mlp_extractor that gets a layers argument and from this constructs a shared or non-shared network for policy and value. layers could be a list of layer-info tuples with the following semantics:
<n> or (<n>, ) or dict(size=<n>) or dict(size=<n>, shared=False): two non-shared layers with (<n>, "shared") or dict(size=<n>, shared=True): shared layer with (<p>, <v>) or dict(pi_layer_size=<p>, vf_layer_size=<v>): non-shared layer with units for the policy network and
Any of the layer sizes being None or < 1 would mean a pass-through. Look at some illustrating examples:
obs
|
<128>
|
<128>
/ \
latent_policy latent_value
would be encoded as [(128, "shared"), (128, "shared")] or [dict(size=128, shared=True), dict(size=128, shared=True)].
obs
|
<128>
/ \
latent_policy <256>
|
<256>
|
latent_value
would be encoded as [(128, "shared"), (None, 256), (None, 256)] or [dict(size=128, shared=True), dict(vf_layer_size=256), dict(vf_layer_size=256)].
obs
|
<128>
/ \
<16> <256>
| |
latent_policy latent_value
would be encoded as [(128, "shared"), (16, 256)] or [dict(size=128, shared=True), dict(pi_layer_size=16, vf_layer_size=256)].
A rough draft implementation (by no means tested yet or integrated with the rest of the code) would be:
def mlp_extractor(flat_observations, layers, act_fun):
def layer_info(layer):
if type(layer) is int:
layer = [layer]
if type(layer) is list or type(layer) is tuple:
assert len(layer) in [1, 2]
is_shared = len(layer) == 2 and layer[1] == "shared"
pi_layer_size = layer[0]
vf_layer_size = layer[1] if not is_shared else layer[0]
if type(layer) is dict:
pi_layer_size = layer['pi_layer_size'] if 'pi_layer_size' in layer else None
vf_layer_size = layer['vf_layer_size']if 'vf_layer_size' in layer else None
is_shared = layer['shared'] if 'shared' in layer else False
if 'size' in layer:
pi_layer_size = vf_layer_size = layer['size']
assert not (pi_layer_size != vf_layer_size and is_shared)
return pi_layer_size, vf_layer_size, is_shared
pi_latent = flat_observations
vf_latent = flat_observations
for idx, layer in enumerate(layers):
pi_layer_size, vf_layer_size, is_shared = layer_info(layer)
if is_shared:
with tf.variable_scope("shared_{}".format(idx)):
if pi_latent == vf_latent: # previous layer was shared (we continue shared stream)
n_input = pi_latent.get_shape()[1].value
weight = tf.get_variable("w", [n_input, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
pi_latent = vf_latent = act_fun(tf.matmul(pi_latent, weight) + bias)
else: # previous layer was not shared (we merge stream)
n_input_pi = pi_latent.get_shape()[1].value
n_input_vf = vf_latent.get_shape()[1].value
weight_pi = tf.get_variable("w_pi", [n_input_pi, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
weight_vf = tf.get_variable("w_vf", [n_input_vf, vf_layer_size], initializer=ortho_init(np.sqrt(2)))
bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
pi_latent = vf_latent = act_fun(tf.matmul(pi_latent, weight_pi) + tf.matmul(vf_latent, weight_vf) + bias)
else: # (not is_shared) layers are not shared
if pi_layer_size is None or pi_layer_size < 1: # path through, there is no layer for the pi network
pass
else:
with tf.variable_scope("pi_{}".format(idx)):
n_input = pi_latent.get_shape()[1].value
weight = tf.get_variable("w", [n_input, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
pi_latent = act_fun(tf.matmul(pi_latent, weight) + bias)
if vf_layer_size is None or vf_layer_size < 1: # path through, there is no layer for the value network
pass
else:
with tf.variable_scope("vf_{}".format(idx)):
n_input = vf_latent.get_shape()[1].value
weight = tf.get_variable("w", [n_input, vf_layer_size], initializer=ortho_init(np.sqrt(2)))
bias = tf.get_variable("b", [vf_layer_size], initializer=tf.constant_initializer(0.0))
vf_latent = act_fun(tf.matmul(vf_latent, weight) + bias)
return pi_latent, vf_latent
What do you think?
It looks like a nice starting point, I'll take a deeper look this weekend, I should have more time ;)
@hill-a @araffin It would be nice to get feedback on this one so I know if it is worth investing the time to implement and test this.
@erniejunior I really like the idea of the tuple definition, it is both concise, powerful (you can express a lot of different architecture) and easy to understand, we just need to specify clearly in the doc in which order things are expected. The dict version is more verbose but there is no doubt of what is going on, still I'm not super happy with it, maybe @hill-a disagree?
For maybe future version, you can maybe also specify the activation function as a third argument in the tuple, but I would prefer in the beginning a first clean version.
Regarding the code you posted, it was a bit hard to follow at first. Anyway, I can provide you with more feedback later to improve the code ;)
To sum it up, I really think that this feature is a cool one =) (and that we will come up with a clean implementation after some discussions (code review feedback) ;))
I think this idea makes a lot sense, I like it.
My only fear, is the notation might get confusing for the tuples, as in which number is for which network. Also the dict notation feels cumbersome (although as @araffin said, very readable).
I might suggest an intermediary approach:
[128, 128, {"vf": [], "pi": [64, 32]}]
obs
|
<128>
|
<128>
/ \
<64> latent_value
|
<32>
|
latent_policy
I like the intermediary approach. It takes away the option to "rejoin" the two data streams which was probably useless in the first place. I might start to work on it when I get around to it.
I started with a first draft (have a look at it in my feature/flexible_mlp_policies branch if you are interested) and found a potential issue with this notation: it is not backwards compatible. FeedForwardPolicy's used to have separate value and policy networks. Now, with the same layers argument, they are suddenly shared.
The question is: does this matter? And do we also need to support loading models generated with the older version of the library? Because then we need to be even more careful.
Now, with the same layers argument, they are suddenly shared.
Can you elaborate a bit more? I think that the default policies must be backward compatible. For the custom policies, that can be a problem. One way to solve that is to create a new variable name (e.g. net_arch) and deprecate the old behavior (raising warning when people use layers) while maintaining backward compatibility.
Can you elaborate a bit more?
Old implementation of FeedForwardPolicy has no shared weights between policy and value network.
It does support a layers parameter. If I e.g. pass layers=[128, 128] to the old implementation, I get a policy without shared weights. If I pass the same arguments to the new implementation I get a policy with shared weights.
I like your proposal of a new parameter to keep backwards compatibility. I did not like the name layers anyways because it makes me want to pass an actual list of layers and not a list of layer sizes.
What about loading models generated with the old library version? Do we need that?
Old implementation of FeedForwardPolicy has no shared weights between policy and value network.
Thanks for the clarification ;)
What about loading models generated with the old library version? Do we need that?
Well, you mean agent saved by previous version? If we keep the default policies the same, this should work, no?
I changed the scope names a bit but. This might cause incompatibilities when loading the weights. I can undo that to match the old scope names.
When looking deeper into the code, I saw that you use your own utilities for dense layers and lstm layers. I was considering switching to the inbuilt tf.layers.dense and tf.contrib.rnn.LSTMBlockCell . Reasons:
But this would most certainly break backwards compatibility when loading old policies.
Yes, I would keep backward compatibility for now. Especially because there are some tricks in the code that rely on scope/name of variable (e.g. parameter perturbation for DDPG).
I saw that you use your own utilities for dense layers and lstm layers.
That is legacy code... I definitely agree that we should switch in the future to built-in layers (maybe for V3, so it is obvious that they are breaking changes). I added that to the roadmap.
I think I mostly finished the implementation for the FeedForwardPolicy and now is is time to think about how we want to deal with the LstmPolicy.
Current state:
The LSTMPolicy constructs a shared MLP extractor (based on the layers parameter) and appends a shared LSTM to that.
What I would whish to have:
feature_net_arch parameter defines the feature extraction network.post_lstm_net_arch parameter defines the network architecture after the LSTM. If we already have two separate LSTMs, this parameter must not include any shared layers.Alternative notation:
We put everything in the net_arch parameter, which is split up in the two networks by an "lstm" layer.
Examples:
Shared feature network + shared LSTM + shared 20 unit layer + two different post-networks for value and policy.
[128, "lstm", 20, dict(pi=[128], vf=[128, 128])]
Partially shared feature network + different LSTMs + different post-networks
[128, dict(pi=[20], vf=[20]), "lstm", dict(pi=[128], vf=[128])]
What I would like to implement:
I think it would be a greater hassle to implement support for non-shared LSTMs. So I would implement the above proposal but without the support for non-shared LSTMs. After we have fixed issue #90, it should be a lot simpler to support multiple LSTM cells.
What do you guys think?
I would favor the "alternative notation" to avoid having too much arguments. Also, I don't really see the need of a "post-lstm" architecture. That allows full control on the policy/value net architecture but I'm afraid that is not really needed (as controlling pre-lstm net + share or not shared + lstm architecture is already sufficient).
non-shared LSTMs
What makes it hard to have non-shared LSTMs? or is it a problem of computation and memory?
Why post-LSTM architecture?
For my use-case I would like the possibility to deploy the trained (recurrent) policy on embedded hardware. This means I want to keep it as small and shallow as possible. On the other hand a large value network gives me better training results (remember: I do not need to run the value network in my embedded hardware). This means I want an architecture with an LSTM and a value network that is much larger than the policy network. The only way to get there is to add non-shared value network layers after the LSTM.
Why no non-shared LSTMS?
As mentioned above, I want non-shared LSTMs, but only after fixing #90 because it would involve changing the policy base class. This step might break other things and needs more strategic planning.
Uhhh I can hear a PR rumbling in the distance ...
closed by #111
Most helpful comment
Uhhh I can hear a PR rumbling in the distance ...