Stable-baselines: Clarify which policies share weights between policy and value network.

Created on 21 Nov 2018 · 18Comments · Source: hill-a/stable-baselines

The documentation does not state explicitly whether the default policies in common.policies share weights between the value network and the policy network. After carefully reading the code, I could deduce that the

LstmPolicy shares all weights between value and policy network except for the very last linear layer.
FeedForwardPolicy shares weights if the 'cnn' extractor is used but uses two entirely different data streams if the 'mlp' extractor is used.

Are there any justifications for this specific setup?

documentation enhancement help wanted

Source

erniejunior

Most helpful comment

Uhhh I can hear a PR rumbling in the distance ...

erniejunior on 3 Dec 2018

👍2

All 18 comments

Are there any justifications for this specific setup?

I think the only justification I can give you is that it was there in the original baselines.
Also, in my opinion, sharing the CNN makes sense, because sharing the feature extractor saves memory and computation (convolutions are expensive!).
For future versions, it could be a good idea to allows both (shared or separate networks). For now, if you want to change that, you can still define a custom policy, but that requires more effort.

araffin on 21 Nov 2018

I see we have a cnn_extractor so why not also introduce an mlp_extractor that gets a layers argument and from this constructs a shared or non-shared network for policy and value. layers could be a list of layer-info tuples with the following semantics:

<n> or (<n>, ) or dict(size=<n>) or dict(size=<n>, shared=False): two non-shared layers with units
(<n>, "shared") or dict(size=<n>, shared=True): shared layer with units
(<p>, <v>) or dict(pi_layer_size=<p>, vf_layer_size=<v>): non-shared layer with
units for the policy network and units for the value network
Any of the layer sizes being None or < 1 would mean a pass-through. Look at some illustrating examples:

Completely shared with two layers of size 128

              obs
               |
             <128>
               |
             <128>
       /               \
latent_policy      latent_value

would be encoded as [(128, "shared"), (128, "shared")] or [dict(size=128, shared=True), dict(size=128, shared=True)].

Value nework deeper than policy network, first layer shared

              obs
               |
             <128>
       /               \
latent_policy          <256>
                         |
                       <256>
                         |
                    latent_value

would be encoded as [(128, "shared"), (None, 256), (None, 256)] or [dict(size=128, shared=True), dict(vf_layer_size=256), dict(vf_layer_size=256)].

Initially shared then diverging

              obs
               |
             <128>
       /               \
     <16>             <256>
       |                |
latent_policy    latent_value

would be encoded as [(128, "shared"), (16, 256)] or [dict(size=128, shared=True), dict(pi_layer_size=16, vf_layer_size=256)].

A rough draft implementation (by no means tested yet or integrated with the rest of the code) would be:

def mlp_extractor(flat_observations, layers, act_fun):

    def layer_info(layer):
        if type(layer) is int:
            layer = [layer]
        if type(layer) is list or type(layer) is tuple:
            assert len(layer) in [1, 2]
            is_shared = len(layer) == 2 and layer[1] == "shared"
            pi_layer_size = layer[0]
            vf_layer_size = layer[1] if not is_shared else layer[0]
        if type(layer) is dict:
            pi_layer_size = layer['pi_layer_size'] if 'pi_layer_size' in layer else None
            vf_layer_size = layer['vf_layer_size']if 'vf_layer_size' in layer else None
            is_shared = layer['shared'] if 'shared' in layer else False
            if 'size' in layer:
                pi_layer_size = vf_layer_size = layer['size']
            assert not (pi_layer_size != vf_layer_size and is_shared)
        return pi_layer_size, vf_layer_size, is_shared

    pi_latent = flat_observations
    vf_latent = flat_observations
    for idx, layer in enumerate(layers):
        pi_layer_size, vf_layer_size, is_shared = layer_info(layer)
        if is_shared:
            with tf.variable_scope("shared_{}".format(idx)):
                if pi_latent == vf_latent:  # previous layer was shared (we continue shared stream)
                    n_input = pi_latent.get_shape()[1].value
                    weight = tf.get_variable("w", [n_input, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
                    bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
                    pi_latent = vf_latent = act_fun(tf.matmul(pi_latent, weight) + bias)
                else:  # previous layer was not shared (we merge stream)
                    n_input_pi = pi_latent.get_shape()[1].value
                    n_input_vf = vf_latent.get_shape()[1].value
                    weight_pi = tf.get_variable("w_pi", [n_input_pi, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
                    weight_vf = tf.get_variable("w_vf", [n_input_vf, vf_layer_size], initializer=ortho_init(np.sqrt(2)))
                    bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
                    pi_latent = vf_latent = act_fun(tf.matmul(pi_latent, weight_pi) + tf.matmul(vf_latent, weight_vf) + bias)
        else:  # (not is_shared) layers are not shared
            if pi_layer_size is None or pi_layer_size < 1:  # path through, there is no layer for the pi network
                pass
            else:
                with tf.variable_scope("pi_{}".format(idx)):
                    n_input = pi_latent.get_shape()[1].value
                    weight = tf.get_variable("w", [n_input, pi_layer_size], initializer=ortho_init(np.sqrt(2)))
                    bias = tf.get_variable("b", [pi_layer_size], initializer=tf.constant_initializer(0.0))
                    pi_latent = act_fun(tf.matmul(pi_latent, weight) + bias)

            if vf_layer_size is None or vf_layer_size < 1:  # path through, there is no layer for the value network
                pass
            else:
                with tf.variable_scope("vf_{}".format(idx)):
                    n_input = vf_latent.get_shape()[1].value
                    weight = tf.get_variable("w", [n_input, vf_layer_size], initializer=ortho_init(np.sqrt(2)))
                    bias = tf.get_variable("b", [vf_layer_size], initializer=tf.constant_initializer(0.0))
                    vf_latent = act_fun(tf.matmul(vf_latent, weight) + bias)
    return pi_latent, vf_latent

What do you think?

erniejunior on 21 Nov 2018

It looks like a nice starting point, I'll take a deeper look this weekend, I should have more time ;)

araffin on 21 Nov 2018

@hill-a @araffin It would be nice to get feedback on this one so I know if it is worth investing the time to implement and test this.

erniejunior on 26 Nov 2018

@erniejunior I really like the idea of the tuple definition, it is both concise, powerful (you can express a lot of different architecture) and easy to understand, we just need to specify clearly in the doc in which order things are expected. The dict version is more verbose but there is no doubt of what is going on, still I'm not super happy with it, maybe @hill-a disagree?

For maybe future version, you can maybe also specify the activation function as a third argument in the tuple, but I would prefer in the beginning a first clean version.

Regarding the code you posted, it was a bit hard to follow at first. Anyway, I can provide you with more feedback later to improve the code ;)

To sum it up, I really think that this feature is a cool one =) (and that we will come up with a clean implementation after some discussions (code review feedback) ;))

araffin on 26 Nov 2018

I think this idea makes a lot sense, I like it.

My only fear, is the notation might get confusing for the tuples, as in which number is for which network. Also the dict notation feels cumbersome (although as @araffin said, very readable).

I might suggest an intermediary approach:

[128, 128, {"vf": [], "pi": [64, 32]}]

              obs
               |
             <128>
               |
             <128>
       /               \
     <64>         latent_value
      |
     <32>
      |
  latent_policy

hill-a on 27 Nov 2018

👍1

I like the intermediary approach. It takes away the option to "rejoin" the two data streams which was probably useless in the first place. I might start to work on it when I get around to it.

erniejunior on 27 Nov 2018

👍1

I started with a first draft (have a look at it in my feature/flexible_mlp_policies branch if you are interested) and found a potential issue with this notation: it is not backwards compatible. FeedForwardPolicy's used to have separate value and policy networks. Now, with the same layers argument, they are suddenly shared.

The question is: does this matter? And do we also need to support loading models generated with the older version of the library? Because then we need to be even more careful.

erniejunior on 27 Nov 2018

Now, with the same layers argument, they are suddenly shared.

Can you elaborate a bit more? I think that the default policies must be backward compatible. For the custom policies, that can be a problem. One way to solve that is to create a new variable name (e.g. net_arch) and deprecate the old behavior (raising warning when people use layers) while maintaining backward compatibility.

araffin on 27 Nov 2018

Can you elaborate a bit more?

Old implementation of FeedForwardPolicy has no shared weights between policy and value network.
It does support a layers parameter. If I e.g. pass layers=[128, 128] to the old implementation, I get a policy without shared weights. If I pass the same arguments to the new implementation I get a policy with shared weights.

I like your proposal of a new parameter to keep backwards compatibility. I did not like the name layers anyways because it makes me want to pass an actual list of layers and not a list of layer sizes.

What about loading models generated with the old library version? Do we need that?

erniejunior on 27 Nov 2018

Old implementation of FeedForwardPolicy has no shared weights between policy and value network.

Thanks for the clarification ;)

What about loading models generated with the old library version? Do we need that?

Well, you mean agent saved by previous version? If we keep the default policies the same, this should work, no?

araffin on 28 Nov 2018

I changed the scope names a bit but. This might cause incompatibilities when loading the weights. I can undo that to match the old scope names.

When looking deeper into the code, I saw that you use your own utilities for dense layers and lstm layers. I was considering switching to the inbuilt tf.layers.dense and tf.contrib.rnn.LSTMBlockCell . Reasons:

Less lines of code to debug/maintain
This LSTM just one node in the TF graph -> a lot faster

But this would most certainly break backwards compatibility when loading old policies.

erniejunior on 28 Nov 2018

Yes, I would keep backward compatibility for now. Especially because there are some tricks in the code that rely on scope/name of variable (e.g. parameter perturbation for DDPG).

I saw that you use your own utilities for dense layers and lstm layers.

That is legacy code... I definitely agree that we should switch in the future to built-in layers (maybe for V3, so it is obvious that they are breaking changes). I added that to the roadmap.

araffin on 28 Nov 2018

I think I mostly finished the implementation for the FeedForwardPolicy and now is is time to think about how we want to deal with the LstmPolicy.

Current state:
The LSTMPolicy constructs a shared MLP extractor (based on the layers parameter) and appends a shared LSTM to that.

What I would whish to have:

The feature_net_arch parameter defines the feature extraction network.
If the feature extraction network is entirely shared, we append a shared LSTM.
If the feature extraction network is not shared in the end, we append two LSTM cells for value and policy network.
A post_lstm_net_arch parameter defines the network architecture after the LSTM. If we already have two separate LSTMs, this parameter must not include any shared layers.
This would allow to split the streams before or after the LSTM.

Alternative notation:
We put everything in the net_arch parameter, which is split up in the two networks by an "lstm" layer.
Examples:

Shared feature network + shared LSTM + shared 20 unit layer + two different post-networks for value and policy.
[128, "lstm", 20, dict(pi=[128], vf=[128, 128])]

Partially shared feature network + different LSTMs + different post-networks
[128, dict(pi=[20], vf=[20]), "lstm", dict(pi=[128], vf=[128])]

What I would like to implement:
I think it would be a greater hassle to implement support for non-shared LSTMs. So I would implement the above proposal but without the support for non-shared LSTMs. After we have fixed issue #90, it should be a lot simpler to support multiple LSTM cells.

What do you guys think?

erniejunior on 28 Nov 2018

I would favor the "alternative notation" to avoid having too much arguments. Also, I don't really see the need of a "post-lstm" architecture. That allows full control on the policy/value net architecture but I'm afraid that is not really needed (as controlling pre-lstm net + share or not shared + lstm architecture is already sufficient).

non-shared LSTMs

What makes it hard to have non-shared LSTMs? or is it a problem of computation and memory?

araffin on 30 Nov 2018

Why post-LSTM architecture?
For my use-case I would like the possibility to deploy the trained (recurrent) policy on embedded hardware. This means I want to keep it as small and shallow as possible. On the other hand a large value network gives me better training results (remember: I do not need to run the value network in my embedded hardware). This means I want an architecture with an LSTM and a value network that is much larger than the policy network. The only way to get there is to add non-shared value network layers after the LSTM.

Why no non-shared LSTMS?
As mentioned above, I want non-shared LSTMs, but only after fixing #90 because it would involve changing the policy base class. This step might break other things and needs more strategic planning.

erniejunior on 30 Nov 2018

Uhhh I can hear a PR rumbling in the distance ...

erniejunior on 3 Dec 2018

👍2

closed by #111

araffin on 5 Dec 2018

Was this page helpful?

0 / 5 - 0 ratings