Hello,
I am using A2C with LstmPolicy, and my state space is MultiDiscrete.
I would like to pass only one part of the state (let's say the first 5 discrete values) to the shared fc and lstm layers. Then, concat this with the remaining discrete values, before going into value and policy networks.
Is there an easy way to do this?
Thanks.
Unfortunately there is no easy way to do this. You have to create a custom policy network where you split the observation placeholder into multiple parts according to your needs. I recommend starting by taking a look at how LSTM policies work, and perhaps making a copy of one for your purpose.
Thanks! Looking at the policies, I see they use self._processed_obs as input tensor, which comes from observation_input. In the Multidiscrete observation space, this function concats one-hot encoding of each dimension. I feel I could just change this to slice right before the encoding?
Moreover, what is the reason for one-hot encoding? Is it not a problem when my values are large integers?
There is no pre-made way to do this so I recommend going with the method that is easiest for you. I would recommend staying inside custom policies and not modifying the stable-baselines code itself to avoid getting too messy, though.
One-hot encoding is common way to encode discrete values (known to work, easy to implement). One alternative used in e.g. language processing are embedding layers, but those are not implemented here. Here the Discrete/MultiDiscrete are designed for "handful" of options, and indeed it will get really sparse if you have many options. As this is not a stable-baselines issue I recommend finding more information on this elsewhere (e.g. examples/tutorials on language processing and representing characters).
If there are no more questions related to stable-baselines, you can close this issue :).
Actually, my question about one-hot encoding comes from the fact that there is no support for Dict/Tuple observation space, right?
The two parts of my state are in fact MultiDiscrete & "Multi-Integer" where some integer values can be very large. But I had to use MultiDiscrete for all due to lack of Dict/Tuple.
In fact, the multi-integers are fixed per episode.
So is it possible to use only MultiDiscrete for observation space, while inputting additional values to the network?
The "standard" hacky way around this is to define your observation space as a gym.spaces.Box (so no extra preprocessing happens), concatenate everything into one vector and then in stable-baselines Policy side you split this into vector different parts and feed them through different processing steps.
Parts you want to modify are nature_cnn the mlp_extractor, but you have to modify the policy to use this new "extractor" as well (see how nature_cnn/mlp_extractor are used in policies).
Most helpful comment
There is no pre-made way to do this so I recommend going with the method that is easiest for you. I would recommend staying inside custom policies and not modifying the stable-baselines code itself to avoid getting too messy, though.
One-hot encoding is common way to encode discrete values (known to work, easy to implement). One alternative used in e.g. language processing are embedding layers, but those are not implemented here. Here the Discrete/MultiDiscrete are designed for "handful" of options, and indeed it will get really sparse if you have many options. As this is not a stable-baselines issue I recommend finding more information on this elsewhere (e.g. examples/tutorials on language processing and representing characters).
If there are no more questions related to stable-baselines, you can close this issue :).