I have a trained PPO2 agent with a MlpLnLstmPolicy. I want to input a state and action and obtain the q-value for that combination (basically checking what the q-function is estimating).
Is there a function that already does this? I did not see anything in the documentation, and I'm a bit overwhelmed by the source code.
Thank you in advance!
First thing to note: PPO2 does not use Q-values (state-action values), but it does have state-values ("V-values"). Following is for the V-values.
There is no pre-made function call this, but you can take a look at how predict gets the action from the policy. PPO2 also has self.value function with same parameters as self.step, so you should be able to replicate it inside predict function. Make sure you feed in right hidden state state, though.
Thank you for pointing those out!
What does the "hidden state state" refer to? I had it run with vectorized environment of 4 and observed it had a shape of (4,512). Is it an array for the tensorflow session to run?
I have this function that I added to the ActorCriticRLModel class and it seems to work.
Does it feed the right hidden state state?
def predict_value(self, observation, state=None, mask=None):
if state is None:
state = self.initial_state
if mask is None:
mask = [False for _ in range(self.n_envs)]
observation = np.array(observation)
observation = observation.reshape((-1,) + self.observation_space.shape)
value=self.value(observation, state, mask)
return value
You are welcome :).
state is the hidden state of recurrent network. In the code you are feeding in the right one, but I was referring to the order of calling these functions. E.g:
# Pseudo-codeish, not correct calls
action, state = model.predict(obs, state, ...)
value, state = model.predict_value(obs, state, ...)
Note that predict_value would receive different state than predict did. You must feed in the same state to both functions. For the next observation you can use the new state from either call.
Ok, that makes sense. Thank you so much for your help!