Stable-baselines: [question] How to evaluate the Q-Values Approximator

Created on 30 Jan 2020 · 4Comments · Source: hill-a/stable-baselines

I have a trained PPO2 agent with a MlpLnLstmPolicy. I want to input a state and action and obtain the q-value for that combination (basically checking what the q-function is estimating).
Is there a function that already does this? I did not see anything in the documentation, and I'm a bit overwhelmed by the source code.
Thank you in advance!

question

Source

titusquah

All 4 comments

First thing to note: PPO2 does not use Q-values (state-action values), but it does have state-values ("V-values"). Following is for the V-values.

There is no pre-made function call this, but you can take a look at how predict gets the action from the policy. PPO2 also has self.value function with same parameters as self.step, so you should be able to replicate it inside predict function. Make sure you feed in right hidden state state, though.

Miffyli on 30 Jan 2020

👍1

Thank you for pointing those out!
What does the "hidden state state" refer to? I had it run with vectorized environment of 4 and observed it had a shape of (4,512). Is it an array for the tensorflow session to run?

I have this function that I added to the ActorCriticRLModel class and it seems to work.
Does it feed the right hidden state state?

def predict_value(self, observation, state=None, mask=None):
      if state is None:
            state = self.initial_state


      if mask is None:
          mask = [False for _ in range(self.n_envs)]
      observation = np.array(observation)

      observation = observation.reshape((-1,) + self.observation_space.shape)


      value=self.value(observation, state, mask)
      return value

titusquah on 30 Jan 2020

You are welcome :).

state is the hidden state of recurrent network. In the code you are feeding in the right one, but I was referring to the order of calling these functions. E.g:

# Pseudo-codeish, not correct calls
action, state = model.predict(obs, state, ...)
value, state =  model.predict_value(obs, state, ...)

Note that predict_value would receive different state than predict did. You must feed in the same state to both functions. For the next observation you can use the new state from either call.

Miffyli on 30 Jan 2020

Ok, that makes sense. Thank you so much for your help!

titusquah on 30 Jan 2020

👍1

Was this page helpful?

0 / 5 - 0 ratings