Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.
If you have any questions, feel free to create an issue with the tag [question].
If you wish to suggest an enhancement or feature request, add the tag [feature request].
If you are submitting a bug report, please fill in the following details.
If your issue is related to a custom gym environment, please check it first using:
from stable_baselines.common.env_checker import check_env
env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)
Describe the bug
The output of DQN should be the estimated Q values, while it seems that there is one softmax layer at the end of DQN network. I am wondering how is this DQN trained? Did I make some mistake understanding?
A clear and concise description of what the bug is.
Code example
Please use the markdown code blocks
for both code and stack traces.
from stable_baselines.common.atari_wrappers import make_atari
from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines import DQN,ACKTR,A2C,PPO2
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.vec_env import VecFrameStack
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython import display
import os
game_list = ['Pong','Breakout','SpaceInvaders','Seaquest','BeamRider','Qbert','Enduro']
method_list = ['dqn','ppo2','a2c','acktr']
method = method_list[0]
for game in game_list:
env = make_atari_env('{}NoFrameskip-v4'.format(game), num_env=1, seed=0)
env = VecFrameStack(env, n_stack=4)
model = DQN.load("trained_agents/{}/{}NoFrameskip-v4.pkl".format(method,game))
env.reset()
model.set_env(env);
obs = env.reset()
for i in range(1000):
actions = model.action_probability(obs) #Here I want to get the Q values
argmax_action = np.argmax(actions)
action, _states = model.predict(obs)
print(actions)
print('the sum: {}'.format(np.sum(actions)))
obs, rewards, dones, infos = env.step(action)
episode_infos = infos[0].get('episode')
Traceback (most recent call last): File ...
System Info
[[0.16696884 0.16787794 0.16488545 0.16646059 0.16625851 0.16754872]]
the sum: 1.0
[[0.16700052 0.16733757 0.16515337 0.1665579 0.16647142 0.1674792 ]]
the sum: 1.0
[[0.16888289 0.16632704 0.16429803 0.16766034 0.16476472 0.168067 ]]
the sum: 1.0
[[0.16873147 0.1668533 0.1633186 0.16757944 0.16445793 0.16905922]]
the sum: 0.9999999403953552
[[0.16896197 0.16633949 0.16370982 0.16852172 0.16337387 0.16909312]]
the sum: 1.0
[[0.16926633 0.1685718 0.16038649 0.16665836 0.16892989 0.16618706]]
the sum: 0.9999998807907104
[[0.17271757 0.16515626 0.15838557 0.1654946 0.16886355 0.16938245]]
the sum: 1.0
[[0.17090748 0.16443412 0.15841582 0.16640182 0.17027445 0.16956638]]
the sum: 1.0
Describe the characteristic of your environment:
Additional context
Add any other context about the problem here.
Hello,
you are not looking at the q values but the action probability (cf doc).
PS: as mentioned in the issue template, please format your code using code block
Hi Araffin,
Thanks for your reply.
May I check how to output the Q values given a state s_t after I fully trained a DQN model?
From my understanding, the action probability is normalized from the output Q values. However, I did not find how to call this.
Thanks a lot
You need to check the code for that: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/deepq/policies.py#L149
Example call: https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/deepq/dqn.py#L310
_, qvalues, _ = model.step_model.step(state, deterministic=True)
Use this line of code can obtain the q_values from the DQN model.
Hope this can help other researchers that look for this solution
Most helpful comment
Use this line of code can obtain the q_values from the DQN model.
Hope this can help other researchers that look for this solution