Stable-baselines: DQN output is normalized?

Created on 24 Feb 2020  路  4Comments  路  Source: hill-a/stable-baselines

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.

If you have any questions, feel free to create an issue with the tag [question].
If you wish to suggest an enhancement or feature request, add the tag [feature request].
If you are submitting a bug report, please fill in the following details.

If your issue is related to a custom gym environment, please check it first using:

from stable_baselines.common.env_checker import check_env

env = CustomEnv(arg1, ...)
# It will check your custom environment and output additional warnings if needed
check_env(env)

Describe the bug
The output of DQN should be the estimated Q values, while it seems that there is one softmax layer at the end of DQN network. I am wondering how is this DQN trained? Did I make some mistake understanding?
A clear and concise description of what the bug is.

Code example

Please use the markdown code blocks
for both code and stack traces.

from stable_baselines.common.atari_wrappers import make_atari
from stable_baselines.deepq.policies import MlpPolicy, CnnPolicy
from stable_baselines import DQN,ACKTR,A2C,PPO2
from stable_baselines.common.cmd_util import make_atari_env
from stable_baselines.common.vec_env import VecFrameStack
import cv2
import matplotlib.pyplot as plt
import numpy as np
from IPython import display
import os
game_list = ['Pong','Breakout','SpaceInvaders','Seaquest','BeamRider','Qbert','Enduro']
method_list = ['dqn','ppo2','a2c','acktr']
method = method_list[0]
for game in game_list:
    env = make_atari_env('{}NoFrameskip-v4'.format(game), num_env=1, seed=0)
    env = VecFrameStack(env, n_stack=4)    
    model = DQN.load("trained_agents/{}/{}NoFrameskip-v4.pkl".format(method,game))
    env.reset()
    model.set_env(env);
    obs = env.reset()
    for i in range(1000):
        actions = model.action_probability(obs) #Here I want to get the Q values
        argmax_action = np.argmax(actions)
        action, _states = model.predict(obs)
        print(actions)
        print('the sum: {}'.format(np.sum(actions)))
        obs, rewards, dones, infos = env.step(action)
        episode_infos = infos[0].get('episode')

Traceback (most recent call last): File ...

System Info
[[0.16696884 0.16787794 0.16488545 0.16646059 0.16625851 0.16754872]]
the sum: 1.0
[[0.16700052 0.16733757 0.16515337 0.1665579 0.16647142 0.1674792 ]]
the sum: 1.0
[[0.16888289 0.16632704 0.16429803 0.16766034 0.16476472 0.168067 ]]
the sum: 1.0
[[0.16873147 0.1668533 0.1633186 0.16757944 0.16445793 0.16905922]]
the sum: 0.9999999403953552
[[0.16896197 0.16633949 0.16370982 0.16852172 0.16337387 0.16909312]]
the sum: 1.0
[[0.16926633 0.1685718 0.16038649 0.16665836 0.16892989 0.16618706]]
the sum: 0.9999998807907104
[[0.17271757 0.16515626 0.15838557 0.1654946 0.16886355 0.16938245]]
the sum: 1.0
[[0.17090748 0.16443412 0.15841582 0.16640182 0.17027445 0.16956638]]
the sum: 1.0
Describe the characteristic of your environment:

  • Describe how the library was installed (pip, docker, source, ...)
  • GPU models and configuration
  • Python version
  • Tensorflow version
  • Versions of any other relevant libraries

Additional context
Add any other context about the problem here.

RTFM question

Most helpful comment

_, qvalues, _ = model.step_model.step(state, deterministic=True) 

Use this line of code can obtain the q_values from the DQN model.
Hope this can help other researchers that look for this solution

All 4 comments

Hello,
you are not looking at the q values but the action probability (cf doc).

PS: as mentioned in the issue template, please format your code using code block

Hi Araffin,

Thanks for your reply.
May I check how to output the Q values given a state s_t after I fully trained a DQN model?
From my understanding, the action probability is normalized from the output Q values. However, I did not find how to call this.
Thanks a lot

_, qvalues, _ = model.step_model.step(state, deterministic=True) 

Use this line of code can obtain the q_values from the DQN model.
Hope this can help other researchers that look for this solution

Was this page helpful?
0 / 5 - 0 ratings

Related issues

junhyeokahn picture junhyeokahn  路  3Comments

matthew-hsr picture matthew-hsr  路  3Comments

saeid93 picture saeid93  路  3Comments

H2SO4T picture H2SO4T  路  3Comments

maystroh picture maystroh  路  3Comments