I'm runningBlackjack-v0 with Python 3. I see that env.reset() does not reset environment properly, and state = env.reset() generates the non-starting state for each episode. For example, (20, 8, False) is set as the first state for the episode, which looks not right as the state first value should be less than 11 in theory. And this would cause the incorrect training results. I wonder if you could help to fix the issue. Thanks!
The observation is: player current total, dealer visible total, and whether the player has a usable Ace.
In this case, a draw of 20 could indicate (for example) the player has two 10s.
The blackjack environment class has a documentation string that tries to capture this:
https://github.com/openai/gym/blob/b5108b384ed1e29f3d241d654666b8e347f0f7b9/gym/envs/toy_text/blackjack.py#L64
@machinaut Thanks! Yes I know. But I don't think 20 should appear immediately after I call state = env.reset() because the sum for the first card shouldn't exceed 11.
OK. Looks like state = env.reset() contains data for the first 2 cards in the episode. But seems I still got strange sequence of observations. The following code generates 100-episode games with random actions:
import gym
env = gym.make("Blackjack-v0")
env.seed(10)
for i_episode in range(100):
observation = env.reset()
print("==============================")
print("Episode {}:".format(i_episode))
print("observation reset as: ", observation)
for t in range(10000):
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
print("action: ", action)
print("observation: ", observation)
print("reward: ", reward)
print("done: ", done)
if done:
print("Episode finished after {} timesteps.\n".format(t+1))
break
For some of episodes, the starting observation for state = env.reset() is incorrect, for example:
==============================
Episode 4:
observation reset as: (21, 3, True)
action: 1
observation: (18, 3, False)
reward: 0
done: False
action: 1
observation: (27, 3, False)
reward: -1
done: True
Episode finished after 2 timesteps.
You could see card sum for the second observation 18 is less than the one for the first observation 21, which looks incorrect, and you probably may get issues for several episodes like that. So I wonder if there is something wrong in env.reset(). Thanks for any helps!
I believe the behavior is correct. The first observation (21, 3, True) means the player has cards including an ace that sum to 21. Could be [Ace, 10]. It's legal (though silly) to hit on this observation. In this case, it updated to (18, 3, False) which probably means the cards are [Ace, 10, 7] which sum to 18 because you now count the ace as 1 instead of 11.
The logic for this is in usable_ace and sum_hand.
Most helpful comment
I believe the behavior is correct. The first observation
(21, 3, True)means the player has cards including an ace that sum to 21. Could be [Ace, 10]. It's legal (though silly) to hit on this observation. In this case, it updated to(18, 3, False)which probably means the cards are [Ace, 10, 7] which sum to 18 because you now count the ace as 1 instead of 11.The logic for this is in
usable_aceandsum_hand.