Gym: Blackjack bug for "env.reset()"

Created on 19 Dec 2016  路  4Comments  路  Source: openai/gym

I'm runningBlackjack-v0 with Python 3. I see that env.reset() does not reset environment properly, and state = env.reset() generates the non-starting state for each episode. For example, (20, 8, False) is set as the first state for the episode, which looks not right as the state first value should be less than 11 in theory. And this would cause the incorrect training results. I wonder if you could help to fix the issue. Thanks!

Most helpful comment

I believe the behavior is correct. The first observation (21, 3, True) means the player has cards including an ace that sum to 21. Could be [Ace, 10]. It's legal (though silly) to hit on this observation. In this case, it updated to (18, 3, False) which probably means the cards are [Ace, 10, 7] which sum to 18 because you now count the ace as 1 instead of 11.

The logic for this is in usable_ace and sum_hand.

All 4 comments

The observation is: player current total, dealer visible total, and whether the player has a usable Ace.

In this case, a draw of 20 could indicate (for example) the player has two 10s.

The blackjack environment class has a documentation string that tries to capture this:
https://github.com/openai/gym/blob/b5108b384ed1e29f3d241d654666b8e347f0f7b9/gym/envs/toy_text/blackjack.py#L64

@machinaut Thanks! Yes I know. But I don't think 20 should appear immediately after I call state = env.reset() because the sum for the first card shouldn't exceed 11.

OK. Looks like state = env.reset() contains data for the first 2 cards in the episode. But seems I still got strange sequence of observations. The following code generates 100-episode games with random actions:

import gym

env = gym.make("Blackjack-v0")
env.seed(10)
for i_episode in range(100):
   observation = env.reset()
   print("==============================")
   print("Episode {}:".format(i_episode))
   print("observation reset as: ", observation)
   for t in range(10000):
     action = env.action_space.sample()
     observation, reward, done, info = env.step(action)
     print("action: ", action)
     print("observation: ", observation)
     print("reward: ", reward)
     print("done: ", done)
     if done:
       print("Episode finished after {} timesteps.\n".format(t+1))
       break

For some of episodes, the starting observation for state = env.reset() is incorrect, for example:

==============================
Episode 4:
observation reset as:  (21, 3, True)
action:  1
observation:  (18, 3, False)
reward:  0
done:  False
action:  1
observation:  (27, 3, False)
reward:  -1
done:  True
Episode finished after 2 timesteps.

You could see card sum for the second observation 18 is less than the one for the first observation 21, which looks incorrect, and you probably may get issues for several episodes like that. So I wonder if there is something wrong in env.reset(). Thanks for any helps!

I believe the behavior is correct. The first observation (21, 3, True) means the player has cards including an ace that sum to 21. Could be [Ace, 10]. It's legal (though silly) to hit on this observation. In this case, it updated to (18, 3, False) which probably means the cards are [Ace, 10, 7] which sum to 18 because you now count the ace as 1 instead of 11.

The logic for this is in usable_ace and sum_hand.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

zhan0903 picture zhan0903  路  4Comments

lbbc1117 picture lbbc1117  路  3Comments

RuofanKong picture RuofanKong  路  4Comments

hipoglucido picture hipoglucido  路  4Comments

tylerlekang picture tylerlekang  路  3Comments