Reinforcement-learning: issue with value update function

Created on 14 Apr 2018 · 6Comments · Source: dennybritz/reinforcement-learning

Shouldn't this be the value update function ?

v += action_prob * (reward + discount_factor * prob * V[next_state])

instead of :
v += action_prob * prob * (reward + discount_factor * V[next_state])

According to slides :

capture

Source

gskishan004

Most helpful comment

@gskishan004 This is duplicate of https://github.com/dennybritz/reinforcement-learning/issues/130
Please review the answer of @memoiry from there.
Briefly, in the case of GridWorld, the agent's choice of an action is not influenced by the environment, i.e. p(s' | s, a) = 1 for the action a leading to that state s'
(e.g. p(s' | s, a) = 1 for s'=2, s = 6, a = 'up/north' and 0 otherwise -- cf. figure below)
screenshot from 2018-06-10 21 29 53 .

plopd on 11 Jun 2018

👍3

All 6 comments

@gskishan004 Which file are you talking about?

jonahweissman on 19 Apr 2018

@jonahweissman I'm talking about the policy_eval function in Policy Evaluation Solution.ipynb

gskishan004 on 19 Apr 2018

@gskishan004 I'm not totally confident about this, but it looks like the only difference between the two equations is whether or not the reward is multiplied by the transition probability. In the slides, there's an assumption that taking an action, a, in state S will give a reward, R, no matter what the state transition is. In dennybritz's implementation, he takes into account that an action could result in different rewards based on what state the environment puts you in.

jonahweissman on 19 Apr 2018

👍3

Tested both the formulas, turns out that they give the exact same result. Strange !

Any explanation ? Might be because both converge to the same optimal value ?

gskishan004 on 7 May 2018

@gskishan004 What environment are you testing in? If the environment you picked is deterministic (taking an action, a, in a given state, s, will always transition to the same next state, s' ), then it would make perfect sense that the two equations would behave identically. The only time when they differ is if the transition probabilities are not 0 or 1.

jonahweissman on 7 May 2018