Reinforcement-learning: issue with value update function

Created on 14 Apr 2018  路  6Comments  路  Source: dennybritz/reinforcement-learning

Shouldn't this be the value update function ?

v += action_prob * (reward + discount_factor * prob * V[next_state])

instead of :
v += action_prob * prob * (reward + discount_factor * V[next_state])

According to slides :

capture

Most helpful comment

@gskishan004 This is duplicate of https://github.com/dennybritz/reinforcement-learning/issues/130
Please review the answer of @memoiry from there.
Briefly, in the case of GridWorld, the agent's choice of an action is not influenced by the environment, i.e. p(s' | s, a) = 1 for the action a leading to that state s'
(e.g. p(s' | s, a) = 1 for s'=2, s = 6, a = 'up/north' and 0 otherwise -- cf. figure below)
screenshot from 2018-06-10 21 29 53.

All 6 comments

@gskishan004 Which file are you talking about?

@jonahweissman I'm talking about the policy_eval function in Policy Evaluation Solution.ipynb

@gskishan004 I'm not totally confident about this, but it looks like the only difference between the two equations is whether or not the reward is multiplied by the transition probability. In the slides, there's an assumption that taking an action, a, in state S will give a reward, R, no matter what the state transition is. In dennybritz's implementation, he takes into account that an action could result in different rewards based on what state the environment puts you in.

Tested both the formulas, turns out that they give the exact same result. Strange !

Any explanation ? Might be because both converge to the same optimal value ?

@gskishan004 What environment are you testing in? If the environment you picked is deterministic (taking an action, a, in a given state, s, will always transition to the same next state, s' ), then it would make perfect sense that the two equations would behave identically. The only time when they differ is if the transition probabilities are not 0 or 1.

@gskishan004 This is duplicate of https://github.com/dennybritz/reinforcement-learning/issues/130
Please review the answer of @memoiry from there.
Briefly, in the case of GridWorld, the agent's choice of an action is not influenced by the environment, i.e. p(s' | s, a) = 1 for the action a leading to that state s'
(e.g. p(s' | s, a) = 1 for s'=2, s = 6, a = 'up/north' and 0 otherwise -- cf. figure below)
screenshot from 2018-06-10 21 29 53.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

JulesVerny picture JulesVerny  路  6Comments

rushabhk7 picture rushabhk7  路  6Comments

nerdoid picture nerdoid  路  78Comments

IbrahimSobh picture IbrahimSobh  路  19Comments

ArikVoronov picture ArikVoronov  路  3Comments