TP7 - RL - ex1cd - iteration 4 in algorithm #6

mtonglet · 2022-05-24T14:25:16Z

I think there's an error at the fourth iteration when updating the policy (and thus values accordingly). Indeed, I've found a higher value of Q when the action is right (Q=16,446) instead of up (Q=16,32) for the position (line2,column4). The same error also occurs in next iterations, and all other value are thus slightly different.
In fact, the algorithm find it more interesting to go in the wall (0% chance to fall in the cliff) than go up (5% chance falling in the cliff).

KenN7 · 2022-05-24T21:17:05Z

Hi,
How do you compute it ?

Note that, in order to take into account the walls, any policy that would take you to a wall has the effect of making you stay at the same spot.

for me, starting for iteration 2, (where M(2,4) = 16.27),

If you compute for policy "up" and M(2,4) you get : V = 0.9*(20+1*0)+0.05*(-50+1*0)+0.05*(0+1*16.27) = 16.3
that is 5% chance of staying in place with max Q = 16.27, 5% chance of -50 reward and 90% chance of getting +20 reward.
and from then on it converges, you just aply the same calculation replacing 16.27 by 16.3.

If you compute for policy "right" and M(2,4), you get V = 0.9*(0+1*15.5) + 0.05*(20+1*0) + 0.05*(0+1*13.95) = 15.65
that is 5% chance of going up with reward 20, 5% chance of going down with max Q = 13.95, and 90% of staying in place with max Q = 15.5.
Policy up is thus better.

mtonglet · 2022-05-27T15:04:21Z

Hello, thanks for your answer so far. I think I missused the Bellman equation, but I have still some grey zones about the reasoning and the algorithm...

I computed it as follow, when considering 3rd iteration where M(2,4)=16.3 :

For policy "up", I got v = -50*0.05 + 20*0.9 + 16.3*0.05 = 16.315 rounded up to 16.32

For policy "right" v = 20*0.05 + 16.3*0.9 + 15.34*0.05 = 16.437 which is bigger than for the policy "up".

Btw, for the 2nd iteration you've writed, I don't understand why you're computing the term [...]+0.05*(0+1*16.27) for the "up" policy, but [...]+0.9*(0+1*15.5) for the "right" policy.
Why are the values different ? Shouldn't it be the same value, i.e. 16.27 ? Personnaly, I would have computed that way v = 20*0.05 + 16.27*0.9 + 13.95*0.05= 16.34, which would have already been greater than the value found for the "up" policy where ´v=16.3´...

KenN7 · 2022-05-30T11:38:19Z

Hi,

Yes indeed, my bad, you are right, there is a mistake in the correction of the practical session.
i just double checked the computation and indeed, you calculations are right, sorry for that.

As for fixing the problem:
In the current form, there are 2 things that make the algorithm converge to "maximum reward" (20):

there is no negative reward from making a move (we could for example put a -1 reward when an action is taken)
the discount factor (gamma) is 1

If the current form the optimal policy is (U=up, D=down, L=left, R=right):

and the values converge to:

But if you introduce a negative reward of -1 for each action then it becomes:

Or if you define the discount factor (gamma) as 0.9:

I hope this clarifies the matter.

(this SE answer should also give you more info: https://ai.stackexchange.com/questions/35596/what-should-the-discount-factor-for-the-non-slippery-version-of-the-frozenlake-e )

I don't have time to fix it now in the correction (you are welcome to make a pull request btw), so i'll leave the issue open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TP7 - RL - ex1cd - iteration 4 in algorithm #6

TP7 - RL - ex1cd - iteration 4 in algorithm #6

mtonglet commented May 24, 2022

KenN7 commented May 24, 2022 •

edited

Loading

mtonglet commented May 27, 2022

KenN7 commented May 30, 2022 •

edited

Loading

TP7 - RL - ex1cd - iteration 4 in algorithm #6

TP7 - RL - ex1cd - iteration 4 in algorithm #6

Comments

mtonglet commented May 24, 2022

KenN7 commented May 24, 2022 • edited Loading

mtonglet commented May 27, 2022

KenN7 commented May 30, 2022 • edited Loading

KenN7 commented May 24, 2022 •

edited

Loading

KenN7 commented May 30, 2022 •

edited

Loading