-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TP7 - RL - ex1cd - iteration 4 in algorithm #6
Comments
Hi, Note that, in order to take into account the walls, any policy that would take you to a wall has the effect of making you stay at the same spot. for me, starting for iteration 2, (where M(2,4) = 16.27), If you compute for policy "up" and M(2,4) you get : If you compute for policy "right" and M(2,4), you get |
Hello, thanks for your answer so far. I think I missused the Bellman equation, but I have still some grey zones about the reasoning and the algorithm... I computed it as follow, when considering 3rd iteration where M(2,4)=16.3 : For policy "up", I got For policy "right" Btw, for the 2nd iteration you've writed, I don't understand why you're computing the term |
Hi, Yes indeed, my bad, you are right, there is a mistake in the correction of the practical session. As for fixing the problem:
If the current form the optimal policy is (U=up, D=down, L=left, R=right): and the values converge to: But if you introduce a negative reward of -1 for each action then it becomes: Or if you define the discount factor (gamma) as 0.9: I hope this clarifies the matter. (this SE answer should also give you more info: https://ai.stackexchange.com/questions/35596/what-should-the-discount-factor-for-the-non-slippery-version-of-the-frozenlake-e ) I don't have time to fix it now in the correction (you are welcome to make a pull request btw), so i'll leave the issue open |
I think there's an error at the fourth iteration when updating the policy (and thus values accordingly). Indeed, I've found a higher value of Q when the action is right (Q=16,446) instead of up (Q=16,32) for the position (line2,column4). The same error also occurs in next iterations, and all other value are thus slightly different.
In fact, the algorithm find it more interesting to go in the wall (0% chance to fall in the cliff) than go up (5% chance falling in the cliff).
The text was updated successfully, but these errors were encountered: