-
Notifications
You must be signed in to change notification settings - Fork 339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about strategy #6
Comments
Personally I've found the simpler the rewards the better. Great example is: http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html That wall proximity reward isn't really needed. Given rewards for eating things and sensors which can both see walls and things to eat, the agent should learn to avoid walls. It's not like people walk around thinking "oh shit a wall, so very afraid of walls, I better go this way", instead they're an obstacle to rewards. Also the forward reward bonus isn't essential, smooths movement, but moving forward is how you get closer to rewards, inherent to eating. That said, on a value based DQN (not actor-critic, policy based, or without an LSTM or RNN), it takes strings of random actions to get past an obstacle and onto a reward, sequences aren't really a thing with this setup. https://github.com/mryellow/reinforcejs/tree/demo-multiagent WallWorld here is a little experiment with 2 sets of sensors, eyes which see walls, nostrils which see goals (even through walls). Given the random experiences to bypass the wall, it can learn that walls are obstacles without any further reward signals. However without access to sequences of actions, it can't escape traps as the corners are deterministic and say "wall, turn around, towards goal", instead of "another wall, keep following, there is a wall in our past blocking it". So here, stuff like that proximity reward ensures there is always something to go off and the random chance of having the "right" experiences to learn from is reduced. In the Atari presentation slides Google say they went with feeding in constant rewards based on an assumption that humans respond to little rewards rather than looking too far forward for the big payoff. |
Cool. So in something trivial like tic-tac-toe, a little (0.1) reward for a valid move, a medium penalty (-1.0) for breaking the rules by trying for an occupied square, and a big +10 for a win. I had considered not forcing it to learn the rules, and instead doing something like allowed=[array of empty valid squares]; But yea. That smells funny. |
Could filter it after the fact, but with the right setup should be able to learn the rules. If the filter is captured in the experiences then it's something the agent can learn to exploit (becomes part of the rules), if it's after-the-fact then you're cleaning up actions to obey rules, choosing a new action if something is wrong. (Philosophically I believe this is a bad idea, "3 laws of robotics" is something for agents to work around, rather than something ingrained. Seems like an opportunity for over complexity and bugs. That said, have used that kinda approach with decent results). When actually playing and wanting few/zero illegal moves, it's a probably a matter of setting |
Morning shower brain kicked in and I'm off on a little bit of a tangent there. The filter and learning in past epsilon stuff is more about when you must choose an action in real-time. With tick-tack-toe you'd just keep adding invalid moves to the experience log until one worked. Like when a human who doesn't know Chess is playing with a computer which rejects their illegal moves, no move actually happens. Now the question is do you penalise those illegal moves or simply give 0 reward. One or the other may push and pull on the resulting approximation a little more than desired. |
I feel like I'm close to being clear. Is there a way to both have memory (something small like 7 move visibility into the past) AND not clutter it up with illegal moves? |
I am fairly certain that it runs counter to the theory of RL to include It seems to me like your configuration is never going to work perfectly You'll have a hard time achieving the 2nd goal if part of your state/action Sutton and Barto's introduction to RL has a couple good passages on when RL The most basic message of these texts is that RL algorithms are useful when Playing legal tic-tac-toe moves wouldn't really qualify as something a RL Notice Andrej Karpathy's demos: the 'rules' of movement for the agents are On Fri, Oct 23, 2015 at 4:23 PM, salamanders [email protected]
|
Maybe, I'm no mathematician, purely practical.
You can learn both the rules and how to win at the same time. Sure mistakes will be made and illegal moves attempted, but so will non-perfect actions be executed when focusing exclusively on winning. The agent should learn to avoid the illegal moves as they don't lead quickly to a decaying reward.
Struggling to find the paper, recent one about how reflexes are trained in the past and executed in the present. Turning edit: To clarify that last bit, if the environment is the thing which is always changing (you have a log of that state over time), your goal is to select actions with good outcomes for the agent in that changing state. Then it's possible to choose actions in the past and determine the resulting reward (instantly if you like, as you're in the past and can lookup the future states). Then in the present you can run at
Likely where RNN comes in. Where the good moves are represented in the nets weights over time, regardless of what is going on in the experience replay. Right? Not sure that really matters for tick-tack-toe though, each state isn't too dependant on previous actions (or sequences of actions), they are right there in the current state to be seen. |
Ok. So if I'm getting it all clear:
FYI: I'm trying to learn to play this: http://mathwithbaddrawings.com/2013/06/16/ultimate-tic-tac-toe/ |
That is an interesting part of it. Experience replay allows DQN to work like supervised learning where it has a labelled data-set to chew on. In most of these javascript experiments when you first load up a pre-trained agent you're only loading the net weights. The pre-trained agent has the brain all ready to say avoid a wall, but given it forms it's own experience replay data-set there is opportunity for it to very quickly poison itself and go off-track. There is less diversity of past experiences to balance against the current one, while those same past experience are being "randomly" selected every single tick for replay, pushing weights in a bias direction. To combat this some of karpathy's demos have a grace-period where no learning is done until a number of experiences have been gathered. You could also bootstrap with experiences from a past episode, or prioritised experiences showing key states/actions. When playing game after game, you'll probably want to keep everything for replay. Then down the track maybe look at bootstraping in some good experiences when you want to demo it. |
Fantastic library!
I have a ton of questions, most of which likely have answers along the lines of "it depends" :) But, the top questions:
Any particular reason to have it be a function? (relates back to my #1 question)
The text was updated successfully, but these errors were encountered: