Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about strategy #6

Open
salamanders opened this issue Oct 21, 2015 · 9 comments
Open

Question about strategy #6

salamanders opened this issue Oct 21, 2015 · 9 comments

Comments

@salamanders
Copy link

Fantastic library!
I have a ton of questions, most of which likely have answers along the lines of "it depends" :) But, the top questions:

  1. In a basic game with many invalid moves (don't crash into a wall, don't play an invalid move, etc) and only a few valid moves, is it normally better to let the system "work out" the rules? I was considering an alternate of only offering the agent a list of valid moves and having it pick among them, but intuitively it seems more confusing - a small shift in valid moves would "off-by-one" the list, and that would be hard for the agent to learn.
  2. I combined a few examples for the spec. Are any of these "bad" in a general game solver?
    spec.update = 'qlearn'; // 'qlearn' or 'sarsa'
    spec.gamma = 0.9; // discount factor, [0, 1)
    spec.epsilon = 0.2; // initial epsilon for epsilon-greedy policy, [0, 1)
    spec.lambda = 0.8; // eligibility trace decay, [0,1). 0 = no eligibility traces
    spec.replacing_traces = false; // use replacing or accumulating traces
    spec.planN = 50; // number of planning steps per iteration. 0 = no planning
    spec.smooth_policy_update = true; // non-standard, updates policy smoothly to follow max_a Q
    spec.beta = 0.1; // learning rate for smooth policy update
    spec.alpha = 0.005; // value function learning rate
    spec.experience_add_every = 5; // number of time steps before we add another experience to replay memory
    spec.experience_size = 10000; // size of experience
    spec.learning_steps_per_iteration = 5;
    spec.tderror_clamp = 1.0; // for robustness
    spec.num_hidden_units = 100; // number of neurons in hidden layer
  1. In your examples, you have
env.getNumStates = function() {
      return 9;
    };
env.getMaxNumActions...

Any particular reason to have it be a function? (relates back to my #1 question)

@mryellow
Copy link

is it normally better to let the system "work out" the rules?

Personally I've found the simpler the rewards the better.

Great example is:

http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html

That wall proximity reward isn't really needed. Given rewards for eating things and sensors which can both see walls and things to eat, the agent should learn to avoid walls. It's not like people walk around thinking "oh shit a wall, so very afraid of walls, I better go this way", instead they're an obstacle to rewards. Also the forward reward bonus isn't essential, smooths movement, but moving forward is how you get closer to rewards, inherent to eating.

That said, on a value based DQN (not actor-critic, policy based, or without an LSTM or RNN), it takes strings of random actions to get past an obstacle and onto a reward, sequences aren't really a thing with this setup.

https://github.com/mryellow/reinforcejs/tree/demo-multiagent

WallWorld here is a little experiment with 2 sets of sensors, eyes which see walls, nostrils which see goals (even through walls). Given the random experiences to bypass the wall, it can learn that walls are obstacles without any further reward signals. However without access to sequences of actions, it can't escape traps as the corners are deterministic and say "wall, turn around, towards goal", instead of "another wall, keep following, there is a wall in our past blocking it".

So here, stuff like that proximity reward ensures there is always something to go off and the random chance of having the "right" experiences to learn from is reduced. In the Atari presentation slides Google say they went with feeding in constant rewards based on an assumption that humans respond to little rewards rather than looking too far forward for the big payoff.

@salamanders
Copy link
Author

Cool. So in something trivial like tic-tac-toe, a little (0.1) reward for a valid move, a medium penalty (-1.0) for breaking the rules by trying for an occupied square, and a big +10 for a win.

I had considered not forcing it to learn the rules, and instead doing something like

allowed=[array of empty valid squares];
move=allowed[action%allowed.length];

But yea. That smells funny.

@mryellow
Copy link

Could filter it after the fact, but with the right setup should be able to learn the rules. If the filter is captured in the experiences then it's something the agent can learn to exploit (becomes part of the rules), if it's after-the-fact then you're cleaning up actions to obey rules, choosing a new action if something is wrong. (Philosophically I believe this is a bad idea, "3 laws of robotics" is something for agents to work around, rather than something ingrained. Seems like an opportunity for over complexity and bugs. That said, have used that kinda approach with decent results).

When actually playing and wanting few/zero illegal moves, it's a probably a matter of setting epsilon to 0 for a forward pass to remove the random element, but still including it for training experiences.

@mryellow
Copy link

Morning shower brain kicked in and I'm off on a little bit of a tangent there.

The filter and learning in past epsilon stuff is more about when you must choose an action in real-time. With tick-tack-toe you'd just keep adding invalid moves to the experience log until one worked. Like when a human who doesn't know Chess is playing with a computer which rejects their illegal moves, no move actually happens.

Now the question is do you penalise those illegal moves or simply give 0 reward. One or the other may push and pull on the resulting approximation a little more than desired.

@salamanders
Copy link
Author

I feel like I'm close to being clear.
So for tic tac toe: the fact that you are rejecting illegal moves doesn't "reset" the memory, so the brain might mistakenly think "I have to do these 9 illegal games, with some small punishment, to get to the winning game" which is of course silly.

Is there a way to both have memory (something small like 7 move visibility into the past) AND not clutter it up with illegal moves?

@rwill128
Copy link

I am fairly certain that it runs counter to the theory of RL to include
illegal moves in the state / action-value estimates. I'm not sure why
you're trying to do it this way.

It seems to me like your configuration is never going to work perfectly
because you're not actually trying to make the agent learn a single goal.
Instead you're trying to achieve two goals: 1. learn how to play
tic-tac-toe (i.e. make 'legal' choices depending on the state.) and 2. win
at tic-tac-toe (make legal and strong moves).

You'll have a hard time achieving the 2nd goal if part of your state/action
space includes moves that are not even legal, because those will always be
chosen some % of the time, unless you drop your exploration rate to zero,
and in that case you don't have a learning algorithm anymore. You just have
an obfuscated set of instructions for doing one thing.

Sutton and Barto's introduction to RL has a couple good passages on when RL
algorithms are useful (and there are many other sources on this as well).

The most basic message of these texts is that RL algorithms are useful when
the behavior necessary to achieve a certain goal is either not understood
at all at the time a program is written or it is not subject to clear,
formulaic rules that can be expressed in a traditional algorithm.

Playing legal tic-tac-toe moves wouldn't really qualify as something a RL
algorithm is designed to do, although it could. Current methods will always
do it sub-optimally though compared to traditional algorithms.


Notice Andrej Karpathy's demos: the 'rules' of movement for the agents are
baked into the simulation. His code doesn't say "agent can only move
forward and backward and up and down, but if it tries to move diagonally
we'll just do nothing and punish it."

On Fri, Oct 23, 2015 at 4:23 PM, salamanders [email protected]
wrote:

I feel like I'm close to being clear.
So for tic tac toe: the fact that you are rejecting illegal moves doesn't
"reset" the memory, so the brain might mistakenly think "I have to do these
9 illegal games, with some small punishment, to get to the winning game"
which is of course silly.

Is there a way to both have memory (something small like 7 move visibility
into the past) AND not clutter it up with illegal moves?


Reply to this email directly or view it on GitHub
#6 (comment).

@mryellow
Copy link

runs counter to the theory of RL

Maybe, I'm no mathematician, purely practical.

includes moves that are not even legal, because those will always be
chosen some % of the time

You can learn both the rules and how to win at the same time. Sure mistakes will be made and illegal moves attempted, but so will non-perfect actions be executed when focusing exclusively on winning. The agent should learn to avoid the illegal moves as they don't lead quickly to a decaying reward.

unless you drop your exploration rate to zero, and in that case you don't have a learning algorithm anymore

Struggling to find the paper, recent one about how reflexes are trained in the past and executed in the present. Turning epsilon to zero for a forward pass to find a good action, works in practice.... Just gotta keep something learning in the background (depends on having another training set of experiences recorded, separate to your "real" actions, the training set still influences it's own data... Easier when agent actions don't have effects on environment).

edit: To clarify that last bit, if the environment is the thing which is always changing (you have a log of that state over time), your goal is to select actions with good outcomes for the agent in that changing state. Then it's possible to choose actions in the past and determine the resulting reward (instantly if you like, as you're in the past and can lookup the future states). Then in the present you can run at epsilon=0, reacting to the current state based on what you've learnt on delayed states.

Is there a way to both have memory (something small like 7 move visibility into the past) AND not clutter it up with illegal moves?

Likely where RNN comes in. Where the good moves are represented in the nets weights over time, regardless of what is going on in the experience replay. Right? Not sure that really matters for tick-tack-toe though, each state isn't too dependant on previous actions (or sequences of actions), they are right there in the current state to be seen.

@salamanders
Copy link
Author

Ok. So if I'm getting it all clear:

  • Q1: Should you punish illegal moves, or disallow them entirely?
    • A1: Depends. Some say yes, some say no. Depends on the game. And how much sense the game would make from the learner's POV if only legal moves are allowed. eg. asking "which number move should I make" is really bad, because a new legal move at the beginning off-by-ones the entire rest of the moves.
  • Q2: When punishing illegal moves, should you start the game over?
    • A2: That can take too long. Better to punish quickly, and ask the learner to "try again"
  • Q3: Regardless of allow/deny and reset/punish-and-continue, how do you train on "history of moves" and somehow tell the system "we are continuing the game, keep track of history" vs. "we are resetting the game, please reset your history"?
    • A3: Dunno.
  • Q4: Why do you care about history of moves?
    • A4: I don't for tic-tac-toe. It is for a more complex variation, where the history of the moves DOES matter and isn't completely reflected in the current state of the board. But, it was awesome to see the basic learner figure out tic-tac-toe. Now I just need to scale it up, IF that is at all possible.

FYI: I'm trying to learn to play this: http://mathwithbaddrawings.com/2013/06/16/ultimate-tic-tac-toe/

@mryellow
Copy link

"we are continuing the game, keep track of history" vs. "we are resetting the game, please reset your history"

That is an interesting part of it. Experience replay allows DQN to work like supervised learning where it has a labelled data-set to chew on.

In most of these javascript experiments when you first load up a pre-trained agent you're only loading the net weights.

The pre-trained agent has the brain all ready to say avoid a wall, but given it forms it's own experience replay data-set there is opportunity for it to very quickly poison itself and go off-track. There is less diversity of past experiences to balance against the current one, while those same past experience are being "randomly" selected every single tick for replay, pushing weights in a bias direction.

To combat this some of karpathy's demos have a grace-period where no learning is done until a number of experiences have been gathered. You could also bootstrap with experiences from a past episode, or prioritised experiences showing key states/actions.

When playing game after game, you'll probably want to keep everything for replay. Then down the track maybe look at bootstraping in some good experiences when you want to demo it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants