Skip to content

Latest commit

 

History

History
18 lines (18 loc) · 12.4 KB

DevReport.md

File metadata and controls

18 lines (18 loc) · 12.4 KB

Introduction

For my Poker Squares player, JharrisTDPlayerMulti, I chose a modified multithreaded Monte Carlo Tree search using a learned mean partial board value evaluator. This approach was chosen after several false starts, described below. The mean partial board value evaluator (i.e. function(board potentially with some cards on it) -> mean final score) used a relatively standard neural network using two hidden layers of 20 and 10 neurons each, respectively, and one output neuron. The layers were fully connected and initialized randomly using a Gaussian scaled by 1/2. The inputs to the evaluator were as follows: 3 inputs for each cell on the board (one boolean on if the cell has a card, one for the suit, and one for the rank), and one for the current value of the (partial) board, as evaluated using the PokerSquaresPointSystem.getScore(grid) function. This evaluator was trained using a variant of temporal difference learning – play through a game using a greedy 1-deep search using the evaluator (i.e. try every possible spot for the next card and pick the one the evaluator returns the highest value for), then at the end calculate the final score and backtrack, learning the evaluator on the intermediate partial board states towards the final score with an exponentially-decreasing learning rate.

False Starts

There were several false starts to this project. Initially a naive alpha-beta search was attempted to maximize the mean value given ideal play. This was attempted initially using an alpha and beta of the sum of the maximum and minimum possible scores for each row and column calculated in isolation, respectively. Although this initially seemed promising, and lead to an improved understanding of combinatorics, it was eventually rejected when it became obvious that the bounds were nowhere near tight enough to be practical (i.e. crashing due to using the entire Java heap on the computer used for development). As such, a tighter bound was required: said bound was going to be the actual minimum and maximum possible values of a board. An initial sub-search was attempted using, again, an alpha-beta search. This worked, but was far too slow to use as a subroutine, as it took several hours per board state. As such, it was determined that said search should temporarily be offloaded to an external PBSAT solver to evaluate the practicality of the overall approach, and this was coded and tested and found to work satisfactorily. Unfortunately, even with the tightest possible bound, said search was still too slow and memory-hungry. Although there were several dimensions of redundancy that could be removed (e.g. swapping rows, swapping columns, rotating the board, swapping suits), these did not resolve the situation. At this point the author decided to abandon this course of action and persue a rather different approach.

Development

At this point the author decided to try machine learning. Having previously stumbled upon a page describing TDGammon (a backgammon player using temporal difference learning), it was decided to attempt to write a player using temporal difference learning. A naive temporal difference learner was written. Said learner did fairly well (~40 points). However, it was noted that learning was extremely slow (taking 8-12 hours per run), and there was an unexpected occurence when increasing the learning rate: the estimator would initially learn quickly, but the estimated values would quickly “blow up” compared to the actual scores, and would then collapse down to predicting 1000 points every game (the maximum possible that the estimator can indicate) while scoring ~9. Note that it consistently scored less than a player playing completely randomly would. After some puzzlement, it was determined that this was caused by the final play (where the learner is corrected toward the final score, not the estimated score) only happening 1 move out of 25, and as such having a relatively high likelihood of adjusting other scores (which cascade, as one score adjusted up means the prior score will be adjusted up the next time, and so on) up more than it adjusted the final score down. Initially, this was overcome by using a higher learning rate on the final step than previous steps within a game, however it was determined that this hampered the final stages of learning. As such, the standard temporal difference learning algorithm was replaced by the version found in the first paragraph. The author then attempted to use a larger neural network (as up to then networks of 1 hidden layer with 10 hidden neurons were all that were being used). However, this slowed down learning even more. As such, the player was run through a profiler, where it was discovered that 90% of the runtime was in... Double.isNaN. The debug checks (as early on there were a few incidents where neuron weights became NaN) were then removed, and the author, heretofore being of the view that 40 points was the value that the network converged to when learning, was surprised to see that the network began converging to ~80 points instead. (Notably, the network mean appears to plateau at ~40 points, then again at ~60 points, then again at ~80 points. It is unknown if the network would plateau again at a yet-higher value given sufficient learning runs; the timescale involved proved to be too large for the author to attempt.) After that various alternatives to hyperbolic tangent were tested as alternatives (notably: multiple approximations to various degrees of accuracy, and x / (1 + abs(x))), however it was discovered that none of the alternatives were sufficiently faster than the standard library to offset the decreased learning rate. Notably, most approximations were slower than the standard library, and as such the standard library version was retained. This would prove to be a slight regret later. After training multiple networks of various configurations, the author's initial selection of a “larger” network proved to be relatively close to optimal, and as such the 10-20 hidden neuron configuration was selected. This could potentially be a route for improvement later, as this configuration was not altered after this point. A number of other tweaks were attempted and discarded, briefly described in no particular order herein. Notably, several standard incremental improvements to artificial neural networks were attempted (as they tend to lead to an improvement) and discarded. Learning momentum was added, but appeared to have no significant effects beyond slowing down learning. Neuron decay was added, but only hindered the final value of the network. Neuron dropout was added, but rendered the network slow enough to learn that the attempt was cancelled due to lack of time. It is hypothesized that the network would have learned to the same mean value, if not a higher one; it was merely learning at a much reduced rate. In hindsight it is unsurprising that these changes did not help: these changes help neural networks in situations where they are otherwise over-fitting, but TD learning presents enough data to avoid this regardless. An alternative to naive back-propagation was tried, namely normalizing the gradient before performing gradient descent as in naive back-propagation. It learned substantially faster, but was ultimately removed as it made the network substantially less stable. Finally, with the estimator network trained to ~80 points when used in naive 1-deep greedy search, the author decided that a refinement was needed. As such, a greedy monte-carlo tree search was written for the final player. (One downside of this is that up to this point the player and estimator was deterministic, but due to the iteration running until out of time the MCTS is not deterministic.) It was decided due to the extreme number of simulations that MCTS would run, that the code would be written to avoid allocation inside the inner MCTS loop where possible, to avoid problems with Java's garbage collection. (Notably: a GC pause or slowdown at the end of the final move can be catastrophic, as it may lead the player to run out of time unless the player leaves substantially more time than is otherwise necessary) As such, the code was written in a “do-undo” style, that is, speculative games would be undone rather than making a copy of the board state before each search. This lead to several problems when debugging, and as such the time was taken to add assertions to the effect of ensuring that board state before and after each “do-undo” matched. Note that said assertions are not enabled in the final player, as the player can literally do no worse than throwing an exception, as it is counted as 0 points for the game. Only three refinements were attempted on the MCTS player. This is as previously games took less than a millisecond each, however the MCTS player takes the full 30 seconds per game (modulo the 500ms time buffer to ensure the player does not go overtime), and it was determined that several hundred games at a minimum were necessary to determine the actual performance of the player, i.e. several hours. First, multithreading was added for the MCTS portion of the play. This required a fair bit of refactoring and defensive copies in the setup for each move, but nonetheless explores more per turn than the single-threaded MCTS as long as there are 2+ processors available. This did not appear to cause any significant improvement in score, but was nonetheless added as it is unlikely to cause any decrease in score. Second, the raw estimation value for the next move was included in the final decision as to which move to make. It appears as though this does result in a slight improvement with a weight of 0.2 or so (that is, the final score used to make the decision as to which move to make is 0.2 * raw estimate + (1 – 0.2) * MCTS estimate). However, this was not exhaustively searched. Finally, the amount of time allocated to each move was tweaked. However, it was found that said tweaks did not improve the score any over a flat allocation, and in many cases decreased the score overall. As such this change was reverted.

Final Thoughts

Overall, the player works fairly well. There are definite improvements, but it nonetheless significantly outperforms a naive player hard-coded to go for flushes. It also occasionally wins the game (i.e. 200 points+), and outperforms the author. Notably, during final development the player achieved a score of 280 in a series of approximately 350 games.

Other Potential Improvements

There are several improvements suggested for this project over and above the above. For one, it was discovered in the final stages of development that due to a bug the output of the function estimator, instead of being scaled between 0 and 1000, was instead scaled to between -1000 and 1000. Unfortunately, as the function estimator was learned using this buggy target output, this cannot be corrected without doing another TD learning run. (An additional minor improvement would be to scale the output to between the minimum and maximum possible values – the actual maximum value of a board using the American point system is less than 1000.) This could potentially improve the value and learning rate of the estimation system significantly. One other obvious improvement would be to fall back to a naive tree search for the last few moves of the game. This could both allow for additional time to calculate moves in the early part of the game, and ensure optimality of the endgame. It may also be useful to limit the depth of the MCTS and use the estimator for the “leaf” nodes directly instead of playing through full games. There is a tradeoff here – the shallower the MCTS is made, the worse it will do, but the faster it will be. It is unknown as to if it will result in an improvement. Also, restoring deterministic behaviour of the player would be useful to ease development of many of these improvements. (This could be done by, for instance, recording how many iterations were done on each core when playing in a time-limited manner, for later “playback”.) Finally, if the processor time is available, it is suspected that continuing to do online learning on the function estimator of the MCTS player in the same style of the previous TD learning may significantly improve its performance. However, considering that the learner required several hundred thousand games to converge, and the MCTS takes 30 seconds per game, this would require at a minimum ~35 days of processor time per development cycle – and note that it is non-obvious as to how to parallelize the TD learning.