-
Notifications
You must be signed in to change notification settings - Fork 13
/
manual.tex
862 lines (646 loc) · 42.9 KB
/
manual.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
%===========================================================================================
% standard format
%\documentclass[12pt]{article}
%\usepackage[T1]{fontenc}
%\usepackage[latin1]{inputenc}
%\usepackage{verbatim}
%\usepackage{url}
%\usepackage{textcomp}
%\usepackage{amsmath}
%\usepackage{alltt}
%\usepackage{graphicx}
%===========================================================================================
% format taken from http://tex.stackexchange.com/questions/84478/how-to-format-title-and-abstract
\documentclass[12pt]{article}
\usepackage[T1]{fontenc}
\usepackage[latin1]{inputenc}
\usepackage{verbatim}
\usepackage{url}
\usepackage{textcomp}
\usepackage{amsmath}
\usepackage{alltt}
\usepackage{graphicx}
\usepackage{lipsum}
\usepackage{authblk}
\usepackage[top=2cm, bottom=2cm, left=2cm, right=2cm]{geometry}
\usepackage{fancyhdr}
%
\pagestyle{fancy}
%
\renewenvironment{abstract}{%
\hfill\begin{minipage}{0.95\textwidth}
\rule{\textwidth}{1pt}}
{\par\noindent\rule{\textwidth}{1pt}\end{minipage}}
%
\makeatletter
\renewcommand\@maketitle{%
\hfill
\begin{minipage}{0.95\textwidth}
\vskip 2em
\let\footnote\thanks
{\LARGE \@title \par }
\vskip 1.5em
{\large \@author \par}
\end{minipage}
\vskip 1em \par
}
\makeatother
%
%===========================================================================================
%\newcommand{\cmdln}[1] {\begin{quotation}\texttt{#1}\end{quotation}}
\newcommand{\rgstrd} {\textsuperscript{\textregistered{}}}
\newcommand{\cpyrht} {\textsuperscript{\textcopyright{}}}
\newcommand{\person} [1] {\textit{#1}}
\newcommand{\swtch} [1] {\texttt{\textbf{#1}}}
\newcommand{\filename} [1] {\texttt{#1}}
\newcommand{\cmdln}[1]{
\par
\begingroup
\leftskip=2em
\addtolength{\rightskip}{0em}
\noindent \small{\texttt{#1}}
\par
\endgroup
}
\newcommand{\inctxt}[1]{
\begingroup
\fontsize{9pt}{11pt}\selectfont
\verbatiminput{#1}
\endgroup
}
% Ra just to silence previous coloring, for some reason "Ra" ends it in gedit
%===========================================================================================
\title{ORDO \input{version.txt} \\ Ratings for chess and other games}
\author{Miguel A. Ballicora\footnote{e-mail: mballicora (at gmail dot com)}}
\date{April 2015}
\parindent 0mm
\parskip 2mm
\begin{document}
\maketitle
\begin{abstract}
Ordo\footnote{Copyright \cpyrht{} 2015 Miguel A. Ballicora} is a program designed to calculate ratings of individual chess engines (or players).
It has a similar concept than the Elo rating\footnote{\url{http://en.wikipedia.org/wiki/Elo_rating_system}}, but with a different model and algorithm.
Ordo keeps consistency among ratings because it calculates them considering all results at once.
In that respect, it behaves similarly to BayesElo\footnote{\url{http://remi.coulom.free.fr/Bayesian-Elo/}}.
Ordo is distributed under the GPL license and binaries are available for GNU/Linux, Windows\rgstrd{}, and OS X.
In addition, the sources are portable and could be easily compiled for other systems.
\end{abstract}
\subsubsection*{Precompiled Files}
In this distribution, you may find versions for GNU/Linux (32 and 64 bits) or Windows\rgstrd{} (64 and 32 bits).
For convenience, you can rename the proper file for your system to \textbf{ordo} (GNU/Linux) or \textbf{ordo.exe} (Windows\rgstrd{}).
As an input example, a publicly available file \filename{games.pgn} is included\footnote{Taken from the recently discontinued \person{Ingo Bauer's} IPON rating list}.
A batch file (\filename{ordo\_example.bat}) is included in the Windows\rgstrd{} distribution\footnote{Kindly prepared by \person{Adam Hair}}.
It is a quick and great start for users of that operating system.
\subsubsection*{GNU/Linux compilation and installation}
After unzipping the contents, you can type
\cmdln{make}
\cmdln{make install} or in Ubuntu \cmdln{sudo make install}
\subsubsection*{Usage}
The input should be a file that adheres to the PGN standard\footnote{\url{http://en.wikipedia.org/wiki/Portable_Game_Notation}}.
Based on the results in that file, Ordo automatically calculates a ranking .
The output can be a plain text file and/or a \textit{comma separated value}\footnote{\url{http://en.wikipedia.org/wiki/Comma-separated_values}} (\filename{.csv}) file.
The \filename{.csv} file is an interesting option because it can be opened/imported by most spreadsheet programs.
Once imported, the user can choose to format the output externally.
The simplest way to use Ordo is typing in the command line:
\cmdln{ordo -p games.pgn}
which will take the results from \filename{games.pgn} and output the text ranking on the screen.
If you want to save the results in a file \filename{ratings.txt}, you can run:
\cmdln{ordo -p games.pgn -o ratings.txt}
By default, the average rating of all the individuals is 2300.
If you want a different overall average, you can use the switch \swtch{-a} to set it.
For instance to have and average of 2500, you can do:
\cmdln{ordo -a 2500 -p games.pgn -o ratings.txt}
or if you want the results in \filename{.csv} format, use the switch \swtch{-c}.
\cmdln{ordo -a 2500 -p games.pgn -c rating.csv}
If you want both, you can use:
\cmdln{ordo -a 2500 -p games.pgn -o ratings.txt -c rating.csv}
\subsubsection*{Multiple input files}
Ordo can use several pgn input files at the same time (limited by the memory of the system), which means that it is not needed to combine them beforehand.
Each file could be listed after the switch \swtch{-}\swtch{-} like this
\cmdln{ordo -a 2500 -p g1.pgn -o ratings.txt -c rating.csv \swtch{-}\swtch{-} g2.pgn g3.pgn "Other games.pgn"}
For that reason, the switch \swtch{-p} can be omitted and all files could be listed after \swtch{-}\swtch{-}
\cmdln{ordo -a 2500 -o ratings.txt -c rating.csv \swtch{-}\swtch{-} g1.pgn g2.pgn g3.pgn "Other games.pgn"}
Another option to input multiple pgn files is to list them in one text file and provide the name with the switch \swtch{-P <file>}
\cmdln{ordo -a 2500 -o ratings.txt -c rating.csv -P list.txt}
where \swtch{list.txt} contains for instance
\cmdln{g1.pgn\\
g2.pgn\\
g3.pgn}
\subsubsection*{Name synonyms}
Sometimes, the same player (engine) has been named differently in tournaments.
The user can speficify what names are actually synonyms, so Ordo will consider them one.
The switch \swtch{-Y <file>} indicates that the file (.csv format) will contain a list of synonyms.
Each line has the following information: main,syn1,syn2 etc.
An example of a synonym file with two lines would be:
\cmdln{"Gaviota 1.0","Gaviota v1","gaviota v1.0"}
\cmdln{"Stockfish 6","Stockfish 6.0"}
In this example, Gaviota 1.0 and Stockfish 6 would be the names used by Ordo.
The other ones will be converted.
\subsubsection*{Excluding games}
In certain situations, the user may want to include/discard in the calculation only a subset of the games present in the input file/s.
Switches \swtch{-i <file>} and \swtch{-x <file>} are used for this purpose.
Switch \swtch{-i} includes \textit{only} the games of participants listed in \swtch{<file>}.
In this file, each participant name has to be in different lines.
Also, each of those names may or may not be surrounded by quotes.
Both are accepted.
For this reason, if a .csv file is provided as a list of participants, only the first column is read.
In addition, \swtch{-x} could be use to \textit{exclude} games of participants listed in \swtch{<file>}.
\subsubsection*{Output information (columns)}
The user can select what information is displayed with the switch \swtch{-U}.
For instance, \swtch{-U "0,1,3"} will select and print "rank and name", "rating", and "points total" columns in that order (each column has a predefined number see below).
The default output is \swtch{"0,1,2,3,4,5"}.
\cmdln{\\
0 rank and name\\
1 rating\\
2 error\\
3 points total\\
4 games total\\
5 score percent\\
6 confidence for superiority\\
7 won games\\
8 draw games\\
9 losses\\
10 draw rate\\
11 opponent average rating\\
12 opponent average error\\
13 number of opponents\\
14 diversity (effective number of opponents based on information theory)}
Option 14 is will be equal to option 13 if the number of games among opponents is equally distributed.
\begin{equation}
OppDiv = e^{ -\left(\sum\limits_{i=1}^N f_{i} ln(f_{i})\right) }
\end{equation}
where $f_{i} = n_{i}/N$. $N$ is the total number of games played by a given participant, and $n_{i}$ is the number of games played agains opponent $i$.
Another switch (\swtch{-b <file>}) controls the width of each column in the output and the header of each column.
This \swtch{<file>} consists of a text in which each line has \swtch{<item>,<width>,"HeaderName"}.
For instance, if you want to have the item 4 (games total, see above), to be 5 characters wide and be named "Points", you will have to include a line like this:
\cmdln{4,8,"Points"}
In a file named \swtch{columns.txt} (or whatever name you choose) and then include the switch \swtch{-b columns.txt}
Note that the width selected for item 0 (rank and name), will be ingnored since it is automatically adjusted.
A default file would look like this (spaces are ignored):
\cmdln{\\
0, 0, "PLAYER"\\
1, 6, "RATING"\\
2, 6, "ERROR"\\
3, 7, "POINTS"\\
4, 7, "PLAYED"\\
5, 5, "(\%)"\\
6, 7, "CFS(next)"\\
7, 4, "W"\\
8, 4, "D"\\
9, 4, "L"\\
10, 5, "D(\%)"\\
11, 7, "OppAvg"\\
12, 7, "OppErr"\\
13, 5, "OppN"\\
14, 7, "OppDiv"}
\subsubsection*{Decimals}
The switch \swtch{-N} provides the ability to control the precision displayed for the ratings (and their errors).
For instance, \swtch{-N2} will output ratings with two decimals.
If an extra parameter is provided separated by comma, it will control the number of decimals for the score and draw percentage.
For example, the default is actually \swtch{-N0,1} or \swtch{-N "0,1"}.
If a second parameter is not provided, the default is used.
\subsubsection*{Anchor}
The switch \swtch{-A} will fix the rating of a given player as a reference (\textit{anchor}) for the whole pool.
\cmdln{ordo -a 2800 -A "Deep Shredder 12" -p games.pgn -o ratings.txt}
That will calculate the ratings from \filename{games.pgn}, save it in \filename{ratings.txt}, and \textit{anchor} the engine \textit{Deep Shredder 12} to a rating of 2800.
Names that contain spaces should be surrounded by quote marks as in this example.
\subsubsection*{White advantage}
The switch \swtch{-w} sets the rating advantage for having white pieces in chess.
Alternatively, the (highly recommended) switch \swtch{-W} lets Ordo calculate it automatically.
With this switch we can complete the above example:
\cmdln{ordo -a 2800 -A "Deep Shredder 12" -p games.pgn -o ratings.txt -W}
If the user knows that the \textit{white advantage} is a value within a certain range, this uncertainty could be given by the switch \swtch{-u}.
A combination of switches \swtch{-w <value>} and \swtch{-u <deviation>} may provide a prior information that Ordo will use to calculate the white advantage.
As the number of games played increases, the prior information will be less and less relevant.
Ordo assumes a Gaussian distribution centered in \swtch{<value>} with a standard \swtch{<deviation>}.
\subsubsection*{Simulation and errors}
The switch \swtch{-s~<n>} instructs Ordo to perform \swtch{<n>} simulations, virtually \textit{replaying} the games \swtch{<n>} times.
The results will be randomly re-assigned for each game according to the probabilities calculated from the ratings.
After running the simulations, and based on all those different results, Ordo calculates standard deviations (\textit{errors}) for the ratings.
For this purpose, an optional switch is \swtch{-F~value}, where \swtch{value} is the \% \textit{confidence level} (The default is 95.0, which is roughly equivalent to $\pm$ 2 standard deviations).
The errors displayed are relative to the pool average.
However, if one of the players is \textit{anchored}, the rest of the errors will be relative to that \textit{anchor}.
In this case, the anchor error will be zero since it is the point of reference.
To get the errors for rating differences between a given pair of players, the switch \swtch{-e~file.csv} should be added.
It will generate an error matrix saved in \swtch{file.csv}.
A minimum reasonable number for the simulations is about \swtch{-s~100}.
The more simulations, the longer it takes to complete the runs.
The errors calculated will be more accurate, but more than 1000 simulations is probably not needed.
This is an example to use these switches:
\cmdln{ordo -a 2800 -A "Deep Shredder 12" -p games.pgn -o ratings.txt -W -s1000 -e errs.csv}
It is important to emphasize that the errors displayed in the output are always against the reference (anchor).
For example, if the anchor is Engine X (Deep Shredder 12 in the example above) set at 2800, and Engine Y is 2900 with an error of 20, then the interpretation is that the difference between Y and X is 100 $\pm$ 20.
As mentioned above, when no engine is set as anchor the \textit{hidden} reference is the average of the pool.
For instance, if the average is set to 2500 (default is 2300) and the rating output for Engine X is 2850 $\pm$ 20, the difference between Engine X and the average of the pool is 350 $\pm$ 20.
That is how the output should be interpreted.
It is incorrect to use this error to estimate relative values against other engines.
For that purpose, the switch \swtch{-e} needs to be provided to obtain a matrix with every single error for every engine-engine match ups.
If an anchor (reference) is provided, but the user wants relative errors to the average of the pool, the switch \swtch{-V} should be used.
This is what other rating software has as default.
\cmdln{ordo -a 2800 -A "Deep Shredder 12" -p games.pgn -o ratings.txt -W -s1000 -e errs.csv -V}
In this case, you will see that the rating of \swtch{Deep Shredder 12} will not have an error of zero.
\subsubsection*{Parallel calculation of simulations}
If the switch \swtch{-n <value>} is used, Ordo will use \swtch{<value>} number of processors in parallel for the simulations.
This may be a significant speed-up.
\subsubsection*{Superiority confidence}
If simulations have been run, using the switch \swtch{-C} will output a matrix with the confidence for superiority (CFS) between each of the players.
Each of the numbers is an answer to the question \textit{"What is the maximum confidence I can set the test to show that player x is not inferior to player y and still obtain the same positive answer?"}.
The matrix file is in \textit{comma separated values} format, and it could be opened by any spreadsheet program if it was saved with the *.csv extension.
In addition, if the user provides the switch \swtch{-J}, the CFS values between the player and the next one in the ranking will be displayed in the output.
\subsubsection*{Draw rate (for equal opponents)}
By default, Ordo considers that the draw rate for evenly matched players is 50\%.
Internally, it calculates the draw for matches in which a player is stronger than the other.
This parameter does not change the rating results, but it will affect the errors calculated after simulations.
Two switches can control this parameter.
First, \swtch{-d} sets the draw rate (which is assumed to be constant throughout the database).
Alternatively, the (highly recommended) switch \swtch{-D} lets Ordo calculate it automatically.
It makes sense if the user wants to calculate more accurate errors, or just for informative purposes.
For instance:
\cmdln{ordo -a 2800 -A "Deep Shredder 12" -p games.pgn -o ratings.txt -W -s1000 -e errs.csv -V -D}
Will calculate the draw rate and outputs it at the end of \cmdln{ratings.txt} (or the screen, if the switch \swtch{-o} is omitted).
Similarly to the calculation of the white advantage, the user can provide prior information for the draw rate.
A combination of switches \swtch{-d <value>} and \swtch{-k <deviation>} will do that.
Ordo assumes a Gaussian distribution centered in \swtch{<value>} with a standard \swtch{<deviation>}.
When as the number of games increases, this information will have less and less impact on the final result.
\subsubsection*{Ignore draws (\swtch{-X} switch)}
This switch internally ignores all draws from the database as they have not been played.
This is only present for experimentation, not for a serious rating calculation.
\subsubsection*{Minimum games}
In certain cases, the user may not want to include certain players with very few games played in the rating.
For that reason, the switch \swtch{-t <value>} provides to the program a threshold of minimum games played for a participant to be included in the final list.
The games are still included for calculation.
\subsubsection*{Perfect winners and losers}
Players who won all games (\textit{perfect winners}) or lost all of them (\textit{perfect losers}) create problems in the rating calculation.
It is impossible to estimate those rating accurately because winning all or losing all correspond to a $+\infty$ or $-\infty$ rating, respectively.
In addition, the calculation slows down considerably because of the impossibility to converge.
Ordo removes these players automatically during the calculation, and place them back after convergence has been reached.
The rating assigned to them is a minimum ("floor") for \textit{perfect winners} and a maximum ("ceiling") for \textit{perfect losers}.
This is indicated by a > or < symbol in the output text.
These limits are established by calculating the rating they would have had if one of the games was a draw.
For example, if player had a performance of 10/10, a proper rating estimation lays between ($+\infty$) and the one corresponding to a performance of 9.5/10.
A nice side effect of this technique is that distinguishes players with perfect score that had different type of opposition or played different number of games.
It is not the same to have been undefeated for three games than twenty.
\subsubsection*{Group connections and pathological data}
Sometimes, a data set contains players or groups/pools of players that did not play enough games against the rest.
These \textit{isolated} groups produce meaningless ratings when compared to the general pool.
The \swtch{-g} switch saves a report of how many groups are in this situation.
The information in this report may guide the user to properly link those groups with extra games.
Doing so will stabilize the whole ranking.
When the data set is "ill" connected, Ordo will attempt to run by purging \textit{perfect winners} and \textit{perfect losers}.
Their ceiling or floor rating will be estimated at the end (see above).
However, a warning will be diplayed.
When purging those players is not enough to guarantee a proper connection, a second warning will be issued.
But, this time the program will stop and exit with an error code (i.e. non-zero).
To force the calculation even in these conditions, the switch \swtch{-G} should be used.
Be careful, this could be slow and the algorithm may not converge.
\subsubsection*{Multiple anchors}
When several players are known to have very accurate ratings, it is possible to assigned fixed values to them.
In that case, they will behave like multiple anchors. An example will be:
\cmdln{ordo -p games.pgn <optional switches> -m anchors.csv}
where \filename{anchors.csv} is a file that contains lines like this
\cmdln{"Gull 1.1", 2350.0\\
"Glaurung 2.2 JA", 2170\\
"Crafty 23.1 JA", 2000}
telling Ordo to fix Gull 1.1, Glaurung 2.2 JA, and Crafty 23.1 JA to 2350, 2170, and 2000, respectively.
The name of the anchors should be present in \filename{games.pgn}.
\subsubsection*{Match up information}
The switch \swtch{-j} will output to a file information about all different matches that have been played.
It shows the rating difference (Diff) between those particular players and the standard deviation (SD) for that difference.
These values come from the simulations performed with the switch \swtch{-s}, so everything is taken into account, not only the information about a particular match.
In addition, there is a column with the \textit{confidence} that would be needed in order to be able to claim superiority based on Diff and SD.
The column is CFS, confidence for superiority, which plays the same role as the \textit{likelihood of superiority}\footnote{\url{https://chessprogramming.wikispaces.com/Match+Statistics}}.
A fragment of the output is:
\inctxt{readme-example-j-switch.txt}
In this example, we can say that Critter 1.4 SSE42 is superior to Deep Rybka 4.1 SSE42 with a 99\% confidence. We can only say that it is better than Komodo 4 SSE42 with a 71\% confidence. The reason is because the rating difference is 6 and the standard deviation is 11.
\subsubsection*{Loose anchors with prior information (\swtch{-y})}
Ordo offers an alternative approach to calculate ratings with previous knowledge from the user (using Bayesian concepts).
With the switch \swtch{-y}, the user can provide a file with a list of players whose ratings will float around an estimated value.
Those players will work as \textit{loose anchors} in the list.
This strategy is useful when the data is scarce and, as a consequence, wild swings could appear in the ratings.
This is what happens at the beginning of a new rating list or tournament.
Ordo accepts an estimated rating for a player, but takes into account how uncertain that value is.
In other words, the user also has to provide the standard error for the estimated value.
That means that the value will be 68\% of the time between $\pm$ the uncertainty value provided.
It is assumed that the estimated rating will follow Gaussian distribution.
In Bayesian terms, that constitute the prior distribution for the rating of that particular player.
For instance, if one line of the file provided with the switch \swtch{-y} contains
\cmdln{"Houdini 3", 3200, 50}
That means Houdini's initial rating is 3200 with an uncertainty of 50.
With this approach the user should have the best educated guess possible, otherwise, the ranking will suffer.
Using information from a previous well established rating lists can add stability to the new list and, as games are added, the contribution of the "previous information" will fade away.
\subsubsection*{Relative anchors (\swtch{-r})}
Another problem in some engine tournaments is that version upgrades enter with no previous ratings.
However, we know in certain situations that the new versions cannot have very different ratings from the previous one.
Therefore, the user can make a good educated guess about the rating of the new version.
For instance, if you know that the new version is within 20 points of the previous one you can use the \swtch{-r} switch to provide a file with lines like this:
\cmdln{"Bouquet 1.8a", "Bouquet 1.8", 0, 20}
That means version 1.8a came after 1.8 and it is estimated to have the same rating (0) with an uncertainty of 20.
With different versions, you can have different lines. An example with Stockfish may be:
\cmdln{"Stockfish 160913", "Stockfish 4", 0, 20\\
"Stockfish 4", "Stockfish 250413", 0, 50\\
"Stockfish 250413", "Stockfish 120413", 0, 20\\
"Stockfish 120413", "Stockfish 250313", 0, 20}
This constitute different \textit{relative anchors}.
When two versions are radically different, you can say nothing and they will be treated as different engines, or for instance
\cmdln{"Komodo 1063", "Komodo 4534", 0, 1000}
The first is a complete rewrite with a parallel search. Thus, the uncertainty of 1000 reflects this fact and make both versions virtually disconnected.
If you want to include more specific info, you could say
\cmdln{"Komodo 1063", "Komodo 4534", 160, 100}
Here, 160 is the estimation of how much improvement you have by going from 1 core to 16 and 100 represents how uncertain that is.
\subsubsection*{Switches}
The list of the switches provided are:
\inctxt{tmp-switches.txt}
\subsubsection*{Memory Limits}
Currently, the program can handle almost un unlimited number of games and players. It is only limited by the memory of the system.
\subsubsection*{Exit code}
When Ordo ran successfully, it will exit with a code = 0.
When problems arose (insufficient memory, database not well connected, empty input, wrong parameters, etc.), Ordo will return a number that is guaranteed to be non-zero.
This could be used in scripts to know whether the process reached its goal or not.
For instance, the following script in bash (linux) will catch if processing games.pgn was correct or not.
\begin{verbatim}
#!/bin/sh
./ordo -p games.pgn
exit_code=$?
if [ $exit_code = 0 ]; then
echo Ordo run properly
else
echo Ordo returned with error: $exit_code
fi
\end{verbatim}
\subsubsection*{Ordoprep}
A tool is available in another distribution\footnote{\url{https://github.com/michiguel/Ordoprep/releases}} to shrink the PGN file.
The output will contain only the results of the games.
In addition, it could discard players that won all games, or lost all games.
Other switches allow the exclusion of players that do not have a minimum performance or played too few games.
Typical usage is:
\cmdln{ordoprep -p raw.pgn -o shrunk.pgn}
Which saves in \filename{shrunk.pgn} a pgn file with only the results.
You can add switches like this:
\cmdln{ordoprep -p raw.pgn -o shrunk.pgn -d -m 5 -g 20}
where \swtch{-d} tells Ordoprep to discard players with 100\% or 0\% performance, \swtch{-m~5}
will exclude players who did not reach a 5\% performance, and \swtch{-g~20} will exclude players with less than 20 games.
After all this, \filename{shrunk.pgn} could be used as input for Ordo
\subsubsection*{Model for rating calculation}
The model assumes that differences in strength are analogous to differences in levels of energy (Fig. \ref{fig:figlevels1}).
A lower (more stable) level of energy would represent a stronger player.
The analogy is that a valley is better at attracting water than a mountain top.
In physics and chemistry, a particle or a molecule that can be in two different states can be predicted to be in one or the other with a certain probability.
\begin{figure}[htb]
\center{\includegraphics [width=0.7\textwidth] {figures/figlevels1.eps} }
% \vspace{3in} //[width=\textwidth]
\caption{\label{fig:figlevels1} Energetic levels as strength levels}
\end{figure}
The probability to be found at each level is proportional to the
\textit{Boltzmann factor}\footnote{\url{https://en.wikipedia.org/wiki/Boltzmann_factor}} $e^{-\beta E_{i}}$.
If $N_{a}$ is the number of particles in level A, and $N_{b}$ is the number of particles in level B, their ratio will be:
\begin{equation} \label{eq:erl}
\frac{N_{a}}{N_{b}} = \frac{ e^{-\beta E_{a}} }{ e^{-\beta E_{b}} } = e^{-\beta(E_{a}-E_{b})}
\end{equation}
$\beta$ is a constant of the system.
The analogy is that we treat the probabilities of a \textit{win} to land in level A or B as the probability of a particle to be in A or B.
Therefore, after reordering equation \ref{eq:erl}, the \textit{fraction of wins} ($f_{b,a}$) of player B in a match vs. A will be:
\begin{equation}
f_{b,a} = \frac{ N_{b} }{ N_{a}+N_{b} } = \frac{1}{1 + e^{-\beta(E_{a}-E_{b})}}
\end{equation}
if we define strength rating $R$ as the negative value of \textit{energy}, then, $R_{a} = -E_{a}$.
For convenience, we flip the scales with the purpose that higher ratings are represented with higher values (Fig. \ref{fig:figlevels2}), and the the \textit{fraction of wins} ($f_{b,a}$) of player B in a match vs. A will be represented by eq. \ref{eq:flipped}.
\begin{figure}[htb]
\center{\includegraphics [width=0.7\textwidth] {figures/figlevels2.eps} }
% \vspace{3in} //[width=\textwidth]
\caption{\label{fig:figlevels2} Rating scale}
\end{figure}
\begin{equation} \label{eq:flipped}
f_{b,a} = \frac{1}{1 + e^{-\beta(R_{b}-R_{a})}}
\end{equation}
This equation has the same form as the logistic function\footnote{\url{http://en.wikipedia.org/wiki/Logistic_function}}.
With this equation we can calculate the predicted fraction of wins between two players.
The predicted performance $P_{x}$, or number of wins of player $x$ among a pool of other players will be the summation of each of the predicted fractions $f$ for each game.
\begin{equation}
P_{x} = f_{x,opp(1)} + f_{x,opp(2)} + ... + f_{x,opp(n)} = \sum\limits_{i=1}^n f_{x,opp(i)}
\end{equation}
where $n$ is the total number of games played by $x$ and $opp(i)$ is the opponent it faced in the game $i$. Then:
\begin{equation}
P_{x} = \sum\limits_{i=1}^n \frac{1}{1 + e^{-\beta(R_{x}-R_{opp(i)})}}
\end{equation}
The most likely strength rating values ($R$) for each player are ones that satisfy that each \textit{predicted} performance $P_{x}$ equals the respective \textit{observed} performance ($O_{x}$) of player $x$ (actual number of games won by $x$).
Therefore, the goal is to find $R$ values so the following unfitness ($U$) score equals zero, where $m$ is the total number of players, and $j$ is each individual player.
\begin{equation}
U = \sum\limits_{j=1}^m (P_{j} - O_{j})^2
\end{equation}
Finding an adequate procedure to minimize $U$ until reaches zero is critical for a proper convergence towards the optimal solution.
The way Ordo fits it is in discrete steps (similar to \textit{hill climbing}\footnote{\url{http://en.wikipedia.org/wiki/Hill_climbing}}), and making those steps smaller and smaller once the convergence was reached.
However, those steps are constrained to certain values to avoid big swings during the calculation.
After many different tests, this procedure was found to be safe and fast.
\subsubsection*{Scale}
Chess players are accustomed to the Elo rating.
Traditionally, it has been based on a normal (Gaussian) distribution, which is the one that the World Chess Federation (FIDE) still uses\footnote{\url{http://en.wikipedia.org/wiki/Elo_rating_system}}.
Here, the default value of $\beta$ was chosen to resembles the Elo scale.
For that reason, the rating difference when the winning expectancy is 76\% has been set to 202 rating points.
This parameter could be modified with the switch \swtch{-z}, and the overall scale can be displayed with switch \swtch{-T}.
The model is valid if the strength assigned to the individual players is additive like energy.
If we know the strength differences between A$\to$B and B$\to$C, we should be able to calculate A$\to$C as A$\to$B + B$\to$C.
Then, this should accurately predict the results of a match between A and C.
Empirical observations seem to suggests that those estimations are reasonable, at least within a certain range.
Certain theoretical assumptions have be done to account the existence of draws.
One of the is that the actual draw rate remains similar throughout the rating scale.
Empirically, this is a reasonable approximation for most cases.
\subsubsection*{White advantage calculation}
The rationale to calculate the white advantage ($W_{adv}$) is that the expected outcome for white should be as close as possible to the actual white performance.
In other words, the number of points obtained by white ($W_{p}$) should be the same as the number of points expected to be obtained by white ($W_{e}$).
\begin{equation} \label{eq:white_adv_errors_1}
E = (W_{p} - W_{e})^2
\end{equation}
Therefore, the optimum $W_{adv}$ is the one that minimizes $E$, which is the overall error squared in equation \ref{eq:white_adv_errors_1}.
\begin{equation} \label{eq:white_adv_errors_1b}
W_{e} = \sum\limits_{i=1}^n Expectancy(RW_{i} + W_{adv}, RB_{i})
\end{equation}
Here, $n$ is the total number of games, $RW_{i}$ and $RB_{i}$ are the ratings (in game $i$) of white and black, respectively.
$Expectancy$ is actually equation \ref{eq:flipped}.
\begin{equation} \label{eq:white_adv_errors_1c}
W_{e} = \sum\limits_{i=1}^n \frac{1}{1 + e^{-\beta(RW_{i} + W_{adv}-RB_{i})}}
\end{equation}
Then, combining \ref{eq:white_adv_errors_1} and \ref{eq:white_adv_errors_1c}
\begin{equation} \label{eq:white_adv_errors_1d}
E = \left(W_{p} - \sum\limits_{i=1}^n \frac{1}{1 + e^{-\beta(RW_{i} + W_{adv}-RB_{i})}}\right)^2
\end{equation}
$W_{adv}$ is calculated iteratively, until $E$ is minimized.
This calculation assumes that $W_{adv}$ is relatively constant throughout the database.
Once $W_{adv}$ is obtained, the ratings are re-calculated.
The procedure continues until the numbers stabilize.
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\subsubsection*{Draw rate model}
To estimate the probability of a draw in a single game the model from Fig. \ref{fig:figlevels2} needs to be expanded to have an extra "draw state" (Fig. \ref{fig:threestates}).
\begin{figure}[htb]
\center{\includegraphics [width=0.9\textwidth] {figures/threestates.eps} }
% \vspace{3in} //[width=\textwidth]
\caption{\label{fig:threestates} Rating scale introducing an extra state for draws}
\end{figure}
The draw rate does not affect the rating calculation, or the performance for each player in the simulations.
However, it affects the relative distribution of \textit{wins}, \textit{losses}, and \textit{draws} simulated, which has an influence on the errors calculated.
Therefore, to have a more realistic simulation and an accurate estimation of the errors we need to predict the probability for a draw.
But, the draw rate is not uniform, as it depends on the rating differences between the opponents.
Thus, draw rate depends on two parameters, $D_{eq}$ (draw rate when the two opponents are of equal strength) and $\Delta R$.
Ordo assumes that $D_{eq}$ is relatively constant throughout the database.
If we know $D_{eq}$, the following equation
\begin{equation} \label{eq:draw_rate_errors}
E = \sum\limits_{m=1}^M (D_{m} - N_{m}\: D_{exp}(\Delta R_{m} + W_{adv}, D_{eq}))^2
\end{equation}
will give $E$ as the overall error in the estimation of $D_{eq}$.
Here, $m$ is the match number, $M$ is the total number of matches, $N_{m}$ is the number of games played in each match $m$, $\Delta R_{m}$ is the rating difference in that particular match, $D_{m}$ is the number of draws observed, and $W_{adv}$ is the \textit{white advantage}.
$D_{exp}$ is a function that gives the draw rate expected given a certain $\Delta R$ and $D_{eq}$.
Note that here a match is considered any series of games between two opponents with the same colors.
In other words, they are any set of games with the same opponents and conditions.
With this equation, $D_{exp}$ is calculated iteratively until $E$ is minimized.
To apply this algorithm we need the function $D_{exp}$.
In the following section we show how to calculate the draw rate when opponents are of equal strength and later from a given $p$ and $D_{eq}$.
From $\Delta R$, the performance expected ($p$) can be directly calculated.
%=====
\subsubsection*{Draw rate between opponents of equal strength}
We can model the draw rate by introducing an extra \textit{draw state} (Fig. \ref{fig:threestates}).
This is a derivation of the equation that relates draw rate ($D$) and $\delta$.
\begin{equation} \label{eq:basic_WDL}
1 = W + D + L
\end{equation}
Here, $W$, $D$, and $L$ are the respective win, draw, and loss rates. Since the opponents are of equal strength, W equals L.
\begin{equation} \label{eq:erll}
1 = 2\: W + D
\end{equation}
Based on the assumptions that the probabilities of the different levels are proportional to the \textit{Boltzmann factor}\footnote{\url{https://en.wikipedia.org/wiki/Boltzmann_factor}} $e^{-\beta E_{i}}$, the following ratio can be established ($R_{i}=-E_{i}$, higher ratings mean lower "energy levels").
\begin{equation}
\frac{D}{W} = \frac{e^{\beta R_{D}}}{e^{\beta R_{W}}} = e^{\beta(R_{D}-R_{W})} = e^{\beta\delta}
\end{equation}
Replacing into eq. \ref{eq:erll}
\begin{equation}
1 = e^{\beta\delta} W + 2W
\end{equation}
\begin{equation} \label{eq:wdep}
W = \frac{1}{e^{\beta\delta} + 2}
\end{equation}
Combining with eq. \ref{eq:erll} we obtained $D_{eq}$, which is the draw rate when both players are equally strong. This value depends on $\delta$.
\begin{equation} \label{eq:eqlog}
D = D_{eq} = 1 - \frac {2}{e^{\beta\delta} + 2} = \frac{e^{\beta\delta}}{e^{\beta\delta}+2}
\end{equation}
%=====
\subsubsection*{Draw rate from $p$ (performance) and $D_{eq}$}
Performance ($p$) is the ratio of the total points obtained by a player in a given number of games.
It is defined by this simple relationship.
\begin{equation} \label{eq:deq_perf}
p = W + D/2;\:\:\: W = p - D/2
\end{equation}
To define $D_{eq}$, we are going to assume it is constant, regardless of the absolute strength of each individual.
We then have three possible states, W (win), D (draw), and L (loss), in which the state D is separated by $\delta$ from the average of the levels W and L.
In this scenario, and reordering eq. \ref{eq:eqlog} we have:
\begin{equation} \label{eq:deq0}
\frac{1-D_{eq}}{2 D_{eq}} = e^{-\beta \delta}
\end{equation}
For convenience we will call $e^{-\beta \delta} = \phi$ then
\begin{equation} \label{eq:deq0b}
\frac{1-D_{eq}}{2 D_{eq}} = \phi
\end{equation}
$D_{eq}$ is the rate when $R_{W}$ and $R_{L}$ are at the same level.
If $R_{W}$ and $R_{L}$ change, and $\delta$ remains at the same distance from the average of $R_{W}$ and $R_{L}$, the equations that relate the probabilities for each state are:
\begin{equation} \label{eq:deq1}
R_{avg} = \frac{R_{W} - R_{L}}{2};\:\: x = R_{W} - R_{avg} = R_{avg} - R_{L}
\end{equation}
\begin{equation} \label{eq:deq2}
W/D = e^{\beta(x-\delta)} = e^{\beta x} e^{-\beta \delta}
\end{equation}
\begin{equation} \label{eq:deq3}
D/L = e^{\beta(x+\delta)} = e^{\beta x} / e^{-\beta \delta}
\end{equation}
For convenience, if we call $e^{-\beta \delta} = \phi$ as we did before we get
\begin{equation} \label{eq:deq4}
W/D = e^{\beta x} \phi;\:\:\:
D/L = e^{\beta x} / \phi
\end{equation}
therefore
\begin{equation} \label{eq:deq_wldd}
\frac{W}{D} \frac{L}{D} = \phi^2; \:\:\: L = \frac{\phi^2 D^2}{W}
\end{equation}
combining this equation with eq. \ref{eq:basic_WDL} and reordering:
\begin{equation} \label{eq:deq7}
0 = W^2 + D W - W + \phi^2 D^2
\end{equation}
replacing W with eq. \ref{eq:deq_perf} we obtain
\begin{equation} \label{eq:deq9}
0 = (p - D/2)^2 + D (p - D/2) - (p - D/2) + \phi^2 D^2
\end{equation}
expanding, simplifying, and reordering leads to
\begin{equation} \label{eq:deq10}
0 = (4 \phi^2 -1) D^2 + 2 D + 4 (p^2 - p)
\end{equation}
replacing with eq. \ref{eq:deq0b}
\begin{equation} \label{eq:deq11}
0 = \left(
\left(\frac {1-D_{eq}}
{D_{eq}}
\right) ^2 -1
\right)
D^2 + 2 D + 4 (p^2 - p)
\end{equation}
Solving this quadratic equation, we obtain the predicted draw rate ($D$) between two given opponents, as long as we know the predicted performance ($p$) and the draw rate between equally matched opponents ($D_{eq}$). This is used to plug it in eq. \ref{eq:draw_rate_errors}.
\subsubsection*{Draw rate and win rate relationship}
Reordering eq. \ref{eq:deq7} we obtain
\begin{equation} \label{eq:draw_rate1}
D^2 = \phi^{-2}\: W (1 - W - D)
\end{equation}
Note that this relationship is equivalent to the basic assumption used by Davidson\footnote{Equation 2.5 in \url{http://stat.fsu.edu/techreports/M169.pdf}} to develop his draw model
\begin{equation} \label{eq:drawrate_davidson}
D = \nu\: \sqrt{W \: L}
\end{equation}
Here, $\nu = \phi^{-2}$ and $L = 1 - W - D$. Shawul and Coulom showed that this relationship is superior for chess engines when compared to other alternatives\footnote{\url{https://dl.dropboxusercontent.com/u/55295461/elopapers/elopapers/ChessOutcomes.pdf}}.
Replacing $\phi$ in eq. \ref{eq:draw_rate1} with eq. \ref{eq:deq0b} we obtain
\begin{equation} \label{eq:draw_rate2}
D^2 = \left(\frac{2 D_{eq}}{1-D_{eq}}\right)^2\: W (1 - W - D)
\end{equation}
Equation \ref{eq:draw_rate2} is the one used by Ordo to obtain the draw rate for any pair of opponents as a function of win probability ($W$) and draw rate for equal opponents ($D_{eq}$).
%---------------------------------------------------------
\subsubsection*{Draw rate calculation}
The rationale to calculate the draw rate for equal opponents ($D_{eq}$) is that the expected outcome of number of draws showuld be as close as possible to the actual number of draws in the database.
In other words, the number of draws observed ($D_{obs}$) should be the same as the number of draws expected ($D_{exp}$).
\begin{equation} \label{eq:drawrate_errors_1}
E = (D_{obs} - D_{exp})^2
\end{equation}
Therefore, the optimum $D_{eq}$ is the one that minimizes $E$, which is the overall error squared in equation \ref{eq:drawrate_errors_1}.
\begin{equation} \label{eq:drawrate_errors_1b}
D_{exp} = \sum\limits_{i=1}^n D_{i}
\end{equation}
Here, $n$ is the total number of games, and $D_{i}$ is the probability of a draw for game $i$.
From equation \ref{eq:deq11}, $D_{i}$ could be solved as
\begin{equation} \label{eq:drawrate_errors_1c}
D_{i} = \frac{- 1 + \sqrt{ 1 - 4 (p_{i}^2-p_{i}) (4 (\frac{1-D_{eq}}{2 D_{eq}})^2 - 1) } }{4 (\frac{1-D_{eq}}{2 D_{eq}})^2 - 1}
% b=2
% 4 (p_{i}^2-p_{i})
% a = (4 (\frac{1-D_{eq}}{2 D_{eq}})^2 - 1)
% D_{i} = \frac{1+ \sqrt{1 - 4 (4 \phi^2 - 1) (p_{i}^2-p_{i})} }{4 \phi^2 - 1}
% (\frac{1-D_{eq}}{2 D_{eq}})
\end{equation}
where $p_{i}$ is the expected performance for white for each game, and could be calculated from equation \ref{eq:flipped} as
\begin{equation} \label{eq:drawrate_errors_1d}
p_{i} = \frac{1}{1 + e^{-\beta(RW_{i} + W_{adv}-RB_{i})}}
\end{equation}
$RW_{i}$ and $RB_{i}$ are the ratings (in game $i$) of white and black, respectively.
Once $D_{eq}$ is estimated, $p_{i}$ and $D_{i}$ are calculated (equations \ref{eq:drawrate_errors_1c} and \ref{eq:drawrate_errors_1d}) for each game to obtain $D_{exp}$ and $E$ (equations \ref{eq:drawrate_errors_1} and \ref{eq:drawrate_errors_1b}).
Optimum value of $D_{eq}$ is the one that minimizes $E$ and it is calculated iteratively.
This calculation assumes that $D_{eq}$ is relatively constant throughout the database.
Once $D_{eq}$ is obtained, the ratings are re-calculated as it is done with $W_{adv}$.
The procedure continues until the numbers stabilize.
%---------------------------------------------------------
\subsubsection*{Rating calculation with prior information}
When user provides Ordo with either \textit{loose anchors}, \textit{relative anchors}, \textit{white advantage} uncertainty, or a \textit{draw rate} uncertainty the calculation is performed by a \textit{maximum-likelihood estimation}.
In those cases, for each game the probability for the given outcome (W, D, or L) is calculated and the logarithm of this value is added and accumulated.
This will constitute an unfitness score that will need to be minimized.
In addition, to this score, the logarithm of the probabilities for each loose anchor, relative anchor, white advantage, and draw rate are accumulated.
An overall minimization brings optimum values for the ratings of each player and each of the above mentioned parameters.
Note that adding the logarithm of each of the probabilities is analogous to multiplying the probabilities.
\subsubsection*{Forcing maximum likelihood}
Another option to force Ordo to perform a \textit{maximum-likelihood estimation} to calculate the ratings is by providing the switch \swtch{-M}.
This option is generally a bit slower and probably not necessary since the output should be nearly identical with perfect convergence, but it is a good feature for comparison an debugging.
%~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
\subsubsection*{Acknowledgments}
\person{Adam Hair} has extensively tested and suggested valuable ideas.
\subsubsection*{License}
\normalsize{
\inctxt{tmp-lic.txt}
}
\end{document}