-
Notifications
You must be signed in to change notification settings - Fork 2
/
CHANGELOG
266 lines (195 loc) · 12.7 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
Note, in addition to nice, tidy CHANGELOG, there's also a verbose DEVLOG below.
Version 0.1.2.1 -- Fixed duplicate DRBG depend in .cabal
Version 0.1.2 -- Unfortunate version bump to fix .cabal description.
Version 0.1.1 -- Initial release
Version 0.1.2.2 -- Removed LargeWord implementation. It's now in its own module.
Version 0.1.2.3 -- Updated to work with CryptoAPI >= 0.5
--------------------------------------------------------------------------------
DEVLOG:
--------------------------------------------------------------------------------
[2011.01.28] {unexpected performance bump}
Note: when I duplicated part of the Crypto package (LargeWord) I also saw a little
performance bump on my 3.33 ghz nehalem desktop IN SPITE of the fact
that I had tweaked and reinstalled my system's Crypto library, taking
care to compile it with -O2. Better inlining?
Old numbers, per second:
15,753 random ints generated [BurtonGenSlow/reference] ~ 209,832 cycles/int
31,577 random ints generated [BurtonGenSlow] ~ 104,680 cycles/int
New numbers:
16,052 random ints generated [BurtonGenSlow/reference] ~ 205,226 cycles/int
32,411 random ints generated [BurtonGenSlow] ~ 101,641 cycles/int
[2011.01.29] {First results}
This is what I'm getting (3.33 ghz nehalem) on a single thread compiled without -threaded:
How many random numbers can we generate in a second on one thread?
Cost of rdtsc (ffi call): 83
Approx getCPUTime calls per second: 205,640
Approx clock frequency: 3,306,891,339
First, timing with System.Random interface:
193,178,901 random ints generated [constant zero gen] ~ 17.12 cycles/int
14,530,358 random ints generated [System.Random stdGen] ~ 228 cycles/int
16,346 random ints generated [BurtonGenSlow/reference] ~ 202,306 cycles/int
32,965 random ints generated [BurtonGenSlow] ~ 100,315 cycles/int
Comparison to C's rand():
118,766,285 random ints generated [rand/store in C loop] ~ 27.84 cycles/int
114,668,028 random ints generated [rand / Haskell loop] ~ 28.84 cycles/int
114,675,116 random ints generated [rand/store Haskell] ~ 28.84 cycles/int
And with -threaded but a single thread (no -N argument):
How many random numbers can we generate in a second on one thread?
Cost of rdtsc (ffi call): 83
Approx getCPUTime calls per second: 226,961
Approx clock frequency: 3,306,672,915
First, timing with System.Random interface:
177,617,807 random ints generated [constant zero gen] ~ 18.62 cycles/int
13,603,612 random ints generated [System.Random stdGen] ~ 243 cycles/int
15,314 random ints generated [BurtonGenSlow/reference] ~ 215,925 cycles/int
31,241 random ints generated [BurtonGenSlow] ~ 105,844 cycles/int
Comparison to C's rand():
57,296,260 random ints generated [rand/store in C loop] ~ 57.71 cycles/int
63,888,404 random ints generated [rand / Haskell loop] ~ 51.76 cycles/int
62,688,727 random ints generated [rand/store Haskell] ~ 52.75 cycles/int
Finished.
Wow... the foreign versions take a big hit from -threaded.
Ack, perhaps worse, with -N4 but only a single threaded *workload* we run into problems:
How many random numbers can we generate in a second on one thread?
Cost of rdtsc (ffi call): 317
Approx getCPUTime calls per second: 154,859
Approx clock frequency: 2,797,993,233
First, timing with System.Random interface:
12,774,233 random ints generated [constant zero gen] ~ 219 cycles/int
8,211,430 random ints generated [System.Random stdGen] ~ 341 cycles/int
4,990 random ints generated [BurtonGenSlow/reference] ~ 560,720 cycles/int
18,429 random ints generated [BurtonGenSlow] ~ 151,826 cycles/int
Comparison to C's rand():
540,942,561 random ints generated [ptr store in C loop] ~ 5.17 cycles/int
56,722,299 random ints generated [rand/store in C loop] ~ 49.33 cycles/int
63,173,134 random ints generated [rand / Haskell loop] ~ 44.29 cycles/int
61,999,404 random ints generated [rand/store Haskell] ~ 45.13 cycles/int
It doesn't measure the clock frequency accurately in this mode!
I can't understand why that would be... but thread migration may be the culprit.
Yet -qm and -qa don't fix the problem at all...
Apparently -N4 also suffers from HIGH VARIANCE in throughput (which is
independent of the cpu frequency measurement... though it does mean
that the cycles/int numbers are bogus below).
31,413,439 random ints generated [constant zero gen] ~ 87.08 cycles/int
8,999,510 random ints generated [System.Random stdGen] ~ 304 cycles/int
11,755 random ints generated [BurtonGenSlow/reference] ~ 232,703 cycles/int
367 random ints generated [BurtonGenSlow] ~ 7,453,473 cycles/int
41,861,866 random ints generated [constant zero gen] ~ 47.84 cycles/int
8,073,351 random ints generated [System.Random stdGen] ~ 248 cycles/int
11,850 random ints generated [BurtonGenSlow/reference] ~ 168,992 cycles/int
19,853 random ints generated [BurtonGenSlow] ~ 100,869 cycles/int
12,774,233 random ints generated [constant zero gen] ~ 219 cycles/int
8,211,430 random ints generated [System.Random stdGen] ~ 341 cycles/int
4,990 random ints generated [BurtonGenSlow/reference] ~ 560,720 cycles/int
18,429 random ints generated [BurtonGenSlow] ~ 151,826 cycles/int
Presumably some of this is because of attempted sparking gone awry?
--------------------------------------------------------------------------------
MULTI-THREADED MEASUREMENTS
--------------------------------------------------------------------------------
I'm starting to measure 4-thread performance (all threads generating
randoms), and I'm about to switch to looking at per-thread throughput,
but first here's aggregate throughput. Kind of surprising, right?
The reference version is doing better all of a sudden.
179,217,901 random ints generated [constant zero gen] ~ 10.80 cycles/int
49,573,433 random ints generated [System.Random stdGen] ~ 39.05 cycles/int
69,093 random ints generated [BurtonGenSlow/reference] ~ 28,020 cycles/int
44,454 random ints generated [BurtonGenSlow] ~ 43,550 cycles/int
Comparison to C's rand():
6,450,217 random ints generated [rand/store in C loop] ~ 384 cycles/int
8,589,888 random ints generated [rand / Haskell loop] ~ 289 cycles/int
4,038,556 random ints generated [rand/store Haskell] ~ 614 cycles/int
And the foreign/rand version took a MASSIVE hit, which doesn't make
that much sense, especially for the first variant. I forg
--------------------------------------------------------------------------------
NOW SWITCHING TO MEAN PER-THREAD THROUGHPUT, NORMALIZED TO NUMTHREADS
--------------------------------------------------------------------------------
These numbers are not looking bad for the pure haskell versions.
Pretty much keeping all four CPUs busy. C's rand is not supposed to
be reentrant so it doesn't really matter.
Beware the incorrect clock frequency:
How many random numbers can we generate in a second on one thread?
Cost of rdtsc (ffi call): 228
Approx getCPUTime calls per second: 134,849
Approx clock frequency: 2,478,290,084
Now 4 threads, reporting mean randoms-per-second-per-thread:
First, timing with System.Random interface:
46,142,182 random ints generated [constant zero gen] ~ 54.48 cycles/int
9,957,644 random ints generated [System.Random stdGen] ~ 252 cycles/int
14,146 random ints generated [BurtonGenSlow/reference] ~ 177,683 cycles/int
29,046 random ints generated [BurtonGenSlow] ~ 86,539 cycles/int
Comparison to C's rand():
441,296,179 random ints generated [ptr store in C loop] ~ 5.70 cycles/int
1,488,240 random ints generated [rand/store in C loop] ~ 1,689 cycles/int
1,443,434 random ints generated [rand / Haskell loop] ~ 1,741 cycles/int
1,725,817 random ints generated [rand/store Haskell] ~ 1,456 cycles/int
I don't know why the "reference" version seemed to be doing better
above... now it's back to its old pattern.
I'm still seeing a fair amount of variance even with the
all-threads-working workload. For example, the same run as did 29K
BurtonGenSlow above, can also give me:
47,082,714 random ints generated [constant zero gen] ~ 58.68 cycles/int
10,789,920 random ints generated [System.Random stdGen] ~ 256 cycles/int
11,917 random ints generated [BurtonGenSlow/reference] ~ 231,843 cycles/int
23,854 random ints generated [BurtonGenSlow] ~ 115,824 cycles/int
15,289,001 random ints generated [constant zero gen] ~ 162 cycles/int
3,586,250 random ints generated [System.Random stdGen] ~ 693 cycles/int
4,487 random ints generated [BurtonGenSlow/reference] ~ 553,683 cycles/int
8,994 random ints generated [BurtonGenSlow] ~ 276,241 cycles/int
The variance doesn't seem to correlate with mutator PRODUCTIVITY.
Productivity is pretty good (92%) on the haskell-only part of the
benchmarks.
MUT time (elapsed) GC time (elapsed)
Task 0 (worker) : 16.84s ( 4.46s) 0.00s ( 0.00s)
Task 1 (worker) : 16.84s ( 4.46s) 0.00s ( 0.00s)
Task 2 (bound) : 16.77s ( 4.46s) 0.07s ( 0.02s)
Task 3 (worker) : 16.38s ( 4.46s) 0.46s ( 0.15s)
Task 4 (worker) : 16.84s ( 4.46s) 0.00s ( 0.00s)
Task 5 (worker) : 16.05s ( 4.46s) 0.78s ( 0.23s)
[2011.01.29] {Wrapping the Intel AES Sample Library}
The big question is whether to drop in their whole distribution and
call their build script, or try to rip out just the bits I need.
Here's the GCC call they use to build the intel_aes64.a library:
for i in $asm; do echo do $i.s; $yasm -D__linux__ -g dwarf2 -f elf${sz} asm/x${arch}/$i.s -o obj/x${arch}/$i.o; done
gcc -O3 -g -m${sz} -Iinclude/ -c src/intel_aes.c -o obj/x${arch}/intel_aes.o
ar -r lib/x${arch}/intel_aes${arch}.a obj/x${arch}/*.o
Ok... I tried rebuilding the library by hand to get a .so
gcc -fPIC -O3 -g -Iinclude/ -c src/intel_aes.c -o obj/x64/intel_aes.o
gcc -shared -dynamic -o lib/x64/libintel_aes64.so obj/x64/*.o
[2011.02.02] {Annoying link problems}
In commit 4bf79dfe55.. I reverted some recent refactorings to fix the weird link problem.
Linking dist/build/benchmark-intel-aes-rng/benchmark-intel-aes-rng ...
/home/newton/Dropbox/working_copies/intel-aes/dist/build/libHSintel-aes-0.1.1.a(GladmanAES.o): In function `s3iP_info':
(.text+0x34c3): undefined reference to `__stginit_intelzmaeszm0zi1zi1_CodecziCryptoziConvertRNG_'
collect2: ld returned 1 exit status
[2011.02.07] {Trying to build in windows, not having much luck}
.
I tried to get it to work with gcc 4.3.4, that failed for lack of intrin.h.
I tried to actually add the ../VC/bin directory to my path so I could
run cl.exe. Then I tried the .bat file that came with the Intel
Sample library. It seemed to get to the final "lib.exe" call at which
point it silently fails (errro code 127) with no message.
[2011.02.07] {Haddock problems}
I'm having a variety of different problems trying to build
documentation for this package. If I do "runhaskell Setup.hs haddock"
I get the following:
GHCi runtime linker: fatal error: I found a duplicate definition for symbol
__hscore_S_IFDIR
whilst processing object file
/home/newton/.cabal/lib/directory-1.0.1.2/ghc-7.0.1/HSdirectory-1.0.1.2.o
This could be caused by:
* Loading two different object files which export the same symbol
* Specifying the same object file twice on the GHCi command line
* An incorrect `package.conf' entry, causing some object to be
loaded twice.
GHCi cannot safely continue in this situation. Exiting now. Sorry.
If I try "cabal haddock" it fails silently with error code 127.
But if I run "haddock" by hand on the .hs files it seems to work ok.
At one point I thought I was dealing with this problem:
http://hackage.haskell.org/trac/hackage/ticket/656
In particular I believe I ran into this error:
cabal: Can't find transitive deps for haddock
[2011.02.07] {Ah, the error 127 business above was really a cabal problem}
I must be crashing cabal silently with my intel-aes.cabal file.
Nope.. that's not it dist/setup/setup lost it's execute bit (thanks
dropbox) and that completely flummoxed cabal.
The "cabal haddock" option above DOES result in the transitive deps error.