-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parameterize summations over RealFloats #64
base: master
Are you sure you want to change the base?
Conversation
The instance is useless, as one can use Prelude.sum for the same effect. The test will have the same result as its type parameter would.
Main problem is of course performance impact. We go from 3-words/KBN to 5-words/KBN + pointer chasing. And then there's question what is impact of this change. Benchmark result (with -O2) are exactly same so I suppose GHC is smart enough to just unbox everything in main loop and don't allocate Kahan/KBN/KB2 objects at all. Do benchmark with -O1 (could be changed in cabal file) continue to show no difference? P.S. Out of curiosity. How exactly do you use compensated summation with ad? |
Okay, here are some I also dumped the assembly for some minimal examples (old, new). The new ASM doesn't look great, but I'm not really in a position to evaluate it. |
Here are some more easily comparable benchmarks: https://gist.github.com/414owen/ea366fc110a4e416ae9ceea035689a03 The new generalized version with |
I looked into core and yes GHC unboxes KBN/etc accumulators so inner loop has type Question turns out to be how easy to break optimization above? I can't invent one on the spot so I need to think a bit about it.
AFAIK NCG never was among best so it could generate suboptimal assembly
And how does KBN & friends enter the picture? |
I've added summation benchmark which inhibits inlining and requires compiler to actually allocate KBN objects on heap: kbnStep :: Sum.KBNSum Double -> Double -> Sum.KBNSum Double
kbnStep = Sum.add
{-# NOINLINE kbnStep #-}
...
, bench "kbn.Noinline" $ whnf (Sum.kbn . U.foldl' kbnStep Sum.zero) v
... Results are to say the least surprising:
For some reason boxed version outperforms unboxed one. I didn't tried to look why it's the case. It looks like making accumulator types boxed doesn't result in large performance penalty and for good performance everything must be inlined anyway Another possible related puzzle is kb2 outperforming everything else including naive summation. And this happens despite kb2 doing much ore work. |
This PR turns into digging into benchmarking weirdness but it's difficult to make performance sensitive choices when benchmarks lie to you. It turns out that benchmark results depend on order in which they're run:
When kbn is run lasts its run time goes from 3.7ms to 1.6ms! It's more than 2x speedup! Something weird is going on here. I also attempted to measure run time of kbn/kb2 summation using perf tools (gist here). Results are very boring and in line with what's expected: kbn — 1.9 slowdown and kb2 — 2.7 slowdown. |
I'm trying to gauge interest in upstreaming some more generalized floating point compensated summations.
This is useful to me for use with the
ad
library, wheregrad
takes a function(Traversable f, Num a) => f (Reverse s a) -> Reverse s a)
. I use theRealFloat
instance ofReverse s a
to implement a loss function, which includes compensated floating point arithmetic.Currently all tests pass apart from GHCJS 8.4. I'll look into it if there's enough interest in merging this.
I've run some preliminary benchmarks, which seem very promising, and I think establish that any potential data boxing coming from this generalization won't do too much harm to a user.
edit:
<removed obsolete benchmarks>