Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize summations over RealFloats #64

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

414owen
Copy link

@414owen 414owen commented Sep 25, 2020

I'm trying to gauge interest in upstreaming some more generalized floating point compensated summations.
This is useful to me for use with the ad library, where grad takes a function (Traversable f, Num a) => f (Reverse s a) -> Reverse s a). I use the RealFloat instance of Reverse s a to implement a loss function, which includes compensated floating point arithmetic.

Currently all tests pass apart from GHCJS 8.4. I'll look into it if there's enough interest in merging this.

I've run some preliminary benchmarks, which seem very promising, and I think establish that any potential data boxing coming from this generalization won't do too much harm to a user.

edit: <removed obsolete benchmarks>

The instance is useless, as one can use Prelude.sum for the same effect.

The test will have the same result as its type parameter would.
@414owen 414owen changed the title Generalized summations Parameterize summations over RealFloats Sep 25, 2020
@Shimuuar
Copy link
Collaborator

Main problem is of course performance impact. We go from 3-words/KBN to 5-words/KBN + pointer chasing. And then there's question what is impact of this change. Benchmark result (with -O2) are exactly same so I suppose GHC is smart enough to just unbox everything in main loop and don't allocate Kahan/KBN/KB2 objects at all.

Do benchmark with -O1 (could be changed in cabal file) continue to show no difference?

P.S. Out of curiosity. How exactly do you use compensated summation with ad?

@414owen
Copy link
Author

414owen commented Sep 28, 2020

Okay, here are some more complete benchmarks

I also dumped the assembly for some minimal examples (old, new).

The new ASM doesn't look great, but I'm not really in a position to evaluate it.
The benchmarks don't seem to care too much...

@414owen
Copy link
Author

414owen commented Sep 28, 2020

I'm passing grad, and optimize a function (Traversable f, RealFloat b) => f b -> b, where b ~ Reverse s a in grad's type.
Elsewhere, I use the function where b ~ Double.

@414owen
Copy link
Author

414owen commented Oct 4, 2020

Here are some more easily comparable benchmarks: https://gist.github.com/414owen/ea366fc110a4e416ae9ceea035689a03

The new generalized version with -O2 is slightly faster than the current version in every test.
I don't know why this is the case.

@Shimuuar
Copy link
Collaborator

Shimuuar commented Oct 4, 2020

I looked into core and yes GHC unboxes KBN/etc accumulators so inner loop has type Double# -> Double# -> Int# -> Double. It means it doesn't matter whether accumulator is boxed or not. On one hand it's pretty much ideal case for optimizer: simple loop where everything is inlined. On other it's more or less how accumulator is intended to be used.

Question turns out to be how easy to break optimization above? I can't invent one on the spot so I need to think a bit about it.

The new ASM doesn't look great, but I'm not really in a position to evaluate it.

AFAIK NCG never was among best so it could generate suboptimal assembly

I'm passing grad, and optimize a function ...

And how does KBN & friends enter the picture?

@Shimuuar
Copy link
Collaborator

I've added summation benchmark which inhibits inlining and requires compiler to actually allocate KBN objects on heap:

kbnStep :: Sum.KBNSum Double -> Double -> Sum.KBNSum Double
kbnStep = Sum.add
{-# NOINLINE kbnStep #-}

...
      , bench "kbn.Noinline" $ whnf (Sum.kbn . U.foldl' kbnStep Sum.zero) v
...

Results are to say the least surprising:

Sum/kbn                     3.704 ms
Sum/kbn.Noinline [unboxed]  8.467 ms
Sum/kbn.Noinline [boxed]    5.871 ms (!)

For some reason boxed version outperforms unboxed one. I didn't tried to look why it's the case. It looks like making accumulator types boxed doesn't result in large performance penalty and for good performance everything must be inlined anyway

Another possible related puzzle is kb2 outperforming everything else including naive summation. And this happens despite kb2 doing much ore work.

@Shimuuar
Copy link
Collaborator

This PR turns into digging into benchmarking weirdness but it's difficult to make performance sensitive choices when benchmarks lie to you. It turns out that benchmark results depend on order in which they're run:

Sum/naive                                mean 3.632 ms  ( +- 30.21 μs  )
Sum/kahan                                mean 5.437 ms  ( +- 19.12 μs  )
Sum/kb2                                  mean 2.310 ms  ( +- 17.17 μs  )
Sum/kbn                                  mean 1.566 ms  ( +- 8.775 μs  )

Sum/naive                                mean 3.621 ms  ( +- 24.39 μs  )
Sum/kahan                                mean 5.476 ms  ( +- 20.11 μs  )
Sum/kbn                                  mean 3.712 ms  ( +- 28.19 μs  )
Sum/kb2                                  mean 2.284 ms  ( +- 19.84 μs  )

When kbn is run lasts its run time goes from 3.7ms to 1.6ms! It's more than 2x speedup! Something weird is going on here.

I also attempted to measure run time of kbn/kb2 summation using perf tools (gist here). Results are very boring and in line with what's expected: kbn — 1.9 slowdown and kb2 — 2.7 slowdown.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants