implement epsilon-based floating point number comparisons in tests #46

sfindeisen · 2022-05-19T19:32:49Z

Depending on the architecture and/or platform, we're sometimes getting test errors like:

    black/white 250 209310 2.0e-2:                             FAIL (2.49s)
      test/common/TestMnistFCNN.hs:75:
      expected: 23.025850929940457
       but got: 23.02585092994046

This shows we need more relaxed floating point value equality checks in tests. Perhaps something like: abs(x-y) < ε, where ε should be configurable.

Within the scope of this ticket is:

implement the ε mechanism
make ε configurable
refactor all the flaky tests to use the new ε

Related issue: #47

The text was updated successfully, but these errors were encountered:

sfindeisen · 2022-05-19T20:06:47Z

More failures:

$ cabal test --enable-optimization shortTestForCI
[...]
    black/white 250 209310 2.0e-2:                             FAIL (2.49s)
      test/common/TestMnistFCNN.hs:75:
      expected: 23.025850929940457
       but got: 23.02585092994046
      Use -p '/black\/white 250 209310 2.0e-2/' to rerun this test only.
    random 100 250 209310 2.0e-2:                              OK (4.73s)
    first 100 trainset samples only 250 209310 2.0e-2:         FAIL (6.17s)
      test/common/TestMnistFCNN.hs:75:
      expected: 3.233123290489956
       but got: 3.2331232904899556
      Use -p '/first 100 trainset samples only 250 209310 2.0e-2/' to rerun this test only.
[...]
  Sine RNN tests
    train 1 2 2 991 5.0e-2:                                    FAIL (8.16s)
      test/common/TestMnistRNN.hs:154:
      expected: 5.060827754123346e-5
       but got: 5.060827754124135e-5
      Use -p '/train 1 2 2 991 5.0e-2/' to rerun this test only.
    feedback 1 2 2 991 5.0e-2:                                 FAIL (2.61s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9655322144631203,-0.8919588317267176,-0.7773331580548076,-0.6212249872512189,-0.4246885094957385,-0.19280278430361192,6.316924614971235e-2,0.3255160857644734,0.5731149496491759,0.7872840563791541,0.957217059407527,1.0815006200684472,1.1654656874016613,1.2170717188563214,1.2437913143303263,1.251142657837598,1.2423738174804864,1.2186583377053681,1.1794148708577938,1.1226117988569018,1.0450711676413071,0.9428743310020188,0.8120257428038534,0.6495453130357101,0.45507653540664667,0.23281831228915612,-6.935736916677385e-3,-0.24789484923780786,-0.4705527193222155]
       but got: [-0.9980267284282716,-0.965532214463121,-0.8919588317267196,-0.7773331580548113,-0.6212249872512243,-0.42468850949574577,-0.19280278430362144,6.316924614970049e-2,0.3255160857644596,0.5731149496491608,0.7872840563791381,0.9572170594075105,1.0815006200684296,1.1654656874016416,1.2170717188562976,1.243791314330297,1.2511426578375604,1.242373817480437,1.2186583377053033,1.1794148708577084,1.1226117988567896,1.045071167641161,0.9428743310018309,0.8120257428036174,0.649545313035422,0.4550765354063085,0.2328183122887787,-6.935736917072714e-3,-0.24789484923819122,-0.4705527193225557]
      Use -p '/feedback 1 2 2 991 5.0e-2/' to rerun this test only.
    trainVV 1 33 0 991 5.0e-2:                                 FAIL (12.65s)
      test/common/TestMnistRNN.hs:154:
      expected: 4.6511403967229306e-5
       but got: 4.6511403967249e-5
      Use -p '/trainVV 1 33 0 991 5.0e-2/' to rerun this test only.
    feedbackVV 1 33 0 991 5.0e-2:                              FAIL (4.20s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9660899403337656,-0.8930568599923028,-0.7791304201898077,-0.6245654477568863,-0.4314435277698684,-0.2058673183484546,4.0423225394292085e-2,0.29029630688547203,0.5241984159992963,0.7250013011527577,0.8820730400055012,0.9922277361823716,1.057620382863504,1.08252746840241,1.070784986731554,1.0245016946328942,0.9438848015250431,0.827868146535437,0.6753691437632174,0.48708347071773117,0.26756701680655437,2.6913747557207532e-2,-0.21912614372802072,-0.45154893423928943,-0.6525638736434227,-0.8098403108946983,-0.9180866488182939,-0.9775459850131992,-0.9910399864230198]
       but got: [-0.9980267284282716,-0.9660899403337655,-0.8930568599923028,-0.779130420189808,-0.624565447756887,-0.4314435277698694,-0.20586731834845626,4.042322539428911e-2,0.29029630688546687,0.5241984159992878,0.7250013011527451,0.8820730400054831,0.9922277361823462,1.0576203828634687,1.0825274684023618,1.0707849867314898,1.0245016946328098,0.9438848015249338,0.8278681465352983,0.6753691437630459,0.4870834707175275,0.26756701680632367,2.6913747556961208e-2,-0.2191261437282658,-0.45154893423951464,-0.6525638736436126,-0.809840310894844,-0.9180866488183926,-0.9775459850132528,-0.9910399864230315]
      Use -p '/feedbackVV 1 33 0 991 5.0e-2/' to rerun this test only.
    trainAR 3 0 0 3 5.0e-2:                                    FAIL (0.07s)
      test/common/TestMnistRNN.hs:154:
      expected: 6.327978161031336e-23
       but got: 6.327271647343858e-23
      Use -p '/trainAR 3 0 0 3 5.0e-2/' to rerun this test only.
    feedbackAR 3 0 0 3 5.0e-2:                                 FAIL (0.02s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9510565162972417,-0.8443279255081759,-0.6845471059406962,-0.48175367412103653,-0.24868988719256901,-3.673766846290505e-11,0.24868988711894977,0.4817536740469978,0.6845471058659982,0.8443279254326351,0.9510565162207472,0.9980267283507953,0.9822872506502898,0.9048270523889208,0.7705132427021685,0.5877852522243431,0.3681245526237731,0.12533323351198067,-0.1253332336071494,-0.36812455271766376,-0.5877852523157643,-0.7705132427900961,-0.9048270524725681,-0.9822872507291605,-0.9980267284247174,-0.9510565162898851,-0.844327925497479,-0.6845471059273313,-0.48175367410584324]
       but got: [-0.9980267284282716,-0.9510565162972417,-0.8443279255081759,-0.6845471059406962,-0.48175367412103653,-0.2486898871925689,-3.6737446418300124e-11,0.2486898871189501,0.4817536740469982,0.6845471058659988,0.8443279254326359,0.9510565162207482,0.9980267283507965,0.9822872506502912,0.9048270523889225,0.7705132427021704,0.5877852522243452,0.3681245526237752,0.12533323351198278,-0.1253332336071474,-0.368124552717662,-0.5877852523157627,-0.770513242790095,-0.9048270524725675,-0.9822872507291602,-0.9980267284247176,-0.9510565162898857,-0.8443279254974801,-0.6845471059273328,-0.48175367410584524]
      Use -p '/feedbackAR 3 0 0 3 5.0e-2/' to rerun this test only.
[...]
OK (25.40s)
    randomLL2 140 0 3 5 54282 5.0e-2:                          FAIL (2.07s)
      test/common/TestMnistRNN.hs:173:
      wrong result: 30.061871495725264 is expected to be a member of [30.061871495723956,30.06187089990927]
      Use -p '/randomLL2 140 0 3 5 54282 5.0e-2/' to rerun this test only.
[...]

Mikolaj · 2022-05-19T20:54:28Z

BTW, in some cases it's fine to simplify the tests. E.g., if a test returns a pair of a float and a list of floats often it's enough to check the first float to pin the test result, because the list is printed out of curiosity, not because it's independent enough component of the result to be worth verifying. Or the test can be tweaked so that the list is summed and often the probability of us getting exactly the same sum from a wrong list is not worth considering.

sfindeisen · 2022-06-15T12:57:11Z

In the above PR (#57) I implemented ε configuration using Tasty's custom options mechanism. I refactored a single test (in ShortTestForCI) to illustrate this. @Mikolaj is this the way to go? Looks bloated to me but I can't see a better way. What is also striking is that Tasty does not seem to support this (obvious) usecase out of the box.

Mikolaj · 2022-06-15T13:05:53Z

In the above PR (#57) I implemented ε configuration using Tasty's custom options mechanism. I refactored a single test (in ShortTestForCI) to illustrate this. @Mikolaj is this the way to go? Looks bloated to me but I can't see a better way.

Thank you, I will have a look.

What is also striking is that Tasty does not seem to support this (obvious) usecase out of the box.

So you confirm this. Indeed, I couldn't believe that's true. Is there a package built on top of Tasty or Test.Tasty.HUnit that implements that? Any ticket in their bugtracker? Do alternative testing packages supports that? I know of spec and https://github.com/NorfairKing/sydtest as possible contenders to Tasty, but I haven't checked in a while.

sfindeisen · 2022-06-16T07:35:31Z

UnkindPartition/tasty#337

sfindeisen · 2022-06-17T10:14:38Z

@Mikolaj how do you like their answer? The main maintainer (Roman) seems to be in the Ukrainian military now.

Mikolaj · 2022-06-17T10:21:42Z

Oh dear.

Regarding Eq instance, if that works, why not? Does it really give us @?= and @=? for free?

sfindeisen · 2022-06-17T16:02:50Z

Yes we can do it like this: 56663a4 . Is it cool?

Mikolaj · 2022-06-17T16:50:38Z

I think that's acceptable. We are hackers after all, transitivity of Eq is something we use, not honour. :)

The wrapping of stuff in Precision5, though, is noisy. I suppose we'd define our own @?=!#*&@ that wraps, but then it's harder to understand at first glance by somebody that knows HUnit. But if we define our own operator, I suppose we can define it without Precision5?

sfindeisen · 2022-06-17T17:15:07Z

our own operator? where? inside Tasty?

andreasabel · 2022-06-17T18:54:09Z

I'd think you can easily define your function/operator of choice that calls assertBool (or even just assertFailure) in a suitable way. See the source code for tasty-hunit: https://hackage.haskell.org/package/tasty-hunit-0.10.0.3/docs/src/Test.Tasty.HUnit.Orig.html#assertEqual
The only convenience of assertEqual over assertBool seems to be that it has a predefined message "expected ... but got ...".

P.S.: Equality on floats: We all suffer from many years of indoctrination in high school where they taught us there was a concept of equality on reals, when there is really none (from the perspective of computing). But doing away with equality on reals would be a nightmare for mathematicians, so they rather push it onto us... [sorry for the rant]

sfindeisen · 2022-06-18T08:37:58Z

@Mikolaj how about this: https://github.com/Mikolaj/horde-ad/pull/57/files#diff-a78806a9b87bdb9a4a9f67dfe9b566d0daed05f8f153b835a6ad49a3c2a71247R157 . Our ε is now hardcoded though: https://github.com/Mikolaj/horde-ad/pull/57/files#diff-a4e3f755eecb73528d5df54496793310f1aadace29ece94d400f898d0de443b5R35 . (I borrowed some code from Tasty.)

Mikolaj · 2022-06-19T06:21:26Z

Given that this lives in IO, could it get the epsilon from, say, commandline?

sfindeisen · 2022-06-20T12:36:10Z

Do you mean a commandline argument, like here: https://github.com/Mikolaj/horde-ad/pull/57/files#diff-78fe67374a9d4d8a20b6f836d420d2299f903f2167d98408e9c86cfb46ac8aaeR40 ? Yes we can do it like that, the only problem is that we have to propagate our ε from the program start all the way down to assertClose, which makes the code noisy. Is there a better way?

Or did you mean asking the user during the run? IMHO that would be even more ugly and wouldn't work well in batch mode.

Mikolaj · 2022-06-20T12:40:57Z

Yes, definitely the first option. I haven't given this much thought, but how about an IORef for the epsilon that gets updated with the commandline parameter and is read by assertClose?

sfindeisen · 2022-06-20T12:48:28Z

Yes I was thinking about IORef but is this pure (enough) and cool? Is there a better way?

Mikolaj · 2022-06-20T12:53:23Z

This, regrettably, uses the IO monad, so it's not ideal. However, once we have this working, we can think of a better way (or find other tasks more worthwhile). Perhaps tasty folk can offer feedback, especially if you report the results of your experiment.

sfindeisen mentioned this issue May 19, 2022

investigate huge floating point errors in (flaky) tests #47

Open

Mikolaj added help wanted Extra attention is needed good first issue Good for newcomers labels May 19, 2022

sfindeisen mentioned this issue Jun 15, 2022

implement epsilon-based floating point number comparisons in tests #57

Merged

sfindeisen closed this as completed in #57 Jul 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement epsilon-based floating point number comparisons in tests #46

implement epsilon-based floating point number comparisons in tests #46

sfindeisen commented May 19, 2022 •

edited

Loading

sfindeisen commented May 19, 2022

Mikolaj commented May 19, 2022 •

edited

Loading

sfindeisen commented Jun 15, 2022

Mikolaj commented Jun 15, 2022

sfindeisen commented Jun 16, 2022

sfindeisen commented Jun 17, 2022

Mikolaj commented Jun 17, 2022

sfindeisen commented Jun 17, 2022

Mikolaj commented Jun 17, 2022

sfindeisen commented Jun 17, 2022

andreasabel commented Jun 17, 2022 •

edited

Loading

sfindeisen commented Jun 18, 2022 •

edited

Loading

Mikolaj commented Jun 19, 2022

sfindeisen commented Jun 20, 2022

Mikolaj commented Jun 20, 2022

sfindeisen commented Jun 20, 2022

Mikolaj commented Jun 20, 2022

implement epsilon-based floating point number comparisons in tests #46

implement epsilon-based floating point number comparisons in tests #46

Comments

sfindeisen commented May 19, 2022 • edited Loading

sfindeisen commented May 19, 2022

Mikolaj commented May 19, 2022 • edited Loading

sfindeisen commented Jun 15, 2022

Mikolaj commented Jun 15, 2022

sfindeisen commented Jun 16, 2022

sfindeisen commented Jun 17, 2022

Mikolaj commented Jun 17, 2022

sfindeisen commented Jun 17, 2022

Mikolaj commented Jun 17, 2022

sfindeisen commented Jun 17, 2022

andreasabel commented Jun 17, 2022 • edited Loading

sfindeisen commented Jun 18, 2022 • edited Loading

Mikolaj commented Jun 19, 2022

sfindeisen commented Jun 20, 2022

Mikolaj commented Jun 20, 2022

sfindeisen commented Jun 20, 2022

Mikolaj commented Jun 20, 2022

sfindeisen commented May 19, 2022 •

edited

Loading

Mikolaj commented May 19, 2022 •

edited

Loading

andreasabel commented Jun 17, 2022 •

edited

Loading

sfindeisen commented Jun 18, 2022 •

edited

Loading