Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement epsilon-based floating point number comparisons in tests #46

Closed
sfindeisen opened this issue May 19, 2022 · 17 comments · Fixed by #57
Closed

implement epsilon-based floating point number comparisons in tests #46

sfindeisen opened this issue May 19, 2022 · 17 comments · Fixed by #57
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@sfindeisen
Copy link
Contributor

sfindeisen commented May 19, 2022

Depending on the architecture and/or platform, we're sometimes getting test errors like:

    black/white 250 209310 2.0e-2:                             FAIL (2.49s)
      test/common/TestMnistFCNN.hs:75:
      expected: 23.025850929940457
       but got: 23.02585092994046

This shows we need more relaxed floating point value equality checks in tests. Perhaps something like: abs(x-y) < ε, where ε should be configurable.

Within the scope of this ticket is:

  1. implement the ε mechanism
  2. make ε configurable
  3. refactor all the flaky tests to use the new ε

Related issue: #47

@sfindeisen
Copy link
Contributor Author

More failures:

$ cabal test --enable-optimization shortTestForCI
[...]
    black/white 250 209310 2.0e-2:                             FAIL (2.49s)
      test/common/TestMnistFCNN.hs:75:
      expected: 23.025850929940457
       but got: 23.02585092994046
      Use -p '/black\/white 250 209310 2.0e-2/' to rerun this test only.
    random 100 250 209310 2.0e-2:                              OK (4.73s)
    first 100 trainset samples only 250 209310 2.0e-2:         FAIL (6.17s)
      test/common/TestMnistFCNN.hs:75:
      expected: 3.233123290489956
       but got: 3.2331232904899556
      Use -p '/first 100 trainset samples only 250 209310 2.0e-2/' to rerun this test only.
[...]
  Sine RNN tests
    train 1 2 2 991 5.0e-2:                                    FAIL (8.16s)
      test/common/TestMnistRNN.hs:154:
      expected: 5.060827754123346e-5
       but got: 5.060827754124135e-5
      Use -p '/train 1 2 2 991 5.0e-2/' to rerun this test only.
    feedback 1 2 2 991 5.0e-2:                                 FAIL (2.61s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9655322144631203,-0.8919588317267176,-0.7773331580548076,-0.6212249872512189,-0.4246885094957385,-0.19280278430361192,6.316924614971235e-2,0.3255160857644734,0.5731149496491759,0.7872840563791541,0.957217059407527,1.0815006200684472,1.1654656874016613,1.2170717188563214,1.2437913143303263,1.251142657837598,1.2423738174804864,1.2186583377053681,1.1794148708577938,1.1226117988569018,1.0450711676413071,0.9428743310020188,0.8120257428038534,0.6495453130357101,0.45507653540664667,0.23281831228915612,-6.935736916677385e-3,-0.24789484923780786,-0.4705527193222155]
       but got: [-0.9980267284282716,-0.965532214463121,-0.8919588317267196,-0.7773331580548113,-0.6212249872512243,-0.42468850949574577,-0.19280278430362144,6.316924614970049e-2,0.3255160857644596,0.5731149496491608,0.7872840563791381,0.9572170594075105,1.0815006200684296,1.1654656874016416,1.2170717188562976,1.243791314330297,1.2511426578375604,1.242373817480437,1.2186583377053033,1.1794148708577084,1.1226117988567896,1.045071167641161,0.9428743310018309,0.8120257428036174,0.649545313035422,0.4550765354063085,0.2328183122887787,-6.935736917072714e-3,-0.24789484923819122,-0.4705527193225557]
      Use -p '/feedback 1 2 2 991 5.0e-2/' to rerun this test only.
    trainVV 1 33 0 991 5.0e-2:                                 FAIL (12.65s)
      test/common/TestMnistRNN.hs:154:
      expected: 4.6511403967229306e-5
       but got: 4.6511403967249e-5
      Use -p '/trainVV 1 33 0 991 5.0e-2/' to rerun this test only.
    feedbackVV 1 33 0 991 5.0e-2:                              FAIL (4.20s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9660899403337656,-0.8930568599923028,-0.7791304201898077,-0.6245654477568863,-0.4314435277698684,-0.2058673183484546,4.0423225394292085e-2,0.29029630688547203,0.5241984159992963,0.7250013011527577,0.8820730400055012,0.9922277361823716,1.057620382863504,1.08252746840241,1.070784986731554,1.0245016946328942,0.9438848015250431,0.827868146535437,0.6753691437632174,0.48708347071773117,0.26756701680655437,2.6913747557207532e-2,-0.21912614372802072,-0.45154893423928943,-0.6525638736434227,-0.8098403108946983,-0.9180866488182939,-0.9775459850131992,-0.9910399864230198]
       but got: [-0.9980267284282716,-0.9660899403337655,-0.8930568599923028,-0.779130420189808,-0.624565447756887,-0.4314435277698694,-0.20586731834845626,4.042322539428911e-2,0.29029630688546687,0.5241984159992878,0.7250013011527451,0.8820730400054831,0.9922277361823462,1.0576203828634687,1.0825274684023618,1.0707849867314898,1.0245016946328098,0.9438848015249338,0.8278681465352983,0.6753691437630459,0.4870834707175275,0.26756701680632367,2.6913747556961208e-2,-0.2191261437282658,-0.45154893423951464,-0.6525638736436126,-0.809840310894844,-0.9180866488183926,-0.9775459850132528,-0.9910399864230315]
      Use -p '/feedbackVV 1 33 0 991 5.0e-2/' to rerun this test only.
    trainAR 3 0 0 3 5.0e-2:                                    FAIL (0.07s)
      test/common/TestMnistRNN.hs:154:
      expected: 6.327978161031336e-23
       but got: 6.327271647343858e-23
      Use -p '/trainAR 3 0 0 3 5.0e-2/' to rerun this test only.
    feedbackAR 3 0 0 3 5.0e-2:                                 FAIL (0.02s)
      test/common/TestMnistRNN.hs:226:
      expected: [-0.9980267284282716,-0.9510565162972417,-0.8443279255081759,-0.6845471059406962,-0.48175367412103653,-0.24868988719256901,-3.673766846290505e-11,0.24868988711894977,0.4817536740469978,0.6845471058659982,0.8443279254326351,0.9510565162207472,0.9980267283507953,0.9822872506502898,0.9048270523889208,0.7705132427021685,0.5877852522243431,0.3681245526237731,0.12533323351198067,-0.1253332336071494,-0.36812455271766376,-0.5877852523157643,-0.7705132427900961,-0.9048270524725681,-0.9822872507291605,-0.9980267284247174,-0.9510565162898851,-0.844327925497479,-0.6845471059273313,-0.48175367410584324]
       but got: [-0.9980267284282716,-0.9510565162972417,-0.8443279255081759,-0.6845471059406962,-0.48175367412103653,-0.2486898871925689,-3.6737446418300124e-11,0.2486898871189501,0.4817536740469982,0.6845471058659988,0.8443279254326359,0.9510565162207482,0.9980267283507965,0.9822872506502912,0.9048270523889225,0.7705132427021704,0.5877852522243452,0.3681245526237752,0.12533323351198278,-0.1253332336071474,-0.368124552717662,-0.5877852523157627,-0.770513242790095,-0.9048270524725675,-0.9822872507291602,-0.9980267284247176,-0.9510565162898857,-0.8443279254974801,-0.6845471059273328,-0.48175367410584524]
      Use -p '/feedbackAR 3 0 0 3 5.0e-2/' to rerun this test only.
[...]
OK (25.40s)
    randomLL2 140 0 3 5 54282 5.0e-2:                          FAIL (2.07s)
      test/common/TestMnistRNN.hs:173:
      wrong result: 30.061871495725264 is expected to be a member of [30.061871495723956,30.06187089990927]
      Use -p '/randomLL2 140 0 3 5 54282 5.0e-2/' to rerun this test only.
[...]

@Mikolaj Mikolaj added help wanted Extra attention is needed good first issue Good for newcomers labels May 19, 2022
@Mikolaj
Copy link
Owner

Mikolaj commented May 19, 2022

BTW, in some cases it's fine to simplify the tests. E.g., if a test returns a pair of a float and a list of floats often it's enough to check the first float to pin the test result, because the list is printed out of curiosity, not because it's independent enough component of the result to be worth verifying. Or the test can be tweaked so that the list is summed and often the probability of us getting exactly the same sum from a wrong list is not worth considering.

@sfindeisen
Copy link
Contributor Author

In the above PR (#57) I implemented ε configuration using Tasty's custom options mechanism. I refactored a single test (in ShortTestForCI) to illustrate this. @Mikolaj is this the way to go? Looks bloated to me but I can't see a better way. What is also striking is that Tasty does not seem to support this (obvious) usecase out of the box.

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 15, 2022

In the above PR (#57) I implemented ε configuration using Tasty's custom options mechanism. I refactored a single test (in ShortTestForCI) to illustrate this. @Mikolaj is this the way to go? Looks bloated to me but I can't see a better way.

Thank you, I will have a look.

What is also striking is that Tasty does not seem to support this (obvious) usecase out of the box.

So you confirm this. Indeed, I couldn't believe that's true. Is there a package built on top of Tasty or Test.Tasty.HUnit that implements that? Any ticket in their bugtracker? Do alternative testing packages supports that? I know of spec and https://github.com/NorfairKing/sydtest as possible contenders to Tasty, but I haven't checked in a while.

@sfindeisen
Copy link
Contributor Author

UnkindPartition/tasty#337

@sfindeisen
Copy link
Contributor Author

@Mikolaj how do you like their answer? The main maintainer (Roman) seems to be in the Ukrainian military now.

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 17, 2022

Oh dear.

Regarding Eq instance, if that works, why not? Does it really give us @?= and @=? for free?

@sfindeisen
Copy link
Contributor Author

Yes we can do it like this: 56663a4 . Is it cool?

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 17, 2022

I think that's acceptable. We are hackers after all, transitivity of Eq is something we use, not honour. :)

The wrapping of stuff in Precision5, though, is noisy. I suppose we'd define our own @?=!#*&@ that wraps, but then it's harder to understand at first glance by somebody that knows HUnit. But if we define our own operator, I suppose we can define it without Precision5?

@sfindeisen
Copy link
Contributor Author

our own operator? where? inside Tasty?

@andreasabel
Copy link

andreasabel commented Jun 17, 2022

I'd think you can easily define your function/operator of choice that calls assertBool (or even just assertFailure) in a suitable way. See the source code for tasty-hunit: https://hackage.haskell.org/package/tasty-hunit-0.10.0.3/docs/src/Test.Tasty.HUnit.Orig.html#assertEqual
The only convenience of assertEqual over assertBool seems to be that it has a predefined message "expected ... but got ...".

P.S.: Equality on floats: We all suffer from many years of indoctrination in high school where they taught us there was a concept of equality on reals, when there is really none (from the perspective of computing). But doing away with equality on reals would be a nightmare for mathematicians, so they rather push it onto us... [sorry for the rant]

@sfindeisen
Copy link
Contributor Author

sfindeisen commented Jun 18, 2022

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 19, 2022

Given that this lives in IO, could it get the epsilon from, say, commandline?

@sfindeisen
Copy link
Contributor Author

Do you mean a commandline argument, like here: https://github.com/Mikolaj/horde-ad/pull/57/files#diff-78fe67374a9d4d8a20b6f836d420d2299f903f2167d98408e9c86cfb46ac8aaeR40 ? Yes we can do it like that, the only problem is that we have to propagate our ε from the program start all the way down to assertClose, which makes the code noisy. Is there a better way?

Or did you mean asking the user during the run? IMHO that would be even more ugly and wouldn't work well in batch mode.

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 20, 2022

Yes, definitely the first option. I haven't given this much thought, but how about an IORef for the epsilon that gets updated with the commandline parameter and is read by assertClose?

@sfindeisen
Copy link
Contributor Author

Yes I was thinking about IORef but is this pure (enough) and cool? Is there a better way?

@Mikolaj
Copy link
Owner

Mikolaj commented Jun 20, 2022

This, regrettably, uses the IO monad, so it's not ideal. However, once we have this working, we can think of a better way (or find other tasks more worthwhile). Perhaps tasty folk can offer feedback, especially if you report the results of your experiment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants