Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#103] drop utf8-string dependency #104

Merged
merged 1 commit into from
Oct 19, 2018
Merged

[#103] drop utf8-string dependency #104

merged 1 commit into from
Oct 19, 2018

Conversation

sphaso
Copy link
Contributor

@sphaso sphaso commented Oct 13, 2018

Resolves #103

Let me know if that's what you had in mind and if this change warrants a line in the CHANGELOG. If so, I'll be happy to add it.

Checklist:

HLint

  • I've changed the exposed interface (add new reexports, remove reexports, rename reexported things, etc.).
    • I've updated hlint.dhall accordingly to my changes (add new rules for the new imports, remove old ones, when they are outdated, etc.).
    • I've generated the new .hlint.yaml file (see this instructions).

General

  • I've updated the CHANGELOG with the short description of my latest changes.
  • All new and existing tests pass.
  • I keep the code style used in the files I've changed (see style-guide for more details).
  • I've used the stylish-haskell file.
  • My change requires the documentation updates.
    • I've updated the documentation accordingly.
  • I've added the [ci skip] text to the docs-only related commit's name.

@chshersh chshersh added refactoring remove Hacktoberfest https://hacktoberfest.digitalocean.com/ labels Oct 14, 2018
Copy link
Contributor

@chshersh chshersh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally okay with this change. But I would like to see, whether performance and behavior is different from utf8-string package. I think it's still a good idea to provide fast conversion between different data types even if it adds one extra dependency. But if the performance is almost the same and if the results of two functions are also the same, then it's okay to remove this dependency 👍

@sphaso
Copy link
Contributor Author

sphaso commented Oct 14, 2018

Thanks @chshersh , it does seem like a sensible approach. On this note I created a small repo to test both performance and correctness here.
I'm not a performance expert, so take my words with a grain of salt. I've noticed that the new functions are significantly faster (~ 2x) than the old ones. By simply running criterion however there's some false results for the ByteString.Lazy part (running individual benchmarks give different results from running the whole batch, I believe it might be related to this Criterion bug).
My correctness test is very naive, but I wouldn't know how to test otherwise if not by exhausting the whole unicode range. On this note, it's not unlikely that the two libraries have different idiosyncrasies regarding more exotic unicode ranges (e.g. see this).

@chshersh
Copy link
Contributor

@sphaso Thanks for great work! I will review benchmarks and tests more carefully in nearest couple days. If this is indeed even faster, than it looks awesome! 👍

@chshersh
Copy link
Contributor

@sphaso Unit test looks ok. The more robust test will be something with the hedgehog library where you generate random String and then encode it with both old and new algorithms and compare results. Same with the decode function.

Regarding your benchmarks: they contain single error, because of which the results you've observed are not reliable. Specifically, you used whnf but you should use nf. whnf doesn't evaluate the whole String or the whole LByteString, it only pattern matches on the first constructor ((x : xs) in case of String so the result is not evaluated fully and we don't see the number).

After changing this for every benchmarks and running for GHC 8.4.3 and GHC 8.6.1 I've observed the following interesting behavior (numbers are slightly different for different GHC version, but relation is the same):

  • ByteString: new encode is about ~10x faster than the old one 💯
  • ByteString: new decode is ~4x faster than the old one 👍
  • LByteString: new encode is ~4x slower than the old one 😞
  • LByteString: new decode is ~1.5x faster than the old one 🆗

So, we can't say yet that the new implementation is strictly better...

But I think that the problem with String -> LByteString should be solved separately. Meanwhile you can try to implement property-based tests so we will be 146% sure that it works correctly and then I can merge the change.

@sphaso
Copy link
Contributor Author

sphaso commented Oct 16, 2018

@chshersh I wasn't sure nf was actually needed. Good to know!
I'll implement the property tests with hedgehog in the side project within the week.

@sphaso
Copy link
Contributor Author

sphaso commented Oct 17, 2018

@chshersh Hi! I implemented a very basic test with hedgehog here. Using Gen.unicode the test passes, using Gen.unicodeAll some discrepancies show. e.g.

  ✓ strict passed 100 tests.
  ✗ lazy failed after 77 tests and 13 shrinks.
  
       ┏━━ test/test.hs ━━━
    18 ┃ stringToLazyByteString :: Property
    19 ┃ stringToLazyByteString = property $ do
    20 ┃     xs <- forAll $ Gen.string (Range.linear 0 10000) Gen.unicodeAll
       ┃     │ "\65534"
    21 ┃     (oldDecodeUtf8 . (oldEncodeUtf8 :: String -> LB.ByteString)) xs === (newDecodeUtf8 . (newEncodeUtf8 :: String -> LB.ByteString)) xs
       ┃     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       ┃     │ Failed (- lhs =/= + rhs)
       ┃     │ - "\65533"
       ┃     │ + "\65534"
    
    This failure can be reproduced by running:
    > recheck (Size 76) (Seed 11525628145680357244 13351701613768266159) lazy

Gen.unicodeAll in fact comprises the whole unicode range, including '\55296'..'\57343'. The case that breaks the tests is the same reported in the issue I linked before (this one). I'm not sure how much consistency is expected on that range.

Let me know if I can do anything else to help with this PR!

Copy link
Contributor

@chshersh chshersh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphaso I think that the tests results are ok because text like "\55296" is not a valid Text at the end, because it's only a half of a surrogate pair.

I think that you only need to update the CHANGELOG and this can be merged 👍

src/Relude/String/Conversion.hs Outdated Show resolved Hide resolved
@chshersh
Copy link
Contributor

@sphaso I don't understand one thing: the error in your test case is on the "\65533" but the range from issue you linked doesn't include this symbol: '\55296'..'\57343'

@sphaso
Copy link
Contributor Author

sphaso commented Oct 18, 2018

@chshersh oops! you're right, interestingly I found this issue in hedgehog. It's funny how this character keeps popping up.
Regarding the changelog (and the cabal file I presume?), would you agree that it's a minor change?

@chshersh
Copy link
Contributor

@sphaso Regarding unicode: this is indeed strange. The fix is merged to the hedgehog library and is released, so you shouldn't see such unicode symbol...

Regarding changelog: this can go to the 0.4.0 version because it's not released to hackage yet with all other changes.

@sphaso
Copy link
Contributor Author

sphaso commented Oct 18, 2018

@chshersh I'll add the change to the changelog as soon as I get home from work. I overlooked one major detail regarding our mystery character: the PR in hedgehog filters out the char in Gen.unicode, but the tests break in Gen.unicodeAll :P

Copy link
Contributor

@chshersh chshersh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maintainer of utf8-string package confirmed that it's better to use text package. So well done 👍 Sometimes couple lines of code lead to rabbit hole and unexpected discoveries. The discussion around this issue is more valuable then the contribution itself, so don't be upset if it took so much time 👌

@chshersh chshersh merged commit 62d4497 into kowainik:master Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Hacktoberfest https://hacktoberfest.digitalocean.com/ refactoring remove
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants