[#103] drop utf8-string dependency #104

sphaso · 2018-10-13T22:54:50Z

Resolves #103

Let me know if that's what you had in mind and if this change warrants a line in the CHANGELOG. If so, I'll be happy to add it.

Checklist:

HLint

I've changed the exposed interface (add new reexports, remove reexports, rename reexported things, etc.).
- I've updated hlint.dhall accordingly to my changes (add new rules for the new imports, remove old ones, when they are outdated, etc.).
- I've generated the new .hlint.yaml file (see this instructions).

General

I've updated the CHANGELOG with the short description of my latest changes.
All new and existing tests pass.
I keep the code style used in the files I've changed (see style-guide for more details).
I've used the stylish-haskell file.
My change requires the documentation updates.
- I've updated the documentation accordingly.
I've added the [ci skip] text to the docs-only related commit's name.

chshersh

I'm generally okay with this change. But I would like to see, whether performance and behavior is different from utf8-string package. I think it's still a good idea to provide fast conversion between different data types even if it adds one extra dependency. But if the performance is almost the same and if the results of two functions are also the same, then it's okay to remove this dependency 👍

sphaso · 2018-10-14T13:14:36Z

Thanks @chshersh , it does seem like a sensible approach. On this note I created a small repo to test both performance and correctness here.
I'm not a performance expert, so take my words with a grain of salt. I've noticed that the new functions are significantly faster (~ 2x) than the old ones. By simply running criterion however there's some false results for the ByteString.Lazy part (running individual benchmarks give different results from running the whole batch, I believe it might be related to this Criterion bug).
My correctness test is very naive, but I wouldn't know how to test otherwise if not by exhausting the whole unicode range. On this note, it's not unlikely that the two libraries have different idiosyncrasies regarding more exotic unicode ranges (e.g. see this).

chshersh · 2018-10-14T13:20:21Z

@sphaso Thanks for great work! I will review benchmarks and tests more carefully in nearest couple days. If this is indeed even faster, than it looks awesome! 👍

chshersh · 2018-10-16T07:09:33Z

@sphaso Unit test looks ok. The more robust test will be something with the hedgehog library where you generate random String and then encode it with both old and new algorithms and compare results. Same with the decode function.

Regarding your benchmarks: they contain single error, because of which the results you've observed are not reliable. Specifically, you used whnf but you should use nf. whnf doesn't evaluate the whole String or the whole LByteString, it only pattern matches on the first constructor ((x : xs) in case of String so the result is not evaluated fully and we don't see the number).

After changing this for every benchmarks and running for GHC 8.4.3 and GHC 8.6.1 I've observed the following interesting behavior (numbers are slightly different for different GHC version, but relation is the same):

ByteString: new encode is about ~10x faster than the old one 💯
ByteString: new decode is ~4x faster than the old one 👍
LByteString: new encode is ~4x slower than the old one 😞
LByteString: new decode is ~1.5x faster than the old one 🆗

So, we can't say yet that the new implementation is strictly better...

But I think that the problem with String -> LByteString should be solved separately. Meanwhile you can try to implement property-based tests so we will be 146% sure that it works correctly and then I can merge the change.

sphaso · 2018-10-16T11:33:14Z

@chshersh I wasn't sure nf was actually needed. Good to know!
I'll implement the property tests with hedgehog in the side project within the week.

sphaso · 2018-10-17T19:07:39Z

@chshersh Hi! I implemented a very basic test with hedgehog here. Using Gen.unicode the test passes, using Gen.unicodeAll some discrepancies show. e.g.

  ✓ strict passed 100 tests.
  ✗ lazy failed after 77 tests and 13 shrinks.
  
       ┏━━ test/test.hs ━━━
    18 ┃ stringToLazyByteString :: Property
    19 ┃ stringToLazyByteString = property $ do
    20 ┃     xs <- forAll $ Gen.string (Range.linear 0 10000) Gen.unicodeAll
       ┃     │ "\65534"
    21 ┃     (oldDecodeUtf8 . (oldEncodeUtf8 :: String -> LB.ByteString)) xs === (newDecodeUtf8 . (newEncodeUtf8 :: String -> LB.ByteString)) xs
       ┃     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       ┃     │ Failed (- lhs =/= + rhs)
       ┃     │ - "\65533"
       ┃     │ + "\65534"
    
    This failure can be reproduced by running:
    > recheck (Size 76) (Seed 11525628145680357244 13351701613768266159) lazy

Gen.unicodeAll in fact comprises the whole unicode range, including '\55296'..'\57343'. The case that breaks the tests is the same reported in the issue I linked before (this one). I'm not sure how much consistency is expected on that range.

Let me know if I can do anything else to help with this PR!

chshersh

@sphaso I think that the tests results are ok because text like "\55296" is not a valid Text at the end, because it's only a half of a surrogate pair.

I think that you only need to update the CHANGELOG and this can be merged 👍

src/Relude/String/Conversion.hs

chshersh · 2018-10-18T07:22:24Z

@sphaso I don't understand one thing: the error in your test case is on the "\65533" but the range from issue you linked doesn't include this symbol: '\55296'..'\57343'

sphaso · 2018-10-18T11:34:39Z

@chshersh oops! you're right, interestingly I found this issue in hedgehog. It's funny how this character keeps popping up.
Regarding the changelog (and the cabal file I presume?), would you agree that it's a minor change?

chshersh · 2018-10-18T15:00:57Z

@sphaso Regarding unicode: this is indeed strange. The fix is merged to the hedgehog library and is released, so you shouldn't see such unicode symbol...

Regarding changelog: this can go to the 0.4.0 version because it's not released to hackage yet with all other changes.

sphaso · 2018-10-18T15:53:08Z

@chshersh I'll add the change to the changelog as soon as I get home from work. I overlooked one major detail regarding our mystery character: the PR in hedgehog filters out the char in Gen.unicode, but the tests break in Gen.unicodeAll :P

chshersh

Maintainer of utf8-string package confirmed that it's better to use text package. So well done 👍 Sometimes couple lines of code lead to rabbit hole and unexpected discoveries. The discussion around this issue is more valuable then the contribution itself, so don't be upset if it took so much time 👌

chshersh requested review from chshersh and vrom911 October 14, 2018 02:47

chshersh assigned sphaso Oct 14, 2018

chshersh added refactoring remove Hacktoberfest https://hacktoberfest.digitalocean.com/ labels Oct 14, 2018

chshersh reviewed Oct 14, 2018

View reviewed changes

chshersh reviewed Oct 18, 2018

View reviewed changes

src/Relude/String/Conversion.hs Outdated Show resolved Hide resolved

chshersh mentioned this pull request Oct 18, 2018

Consider adding property-based roundtrip tests for decode/encode glguy/utf8-string#27

Closed

[#103] drop utf8-string dependency

5468713

chshersh approved these changes Oct 19, 2018

View reviewed changes

chshersh merged commit 62d4497 into kowainik:master Oct 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#103] drop utf8-string dependency #104

[#103] drop utf8-string dependency #104

sphaso commented Oct 13, 2018 •

edited

Loading

chshersh left a comment

sphaso commented Oct 14, 2018

chshersh commented Oct 14, 2018

chshersh commented Oct 16, 2018

sphaso commented Oct 16, 2018

sphaso commented Oct 17, 2018

chshersh left a comment

chshersh commented Oct 18, 2018

sphaso commented Oct 18, 2018

chshersh commented Oct 18, 2018

sphaso commented Oct 18, 2018

chshersh left a comment

[#103] drop utf8-string dependency #104

[#103] drop utf8-string dependency #104

Conversation

sphaso commented Oct 13, 2018 • edited Loading

Checklist:

HLint

General

chshersh left a comment

Choose a reason for hiding this comment

sphaso commented Oct 14, 2018

chshersh commented Oct 14, 2018

chshersh commented Oct 16, 2018

sphaso commented Oct 16, 2018

sphaso commented Oct 17, 2018

chshersh left a comment

Choose a reason for hiding this comment

chshersh commented Oct 18, 2018

sphaso commented Oct 18, 2018

chshersh commented Oct 18, 2018

sphaso commented Oct 18, 2018

chshersh left a comment

Choose a reason for hiding this comment

sphaso commented Oct 13, 2018 •

edited

Loading