Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Optimizations #467

Merged
merged 17 commits into from
Mar 20, 2023
Merged

Performance Optimizations #467

merged 17 commits into from
Mar 20, 2023

Conversation

lihaoyi
Copy link
Member

@lihaoyi lihaoyi commented Mar 13, 2023

This PR attempts to optimize the uPickle read/write logic while preserving binary compatibility

Code Changes

  • Generally avoid calling BaseElemRenderer.getElemSafe and ElemBuilder.appendUnsafe in hot paths. Instead, in both cases we try to call requestUntilGetSafeIndex/ensureLength and then read/write as many characters as we can directly on the byte/char array.

  • ElemBuilder.arr and ElemBuilder.length public so that they can be used directly outside of upickle.core

  • Adjust requestUntil0 to properly handle cases where readDataIntoBuffer returns n = -1

  • Write integers directly to output without needing to .toString a temporary string

  • Read (most) integers directly from input without needing to .toString a temporary string

  • In ArrayWriter and SeqLikeWriter, lift ctx.subVisitor out of the loop since it never changes

  • Separated out "fast-path" simple-acii-string loop from "slow-path" escape/unicode loop during rendering and parsing

  • Remove redundant flushBuffer and flushElemBuilder calls inside visitFloat32/visitFloat64

  • Fixed a bug in WrapElemArrayCharSeq.charAt which was not correct when Elem=Byte

  • Added new visitFloat64{Char,Byte}Parts visitor methods which are optimized versions of visitFloat64StringParts.

  • Fix a bug in StringWriter where it was not accepting visitFloat64StringParts calls, even though it accepts all the other visitFloat* and visitInt* calls.

  • Directly write unicode characters to UTF-8 bytes without going through an intermediate makeString: String and .getBytes: Array[Byte]

  • Vendored/patched the fast Schubfach Double/Float serialization code from Jackson-Core. This is similar to the version used in OpenJDK, but OpenJDK's is GPLed (rather than permissive) and thus cannot be used. Furthermore, even if a future version of Java includes the fast double serializer, vendoring/patching it allows us to write the output directly to the output char/byte array, saving us an intermediate string and byte array. For now, this is just for Scala-JVM, to allow us to re-use the Java sources with minimal changes while still providing perf on the platform where it matters the most

Testing Changes

  • Added some micro-benchmarks to the bench/ folder, to better tease out differences that only affect specific code paths

  • Moved slow unit tests into dedicated .testSlow module, to get them out of the fast path

  • Added some fuzz tests reading/writing a variety of Strings/Numbers of different lengths, to try and catch any bugs that the trivial unit tests don't

Benchmarks

git checkout optimize && ./mill bench.jvm.run && git checkout main && git checkout optimize bench && ./mill bench.jvm.run on my M1 Macbook pro, Scala 2.13.10, Java 11

  • "macro" benchmarks upickleDefault Read/Write show 16-33% speedups when handling JSON

  • "micro" benchmarks show large speedups in integers Write, and the string handling operations in general

  • The micro-benchmarks show a minor across-the-board speedup, across all benchmarks, that seems evident despite the random variation.

Main -> PR (higher is better)

As of 66d9af8

Name JsonString JsonByteArray MsgPack
upickleDefault Read 1432 -> 1625 (+13%) 1120 -> 923 (-17%) 2438 -> 2324 (-4%)
upickleDefault Write 1602 -> 2060 (+28%) 1553 -> 1889 (+21%) 2755 -> 2613 (-5%)
integers Read 196864 -> 285963 (+45%) 229698 -> 246587 (+7%) 592572 -> 580823 (-1%)
integers Write 296029 -> 396122 (+33%) 273371 -> 385921 (+41%) 653991 -> 666033 (+1%)
doubles Read 86299 -> 97270 (+12%) 92724 -> 52961 (-42%) 673140 -> 650428 (-3%)
doubles Write 89059 -> 142984 (+60%) 88151 -> 140816 (+59%) 591629 -> 655322 (+10%)
shortStrings Read 232302 -> 301723 (+29%) 212627 -> 226610 (+6%) 295688 -> 375804 (+27%)
shortStrings Write 306868 -> 337049 (+9%) 320702 -> 368960 (+15%) 574461 -> 562958 (-2%)
longStrings Read 9598 -> 17142 (+78%) 11041 -> 17451 (+58%) 200153 -> 220143 (+9%)
longStrings Write 11010 -> 12785 (+16%) 9490 -> 15009 (+58%) 101027 -> 100749 (0%)
unicodeStrings Read 71976 -> 135159 (+87%) 26527 -> 44733 (+68%) 62247 -> 95197 (+52%)
unicodeStrings Write 75241 -> 70620 (-6%) 56140 -> 50046 (-10%) 119705 -> 119706 (0%)
caseClasses Read 4376 -> 4122 (-5%) 3852 -> 3949 (+2%) 5025 -> 4417 (-12%)
caseClasses Write 7450 -> 7457 (0%) 7966 -> 7869 (-1%) 6856 -> 6476 (-5%)
sequences Read 1135 -> 1263 (+11%) 1228 -> 1240 (0%) 1615 -> 1632 (+1%)
sequences Write 1420 -> 1559 (+9%) 1191 -> 1298 (+8%) 1466 -> 1561 (+6%)

Notes

  • A lot of function signatures changed or were moved around. I left @deprecated forwarder methods where necessary to keep binary compatibility

  • For now, some possible performance gains are still left on the table: custom double parsing, vectorized parsing, etc.

  • For now, most changes are pretty localized. There's probably some optimization possibilities that would involve larger changes to the core interfaces

  • The performance effects of changes can be pretty surprising. A number of things that I thought would speed things up actually slowed things down (e.g. making visitFloat64StringParts/visitString take Array[Byte/Char] rather than CharSequence, or making the escapeByte's String -> UTF8 Bytes conversion direct rather than through an intermediate CharBuilder)

Per-commit rough benchmarks

5-second-long rough benchmarks on each significant commit, to see the numbers changing over time

Bench R/W 8e1 7bf 5c3 cb5 0b3 632 600 848 49b
upickleDefault Read 678 745 700 688 697 711 841 807 789
upickleDefault Write 910 975 1012 1028 1013 1016 963 1030 922
upickleDefaultByteArray Read 629 661 613 607 650 617 700 716 684
upickleDefaultByteArray Write 852 1018 1036 1018 1069 943 937 948 946
upickleDefaultBinary Read 1312 1293 1317 1308 1290 1302 1291 1321 1305
upickleDefaultBinary Write 1397 1325 1298 1282 1334 1343 1330 1302 1413
integers Read 89334 92446 90720 105653 101459 100886 100721 99277 142552
integers Write 198271 198352 174247 207267 209200 202796 179766 203459 204256
integersByteArray Read 105246 102440 112112 115220 111878 113574 116542 112830 155813
integersByteArray Write 188687 188913 197311 192353 181826 197057 165848 196274 187304
integersBinary Read 282797 142942 276694 279362 282047 286733 278783 278341 297563
integersBinary Write 325789 327924 338787 317988 306883 337490 263196 333712 312733
doubles Read 39963 46040 41880 45239 42929 42325 42941 42855 47863
doubles Write 71034 68354 59554 70768 70380 71103 66641 71240 70470
doublesByteArray Read 46845 47178 43895 46609 46309 44951 44367 44563 40914
doublesByteArray Write 69656 69132 70225 70212 68752 70929 66443 70732 68820
doublesBinary Read 321771 501575 323180 315392 320393 318485 318434 325906 329329
doublesBinary Write 295796 295993 321326 321252 283749 329617 259816 323299 300292
shortStrings Read 121639 130327 128765 113842 121797 120442 158563 129727 141102
shortStrings Write 158975 178526 182496 183346 178737 169459 164988 170451 164864
shortStringsByteArray Read 114474 116461 115078 123167 121479 106868 124290 126967 121628
shortStringsByteArray Write 168558 205763 212324 198424 210045 184594 188127 187928 185694
shortStringsBinary Read 130088 134634 126799 151912 152714 151418 192270 203078 201106
shortStringsBinary Write 281902 285634 286099 278813 280846 281251 284006 278087 281807
longStrings Read 4757 4870 4829 4795 4877 4859 8617 7530 8698
longStrings Write 5601 9533 9566 9635 9542 6397 6268 6404 6243
longStringsByteArray Read 5512 5522 5454 5403 5507 5192 7583 7574 8939
longStringsByteArray Write 4728 11756 11979 11398 11962 7487 7482 7460 7411
longStringsBinary Read 103204 109631 104458 102516 103513 104587 111451 109700 109303
longStringsBinary Write 50642 50240 50739 50601 50720 50158 50588 50120 49964
unicodeStrings Read 37295 36067 38745 38058 38096 37983 61195 52783 53140
unicodeStrings Write 38483 76162 76756 86370 76720 33131 33731 31363 31383
unicodeStringsByteArray Read 15022 13299 14876 14836 13414 14910 20655 17170 23055
unicodeStringsByteArray Write 22951 43731 44270 43590 44064 24946 24584 23989 24514
unicodeStringsBinary Read 44506 31734 43731 45995 31111 43678 47326 32642 47521
unicodeStringsBinary Write 59195 60047 60002 59369 59611 59796 60004 59083 59675
caseClasses Read 2507 2375 2253 2569 2549 2304 2171 2641 2245
caseClasses Write 3403 3046 3296 3169 3486 3707 3591 3720 3649
caseClassesByteArray Read 2031 2323 1917 2822 2530 2409 2596 2699 2641
caseClassesByteArray Write 3545 3342 3828 3218 3905 3916 3196 4023 3852
caseClassesBinary Read 2829 2452 2451 3038 2486 2336 2458 2574 2664
caseClassesBinary Write 3803 3870 4005 3839 3446 3439 3743 3622 3517
sequences Read 545 658 576 565 573 576 669 673 568
sequences Write 733 717 585 776 745 796 583 794 733
sequencesByteArray Read 709 714 625 697 717 701 704 699 695
sequencesByteArray Write 621 618 659 645 573 659 524 655 609
sequencesBinary Read 820 836 615 810 818 790 811 827 777
sequencesBinary Write 740 737 769 764 699 799 614 776 725

@lihaoyi lihaoyi changed the title Some micro-optimizations Performance Optimizations Mar 17, 2023
@lihaoyi lihaoyi marked this pull request as ready for review March 17, 2023 03:11
@lihaoyi lihaoyi requested review from lolgab and lefou March 17, 2023 03:37
@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 17, 2023

Review by @lefou @lolgab @plokhotnyuk if anyone wants to have a look through before I merge it.

It's a bit of a messy PR, but a lot of it is pretty boilerplatey code, so hopefully not too dense. Scala 3's inline defs would help a bit to alleviate the boilerplate a bit, some day in the far future when we drop Scala 2 support.

The test suite is green, the MIMA bin-compat checks pass, and the benchmarks seem to indicate good improvements.

@plokhotnyuk
Copy link
Contributor

plokhotnyuk commented Mar 17, 2023

A good improvement would be switching to JMH for benchmarking. It will prevent mistakes in measurements when following best practices from JMH samples and will give a good tool for measurement of secondary metrics like other CPU events or JVM allocations (using built-in perfnorm and gc profilers).

@lihaoyi Do you consider an integration of visitor API to work with jsoniter-scala-core? I'm ready to add a missing parsing or serialization methods in jsoniter-scala-core API as it was done recently for hasRemaining.

@lefou
Copy link
Member

lefou commented Mar 17, 2023

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 17, 2023

Yeah using JMH would be nice. The current benchmark suite is pretty old.

I haven't thought seriously about integrating with jsoniter. I've poked around (e.g. when looking for a double serializer to use), but it's a pretty intimidating codebase so i didn't get as far as figuring out where the possible integration points are

@plokhotnyuk
Copy link
Contributor

plokhotnyuk commented Mar 17, 2023

Yeah using JMH would be nice. The current benchmark suite is pretty old.

As a workaround, for the first time you can publish locally a snapshot version and use JMH benchmarks from jsoniter-scala-benchmarkJS and jsoniter-scala-benchmarkJVM sub-projects.

I haven't thought seriously about integrating with jsoniter. I've poked around (e.g. when looking for a double serializer to use), but it's a pretty intimidating codebase so i didn't get as far as figuring out where the possible integration points are.

Here is an attempt to do that in weePickle's codebase. Probably with cooperation from both sides, we can achieve quite competitive solutions like those jsoniter-scala-circe and play-json-jsoniter boosters.

@lihaoyi
Copy link
Member Author

lihaoyi commented Mar 17, 2023

@plokhotnyuk what's the story for JMH and Scala.js/Scala-Native? IIRC one of the reasons I had my own benchmarking code in Scala is so I can easily re-use the benchmarks across platforms. Is JMH only for Scala-JVM? Or is there some cross-platform support too?

@plokhotnyuk
Copy link
Contributor

plokhotnyuk commented Mar 17, 2023

@plokhotnyuk what's the story for JMH and Scala.js/Scala-Native? IIRC one of the reasons I had my own benchmarking code in Scala is so I can easily re-use the benchmarks across platforms. Is JMH only for Scala-JVM? Or is there some cross-platform support too?

In jsoniter-scala-benchmarkJS sub-project I use scalajs-benchmark (a Scala.js port of JMH for browsers).

For Scala Native I have only a rough measurements on validation of huge JSON files using an application from examples.

AFAIK @armanbilge have started porting of JMH to Scala for cross-platform support in his scala-jmh repo. I think the Scala community could sponsor his work on that.

Comment on lines +150 to +159
// If we do not copy the data from `cs0` into our own local array before
// parsing it, we take a significant performance penalty in the
// `integers Read` benchmarks, but *only* when run together with the rest
// of the benchmarks! When `integers Read` isrun alone, this does not
// happen, and is presumably something to do with the JIT compiler.

// Since any real world use case would exercise all sorts of code paths,
// it would more closely resemble the "all benchmarks together" case
// rather, and so we leave this copy in-place to optimize performance
// for that scenario.
Copy link
Member Author

@lihaoyi lihaoyi Mar 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is especially weird behavior that I don't understand. @plokhotnyuk have you seen anything like this before in your own performance work? I'm aware of things like the JIT causing method calls to slow down when running a bunch of code paths cause callsites to become megamorphic, but I've never heard of JIT-related slowdowns when reading and writing entries directly into arrays like this. And yet, this behavior seems reproducible when I tried it on on OpenJDK Correto 11/17/19

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both way of testing are suitable to understand the parser work: the separated one to see how parsing of some data type works in isolation, and the combined one to see more realistic scenarios that usually based on real-world samples.

From the start of development jsoniter-scala doesn't have any predefined codecs and as much as possible use private[this] methods for generated codecs, so that the type-class interface is generated only for the top-level class of the nested data structures.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plokhotnyuk the unusual thing is that in both cases, the benchmark runs the parser in a tight single-threaded loop again and again in isolation (nothing else running in the JVM/OS).

The difference is that when the benchmark is run on a "clean" JVM right after startup, this code is fast. When the JVM is run on a "dirty" JVM that has already run a bunch of other benchmarks on the same thread (but which have all since terminated), this code is slow. That's the part that's rather unusual

@lihaoyi lihaoyi merged commit 0a53ce8 into main Mar 20, 2023
@lefou lefou added this to the 3.1.0 milestone Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants