Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weejson-jsoniter-scala #105

Open
wants to merge 12 commits into
base: v1
Choose a base branch
from

Conversation

htmldoug
Copy link
Contributor

@htmldoug htmldoug commented Dec 19, 2021

Summary

Adds basic integration for the very fast jsoniter-scala parser.

Benchmarks

Optional Jar

Based on performance alone, I'd love to have this parser as weepickle's default UTF-8 implementation, but:

  1. I'm not giving up jackson-core integration (particularly for dataformats-binary and dataformats-text).
  2. Having two default parser dependencies is slightly "wat"
  3. I'm not sure if jsoniter-scala v3 will use the same package names and cause us dependency hell. Last major version release was in Nov 2019 and did not change package names.)

The README suggests as a feature:

Support of shading to another package for locking on a particular released version

We may want to go this route in the future, but for now, I'm adding the integration as an optional jar primarily for use by non-libraries at the absolute root of the dependency tree.

TODO

  • parse floats less horribly
  • update readme

@htmldoug htmldoug marked this pull request as ready for review March 9, 2022 05:41
Comment on lines 151 to 175
val cs = new String(in.readRawValAsBytes(), StandardCharsets.US_ASCII)

/**
* This regex performs rather badly, but gets the tests passing.
*
* We're looking for a value we can pass through the Visitor interface--
* either a primitive Double or a CharSequence representing a *valid*
* number conforming to https://datatracker.ietf.org/doc/html/rfc7159#page-6.
*
* `in.readRawValAsBytes()` does NOT do that validation. It will happily
* return a String of "------".
*
* `in.readBigDecimal(null).toString` is tempting, but will not provide the raw input.
* Instead, it transforms the input from "0.00000001" to "1.0E-8".
* This fails roundtrip tests.
*
* I tried combining the two approaches, `in.readBigDecimal(null)` for validation,
* then `in.rollbackToMark()` + `in.readRawValAsBytes()` to capture the raw input,
* but for a value like "1.0-----", `in.readBigDecimal(null)` will read "1.0",
* then `in.readRawValAsBytes()` will return the whole string, including the unwanted
* trailing hyphens.
*
*/
require(ValidJsonNum.pattern.matcher(cs).matches(), "invalid number")
v.visitFloat64String(cs)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plokhotnyuk, I seem to be stuck here. Do you have any suggestions to achieve better performance than my hacky regex given the failed attempts in the comment above? Specifically, is there anything in JsonReader that can help me that I might have overlooked?

Copy link
Contributor

@plokhotnyuk plokhotnyuk Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htmldoug You can use some counters to calculate number of valid chars like in the following code snippet (BEWARE: it is not tested)

    private def parseNumber[J](
      in: JsonReader,
      v: Visitor[_, J]
    ): J = {
      in.setMark()
      var b = in.nextByte()
      var digits, index = 0
      var decIndex, expIndex = -1
      if (b == '-') {
        b = in.nextByte()
        index += 1
      }
      try {
        digits -= index
        while (b >= '0' && b <= '9') {
          b = in.nextByte()
          index += 1
        }
        digits += index
        if (b == '.') {
          decIndex = index
          b = in.nextByte()
          index += 1
        }
        digits -= index
        while (b >= '0' && b <= '9') {
          b = in.nextByte()
          index += 1
        }
        digits += index
        if ((b | 0x20) == 'e') {
          expIndex = index
          b = in.nextByte()
          index += 1
          if (b == '-' || b == '+') {
            b = in.nextByte()
            index += 1
          }
          while (b >= '0' && b <= '9') {
            b = in.nextByte()
            index += 1
          }
        }
      } catch {
        case _: JsonReaderException => // ignore the end of input error for now
      } finally in.rollbackToMark()
      if ((decIndex & expIndex) == -1) {
        if (digits < 19) v.visitInt64(in.readLong())
        else {
          val x = in.readBigInt(null)
          if (x.bitLength < 64) v.visitInt64(x.longValue)
          else v.visitFloat64StringParts(x.toString, -1, -1)
        }
      } else {
        val cs = new String(in.readRawValAsBytes(), StandardCharsets.US_ASCII)
        require(cs.length - 1 == index, "invalid number")
        v.visitFloat64String(cs)
      }
    }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, comparing the walked length vs the string.length is smart. I'll run with this implementation tomorrow. Thank you, @plokhotnyuk!

I tried something similar previously, but it wasn't giving me the throughput I expected. I just realized that most of the time is spent in endOfInputError > appendHexDump. "3.14 " parses 3x faster than "3.14"! I wonder if something similar could also be happening in BigDecimalReading, BigIntReading, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@htmldoug You can parse an array of numbers to make your benchmark more realistic. I bet nobody use JSON parsers to read a single number.

@plokhotnyuk
Copy link
Contributor

@htmldoug I've added a missing method for checking of remaining bytes in the reader, you can now use it to parse numbers without need to catch end of input errors, like here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants