BufferedValue equality #107

russellremple · 2021-12-29T06:10:51Z

Resolve issues with byte array comparisons in Binary and Ext. Resolve attribute order sensitivity in Obj. Allow equality across AnyNum subtypes, only converting to BigDecimal when needed. Enhance property tests to deal with all subtypes and test Obj attribute shuffles.

htmldoug · 2021-12-30T20:55:27Z

weejson/src/main/scala/com/rallyhealth/weejson/v1/BufferedValue.scala

+
+    override def equals(that: Any): Boolean = that match {
+      case NumLong(otherL) => this.l == otherL
+      case NumDouble(otherD) => this.l.toDouble == otherD


Does l == otherD work here? Tests seem to pass with it.

You are right, I'll fix it. I was trying to avoid the subtleties with |L| > 2^53, but it looks like Long.equals already accommodates that. However, the inverse is not true, i.e., this.l == otherD.toLong will not work for all values outside of the 2^53 range.

htmldoug · 2021-12-30T21:06:03Z

weejson/src/main/scala/com/rallyhealth/weejson/v1/BufferedValue.scala

  case class Num(s: String, decIndex: Int, expIndex: Int) extends AnyNum {
-    override def value: BigDecimal = BigDecimal(s)
+    override lazy val value: BigDecimal = BigDecimal(s)


This lazy val is going to cost an extra 8 bytes of allocation since it pushes us over the object alignment boundary. I'm not convinced that the extra memory footprint when buffering (and extra allocation pressure, even when memory is plentiful) is a good trade. It's fine to have a slower equals in BufferedValue since in practice, it should only be used for tests.

I'll remove it, and we can study the performance tradeoffs later. My hope is that the stringy Num is something we can try to avoid in general (i.e., map to Long or Double early), but we gotta figure that out. My thoughts at the time I stuck in the lazy were (1) conversion from a String to BigDecimal is known to be expensive, (2) for the Num to be useful in a numeric context, it will have to get converted at some point, (3) if it's gonna happen once, we probably should cache the results somehow, and (4) putting in lazy will provoke a response from Doug and get us talking about it. Mission accomplished!

htmldoug · 2021-12-30T21:08:12Z

weejson/src/main/scala/com/rallyhealth/weejson/v1/BufferedValue.scala

+    override def equals(that: Any): Boolean = that match {
+      case Obj(thatValue0 @ _*) =>
+        this.value0.size == thatValue0.size &&
+          this.value0.sortBy(_._1).zip(thatValue0.sortBy(_._1)).forall {
+            case ((thisKey, thisValue), (thatKey, thatValue)) =>
+              thisKey == thatKey && thisValue == thatValue
+          }
+      case _ => super.equals(that)
+    }
+
+    // expensive but reliable
+    override def hashCode(): Int = this.value0.sortBy(_._1).hashCode()


Why do we need an unordered equals/hashCode here? Can we have this do a cheap ordered comparison instead? Seems like unordered would be preferable anyway for use in tests.

So you want Obj("a" -> 1, "b" -> 2) != Obj("b" -> 2, "a" -> 1)? Seems wrong. This goes back to the "dup key" conversation , and whether the underlying data structure for Value's Obj (a Map which is unordered and does not allow duplicates) is superior to the underlying data structure for BufferedValue (a Seq, which is ordered and allows duplicates). You can argue against the underlying structure being optimal (and I think we agree on that point), but given what we have, this is the most appropriate definition for equals and hashCode IMO.

I suppose one approach could be to do a cheap, unordered compare first, and only attempt the expensive, ordered compare if there is a mismatch. If transformations generally preserve key order and compared values generally match, compares would be cheaper. But if transformations frequently scramble key order or compared values generally mismatch, compares would be even more expensive (doing everything twice). I'm honestly not inclined to fine-tune this too much if our position is to just keep using Value for the time being, but it is important to me for it to at least be correct.

Part of the reason this feels hard is that we lack a clear definition of what "correct equality" should mean for BufferedValue. Given its current role buffering input, how about:

Two BufferedValue are equal if one can be substituted for another in a bufferedValue.transform(to[T]) call for any valid To[T], such that the resulting values of T are also equal.

This definition would clarify that losing precision over Infinity #107 (comment) isn't valid. As for objects, I'm not sure, but I'm not yet convinced that we can eliminate the possibility of some data format or structure that permits and allows multimaps or is otherwise order-sensitive.

That makes sense to me, and clearly both Obj("a" -> 1, "b" -> 2) and Obj("b" -> 2, "a" -> 1) would transform to the same case class Thing(a: Int, b: Int), so we should consider them equal (somehow).

Yeah, it works for case classes and scala Maps, but are we prepared to go so far as to say that a To[T] where the T cares about object order is invalid? Does this hold for MongoDBObjects? YAML? XML? The rest of the textual and binary formats in the jackson ecosystem and beyond?

So now we agree on what "correct equality" means, but don't agree on what "valid" means in the context of "...for any valid To[T]" :'(

Did you mean "..for all valid To[T]"?

htmldoug · 2021-12-30T21:08:47Z

weejson/src/test/scala/com/rallyhealth/weejson/v1/GenBufferedValue.scala

+ * Generator for BufferedValue
+ *
+ * @param jsonReversible if you are piping the arbitrary BufferedValue through JSON, set this to true so that
+ *                       only reversible types are used (i.e., excludes Timestamp, Ext, and Binary, which are
+ *                       encoded in such a way that they are not reversible)
+ */
+abstract class GenBufferedValue(jsonReversible: Boolean) {


Great idea!

htmldoug · 2021-12-30T21:20:00Z

weejson/src/main/scala/com/rallyhealth/weejson/v1/BufferedValue.scala

+      case NumDouble(d) =>
+        val thisD = value.toDouble // may chop precision or go infinite
+        if (thisD.isInfinite) value == d.value else thisD == d


What's the case for Infinity equality? BigDecimal doesn't seem to agree. I'm not sure we should accept loss of precision here. BufferedValue is supposed to be as faithful to the raw input as possible.

scala> val bd = BigDecimal("1e500") val bd: BigDecimal = 1E+500 scala> bd.toDouble val res20: Double = Infinity scala> bd == res20 val res21: Boolean = false

I'll remove it. The property tests only ensure things that should be equal are equal, but I was concerned about things being considered equal that should not be. However, there is a flaw in my logic -- BigDecimal can't represent an infinity anyway, so when NumDouble declares override def value: BigDecimal = BigDecimal(d), that ain't gonna work for infinities since BigDecimal(Double.PositiveInfinity) throws java.lang.NumberFormatException! So, although BigDecimal is "larger" than Double in almost every sense, in this sense it is "smaller". I was jumping through invisible hoops.

russellremple added 2 commits December 28, 2021 22:05

BufferedValue equality

4dd96ba

Simplify generator

d2457e5

russellremple marked this pull request as draft December 29, 2021 06:19

russellremple added 5 commits December 28, 2021 22:27

make mima happy

a9d0753

arbInstant not found?

05d8d22

arbInstant not found?!?

ee27586

arbInstant not in 2.11 libs!

e8e4be1

param arbValue is JSON reversible

b7193f6

russellremple marked this pull request as ready for review December 29, 2021 18:07

hashCode consistent with equals

05631cd

russellremple requested a review from htmldoug December 29, 2021 19:38

russellremple added 3 commits December 29, 2021 11:50

less expensive AnyNum hashCode

e47ec21

property tests for equals/hashCode, fix bugs

7cbd2bb

Double.isInfinite

490d965

htmldoug reviewed Dec 30, 2021

View reviewed changes

simpler

baeac07

russellremple requested a review from htmldoug December 30, 2021 22:28

Merge branch 'v1' into bveq

9ef75eb

russellremple marked this pull request as draft March 1, 2022 23:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BufferedValue equality #107

BufferedValue equality #107

russellremple commented Dec 29, 2021 •

edited

Loading

htmldoug Dec 30, 2021

russellremple Dec 30, 2021

htmldoug Dec 30, 2021

russellremple Dec 30, 2021

htmldoug Dec 30, 2021 •

edited

Loading

russellremple Dec 30, 2021

russellremple Dec 30, 2021

htmldoug Dec 30, 2021 •

edited

Loading

russellremple Dec 30, 2021

htmldoug Dec 30, 2021 •

edited

Loading

russellremple Dec 31, 2021

htmldoug Dec 30, 2021

htmldoug Dec 30, 2021

russellremple Dec 30, 2021

BufferedValue equality #107

Are you sure you want to change the base?

BufferedValue equality #107

Conversation

russellremple commented Dec 29, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htmldoug Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htmldoug Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

htmldoug Dec 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

russellremple commented Dec 29, 2021 •

edited

Loading

htmldoug Dec 30, 2021 •

edited

Loading

htmldoug Dec 30, 2021 •

edited

Loading

htmldoug Dec 30, 2021 •

edited

Loading