-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BufferedValue equality #107
base: v1
Are you sure you want to change the base?
Conversation
|
||
override def equals(that: Any): Boolean = that match { | ||
case NumLong(otherL) => this.l == otherL | ||
case NumDouble(otherD) => this.l.toDouble == otherD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does l == otherD
work here? Tests seem to pass with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, I'll fix it. I was trying to avoid the subtleties with |L| > 2^53, but it looks like Long.equals
already accommodates that. However, the inverse is not true, i.e., this.l == otherD.toLong
will not work for all values outside of the 2^53 range.
case class Num(s: String, decIndex: Int, expIndex: Int) extends AnyNum { | ||
override def value: BigDecimal = BigDecimal(s) | ||
override lazy val value: BigDecimal = BigDecimal(s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This lazy val is going to cost an extra 8 bytes of allocation since it pushes us over the object alignment boundary. I'm not convinced that the extra memory footprint when buffering (and extra allocation pressure, even when memory is plentiful) is a good trade. It's fine to have a slower equals in BufferedValue
since in practice, it should only be used for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it, and we can study the performance tradeoffs later. My hope is that the stringy Num
is something we can try to avoid in general (i.e., map to Long
or Double
early), but we gotta figure that out. My thoughts at the time I stuck in the lazy
were (1) conversion from a String
to BigDecimal
is known to be expensive, (2) for the Num
to be useful in a numeric context, it will have to get converted at some point, (3) if it's gonna happen once, we probably should cache the results somehow, and (4) putting in lazy
will provoke a response from Doug and get us talking about it. Mission accomplished!
override def equals(that: Any): Boolean = that match { | ||
case Obj(thatValue0 @ _*) => | ||
this.value0.size == thatValue0.size && | ||
this.value0.sortBy(_._1).zip(thatValue0.sortBy(_._1)).forall { | ||
case ((thisKey, thisValue), (thatKey, thatValue)) => | ||
thisKey == thatKey && thisValue == thatValue | ||
} | ||
case _ => super.equals(that) | ||
} | ||
|
||
// expensive but reliable | ||
override def hashCode(): Int = this.value0.sortBy(_._1).hashCode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need an unordered equals/hashCode here? Can we have this do a cheap ordered comparison instead? Seems like unordered would be preferable anyway for use in tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you want Obj("a" -> 1, "b" -> 2)
!= Obj("b" -> 2, "a" -> 1)
? Seems wrong. This goes back to the "dup key" conversation , and whether the underlying data structure for Value
's Obj
(a Map
which is unordered and does not allow duplicates) is superior to the underlying data structure for BufferedValue
(a Seq
, which is ordered and allows duplicates). You can argue against the underlying structure being optimal (and I think we agree on that point), but given what we have, this is the most appropriate definition for equals
and hashCode
IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose one approach could be to do a cheap, unordered compare first, and only attempt the expensive, ordered compare if there is a mismatch. If transformations generally preserve key order and compared values generally match, compares would be cheaper. But if transformations frequently scramble key order or compared values generally mismatch, compares would be even more expensive (doing everything twice). I'm honestly not inclined to fine-tune this too much if our position is to just keep using Value
for the time being, but it is important to me for it to at least be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the reason this feels hard is that we lack a clear definition of what "correct equality" should mean for BufferedValue
. Given its current role buffering input, how about:
Two
BufferedValue
are equal if one can be substituted for another in abufferedValue.transform(to[T])
call for any validTo[T]
, such that the resulting values ofT
are also equal.
This definition would clarify that losing precision over Infinity
#107 (comment) isn't valid. As for objects, I'm not sure, but I'm not yet convinced that we can eliminate the possibility of some data format or structure that permits and allows multimaps or is otherwise order-sensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me, and clearly both Obj("a" -> 1, "b" -> 2)
and Obj("b" -> 2, "a" -> 1)
would transform to the same case class Thing(a: Int, b: Int)
, so we should consider them equal (somehow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it works for case classes and scala Maps, but are we prepared to go so far as to say that a To[T]
where the T
cares about object order is invalid? Does this hold for MongoDBObjects? YAML? XML? The rest of the textual and binary formats in the jackson ecosystem and beyond?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So now we agree on what "correct equality" means, but don't agree on what "valid" means in the context of "...for any valid To[T]" :'(
Did you mean "..for all valid To[T]"?
* Generator for BufferedValue | ||
* | ||
* @param jsonReversible if you are piping the arbitrary BufferedValue through JSON, set this to true so that | ||
* only reversible types are used (i.e., excludes Timestamp, Ext, and Binary, which are | ||
* encoded in such a way that they are not reversible) | ||
*/ | ||
abstract class GenBufferedValue(jsonReversible: Boolean) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great idea!
case NumDouble(d) => | ||
val thisD = value.toDouble // may chop precision or go infinite | ||
if (thisD.isInfinite) value == d.value else thisD == d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the case for Infinity
equality? BigDecimal
doesn't seem to agree. I'm not sure we should accept loss of precision here. BufferedValue
is supposed to be as faithful to the raw input as possible.
scala> val bd = BigDecimal("1e500")
val bd: BigDecimal = 1E+500
scala> bd.toDouble
val res20: Double = Infinity
scala> bd == res20
val res21: Boolean = false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove it. The property tests only ensure things that should be equal are equal, but I was concerned about things being considered equal that should not be. However, there is a flaw in my logic -- BigDecimal
can't represent an infinity anyway, so when NumDouble
declares override def value: BigDecimal = BigDecimal(d)
, that ain't gonna work for infinities since BigDecimal(Double.PositiveInfinity)
throws java.lang.NumberFormatException! So, although BigDecimal
is "larger" than Double
in almost every sense, in this sense it is "smaller". I was jumping through invisible hoops.
Resolve issues with byte array comparisons in
Binary
andExt
. Resolve attribute order sensitivity inObj
. Allow equality acrossAnyNum
subtypes, only converting toBigDecimal
when needed. Enhance property tests to deal with all subtypes and testObj
attribute shuffles.