Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BufferedValue equality #107

Draft
wants to merge 13 commits into
base: v1
Choose a base branch
from
Draft

Conversation

russellremple
Copy link
Contributor

@russellremple russellremple commented Dec 29, 2021

Resolve issues with byte array comparisons in Binary and Ext. Resolve attribute order sensitivity in Obj. Allow equality across AnyNum subtypes, only converting to BigDecimal when needed. Enhance property tests to deal with all subtypes and test Obj attribute shuffles.

@russellremple russellremple marked this pull request as draft December 29, 2021 06:19
@russellremple russellremple marked this pull request as ready for review December 29, 2021 18:07

override def equals(that: Any): Boolean = that match {
case NumLong(otherL) => this.l == otherL
case NumDouble(otherD) => this.l.toDouble == otherD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does l == otherD work here? Tests seem to pass with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I'll fix it. I was trying to avoid the subtleties with |L| > 2^53, but it looks like Long.equals already accommodates that. However, the inverse is not true, i.e., this.l == otherD.toLong will not work for all values outside of the 2^53 range.

case class Num(s: String, decIndex: Int, expIndex: Int) extends AnyNum {
override def value: BigDecimal = BigDecimal(s)
override lazy val value: BigDecimal = BigDecimal(s)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lazy val is going to cost an extra 8 bytes of allocation since it pushes us over the object alignment boundary. I'm not convinced that the extra memory footprint when buffering (and extra allocation pressure, even when memory is plentiful) is a good trade. It's fine to have a slower equals in BufferedValue since in practice, it should only be used for tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it, and we can study the performance tradeoffs later. My hope is that the stringy Num is something we can try to avoid in general (i.e., map to Long or Double early), but we gotta figure that out. My thoughts at the time I stuck in the lazy were (1) conversion from a String to BigDecimal is known to be expensive, (2) for the Num to be useful in a numeric context, it will have to get converted at some point, (3) if it's gonna happen once, we probably should cache the results somehow, and (4) putting in lazy will provoke a response from Doug and get us talking about it. Mission accomplished!

Comment on lines +167 to +178
override def equals(that: Any): Boolean = that match {
case Obj(thatValue0 @ _*) =>
this.value0.size == thatValue0.size &&
this.value0.sortBy(_._1).zip(thatValue0.sortBy(_._1)).forall {
case ((thisKey, thisValue), (thatKey, thatValue)) =>
thisKey == thatKey && thisValue == thatValue
}
case _ => super.equals(that)
}

// expensive but reliable
override def hashCode(): Int = this.value0.sortBy(_._1).hashCode()
Copy link
Contributor

@htmldoug htmldoug Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need an unordered equals/hashCode here? Can we have this do a cheap ordered comparison instead? Seems like unordered would be preferable anyway for use in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want Obj("a" -> 1, "b" -> 2) != Obj("b" -> 2, "a" -> 1)? Seems wrong. This goes back to the "dup key" conversation , and whether the underlying data structure for Value's Obj (a Map which is unordered and does not allow duplicates) is superior to the underlying data structure for BufferedValue (a Seq, which is ordered and allows duplicates). You can argue against the underlying structure being optimal (and I think we agree on that point), but given what we have, this is the most appropriate definition for equals and hashCode IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose one approach could be to do a cheap, unordered compare first, and only attempt the expensive, ordered compare if there is a mismatch. If transformations generally preserve key order and compared values generally match, compares would be cheaper. But if transformations frequently scramble key order or compared values generally mismatch, compares would be even more expensive (doing everything twice). I'm honestly not inclined to fine-tune this too much if our position is to just keep using Value for the time being, but it is important to me for it to at least be correct.

Copy link
Contributor

@htmldoug htmldoug Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the reason this feels hard is that we lack a clear definition of what "correct equality" should mean for BufferedValue. Given its current role buffering input, how about:

Two BufferedValue are equal if one can be substituted for another in a bufferedValue.transform(to[T]) call for any valid To[T], such that the resulting values of T are also equal.

This definition would clarify that losing precision over Infinity #107 (comment) isn't valid. As for objects, I'm not sure, but I'm not yet convinced that we can eliminate the possibility of some data format or structure that permits and allows multimaps or is otherwise order-sensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me, and clearly both Obj("a" -> 1, "b" -> 2) and Obj("b" -> 2, "a" -> 1) would transform to the same case class Thing(a: Int, b: Int), so we should consider them equal (somehow).

Copy link
Contributor

@htmldoug htmldoug Dec 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it works for case classes and scala Maps, but are we prepared to go so far as to say that a To[T] where the T cares about object order is invalid? Does this hold for MongoDBObjects? YAML? XML? The rest of the textual and binary formats in the jackson ecosystem and beyond?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So now we agree on what "correct equality" means, but don't agree on what "valid" means in the context of "...for any valid To[T]" :'(

Did you mean "..for all valid To[T]"?

Comment on lines +10 to +16
* Generator for BufferedValue
*
* @param jsonReversible if you are piping the arbitrary BufferedValue through JSON, set this to true so that
* only reversible types are used (i.e., excludes Timestamp, Ext, and Binary, which are
* encoded in such a way that they are not reversible)
*/
abstract class GenBufferedValue(jsonReversible: Boolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great idea!

Comment on lines 195 to 197
case NumDouble(d) =>
val thisD = value.toDouble // may chop precision or go infinite
if (thisD.isInfinite) value == d.value else thisD == d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the case for Infinity equality? BigDecimal doesn't seem to agree. I'm not sure we should accept loss of precision here. BufferedValue is supposed to be as faithful to the raw input as possible.

scala> val bd = BigDecimal("1e500")
val bd: BigDecimal = 1E+500

scala> bd.toDouble
val res20: Double = Infinity

scala> bd == res20
val res21: Boolean = false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it. The property tests only ensure things that should be equal are equal, but I was concerned about things being considered equal that should not be. However, there is a flaw in my logic -- BigDecimal can't represent an infinity anyway, so when NumDouble declares override def value: BigDecimal = BigDecimal(d), that ain't gonna work for infinities since BigDecimal(Double.PositiveInfinity) throws java.lang.NumberFormatException! So, although BigDecimal is "larger" than Double in almost every sense, in this sense it is "smaller". I was jumping through invisible hoops.

@russellremple russellremple marked this pull request as draft March 1, 2022 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants