You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an edge case that may not be worth fixing.
SparseHll has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12), the maximum possible bucket value is 64 - indexBitLength + 1 = 53 (52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.
I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in SparseHll differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal to VALUE_BITS in SparseHll.) I suspect fixing this bug would require making breaking changes to the format of SparseHll. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.
The text was updated successfully, but these errors were encountered:
jonhehir
added a commit
to jonhehir/airlift
that referenced
this issue
Jun 24, 2022
This commit makes no functional changes and only adds tests. Beyond merely improving test
coverage, this commit serves as partial documentation of one minor (but surprising) edge case
(prestodb#56) and as verification of behavior that was contested in prestodb#42.
This commit makes no functional changes and only adds tests. Beyond merely improving test
coverage, this commit serves as partial documentation of one minor (but surprising) edge case
(#56) and as verification of behavior that was contested in #42.
This is an edge case that may not be worth fixing.
SparseHll
has a small bug that causes the wrong bucket value to be recorded when the number of leading zeros is within 6 of the maximum possible number. For example, with 4096 buckets (indexBitLength = 12
), the maximum possible bucket value is64 - indexBitLength + 1 = 53
(52 leading zeros). Values up to 46 are correctly recorded, but attempting to insert a hash with more 46 or more leading zeros records an incorrect value. This is reflected in the added unit tests in #55.I have not thoroughly investigated why this is, but I believe the implementation of HyperLogLog++ in
SparseHll
differs slightly from that laid out in the original paper. (See in particular the encoding and decoding of hashes in the sparse setting, and note that 6 is equal toVALUE_BITS
inSparseHll
.) I suspect fixing this bug would require making breaking changes to the format ofSparseHll
. Meanwhile, the probability of this bug occurring is very remote, so its effects are virtually nil. Nonetheless, I figured it would be worthwhile to document this behavior for any future reference.The text was updated successfully, but these errors were encountered: