-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: SerdeVecMap type for serializing ID maps #25492
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments in-line.
influxdb3_catalog/src/catalog.rs
Outdated
// insta::with_settings!({ | ||
// sort_maps => true, | ||
// description => "catalog serialized, this snapshot is good for catching changes in \ | ||
// catalog serialization" | ||
// }, { | ||
// assert_json_snapshot!(catalog); | ||
// }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and a couple of other insta
snapshots are commented out in this PR. We likely still want to be doing some kind of snapshot testing, but as per the TODO
comment above these lines, this is not possible when using a HashMap
like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.
@@ -466,7 +420,7 @@ pub struct DatabaseSchema { | |||
pub id: DbId, | |||
pub name: Arc<str>, | |||
/// The database is a map of tables | |||
pub tables: BTreeMap<TableId, TableDefinition>, | |||
pub tables: SerdeVecHashMap<TableId, TableDefinition>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this to use a HashMap
instead of BTreeMap
for faster lookups.
/// During deserialization, there are no duplicate keys allowed. If duplicates are found, an error | ||
/// will be thrown. | ||
#[derive(Debug, Clone, PartialEq, Eq)] | ||
pub struct SerdeVecHashMap<K: Eq + std::hash::Hash, V>(HashMap<K, V>); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is mainly used for ID maps, which are integers, we might be able to use FNV hash: https://doc.servo.org/fnv/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #25494
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. My thought on the snapshot thing is something like this: we want to test serialization and deserialization. Anything that is going to be a file format needs to be versioned. So I'd think that we'd want:
- A static file with the appropriate data members checked into the repo
- A test that can deserialize the static file and validate it has the members we expect (this is all hard coded stuff)
- Another test that takes an in memory structure, serializes it, then deserializes that and validates they're the same (this might be unnecessary)
influxdb3_catalog/src/catalog.rs
Outdated
// TODO: with the use of the SerdeVecHashMap, using a snapshot test is flaky. This is | ||
// because the serialized vector can not have a deterministic order. So, this part of the | ||
// test is disabled for now. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think with these kinds of serialization things, snapshot testing might not be what we want to use anyway. We want to confirm that we can serialize and deserialize data. Later when we make updates to the types, we want to make sure we can still deserialize older data.
But if we're marshalling into RAM, it's the equality of the data members that we're looking for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, insta
is compelling but not the right tool here. I think @waynr was grappling with a similar issue in clustered so I may pick his brain to see if he landed on a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #25493
influxdb3_catalog/src/catalog.rs
Outdated
// insta::with_settings!({ | ||
// sort_maps => true, | ||
// description => "catalog serialized, this snapshot is good for catching changes in \ | ||
// catalog serialization" | ||
// }, { | ||
// assert_json_snapshot!(catalog); | ||
// }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.
Given the review comments, I went and cleaned up the commented out code and removed the now unused snapshot files in Once ✅ I will merge. |
This PR introduces a new type
SerdeVecHashMap
that can be used in places where we need a HashMap with the following properties:This is useful in places where we need to create map types that map from an identifier (integer) to some value, and need to serialize that data. For example: in the WAL when serializing write batches, and in the catalog when serializing the database/table schema.
This PR refactors the code in
influxdb3_wal
andinfluxdb3_catalog
to use the new type for maps that useDbId
andTableId
as the key. Follow on work can give the same treatment toColumnId
based maps once that is fully worked out.Explanation
If we have a
HashMap<u32, String>
,serde_json
will serialize it in the following way:i.e., the integer keys are serialized as strings, since JSON doesn't support any other type of key in maps.
SerdeVecHashMap<u32, String>
will be serialized byserde_json
in the following way:and will deserialize from that vector-based structure back to the map. This allows serialization/deserialization to run directly off of the
HashMap
'sIterator
/FromIterator
implementations.The Controversial Part
One thing I also did in this PR was switch the catalog from using a
BTreeMap
for tables to using the newHashMap
type. This breaks the deterministic ordering of the database schema'stables
map and therefore wrecks the snapshot tests we were using. I had to comment those parts of their respective tests out, because there isn't an easy way to make the underlying hashmap have a deterministic ordering just in tests that I am aware of.If we think that using
BTreeMap
in the catalog is okay over aHashMap
, then I think it would be okay to roll a similarSerdeVecBTreeMap
type specifically for the catalog. Coincidentally, this may actually be a good use case forindexmap
, since it holds supposedly similar lookup performance characteristics to hashmap, while preserving order and having faster iteration which could be a win for WAL serialization speed. It also accepts different hashing algorithms so could be swapped in with FNV likeHashMap
can.Follow-up work
Use the
SerdeVecHashMap
for column data in the WAL following #25461