feat: SerdeVecMap type for serializing ID maps #25492

hiltontj · 2024-10-25T01:34:52Z

This PR introduces a new type SerdeVecHashMap that can be used in places where we need a HashMap with the following properties:

When serialized, it is serialized as a list of key-value pairs, instead of a map
When deserialized, it assumes the serialization format from (1.) and deserializes from a list of key-value pairs to a map
Does not allow for duplicate keys on deserialization

This is useful in places where we need to create map types that map from an identifier (integer) to some value, and need to serialize that data. For example: in the WAL when serializing write batches, and in the catalog when serializing the database/table schema.

This PR refactors the code in influxdb3_wal and influxdb3_catalog to use the new type for maps that use DbId and TableId as the key. Follow on work can give the same treatment to ColumnId based maps once that is fully worked out.

Explanation

If we have a HashMap<u32, String>, serde_json will serialize it in the following way:

{"0": "foo", "1": "bar"}

i.e., the integer keys are serialized as strings, since JSON doesn't support any other type of key in maps.

SerdeVecHashMap<u32, String> will be serialized by serde_json in the following way:

[[0, "foo"], [1, "bar"]]

and will deserialize from that vector-based structure back to the map. This allows serialization/deserialization to run directly off of the HashMap's Iterator/FromIterator implementations.

The Controversial Part

One thing I also did in this PR was switch the catalog from using a BTreeMap for tables to using the new HashMap type. This breaks the deterministic ordering of the database schema's tables map and therefore wrecks the snapshot tests we were using. I had to comment those parts of their respective tests out, because there isn't an easy way to make the underlying hashmap have a deterministic ordering just in tests that I am aware of.

If we think that using BTreeMap in the catalog is okay over a HashMap, then I think it would be okay to roll a similar SerdeVecBTreeMap type specifically for the catalog. Coincidentally, this may actually be a good use case for indexmap, since it holds supposedly similar lookup performance characteristics to hashmap, while preserving order and having faster iteration which could be a win for WAL serialization speed. It also accepts different hashing algorithms so could be swapped in with FNV like HashMap can.

Follow-up work

Use the SerdeVecHashMap for column data in the WAL following #25461

hiltontj

Left some comments in-line.

hiltontj · 2024-10-25T15:11:53Z

influxdb3_catalog/src/catalog.rs

+        // insta::with_settings!({
+        //     sort_maps => true,
+        //     description => "catalog serialized, this snapshot is good for catching changes in \
+        //     catalog serialization"
+        // }, {
+        //     assert_json_snapshot!(catalog);
+        // });


This and a couple of other insta snapshots are commented out in this PR. We likely still want to be doing some kind of snapshot testing, but as per the TODO comment above these lines, this is not possible when using a HashMap like this.

I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.

hiltontj · 2024-10-25T15:12:31Z

influxdb3_catalog/src/catalog.rs

@@ -466,7 +420,7 @@ pub struct DatabaseSchema {
    pub id: DbId,
    pub name: Arc<str>,
    /// The database is a map of tables
-    pub tables: BTreeMap<TableId, TableDefinition>,
+    pub tables: SerdeVecHashMap<TableId, TableDefinition>,


Changing this to use a HashMap instead of BTreeMap for faster lookups.

hiltontj · 2024-10-25T15:30:53Z

influxdb3_id/src/serialize.rs

+/// During deserialization, there are no duplicate keys allowed. If duplicates are found, an error
+/// will be thrown.
+#[derive(Debug, Clone, PartialEq, Eq)]
+pub struct SerdeVecHashMap<K: Eq + std::hash::Hash, V>(HashMap<K, V>);


If this is mainly used for ID maps, which are integers, we might be able to use FNV hash: https://doc.servo.org/fnv/

I opened #25494

pauldix

Looks good. My thought on the snapshot thing is something like this: we want to test serialization and deserialization. Anything that is going to be a file format needs to be versioned. So I'd think that we'd want:

A static file with the appropriate data members checked into the repo
A test that can deserialize the static file and validate it has the members we expect (this is all hard coded stuff)
Another test that takes an in memory structure, serializes it, then deserializes that and validates they're the same (this might be unnecessary)

pauldix · 2024-10-25T16:54:51Z

influxdb3_catalog/src/catalog.rs

+        // TODO: with the use of the SerdeVecHashMap, using a snapshot test is flaky. This is
+        // because the serialized vector can not have a deterministic order. So, this part of the
+        // test is disabled for now.
+


I think with these kinds of serialization things, snapshot testing might not be what we want to use anyway. We want to confirm that we can serialize and deserialize data. Later when we make updates to the types, we want to make sure we can still deserialize older data.

But if we're marshalling into RAM, it's the equality of the data members that we're looking for.

Yeah, insta is compelling but not the right tool here. I think @waynr was grappling with a similar issue in clustered so I may pick his brain to see if he landed on a solution.

I opened #25493

pauldix · 2024-10-25T16:58:12Z

influxdb3_catalog/src/catalog.rs

+        // insta::with_settings!({
+        //     sort_maps => true,
+        //     description => "catalog serialized, this snapshot is good for catching changes in \
+        //     catalog serialization"
+        // }, {
+        //     assert_json_snapshot!(catalog);
+        // });


I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.

hiltontj · 2024-10-25T17:26:48Z

Given the review comments, I went and cleaned up the commented out code and removed the now unused snapshot files in influxdb3_catalog in 06e9689

Once ✅ I will merge.

feat: SerdeVecMap type for serializing ID maps

824f0ad

hiltontj self-assigned this Oct 25, 2024

hiltontj added 3 commits October 25, 2024 09:31

refactor: move serdevecmap to id crate and upgrade rust toolchain

ed4c289

refactor: use SerdeVecHashMap in WriteBatch

e490bc0

refactor: use SerdeVecMap in catalog

48c3f84

hiltontj commented Oct 25, 2024

View reviewed changes

hiltontj requested review from mgattozzi and pauldix and removed request for mgattozzi October 25, 2024 15:50

hiltontj marked this pull request as ready for review October 25, 2024 15:50

pauldix approved these changes Oct 25, 2024

View reviewed changes

This was referenced Oct 25, 2024

Versioned serialization/deserialization tests #25493

Open

Use FNV hasher for ID maps #25494

Open

refactor: remove commented code and unused snapshots

06e9689

hiltontj merged commit 0e814f5 into main Oct 25, 2024
13 checks passed

hiltontj deleted the hiltontj/persisted-snapshot-serde branch October 25, 2024 17:49

hiltontj added the v3 label Nov 1, 2024

hiltontj mentioned this pull request Nov 1, 2024

refactor: move to ColumnId and Arc<str> as much as possible #25495

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SerdeVecMap type for serializing ID maps #25492

feat: SerdeVecMap type for serializing ID maps #25492

hiltontj commented Oct 25, 2024 •

edited

Loading

hiltontj left a comment

hiltontj Oct 25, 2024

pauldix Oct 25, 2024

hiltontj Oct 25, 2024

hiltontj Oct 25, 2024

hiltontj Oct 25, 2024

pauldix left a comment

pauldix Oct 25, 2024

hiltontj Oct 25, 2024

hiltontj Oct 25, 2024

pauldix Oct 25, 2024

hiltontj commented Oct 25, 2024

feat: SerdeVecMap type for serializing ID maps #25492

feat: SerdeVecMap type for serializing ID maps #25492

Conversation

hiltontj commented Oct 25, 2024 • edited Loading

Explanation

The Controversial Part

Follow-up work

hiltontj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiltontj commented Oct 25, 2024

hiltontj commented Oct 25, 2024 •

edited

Loading