Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: SerdeVecMap type for serializing ID maps #25492

Merged
merged 5 commits into from
Oct 25, 2024

Conversation

hiltontj
Copy link
Contributor

@hiltontj hiltontj commented Oct 25, 2024

This PR introduces a new type SerdeVecHashMap that can be used in places where we need a HashMap with the following properties:

  1. When serialized, it is serialized as a list of key-value pairs, instead of a map
  2. When deserialized, it assumes the serialization format from (1.) and deserializes from a list of key-value pairs to a map
  3. Does not allow for duplicate keys on deserialization

This is useful in places where we need to create map types that map from an identifier (integer) to some value, and need to serialize that data. For example: in the WAL when serializing write batches, and in the catalog when serializing the database/table schema.

This PR refactors the code in influxdb3_wal and influxdb3_catalog to use the new type for maps that use DbId and TableId as the key. Follow on work can give the same treatment to ColumnId based maps once that is fully worked out.

Explanation

If we have a HashMap<u32, String>, serde_json will serialize it in the following way:

{"0": "foo", "1": "bar"}

i.e., the integer keys are serialized as strings, since JSON doesn't support any other type of key in maps.

SerdeVecHashMap<u32, String> will be serialized by serde_json in the following way:

[[0, "foo"], [1, "bar"]]

and will deserialize from that vector-based structure back to the map. This allows serialization/deserialization to run directly off of the HashMap's Iterator/FromIterator implementations.

The Controversial Part

One thing I also did in this PR was switch the catalog from using a BTreeMap for tables to using the new HashMap type. This breaks the deterministic ordering of the database schema's tables map and therefore wrecks the snapshot tests we were using. I had to comment those parts of their respective tests out, because there isn't an easy way to make the underlying hashmap have a deterministic ordering just in tests that I am aware of.

If we think that using BTreeMap in the catalog is okay over a HashMap, then I think it would be okay to roll a similar SerdeVecBTreeMap type specifically for the catalog. Coincidentally, this may actually be a good use case for indexmap, since it holds supposedly similar lookup performance characteristics to hashmap, while preserving order and having faster iteration which could be a win for WAL serialization speed. It also accepts different hashing algorithms so could be swapped in with FNV like HashMap can.

Follow-up work

Use the SerdeVecHashMap for column data in the WAL following #25461

@hiltontj hiltontj self-assigned this Oct 25, 2024
Copy link
Contributor Author

@hiltontj hiltontj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments in-line.

Comment on lines 1066 to 1072
// insta::with_settings!({
// sort_maps => true,
// description => "catalog serialized, this snapshot is good for catching changes in \
// catalog serialization"
// }, {
// assert_json_snapshot!(catalog);
// });
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and a couple of other insta snapshots are commented out in this PR. We likely still want to be doing some kind of snapshot testing, but as per the TODO comment above these lines, this is not possible when using a HashMap like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.

@@ -466,7 +420,7 @@ pub struct DatabaseSchema {
pub id: DbId,
pub name: Arc<str>,
/// The database is a map of tables
pub tables: BTreeMap<TableId, TableDefinition>,
pub tables: SerdeVecHashMap<TableId, TableDefinition>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing this to use a HashMap instead of BTreeMap for faster lookups.

/// During deserialization, there are no duplicate keys allowed. If duplicates are found, an error
/// will be thrown.
#[derive(Debug, Clone, PartialEq, Eq)]
pub struct SerdeVecHashMap<K: Eq + std::hash::Hash, V>(HashMap<K, V>);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is mainly used for ID maps, which are integers, we might be able to use FNV hash: https://doc.servo.org/fnv/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #25494

@hiltontj hiltontj requested review from mgattozzi and pauldix and removed request for mgattozzi October 25, 2024 15:50
@hiltontj hiltontj marked this pull request as ready for review October 25, 2024 15:50
Copy link
Member

@pauldix pauldix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. My thought on the snapshot thing is something like this: we want to test serialization and deserialization. Anything that is going to be a file format needs to be versioned. So I'd think that we'd want:

  • A static file with the appropriate data members checked into the repo
  • A test that can deserialize the static file and validate it has the members we expect (this is all hard coded stuff)
  • Another test that takes an in memory structure, serializes it, then deserializes that and validates they're the same (this might be unnecessary)

// TODO: with the use of the SerdeVecHashMap, using a snapshot test is flaky. This is
// because the serialized vector can not have a deterministic order. So, this part of the
// test is disabled for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with these kinds of serialization things, snapshot testing might not be what we want to use anyway. We want to confirm that we can serialize and deserialize data. Later when we make updates to the types, we want to make sure we can still deserialize older data.

But if we're marshalling into RAM, it's the equality of the data members that we're looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, insta is compelling but not the right tool here. I think @waynr was grappling with a similar issue in clustered so I may pick his brain to see if he landed on a solution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #25493

Comment on lines 1066 to 1072
// insta::with_settings!({
// sort_maps => true,
// description => "catalog serialized, this snapshot is good for catching changes in \
// catalog serialization"
// }, {
// assert_json_snapshot!(catalog);
// });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to test serialization and deserialization, but I don't know if insta's snapshots are the way to go. Feels more brittle than I'd like.

@hiltontj
Copy link
Contributor Author

Given the review comments, I went and cleaned up the commented out code and removed the now unused snapshot files in influxdb3_catalog in 06e9689

Once ✅ I will merge.

@hiltontj hiltontj merged commit 0e814f5 into main Oct 25, 2024
13 checks passed
@hiltontj hiltontj deleted the hiltontj/persisted-snapshot-serde branch October 25, 2024 17:49
@hiltontj hiltontj added the v3 label Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants