Best way to model data like comments? #170

vedantroy · 2022-05-22T06:25:12Z

vedantroy
May 22, 2022

New

If I have a key that is associated with multiple values, and each value has an id, what would be the best way to delete a value given its id?

Old

What would be the recommended way to model something like comments?
E.g, we have something like:

type Post = {
   text: string;
   comments: Comment[]
}

type Comment = {
    id: string;
    text: string
}

I could model this as a message pack object, but this would mean writing a new comment would involve reading/writing the entire post. I roughly expect the text of each post to be around 500kb, if not more. So this seems bad for performance.

Is there a recommendation for what to do here? I have some initial thoughts:

Store posts in a separate database then comments
The comments database is a table of post id => comments (using the multiple-values-per-key feature).

However, with this model--I do not know how to update a specific comment.

Wanted to hear your thoughts on whether there's a more optimal way to mode this.

Answered by kriszyp

May 23, 2022

This is a great question, and worth considering some of the different ways you can structure this in lmdb-js., I think you are on the right track here, but a lot of this can depend on what you want to optimize for, so I will summarize some of the different approaches (which you may have already considered).

Embedded Arrays
As you have mentioned, comments can easily be embedded arrays, and this is great for smaller data structures that change less frequently. However, as you realized, if there are larger data structures with frequent additions/updates to the comments, the reading and writing of the post and all other comments can significantly increase the overhead for adding comments. How…

View full answer

kriszyp · 2022-05-23T03:34:31Z

kriszyp
May 23, 2022
Maintainer

This is a great question, and worth considering some of the different ways you can structure this in lmdb-js., I think you are on the right track here, but a lot of this can depend on what you want to optimize for, so I will summarize some of the different approaches (which you may have already considered).

Embedded Arrays
As you have mentioned, comments can easily be embedded arrays, and this is great for smaller data structures that change less frequently. However, as you realized, if there are larger data structures with frequent additions/updates to the comments, the reading and writing of the post and all other comments can significantly increase the overhead for adding comments. However, one important note is that in many of these types of applications, reading may be far more frequent (often there might be over 10x more people reading a post than commenting on it). And if the primary way this data is accessed is retrieving the posts with all the comments, this is really fast, just one database operation!

Relation DB Approach
The traditional relational database approach, following principles of database normalization of tables with scalar values (in Codd we trust), would be to separate into distinct tables and use an index. So your Comment table/structure would look like:

type Comment = {
    id: string; // yeah, each comment still gets its own id
    postId: string; // foreign key to posts, and assuming you are using string ids
    text: string;
    // And I'd assume this will eventually have other properties too
}

In this comment table, each comment has its own id, so comments can individually be updated and deleted, as needed.
Now, the next part of relating two tables (through the postId foreign key) is an index. And this is where we would use the multi-values per key dupSort option. The Comment-Post index would be another database/store where the keys are the postIds, and the values are the comment ids (multiple comments for a post, means multiple entries with the same key). For this, you would need to update this index on every comment addition or removal (as a traditional SQL database does for you, but I'm assuming you are using lmdb-js because you like having some control over this.)

// make sure to open with proper encoding:
let commentIndexDb = open(path, { dupSort: true,
        encoding: 'ordered-binary' })
...
commentIndexDb.put(postId, commentId) // adding to the index

And then when it comes time to retrieve/render your post with comments, you search the index for all the comment ids for a given post id, and then retrieve each comment by its id that you found in the index. There is a clear performance disadvantage for reading here: retrieving the set of all comments for a post requires a random read/get for each comment. However, this is good solid normalized database design, would have excellent write performance, and would still have good read characteristics that would probably scale and perform as well or better than any SQL database. It is also worth noting that for entries larger than a couple hundred bytes there is usually more time spent on deserialization than actually database lookups (partly because LMDB is so fast), and this would have basically the same amount of total deserialization cost as the first approach. Also with post/comment apps, there might be good reason for keeping retrieval of posts as a separate action from retrieving comments. Typically you want to show a post as fast as possible, than plenty of time to load comments later.

Multipart Keys
Lmdb-js actually provides another way to define keys; a key can be a sequence (or array) of multiple values (this is different than having multiple entries with the same key). Using these types of keys, you could have a comment table but the keys could be of the form [postId, commentId]. If you were adding a new comment, youmwould open and add like:

commentDb.put([postId, generateUniqueCommentId()], commentData)

With this approach, your comments are now sequentially ordered, so that all the comments for a post can be retrieved with a single getRange query:

for (let comment of commentDb.getRange({ start: postId, end: [postId, MAXIMUM_KEY] })) {
...
}

Since these are sequentially accessed in single traversal, this is faster than doing random access retrievals for each comment.

And since comments still have a unique id, so you could delete/update existing comments:

db.remove([comment.postId, comment.id])
db.put([comment.postId, comment.id], comment)

Caching
This is not really a different approach, but if you use the relationship database approach and you need more better read performance, you could employ caching, and lmdb-js is great for this. You could cache the combination of a post with all its comments when a post is accessed, and then invalidate such cache entries whenever a post is update or any comments for that post are added/updated. In this way you could still facilitate the fast write performance (if you do multiple writes between accesses) and the speed of most accesses being resolved in a single read.

0 replies

kriszyp · 2022-05-23T13:34:22Z

kriszyp
May 23, 2022
Maintainer

In regards to the second question:

If I have a key that is associated with multiple values, and each value has an id, what would be the best way to delete a value given its id?

If you what you are asking is how to do this:

let commentDb = open({ dupSort: true })
commentDb .put('post-key', { name: 'a comment': id: 4 })
commentDb .remove('post-key',... what goes here to delete comment with id 4?)

Technically, what you are asking for is possible:

for (let value of db.getValues('post-key')) {
  if (value.id == 4) {
    db.remove('post-key', value);
    break;
  }
}

However, you probably don't want to be storing arbitrary data objects in as values in dupSort databases, dupSort are really designed to just hold keys (that reference other databases) as values, and values in dupSort databases are subject to the same maximum size limits (normally 1978, unless you have altered the pageSize). So you would be very limited in how much data you can store that way. And, as the example above shows, it is probably undesirable that you have to iterate through values to find an entry.

On the otherhand, if your comment index database just has values that are keys (and make sure to use encoding: 'ordered-binary' for this, than removing is simple:

commentIndexDb.remove('post-key', 4); // remove the reference
commentDb.remove(4); // and remove the comment from the table of comments

Or if you use arrays as keys, (third approach above), even simpler:

commentIndex.remove(['post-key', 4]);

0 replies

vedantroy · 2022-05-24T00:39:10Z

vedantroy
May 24, 2022
Author

Thanks for the excellent & comprehensive explanations!

A few clarifying questions / thoughts:

For multipart keys, e.g commentDb.put([postId, generateUniqueCommentId()], commentData). Presumably, generateUniqueCommentId must return an integer right. (e.g, it cannot return a UUID). Can this integer skip around (i.e, what if it increments 2, 5, 8 etc.)
It seems like I misunderstood something from the documentation. db.remove(key, value) will (in a dup sort database), remove the value associated with the key that has value, value. For some reason, I thought value was the index of the value. E.g, if there are 3 values associated with a key, then db.remove(key, 2) will remove the second value (or 3rd, if we're using 0-indexing). I thought this because the documentation says the type is number, and I figured number meant index.

Can I do db.remove('post-key', uuid) where uuid is a UUID? There's a bit of an issue where UUIDs are too big to represent as integers in Javascript, but maybe I can pass in the UUID as a Buffer object instead?

0 replies

kriszyp · 2022-05-24T01:04:24Z

kriszyp
May 24, 2022
Maintainer

For multipart keys, e.g commentDb.put([postId, generateUniqueCommentId()], commentData). Presumably, generateUniqueCommentId must return an integer right. (e.g, it cannot return a UUID). Can this integer skip around (i.e, what if it increments 2, 5, 8 etc.)

No, they can both/either be strings/UUIDs or any primitive values, (as long as the total key size isn't more than 1978 bytes). And yes, they can skip around, be randomly generated or whatever you want.

I thought this because the documentation says the type is number, and I figured number meant index.

No, its not the index, its the actual value, and you are correct, the docs incorrectly state the type as number (it has to be a number if you are specifying the version number though, and the TS docs do have a correct entry for deleting by value of any type). I will fix that.

Can I do db.remove('post-key', uuid) where uuid is a UUID?

Yes, this will work fine, and is a great choice, you can use UUID strings as the keys and values of dupSort databases (and inside key arrays).

There's a bit of an issue where UUIDs are too big to represent as integers in Javascript, but maybe I can pass in the UUID as a Buffer object instead?

It is worth noting that buffers are directly written as is without any typing, so that means they won't round trip. You can use them as keys for gets and puts, but if you do getRange() and try to read the keys, they won't be preserved as buffers (unless you use a custom encoder). So using strings would be easier. (I have been meaning to make a custom encoder for UUIDs sometime, so they can be written in their compact 16-byte glory).

3 replies

vedantroy May 24, 2022
Author

Thanks again. This clarifies a lot of confusion I had with regards to the dupSort flag.
One final question here is: What should MAX_KEY be, if I I'm using uuids for the commentId in [postId, commentId]? When I'm doing a range scan to grab all the comments, I won't know the maximum UUID before-hand. Also, I'm not sure what a "maximum" would be when it comes to UUIDs. Would it just be 99999999-9999-9999-9999-99999999999 or something like that?

kriszyp May 24, 2022
Maintainer

MAXIMUM_KEY is actually an export of the ordered-binary package (the package that handles all the key (de)serialization) that you can use. But it just maps to Buffer.from([0xff]), so you can also use that (no string/primitive can ever be that "high"). Or with UUIDs in hex, even 'z' is a higher than any UUID key.

BlackGlory Dec 20, 2022

It would be nice if the size limit of the key was written in the documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to model data like comments? #170

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Best way to model data like comments? #170

vedantroy May 22, 2022

New

Old

Replies: 4 comments · 3 replies

kriszyp May 23, 2022 Maintainer

kriszyp May 23, 2022 Maintainer

vedantroy May 24, 2022 Author

kriszyp May 24, 2022 Maintainer

vedantroy May 24, 2022 Author

kriszyp May 24, 2022 Maintainer

BlackGlory Dec 20, 2022

vedantroy
May 22, 2022

Replies: 4 comments 3 replies

kriszyp
May 23, 2022
Maintainer

kriszyp
May 23, 2022
Maintainer

vedantroy
May 24, 2022
Author

kriszyp
May 24, 2022
Maintainer

vedantroy May 24, 2022
Author

kriszyp May 24, 2022
Maintainer