You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that 2.0 is the default we should avoid making changes, even non-breaking feature changes, to make sure we work out all the kinks. There are still a number of things that we would like to improve and so we will work on a 2.1 release. I'd like to focus on the following things for 2.1
Compression for strings, integers (and possibly floats)
Let's enable FSST in 2.1 and add some more tests to confirm stability. For integer compression we need bitpacking / delta / frame of reference. For floating point compression we can investigate ALP although this is a lower priority (it's not clear that ALP can help much with embeddings as they tend to be rather compressed already)
1/2 IOPs structural encodings & repetition index
Miniblock and zipped structural encodings will give us 1-2 IOPS (1 for fixed width types and 2 for variable width types) regardless of how many levels of nesting and repetition are present. This should give us maximum performance for random access
Complete row based encodings
We introduced packed struct in 2.0. We should introduce a new encoding which handles variable width types (we can call it packed row or just extend packed struct) In addition, we should make it possible to create a file that is entirely row-major.
Simplified priority
The logic for calculating priority (for backpressure) is pretty complicated in 2.0. I believe we have it correct now but we have to do some expensive calculations (e.g. binary searches into list offsets) to calculate the priority correctly and there is quite a bit of complexity to handle some corner cases. In 2.1 we will simplify things by always recording the top-level row number on each page. This will be the only format change (i.e. not encodings) that I'm aware of.
Potential extra features which we may tackle as opportunity permits but are not part of the focus:
Run length encoding (using repetition index)
Enhanced I/O schedulers for NVME (e.g. io uring) and RAM (e.g. fully synchronous)
This issue is an umbrella issue that will cover a number of tasks to achieve the above goals.
Now that 2.0 is the default we should avoid making changes, even non-breaking feature changes, to make sure we work out all the kinks. There are still a number of things that we would like to improve and so we will work on a 2.1 release. I'd like to focus on the following things for 2.1
Let's enable FSST in 2.1 and add some more tests to confirm stability. For integer compression we need bitpacking / delta / frame of reference. For floating point compression we can investigate ALP although this is a lower priority (it's not clear that ALP can help much with embeddings as they tend to be rather compressed already)
Miniblock and zipped structural encodings will give us 1-2 IOPS (1 for fixed width types and 2 for variable width types) regardless of how many levels of nesting and repetition are present. This should give us maximum performance for random access
We introduced packed struct in 2.0. We should introduce a new encoding which handles variable width types (we can call it packed row or just extend packed struct) In addition, we should make it possible to create a file that is entirely row-major.
The logic for calculating priority (for backpressure) is pretty complicated in 2.0. I believe we have it correct now but we have to do some expensive calculations (e.g. binary searches into list offsets) to calculate the priority correctly and there is quite a bit of complexity to handle some corner cases. In 2.1 we will simplify things by always recording the top-level row number on each page. This will be the only format change (i.e. not encodings) that I'm aware of.
Potential extra features which we may tackle as opportunity permits but are not part of the focus:
This issue is an umbrella issue that will cover a number of tasks to achieve the above goals.
Tasks:
Better Compression
New Structural Encodings
Row Based Encodings
Simplified Priority
The text was updated successfully, but these errors were encountered: