-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compressor Optimizer #367
base: main
Are you sure you want to change the base?
Compressor Optimizer #367
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #367 +/- ##
============================================
+ Coverage 68.22% 70.40% +2.18%
+ Complexity 1128 1063 -65
============================================
Files 319 305 -14
Lines 12795 12186 -609
Branches 1277 1165 -112
============================================
- Hits 8729 8580 -149
+ Misses 3540 3121 -419
+ Partials 526 485 -41
*This pull request uses carry forward flags. Click here to find out more. |
So to sum up, this is a non-breaking change, we could compress better (by ~7%) if we re-compress the full blob each time we attempt to append a block; but doing so is too slow so you propose to keep the original method, that would give a "preliminary result", and change the result behind the scenes between calls? Wouldn't this have some side effects @jpnovais ? (i.e basically the compressor could say "we compressed block 1, it takes N Bytes, current blob is now at 100kB", then recompress it with more context and update the internal state to "current blob is 98kB" without notifying the coordinator. ). Re implementation, before introducing an async optimizer, I'ld prefer to understand perf constraints better; i.e. how long it takes now, how long it would take if we recompress all the blob at each append, and within what limit we need operate . i.e. if we say the compressor could take as much as XXXms , then we may just want to have a simpler cleaner code and kill this async pattern. |
My input on this optimization is: Context:
My take based on the above:
|
Yep that would make less CPU use to do it only when full . But this last call to "CanWrite" may be 10x slower than the previous calls. |
It's ok to have a call to CanWrite that takes 500ms at the of the blob, as long as the preceding calls are not affected timewise. |
I agree, doing it synchronously at the end is a good idea. In fact it's similar to the "no compress" logic we already have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍 I would probably just add one or two test to ensure this is correctly triggered and that the internal state is correctly reset after
Signed-off-by: Arya Tabaie <[email protected]>
What does this mean, "based on how much time the compressor has had to optimize" from the coordinator's PoV? |
This was for the parallel optimizer so it no longer applies. Removing from description. |
@jpnovais Does it look good now? |
@Tabaie sure. Sorry for the delay and not reverting on the 1st comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good to me. Before merging can you confirm that by referencing to the same buffer when recompressing we don't have some internal overwrites of bytes.Buffer which could cause issues?
Also some comments about simplifying error handling.
That said - imo the current approach may be error prone when wanting to make changes in the future. It heavily relies on reverting the compressor in case the block doesnt fit (or when we only run CanWrite). And the state logic imo is complex enough that it is diffcult to follow and test.
What I would recommend is to add a method to clone the compressor using a pool and then perform operations on the cloned compressor. If write succeeds then we update the reference in BlobMaker
, otherwise we discard the cloned compressor (by resetting and putting it back into the pool). Internally it would imo also require using a pool of bytes.Buffer
.
Imo trying to optimize for memory here doesn't make a lot of sense as the blocks/blobs in general are small (imo blob maximally 2MB and maximum uncompressed data assuming 5x compression 10MB) relative to the rest of memory usage.
From my side this is not blocker for merging the PR for now, but would definitely simplify the implementation and guard against bugs in the future.
This PR implements issue #366.
The compressor now attempts to recompress everything from scratch upon encountering a full blob, before attempting to bypass compression altogether.
Checklist