-
Notifications
You must be signed in to change notification settings - Fork 650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #1266
base: unstable
Are you sure you want to change the base?
Conversation
This PR is replacing https://github.com/valkey-io/valkey/pull/692. It consists of 2 commits:
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1266 +/- ##
============================================
+ Coverage 70.69% 70.70% +0.01%
============================================
Files 114 116 +2
Lines 63161 63230 +69
============================================
+ Hits 44650 44708 +58
- Misses 18511 18522 +11
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor comments, and all the tests are still passing which is good. I might take another pass after this to just cleanup some comments for clarity.
3443501
to
639fa6b
Compare
@madolson, I fixed the comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It LGTM, only sticky point left is just some follows ups on the info fields. Would appreciate if one of @JimB123 or @zuiderkwast have time to take a look as well to review the jemalloc logic.
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
11d69e6
to
39a1f6d
Compare
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]> Signed-off-by: zvi-code <[email protected]>
@valkey-io/core-team this PR replaces previous PR due to git history issues i could not fix. Can we put the 8.1 target (and if it's important than also mark with |
… required) Signed-off-by: Zvi Schneider <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This looks very close merging now. Just some nits and some questions.
#define getBinindNormal(_sz, _offset, _last_sz_pow2) \ | ||
((SIZE_CLASS_GROUP_SZ - (((1 << (_last_sz_pow2)) - (_sz)) >> ((_last_sz_pow2) - LG_QUANTOM_8_FIRST_POW2))) + \ | ||
(((_last_sz_pow2) - (LG_QUANTOM_8_FIRST_POW2 + 3)) - 1) * SIZE_CLASS_GROUP_SZ + (_offset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a static inline
function instead? It'd be easier to read.
// following groups have SIZE_CLASS_GROUP_SZ size-class that are | ||
uint64_t last_sz_in_group_pow2 = 64 - __builtin_clzll(sz - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following groups have SIZE_CLASS_GROUP_SZ size-class that are ... what? Unexpected end of sentence?
+nit: We generally use /* */
comments.
/* ----------------------------------------------------------------------------- | ||
* Helper functions for jemalloc translation between size and index | ||
* -------------------------------------------------------------------------- */ | ||
#define LG_QUANTOM_8_FIRST_POW2 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quantum 8 = lg-quantum 3, right? Should this be LG_QUANTUM_3_FIRST_POW2
?
Btw, quantum is misspelled.
#define LG_QUANTOM_8_FIRST_POW2 3 | ||
#define SIZE_CLASS_GROUP_SZ 4 | ||
|
||
#define LG_QUANTOM_OFFSET_3 ((64 >> LG_QUANTOM_8_FIRST_POW2) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quantum misspelled.
* Note: In case future PR will return the binind (that is better API anyway) we can get rid of | ||
* these conversion functions | ||
*/ | ||
inline unsigned jeSize2BinIndexLgQ3(size_t sz) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline unsigned jeSize2BinIndexLgQ3(size_t sz) { | |
static inline unsigned jeSize2BinIndexLgQ3(size_t sz) { |
Whenever we use inline
, use static inline
instead, or only static
. Only inline
by itself is a compiler hint that is more or less ignored and doesn't have a clear meaning.
// lg-quantum should be 3 | ||
assert(jemalloc_quantum == 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much of this code depends on lg-quantum == 3? Just wondering because if we some day want to change to the default 4, how hard would that be?
*note: this PR replaced prior PR
Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the
computeDefragCycles
and not for every single ptr, this way we are avoiding the additional api call.Followup work will utilize the new options that are now open and will further improve the defrag decision and process.
Added files:
allocator_defrag.c
/allocator_defrag.h
- This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.Moved functions:
zmalloc_no_tcache
,zfree_no_tcache
- these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].frag_smallbins_bytes
- the logic and implementation moved to the new fileExisting API:
computeDefragCycles
zmalloc_get_allocator_info
: gets from jemalloc allocated, active, resident, retained, muzzy,frag_smallbins_bytes
frag_smallbins_bytes
: for each bin; gets from jemalloc bin_info,curr_regs
,cur_slabs
je_defrag_hint
is getting a memory pointer and returns {0,1} . Internally it uses this information points:nonfull_slabs
total_slabs
Jemalloc API (via ctl interface)
[BATCH]
experimental_utilization_batch_query_ctl
: gets an array of pointers, returns for each pointer 3 values,[EXTENDED]
experimental_utilization_query_ctl
:experimental_utilization_batch_query_ctl
vs valkeyje_defrag_hint
?[good]
[bad]
nonfull_slabs
and total allocated regs.New functions:
defrag_jemalloc_init
: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:The mallctlnametomib() function provides a way to avoid repeated name lookups for
applications that repeatedly query the same portion of the namespace,by translating
a name to a “Management Information Base” (MIB) that can be passed repeatedly to
mallctlbymib().
jemalloc_sz2binind_lgq*
: this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived fromlg_quantum
defrag_jemalloc_get_frag_smallbins
: This function replacesfrag_smallbins_bytes
the logic moved to the new file allocator_defragdefrag_jemalloc_should_defrag_multi
→handle_results
- unpacks the resultsshould_defrag
: implements the same logic as the existing implementation inside je_defrag_hintdefrag_jemalloc_should_defrag_multi
: implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.Logical differences:
In order to get the information about #
nonfull_slabs
and #regs
, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we usejemalloc_sz2binind_lgq*
.Testing
This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.
Test:
activedefrag
used_memory
reaches 10GBmaxmemory
to 5GB andmaxmemory-policy
toallkeys-lru
mem_fragmentation_ratio
to reach 2activedefrag
- start test timermem_fragmentation_ratio
= 1.1Results*:
(With this PR)Test results:
56 sec
(Without this PR)Test results:
67 sec
*both runs perform same "work" number of buffers moved to reach fragmentation target
Next benchmarking is to compare to:
je_get_defrag_hint
int defrag_hint() {return 1;}