Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #1266

Open
wants to merge 15 commits into
base: unstable
Choose a base branch
from

Conversation

zvi-code
Copy link

@zvi-code zvi-code commented Nov 5, 2024

*note: this PR replaced prior PR

Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will further improve the defrag decision and process.

Added files:

allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.

Moved functions:

zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].
frag_smallbins_bytes - the logic and implementation moved to the new file

Existing API:

  • [once a second + when completed full cycle] computeDefragCycles
    • zmalloc_get_allocator_info : gets from jemalloc allocated, active, resident, retained, muzzy, frag_smallbins_bytes
    • frag_smallbins_bytes : for each bin; gets from jemalloc bin_info, curr_regs, cur_slabs
  • [during defrag, for each pointer]
    • je_defrag_hint is getting a memory pointer and returns {0,1} . Internally it uses this information points:
      • #nonfull_slabs
      • #total_slabs
      • #free regs in the ptr slab

Jemalloc API (via ctl interface)

[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,

  • number of free regions in the extent
  • number of regions in the extent
  • size of the extent in terms of bytes

[EXTENDED]experimental_utilization_query_ctl :

  • memory address of the extent a potential reallocation would go into
  • number of free regions in the extent
  • number of regions in the extent
  • size of the extent in terms of bytes
  • [stats-enabled]total number of free regions in the bin the extent belongs to
  • [stats-enabled]total number of regions in the bin the extent belongs to

experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?

[good]

  • We can query pointers in a batch, reduce the overall overhead
  • The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily

[bad]

  • In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #nonfull_slabs and total allocated regs.

New functions:

  1. defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:

    The mallctlnametomib() function provides a way to avoid repeated name lookups for
    applications that repeatedly query the same portion of the namespace,by translating
    a name to a “Management Information Base” (MIB) that can be passed repeatedly to
    mallctlbymib().

  2. jemalloc_sz2binind_lgq* : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from lg_quantum

  3. defrag_jemalloc_get_frag_smallbins : This function replaces frag_smallbins_bytes the logic moved to the new file allocator_defrag
    defrag_jemalloc_should_defrag_multihandle_results - unpacks the results

  4. should_defrag : implements the same logic as the existing implementation inside je_defrag_hint

  5. defrag_jemalloc_should_defrag_multi : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.

Logical differences:

In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .

Testing

This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.

Test:

  1. disable activedefrag
  2. run valkey benchmark on overlapping address ranges with different block sizes
  3. wait untill used_memory reaches 10GB
  4. set maxmemory to 5GB and maxmemory-policy to allkeys-lru
  5. stop load
  6. wait for mem_fragmentation_ratio to reach 2
  7. enable activedefrag - start test timer
  8. wait until reach mem_fragmentation_ratio = 1.1

Results*:

(With this PR)Test results: 56 sec
(Without this PR)Test results: 67 sec

*both runs perform same "work" number of buffers moved to reach fragmentation target

Next benchmarking is to compare to:

  • DONE // existing je_get_defrag_hint
  • compare with naive defrag all: int defrag_hint() {return 1;}

@zvi-code zvi-code changed the title (fixes over previous closed PR #692)defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code (fixes over previous closed PR #692) Nov 5, 2024
@zvi-code
Copy link
Author

zvi-code commented Nov 5, 2024

This PR is replacing https://github.com/valkey-io/valkey/pull/692.

It consists of 2 commits:

  1. A cherry-pick of the original commit from PR 692 (on top of unstable)
  2. A CR fixes commit to address the comments made on PR 692

@zvi-code zvi-code changed the title defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code (fixes over previous closed PR #692) defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code (replaces https://github.com/valkey-io/valkey/pull/692) Nov 5, 2024
@zvi-code zvi-code changed the title defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code (replaces https://github.com/valkey-io/valkey/pull/692) defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code Nov 5, 2024
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 98.29060% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.70%. Comparing base (2df56d8) to head (41a9175).
Report is 4 commits behind head on unstable.

Files with missing lines Patch % Lines
src/server.c 33.33% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1266      +/-   ##
============================================
+ Coverage     70.69%   70.70%   +0.01%     
============================================
  Files           114      116       +2     
  Lines         63161    63230      +69     
============================================
+ Hits          44650    44708      +58     
- Misses        18511    18522      +11     
Files with missing lines Coverage Δ
src/allocator_defrag.c 100.00% <100.00%> (ø)
src/defrag.c 84.68% <100.00%> (-1.58%) ⬇️
src/zmalloc.c 82.60% <100.00%> (-2.07%) ⬇️
src/server.c 87.66% <33.33%> (-0.03%) ⬇️

... and 20 files with indirect coverage changes

@madolson madolson self-requested a review November 7, 2024 16:33
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor comments, and all the tests are still passing which is good. I might take another pass after this to just cleanup some comments for clarity.

src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
@zvi-code zvi-code force-pushed the align_defrag_vanila_fix_history branch from 3443501 to 639fa6b Compare November 9, 2024 21:26
@zvi-code
Copy link
Author

zvi-code commented Nov 9, 2024

@madolson, I fixed the comments

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It LGTM, only sticky point left is just some follows ups on the info fields. Would appreciate if one of @JimB123 or @zuiderkwast have time to take a look as well to review the jemalloc logic.

src/allocator_defrag.c Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
src/allocator_defrag.c Outdated Show resolved Hide resolved
Zvi Schneider and others added 12 commits November 13, 2024 15:06
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
Signed-off-by: Zvi Schneider <[email protected]>
@zvi-code zvi-code force-pushed the align_defrag_vanila_fix_history branch from 11d69e6 to 39a1f6d Compare November 13, 2024 13:39
zvi-code and others added 2 commits November 13, 2024 16:01
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
Signed-off-by: zvi-code <[email protected]>
@zvi-code
Copy link
Author

@valkey-io/core-team this PR replaces previous PR due to git history issues i could not fix. Can we put the 8.1 target (and if it's important than also mark with major decision approved)

@zuiderkwast
Copy link
Contributor

this PR replaces #692 due to git history issues i could not fix

@zvi-code OK. It's possible to completely replace the content of a PR by force-pushing completely new commits to the same branch name though. Next time. :)

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This looks very close merging now. Just some nits and some questions.

Comment on lines +122 to +124
#define getBinindNormal(_sz, _offset, _last_sz_pow2) \
((SIZE_CLASS_GROUP_SZ - (((1 << (_last_sz_pow2)) - (_sz)) >> ((_last_sz_pow2) - LG_QUANTOM_8_FIRST_POW2))) + \
(((_last_sz_pow2) - (LG_QUANTOM_8_FIRST_POW2 + 3)) - 1) * SIZE_CLASS_GROUP_SZ + (_offset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a static inline function instead? It'd be easier to read.

Comment on lines +139 to +140
// following groups have SIZE_CLASS_GROUP_SZ size-class that are
uint64_t last_sz_in_group_pow2 = 64 - __builtin_clzll(sz - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following groups have SIZE_CLASS_GROUP_SZ size-class that are ... what? Unexpected end of sentence?

+nit: We generally use /* */ comments.

/* -----------------------------------------------------------------------------
* Helper functions for jemalloc translation between size and index
* -------------------------------------------------------------------------- */
#define LG_QUANTOM_8_FIRST_POW2 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantum 8 = lg-quantum 3, right? Should this be LG_QUANTUM_3_FIRST_POW2?

Btw, quantum is misspelled.

#define LG_QUANTOM_8_FIRST_POW2 3
#define SIZE_CLASS_GROUP_SZ 4

#define LG_QUANTOM_OFFSET_3 ((64 >> LG_QUANTOM_8_FIRST_POW2) - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quantum misspelled.

* Note: In case future PR will return the binind (that is better API anyway) we can get rid of
* these conversion functions
*/
inline unsigned jeSize2BinIndexLgQ3(size_t sz) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
inline unsigned jeSize2BinIndexLgQ3(size_t sz) {
static inline unsigned jeSize2BinIndexLgQ3(size_t sz) {

Whenever we use inline, use static inline instead, or only static. Only inline by itself is a compiler hint that is more or less ignored and doesn't have a clear meaning.

Comment on lines +222 to +223
// lg-quantum should be 3
assert(jemalloc_quantum == 8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of this code depends on lg-quantum == 3? Just wondering because if we some day want to change to the default 4, how hard would that be?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants