Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

jserv · 2023-08-30T01:32:53Z

A block replacement algorithm continues to receive attention for the improvement of its hit ratio. Numerous replacement algorithms have been proposed, among which LIRS stands out with its consistently higher hit ratio across various workloads, while incurring low time and space overheads. However, there are still access patterns where LIRS yields a sub-optimal hit ratio and possesses potential for further improvement.

Recently, LIRS2 has been introduced, incorporating a new measure into the LIRS algorithm aimed at reducing its miss ratio. Through extensive experiments on traces from various sources, LIRS2 has consistently demonstrated an enhanced cache miss ratio with minimal overhead.

Quoted from LIRS2: An Improved LIRS Replacement Algorithm:

When the cache is not large enough to hold all the blocks, LRU has a 100% miss ratio.
Algorithms that consider access frequency, such as LFU and LRFU, are able to perform well on the access pattern. However, such algorithms can be too expensive as they need to track every block’s access frequency and cannot quickly respond to access pattern change.
In the meantime, we identify access patterns where LIRS has its critical weakness or has room for improvement. One illustrative example is a zigzag access pattern, where a file or data set is repeatedly accessed in an alternative order, such as reading from the file head to its tail and then reversing the order.
The LIRS2 algorithm is an enhancement of LIRS by replacing the reuse distance in its locality measure with sum of two recent reuse distances.
LFU is not responsive to access pattern change and often performs poorly for dynamic workloads. Additionally, it can be too expensive in a practical system.

Reference:

paper
test programs
LIRS caching algorithm was used in Linux memory management.

visitorckw · 2023-10-21T23:04:56Z

I attempted to implement an LIRS cache and conducted performance testing.
Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.
Here are the data:

i5-10500

Metric	LFU	ARC	LIRS
Coremark	2299.937	2099.059	2065.093
Dhrystone	3706.87	3376.85	3374.75

Reference:
Code - LIRS cache implementation

jserv · 2023-10-22T01:45:02Z

I attempted to implement an LIRS cache and conducted performance testing. Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

Lately, we have encountered unexpected code generation overhead from MIR. As an experiment, @qwe661234 is testing clang as the backend for JIT code generation instead of MIR. Please switch to the git branch wip/research_jit, which is expected to be incorporated into his technical reports: https://github.com/qwe661234/rv32emu/tree/wip/research_jit
(Make sure that MIR macro is unset when JIT configuration is enabled.)

jserv · 2023-10-22T01:50:08Z

Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

In addition to the typical benchmark suite, which includes CoreMark and Dhrystone, we also make use of additional programs such as scimark2, richards, and nqueens. You can locate the corresponding prebuilt ELF executable files in the build directory.

visitorckw · 2023-10-22T07:13:33Z

I attempted to implement an LIRS cache and conducted performance testing. Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

Lately, we have encountered unexpected code generation overhead from MIR. As an experiment, @qwe661234 is testing clang as the backend for JIT code generation instead of MIR. Please switch to the git branch wip/research_jit, which is expected to be incorporated into his technical reports: https://github.com/qwe661234/rv32emu/tree/wip/research_jit (Make sure that MIR macro is unset when JIT configuration is enabled.)

It seems that the code on the "wip/research_jit" branch is unable to pass the "make check" test with jit enabled.
Here's the output:

Running hello.elf ... [OK]
Running puzzle.elf ... [OK]
Running pi.elf ... Failed.
make: *** [Makefile:174: check] Error 1

visitorckw · 2023-10-22T07:14:24Z

Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

In addition to the typical benchmark suite, which includes CoreMark and Dhrystone, we also make use of additional programs such as scimark2, richards, and nqueens. You can locate the corresponding prebuilt ELF executable files in the build directory.

Nice!
I'll make use of these benchmark programs for testing.

visitorckw · 2023-10-23T22:41:24Z

I've noticed that when using the code from the 'wip/jit' branch with MIR as the JIT backend and running the scimark2 benchmark with my implemented LIRS cache, I've encountered a use after free error. It's possible that there might be an issue with my current implementation of LIRS or there could be a problem in the JIT-related implementation. I will conduct a more thorough examination of the relevant code to pinpoint the exact cause of the error.

qwe661234 · 2023-10-24T02:48:54Z

Actually, the function destroy_cache has some problem I haven't sloved.

visitorckw · 2023-11-04T08:59:39Z

After conducting further testing, I have observed that regardless of the cache replacement algorithm used, the cache hit ratio consistently remains above 99.99%. Therefore, I believe that the primary factor influencing performance may be the overhead of cache_get() operations, despite their theoretical O(1) time complexity. Based on the current test results, I would recommend continuing to use LFU as the default cache replacement algorithm.

Here is the testing data:

Execution time(ms)	ARC	LFU	LRU	LIRS	FIFO
Coremark	20614.7	18755.07	20083.02	21027.97	18963.75
Dhrystone	1677.52	1535.485	1620.851	1683.112	1564.062
Nqueens	3555.8	3193.08	3364.245	3507.252	3198.479
Richards	583.442	518.0965	544.4019	585.9974	551.1093

Hit ratio	ARC	LFU	LRU	LIRS	FIFO
Coremark	0.99999	0.999997	0.999997	0.999996	0.999997
Dhrystone	0.99999	0.999986	0.999986	0.999986	0.999986
Nqueens	0.99999	0.999991	0.999991	0.999991	0.999991
Richards	0.99997	0.999966	0.999966	0.999966	0.999966

jserv · 2023-11-19T07:44:06Z

After conducting further testing, I have observed that regardless of the cache replacement algorithm used, the cache hit ratio consistently remains above 99.99%. Therefore, I believe that the primary factor influencing performance may be the overhead of cache_get() operations, despite their theoretical O(1) time complexity. Based on the current test results, I would recommend continuing to use LFU as the default cache replacement algorithm.

I am concerned that benchmark programs like CoreMark and Dhrystone may not accurately represent real-world workloads, especially when applying this RISC-V instruction emulation within a system emulation context, similar to what semu achieves. Can you suggest empirical approaches for utilizing cache replacement policies in system emulation?

visitorckw · 2023-11-19T19:48:42Z

I acknowledge the possibility that CoreMark or Dhrystone may exhibit significant differences from real-world workloads. Perhaps, it would be prudent for us to explore alternative benchmarks that more accurately represent the actual usage scenarios. I will make an effort to identify more suitable benchmarks for our testing purposes.

jserv · 2023-11-20T19:25:18Z

I acknowledge the possibility that CoreMark or Dhrystone may exhibit significant differences from real-world workloads. Perhaps, it would be prudent for us to explore alternative benchmarks that more accurately represent the actual usage scenarios. I will make an effort to identify more suitable benchmarks for our testing purposes.

If you are exploring semu, please note that there is a fork called semu-c64 which includes additional features like insn_count for enhanced analysis and checkpoint functionality.

visitorckw · 2023-11-21T00:20:15Z

If you are exploring semu, please note that there is a fork called semu-c64 which includes additional features like insn_count for enhanced analysis and checkpoint functionality.

Thanks for the information! I'll take a look at semu-c64.

jserv · 2023-12-30T07:29:46Z

libCacheSim is

a high-performance cache simulator for running cache simulations.
a high-performance and versatile trace analyzer for analyzing different cache traces.
a high-performance library for building cache simulators.

visitorckw · 2023-12-31T11:52:51Z

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

visitorckw · 2023-12-31T11:53:10Z

My speculation is that in real workloads, the number of cache_get operations is typically much larger than cache_put operations. However, it's only during cache_put that the need to select a victim for eviction from the cache may arise. Therefore, attempting to reduce the cache miss ratio and increase the overhead of cache get operations (even though this overhead is usually low) might not be cost-effective. However, this assumption requires further experimentation and validation.

jserv · 2023-12-31T12:32:23Z

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

@qwe661234, can you clarify this?

qwe661234 · 2024-01-02T05:33:11Z

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

@qwe661234, can you clarify this?

Sure, I believe the reason why JVM flushes the code cache when it's full is similar to the implementation in QEMU. The runtime-generated machine code is stored in contiguous memory, and its size varies. This concept is akin to OS memory management. If we don't divide the machine code into fixed sizes like memory pages, problems may arise. For instance, if the size of new machine code is larger than the replaced one, it might cause an overflow. Conversely, if the size of the new machine code is smaller than the replaced one, it could lead to segmentation. Therefore, flushing the code cache when the code cache is full is a simpler way to address these issues.

jserv · 2024-01-03T06:44:01Z

For instance, if the size of new machine code is larger than the replaced one, it might cause an overflow. Conversely, if the size of the new machine code is smaller than the replaced one, it could lead to segmentation. Therefore, flushing the code cache when the code cache is full is a simpler way to address these issues.

I attempted to run build/donut.elf using the JIT-enabled build and observed that the animation didn't perform as expected. In contrast to the usual smooth movement of a spinning donut, this animation was uneven and failed to render consistently over time. I'm unsure whether the intermittent lag observed is due to the current tier-1 JIT compiler's practice of periodically flushing the emitted code.

qwe661234 · 2024-01-03T08:50:46Z

I ensure that the problem does not arise from this. We haven't designed any mechanism to flush the cache. Therefore, if the code cache is full, the program would abort. I will design this later.

jserv · 2024-02-05T05:19:28Z

For scenarios that require caching, a good cache eviction strategy can improve the cache hit rate. The most commonly used cache eviction strategies are LFU (Least Frequently Used) and LRU (Least Recently Used).

LFU needs to maintain the frequency of each element, and for accuracy, it requires maintaining the global frequency of elements, which can bring significant space costs. Moreover, once an element establishes a frequency, it becomes difficult to evict, but in reality, access frequency can change over time.
LRU is an excellent cache eviction strategy, but it requires more space to ensure that elements are not evicted before their next access.

W-TinyLFU, on the other hand, is a very excellent cache eviction strategy that comprehensively considers various issues that might be encountered in real scenarios. W-TinyLFU has the best hit rate and strong adaptability. However, it is relatively complex to implement. If such a high hit rate is not required, SLRU could also be considered, as it also has a high hit rate.

Reference:

ben-manes · 2024-02-07T04:08:52Z

I'd recommend capturing workload traces and run through a simulator like Caffeine's. That is extremely helpful to see how the policy actually performs and to make a choice that balances between versatility, simplicity, efficiency, etc.

As a general purpose library, Caffeine can't make many assumptions as our users expect us to excel in all cases. That means databases and analytics (MRU-biased), event-driven messaging (LRU-biased), or the typical in-between (mixed LRU/LFU). Some will want concurrency with high throughput, low latency, and linearizable behavior (like computations). Then add O(1), low memory footprint, features, and an easy to use api. The adaptivity allows us to have superior or competitive hit rates in all workloads, and I'll test on traces from new papers in case there are any weaknesses to investigate. LIRS (v1 and v2) is also excellent in my experience and makes a similar attempt at being a strong general purpose algorithm.

Recently a CMU paper claimed to have significantly better results, but the author didn't show their data. When probing them and running their traces, I found that this difference was 2% in a single cherrypicked case while their policy did miserably across a variety of other real workloads. That's in fact perfectly fine and not meant to discredit their work. It's quite nice to have a variety of choices that are vetted for specialized workloads so that you can make the right tradeoffs. I would only say it is important to validate as some research papers are more marketing than substance. I'd also run the author's simulators to avoid bias, as I have observed both less than honest work and also innocent bugs.

If you can narrow down on a set of tradeoffs (what you want and don't care about) then I can point you at likely candidates. But if you don't have a strong inclination now then I'd recommend choosing something simple to maintain that is good enough so that you can swap it out later once you know more if necessary (aka YAGNI).

jserv assigned qwe661234 Aug 30, 2023

jserv assigned visitorckw Oct 15, 2023

jserv mentioned this issue Dec 17, 2023

Remove unused ARC implementation #299

Merged

jserv added enhancement New feature or request research Study certain topics and removed enhancement New feature or request labels Dec 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

jserv commented Aug 30, 2023

visitorckw commented Oct 21, 2023

jserv commented Oct 22, 2023

jserv commented Oct 22, 2023

visitorckw commented Oct 22, 2023

visitorckw commented Oct 22, 2023

visitorckw commented Oct 23, 2023

qwe661234 commented Oct 24, 2023

visitorckw commented Nov 4, 2023

jserv commented Nov 19, 2023

visitorckw commented Nov 19, 2023

jserv commented Nov 20, 2023 •

edited

Loading

visitorckw commented Nov 21, 2023 •

edited by jserv

Loading

jserv commented Dec 30, 2023

visitorckw commented Dec 31, 2023

visitorckw commented Dec 31, 2023

jserv commented Dec 31, 2023

qwe661234 commented Jan 2, 2024

jserv commented Jan 3, 2024

qwe661234 commented Jan 3, 2024

jserv commented Feb 5, 2024 •

edited

Loading

ben-manes commented Feb 7, 2024 •

edited

Loading

Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

Comments

jserv commented Aug 30, 2023

visitorckw commented Oct 21, 2023

jserv commented Oct 22, 2023

jserv commented Oct 22, 2023

visitorckw commented Oct 22, 2023

visitorckw commented Oct 22, 2023

visitorckw commented Oct 23, 2023

qwe661234 commented Oct 24, 2023

visitorckw commented Nov 4, 2023

jserv commented Nov 19, 2023

visitorckw commented Nov 19, 2023

jserv commented Nov 20, 2023 • edited Loading

visitorckw commented Nov 21, 2023 • edited by jserv Loading

jserv commented Dec 30, 2023

visitorckw commented Dec 31, 2023

visitorckw commented Dec 31, 2023

jserv commented Dec 31, 2023

qwe661234 commented Jan 2, 2024

jserv commented Jan 3, 2024

qwe661234 commented Jan 3, 2024

jserv commented Feb 5, 2024 • edited Loading

ben-manes commented Feb 7, 2024 • edited Loading

jserv commented Nov 20, 2023 •

edited

Loading

visitorckw commented Nov 21, 2023 •

edited by jserv

Loading

jserv commented Feb 5, 2024 •

edited

Loading

ben-manes commented Feb 7, 2024 •

edited

Loading