Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate block replacement algorithms which receive attention on improvement of hit ratio #200

Open
jserv opened this issue Aug 30, 2023 · 21 comments
Assignees
Labels
research Study certain topics

Comments

@jserv
Copy link
Contributor

jserv commented Aug 30, 2023

A block replacement algorithm continues to receive attention for the improvement of its hit ratio. Numerous replacement algorithms have been proposed, among which LIRS stands out with its consistently higher hit ratio across various workloads, while incurring low time and space overheads. However, there are still access patterns where LIRS yields a sub-optimal hit ratio and possesses potential for further improvement.

Recently, LIRS2 has been introduced, incorporating a new measure into the LIRS algorithm aimed at reducing its miss ratio. Through extensive experiments on traces from various sources, LIRS2 has consistently demonstrated an enhanced cache miss ratio with minimal overhead.

Quoted from LIRS2: An Improved LIRS Replacement Algorithm:

  • When the cache is not large enough to hold all the blocks, LRU has a 100% miss ratio.
  • Algorithms that consider access frequency, such as LFU and LRFU, are able to perform well on the access pattern. However, such algorithms can be too expensive as they need to track every block’s access frequency and cannot quickly respond to access pattern change.
  • In the meantime, we identify access patterns where LIRS has its critical weakness or has room for improvement. One illustrative example is a zigzag access pattern, where a file or data set is repeatedly accessed in an alternative order, such as reading from the file head to its tail and then reversing the order.
  • The LIRS2 algorithm is an enhancement of LIRS by replacing the reuse distance in its locality measure with sum of two recent reuse distances.
  • LFU is not responsive to access pattern change and often performs poorly for dynamic workloads. Additionally, it can be too expensive in a practical system.

Reference:

@visitorckw
Copy link
Collaborator

I attempted to implement an LIRS cache and conducted performance testing.
Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.
Here are the data:

  • i5-10500
Metric LFU ARC LIRS
Coremark 2299.937 2099.059 2065.093
Dhrystone 3706.87 3376.85 3374.75

Reference:
Code - LIRS cache implementation

@jserv
Copy link
Contributor Author

jserv commented Oct 22, 2023

I attempted to implement an LIRS cache and conducted performance testing. Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

Lately, we have encountered unexpected code generation overhead from MIR. As an experiment, @qwe661234 is testing clang as the backend for JIT code generation instead of MIR. Please switch to the git branch wip/research_jit, which is expected to be incorporated into his technical reports: https://github.com/qwe661234/rv32emu/tree/wip/research_jit
(Make sure that MIR macro is unset when JIT configuration is enabled.)

@jserv
Copy link
Contributor Author

jserv commented Oct 22, 2023

Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

In addition to the typical benchmark suite, which includes CoreMark and Dhrystone, we also make use of additional programs such as scimark2, richards, and nqueens. You can locate the corresponding prebuilt ELF executable files in the build directory.

@visitorckw
Copy link
Collaborator

I attempted to implement an LIRS cache and conducted performance testing. Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

Lately, we have encountered unexpected code generation overhead from MIR. As an experiment, @qwe661234 is testing clang as the backend for JIT code generation instead of MIR. Please switch to the git branch wip/research_jit, which is expected to be incorporated into his technical reports: https://github.com/qwe661234/rv32emu/tree/wip/research_jit (Make sure that MIR macro is unset when JIT configuration is enabled.)

It seems that the code on the "wip/research_jit" branch is unable to pass the "make check" test with jit enabled.
Here's the output:

Running hello.elf ... [OK]
Running puzzle.elf ... [OK]
Running pi.elf ... Failed.
make: *** [Makefile:174: check] Error 1

@visitorckw
Copy link
Collaborator

Unfortunately, I observed regressions in both the Coremark and Dhrystone benchmarks.

In addition to the typical benchmark suite, which includes CoreMark and Dhrystone, we also make use of additional programs such as scimark2, richards, and nqueens. You can locate the corresponding prebuilt ELF executable files in the build directory.

Nice!
I'll make use of these benchmark programs for testing.

@visitorckw
Copy link
Collaborator

I've noticed that when using the code from the 'wip/jit' branch with MIR as the JIT backend and running the scimark2 benchmark with my implemented LIRS cache, I've encountered a use after free error. It's possible that there might be an issue with my current implementation of LIRS or there could be a problem in the JIT-related implementation. I will conduct a more thorough examination of the relevant code to pinpoint the exact cause of the error.

@qwe661234
Copy link
Collaborator

Actually, the function destroy_cache has some problem I haven't sloved.

@visitorckw
Copy link
Collaborator

After conducting further testing, I have observed that regardless of the cache replacement algorithm used, the cache hit ratio consistently remains above 99.99%. Therefore, I believe that the primary factor influencing performance may be the overhead of cache_get() operations, despite their theoretical O(1) time complexity. Based on the current test results, I would recommend continuing to use LFU as the default cache replacement algorithm.

Here is the testing data:

Execution time(ms) ARC LFU LRU LIRS FIFO
Coremark 20614.7 18755.07 20083.02 21027.97 18963.75
Dhrystone 1677.52 1535.485 1620.851 1683.112 1564.062
Nqueens 3555.8 3193.08 3364.245 3507.252 3198.479
Richards 583.442 518.0965 544.4019 585.9974 551.1093
Hit ratio ARC LFU LRU LIRS FIFO
Coremark 0.99999 0.999997 0.999997 0.999996 0.999997
Dhrystone 0.99999 0.999986 0.999986 0.999986 0.999986
Nqueens 0.99999 0.999991 0.999991 0.999991 0.999991
Richards 0.99997 0.999966 0.999966 0.999966 0.999966

@jserv
Copy link
Contributor Author

jserv commented Nov 19, 2023

After conducting further testing, I have observed that regardless of the cache replacement algorithm used, the cache hit ratio consistently remains above 99.99%. Therefore, I believe that the primary factor influencing performance may be the overhead of cache_get() operations, despite their theoretical O(1) time complexity. Based on the current test results, I would recommend continuing to use LFU as the default cache replacement algorithm.

I am concerned that benchmark programs like CoreMark and Dhrystone may not accurately represent real-world workloads, especially when applying this RISC-V instruction emulation within a system emulation context, similar to what semu achieves. Can you suggest empirical approaches for utilizing cache replacement policies in system emulation?

@visitorckw
Copy link
Collaborator

I acknowledge the possibility that CoreMark or Dhrystone may exhibit significant differences from real-world workloads. Perhaps, it would be prudent for us to explore alternative benchmarks that more accurately represent the actual usage scenarios. I will make an effort to identify more suitable benchmarks for our testing purposes.

@jserv
Copy link
Contributor Author

jserv commented Nov 20, 2023

I acknowledge the possibility that CoreMark or Dhrystone may exhibit significant differences from real-world workloads. Perhaps, it would be prudent for us to explore alternative benchmarks that more accurately represent the actual usage scenarios. I will make an effort to identify more suitable benchmarks for our testing purposes.

If you are exploring semu, please note that there is a fork called semu-c64 which includes additional features like insn_count for enhanced analysis and checkpoint functionality.

@visitorckw
Copy link
Collaborator

visitorckw commented Nov 21, 2023

If you are exploring semu, please note that there is a fork called semu-c64 which includes additional features like insn_count for enhanced analysis and checkpoint functionality.

Thanks for the information! I'll take a look at semu-c64.

@jserv jserv added enhancement New feature or request research Study certain topics and removed enhancement New feature or request labels Dec 25, 2023
@jserv
Copy link
Contributor Author

jserv commented Dec 30, 2023

libCacheSim is

  • a high-performance cache simulator for running cache simulations.
  • a high-performance and versatile trace analyzer for analyzing different cache traces.
  • a high-performance library for building cache simulators.

@visitorckw
Copy link
Collaborator

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

@visitorckw
Copy link
Collaborator

My speculation is that in real workloads, the number of cache_get operations is typically much larger than cache_put operations. However, it's only during cache_put that the need to select a victim for eviction from the cache may arise. Therefore, attempting to reduce the cache miss ratio and increase the overhead of cache get operations (even though this overhead is usually low) might not be cost-effective. However, this assumption requires further experimentation and validation.

@jserv
Copy link
Contributor Author

jserv commented Dec 31, 2023

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

@qwe661234, can you clarify this?

@qwe661234
Copy link
Collaborator

I noticed that the JIT compiler's code cache in the JVM doesn't utilize any cache replacement algorithm. Instead, when the code cache is full, it either disables the JIT compiler or flushes the code cache (depending on the configuration). This surprised me, and I haven't delved into the reasons behind this approach.

@qwe661234, can you clarify this?

Sure, I believe the reason why JVM flushes the code cache when it's full is similar to the implementation in QEMU. The runtime-generated machine code is stored in contiguous memory, and its size varies. This concept is akin to OS memory management. If we don't divide the machine code into fixed sizes like memory pages, problems may arise. For instance, if the size of new machine code is larger than the replaced one, it might cause an overflow. Conversely, if the size of the new machine code is smaller than the replaced one, it could lead to segmentation. Therefore, flushing the code cache when the code cache is full is a simpler way to address these issues.

@jserv
Copy link
Contributor Author

jserv commented Jan 3, 2024

For instance, if the size of new machine code is larger than the replaced one, it might cause an overflow. Conversely, if the size of the new machine code is smaller than the replaced one, it could lead to segmentation. Therefore, flushing the code cache when the code cache is full is a simpler way to address these issues.

I attempted to run build/donut.elf using the JIT-enabled build and observed that the animation didn't perform as expected. In contrast to the usual smooth movement of a spinning donut, this animation was uneven and failed to render consistently over time. I'm unsure whether the intermittent lag observed is due to the current tier-1 JIT compiler's practice of periodically flushing the emitted code.

@qwe661234
Copy link
Collaborator

I ensure that the problem does not arise from this. We haven't designed any mechanism to flush the cache. Therefore, if the code cache is full, the program would abort. I will design this later.

@jserv
Copy link
Contributor Author

jserv commented Feb 5, 2024

For scenarios that require caching, a good cache eviction strategy can improve the cache hit rate. The most commonly used cache eviction strategies are LFU (Least Frequently Used) and LRU (Least Recently Used).

LFU needs to maintain the frequency of each element, and for accuracy, it requires maintaining the global frequency of elements, which can bring significant space costs. Moreover, once an element establishes a frequency, it becomes difficult to evict, but in reality, access frequency can change over time.
LRU is an excellent cache eviction strategy, but it requires more space to ensure that elements are not evicted before their next access.

W-TinyLFU, on the other hand, is a very excellent cache eviction strategy that comprehensively considers various issues that might be encountered in real scenarios. W-TinyLFU has the best hit rate and strong adaptability. However, it is relatively complex to implement. If such a high hit rate is not required, SLRU could also be considered, as it also has a high hit rate.

Reference:

@ben-manes
Copy link

ben-manes commented Feb 7, 2024

I'd recommend capturing workload traces and run through a simulator like Caffeine's. That is extremely helpful to see how the policy actually performs and to make a choice that balances between versatility, simplicity, efficiency, etc.

As a general purpose library, Caffeine can't make many assumptions as our users expect us to excel in all cases. That means databases and analytics (MRU-biased), event-driven messaging (LRU-biased), or the typical in-between (mixed LRU/LFU). Some will want concurrency with high throughput, low latency, and linearizable behavior (like computations). Then add O(1), low memory footprint, features, and an easy to use api. The adaptivity allows us to have superior or competitive hit rates in all workloads, and I'll test on traces from new papers in case there are any weaknesses to investigate. LIRS (v1 and v2) is also excellent in my experience and makes a similar attempt at being a strong general purpose algorithm.

Recently a CMU paper claimed to have significantly better results, but the author didn't show their data. When probing them and running their traces, I found that this difference was 2% in a single cherrypicked case while their policy did miserably across a variety of other real workloads. That's in fact perfectly fine and not meant to discredit their work. It's quite nice to have a variety of choices that are vetted for specialized workloads so that you can make the right tradeoffs. I would only say it is important to validate as some research papers are more marketing than substance. I'd also run the author's simulators to avoid bias, as I have observed both less than honest work and also innocent bugs.

If you can narrow down on a set of tradeoffs (what you want and don't care about) then I can point you at likely candidates. But if you don't have a strong inclination now then I'd recommend choosing something simple to maintain that is good enough so that you can swap it out later once you know more if necessary (aka YAGNI).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
research Study certain topics
Projects
None yet
Development

No branches or pull requests

4 participants