Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add options to specify cache type #338

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

stevefan1999-personal
Copy link
Contributor

@stevefan1999-personal stevefan1999-personal commented Apr 5, 2023

Part 3 of #294 series, and this PR focuses on letting the user to specify their own cache type under context such as no_std or when the user needs to explicitly state the kind of cache type they wanted to use. This means users can now supply LinearMap in heapless - Rust (docs.rs), SgMap in scapegoat - Rust (docs.rs), or even Simsys/fchashmap: A fixed capacity no_std hashmap (github.com)

As I tried before I want to consolidate the cache interface into one facade trait stated here but I quickly found that I would have to use a Box to wrap the dynamic trait -- which is not acceptable in no_std environment. This means we have to resort to using "duck typing" that we have to specify concrete type in the cache type definition. The cache type must follow the following "trait protocol":

pub trait GrammarRuleCache<T> {
    fn insert(&mut self, k: usize, v: T);
    fn get(&self, k: usize) -> Option<&T>;
}

And the type must implement Default

This PR is based on #336 and #337 as a prerequisite.

@stevefan1999-personal
Copy link
Contributor Author

@kevinmehall Umm the test failed because it was running under no_std context, and indeed there is no println macro. Wonder if we should also take this out too and instead of printing to the console, let's put them all together into a Vec<String> or let's even structuralize it. And then we could print it out altogether in the end. This could also help external compiler tracing, but I would like to keep this into another PR.

@kevinmehall
Copy link
Owner

A couple thoughts / concerns with this:

  • A suggested use case is switching to data structures with limited / fixed capacity. What happens if the capacity is filled? #[cache] could potentially evict items, but I'm not sure #[cache_left_rec] can drop items and preserve correctness.
  • Are there use cases where individual rules may want different cache data structures? If not, would it make more sense to configure this grammar-wide rather than repeating it every time the cache is specified?
  • Another related case for this is still using HashMap, but replacing the hasher with a faster one. I think the default SipHash being slow is part of the reason that #[cache] is rarely a win in my experience. I think that's doable with this design, but requires first defining a type alias. Not terrible, but not necessarily obvious. I can't think of a better way to do that though since the macro supplies the key and value type parameters. If the syntax involved specifying the full type explicitly including all type params, it would expose the internal RuleResult type and require more repetition.
  • Do you have a concrete use case where you want caching on no_std right now, or is this just for completeness? I'm having a hard time imagining something that you'd want to parse on an embedded system that would greatly benefit from caching vs just restructuring the grammar to reduce redundancy.

I wouldn't mind just skipping the no_std tests when the trace feature is enabled. (#![cfg(not(feature = "trace")] or similar)

@stevefan1999-personal
Copy link
Contributor Author

  • A suggested use case is switching to data structures with limited / fixed capacity. What happens if the capacity is filled? #[cache] could potentially evict items, but I'm not sure #[cache_left_rec] can drop items and preserve correctness.

Maybe we will need to define the cache behavior. I think we should first try to insert the entry, and if not we should execute the rest of the production rule as usual.

  • Are there use cases where individual rules may want different cache data structures? If not, would it make more sense to configure this grammar-wide rather than repeating it every time the cache is specified?

I theorize the following situation: because hash-based containers are based on cost amortization (that means the algorithmic cost to a certain operation would be asymptomatically closer to the big-O once it go large), it will struggle a lot on if you have little to no entries but big batches of repeated stuff. This is because the smaller the hash structure, the more the chance to require hash map expansion and rehashing.

Given we have a large amount of recursive parsing calls in a PEG parser, so I suggest using a binary tree-based map would yield a nice improvement on the expected performance/throughput over a hash one, because the behavior would be much more stable and deterministic since a self-balancing binary tree should take O(log n) time for any operations (hash map can achieve O(1) only if amortized, so it can be closer to O(n) for the first 512 entries-ish idk).

...Well on the other hand, also theoretically, O(log n) is by definition slower than O(1), so maybe a benchmark is still needed to test it through. We could also test if we FNV, SipHash, Farmhash, compared with B+ tree, Scapegoat tree or even a naive index map (that is just a linear array and you do linear search) could make a difference.

  • Do you have a concrete use case where you want caching on no_std right now, or is this just for completeness? I'm having a hard time imagining something that you'd want to parse on an embedded system that would greatly benefit from caching vs just restructuring the grammar to reduce redundancy.

It's for completeness. I use peg to parse serial console command and sometimes ACPI bytecode. Although both of them often work fine without cache so far, but they are for embedded use, and I can't risk to use a hash map on deterministic ground.

@stevefan1999-personal
Copy link
Contributor Author

I also wanted to test how well it work with the cache that does cache replacement, i.e. LRU, TinyLFU and so on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants