Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overreserve: cache: report currently-used NRT object #222

Open
ffromani opened this issue Jun 26, 2024 · 1 comment
Open

overreserve: cache: report currently-used NRT object #222

ffromani opened this issue Jun 26, 2024 · 1 comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.

Comments

@ffromani
Copy link
Member

ffromani commented Jun 26, 2024

currently the overreserve cache is too opaque, which makes hard to debug and troubleshoot desyncing issues.
As it stands today, thanks to the pfpstatus support we have at least some tools to learn the scheduler view to the nodes, which can guide to learn why nodes are not being synced/why PFP does not match

To improve the debuggability, it would be useful to dump somehow the current cached NRT object, so we can have a perspective of what the overrserve cache is currently reporting. It is currently extremely hard to reconstruct this information with the existing tools/logs.

An option would be to create a parallel dump directory like pfpstatus, on which we dump the JSON representation of the current cached NRT object. This can be done at the end of each FlushNodes loop and should be done asynchronously in another utility goroutine (not in the FlushNodes body). Write errors should be tolerated. We should have a way to correlate the dump with the flush attempt who set it in the cache.

Publishing this information, or a subset thereof, as proper k8s object would be much better, but there's no obvious way to do this atm, so a filesystem dump is remarkably better than nothing.

@ffromani ffromani added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jun 26, 2024
@ffromani
Copy link
Member Author

we need to clarify the foreign pods handling logic and document why we also setup makr nodes as holding foreign pods on delete

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
Projects
None yet
Development

No branches or pull requests

1 participant