You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
currently the overreserve cache is too opaque, which makes hard to debug and troubleshoot desyncing issues.
As it stands today, thanks to the pfpstatus support we have at least some tools to learn the scheduler view to the nodes, which can guide to learn why nodes are not being synced/why PFP does not match
To improve the debuggability, it would be useful to dump somehow the current cached NRT object, so we can have a perspective of what the overrserve cache is currently reporting. It is currently extremely hard to reconstruct this information with the existing tools/logs.
An option would be to create a parallel dump directory like pfpstatus, on which we dump the JSON representation of the current cached NRT object. This can be done at the end of each FlushNodes loop and should be done asynchronously in another utility goroutine (not in the FlushNodes body). Write errors should be tolerated. We should have a way to correlate the dump with the flush attempt who set it in the cache.
Publishing this information, or a subset thereof, as proper k8s object would be much better, but there's no obvious way to do this atm, so a filesystem dump is remarkably better than nothing.
The text was updated successfully, but these errors were encountered:
ffromani
added
the
help wanted
Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines.
label
Jun 26, 2024
currently the overreserve cache is too opaque, which makes hard to debug and troubleshoot desyncing issues.
As it stands today, thanks to the pfpstatus support we have at least some tools to learn the scheduler view to the nodes, which can guide to learn why nodes are not being synced/why PFP does not match
To improve the debuggability, it would be useful to dump somehow the current cached NRT object, so we can have a perspective of what the overrserve cache is currently reporting. It is currently extremely hard to reconstruct this information with the existing tools/logs.
An option would be to create a parallel dump directory like pfpstatus, on which we dump the JSON representation of the current cached NRT object. This can be done at the end of each
FlushNodes
loop and should be done asynchronously in another utility goroutine (not in theFlushNodes
body). Write errors should be tolerated. We should have a way to correlate the dump with the flush attempt who set it in the cache.Publishing this information, or a subset thereof, as proper k8s object would be much better, but there's no obvious way to do this atm, so a filesystem dump is remarkably better than nothing.
The text was updated successfully, but these errors were encountered: