You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Cloudbeat can any moment receive a context cancel. Right now, once a context cancel happens, we log errors from multiple different places of cloudbeat.
That is specially troublesome in agentless where pods tend be restarted/deleted with more frequency than a standard agent based solution. On top of that, in agentless we are paged based on amount of errors, and a cloudbeat shutdown during a cycle might alert the engineer on duty (urgency low, example).
The error logging is spread through the code, and we can't just unifying all errors and raise them up because some of them are "optional" errors (we log them but doesn't stop the execution). Example.
Ideally we find a strategy to not have any alert in such a scenario, because the context canceled on a shutdown is something that a oncaller has nothing to act upon, therefore is a false positive.
There are two directions we could see us going with:
From cloudbeat, we could write a wrapper or handler around logp to receive the error and check, if context canceled lower the level to warn (or whatever else we decide). Or we could case per case, what would be very repetitive.
Don't alert in case of pods shutdown or restart. That might be tricky to configure and might hide a legit issue. But the fact is that once a pod is shut down there is nothing a oncaller can do. There is no customer impact. There is nothing to fix - the pod is gone. So should we alert on non actionable problems?
The text was updated successfully, but these errors were encountered:
Cloudbeat can any moment receive a context cancel. Right now, once a context cancel happens, we log errors from multiple different places of cloudbeat.
That is specially troublesome in agentless where pods tend be restarted/deleted with more frequency than a standard agent based solution. On top of that, in agentless we are paged based on amount of errors, and a cloudbeat shutdown during a cycle might alert the engineer on duty (urgency low, example).
The error logging is spread through the code, and we can't just unifying all errors and raise them up because some of them are "optional" errors (we log them but doesn't stop the execution). Example.
Ideally we find a strategy to not have any alert in such a scenario, because the context canceled on a shutdown is something that a oncaller has nothing to act upon, therefore is a false positive.
There are two directions we could see us going with:
From cloudbeat, we could write a wrapper or handler around
logp
to receive the error and check, ifcontext canceled
lower the level to warn (or whatever else we decide). Or we could case per case, what would be very repetitive.Don't alert in case of pods shutdown or restart. That might be tricky to configure and might hide a legit issue. But the fact is that once a pod is shut down there is nothing a oncaller can do. There is no customer impact. There is nothing to fix - the pod is gone. So should we alert on non actionable problems?
The text was updated successfully, but these errors were encountered: