-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie checks #722
Comments
I think that is not the case, if session/batch job was killed externally, lighter would get a not found response and could mark the session/job as failed without a zombie check. If I remember correctly (@Minutis maybe you do?), this was originally implemented to work around the cases, when kubernetes driver pod is being re-scheduled on a different node. This functionality makes sense only on this specific configuration, I think we could make it configurable and allow users to switch it off. It looks like you have an unusual use-case for the Lighter sessions, maybe it would make sense for you to try look into our "Permanent session" feature - there is a way to configure a session that should run forever, and in cases, when it dies or is killed by external forces, Lighter will recreate that session under the same ID. We didn't have time to document this feature (we'll do it eventually). |
I can make some extra check but i think you are wrong, at least for sessions. Putting session to
The first case is pretty simple and happens on error after session is launched. The second case is more interesting as in
I did some test with lighter 0.0.50 on k8s. When i killed my session pod ( I am just wondering if anything can be done with it. Of course one of option is to simply make it configurable and let users decided what shall be "zombie" time. My biggest concern is that if something bad happens to session (like being killed permanently or being rescheduled on another node) it can take too much time before i can find it out (via lighter rest api). If session pod is killed i can still submit statements to it which will never be executed and after 30 mins session will end up in Permanent sessions can be helpful but at some point i will need to kill them - sessions occupy too much resources and because of my trying to use custom python libraries i can not easily reuse them |
Yes. It is how it works now. If the pod is killed (not deleted) it should mark it to “error” sooner, but if it is removed - it takes 30mins. Zombie check was implemented without having sessions in mind and the value was set to 30min, when we were trying to work around the issue in our infrastructure. Something related to lenger spot instance start times and pod recreation, cant really remember what exactly. I’m not saying it is configurable now, but we can make it configurable, in many cases it does not make sense to have this check at all, since we can treat “not found” pod as an error right away. |
Sorry for delayed response. I think that zombie case was also happening in YARN backend too. When YARN as whole service is restarted - all information about running jobs is gone. That means if any of the batch applications were running and the YARN was restarted Lighter no longer could get any information about the jobs and marked them as a zombies after the delay. I think it makes sense to make it configurable. It also can be handled differently in different backends but of course the code would be harder to manage since same functionality would need different implementation depending on the backend serving the jobs. |
Looking through the code of lighter i find that there is a check for "zombie". I imagine it is done to cover the case in which session is somehow killed/terminated with no intervention from lighter itself, for example when session pod is killed manually (in case of k8s).
Just wondering why there is 30 mins window:
lighter/server/src/main/java/com/exacaster/lighter/application/ApplicationStatusHandler.java
Line 88 in fc76c23
Is 30 mins some kind of "good enough" value ? BTW 30 mins seems to be out of sync from log message you out in log storage
lighter/server/src/main/java/com/exacaster/lighter/application/ApplicationStatusHandler.java
Lines 93 to 94 in fc76c23
My biggest concern is: if session pod is somehow killed (by accident, k8s scheduler or whatever) lighter will find it out after at least 30 mins. Till that time session will be in state
IDLE
and statements can be submitted. Effectively these statements will be never executed...The text was updated successfully, but these errors were encountered: