You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Context: Kasper B. tried to cancel a batch job on vlcc-prod; for some (still unclear) reason its corresponding k8s application had finished and was gone but the job's status could not be set to canceled (r-2407102c0270404dbaf57b0901bf6fcc):
Traceback (most recent call last):
File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 1484, in full_dispatch_request
rv = self.dispatch_request()
File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 1469, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/users/auth.py", line 95, in decorated
return f(*args, **kwargs)
File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/views.py", line 1521, in cancel_job
backend_implementation.batch_jobs.cancel_job(job_id=job_id, user_id=user.user_id)
File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/backend.py", line 2795, in cancel_job
delete_response_sparkapplication = api_instance_custom_object.delete_namespaced_custom_object(group, version, namespace, plural, name)
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 891, in delete_namespaced_custom_object
return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs) # noqa: E501
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 1018, in delete_namespaced_custom_object_with_http_info
return self.api_client.call_api(
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 415, in request
return self.rest_client.DELETE(url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 265, in DELETE
return self.request("DELETE", url,
File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '07080f79-35c9-49e8-9633-292aada6c9a9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '78a6a66b-33dc-4871-84cc-c67eb8021575', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b089eb3e-e97a-4e2c-ae7c-dd74289034c5', 'Date': 'Wed, 10 Jul 2024 08:42:09 GMT', 'Content-Length': '314'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"sparkapplications.sparkoperator.k8s.io \"a-1f49b32ca95947bcbd68c74815f47331\" not found","reason":"NotFound","details":{"name":"a-1f49b32ca95947bcbd68c74815f47331","group":"sparkoperator.k8s.io","kind":"sparkapplications"},"code":404}
The YARN implementation of cancel_job intentionally swallows and logs a warning if the YARN job could not be cancelled, then changes the OpenEO job status to "canceled" anyway:
The k8s implementation instead fails and does not change the status: that's a bug. In this case, the job remained in the running state, blocking the client while the job_tracker was also not able to update its status (again, because the k8s application was gone).
The text was updated successfully, but these errors were encountered:
bossie
changed the title
unable to cancel a batch job for which the k8s is gone
unable to cancel a batch job for which the k8s application is gone
Jul 10, 2024
Context: Kasper B. tried to cancel a batch job on vlcc-prod; for some (still unclear) reason its corresponding k8s application had finished and was gone but the job's status could not be set to
canceled
(r-2407102c0270404dbaf57b0901bf6fcc):The YARN implementation of
cancel_job
intentionally swallows and logs a warning if the YARN job could not be cancelled, then changes the OpenEO job status to "canceled" anyway:openeo-geopyspark-driver/openeogeotrellis/backend.py
Lines 2827 to 2832 in 4aea4a1
The k8s implementation instead fails and does not change the status: that's a bug. In this case, the job remained in the
running
state, blocking the client while the job_tracker was also not able to update its status (again, because the k8s application was gone).The text was updated successfully, but these errors were encountered: