Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unable to cancel a batch job for which the k8s application is gone #823

Open
bossie opened this issue Jul 10, 2024 · 0 comments
Open

unable to cancel a batch job for which the k8s application is gone #823

bossie opened this issue Jul 10, 2024 · 0 comments
Labels

Comments

@bossie
Copy link
Collaborator

bossie commented Jul 10, 2024

Context: Kasper B. tried to cancel a batch job on vlcc-prod; for some (still unclear) reason its corresponding k8s application had finished and was gone but the job's status could not be set to canceled (r-2407102c0270404dbaf57b0901bf6fcc):

Traceback (most recent call last):
  File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 1484, in full_dispatch_request
    rv = self.dispatch_request()
  File "/opt/openeo/lib/python3.8/site-packages/flask/app.py", line 1469, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/users/auth.py", line 95, in decorated
    return f(*args, **kwargs)
  File "/opt/openeo/lib/python3.8/site-packages/openeo_driver/views.py", line 1521, in cancel_job
    backend_implementation.batch_jobs.cancel_job(job_id=job_id, user_id=user.user_id)
  File "/opt/openeo/lib/python3.8/site-packages/openeogeotrellis/backend.py", line 2795, in cancel_job
    delete_response_sparkapplication = api_instance_custom_object.delete_namespaced_custom_object(group, version, namespace, plural, name)
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 891, in delete_namespaced_custom_object
    return self.delete_namespaced_custom_object_with_http_info(group, version, namespace, plural, name, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 1018, in delete_namespaced_custom_object_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 415, in request
    return self.rest_client.DELETE(url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 265, in DELETE
    return self.request("DELETE", url,
  File "/usr/local/lib/python3.8/site-packages/kubernetes/client/rest.py", line 233, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': '07080f79-35c9-49e8-9633-292aada6c9a9', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '78a6a66b-33dc-4871-84cc-c67eb8021575', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'b089eb3e-e97a-4e2c-ae7c-dd74289034c5', 'Date': 'Wed, 10 Jul 2024 08:42:09 GMT', 'Content-Length': '314'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"sparkapplications.sparkoperator.k8s.io \"a-1f49b32ca95947bcbd68c74815f47331\" not found","reason":"NotFound","details":{"name":"a-1f49b32ca95947bcbd68c74815f47331","group":"sparkoperator.k8s.io","kind":"sparkapplications"},"code":404}

The YARN implementation of cancel_job intentionally swallows and logs a warning if the YARN job could not be cancelled, then changes the OpenEO job status to "canceled" anyway:

except CalledProcessError as e:
logger.warning(f"Could not kill corresponding Spark job {application_id}, output was: {e.stdout}",
exc_info=True, extra={'job_id': job_id})
finally:
with self._double_job_registry as registry:
registry.set_status(job_id, user_id, JOB_STATUS.CANCELED)

The k8s implementation instead fails and does not change the status: that's a bug. In this case, the job remained in the running state, blocking the client while the job_tracker was also not able to update its status (again, because the k8s application was gone).

@bossie bossie added the bug label Jul 10, 2024
@bossie bossie changed the title unable to cancel a batch job for which the k8s is gone unable to cancel a batch job for which the k8s application is gone Jul 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant