Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. #3699

Open
ty-dc opened this issue Jul 8, 2024 · 0 comments
Assignees
Labels

Comments

@ty-dc
Copy link
Collaborator

ty-dc commented Jul 8, 2024

Spiderpool Version

v0.9.3

Bug Type

IPAM

Main CNI

macvlan

What happened?

Warning  FailedCreatePodSandBox  31s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "d03d1785bad07f92d23677169acc40ecdd3ff90658d18c39ead55b010438fb4b": plugin type="multus" name="multus-cni-network" failed (add): [drun-test/llama2-master-0/f699f414-842c-40ab-8379-71710eac15c0:sriov-gpu20-enp40s0np0]: error adding container to network "sriov-gpu20-enp40s0np0": failed to set up IPAM plugin type "spiderpool" from the device "enp40s0np0": spiderpool IP allocation error: [POST /ipam/ip][500] postIpamIpFailure  failed to allocate IP addresses in standard mode: failed to patch IP allocation results to Endpoint: Operation cannot be fulfilled on [spiderendpoints.spiderpool.spidernet.io](http://spiderendpoints.spiderpool.spidernet.io/) "llama2-master-0": the object has been modified; please apply your changes to the latest version and try again

What did you expect to happen?

success

How to reproduce it (as minimally and precisely as possible)

PyTorch creates jobs in batches, and its job names are named like sequence numbers in stateful applications. Therefore, after creating a set of tasks, the administrator quickly cancels them and creates a new set of tasks. Occasionally, endpoints with the same name remain, and the IP address cannot be allocated.

Additional Context

Solution: The uuid of the pod corresponding to the endpoint does not exist. Detect and delete/update the endpoint object and use gc old data.

@ty-dc ty-dc added the kind/bug label Jul 8, 2024
@ty-dc ty-dc changed the title The working application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. The job application was rebuilt and the same endpoint name was kept which made it impossible to assign an IP address. Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants