Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: ensure containerd-related directories removed on failed bootstrap/join-cluster. #863

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

aznashwan
Copy link
Contributor

No description provided.

@aznashwan aznashwan requested a review from a team as a code owner December 3, 2024 19:00
@aznashwan aznashwan force-pushed the containerd-bootstrap-cleanup branch 5 times, most recently from 1f725fc to 773eb93 Compare December 3, 2024 20:49
Copy link
Member

@berkayoz berkayoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, we should probably wait and adjust for #817

continue
}

// Ensure lockfile's contents is the one we expect:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this check ensure for us? Is it to check someone else did not create the locks?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I have no concrete usecase where this check would ever be relevant, it just "felt right" to double-check in case someone changed their contents from what the k8s-snap expects.

Note that, as far as I can tell, the k8s-snap does NOT use the paths written within the lockfiles (considering they're hardcoded/derived within the k8s-snap itself, it has no need for them).

The only place I've seen the actual file contents read is in the remove hook of the snap.

I could gladly remove it if it's "too redundant" or switch the k8s-snap to always clean up the paths which are written within the file, instead of what the k8s-snap expects that file lock to be for...

} else {
if err := os.RemoveAll(dirpath); err != nil {
log.Info("WARN: failed to remove containerd data directory", "directory", dirpath, "error", err)
continue // Avoid removing the lockfile path.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't cleanup done on a best effort basis? Do we check the lockfile later on at bootstrap time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is, but IMO it would have been "incorrect" for this code to remove the lockfile path if it had failed to remove the actual path it was pointing too. (IMO either both should be deleted; or none)

Do we check the lockfile later on at bootstrap time?

AFAIK the LockFilesDir() is used only by k8sd in this file when setting up the containerd dirs, and to mark a a node as worker (contents doesn't matter for marking workers, only the file existing does).

These contents of the lockfiles are only read by the remove hook of the k8s-snap (it also neglects to delete the lockfiles after it's done though, unsure whether that's intentional or not...).

tests/integration/tests/test_cleanup.py Show resolved Hide resolved
@aznashwan
Copy link
Contributor Author

@berkayoz note that it just hit me that because of the new pytest.mark.tags(tags.NIGHTLY) test tagging system the integration test doesn't get executed as part of the PR, sorry 🤦

While I believe it's better to leave that test for NIGHTLY, I've manually run the test on a 22.04 VM and I have.

I needed to add this small fix (I had forgotten to actually call socket.listen() in the contextmanager).

Can vouch that the added integration test seems to work when run manually now BUT it does sometimes exhibit the EOF error on my machine (which I am still positive is unrelated and hope Claudiu will push a fix for soon).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants