Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If you copy checkpoints from HOME to gcs they can get deleted #394

Open
sshleifer opened this issue Jul 8, 2023 · 3 comments
Open

If you copy checkpoints from HOME to gcs they can get deleted #394

sshleifer opened this issue Jul 8, 2023 · 3 comments

Comments

@sshleifer
Copy link

sshleifer commented Jul 8, 2023

because of this line

if is_gcs_path(path) and not (path / _COMMIT_SUCCESS_FILE).exists():

They don't have success file but are in GCS so orbax thinks its tmp and cleans it up.

I would suggest always or never saving COMMIT_SUCCESS file.

This is not blocking me (easy to just write extra commit success files once I found this) but it felt like I should report because it was very unexpected behavior and moving around checkpoints is super common.

@cpgaffney1
Copy link
Collaborator

Thanks for the report, we currently have different behavior for ensuring atomicity on GCS vs. other filesystems. This was sort of a practice that we inherited from earlier code. I will look into standardizing this.

@young-geng
Copy link

+1 on this. I was having a lot of issues trying to load a checkpoint that was saved locally and copied to GCS, and orbax keeps telling me that the checkpoint is incomplete because of the missing _COMMIT_SUCCESS_FILE file.

@cpgaffney1
Copy link
Collaborator

Update: our previous intention was to switch to the same logic everywhere, i.e. relying on atomic rename. It is not possible to rely on this for all filesystems, though, so we're instead intending to make it configurable, while defaulting to atomic rename for GCS and internal. This has a higher priority now, to better support cloud users - hopefully will get to it within a month.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants