Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS check if bagged_score.csv exists #503

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

maikia
Copy link
Contributor

@maikia maikia commented Jan 7, 2021

It checks if the bagged score file (the last score saved by ramp) exists on the AWS ec2 instance.

It should close #483 and close #499

@codecov
Copy link

codecov bot commented Jan 7, 2021

Codecov Report

Merging #503 (fff3a5d) into master (b2bd7e3) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   93.65%   93.67%   +0.01%     
==========================================
  Files          99       99              
  Lines        8633     8658      +25     
==========================================
+ Hits         8085     8110      +25     
  Misses        548      548              
Impacted Files Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py 87.78% <100.00%> (+1.28%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2bd7e3...fff3a5d. Read the comment docs.

@maikia maikia changed the title WIP AWS check if bagged_score.csv exists AWS check if bagged_score.csv exists Jan 8, 2021
@maikia
Copy link
Contributor Author

maikia commented Jan 8, 2021

The error happens because although the screen does no longer exist the score session is not saved.

The outcome of the submission might be:

  • successful (we can check if the bagged_scores.csv exists)
  • failed (check if error.txt exists in one of the fold_* directories).

this is not a very nice solution and I don't know how to properly test it using pytest
still.. it might work temporarily...

@maikia
Copy link
Contributor Author

maikia commented Jan 13, 2021

@tomMoral or @kegl could you please also look at it before merging?
Thanks in advance

# this can only work if the training was successful
has_score_file = _has_error_or_score_file(
config, instance_id, submission_name)
return not has_screen and has_score_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this risk of deadlocking the workers if a screen failed but the result were not saved? which mechanism ensure that you wi exit the training loop in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Do you have a suggestion for additional check to avoid that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have three states. One running (screen alive), one is finished (no screen and score file exist) and another one is broken (no screen and no score). and stop when you are not running anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I understand the three states + the finished can end on the score file or error file. But the broken state is problematic because I don't know how to check if we are not running anymore if we are broken. That's why checking if the screen exists was introduced.
Do you have an idea how to do that?

Otherwise, I want to introduce timeout to AWS instances, that would be partial solution: in those few cases when for some unexplained reason the instance has no screen and it didn't manage to save a file (ie isbroken) in the worse case scenario it would run for the set timeout length of time doing nothing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say if you are in a broken state, just restart consider it a a checking error no?
That way you will launch back the submission as you are in a state where you don't know what to do.
This is the classical approach taken in multiprocessing computations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. We can do that. But still.. do you have idea how to ensure that we are broken?
(no screen and no score previously meant still being in the process of saving the score so it would not be a good idea to only take those two flags into account to restart the whole submission).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you compute the score in the same screen?
Else, you can use a patience parameter to allow for some time.

@rth
Copy link
Collaborator

rth commented Apr 4, 2022

What's the conclusion on this? Can we merge it or not?

@agramfort
Copy link
Collaborator

I have no clue

@maikia
Copy link
Contributor Author

maikia commented Apr 5, 2022

Me neither. I guess it depends on the need: is ramp running into this issue nowadays?
This was just a patch, if it's not urgent better solution might be found.

@tomMoral
Copy link
Collaborator

tomMoral commented Apr 5, 2022

Yes, I run into this one multiple times with follicles because of some test-time error which created the fold0 but not the others.
I would say it is nice to fix this.

But we should finish this and detect the broken state (and report the error).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG train_time not found BUG missing bagged scores.csv when collecting the data [AWS workers]
4 participants