-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS check if bagged_score.csv exists #503
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## master #503 +/- ##
==========================================
+ Coverage 93.65% 93.67% +0.01%
==========================================
Files 99 99
Lines 8633 8658 +25
==========================================
+ Hits 8085 8110 +25
Misses 548 548
Continue to review full report at Codecov.
|
The error happens because although the screen does no longer exist the score session is not saved. The outcome of the submission might be:
this is not a very nice solution and I don't know how to properly test it using pytest |
# this can only work if the training was successful | ||
has_score_file = _has_error_or_score_file( | ||
config, instance_id, submission_name) | ||
return not has_screen and has_score_file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this risk of deadlocking the workers if a screen failed but the result were not saved? which mechanism ensure that you wi exit the training loop in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Do you have a suggestion for additional check to avoid that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to have three states. One running
(screen alive), one is finished
(no screen and score file exist) and another one is broken
(no screen and no score). and stop when you are not running
anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. I understand the three states + the finished
can end on the score file or error file. But the broken
state is problematic because I don't know how to check if we are not running anymore if we are broken
. That's why checking if the screen exists was introduced.
Do you have an idea how to do that?
Otherwise, I want to introduce timeout to AWS instances, that would be partial solution: in those few cases when for some unexplained reason the instance has no screen and it didn't manage to save a file (ie isbroken
) in the worse case scenario it would run for the set timeout
length of time doing nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say if you are in a broken state, just restart consider it a a checking error no?
That way you will launch back the submission as you are in a state where you don't know what to do.
This is the classical approach taken in multiprocessing computations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good idea. We can do that. But still.. do you have idea how to ensure that we are broken
?
(no screen
and no score
previously meant still being in the process of saving the score so it would not be a good idea to only take those two flags into account to restart the whole submission).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't you compute the score in the same screen?
Else, you can use a patience parameter to allow for some time.
What's the conclusion on this? Can we merge it or not? |
I have no clue |
Me neither. I guess it depends on the need: is ramp running into this issue nowadays? |
Yes, I run into this one multiple times with But we should finish this and detect the broken state (and report the error). |
It checks if the bagged score file (the last score saved by ramp) exists on the AWS ec2 instance.
It should close #483 and close #499