AWS check if bagged_score.csv exists #503

maikia · 2021-01-07T09:43:47Z

It checks if the bagged score file (the last score saved by ramp) exists on the AWS ec2 instance.

It should close #483 and close #499

codecov · 2021-01-07T10:10:30Z

Codecov Report

Merging #503 (fff3a5d) into master (b2bd7e3) will increase coverage by 0.01%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #503      +/-   ##
==========================================
+ Coverage   93.65%   93.67%   +0.01%     
==========================================
  Files          99       99              
  Lines        8633     8658      +25     
==========================================
+ Hits         8085     8110      +25     
  Misses        548      548

Impacted Files	Coverage Δ
ramp-engine/ramp_engine/tests/test_aws.py	`87.78% <100.00%> (+1.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b2bd7e3...fff3a5d. Read the comment docs.

maikia · 2021-01-08T19:48:42Z

The error happens because although the screen does no longer exist the score session is not saved.

The outcome of the submission might be:

successful (we can check if the bagged_scores.csv exists)
failed (check if error.txt exists in one of the fold_* directories).

this is not a very nice solution and I don't know how to properly test it using pytest
still.. it might work temporarily...

maikia · 2021-01-13T13:32:24Z

@tomMoral or @kegl could you please also look at it before merging?
Thanks in advance

tomMoral · 2021-01-13T22:16:28Z

ramp-engine/ramp_engine/aws/api.py

+    # this can only work if the training was successful
+    has_score_file = _has_error_or_score_file(
+        config, instance_id, submission_name)
+    return not has_screen and has_score_file


doesn't this risk of deadlocking the workers if a screen failed but the result were not saved? which mechanism ensure that you wi exit the training loop in this case?

You are right. Do you have a suggestion for additional check to avoid that?

you need to have three states. One running (screen alive), one is finished (no screen and score file exist) and another one is broken (no screen and no score). and stop when you are not running anymore.

yes. I understand the three states + the finished can end on the score file or error file. But the broken state is problematic because I don't know how to check if we are not running anymore if we are broken. That's why checking if the screen exists was introduced.
Do you have an idea how to do that?

Otherwise, I want to introduce timeout to AWS instances, that would be partial solution: in those few cases when for some unexplained reason the instance has no screen and it didn't manage to save a file (ie isbroken) in the worse case scenario it would run for the set timeout length of time doing nothing.

I would say if you are in a broken state, just restart consider it a a checking error no?
That way you will launch back the submission as you are in a state where you don't know what to do.
This is the classical approach taken in multiprocessing computations.

Yes, good idea. We can do that. But still.. do you have idea how to ensure that we are broken?
(no screen and no score previously meant still being in the process of saving the score so it would not be a good idea to only take those two flags into account to restart the whole submission).

can't you compute the score in the same screen?
Else, you can use a patience parameter to allow for some time.

rth · 2022-04-04T20:17:05Z

What's the conclusion on this? Can we merge it or not?

agramfort · 2022-04-04T20:28:06Z

I have no clue

maikia · 2022-04-05T07:42:29Z

Me neither. I guess it depends on the need: is ramp running into this issue nowadays?
This was just a patch, if it's not urgent better solution might be found.

tomMoral · 2022-04-05T07:50:46Z

Yes, I run into this one multiple times with follicles because of some test-time error which created the fold0 but not the others.
I would say it is nice to fix this.

But we should finish this and detect the broken state (and report the error).

maikia added 2 commits January 6, 2021 11:00

init

4ae0f12

checks if the bagged_scores.csv is saved on the aws instance

19be050

maikia added 2 commits January 7, 2021 11:37

cleanup\

2c28b6f

rephrasing

bedaf73

maikia mentioned this pull request Jan 8, 2021

AWS web service to receive submissions #504

Open

maikia added 3 commits January 8, 2021 18:20

updated the call to ec2 to check if the training has finished

a71b664

cleanup

d233106

changed from looking for log to error.txt file

3c4acf3

maikia changed the title ~~WIP AWS check if bagged_score.csv exists~~ AWS check if bagged_score.csv exists Jan 8, 2021

improve the names

fff3a5d

agramfort approved these changes Jan 8, 2021

View reviewed changes

tomMoral reviewed Jan 13, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS check if bagged_score.csv exists #503

AWS check if bagged_score.csv exists #503

maikia commented Jan 7, 2021

codecov bot commented Jan 7, 2021 •

edited

Loading

maikia commented Jan 8, 2021

maikia commented Jan 13, 2021

tomMoral Jan 13, 2021

maikia Jan 14, 2021

tomMoral Jan 14, 2021

maikia Jan 14, 2021

tomMoral Jan 22, 2021

maikia Jan 22, 2021

tomMoral Jan 22, 2021

rth commented Apr 4, 2022

agramfort commented Apr 4, 2022

maikia commented Apr 5, 2022

tomMoral commented Apr 5, 2022

AWS check if bagged_score.csv exists #503

Are you sure you want to change the base?

AWS check if bagged_score.csv exists #503

Conversation

maikia commented Jan 7, 2021

codecov bot commented Jan 7, 2021 • edited Loading

Codecov Report

maikia commented Jan 8, 2021

maikia commented Jan 13, 2021

tomMoral Jan 13, 2021

Choose a reason for hiding this comment

maikia Jan 14, 2021

Choose a reason for hiding this comment

tomMoral Jan 14, 2021

Choose a reason for hiding this comment

maikia Jan 14, 2021

Choose a reason for hiding this comment

tomMoral Jan 22, 2021

Choose a reason for hiding this comment

maikia Jan 22, 2021

Choose a reason for hiding this comment

tomMoral Jan 22, 2021

Choose a reason for hiding this comment

rth commented Apr 4, 2022

agramfort commented Apr 4, 2022

maikia commented Apr 5, 2022

tomMoral commented Apr 5, 2022

codecov bot commented Jan 7, 2021 •

edited

Loading