Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS check if bagged_score.csv exists #503

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 26 additions & 2 deletions ramp-engine/ramp_engine/aws/api.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,9 +795,14 @@ def _is_ready(config, instance_id):

def _training_finished(config, instance_id, submission_name):
"""
Return True if a submission has finished training
Return True if a submission has finished training (if the screen no longer
exists on the ec2 instance and if the bagged_scores.csv file was saved)
"""
return not _has_screen(config, instance_id, submission_name)
has_screen = _has_screen(config, instance_id, submission_name)
# this can only work if the training was successful
has_score_file = _has_error_or_score_file(
config, instance_id, submission_name)
return not has_screen and has_score_file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this risk of deadlocking the workers if a screen failed but the result were not saved? which mechanism ensure that you wi exit the training loop in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Do you have a suggestion for additional check to avoid that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to have three states. One running (screen alive), one is finished (no screen and score file exist) and another one is broken (no screen and no score). and stop when you are not running anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I understand the three states + the finished can end on the score file or error file. But the broken state is problematic because I don't know how to check if we are not running anymore if we are broken. That's why checking if the screen exists was introduced.
Do you have an idea how to do that?

Otherwise, I want to introduce timeout to AWS instances, that would be partial solution: in those few cases when for some unexplained reason the instance has no screen and it didn't manage to save a file (ie isbroken) in the worse case scenario it would run for the set timeout length of time doing nothing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say if you are in a broken state, just restart consider it a a checking error no?
That way you will launch back the submission as you are in a state where you don't know what to do.
This is the classical approach taken in multiprocessing computations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good idea. We can do that. But still.. do you have idea how to ensure that we are broken?
(no screen and no score previously meant still being in the process of saving the score so it would not be a good idea to only take those two flags into account to restart the whole submission).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't you compute the score in the same screen?
Else, you can use a patience parameter to allow for some time.



def _training_successful(config, instance_id, submission_name,
Expand Down Expand Up @@ -844,6 +849,25 @@ def _has_screen(config, instance_id, screen_name):
return nb > 0


def _has_error_or_score_file(config, instance_id, screen_name):
"""
Return True if a 'bagged_scores.csv' file exists (submission terminated
with a success) or if log file (submisssion terminated with error) exists
but the directory with fold_0 does not exist on the ec2
instance
"""
submission_path = os.path.join(
config['remote_ramp_kit_folder'],
'submissions', screen_name, 'training_output')
cmd = ("bash -c '"
f"if [ -f {submission_path}/bagged_scores.csv ] || "
f"[ -f {submission_path}/fold_*/error.txt ]; "
"then echo 1; else echo 0; fi'")
is_log_or_score = _run(config, instance_id, cmd, return_output=True)

return int(is_log_or_score)


def _tag_instance_by_submission(config, instance_id, submission_name):
"""
Add tags to an instance with infos from the submission to know which
Expand Down
43 changes: 43 additions & 0 deletions ramp-engine/ramp_engine/tests/test_aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,49 @@ class DummyInstance:
assert 'Adding the submission back to the queue' in caplog.text


@mock.patch('ramp_engine.aws.api.is_spot_terminated')
@mock.patch('ramp_engine.aws.api.launch_train')
@mock.patch('ramp_engine.aws.api._has_screen')
@mock.patch('ramp_engine.aws.api._has_error_or_score_file')
def test_not_finished_until_bagged_or_log_saved(has_score_file, has_screen,
launch_train,
spot_terminated,
caplog):
""" checks if the submission is considered finished correctly only if the
submission has finished correctly and the bagged_scores.csv file is
saved or it had some errors which appear in the log file """
class DummyInstance:
id = 1
launch_train.return_value = 0
has_screen.return_value = True
has_score_file.return_value = False

# setup the AWS worker
event_config = read_config(ramp_aws_config_template())['worker']

worker = AWSWorker(event_config, submission='starting_kit_local')
worker.config = event_config
worker.submission = 'dummy submissions'
worker.instance = DummyInstance

# set the submission did not yet finish training
spot_terminated.return_value = False

worker.launch_submission()
assert worker.status == 'running'
assert caplog.text == ''

# set screen is no longer there
has_screen.return_value = False
assert worker.status == 'running'
assert caplog.text == ''

# set that the score file was saved
has_score_file.return_value = True
assert worker.status == 'finished'
assert caplog.text == ''


def test_aws_worker():
if not os.path.isfile(os.path.join(HERE, 'config.yml')):
pytest.skip("Only for local tests for now")
Expand Down