Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Feature to detect stalled experiments #2049

Conversation

TerrenceMcGuinness-NOAA
Copy link
Collaborator

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA commented Nov 9, 2023

Description

A helper python script rocoto_statcount.py is added in $HOMEgfs/ci/scrpts/utils that checks to see if an experiment is "stalled". This is done by running rocotorun and checking within all the jobs and if none are RUNNING, SUBMITTING, or QUEUED then the experiment is STALLED.

Resolves #2008

Type of change

  • New feature (adds functionality)

Change characteristics

  • Is this a breaking change (a change in existing functionality)? NO
  • Does this change require a documentation update? NO

How has this been tested?

  • Ran in CI CRON development space and set up a case that had a missing file dependency and it was appropriately flagged.

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA added feature New feature or request CI/CD Issue related to CI/CD labels Nov 9, 2023
ci/scripts/run_ci.sh Fixed Show fixed Hide fixed
fixed some grammar in description
echo "Running: ${rocotorun} -v 10 -w ${xml} -d ${db}"
"${rocotorun}" -v 10 -w "${xml}" -d "${db}"
set +e
${ROOT_DIR}/ci/scripts/utils/rocoto_statcount.py -d "${db}" -w "${xml}" --check_stalled

Check notice

Code scanning / shellcheck

Double quote to prevent globbing and word splitting.

Double quote to prevent globbing and word splitting.
fixed double quotes for bash lint
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok, but I spotted some minor typos.
See suggestions.
And one comment on something that probably is incorrect.

@@ -131,7 +131,7 @@ for pr in ${pr_list}; do
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
error_logs=$("${rocotostat}" -d "${db}" -w "${xml}" | grep -E 'FAIL|DEAD' | awk '{print "-c", $1, "-t", $2}' | xargs "${rocotocheck}" -d "${db}" -w "${xml}" | grep join | awk '{print $2}') || true
{
echo "Experiment ${pslot} Terminated: *** FAILED ***"
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHIND_ID^}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo

Suggested change
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHIND_ID^}"
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHINE_ID^}"

Comment on lines 155 to 156
echo "Experiment ${pslot} **SUCCESS** ${DATE +%Y%m%d} on ${MACHINE_ID^}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE +%Y%m%d} on ${MACHIND_ID^}" >> "${output_ci}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot possibly work ${DATE +%Y%m%d}.

output_ci="${pr_dir}/output_runtime_single.log"
{
echo "${pslot} has *** STALLED **** on ${MACHINE_ID^}"
echo "A job in expermint ${pslot} in ${pslot_dir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "A job in expermint ${pslot} in ${pslot_dir}"
echo "A job in experiment ${pslot} in ${pslot_dir}"

ci/scripts/utils/rocoto_statcount.py Outdated Show resolved Hide resolved
ci/scripts/utils/rocoto_statcount.py Outdated Show resolved Hide resolved
Copy link
Contributor

@aerorahul aerorahul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some suggestions for rocoto_statcount.py

ci/scripts/utils/rocoto_statcount.py Outdated Show resolved Hide resolved
ci/scripts/utils/rocoto_statcount.py Outdated Show resolved Hide resolved
xml_file_path = os.path.abspath(args.w)
db_file_path = os.path.abspath(args.d)

rocotostat_all = which("rocotostat")
Copy link
Contributor

@aerorahul aerorahul Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this again? You already have rocotostat. You can use that any number of times with different arguments.

@aerorahul I have not been able to find a way to use add_default_args to "update" the argument list of an Executable Object. Can you fine an example, could not find such a use case in the repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a way to "change" arguments once the are set on the Executable object.

Comment on lines 58 to 68
rocotostat.add_default_arg(['-w',xml_file_path,'-d',db_file_path,'-s'])
rocotostat_all.add_default_arg(['-w',xml_file_path,'-d',db_file_path,'-a'])

rocotostat_output = rocotostat(output=str)
rocotostat_output = rocotostat_output.splitlines()[1:]
rocotostat_output = [line.split()[0:2] for line in rocotostat_output]

rocotostat_output_all = rocotostat_all(output=str)
rocotostat_output_all = rocotostat_output_all.splitlines()[1:]
rocotostat_output_all = [line.split()[0:4] for line in rocotostat_output_all]
rocotostat_output_all = [line for line in rocotostat_output_all if len(line) != 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can show you how to reduce this and simplify it considerably

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that would be great, you can jot down pseudo code it you like


if args.check_stalled:
if rocoto_state != 'Done':
rocoto_run = which("rocotorun")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having seen this, I think we should have a RocotoCommands class and initialize it with everything from rocoto you need; e.g. rocotostat, rocotorun, etc. While you are at it, also provide the xml and database to the class object rather than have to do this over and over.

ci/scripts/utils/rocoto_statcount.py Outdated Show resolved Hide resolved
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA marked this pull request as draft November 13, 2023 15:38
@@ -152,8 +152,8 @@
rm -f "${output_ci_single}"
# echo "\`\`\`" > "${output_ci_single}"
DATE=$(date)
echo "Experiment ${pslot} **SUCCESS** ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE}" >> "${output_ci}"
echo "Experiment ${pslot} **SUCCESS** $(date +'%A %b %d, %Y') on ${MACHINE_ID^}" >> "${output_ci_single}"

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).
echo "Experiment ${pslot} **SUCCESS** ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE}" >> "${output_ci}"
echo "Experiment ${pslot} **SUCCESS** $(date +'%A %b %d, %Y') on ${MACHINE_ID^}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at $(date +'%A %b %d, %Y') on ${MACHINE_ID^}" >> "${output_ci}"

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).
ci/scripts/check_ci.sh Fixed Show fixed Hide fixed
ci/scripts/check_ci.sh Fixed Show fixed Hide fixed
update to more robust logic

Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
added correct exception when command is not found

Co-authored-by: Rahul Mahajan <[email protected]>
added import for correct exception

Co-authored-by: Rahul Mahajan <[email protected]>
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA marked this pull request as ready for review November 13, 2023 21:09
trying to force refresh
stabs at solving double quote requirements for shellnoms
double quoting flags wont stop even though the are correct
still wrestling with double quote complaints from shellnormns even thought they are correct in the script and trying that ever the F&*@
Copy link
Collaborator Author

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done with my review

xml_file_path = os.path.abspath(args.w)
db_file_path = os.path.abspath(args.d)

rocotostat_all = which("rocotostat")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a way to "change" arguments once the are set on the Executable object.

ci/scripts/run_ci.sh Outdated Show resolved Hide resolved
ci/scripts/run_ci.sh Outdated Show resolved Hide resolved
Comment on lines +22 to +27
description = """
Using rocotostat to get the status of all jobs this scripts
determines rocoto_state: if all cycles are done, then rocoto_state is Done.
Assuming rocotorun had just been run, and the rocoto_state is not Done, then
rocoto_state is Stalled if there are no jobs that are RUNNING, SUBMITTING, or QUEUED.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like a description for the whole script or rocoto_statcount than this function.

@WalterKolczynski-NOAA
Copy link
Contributor

@TerryMcGuinness-NOAA still waiting on you to make the suggestions I recommended

@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA marked this pull request as draft November 28, 2023 23:05
Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
@TerrenceMcGuinness-NOAA TerrenceMcGuinness-NOAA deleted the feature/detect_dependency branch July 12, 2024 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI/CD Issue related to CI/CD feature New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature BASH CI detects when a dependacy isn't being met
4 participants