Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Feature to detect stalled experiments #2049

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
fb871ab
started to add python rocoto stat
TerryMcGuinness-NOAA Nov 8, 2023
b865bf2
first steps with rocoto stat script
TerryMcGuinness-NOAA Nov 8, 2023
1e59bef
finished working version of rocoto_statcount
TerryMcGuinness-NOAA Nov 8, 2023
dd4c735
replaced explicit decleration of stat dict with loop in rocoto statcount
TerryMcGuinness-NOAA Nov 9, 2023
f31ec82
fixed end state mixup of varialble names
TerryMcGuinness-NOAA Nov 9, 2023
aad09fd
more slight improvments on user GitHub messaging outputs
TerryMcGuinness-NOAA Nov 9, 2023
d45017a
added failed state updates with stalled in run ci
TerryMcGuinness-NOAA Nov 9, 2023
b170ca7
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
ea4da7d
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
d529dbb
bug updates after testing depedancies catch
TerryMcGuinness-NOAA Nov 9, 2023
12c26e8
fixed conflict with run_ci
TerryMcGuinness-NOAA Nov 9, 2023
5ae0d72
Update rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 9, 2023
5b98e01
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
b48f519
spelling, typos, and brain fart on {DATE +}
TerryMcGuinness-NOAA Nov 13, 2023
02bd37e
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
ba7f6d9
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
729002c
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
5529000
updated elseif line and replaced PENDING with SUBMITTING
TerryMcGuinness-NOAA Nov 13, 2023
6d8057a
added exit code and removed redundent which
TerryMcGuinness-NOAA Nov 13, 2023
b9a2d08
put Walters test back for no RUNNING, SUBMITTING, or QUEUED and added…
TerryMcGuinness-NOAA Nov 13, 2023
1ca1097
added back which method to get new and sperate Execute Object to have…
TerryMcGuinness-NOAA Nov 13, 2023
ae0ca41
Update check_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
fd8c33b
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
8146364
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
688613f
removed DATE as date command
TerryMcGuinness-NOAA Nov 13, 2023
e74567f
white spaces from lint
TerryMcGuinness-NOAA Nov 13, 2023
1654e52
pynorm indent on dict
TerryMcGuinness-NOAA Nov 13, 2023
919b0c2
more white space related pynorm stuff
TerryMcGuinness-NOAA Nov 13, 2023
1bfe74a
hopefully last white space related pynorm stuff
TerryMcGuinness-NOAA Nov 13, 2023
ffd92cf
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
084719b
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
14e0df1
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
18557c0
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
9639ae3
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 14, 2023
52bf3bb
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 14, 2023
451b55c
updated discription for rocoto_statcount
TerryMcGuinness-NOAA Nov 14, 2023
cf3b262
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 16, 2023
338831c
Update ci/scripts/run_ci.sh
TerrenceMcGuinness-NOAA Nov 28, 2023
a839b4b
Update ci/scripts/run_ci.sh
TerrenceMcGuinness-NOAA Nov 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 4 additions & 5 deletions ci/scripts/check_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ for pr in ${pr_list}; do
# shellcheck disable=SC2312
if [[ -z $(ls -A "${pr_dir}/RUNTESTS/EXPDIR") ]] ; then
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Passed"
sed -i "1 i\All CI Test Cases Passed on ${MACHINE_ID^}" "${output_ci}"
sed -i "1 i\`\`\`" "${output_ci}"
sed -i "1 i\All CI Test Cases Passed:" "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
# Check to see if this PR that was opened by the weekly tests and if so close it if it passed on all platforms
Expand Down Expand Up @@ -131,7 +131,7 @@ for pr in ${pr_list}; do
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
error_logs=$("${rocotostat}" -d "${db}" -w "${xml}" | grep -E 'FAIL|DEAD' | awk '{print "-c", $1, "-t", $2}' | xargs "${rocotocheck}" -d "${db}" -w "${xml}" | grep join | awk '{print $2}') || true
{
echo "Experiment ${pslot} Terminated: *** FAILED ***"
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHINE_ID^}"
echo "Experiment ${pslot} Terminated with ${num_failed} tasks failed at $(date)" || true
echo "Error logs:"
echo "${error_logs}"
Expand All @@ -151,9 +151,8 @@ for pr in ${pr_list}; do
rm -Rf "${pr_dir}/RUNTESTS/COMROT/${pslot}"
rm -f "${output_ci_single}"
# echo "\`\`\`" > "${output_ci_single}"
DATE=$(date)
echo "Experiment ${pslot} **SUCCESS** ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE}" >> "${output_ci}"
echo "Experiment ${pslot} **SUCCESS** $(date +'%A %b %d, %Y') on ${MACHINE_ID^}" || true >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at $(date +'%A %b %d, %Y') on ${MACHINE_ID^}" || true >> "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"

fi
Expand Down
5 changes: 2 additions & 3 deletions ci/scripts/driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ for pr in ${pr_list}; do
set +e
export LOGFILE_PATH="${HOMEgfs}/ci/scripts/create_experiment.log"
rm -f "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" 2>&1 "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" > "${LOGFILE_PATH}" 2>&1
ci_status=$?
set -e
if [[ ${ci_status} -eq 0 ]]; then
Expand All @@ -174,8 +174,7 @@ for pr in ${pr_list}; do
} >> "${output_ci}"
else
{
echo "*** Failed *** to create experiment: ${pslot}"
echo ""
echo "*** Failed *** to create experiment: ${pslot} on ${MACHINE_ID^} for PR #${pr}"
cat "${LOGFILE_PATH}"
} >> "${output_ci}"
"${GH}" pr edit "${pr}" --repo "${REPO_URL}" --remove-label "CI-${MACHINE_ID^}-Building" --add-label "CI-${MACHINE_ID^}-Failed"
Expand Down
19 changes: 18 additions & 1 deletion ci/scripts/run_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." >/dev/null 2>&1 && pwd )"
scriptname=$(basename "${BASH_SOURCE[0]}")
echo "Begin ${scriptname} at $(date -u)" || true
export PS4='+ $(basename ${BASH_SOURCE})[${LINENO}]'
GH=${HOME}/bin/gh
REPO_URL="https://github.com/NOAA-EMC/global-workflow.git"

#########################################################################
# Set up runtime environment varibles for accounts on supproted machines
Expand Down Expand Up @@ -81,7 +83,22 @@ for pr in ${pr_list}; do
pslot=$(basename "${pslot_dir}")
xml="${pslot_dir}/${pslot}.xml"
db="${pslot_dir}/${pslot}.db"
echo "Running: ${rocotorun} -v 10 -w ${xml} -d ${db}"
"${rocotorun}" -v 10 -w "${xml}" -d "${db}"
set +e
"${ROOT_DIR}/ci/scripts/utils/rocoto_statcount.py" -d "${db}" -w "${xml}"
rc=$?

if [[ "${rc}" -ne 0 ]]; then

Check notice

Code scanning / shellcheck

This is actually an end quote, but due to next char it looks suspect.

This is actually an end quote, but due to next char it looks suspect.

Check warning

Code scanning / shellcheck

Did you forget to close this double quoted string?

Did you forget to close this double quoted string?
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
output_ci="${pr_dir}/output_runtime_single.log"

Check notice

Code scanning / shellcheck

This is actually an end quote, but due to next char it looks suspect.

This is actually an end quote, but due to next char it looks suspect.
Fixed Show fixed Hide fixed
{
echo "${pslot} has *** STALLED **** on ${MACHINE_ID^}"
echo "A job in experiment ${pslot} in ${pslot_dir}"
echo "may have depenencies that are not being met"
} >> "${output_ci}"
sed -i "1 i\`\`\`" "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"

Check failure

Code scanning / shellcheck

Couldn't parse this double quoted string. Fix to allow more checks.

Couldn't parse this double quoted string. Fix to allow more checks.
Fixed Show fixed Hide fixed
fi
done
done
99 changes: 99 additions & 0 deletions ci/scripts/utils/rocoto_statcount.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
#!/usr/bin/env python3

import sys
import os

from wxflow import Executable, which, Logger, CommandNotFoundError
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter

logger = Logger(level=os.environ.get("LOGGING_LEVEL", "DEBUG"), colored_log=False)


def input_args():
"""
Parse command-line arguments.

Returns
-------
args : Namespace
The parsed command-line arguments.
"""

description = """
Using rocotostat to get the status of all jobs this scripts
determines rocoto_state: if all cycles are done, then rocoto_state is Done.
Assuming rocotorun had just been run, and the rocoto_state is not Done, then
rocoto_state is Stalled if there are no jobs that are RUNNING, SUBMITTING, or QUEUED.
"""
Comment on lines +22 to +27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems more like a description for the whole script or rocoto_statcount than this function.


parser = ArgumentParser(description=description,
formatter_class=ArgumentDefaultsHelpFormatter)

parser.add_argument('-w', help='workflow_document', type=str)
parser.add_argument('-d', help='database_file', type=str)

args = parser.parse_args()

return args


def rocoto_statcount():
"""
Run rocotostat and process its output.
"""

args = input_args()

try:
rocotostat = which("rocotostat")
except CommandNotFoundError:
logger.exception("rocotostat not found in PATH")
raise CommandNotFoundError("rocotostat not found in PATH")

xml_file_path = os.path.abspath(args.w)
db_file_path = os.path.abspath(args.d)

rocotostat_all = which("rocotostat")
Copy link
Contributor

@aerorahul aerorahul Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this again? You already have rocotostat. You can use that any number of times with different arguments.

@aerorahul I have not been able to find a way to use add_default_args to "update" the argument list of an Executable Object. Can you fine an example, could not find such a use case in the repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a way to "change" arguments once the are set on the Executable object.

rocotostat.add_default_arg(['-w', xml_file_path, '-d', db_file_path, '-s'])
rocotostat_all.add_default_arg(['-w', xml_file_path, '-d', db_file_path, '-a'])

rocotostat_output = rocotostat(output=str)
rocotostat_output = rocotostat_output.splitlines()[1:]
rocotostat_output = [line.split()[0:2] for line in rocotostat_output]

rocotostat_output_all = rocotostat_all(output=str)
rocotostat_output_all = rocotostat_output_all.splitlines()[1:]
rocotostat_output_all = [line.split()[0:4] for line in rocotostat_output_all]
rocotostat_output_all = [line for line in rocotostat_output_all if len(line) != 1]

rocoto_status = {
'Cycles': len(rocotostat_output),
'Done_Cycles': sum([sublist.count('Done') for sublist in rocotostat_output])
}

status_cases = ['SUCCEEDED', 'FAIL', 'DEAD', 'RUNNING', 'SUBMITTING', 'QUEUED']
for case in status_cases:
rocoto_status[case] = sum([sublist.count(case) for sublist in rocotostat_output_all])

return rocoto_status


if __name__ == '__main__':

args = input_args()

rocoto_status = rocoto_statcount()
for status in rocoto_status:
print(f'Number of {status} : {rocoto_status[status]}')
if rocoto_status['Cycles'] == rocoto_status['Done_Cycles']:
rocoto_state = 'Done'
elif 'UNKNOWN' in rocoto_status:
rocoto_state = 'Unknown'
print(f'Rocoto State : {rocoto_state}')
elif rocoto_status['RUNNING'] + rocoto_status['SUBMITTING'] + rocoto_status['QUEUED'] == 0:
rocoto_state = 'Stalled'
print(f'Rocoto State : {rocoto_state}')
sys.exit(-1)
else:
rocoto_state = 'Running'
print(f'Rocoto State : {rocoto_state}')