Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI Feature to detect stalled experiments #2049

Closed
Show file tree
Hide file tree
Changes from 13 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
fb871ab
started to add python rocoto stat
TerryMcGuinness-NOAA Nov 8, 2023
b865bf2
first steps with rocoto stat script
TerryMcGuinness-NOAA Nov 8, 2023
1e59bef
finished working version of rocoto_statcount
TerryMcGuinness-NOAA Nov 8, 2023
dd4c735
replaced explicit decleration of stat dict with loop in rocoto statcount
TerryMcGuinness-NOAA Nov 9, 2023
f31ec82
fixed end state mixup of varialble names
TerryMcGuinness-NOAA Nov 9, 2023
aad09fd
more slight improvments on user GitHub messaging outputs
TerryMcGuinness-NOAA Nov 9, 2023
d45017a
added failed state updates with stalled in run ci
TerryMcGuinness-NOAA Nov 9, 2023
b170ca7
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
ea4da7d
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
d529dbb
bug updates after testing depedancies catch
TerryMcGuinness-NOAA Nov 9, 2023
12c26e8
fixed conflict with run_ci
TerryMcGuinness-NOAA Nov 9, 2023
5ae0d72
Update rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 9, 2023
5b98e01
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 9, 2023
b48f519
spelling, typos, and brain fart on {DATE +}
TerryMcGuinness-NOAA Nov 13, 2023
02bd37e
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
ba7f6d9
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
729002c
Update ci/scripts/utils/rocoto_statcount.py
TerrenceMcGuinness-NOAA Nov 13, 2023
5529000
updated elseif line and replaced PENDING with SUBMITTING
TerryMcGuinness-NOAA Nov 13, 2023
6d8057a
added exit code and removed redundent which
TerryMcGuinness-NOAA Nov 13, 2023
b9a2d08
put Walters test back for no RUNNING, SUBMITTING, or QUEUED and added…
TerryMcGuinness-NOAA Nov 13, 2023
1ca1097
added back which method to get new and sperate Execute Object to have…
TerryMcGuinness-NOAA Nov 13, 2023
ae0ca41
Update check_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
fd8c33b
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
8146364
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
688613f
removed DATE as date command
TerryMcGuinness-NOAA Nov 13, 2023
e74567f
white spaces from lint
TerryMcGuinness-NOAA Nov 13, 2023
1654e52
pynorm indent on dict
TerryMcGuinness-NOAA Nov 13, 2023
919b0c2
more white space related pynorm stuff
TerryMcGuinness-NOAA Nov 13, 2023
1bfe74a
hopefully last white space related pynorm stuff
TerryMcGuinness-NOAA Nov 13, 2023
ffd92cf
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
084719b
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
14e0df1
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
18557c0
Update run_ci.sh
TerrenceMcGuinness-NOAA Nov 13, 2023
9639ae3
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 14, 2023
52bf3bb
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 14, 2023
451b55c
updated discription for rocoto_statcount
TerryMcGuinness-NOAA Nov 14, 2023
cf3b262
Merge branch 'NOAA-EMC:develop' into feature/detect_dependency
TerrenceMcGuinness-NOAA Nov 16, 2023
338831c
Update ci/scripts/run_ci.sh
TerrenceMcGuinness-NOAA Nov 28, 2023
a839b4b
Update ci/scripts/run_ci.sh
TerrenceMcGuinness-NOAA Nov 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions ci/scripts/check_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,8 +85,8 @@ for pr in ${pr_list}; do
# shellcheck disable=SC2312
if [[ -z $(ls -A "${pr_dir}/RUNTESTS/EXPDIR") ]] ; then
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Passed"
sed -i "1 i\All CI Test Cases Passed on ${MACHINE_ID^}" "${output_ci}"
sed -i "1 i\`\`\`" "${output_ci}"
sed -i "1 i\All CI Test Cases Passed:" "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
# Check to see if this PR that was opened by the weekly tests and if so close it if it passed on all platforms
Expand Down Expand Up @@ -131,7 +131,7 @@ for pr in ${pr_list}; do
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
error_logs=$("${rocotostat}" -d "${db}" -w "${xml}" | grep -E 'FAIL|DEAD' | awk '{print "-c", $1, "-t", $2}' | xargs "${rocotocheck}" -d "${db}" -w "${xml}" | grep join | awk '{print $2}') || true
{
echo "Experiment ${pslot} Terminated: *** FAILED ***"
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHIND_ID^}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small typo

Suggested change
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHIND_ID^}"
echo "Experiment ${pslot} Terminated: *** FAILED *** on ${MACHINE_ID^}"

echo "Experiment ${pslot} Terminated with ${num_failed} tasks failed at $(date)" || true
echo "Error logs:"
echo "${error_logs}"
Expand All @@ -152,8 +152,8 @@ for pr in ${pr_list}; do
rm -f "${output_ci_single}"
# echo "\`\`\`" > "${output_ci_single}"
DATE=$(date)
echo "Experiment ${pslot} **SUCCESS** ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE}" >> "${output_ci}"
echo "Experiment ${pslot} **SUCCESS** ${DATE +%Y%m%d} on ${MACHINE_ID^}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE +%Y%m%d} on ${MACHIND_ID^}" >> "${output_ci}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cannot possibly work ${DATE +%Y%m%d}.

"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"

fi
Expand Down
5 changes: 2 additions & 3 deletions ci/scripts/driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ for pr in ${pr_list}; do
set +e
export LOGFILE_PATH="${HOMEgfs}/ci/scripts/create_experiment.log"
rm -f "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" 2>&1 "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" > "${LOGFILE_PATH}" 2>&1
ci_status=$?
set -e
if [[ ${ci_status} -eq 0 ]]; then
Expand All @@ -174,8 +174,7 @@ for pr in ${pr_list}; do
} >> "${output_ci}"
else
{
echo "*** Failed *** to create experiment: ${pslot}"
echo ""
echo "*** Failed *** to create experiment: ${pslot} on ${MACHINE_ID^} for PR #${pr}"
cat "${LOGFILE_PATH}"
} >> "${output_ci}"
"${GH}" pr edit "${pr}" --repo "${REPO_URL}" --remove-label "CI-${MACHINE_ID^}-Building" --add-label "CI-${MACHINE_ID^}-Failed"
Expand Down
19 changes: 17 additions & 2 deletions ci/scripts/run_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." >/dev/null 2>&1 && pwd )"
scriptname=$(basename "${BASH_SOURCE[0]}")
echo "Begin ${scriptname} at $(date -u)" || true
export PS4='+ $(basename ${BASH_SOURCE})[${LINENO}]'
GH=${HOME}/bin/gh
REPO_URL="https://github.com/NOAA-EMC/global-workflow.git"

#########################################################################
# Set up runtime environment varibles for accounts on supproted machines
Expand Down Expand Up @@ -81,7 +83,20 @@ for pr in ${pr_list}; do
pslot=$(basename "${pslot_dir}")
xml="${pslot_dir}/${pslot}.xml"
db="${pslot_dir}/${pslot}.db"
echo "Running: ${rocotorun} -v 10 -w ${xml} -d ${db}"
"${rocotorun}" -v 10 -w "${xml}" -d "${db}"
set +e
"${ROOT_DIR}/ci/scripts/utils/rocoto_statcount.py" -d "${db}" -w "${xml}" --check_stalled

Check notice

Code scanning / shellcheck

This is actually an end quote, but due to next char it looks suspect.

This is actually an end quote, but due to next char it looks suspect.

Check warning

Code scanning / shellcheck

Did you forget to close this double quoted string?

Did you forget to close this double quoted string?
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
rc=$?
if [[ "${rc}" -ne 0 ]]; then

Check notice

Code scanning / shellcheck

This is actually an end quote, but due to next char it looks suspect.

This is actually an end quote, but due to next char it looks suspect.

Check warning

Code scanning / shellcheck

Did you forget to close this double quoted string?

Did you forget to close this double quoted string?
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed
output_ci="${pr_dir}/output_runtime_single.log"

Check notice

Code scanning / shellcheck

This is actually an end quote, but due to next char it looks suspect.

This is actually an end quote, but due to next char it looks suspect.
Fixed Show fixed Hide fixed
{
echo "${pslot} has *** STALLED **** on ${MACHINE_ID^}"
echo "A job in expermint ${pslot} in ${pslot_dir}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "A job in expermint ${pslot} in ${pslot_dir}"
echo "A job in experiment ${pslot} in ${pslot_dir}"

echo "may have depenencies that are not being met"
} >> "${output_ci}"
sed -i "1 i\`\`\`" "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"

Check failure

Code scanning / shellcheck

Couldn't parse this double quoted string. Fix to allow more checks.

Couldn't parse this double quoted string. Fix to allow more checks.
Fixed Show fixed Hide fixed
fi
done
done
104 changes: 104 additions & 0 deletions ci/scripts/utils/rocoto_statcount.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
#!/usr/bin/env python3

import sys
import os

from wxflow import Executable, which, Logger
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
from argparse import ArgumentParser, ArgumentDefaultsHelpFormatter

logger = Logger(level=os.environ.get("LOGGING_LEVEL", "DEBUG"), colored_log=False)


def input_args():
"""
Parse command-line arguments.

Returns
-------
args : Namespace
The parsed command-line arguments.
"""

description = """
Using rocotostat to get the status of all jobs this scripts
determines rocoto_state: if all cycles are done, then rocoto_state is Done.
If all cycles are not done, then rocoto_state is Running.
If the check_stalled is used then rocotorun is issued and
rocotostat is run again and checks if all jobs have not advanced, then
rocoto_state is Stalled and the script exits with -1.
"""

parser = ArgumentParser(description=description,
formatter_class=ArgumentDefaultsHelpFormatter)

parser.add_argument('-w', help='workflow_document', type=str)
parser.add_argument('-d', help='database_file', type=str)
parser.add_argument('--check_stalled', help='check if any jobs do not advance (stalled)', action='store_true', required=False)

args = parser.parse_args()

return args

def rocoto_statcount():
"""
Run rocotostat and process its output.
"""

args = input_args()

rocotostat = which("rocotostat")
if not rocotostat:
logger.exception("rocotostat not found in PATH")
sys.exit(-1)
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved

xml_file_path = os.path.abspath(args.w)
db_file_path = os.path.abspath(args.d)

rocotostat_all = which("rocotostat")
Copy link
Contributor

@aerorahul aerorahul Nov 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do this again? You already have rocotostat. You can use that any number of times with different arguments.

@aerorahul I have not been able to find a way to use add_default_args to "update" the argument list of an Executable Object. Can you fine an example, could not find such a use case in the repo.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not find a way to "change" arguments once the are set on the Executable object.

rocotostat.add_default_arg(['-w',xml_file_path,'-d',db_file_path,'-s'])
rocotostat_all.add_default_arg(['-w',xml_file_path,'-d',db_file_path,'-a'])

rocotostat_output = rocotostat(output=str)
rocotostat_output = rocotostat_output.splitlines()[1:]
rocotostat_output = [line.split()[0:2] for line in rocotostat_output]

rocotostat_output_all = rocotostat_all(output=str)
rocotostat_output_all = rocotostat_output_all.splitlines()[1:]
rocotostat_output_all = [line.split()[0:4] for line in rocotostat_output_all]
rocotostat_output_all = [line for line in rocotostat_output_all if len(line) != 1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can show you how to reduce this and simplify it considerably

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure that would be great, you can jot down pseudo code it you like


rocoto_status = {
'Cycles' : len(rocotostat_output),
'Done_Cycles' : sum([ sublist.count('Done') for sublist in rocotostat_output ])
}

status_cases = [ 'SUCCEEDED', 'FAIL', 'DEAD', 'RUNNING', 'PENDING', 'QUEUED']
WalterKolczynski-NOAA marked this conversation as resolved.
Show resolved Hide resolved
for case in status_cases:
rocoto_status[case] = sum([ sublist.count(case) for sublist in rocotostat_output_all ])

return rocoto_status

if __name__ == '__main__':

args = input_args()

rocoto_status = rocoto_statcount()
for status in rocoto_status:
print(f'Number of {status} : {rocoto_status[status]}')
rocoto_state = 'Running'
if rocoto_status['Cycles'] == rocoto_status['Done_Cycles']:
rocoto_state = 'Done'

if args.check_stalled:
if rocoto_state != 'Done':
rocoto_run = which("rocotorun")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having seen this, I think we should have a RocotoCommands class and initialize it with everything from rocoto you need; e.g. rocotostat, rocotorun, etc. While you are at it, also provide the xml and database to the class object rather than have to do this over and over.

rocoto_run.add_default_arg(['-w',args.w,'-d',args.d])
rocoto_run()
rocoto_status2 = rocoto_statcount()
if rocoto_status2 == rocoto_status:
rocoto_state = 'Stalled'
print(f'Rocoto State : {rocoto_state}')
sys.exit(-1)
else:
rocoto_state = 'Running'
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
print(f'Rocoto State : {rocoto_state}')
Loading