Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust CI Restarts #2093

Merged
Show file tree
Hide file tree
Changes from 14 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
7e90fad
added PID kill on label change
TerryMcGuinness-NOAA Nov 27, 2023
3bad056
updated chmod on ci_utils.sh
TerryMcGuinness-NOAA Nov 27, 2023
739a0c0
remove bracket and move id up
TerryMcGuinness-NOAA Nov 27, 2023
a6192b1
added better kill all to make sure to get all descendants
TerryMcGuinness-NOAA Nov 27, 2023
cdc48e2
few shell norms on kill command
TerryMcGuinness-NOAA Nov 28, 2023
17601e3
another shell norm on kill line
TerryMcGuinness-NOAA Nov 28, 2023
3d4fdb0
added log output on link fail and some touchups on output
TerryMcGuinness-NOAA Nov 28, 2023
24a74af
udpated starting message
TerryMcGuinness-NOAA Nov 28, 2023
3c4e7f7
moved DATE assignment outside of if
TerryMcGuinness-NOAA Nov 28, 2023
2e172ae
quote around ps for shell norms
TerryMcGuinness-NOAA Nov 28, 2023
fb5c8fb
removed quotes in grep of ps for kill driver 304527
TerryMcGuinness-NOAA Nov 28, 2023
c251060
quoted the gerp patter on ps kill of drivers 304527
TerryMcGuinness-NOAA Nov 28, 2023
ea93c92
add ingnore SC009 because there is not a pgrep version of this
TerryMcGuinness-NOAA Nov 28, 2023
d55539f
added pgrep shellnorms work arounds
TerryMcGuinness-NOAA Nov 28, 2023
97225ac
removed pid from data base after building
TerryMcGuinness-NOAA Nov 28, 2023
0997559
fixed woron path to ci_utils.sh
TerryMcGuinness-NOAA Nov 28, 2023
b746d08
type syntax error on echo in scancel
TerryMcGuinness-NOAA Nov 28, 2023
5e70ba1
better kill switch
TerryMcGuinness-NOAA Nov 28, 2023
7e65d3f
added cleaner headers on user messages
TerryMcGuinness-NOAA Nov 28, 2023
afe37b5
added true to kill line for shell norms
TerryMcGuinness-NOAA Nov 28, 2023
2cde8b4
shorter underline
TerryMcGuinness-NOAA Nov 28, 2023
afaded5
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 28, 2023
85996e3
small _ removed from ouput on restart
TerryMcGuinness-NOAA Nov 28, 2023
b22e552
Merge branch 'hotfix/restart_build' of github.com:TerrenceMcGuinness-…
TerryMcGuinness-NOAA Nov 28, 2023
e2bfb0e
added machine name on sinlge exe completion lines
TerryMcGuinness-NOAA Nov 28, 2023
86d0236
updated REPO_URL to global just before to submit PR
TerryMcGuinness-NOAA Nov 28, 2023
b0bc70c
Update ci/scripts/check_ci.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
33a8a70
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
7311a8b
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
f55bf8f
moved STMP to /work2 on orion because /work is full
TerryMcGuinness-NOAA Nov 29, 2023
f414200
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
9c7d2c7
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
94c29fe
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
f3634f2
Update ci/scripts/utils/ci_utils.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
31ee4bb
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
1f0f842
Update ci_utils.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
9fcc897
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
7e05961
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 29, 2023
2b29dee
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
efa3e08
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
063a95b
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 29, 2023
7c602df
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 30, 2023
010a86f
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Dec 1, 2023
501753e
added some edification documentation for clearity
TerryMcGuinness-NOAA Dec 1, 2023
b1e2b13
Update clone-build_ci.sh
TerrenceMcGuinness-NOAA Dec 1, 2023
4b27ebe
Update ci/scripts/utils/ci_utils.sh
TerrenceMcGuinness-NOAA Dec 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/platforms/config.orion
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

export GFS_CI_ROOT=/work2/noaa/stmp/GFS_CI_ROOT
export ICSDIR_ROOT=/work/noaa/global/glopara/data/ICSDIR
export STMP="/work/noaa/stmp/${USER}"
export STMP="/work2/noaa/stmp/${USER}"
export SLURM_ACCOUNT=nems
export max_concurrent_cases=5
export max_concurrent_pr=4
2 changes: 1 addition & 1 deletion ci/scripts/check_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ case ${MACHINE_ID} in
esac
set +x
source "${ROOT_DIR}/ush/module-setup.sh"
source "${ROOT_DIR}/ci/scripts/utils/ci_utils.h"
source "${ROOT_DIR}/ci/scripts/utils/ci_utils.sh"
module use "${ROOT_DIR}/modulefiles"
module load "module_gwsetup.${MACHINE_ID}"
module list
Expand Down
16 changes: 8 additions & 8 deletions ci/scripts/driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -89,14 +89,14 @@ for pr in ${pr_list}; do
if [[ "${driver_PID}" -ne 0 ]]; then
echo "Driver PID: ${driver_PID} no longer running this build having it killed"
if [[ "${driver_HOST}" == "${host_name}" ]]; then
pstree -A -p "${driver_PID}" | grep -Eow "[0-9]+" | xargs kill || true
sleep 30
#shellcheck disable=SC2312,SC2312
WalterKolczynski-NOAA marked this conversation as resolved.
Show resolved Hide resolved
pstree -A -p "${driver_PID}" | grep -Pow "(?<=\()[0-9]+(?=\))" | xargs kill
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).
else
# shellcheck disable=SC2312
ssh "${driver_HOST}" 'pstree -A -p "${driver_PID}" | grep -Eow "[0-9]+" | xargs kill'
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
sleep 30
fi
{
echo "Driver PID: ${driver_PID} on ${driver_HOST} is no longer running this test"
echo "Driver PID: Requested termination of ${driver_PID} and children on ${driver_HOST}"
echo "Driver PID: has restarted as $$ on ${host_name}"
} >> "${output_ci_single}"
fi
Expand All @@ -105,8 +105,8 @@ for pr in ${pr_list}; do
if [[ -z "${experiments}" ]]; then
echo "No current experiments to cancel in PR: ${pr} on ${MACHINE_ID^}" >> "${output_ci_single}"
else
for cases in ${experiments}; do
case_name=$(basename "${cases}")
for case in ${experiments}; do
case_name=$(basename "${case}")
cancel_slurm_jobs "${case_name}"
{
echo "Canceled all jobs for experiment ${case_name} in PR:${pr} on ${MACHINE_ID^}"
Expand All @@ -115,8 +115,8 @@ for pr in ${pr_list}; do
fi
sed -i "1 i\`\`\`" "${output_ci_single}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"
db_list=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}")
db_list=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --add_pr "${pr}" --dbfile "${pr_list_dbfile}")
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --add_pr "${pr}" --dbfile "${pr_list_dbfile}"
fi
done

Expand Down
2 changes: 1 addition & 1 deletion ci/scripts/utils/ci_utils.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ function cancel_slurm_jobs() {

for job_id in ${job_ids}; do
job_name=$(sacct -j "${job_id}" --format=JobName%100 | head -3 | tail -1 | sed -r 's/\s+//g') || true
if [[ "${job_name}" == *"${substring}"* ]]; then
if [[ "${job_name}" =~ ${substring} ]]; then
echo "Canceling Slurm Job ${job_name} with: scancel ${job_id}"
scancel "${job_id}"
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
fi
Expand Down