Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robust CI Restarts #2093

Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
46 commits
Select commit Hold shift + click to select a range
7e90fad
added PID kill on label change
TerryMcGuinness-NOAA Nov 27, 2023
3bad056
updated chmod on ci_utils.sh
TerryMcGuinness-NOAA Nov 27, 2023
739a0c0
remove bracket and move id up
TerryMcGuinness-NOAA Nov 27, 2023
a6192b1
added better kill all to make sure to get all descendants
TerryMcGuinness-NOAA Nov 27, 2023
cdc48e2
few shell norms on kill command
TerryMcGuinness-NOAA Nov 28, 2023
17601e3
another shell norm on kill line
TerryMcGuinness-NOAA Nov 28, 2023
3d4fdb0
added log output on link fail and some touchups on output
TerryMcGuinness-NOAA Nov 28, 2023
24a74af
udpated starting message
TerryMcGuinness-NOAA Nov 28, 2023
3c4e7f7
moved DATE assignment outside of if
TerryMcGuinness-NOAA Nov 28, 2023
2e172ae
quote around ps for shell norms
TerryMcGuinness-NOAA Nov 28, 2023
fb5c8fb
removed quotes in grep of ps for kill driver 304527
TerryMcGuinness-NOAA Nov 28, 2023
c251060
quoted the gerp patter on ps kill of drivers 304527
TerryMcGuinness-NOAA Nov 28, 2023
ea93c92
add ingnore SC009 because there is not a pgrep version of this
TerryMcGuinness-NOAA Nov 28, 2023
d55539f
added pgrep shellnorms work arounds
TerryMcGuinness-NOAA Nov 28, 2023
97225ac
removed pid from data base after building
TerryMcGuinness-NOAA Nov 28, 2023
0997559
fixed woron path to ci_utils.sh
TerryMcGuinness-NOAA Nov 28, 2023
b746d08
type syntax error on echo in scancel
TerryMcGuinness-NOAA Nov 28, 2023
5e70ba1
better kill switch
TerryMcGuinness-NOAA Nov 28, 2023
7e65d3f
added cleaner headers on user messages
TerryMcGuinness-NOAA Nov 28, 2023
afe37b5
added true to kill line for shell norms
TerryMcGuinness-NOAA Nov 28, 2023
2cde8b4
shorter underline
TerryMcGuinness-NOAA Nov 28, 2023
afaded5
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 28, 2023
85996e3
small _ removed from ouput on restart
TerryMcGuinness-NOAA Nov 28, 2023
b22e552
Merge branch 'hotfix/restart_build' of github.com:TerrenceMcGuinness-…
TerryMcGuinness-NOAA Nov 28, 2023
e2bfb0e
added machine name on sinlge exe completion lines
TerryMcGuinness-NOAA Nov 28, 2023
86d0236
updated REPO_URL to global just before to submit PR
TerryMcGuinness-NOAA Nov 28, 2023
b0bc70c
Update ci/scripts/check_ci.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
33a8a70
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
7311a8b
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
f55bf8f
moved STMP to /work2 on orion because /work is full
TerryMcGuinness-NOAA Nov 29, 2023
f414200
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
9c7d2c7
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
94c29fe
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
f3634f2
Update ci/scripts/utils/ci_utils.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
31ee4bb
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
1f0f842
Update ci_utils.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
9fcc897
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
7e05961
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 29, 2023
2b29dee
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
efa3e08
Update driver.sh
TerrenceMcGuinness-NOAA Nov 29, 2023
063a95b
Merge branch 'NOAA-EMC:develop' into hotfix/restart_build
TerrenceMcGuinness-NOAA Nov 29, 2023
7c602df
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Nov 30, 2023
010a86f
Update ci/scripts/driver.sh
TerrenceMcGuinness-NOAA Dec 1, 2023
501753e
added some edification documentation for clearity
TerryMcGuinness-NOAA Dec 1, 2023
b1e2b13
Update clone-build_ci.sh
TerrenceMcGuinness-NOAA Dec 1, 2023
4b27ebe
Update ci/scripts/utils/ci_utils.sh
TerrenceMcGuinness-NOAA Dec 4, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion ci/platforms/config.orion
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

export GFS_CI_ROOT=/work2/noaa/stmp/GFS_CI_ROOT
export ICSDIR_ROOT=/work/noaa/global/glopara/data/ICSDIR
export STMP="/work/noaa/stmp/${USER}"
export STMP="/work2/noaa/stmp/${USER}"
export SLURM_ACCOUNT=nems
export max_concurrent_cases=5
export max_concurrent_pr=4
15 changes: 8 additions & 7 deletions ci/scripts/check_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ case ${MACHINE_ID} in
esac
set +x
source "${ROOT_DIR}/ush/module-setup.sh"
source "${ROOT_DIR}/ci/scripts/utils/ci_utils.sh"
module use "${ROOT_DIR}/modulefiles"
module load "module_gwsetup.${MACHINE_ID}"
module list
Expand Down Expand Up @@ -86,7 +87,7 @@ for pr in ${pr_list}; do
if [[ -z $(ls -A "${pr_dir}/RUNTESTS/EXPDIR") ]] ; then
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Passed"
sed -i "1 i\`\`\`" "${output_ci}"
sed -i "1 i\All CI Test Cases Passed:" "${output_ci}"
sed -i "1 i\All CI Test Cases Passed on ${MACHINE_ID^}:" "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
# Check to see if this PR that was opened by the weekly tests and if so close it if it passed on all platforms
Expand Down Expand Up @@ -131,8 +132,8 @@ for pr in ${pr_list}; do
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Running" --add-label "CI-${MACHINE_ID^}-Failed"
error_logs=$("${rocotostat}" -d "${db}" -w "${xml}" | grep -E 'FAIL|DEAD' | awk '{print "-c", $1, "-t", $2}' | xargs "${rocotocheck}" -d "${db}" -w "${xml}" | grep join | awk '{print $2}') || true
{
echo "Experiment ${pslot} Terminated: *** FAILED ***"
echo "Experiment ${pslot} Terminated with ${num_failed} tasks failed at $(date)" || true
echo "Experiment ${pslot} *** FAILED *** on ${MACHINE_ID^}"
echo "Experiment ${pslot} with ${num_failed} tasks failed at $(date +'%D %r')" || true
echo "Error logs:"
echo "${error_logs}"
} >> "${output_ci}"
Expand All @@ -141,7 +142,7 @@ for pr in ${pr_list}; do
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
for kill_cases in "${pr_dir}/RUNTESTS/"*; do
pslot=$(basename "${kill_cases}")
sacct --format=jobid,jobname%35,WorkDir%100,stat | grep "${pslot}" | grep "PR\/${pr}\/RUNTESTS" | awk '{print $1}' | xargs scancel || true
cancel_slurm_jobs "${pslot}"
done
break
fi
Expand All @@ -151,9 +152,9 @@ for pr in ${pr_list}; do
rm -Rf "${pr_dir}/RUNTESTS/COMROT/${pslot}"
rm -f "${output_ci_single}"
# echo "\`\`\`" > "${output_ci_single}"
DATE=$(date)
echo "Experiment ${pslot} **SUCCESS** ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} **SUCCESS** at ${DATE}" >> "${output_ci}"
DATE=$(date +'%D %r')
echo "Experiment ${pslot} **SUCCESS** on ${MACHINE_ID^} at ${DATE}" >> "${output_ci_single}"
echo "Experiment ${pslot} *** SUCCESS *** at ${DATE}" >> "${output_ci}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"

fi
Expand Down
20 changes: 13 additions & 7 deletions ci/scripts/clone-build_ci.sh
Original file line number Diff line number Diff line change
Expand Up @@ -72,16 +72,17 @@ cd sorc || exit 1
set +e
./checkout.sh -c -g -u >> log.checkout 2>&1
checkout_status=$?
DATE=$(date +'%D %r')
if [[ ${checkout_status} != 0 ]]; then
{
echo "Checkout: *** FAILED ***"
echo "Checkout: Failed at $(date)" || true
echo "Checkout: Failed at ${DATE}" || true
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
echo "Checkout: see output at ${PWD}/log.checkout"
} >> "${outfile}"
exit "${checkout_status}"
else
{
echo "Checkout: Completed at $(date)" || true
echo "Checkout: Completed at ${DATE}" || true
} >> "${outfile}"
fi

Expand All @@ -92,25 +93,30 @@ rm -rf log.build
./build_all.sh >> log.build 2>&1
build_status=$?

DATE=$(date +'%D %r')
if [[ ${build_status} != 0 ]]; then
{
echo "Build: *** FAILED ***"
echo "Build: Failed at $(date)" || true
echo "Build: see output at ${PWD}/log.build"
echo "Build: Failed at ${DATE}"
cat "${PWD}/log.build"
} >> "${outfile}"
exit "${build_status}"
else
{
echo "Build: Completed at $(date)" || true
echo "Build: Completed at ${DATE}"
} >> "${outfile}"
fi

./link_workflow.sh
LINK_LOGFILE_PATH=link_workflow.log
rm -f "${LINK_LOGFILE_PATH}"
./link_workflow.sh >> "${LINK_LOGFILE_PATH}" 2>&1
link_status=$?
if [[ ${link_status} != 0 ]]; then
DATE=$(date +'%D %r')
{
echo "Link: *** FAILED ***"
echo "Link: Failed at $(date)" || true
echo "Link: Failed at ${DATE}"
cat "${LINK_LOGFILE_PATH}"
} >> "${outfile}"
exit "${link_status}"
fi
Expand Down
97 changes: 69 additions & 28 deletions ci/scripts/driver.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ export REPO_URL=${REPO_URL:-"https://github.com/NOAA-EMC/global-workflow.git"}
################################################################
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/../.." >/dev/null 2>&1 && pwd )"
scriptname=$(basename "${BASH_SOURCE[0]}")
echo "Begin ${scriptname} at $(date -u)" || true
echo "Begin ${scriptname} at $(date +'%D %r')" || true
export PS4='+ $(basename ${BASH_SOURCE})[${LINENO}]'

#########################################################################
Expand All @@ -48,6 +48,7 @@ esac
# setup runtime env for correct python install and git
######################################################
set +x
source "${ROOT_DIR}/ci/scripts/utils/ci_utils.sh"
source "${ROOT_DIR}/ush/module-setup.sh"
module use "${ROOT_DIR}/modulefiles"
module load "module_gwsetup.${MACHINE_ID}"
Expand All @@ -68,24 +69,54 @@ pr_list=$(${GH} pr list --repo "${REPO_URL}" --label "CI-${MACHINE_ID^}-Ready" -
for pr in ${pr_list}; do
pr_dir="${GFS_CI_ROOT}/PR/${pr}"
db_list=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --add_pr "${pr}" --dbfile "${pr_list_dbfile}")
pr_id=0
output_ci_single="${GFS_CI_ROOT}/PR/${pr}/output_single.log"
#############################################################
# Check if a Ready labeled PR has changed back from once set
# and in that case remove all previous jobs in scheduler and
# and remove PR from filesystem to start clean
#############################################################
if [[ "${db_list}" == *"already is in list"* ]]; then
pr_id=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --display "${pr}" | awk '{print $4}') || true
pr_id=$((pr_id+1))
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Ready "${pr_id}"
for cases in "${pr_dir}/RUNTESTS/"*; do
if [[ -z "${cases+x}" ]]; then
break
driver_ID=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --display "${pr}" | awk '{print $4}') || true
driver_PID=$(echo "${driver_ID}" | cut -d":" -f1) || true
driver_HOST=$(echo "${driver_ID}" | cut -d":" -f2) || true
host_name=$(hostname -s)
rm -f "${output_ci_single}"
{
echo "CI Update on ${MACHINE_ID^} at $(date +'%D %r')" || true
echo "================================================="
echo "PR:${pr} Reset to ${MACHINE_ID^}-Ready by user and is now restarting CI tests" || true
} >> "${output_ci_single}"
if [[ "${driver_PID}" -ne 0 ]]; then
echo "Driver PID: ${driver_PID} no longer running this build having it killed"
if [[ "${driver_HOST}" == "${host_name}" ]]; then
# shellcheck disable=SC2312
pstree -A -p "${driver_PID}" | grep -Pow "(?<=\()[0-9]+(?=\))" | xargs kill
Fixed Show fixed Hide fixed
Fixed Show fixed Hide fixed

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Check notice

Code scanning / shellcheck

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).

Consider invoking this command separately to avoid masking its return value (or use '|| true' to ignore).
else
# shellcheck disable=SC2312
ssh "${driver_HOST}" 'pstree -A -p "${driver_PID}" | grep -Eow "[0-9]+" | xargs kill'
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
fi
pslot=$(basename "${cases}")
sacct --format=jobid,jobname%35,WorkDir%100,stat | grep "${pslot}" | grep "PR\/${pr}\/RUNTESTS" | awk '{print $1}' | xargs scancel || true
done
rm -Rf "${pr_dir}"
{
echo "Driver PID: Requested termination of ${driver_PID} and children on ${driver_HOST}"
echo "Driver PID: has restarted as $$ on ${host_name}"
} >> "${output_ci_single}"
fi

experiments=$(find "${pr_dir}/RUNTESTS/EXPDIR" -mindepth 1 -maxdepth 1 -type d) || true
if [[ -z "${experiments}" ]]; then
echo "No current experiments to cancel in PR: ${pr} on ${MACHINE_ID^}" >> "${output_ci_single}"
else
for case in ${experiments}; do
case_name=$(basename "${case}")
cancel_slurm_jobs "${case_name}"
{
echo "Canceled all jobs for experiment ${case_name} in PR:${pr} on ${MACHINE_ID^}"
} >> "${output_ci_single}"
done
fi
sed -i "1 i\`\`\`" "${output_ci_single}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --remove_pr "${pr}" --dbfile "${pr_list_dbfile}"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --add_pr "${pr}" --dbfile "${pr_list_dbfile}"
fi
done

Expand All @@ -110,34 +141,44 @@ for pr in ${pr_list}; do
if [[ -z "${pr_building+x}" ]]; then
continue
fi
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Ready" --add-label "CI-${MACHINE_ID^}-Building"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Building
echo "Processing Pull Request #${pr}"
id=$("${GH}" pr view "${pr}" --repo "${REPO_URL}" --json id --jq '.id')
pr_dir="${GFS_CI_ROOT}/PR/${pr}"
output_ci="${pr_dir}/output_ci_${id}"
output_ci_single="${GFS_CI_ROOT}/PR/${pr}/output_single.log"
driver_build_PID=$$
driver_build_HOST=$(hostname -s)
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Ready" --add-label "CI-${MACHINE_ID^}-Building"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Building "${driver_build_PID}:${driver_build_HOST}"
rm -Rf "${pr_dir}"
mkdir -p "${pr_dir}"
# call clone-build_ci to clone and build PR
id=$("${GH}" pr view "${pr}" --repo "${REPO_URL}" --json id --jq '.id')
{
echo "CI Update on ${MACHINE_ID^} at $(date +'%D %r')" || true
echo "============================================"
echo "Cloning and Building global-workflow PR: ${pr}"
echo "with PID: ${driver_build_PID} on host: ${driver_build_HOST}"
echo ""
} >> "${output_ci_single}"
sed -i "1 i\`\`\`" "${output_ci_single}"
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci_single}"
set +e
output_ci="${pr_dir}/output_build_${id}"
rm -f "${output_ci}"
"${ROOT_DIR}/ci/scripts/clone-build_ci.sh" -p "${pr}" -d "${pr_dir}" -o "${output_ci}"
#echo "SKIPPING: ${ROOT_DIR}/ci/scripts/clone-build_ci.sh"
ci_status=$?
##################################################################
# Checking for special case when Ready label was updated
# that cause a running driver exit fail because was currently
# building so we force and exit 0 instead to does not get relabled
# but a race condtion caused the clone-build_ci.sh to start
# and this instance fails before it was killed. In th case we
# we need to exit this instance of the driver script
#################################################################
if [[ ${ci_status} -ne 0 ]]; then
pr_id_check=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --display "{pr}" --dbfile "${pr_list_dbfile}" | awk '{print $4}') || true
if [[ "${pr_id}" -ne "${pr_id_check}" ]]; then
build_PID_check=$("${ROOT_DIR}/ci/scripts/pr_list_database.py" --display "{pr}" --dbfile "${pr_list_dbfile}" | awk '{print $4}' | cut -d":" -f1) || true
if [[ "${build_PID_check}" -ne "$$" ]]; then
echo "Driver build PID: ${build_PID_check} no longer running this build ... exiting"
exit 0
fi
fi
set -e
if [[ ${ci_status} -eq 0 ]]; then
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Built
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Built "0:0"
#setup space to put an experiment
# export RUNTESTS for yaml case files to pickup
export RUNTESTS="${pr_dir}/RUNTESTS"
Expand All @@ -159,7 +200,7 @@ for pr in ${pr_list}; do
set +e
export LOGFILE_PATH="${HOMEgfs}/ci/scripts/create_experiment.log"
rm -f "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" 2>&1 "${LOGFILE_PATH}"
"${HOMEgfs}/workflow/create_experiment.py" --yaml "${HOMEgfs}/ci/cases/pr/${case}.yaml" > "${LOGFILE_PATH}" 2>&1
ci_status=$?
set -e
if [[ ${ci_status} -eq 0 ]]; then
Expand All @@ -174,7 +215,7 @@ for pr in ${pr_list}; do
} >> "${output_ci}"
else
{
echo "*** Failed *** to create experiment: ${pslot}"
echo "*** Failed *** to create experiment: ${pslot} on ${MACINE_ID^}"
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
echo ""
cat "${LOGFILE_PATH}"
} >> "${output_ci}"
Expand All @@ -186,7 +227,7 @@ for pr in ${pr_list}; do
done

"${GH}" pr edit --repo "${REPO_URL}" "${pr}" --remove-label "CI-${MACHINE_ID^}-Building" --add-label "CI-${MACHINE_ID^}-Running"
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Running
"${ROOT_DIR}/ci/scripts/pr_list_database.py" --dbfile "${pr_list_dbfile}" --update_pr "${pr}" Open Running "0:0"
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
"${GH}" pr comment "${pr}" --repo "${REPO_URL}" --body-file "${output_ci}"

else
Expand Down
16 changes: 16 additions & 0 deletions ci/scripts/utils/ci_utils.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/bin/env bash

function cancel_slurm_jobs() {

local substring=$1
local job_ids
job_ids=$(squeue -u "${USER}" -h -o "%i")

for job_id in ${job_ids}; do
job_name=$(sacct -j "${job_id}" --format=JobName%100 | head -3 | tail -1 | sed -r 's/\s+//g') || true
if [[ "${job_name}" =~ ${substring} ]]; then
echo "Canceling Slurm Job ${job_name} with: scancel ${job_id}"
scancel "${job_id}"
TerrenceMcGuinness-NOAA marked this conversation as resolved.
Show resolved Hide resolved
fi
done
}