Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix xml file setup and complete C48 ATM and S2SW runs for CI on Gaea #2701

Merged
merged 18 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
18 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions env/GAEA.env
Original file line number Diff line number Diff line change
Expand Up @@ -36,4 +36,9 @@ elif [[ "${step}" = "atmos_products" ]]; then

export USE_CFP="YES" # Use MPMD for downstream product generation

elif [[ "${step}" = "oceanice_products" ]]; then

export NTHREADS_OCNICEPOST=${nth_oceanice_products:-1}
export APRUN_OCNICEPOST="${launcher} -n 1 --cpus-per-task=${NTHREADS_OCNICEPOST}"

fi
1 change: 1 addition & 0 deletions parm/config/gfs/config.base
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ export QUEUE_SERVICE="@QUEUE_SERVICE@"
export PARTITION_BATCH="@PARTITION_BATCH@"
export PARTITION_SERVICE="@PARTITION_SERVICE@"
export RESERVATION="@RESERVATION@"
export CLUSTERS="@CLUSTERS@"

# Project to use in mass store:
export HPSS_PROJECT="@HPSS_PROJECT@"
Expand Down
29 changes: 24 additions & 5 deletions parm/config/gfs/config.resources
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,11 @@ case ${step} in
export nth_waveinit=1
export npe_node_waveinit=$(( npe_node_max / nth_waveinit ))
export NTASKS=${npe_waveinit}
export memory_waveinit="2GB"
if [[ "${machine}" == "GAEA" ]]; then
export memory_waveinit=""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why specifying memory on Gaea is inappropriate (here and elsewhere)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aerorahul Thanks for looking over the PR.

I tried a few different options for setting the memory on Gaea before contacting Gaea help desk.

  1. add 2G
    sbatch: error: Memory specification can not be satisfied
    sbatch: error: Batch job submission failed: Requested node configuration is not available

  2. --mem=2G
    sbatch: error: Memory specification can not be satisfied

  3. 0 works

The response from Gaea help desk and ORNL:
"Due to the configuration of slurm on Gaea, users are not expected to set the memory for batch jobs. In cases of node sharing (on a specific partition, on a given set of nodes) among users, you would then be required to explicitly request a certain amount of memory for a job.
I talked to the admins at ORNL to see if it was intentional and with the way slurm is configured memory is a consumable resource which is not shared among jobs meaning exclusivity is assumed in this case. Users should not have to manually set the real memory on the batch partition."

Here's a simple test script that demonstrates the error:
#!/bin/bash
#SBATCH -A ufs-ard
#SBATCH -M c5
#SBATCH --mem=... 0 succeeds; 1G fails
#SBATCH --time=1:00:00

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing this for every task, can you try this at the end of config.resources while we get feedback from the Gaea sysadmins:

if [[ "${machine}" == "GAEA" ]]; then
  for mem_var in $(env | grep '^memory_' | cut -d= -f1); do
    unset "${mem_var}"
  done
fi

It should unset all memory_ variables for Gaea without having to do so for each task.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Morning @aerorahul. I'm responding to the ticket right now and will test this out today. Thanks.

else
export memory_waveinit="2GB"
fi
;;

"waveprep")
Expand All @@ -137,8 +141,13 @@ case ${step} in
export nth_wavepostsbs=1
export npe_node_wavepostsbs=$(( npe_node_max / nth_wavepostsbs ))
export NTASKS=${npe_wavepostsbs}
export memory_wavepostsbs="10GB"
export memory_wavepostsbs_gfs="10GB"
if [[ "${machine}" == "GAEA" ]]; then
export memory_wavepostsbs=""
export memory_wavepostsbs_gfs=""
else
export memory_wavepostsbs="10GB"
export memory_wavepostsbs_gfs="10GB"
fi
;;

# The wavepost*pnt* jobs are I/O heavy and do not scale well to large nodes.
Expand Down Expand Up @@ -777,7 +786,11 @@ case ${step} in
export npe_oceanice_products=1
export npe_node_oceanice_products=1
export nth_oceanice_products=1
export memory_oceanice_products="96GB"
if [[ "${machine}" == "GAEA" ]]; then
export memory_oceanice_products=""
else
export memory_oceanice_products="96GB"
fi
;;

"upp")
Expand Down Expand Up @@ -935,6 +948,8 @@ case ${step} in
declare -x "memory_${step}"="4096M"
if [[ "${machine}" == "WCOSS2" ]]; then
declare -x "memory_${step}"="50GB"
elif [[ "${machine}" == "GAEA" ]]; then
declare -x "memory_${step}"=""
fi
;;

Expand All @@ -943,7 +958,11 @@ case ${step} in
export npe_cleanup=1
export npe_node_cleanup=1
export nth_cleanup=1
export memory_cleanup="4096M"
if [[ "${machine}" == "GAEA" ]]; then
export memory_cleanup=""
else
export memory_cleanup="4096M"
fi
;;

"stage_ic")
Expand Down
2 changes: 1 addition & 1 deletion sorc/link_workflow.sh
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ case "${machine}" in
"hercules") FIX_DIR="/work/noaa/global/glopara/fix" ;;
"jet") FIX_DIR="/lfs4/HFIP/hfv3gfs/glopara/git/fv3gfs/fix" ;;
"s4") FIX_DIR="/data/prod/glopara/fix" ;;
"gaea") FIX_DIR="/gpfs/f5/epic/proj-shared/global/glopara/data/fix" ;;
"gaea") FIX_DIR="/gpfs/f5/ufs-ard/world-shared/global/glopara/data/fix" ;;
*)
echo "FATAL: Unknown target machine ${machine}, couldn't set FIX_DIR"
exit 1
Expand Down
21 changes: 12 additions & 9 deletions workflow/hosts/gaea.yaml
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
BASE_GIT: '/gpfs/f5/epic/proj-shared/global/glopara/data/git'
DMPDIR: '/gpfs/f5/epic/proj-shared/global/glopara/data/dump'
BASE_CPLIC: '/gpfs/f5/epic/proj-shared/global/glopara/data/ICSDIR/prototype_ICs'
PACKAGEROOT: '/gpfs/f5/epic/proj-shared/global/glopara/data/nwpara'
COMROOT: '/gpfs/f5/epic/proj-shared/global/glopara/data/com'
BASE_GIT: '/gpfs/f5/ufs-ard/world-shared/global/glopara/data/git'
DMPDIR: '/gpfs/f5/ufs-ard/world-shared/global/glopara/data/dump'
BASE_CPLIC: '/gpfs/f5/ufs-ard/world-shared/global/glopara/data/ICSDIR/prototype_ICs'
PACKAGEROOT: '/gpfs/f5/ufs-ard/world-shared/global/glopara/data/nwpara'
COMROOT: '/gpfs/f5/ufs-ard/world-shared/global/glopara/data/com'
COMINsyn: '${COMROOT}/gfs/prod/syndat'
HOMEDIR: '/gpfs/f5/epic/scratch/${USER}'
STMP: '/gpfs/f5/epic/scratch/${USER}'
PTMP: '/gpfs/f5/epic/scratch/${USER}'
HOMEDIR: '/gpfs/f5/ufs-ard/scratch/${USER}'
STMP: '/gpfs/f5/ufs-ard/scratch/${USER}'
PTMP: '/gpfs/f5/ufs-ard/scratch/${USER}'
NOSCRUB: $HOMEDIR
ACCOUNT: epic
ACCOUNT: ufs-ard
ACCOUNT_SERVICE: ufs-ard
SCHEDULER: slurm
QUEUE: normal
QUEUE_SERVICE: normal
PARTITION_BATCH: batch
PARTITION_SERVICE: batch
RESERVATION: ''
CLUSTERS: 'c5'
CHGRP_RSTPROD: 'NO'
CHGRP_CMD: 'chgrp rstprod'
HPSSARCH: 'NO'
Expand Down
2 changes: 2 additions & 0 deletions workflow/rocoto/tasks.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@ def get_resource(self, task_name):
native = '--export=NONE'
if task_config['RESERVATION'] != "":
native += '' if task_name in Tasks.SERVICE_TASKS else ' --reservation=' + task_config['RESERVATION']
if task_config['CLUSTERS'] != "":
native += ' --clusters=' + task_config['CLUSTERS']

queue = task_config['QUEUE_SERVICE'] if task_name in Tasks.SERVICE_TASKS else task_config['QUEUE']

Expand Down
Loading