The UPP job (offline post) failed at Hera #2227

WenMeng-NOAA · 2024-01-16T17:00:31Z

What is wrong?

The standalone JGLOBAL_ATMOS_UPP job failed on Hera with model history files in C768.
The runtime log indicates out of memory issue:

 84: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=54315683.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h23c26: task 92: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=54315683.0
  0: slurmstepd: error: *** STEP 54315683.0 ON h4c47 CANCELLED AT 2024-01-16T16:37:16 ***
 89: forrtl: error (78): process killed (SIGTERM)
 89: Image              PC                Routine            Line        Source

The C768 case has computation resource configuration as:

^[[38;21m2024-01-16 16:34:57,621 - INFO     - upp         : BEGIN: pygfs.task.upp._call_executable^[[0m
^[[38;5;39m2024-01-16 16:34:57,621 - DEBUG    - upp         : ( <exe: ['srun', '-l', '--export=ALL', '-n', '120', '--cpus-per-task=1', '/scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x']> )^[[0m
^[[38;21m2024-01-16 16:34:57,621 - INFO     - upp         : Executing srun -l --export=ALL -n 120 --cpus-per-task=1 /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x^[[0m
 59:   mype=          59  ierr=           0

What should have happened?

This job is running off-line post to generate GFS master, flux and goes files.

What machines are impacted?

Hera

Steps to reproduce

Checkout global-workflow develop branch, run jobs/rocoto/upp.sh with model history files from GFS V17 HR2.

Additional information

None.

Do you have a proposed solution?

Update env/HERA.env file:
Change
export APRUN_UPP="${launcher} -n ${npe_upp} --cpus-per-task=${NTHREADS_UPP}"
into
export APRUN_UPP="${launcher} -n ${npe_upp}"

The text was updated successfully, but these errors were encountered:

aerorahul · 2024-01-16T17:15:09Z

@WenMeng-NOAA
I think this was added by @KateFriedman-NOAA based on feedback received from the RDHPCS admins.
The PR was #2077 and the issue was #2044

So, I don't think the issue is in the memory and not in the number of threads assignment in srun.

KateFriedman-NOAA · 2024-01-16T17:38:57Z

@WenMeng-NOAA I think this was added by @KateFriedman-NOAA based on feedback received from the RDHPCS admins. The PR was #2077 and the issue was #2044

That is correct. We had to add the --cpus-per-task flag for Orion and I also added it for Hera. The Hera sysadmins added something in the background that exports the value to SLURM but the Orion sysadmins hadn't, so we have to specify that flag on Orion for SLURM now. Even though there is a background workaround for Hera, I also added the flag there to be consistent and just in case that background workaround is ever removed. I ran with and without the --cpus-per-task flag on Hera during testing for PR #2077 and there was no difference whether I used it or not (because of the background workaround) so it was safe to add.

WenMeng-NOAA · 2024-01-16T17:43:55Z

@aerorahul and @KateFriedman-NOAA From my testing, removing the option "--cpus-per-task=${NTHREADS_UPP}" can solve the off-line post failure.
My UPP test job card is /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/submit_gfsv17_gw_upp_hera.sh on Hera. Please advise the fix. Thanks!

aerorahul · 2024-01-16T17:50:28Z

I think this needs RDHPCS input as the line was added based on their suggestion and should have no impact on the run success/failure.
Can you remind us what is the memory requirement of this job at this resolution, since the failure is an Out of Memory error.

WenMeng-NOAA · 2024-01-16T18:03:43Z

For the UPP gfs standalone tests on Hera, I usually don't specify the memory size but be sure all tasks not in one node.
Here is my job card setting on Hera:

#SBATCH -o upp.gfs.oe%j
#SBATCH -e upp.gfs.oe%j
#SBATCH -J gfs_upp
#SBATCH -t 00:30:00
#SBATCH -N 10 --ntasks-per-node=12
#SBATCH -q debug
#SBATCH -A ovp

Following the computation resource configuration for UPP in global-workflow, the offline post is run as:
Executing srun -l --export=ALL -n 120 --cpus-per-task=1 /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x

Then having out of memory errors as:

  0:   in WRFPOST npset=           1  num_pset=           2
  0:   in WRFPOST size datapd           0
 84: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=54315683.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h23c26: task 92: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=54315683.0
  0: slurmstepd: error: *** STEP 54315683.0 ON h4c47 CANCELLED AT 2024-01-16T16:37:16 ***
 89: forrtl: error (78): process killed (SIGTERM)

You may look into my runtime log upp.gfs.oe54315683 at /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test on hera.
Any suggestions are appreciated.

aerorahul · 2024-01-16T18:06:59Z

Can you open a ticket w/ Hera RDHPCS and ask them why would adding --cpus-per-task cause this failure as this goes against their previous (very recent) recommendation?

WenMeng-NOAA · 2024-01-16T18:10:37Z

@aerorahul Will do it.

aerorahul · 2024-01-16T18:41:26Z

@WenMeng-NOAA This also might help
https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php?title=Running_and_Monitoring_Jobs&title=Running_and_Monitoring_Jobs#Setting_the_--cpus-per-task_parameter

Since we are putting --cpus-per-task=1, and using 12 tasks per node, you will need to specify memory.
Without --cpus-per-task, 12 tasks are using the full node memory. With that argument, each task is only getting 1/40th of the total available memory.

WenMeng-NOAA · 2024-01-16T20:16:04Z

@WenMeng-NOAA This also might help https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php?title=Running_and_Monitoring_Jobs&title=Running_and_Monitoring_Jobs#Setting_the_--cpus-per-task_parameter

Since we are putting --cpus-per-task=1, and using 12 tasks per node, you will need to specify memory. Without --cpus-per-task, 12 tasks are using the full node memory. With that argument, each task is only getting 1/40th of the total available memory.

@aerorahul I tested with "srun -l --export=ALL -n 120 --cpus-per-task=2". The job can be successfully completed. Do you have any suggestions on specifying memory or tunning 'NTHREADS_UPP' in env/HERA.env for the overall gfs resource configuration in global-workflow?

aerorahul · 2024-01-16T20:44:54Z

--cpus-per-task=2 is not the correct way to do this. It works in this case, but that is just luck. This number is for threading. Do we intend to run UPP with threads; i.e. OMP_NUM_THREADS=2?
The proper way is to request the right amount of memory for the application per node, regardless of threading count.

WenMeng-NOAA · 2024-01-16T20:52:26Z

@aerorahul No need running UPP with threads. OMP_NUM_THREADS=1 would be good.

WenMeng-NOAA · 2024-01-16T21:59:13Z

@aerorahul @KateFriedman-NOAA For my off-line post testing in global-workflow, I can make some local changes to in order to complete the job. I am wondering if you see the similar issue from GFS in high resolution case end to end run. The off-line post is used in post processing analysis data from model.

WenMeng-NOAA · 2024-01-17T00:33:45Z

@aerorahul I am confused about setting of NTHREADS_UPP at
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/HERA.env#L205

Are you meaning the option "--cpus-per-task=??" is for threads? If yes, could this option be removed from off-line post configuration?

aerorahul · 2024-02-22T19:15:32Z

I think we resolved this by providing more memory.

WenMeng-NOAA added bug Something isn't working triage Issues that are triage labels Jan 16, 2024

WalterKolczynski-NOAA removed the triage Issues that are triage label Jan 16, 2024

aerorahul closed this as completed Feb 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The UPP job (offline post) failed at Hera #2227

The UPP job (offline post) failed at Hera #2227

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

KateFriedman-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024 •

edited

Loading

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 17, 2024

aerorahul commented Feb 22, 2024

The UPP job (offline post) failed at Hera #2227

The UPP job (offline post) failed at Hera #2227

Comments

WenMeng-NOAA commented Jan 16, 2024

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

aerorahul commented Jan 16, 2024

KateFriedman-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024 • edited Loading

WenMeng-NOAA commented Jan 16, 2024

aerorahul commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 16, 2024

WenMeng-NOAA commented Jan 17, 2024

aerorahul commented Feb 22, 2024

aerorahul commented Jan 16, 2024 •

edited

Loading