Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The UPP job (offline post) failed at Hera #2227

Closed
WenMeng-NOAA opened this issue Jan 16, 2024 · 14 comments
Closed

The UPP job (offline post) failed at Hera #2227

WenMeng-NOAA opened this issue Jan 16, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@WenMeng-NOAA
Copy link
Contributor

What is wrong?

The standalone JGLOBAL_ATMOS_UPP job failed on Hera with model history files in C768.
The runtime log indicates out of memory issue:

 84: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=54315683.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h23c26: task 92: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=54315683.0
  0: slurmstepd: error: *** STEP 54315683.0 ON h4c47 CANCELLED AT 2024-01-16T16:37:16 ***
 89: forrtl: error (78): process killed (SIGTERM)
 89: Image              PC                Routine            Line        Source

The C768 case has computation resource configuration as:

^[[38;21m2024-01-16 16:34:57,621 - INFO     - upp         : BEGIN: pygfs.task.upp._call_executable^[[0m
^[[38;5;39m2024-01-16 16:34:57,621 - DEBUG    - upp         : ( <exe: ['srun', '-l', '--export=ALL', '-n', '120', '--cpus-per-task=1', '/scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x']> )^[[0m
^[[38;21m2024-01-16 16:34:57,621 - INFO     - upp         : Executing srun -l --export=ALL -n 120 --cpus-per-task=1 /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x^[[0m
 59:   mype=          59  ierr=           0

What should have happened?

This job is running off-line post to generate GFS master, flux and goes files.

What machines are impacted?

Hera

Steps to reproduce

Checkout global-workflow develop branch, run jobs/rocoto/upp.sh with model history files from GFS V17 HR2.

Additional information

None.

Do you have a proposed solution?

Update env/HERA.env file:
Change
export APRUN_UPP="${launcher} -n ${npe_upp} --cpus-per-task=${NTHREADS_UPP}"
into
export APRUN_UPP="${launcher} -n ${npe_upp}"

@WenMeng-NOAA WenMeng-NOAA added bug Something isn't working triage Issues that are triage labels Jan 16, 2024
@aerorahul
Copy link
Contributor

@WenMeng-NOAA
I think this was added by @KateFriedman-NOAA based on feedback received from the RDHPCS admins.
The PR was #2077 and the issue was #2044

So, I don't think the issue is in the memory and not in the number of threads assignment in srun.

@KateFriedman-NOAA
Copy link
Member

@WenMeng-NOAA I think this was added by @KateFriedman-NOAA based on feedback received from the RDHPCS admins. The PR was #2077 and the issue was #2044

That is correct. We had to add the --cpus-per-task flag for Orion and I also added it for Hera. The Hera sysadmins added something in the background that exports the value to SLURM but the Orion sysadmins hadn't, so we have to specify that flag on Orion for SLURM now. Even though there is a background workaround for Hera, I also added the flag there to be consistent and just in case that background workaround is ever removed. I ran with and without the --cpus-per-task flag on Hera during testing for PR #2077 and there was no difference whether I used it or not (because of the background workaround) so it was safe to add.

@WenMeng-NOAA
Copy link
Contributor Author

@aerorahul and @KateFriedman-NOAA From my testing, removing the option "--cpus-per-task=${NTHREADS_UPP}" can solve the off-line post failure.
My UPP test job card is /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/submit_gfsv17_gw_upp_hera.sh on Hera. Please advise the fix. Thanks!

@aerorahul
Copy link
Contributor

I think this needs RDHPCS input as the line was added based on their suggestion and should have no impact on the run success/failure.
Can you remind us what is the memory requirement of this job at this resolution, since the failure is an Out of Memory error.

@WenMeng-NOAA
Copy link
Contributor Author

For the UPP gfs standalone tests on Hera, I usually don't specify the memory size but be sure all tasks not in one node.
Here is my job card setting on Hera:

#SBATCH -o upp.gfs.oe%j
#SBATCH -e upp.gfs.oe%j
#SBATCH -J gfs_upp
#SBATCH -t 00:30:00
#SBATCH -N 10 --ntasks-per-node=12
#SBATCH -q debug
#SBATCH -A ovp

Following the computation resource configuration for UPP in global-workflow, the offline post is run as:
Executing srun -l --export=ALL -n 120 --cpus-per-task=1 /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test/rundir/upp.x

Then having out of memory errors as:

  0:   in WRFPOST npset=           1  num_pset=           2
  0:   in WRFPOST size datapd           0
 84: slurmstepd: error: Detected 1 oom-kill event(s) in StepId=54315683.0. Some of your processes may have been killed by the cgroup out-of-memory handler.
srun: error: h23c26: task 92: Out Of Memory
srun: launch/slurm: _step_signal: Terminating StepId=54315683.0
  0: slurmstepd: error: *** STEP 54315683.0 ON h4c47 CANCELLED AT 2024-01-16T16:37:16 ***
 89: forrtl: error (78): process killed (SIGTERM)

You may look into my runtime log upp.gfs.oe54315683 at /scratch1/NCEPDEV/stmp2/Wen.Meng/gw_test on hera.
Any suggestions are appreciated.

@aerorahul
Copy link
Contributor

Can you open a ticket w/ Hera RDHPCS and ask them why would adding --cpus-per-task cause this failure as this goes against their previous (very recent) recommendation?

@WenMeng-NOAA
Copy link
Contributor Author

@aerorahul Will do it.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Jan 16, 2024
@aerorahul
Copy link
Contributor

aerorahul commented Jan 16, 2024

@WenMeng-NOAA This also might help
https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php?title=Running_and_Monitoring_Jobs&title=Running_and_Monitoring_Jobs#Setting_the_--cpus-per-task_parameter

Since we are putting --cpus-per-task=1, and using 12 tasks per node, you will need to specify memory.
Without --cpus-per-task, 12 tasks are using the full node memory. With that argument, each task is only getting 1/40th of the total available memory.

@WenMeng-NOAA
Copy link
Contributor Author

@WenMeng-NOAA This also might help https://rdhpcs-common-docs.rdhpcs.noaa.gov/wiki/index.php?title=Running_and_Monitoring_Jobs&title=Running_and_Monitoring_Jobs#Setting_the_--cpus-per-task_parameter

Since we are putting --cpus-per-task=1, and using 12 tasks per node, you will need to specify memory. Without --cpus-per-task, 12 tasks are using the full node memory. With that argument, each task is only getting 1/40th of the total available memory.

@aerorahul I tested with "srun -l --export=ALL -n 120 --cpus-per-task=2". The job can be successfully completed. Do you have any suggestions on specifying memory or tunning 'NTHREADS_UPP' in env/HERA.env for the overall gfs resource configuration in global-workflow?

@aerorahul
Copy link
Contributor

--cpus-per-task=2 is not the correct way to do this. It works in this case, but that is just luck. This number is for threading. Do we intend to run UPP with threads; i.e. OMP_NUM_THREADS=2?
The proper way is to request the right amount of memory for the application per node, regardless of threading count.

@WenMeng-NOAA
Copy link
Contributor Author

@aerorahul No need running UPP with threads. OMP_NUM_THREADS=1 would be good.

@WenMeng-NOAA
Copy link
Contributor Author

@aerorahul @KateFriedman-NOAA For my off-line post testing in global-workflow, I can make some local changes to in order to complete the job. I am wondering if you see the similar issue from GFS in high resolution case end to end run. The off-line post is used in post processing analysis data from model.

@WenMeng-NOAA
Copy link
Contributor Author

@aerorahul I am confused about setting of NTHREADS_UPP at
https://github.com/NOAA-EMC/global-workflow/blob/develop/env/HERA.env#L205

Are you meaning the option "--cpus-per-task=??" is for threads? If yes, could this option be removed from off-line post configuration?

@aerorahul
Copy link
Contributor

I think we resolved this by providing more memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants