Final Updates for HR2 #1827

JessicaMeixner-NOAA · 2023-08-29T12:38:13Z

Description

This PR has the final updates for HR2. There was one added variable for diagnostic output: iopt_diag=3 and required from land team. This also points to HR2 updates for initial conditions which update the land states in sfc* files compared to HR1.
Lastly, it was determined that the setting to enable running on hera WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20 actually will fail on WCOSS2. While not ideal, a setting that would run out of the box on both hera and wcoss2 for C768 could not be found, therefore a comment was added noting the need for a different setting on WCOSS2 until a solution can be found.

Resolves #1500 (although note this was technically already completed and this PR itself does not update the ufs-weather-model hash)

Type of change

New Feature - HR2 updates

Change characteristics

Is this a breaking change (a change in existing functionality)? NO
Does this change require a documentation update? NO

How has this been tested?

2 Day C768 S2SW tests have been tested on orion, hera and wcoss2

Checklist

Any dependent changes have been merged and published
My code follows the style guidelines of this project
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
New and existing tests pass with my changes
I have made corresponding changes to the documentation if necessary

…ilure

group means wcoss2 fails

WalterKolczynski-NOAA · 2023-08-30T16:55:14Z

parm/config/gfs/config.ufs

+	if [[ ${machine} = "HERA" ]]; then
+            export WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20
+	fi 


Please fix the indentation

Why is this just for Hera? Is this load balanced?

I think it was an issue of tabs, this has been updated.

Why is this just for Hera? Is this load balanced?

It's an issue of memory on hera. The jobs will not run on hera with lower than 20 here and will not run on wcoss2 with 20 here.

parm/config/gfs/config.coupled_ic

aerorahul · 2023-08-30T17:02:42Z

Can someone explain why 20 in wcoss2 fails?

JessicaMeixner-NOAA · 2023-08-30T17:04:58Z

Can someone explain why 20 in wcoss2 fails?

I know @GeorgeVandenberghe-NOAA was looking into some of this. There were machine issues on wcoss2, but despite many tries by myself and @jiandewang 20 never worked on WCOSS2, but 10 does. I know this is not ideal but I don't know of another solution.

aerorahul · 2023-08-30T17:08:42Z

Is there an issue open for this anywhere and is it being tracked?
How much memory is actually used at this resolution? Is the write grid component load balanced for these choices?
A few lines above, for CDUMP=gdas, would there be similar issues with memory?

JessicaMeixner-NOAA · 2023-08-30T17:13:59Z

There's ongoing work led by @junwang-noaa to determine optimal load balancing for operational targets. This admittedly is not an extensive load balancing but an attempt to ensure that everyone could run HR2 with C768 out of the box for various machines. We did confirm we had similar performance for this set up as we did for HR1, which was not surprising. I have not tested in cycled mode, therefore did not add likes for the gdas component, but yes we'd likely be in a similar situation and I'd be happy to add additional lines there as this soon will be something that is tested.

aerorahul · 2023-08-30T17:27:10Z

There's ongoing work led by @junwang-noaa to determine optimal load balancing for operational targets. This admittedly is not an extensive load balancing but an attempt to ensure that everyone could run HR2 with C768 out of the box for various machines. We did confirm we had similar performance for this set up as we did for HR1, which was not surprising. I have not tested in cycled mode, therefore did not add likes for the gdas component, but yes we'd likely be in a similar situation and I'd be happy to add additional lines there as this soon will be something that is tested.

Thanks.
I am leery of introducing machine specific logic in this file. We are actively working towards reducing collecting machine specific logic in as few places as possible. This is and has been our goal all along. This development runs contrary to our efforts in improving the system while supporting existing developments.

Running any configuration out of the box is an ideal desire from users. Though achieving this requires significant refactoring of the configuration system and is beyond the scope for any upcoming effort.

Having said that, I am not against merging this in the interest of getting this for HR2, though I would suggest not bringing this in develop and keep it in a HR2 tag/branch.

JessicaMeixner-NOAA · 2023-08-30T17:32:20Z

I know this wasn't ideal, but at some point I don't know how to deal with the fact that one machine has significantly less memory than the others and requires a different settings. I know there are efforts to determine ideal settings for multiple machines, but in the meantime I'm not sure a way around it. I do not want to branch off of develop as I think that encourages people into their own sand-boxes and counters the efforts we've made to get everyone back to the develop branch. How about we leave the 20 here and maybe make a comment about the WCOSS2 setting as a compromise? Alternatively we'll just leave the 20 there.

junwang-noaa · 2023-08-30T17:47:36Z

@JessicaMeixner-NOAA is there an issue available on tracking the issues with write grid component settings in HR2 in the tested platforms? Can you list the error message, model revision, platforms etc so that we can confirm it is the memory issue? If I remember correctly you mentioned about hera node issues before, can you include those information in the issue too?

JessicaMeixner-NOAA · 2023-08-30T17:56:27Z

The node failure issue on hera has an open hera helpdesk issue-- I'll ask to see what the status is.

Many people have experienced this issue and an issue was previously opened on global-workflow for this: #1793 which was closed when we moved to 20 instead of 10. We've frequently ran into memory issues that were simply solved by changing settings, so this was not seen as a red-flag or cause for opening an issue on the model side. There's also a general workflow issue open about memory issues: #1664

If a specific ufs-weather-model issue is needed I can create one and add the information I have, although I finally cleaned up a lot of my runs as we were running out of space and we have successful configurations, so do not have much in terms of failed log files at this point.

junwang-noaa · 2023-08-30T18:04:31Z

@JessicaMeixner-NOAA Thanks for the information. May I ask if there are diag_table updates in HR2 from HR1? Have we output more fields in history files? I am trying to understand what may cause the memory increase.

JessicaMeixner-NOAA · 2023-08-30T18:09:29Z

There were a few additional variables added to the diag_table:

For full changes in workflow:
prototype/hr1...develop (Note this doesn't have the final changes from this PR though, which are small).

junwang-noaa · 2023-08-30T18:11:26Z

What is the size of the sfc file compared to HR1?

JessicaMeixner-NOAA · 2023-08-30T18:16:19Z

What is the size of the sfc file compared to HR1?

I don't have the size of HR1 sfc files readily available. HR2 sfc NetCDF files are 1.3G.

jiandewang · 2023-08-30T18:19:40Z

What is the size of the sfc file compared to HR1?

I don't have the size of HR1 sfc files readily available. HR2 sfc NetCDF files are 1.3G.

just hecked HR1 sfc,
1313637221 2023-03-18 02:28 ./gfs.20200619/00/atmos/gfs.t00z.sfcf378.nc

JessicaMeixner-NOAA · 2023-08-30T18:26:54Z

Thanks @jiandewang

The likely fix for the hera node failures will be implemented in the next maintenance on Sept 6th.

GeorgeVandenberghe-NOAA · 2023-08-30T18:38:28Z

What is that number, 10 or 20. WCOSS2 I/O groups work much better when they do not share nodes with either another group or another component. THerefore 64 MPI ranks per I/O group with 32 ranks per node works well. Others were found by chance experiment, in particular 48 seems to work well and is used in production. 60 and 30 ranks per group failed (hung) for me. The nature of the hangs is that a small amount of netcdf gets written, then writes stop and the ATM component eventually stops. This problem either goes away or is much improved with hdf5/1.14.1 but that is not in production on wcoss2 yet (I built it in private benchmark testing and was allowed to build that library exclusively for benchmarks). We should upgrade to hdf5/1/14.1 and all dependencies of hdf5/1.14.1 (netcdf, esmf, pio, gftl, and MAPL) when hdf5/1.14.1 becomes available for general wcoss2 use.

…

On Wed, Aug 30, 2023 at 5:05 PM Jessica Meixner ***@***.***> wrote: Can someone explain why 20 in wcoss2 fails? I know @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> was looking into some of this. There were machine issues on wcoss2, but despite many tries by myself and @jiandewang <https://github.com/jiandewang> 20 never worked on WCOSS2, but 10 does. I know this is not ideal but I don't know of another solution. — Reply to this email directly, view it on GitHub <#1827 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FU4HIFQEOIDSHHBTV3XX5XELANCNFSM6AAAAAA4C4LB7U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

junwang-noaa · 2023-08-30T18:39:04Z

Just to confirm, the node failure has no impact on the write grid comp settings?

JessicaMeixner-NOAA · 2023-08-30T18:50:26Z

Just to confirm, the node failure has no impact on the write grid comp settings?

Not to my knowledge no and while we didn't always see the "node failure" as a cause of an issue, the issues were widespread beyond just me I know at least 3-4 people who all independently had the failures that were fixed by going from WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 to 20 on hera. @jiandewang do you have a log file? Otherwise the failure will have to be reproduced.

@GeorgeVandenberghe-NOAA the 10/20 is referring to WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS

GeorgeVandenberghe-NOAA · 2023-08-30T18:55:14Z

Well that answers my question. WCOSS2 should still use a value that prevents node sharing in groups and between a group and other components. hdf5/1.14.1 relaxes this constraint but I haven't been able to test (forbidden by policy) if that completely eliminates it until hdf5/1.14.1 is generally available.

…

On Wed, Aug 30, 2023 at 6:50 PM Jessica Meixner ***@***.***> wrote: Just to confirm, the node failure has no impact on the write grid comp settings? Not to my knowledge no and while we didn't always see the "node failure" as a cause of an issue, the issues were widespread beyond just me I know at least 3-4 people who all independently had the failures that were fixed by going from WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 to 20 on hera. @jiandewang <https://github.com/jiandewang> do you have a log file? Otherwise the failure will have to be reproduced. @GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> the 10/20 is referring to WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS — Reply to this email directly, view it on GitHub <#1827 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FXQ5FR2FURQS2ZV7VLXX6DP5ANCNFSM6AAAAAA4C4LB7U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

WalterKolczynski-NOAA · 2023-08-30T19:00:09Z

20*4*6 = 480, so the write group size itself should fit in a whole number of nodes on WCOSS2. I don't know that it necessarily would be assigned such by UFS though.

WalterKolczynski-NOAA · 2023-08-30T19:27:21Z

How about we leave the 20 here and maybe make a comment about the WCOSS2 setting as a compromise?

Discussed it with Rahul. Let's go with this to get HR2 in.

JessicaMeixner-NOAA · 2023-08-30T19:28:34Z

@WalterKolczynski-NOAA and @aerorahul -- I'll make this update now!

and removing hera exception and added a note that 10 is needed for WCOSS2

GeorgeVandenberghe-NOAA · 2023-08-30T19:31:31Z

It's assigned by the run script that writes model_configure. ESMF threads makes it tricker but basically write group size should be an integral factor or multiple of the number of ranks per node requested.

…

On Wed, Aug 30, 2023 at 7:00 PM Walter Kolczynski - NOAA < ***@***.***> wrote: 20*4*6 = 480, so the write group size itself should fit in a whole number of nodes on WCOSS2. I don't know that it necessarily would be assigned such by UFS though. — Reply to this email directly, view it on GitHub <#1827 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ANDS4FSUBLNUWE52WYMPDBLXX6EUPANCNFSM6AAAAAA4C4LB7U> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- George W Vandenberghe *Lynker Technologies at * NOAA/NWS/NCEP/EMC 5830 University Research Ct., Rm. 2141 College Park, MD 20740 ***@***.*** 301-683-3769(work) 3017751547(cell)

JessicaMeixner-NOAA · 2023-08-30T19:51:13Z

Thank you!!!! @WalterKolczynski-NOAA can this be tagged for HR2? Let me know if you want me to make an issue for that request, I'd be happy to.

WalterKolczynski-NOAA · 2023-08-30T19:52:49Z

Already on it. Just need to hop on a machine and clone so I can make the tag. Will be prototype/hr2.

WalterKolczynski-NOAA · 2023-08-30T19:56:55Z

@JessicaMeixner-NOAA Tag created: https://github.com/NOAA-EMC/global-workflow/releases/tag/prototype%2Fhr2

JessicaMeixner-NOAA · 2023-08-30T20:03:22Z

Thank you @WalterKolczynski-NOAA !!!

JessicaMeixner-NOAA and others added 9 commits August 15, 2023 20:43

Minor updates for HR2

bf2ad13

temporarily disable one type of restarts for waves to bypass model fa…

6b48381

…ilure

update nems configure with settings for cmeps restart write

49d11e1

add pio for CMEPS to other nems.configure files

ba7d183

update opt_diag=3 for better diagnostics

923c9c5

Merge remote-tracking branch 'EMC/develop' into HR2updates

dcae923

Merge branch 'NOAA-EMC:develop' into HR2updates

7b80c98

while hera needs extra memory on write group, too many tasks on write

421466d

group means wcoss2 fails

Merge branch 'NOAA-EMC:develop' into HR2updates

5177d2f

WalterKolczynski-NOAA requested changes Aug 30, 2023

View reviewed changes

WalterKolczynski-NOAA requested a review from aerorahul August 30, 2023 16:58

update tab -> space for indentation

d42b7b0

update WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20

6634e16

and removing hera exception and added a note that 10 is needed for WCOSS2

WalterKolczynski-NOAA approved these changes Aug 30, 2023

View reviewed changes

aerorahul approved these changes Aug 30, 2023

View reviewed changes

WalterKolczynski-NOAA self-assigned this Aug 30, 2023

WalterKolczynski-NOAA merged commit 63270da into NOAA-EMC:develop Aug 30, 2023
3 checks passed

JessicaMeixner-NOAA deleted the HR2updates branch March 3, 2024 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Final Updates for HR2 #1827

Final Updates for HR2 #1827

JessicaMeixner-NOAA commented Aug 29, 2023 •

edited by WalterKolczynski-NOAA

Loading

WalterKolczynski-NOAA Aug 30, 2023

aerorahul Aug 30, 2023

JessicaMeixner-NOAA Aug 30, 2023

JessicaMeixner-NOAA Aug 30, 2023

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023 •

edited

Loading

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

jiandewang commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

WalterKolczynski-NOAA commented Aug 30, 2023 •

edited

Loading

WalterKolczynski-NOAA commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

JessicaMeixner-NOAA commented Aug 30, 2023

WalterKolczynski-NOAA commented Aug 30, 2023

WalterKolczynski-NOAA commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

Final Updates for HR2 #1827

Final Updates for HR2 #1827

Conversation

JessicaMeixner-NOAA commented Aug 29, 2023 • edited by WalterKolczynski-NOAA Loading

Description

Type of change

Change characteristics

How has this been tested?

Checklist

WalterKolczynski-NOAA Aug 30, 2023

Choose a reason for hiding this comment

aerorahul Aug 30, 2023

Choose a reason for hiding this comment

JessicaMeixner-NOAA Aug 30, 2023

Choose a reason for hiding this comment

JessicaMeixner-NOAA Aug 30, 2023

Choose a reason for hiding this comment

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023 • edited Loading

aerorahul commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

jiandewang commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

junwang-noaa commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

WalterKolczynski-NOAA commented Aug 30, 2023 • edited Loading

WalterKolczynski-NOAA commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

GeorgeVandenberghe-NOAA commented Aug 30, 2023 via email

JessicaMeixner-NOAA commented Aug 30, 2023

WalterKolczynski-NOAA commented Aug 30, 2023

WalterKolczynski-NOAA commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 30, 2023

JessicaMeixner-NOAA commented Aug 29, 2023 •

edited by WalterKolczynski-NOAA

Loading

JessicaMeixner-NOAA commented Aug 30, 2023 •

edited

Loading

WalterKolczynski-NOAA commented Aug 30, 2023 •

edited

Loading