-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Final Updates for HR2 #1827
Final Updates for HR2 #1827
Conversation
group means wcoss2 fails
parm/config/gfs/config.ufs
Outdated
if [[ ${machine} = "HERA" ]]; then | ||
export WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20 | ||
fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix the indentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this just for Hera? Is this load balanced?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was an issue of tabs, this has been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this just for Hera? Is this load balanced?
It's an issue of memory on hera. The jobs will not run on hera with lower than 20 here and will not run on wcoss2 with 20 here.
Can someone explain why 20 in wcoss2 fails? |
I know @GeorgeVandenberghe-NOAA was looking into some of this. There were machine issues on wcoss2, but despite many tries by myself and @jiandewang 20 never worked on WCOSS2, but 10 does. I know this is not ideal but I don't know of another solution. |
Is there an issue open for this anywhere and is it being tracked? |
There's ongoing work led by @junwang-noaa to determine optimal load balancing for operational targets. This admittedly is not an extensive load balancing but an attempt to ensure that everyone could run HR2 with C768 out of the box for various machines. We did confirm we had similar performance for this set up as we did for HR1, which was not surprising. I have not tested in cycled mode, therefore did not add likes for the gdas component, but yes we'd likely be in a similar situation and I'd be happy to add additional lines there as this soon will be something that is tested. |
Thanks. Running any configuration out of the box is an ideal desire from users. Though achieving this requires significant refactoring of the configuration system and is beyond the scope for any upcoming effort. Having said that, I am not against merging this in the interest of getting this for HR2, though I would suggest not bringing this in |
I know this wasn't ideal, but at some point I don't know how to deal with the fact that one machine has significantly less memory than the others and requires a different settings. I know there are efforts to determine ideal settings for multiple machines, but in the meantime I'm not sure a way around it. I do not want to branch off of develop as I think that encourages people into their own sand-boxes and counters the efforts we've made to get everyone back to the develop branch. How about we leave the 20 here and maybe make a comment about the WCOSS2 setting as a compromise? Alternatively we'll just leave the 20 there. |
@JessicaMeixner-NOAA is there an issue available on tracking the issues with write grid component settings in HR2 in the tested platforms? Can you list the error message, model revision, platforms etc so that we can confirm it is the memory issue? If I remember correctly you mentioned about hera node issues before, can you include those information in the issue too? |
The node failure issue on hera has an open hera helpdesk issue-- I'll ask to see what the status is. Many people have experienced this issue and an issue was previously opened on global-workflow for this: #1793 which was closed when we moved to 20 instead of 10. We've frequently ran into memory issues that were simply solved by changing settings, so this was not seen as a red-flag or cause for opening an issue on the model side. There's also a general workflow issue open about memory issues: #1664 If a specific ufs-weather-model issue is needed I can create one and add the information I have, although I finally cleaned up a lot of my runs as we were running out of space and we have successful configurations, so do not have much in terms of failed log files at this point. |
@JessicaMeixner-NOAA Thanks for the information. May I ask if there are diag_table updates in HR2 from HR1? Have we output more fields in history files? I am trying to understand what may cause the memory increase. |
There were a few additional variables added to the diag_table: For full changes in workflow: |
What is the size of the sfc file compared to HR1? |
I don't have the size of HR1 sfc files readily available. HR2 sfc NetCDF files are 1.3G. |
just hecked HR1 sfc, |
Thanks @jiandewang The likely fix for the hera node failures will be implemented in the next maintenance on Sept 6th. |
What is that number, 10 or 20.
WCOSS2 I/O groups work much better when they do not share nodes with either
another group or another component. THerefore 64 MPI ranks per I/O group
with 32 ranks per node works well. Others were found by chance experiment,
in particular 48 seems to work well and is used in production. 60 and 30
ranks per group failed (hung) for me. The nature of the hangs is that a
small amount of netcdf gets written, then writes stop and the ATM component
eventually stops.
This problem either goes away or is much improved with hdf5/1.14.1 but
that is not in production on wcoss2 yet (I built it in
private benchmark testing and was allowed to build that library exclusively
for benchmarks). We should upgrade to hdf5/1/14.1
and all dependencies of hdf5/1.14.1 (netcdf, esmf, pio, gftl, and MAPL)
when hdf5/1.14.1 becomes available for general wcoss2 use.
…On Wed, Aug 30, 2023 at 5:05 PM Jessica Meixner ***@***.***> wrote:
Can someone explain why 20 in wcoss2 fails?
I know @GeorgeVandenberghe-NOAA
<https://github.com/GeorgeVandenberghe-NOAA> was looking into some of
this. There were machine issues on wcoss2, but despite many tries by myself
and @jiandewang <https://github.com/jiandewang> 20 never worked on
WCOSS2, but 10 does. I know this is not ideal but I don't know of another
solution.
—
Reply to this email directly, view it on GitHub
<#1827 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FU4HIFQEOIDSHHBTV3XX5XELANCNFSM6AAAAAA4C4LB7U>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Just to confirm, the node failure has no impact on the write grid comp settings? |
Not to my knowledge no and while we didn't always see the "node failure" as a cause of an issue, the issues were widespread beyond just me I know at least 3-4 people who all independently had the failures that were fixed by going from WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 to 20 on hera. @jiandewang do you have a log file? Otherwise the failure will have to be reproduced. @GeorgeVandenberghe-NOAA the 10/20 is referring to WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS |
Well that answers my question. WCOSS2 should still use a value that
prevents node sharing in groups and between a group and other components.
hdf5/1.14.1 relaxes
this constraint but I haven't been able to test (forbidden by policy) if
that completely eliminates it until hdf5/1.14.1 is generally available.
…On Wed, Aug 30, 2023 at 6:50 PM Jessica Meixner ***@***.***> wrote:
Just to confirm, the node failure has no impact on the write grid comp
settings?
Not to my knowledge no and while we didn't always see the "node failure"
as a cause of an issue, the issues were widespread beyond just me I know at
least 3-4 people who all independently had the failures that were fixed by
going from WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=10 to 20 on hera.
@jiandewang <https://github.com/jiandewang> do you have a log file?
Otherwise the failure will have to be reproduced.
@GeorgeVandenberghe-NOAA <https://github.com/GeorgeVandenberghe-NOAA> the
10/20 is referring to WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS
—
Reply to this email directly, view it on GitHub
<#1827 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FXQ5FR2FURQS2ZV7VLXX6DP5ANCNFSM6AAAAAA4C4LB7U>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
|
Discussed it with Rahul. Let's go with this to get HR2 in. |
@WalterKolczynski-NOAA and @aerorahul -- I'll make this update now! |
and removing hera exception and added a note that 10 is needed for WCOSS2
It's assigned by the run script that writes model_configure. ESMF threads
makes it tricker but basically write group size should be an integral
factor or multiple
of the number of ranks per node requested.
…On Wed, Aug 30, 2023 at 7:00 PM Walter Kolczynski - NOAA < ***@***.***> wrote:
20*4*6 = 480, so the write group size itself should fit in a whole number
of nodes on WCOSS2. I don't know that it necessarily would be assigned such
by UFS though.
—
Reply to this email directly, view it on GitHub
<#1827 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ANDS4FSUBLNUWE52WYMPDBLXX6EUPANCNFSM6AAAAAA4C4LB7U>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
George W Vandenberghe
*Lynker Technologies at * NOAA/NWS/NCEP/EMC
5830 University Research Ct., Rm. 2141
College Park, MD 20740
***@***.***
301-683-3769(work) 3017751547(cell)
|
Thank you!!!! @WalterKolczynski-NOAA can this be tagged for HR2? Let me know if you want me to make an issue for that request, I'd be happy to. |
Already on it. Just need to hop on a machine and clone so I can make the tag. Will be |
Thank you @WalterKolczynski-NOAA !!! |
Description
This PR has the final updates for HR2. There was one added variable for diagnostic output:
iopt_diag=3
and required from land team. This also points to HR2 updates for initial conditions which update the land states in sfc* files compared to HR1.Lastly, it was determined that the setting to enable running on hera
WRTTASK_PER_GROUP_PER_THREAD_PER_TILE_GFS=20
actually will fail on WCOSS2. While not ideal, a setting that would run out of the box on both hera and wcoss2 for C768 could not be found, therefore a comment was added noting the need for a different setting on WCOSS2 until a solution can be found.Resolves #1500 (although note this was technically already completed and this PR itself does not update the ufs-weather-model hash)
Type of change
Change characteristics
How has this been tested?
Checklist