Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303

DeniseWorthen · 2024-09-24T15:13:35Z

Pull Request Summary

Enable netCDF restarts and run-time netCDF history using PIO.

Description

PIO configuration

PIO is configured via configuration settings in WAV attributes for UFS and via share code for CESM. Defaults are provided for the stride, number of io tasks and rearranger settings. The PIO subsystem is managed and initialized in wav_pio_mod.F90

netCDF restarts

PIO/netCDF restarts are enabled by using use_restartnc=true in the configuration attributes, otherwise native WW3 binary restarts will be written. If netCDF restarts are enabled, an additional option exists to restart from an existing binary restart (restart_from_binary=true). In this case, the binary restart filename needs the same form as the netCDF restarts (minus the .nc at the end).

With netCDF restarts enabled, restarts will be written at the rstwr frequency defined in the cap, which will allow both the flexible (ie, non-interval) restarts and end-of-run restarts to be enabled in the same way as for other coupled components in UFS. These two features are not implemented in this work.

The netCDF restarts are read and written in a new routine wav_restart_mod.F90. Only two fields (va and mapsta) are written to the restart file by default. As noted in the comment in #1298, when waves are in the slow-coupling loop, the ice field is also written. The addition of extra restart fields is generalized via the already-existing restart%extra namelist.

netCDF history

Runtime netCDF history is enabled by using use_historync=true in the configuration attributes, otherwise native WW3 binary history files will be written. The PIO/netCDF capability builds on the existing serial netCDF capability in the mesh cap. The two existing routines used for this (w3iogoncdmd.F90, wav_grdout.F90) have been reduced to a new single netCDF history module wav_history_mod.F90 containing the same functionality.

History files can be custom named (if user_histname = true), otherwise they will be named in the native WW3 file naming. History files will be written at the histwr frequency. This frequency will be set using the normal history_n,history_option settings, if present. Otherwise, the field%stride frequency will be used to determine the histwr frequency. (This requires that the stride be set in units of seconds, which is less flexible than enabling the history_n,history_option settings.)

Removal of mesh-cap specific modifications to w3wavemd, w3iorsmd and w3iogomd..

Modifications to the three listed routines were required for the mesh cap to allow for additional flexibility of restart and history writing. The changes in w3iogomd were resolved w/ the PR to dev/ufs-weather-model for the Langmuir turbulence calculations (#1034). With the new restartnc and historync options via PIO, the changes to the other two routines (wavemd and iorsmd) are no longer required, meaning these two routines are now much more aligned with the same code in the develop branch. Since the dev/ufs-weather-model branch has not been updated for nearly a year, the modifications are best seen by comparing these two routines in this PR branch against the develop branch at 4d8c315

Performance Testing

To test the performance of the new PIO/netCDF restarts vs the existing binary restarts, a series of runs were made using a DATM + 1/4 deg MOM6+CICE6 and WW3 utilizing the uglo_15km-hr4_BlkSlake_360 mesh (~1.8M spatial elements) with 1800 frequencies + directions. Although this PR also enables PIO to write the run-time netCDF history files (vs the current serial netCDF implementation), no performance testing was done for the history writes.

Cases were run for 24 hours, with all component output turned off except for WW3 restarts, which were written every 2 hours. A range of resources was tested for WW3, from ~400 to ~2800. The primary metric used for evaluation was the mean time for a Model Advance on non-restart intervals vs restart intervals (the restart-write "penalty"). Because WW3 is cold-starting, the first 6 hours of integration was ignored.

The primary configuration options for PIO are the stride (and/or number of IO tasks) and the rearranger. In addtion to these options, two options were tested for writing the VA field---either as a single array, dimensioned (1.8M,1800) or with each spectral index written to a separate, named field (e.g. va.0001 through va.1800). This "multi-field" option creates an identically sized restart file and has been also tested in UWM, where it passes all tests. Thanks go to @jedwards4b for helping me get the single array working for subset rearranger and the operational mesh, which allowed me to test the single field vs multi-field options.

Initial scout runs were made to identify the best combination of stride and rearranger and va-array structure. At all WAV tasks tests, the best results were consistently found using pio_stride=4, pio_rearranger=subset and with the VA field written as multiple fields. The finding that stride=4 was a little unexpected, since by default, as resources increase the stride is also set to increase. I think that because we are under-subscribing the nodes in order to provide enough memory to WW3, this acts as a sort of de-facto striding and stride=4 (and even stride=2) can be used even at high task counts.

Once the best combination of parameters were found for PIO, cases were run for a total of 3 times. The binary case was also run three times and all runs were averaged. The raw results can be found here

The summary results are presented below.

Issue(s) addressed

fixes Add netcdf restart and history files using PIO (parallel netCDF) for dev/ufs-weather-model #1298

Commit Message

Check list

Branch is up to date with the authoritative repository (NOAA-EMC) develop branch.
Checked the checklist for a developer submitting to develop.
If a version number update is required, checked the updating version number checklist.
If a new feature was added, a regression test for testing the new feature is added.

Testing

How were these changes tested?

Using UFS-WM, a baseline was created and verified to pass for all tests which include a WAV component. Full testing as well as scalability results will be documented in the associated UFS-WM PR.

* remove wav_grdout routine, now moved into wav_history_mod

* remove more cap stuff from w3iorsmd. only ww3 native filenaming is possible w/ binary restarts * remove ifdef w3_ascii from w3wavemd, since the ascii commit is not yet present in mesh cap branch * clean up config variable logic for filenaming

* nrqrs can be non-zero if also using the restart from binary option

* flpart needs to be set for either historync or not

* move restart and history nc blocks outside of ww3 time testing block. * add log print messages for reading and writing restarts

* get logging working correctly for ufs * fix noclobber variable and file name in wav_history * clarify some comments

* make binary history files match when restartnc=true * rework wav_restart_mod, which originally was designed to be able to read and write restarts for testing purposes from inside wav_comp_nuopc.

* verboselog is true by default, but can be set false by config * the header text for ww3 logging as it steps through time is now turned off in w3init and placed into the mesh cap. this allows the mesh cap to order the logging correctly

* move block where addrstflds was set to before call to w3init since restarts are read in w3init * ensure that if nml file lacks a specification of extra fields, the default value of "unset" will not be returned as a field name

* only ice is added for now * tab cleanup in w3grid

* need to send explicit array bounds for ice array since it is 0:nsea * all baselines b4b against f9531d0

* intialize floutg and floutg2 which are can be unintialized when waves are in slow loop and historync is true

DeniseWorthen · 2024-09-30T13:50:52Z

@MatthewMasarik-NOAA Can I ask if you've tested the mesh you provided me in debug mode?

When using the dev/ufs-weather-model branch, I am getting a failure in debug mode at LN 1388 in w3init

WW3/model/src/w3initmd.F90

Lines 1384 to 1398 in 7f548c7

    
                 DO IK=0, NK+1 
        
                   ! 
        
                   !         Calculate wavenumbers and group velocities. 
        
           #ifdef W3_PDLIB 
        
                   CALL WAVNU3(SIG(IK),DEPTH,WN(IK,IS),CG(IK,IS)) 
        
           #else 
        
                   CALL WAVNU1(SIG(IK),DEPTH,WN(IK,IS),CG(IK,IS)) 
        
           #endif 
        
                   ! 
        
           #ifdef W3_T1 
        
                   WRITE (NDST,9052) IK, TPI/SIG(IK), WN(IK,IS), CG(IK,IS) 
        
           #endif 
        
                   ! 
        
                 END DO 
        
               END DO

MatthewMasarik-NOAA · 2024-09-30T14:13:41Z

Hi @DeniseWorthen, I have not tested this mesh in either standard or debug mode. I was provided this HR4 mesh from @AliS-Noaa and @JessicaMeixner-NOAA who have done the unstructured grid creation, and could probably give that information.

You had mentioned offline that at one point that Ali had told you a grid was passing in standalone, but not in coupled. I had not been briefed on that. Do you know if anyone followed up on that at the time? Or can you confirm if this new mesh is the same as the one you were discussing with Ali?

DeniseWorthen · 2024-09-30T14:26:19Z

@MatthewMasarik-NOAA The remark about failing in coupled mode was from an informal lunch-time discussion I had w/ Ali, some time ago, where he simply mentioned in passing what he was working on. It was hardly a "briefing" I had on the subject.

I have only the mesh you provided me, as well as one which Jiande gave me which he is currently using for HR4.

MatthewMasarik-NOAA · 2024-09-30T14:29:17Z

@MatthewMasarik-NOAA The remark about failing in coupled mode was from an informal lunch-time discussion I had w/ Ali, some time ago, where he simply mentioned in passing what he was working on. It was hardly a "briefing" I had on the subject.

I have only the mesh you provided me, as well as one which Jiande gave me which he is currently using for HR4.

@DeniseWorthen, understood. It was more than has been shared with me so you know more than I do at this point.

JessicaMeixner-NOAA · 2024-09-30T14:31:35Z

I think we might be getting meshes confused as well. HR4 was supposed to originally have an update and use the new mesh from @AliS-Noaa - however, due to time it's using the same as HR3. A new HR with updated wave grids will happen. @sbanihash I think @DeniseWorthen has pointed out an oversight from us that we should test this new mesh with debug mode on - would you be able to assign someone to run the new mesh in debug mode?

DeniseWorthen · 2024-10-15T13:26:44Z

@DeniseWorthen Let me clarify if I understand correctly. By "max time" I mean the "Max" time in ESMF_Profile.summary, usually that time gives the total run time of a model component.

When Gaea-C5 comes back from maintenance, I will add that information.

CatherineThomas-NOAA · 2024-10-15T14:39:49Z

Thanks for the tag @DeniseWorthen. Tagging @RuiyuSun as well.

sbanihash · 2024-10-15T15:31:20Z

@DeniseWorthen awesome work! This is very exciting. Thanks for adding the run details to the PR description.

junwang-noaa · 2024-10-15T19:41:58Z

@DeniseWorthen Is your branch https://github.com/DeniseWorthen/WW3/tree/feature/pio4ww3 ready for testing in HR4? If yes, @GeorgeVandenberghe-NOAA can you use Denise's branch to test ww3 IO performance in HR4? Thanks

DeniseWorthen · 2024-10-15T20:02:44Z

@junwang-noaa I am putting in a final change, which is to error out if the dimensions in the netcdf restart file don't match. It should be finished by today.

There will need to be changes in ufs.configure in order for George to test. To tell you the right settings, I need to know whether HR4 will have a binary WW3 restart available to warm start or is it a cold start?

* tested by using nco to alter the nth,nk values in a restart file. The model correctly aborted after finding nth did not match the expected value

jkbk2004 · 2024-10-22T15:10:51Z

@MatthewMasarik-NOAA we are about to merge ufs-community/ufs-weather-model#2463. We consider to work on this PR after the PR 2463. Let me know if you need more time to review this PR.

MatthewMasarik-NOAA · 2024-10-22T15:53:58Z

@MatthewMasarik-NOAA we are about to merge ufs-community/ufs-weather-model#2463. We consider to work on this PR after the PR 2463. Let me know if you need more time to review this PR.

Hi @jkbk2004. I have the tests going on hera right now, and am working on the code review. It is a significant PR though, so I could use a little more time to review if that's OK for your schedule.

* delay the deallocation of gindex arrays until after the diagnose_mesh routine is called. This only impacts cases where dbug_flag>5, compiled with gnu+debug

MatthewMasarik-NOAA · 2024-10-22T22:28:43Z

The RTs just came back and here is the test_changes.list. The tests were run before the latest commit, and from a quick look at the list, the changes might all be explained by that. I'll have a closer look tomorrow.

DeniseWorthen · 2024-10-22T22:44:51Z

The test_changes.list is not related to the latest commit---that fixes only an issue if dbug_flag>5 and dbug_flag=0 for all tests.

The test_changes.list is because we do not compare ww3 restart files for any cpld test, because ww3 restart files do not reproduce. All of tests should report the reason for failure is that the restart file (prefix.ww3.r.date.nc) does not exist in the baseline. Let me know if that is not the case.

MatthewMasarik-NOAA · 2024-10-22T22:51:16Z

The test_changes.list is not related to the latest commit---that fixes only an issue if dbug_flag>5 and dbug_flag=0 for all tests.

The test_changes.list is because we do not compare ww3 restart files for any cpld test, because ww3 restart files do not reproduce. All of tests should report the reason for failure is that the restart file (prefix.ww3.r.date.nc) does not exist in the baseline. Let me know if that is not the case.

Okay, thanks for that information. I suppose I should re-run the tests at any rate after the new commit. Do you anticipate any more commits?

DeniseWorthen · 2024-10-22T23:29:33Z

I will run against the UWM baseline on hercules overnight and post the test_changes.list.

MatthewMasarik-NOAA · 2024-10-23T20:19:43Z

I ran the RTs last night on hera RegressionTests_hera.log, test_changes.list. Most on the change list are failures due to comparision of wave restarts (which do not pass), as @DeniseWorthen pointed out. A few are 'restart' tests that failed to start, so that is expected as well. There are only two changes then remaining: cpld_control_gfsv17_iau and cpld_mpi_pdlib_p8_intel. The cpld_control_gfsv17_iau is likely due to the reported issue. So cpld_mpi_pdlib_p8_intel is the only changed test that I could not attribute a reason for the failure, and the reason given was UNABLE TO START TEST. @DeniseWorthen I wanted to check with you to see if you are seeing the same thing, and if there is a reason for the cpld_mpi_pdlib_p8_intel test to fail that I overlooked or wasn't aware of?

DeniseWorthen · 2024-10-23T20:40:21Z

I think the issue there is that currently in the rt.conf, cpld_mpi_pdlib_p8 (mistakenly) is dependent on cpld_control_pdlib_p8. So if that test does not complete/pass, the mpi test won't run.

The iau test did not run for the same reason. The test it is dependent on (cpld_control_gfsv17) did not pass, so it did not run.

MatthewMasarik-NOAA · 2024-10-23T21:13:09Z

I think the issue there is that currently in the rt.conf, cpld_mpi_pdlib_p8 (mistakenly) is dependent on cpld_control_pdlib_p8. So if that test does not complete/pass, the mpi test won't run.

The iau test did not run for the same reason. The test it is dependent on (cpld_control_gfsv17) did not pass, so it did not run.

Great, thank you for giving that clarification. I'm okay with the testing then. I'll work on finishing the code review today. If everything goes well it should be ready late today or early tomorrow. Fyi @jkbk2004

jkbk2004 · 2024-10-24T14:22:40Z

@MatthewMasarik-NOAA all sounds good. we can go ahead to commit for full testing this afternoon. @DeniseWorthen can you attach hera pre-test log and push test_changes.list at ufs-community/ufs-weather-model#2445 ?

DeniseWorthen · 2024-10-24T14:39:23Z

@jkbk2004 The test_changes list was generated on Hercules and posted at b5ae65b. However, the WW3 PR has still not been approved, so we should not schedule it for commit until it is.

MatthewMasarik-NOAA

Code review
Pass

Testing
Pass

See logs attached to #1303 (comment) and description in #1303 (comment) as well.

Approved.

MatthewMasarik-NOAA · 2024-10-24T15:21:52Z

Fyi @jkbk2004 @DeniseWorthen

thanks for your patience

FernandoAndrade-NOAA · 2024-11-14T18:23:47Z

Testing on #2445 has completed, please continue with the merge process.

MatthewMasarik-NOAA · 2024-11-14T18:30:18Z

Testing on #2445 has completed, please continue with the merge process.

Will do.

MatthewMasarik-NOAA · 2024-11-14T19:01:45Z

@DeniseWorthen thank you for your hard work developing and testing this very fine upgrade to WW3.

…fs-weather-model (NOAA-EMC#1303) (#2) Co-authored-by: Denise Worthen <[email protected]>

DeniseWorthen and others added 22 commits August 28, 2024 08:06

clean branch for pio work

31d7f36

add new files

0fe584a

rename w3iogoncmd as wav_history_mod

099645b

rework order to enable PIO initialization

c7a9ee3

* remove wav_grdout routine, now moved into wav_history_mod

prevent mpi_startalls if restartnc is active

efd5774

* nrqrs can be non-zero if also using the restart from binary option

move setting of flpart flag

3b038ff

* flpart needs to be set for either historync or not

remove unneeded CESMCOUPLED from w3iorsmd

6e32b5f

update nc block location, add log messages

9976c4d

* move restart and history nc blocks outside of ww3 time testing block. * add log print messages for reading and writing restarts

use pio_ioformat for create mode

32c5f25

make iosys a pointer, add debug level config

4edfbc6

fix ww3 thrd cases

fbeb83e

remove stray bin2nc ifdef

5aedf8c

fixes for logs and wav_history

4f62341

* get logging working correctly for ufs * fix noclobber variable and file name in wav_history * clarify some comments

resolve issue for binary history files

553f1b1

* make binary history files match when restartnc=true * rework wav_restart_mod, which originally was designed to be able to read and write restarts for testing purposes from inside wav_comp_nuopc.

fix multiple writes when waves are in slow loop

8d0f170

rename oldst2 variable in wav_restart

9cb49d1

add verbose log option

4674dae

* verboselog is true by default, but can be set false by config * the header text for ww3 logging as it steps through time is now turned off in w3init and placed into the mesh cap. this allows the mesh cap to order the logging correctly

add optional restart files via nml

7c43376

resolve issue w/ nml for extra fields

f9531d0

* move block where addrstflds was set to before call to w3init since restarts are read in w3init * ensure that if nml file lacks a specification of extra fields, the default value of "unset" will not be returned as a field name

add generic read/write for added restart fields

ad51cd6

* only ice is added for now * tab cleanup in w3grid

fix generic read/write in wav_restart

d92e9ff

* need to send explicit array bounds for ice array since it is 0:nsea * all baselines b4b against f9531d0

DeniseWorthen mentioned this pull request Sep 24, 2024

Update WW3 for PIO/netCDF restarts ufs-community/ufs-weather-model#2445

Merged

14 tasks

DeniseWorthen changed the title ~~Add netCDF PIO capability for restarts and run-time history~~ Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model Sep 24, 2024

initialize variables

e0134de

* intialize floutg and floutg2 which are can be unintialized when waves are in slow loop and historync is true

add error checks for dimensions in restart file

6f28c27

* tested by using nco to alter the nth,nk values in a restart file. The model correctly aborted after finding nth did not match the expected value

DeniseWorthen mentioned this pull request Oct 15, 2024

Flexible restart write times (restart_fh) for med, ocn, ice, wav ufs-community/ufs-weather-model#2419

Open

14 tasks

DeniseWorthen marked this pull request as ready for review October 15, 2024 21:29

fix deallocation issue when dbug_flag>5

06c2099

* delay the deallocation of gindex arrays until after the diagnose_mesh routine is called. This only impacts cases where dbug_flag>5, compiled with gnu+debug

MatthewMasarik-NOAA approved these changes Oct 24, 2024

View reviewed changes

DeniseWorthen mentioned this pull request Nov 8, 2024

remove unneeded fields for dev/ufs-weather-model #1317

Draft

4 tasks

MatthewMasarik-NOAA merged commit abe1e79 into NOAA-EMC:dev/ufs-weather-model Nov 14, 2024
2 of 20 checks passed

sbanihash mentioned this pull request Nov 15, 2024

Add netCDF PIO capability for restarts and run-time history for dev/u… sbanihash/WW3#2

Merged

4 tasks

sbanihash added a commit to sbanihash/WW3 that referenced this pull request Nov 15, 2024

Add netCDF PIO capability for restarts and run-time history for dev/u…

39be107

…fs-weather-model (NOAA-EMC#1303) (#2) Co-authored-by: Denise Worthen <[email protected]>

JessicaMeixner-NOAA mentioned this pull request Nov 18, 2024

WW3 cannot build pre-and-post processing jobs ufs-community/ufs-weather-model#2502

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303

Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303

DeniseWorthen commented Sep 24, 2024 •

edited

Loading

DeniseWorthen commented Sep 30, 2024

MatthewMasarik-NOAA commented Sep 30, 2024

DeniseWorthen commented Sep 30, 2024

MatthewMasarik-NOAA commented Sep 30, 2024

JessicaMeixner-NOAA commented Sep 30, 2024

DeniseWorthen commented Oct 15, 2024

CatherineThomas-NOAA commented Oct 15, 2024

sbanihash commented Oct 15, 2024

junwang-noaa commented Oct 15, 2024

DeniseWorthen commented Oct 15, 2024

jkbk2004 commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 22, 2024 •

edited

Loading

MatthewMasarik-NOAA commented Oct 22, 2024

DeniseWorthen commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 22, 2024

DeniseWorthen commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 23, 2024

DeniseWorthen commented Oct 23, 2024

MatthewMasarik-NOAA commented Oct 23, 2024

jkbk2004 commented Oct 24, 2024

DeniseWorthen commented Oct 24, 2024 •

edited

Loading

MatthewMasarik-NOAA left a comment

MatthewMasarik-NOAA commented Oct 24, 2024

FernandoAndrade-NOAA commented Nov 14, 2024

MatthewMasarik-NOAA commented Nov 14, 2024

MatthewMasarik-NOAA commented Nov 14, 2024

Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303

Add netCDF PIO capability for restarts and run-time history for dev/ufs-weather-model #1303

Conversation

DeniseWorthen commented Sep 24, 2024 • edited Loading

Pull Request Summary

Description

Performance Testing

Issue(s) addressed

Commit Message

Check list

Testing

DeniseWorthen commented Sep 30, 2024

MatthewMasarik-NOAA commented Sep 30, 2024

DeniseWorthen commented Sep 30, 2024

MatthewMasarik-NOAA commented Sep 30, 2024

JessicaMeixner-NOAA commented Sep 30, 2024

DeniseWorthen commented Oct 15, 2024

CatherineThomas-NOAA commented Oct 15, 2024

sbanihash commented Oct 15, 2024

junwang-noaa commented Oct 15, 2024

DeniseWorthen commented Oct 15, 2024

jkbk2004 commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 22, 2024 • edited Loading

MatthewMasarik-NOAA commented Oct 22, 2024

DeniseWorthen commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 22, 2024

DeniseWorthen commented Oct 22, 2024

MatthewMasarik-NOAA commented Oct 23, 2024

DeniseWorthen commented Oct 23, 2024

MatthewMasarik-NOAA commented Oct 23, 2024

jkbk2004 commented Oct 24, 2024

DeniseWorthen commented Oct 24, 2024 • edited Loading

MatthewMasarik-NOAA left a comment

Choose a reason for hiding this comment

MatthewMasarik-NOAA commented Oct 24, 2024

FernandoAndrade-NOAA commented Nov 14, 2024

MatthewMasarik-NOAA commented Nov 14, 2024

MatthewMasarik-NOAA commented Nov 14, 2024

DeniseWorthen commented Sep 24, 2024 •

edited

Loading

MatthewMasarik-NOAA commented Oct 22, 2024 •

edited

Loading

DeniseWorthen commented Oct 24, 2024 •

edited

Loading