Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update UFS to latest #1889

Conversation

JessicaMeixner-NOAA
Copy link
Contributor

Description

This updates the model to the latest version of ufs-weather-model, including config file changes that are following regression tests updates: ufs-community/ufs-weather-model@GFSv17.HR2...f7a94ce some of these changes are described in #1811

Resolves #1811

Type of change

  • Maintenance (code refactor, clean-up, new CI test, etc.) update model

Change characteristics

  • Is this a breaking change (a change in existing functionality)? No
  • Does this change require a documentation update? NO

How has this been tested?

Working on testing, posting this now so that others can check updates.

Checklist

  • Any dependent changes have been merged and published
  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing tests pass with my changes
  • I have made corresponding changes to the documentation if necessary

@JessicaMeixner-NOAA
Copy link
Contributor Author

@DeniseWorthen can you check the updates here that are required and mentioned in the issue you created #1811

@rmontuoro there were two GOCART updates here, can you check if these are correct. Are there other config updates that might have been missed?

@JessicaMeixner-NOAA
Copy link
Contributor Author

I'll be leaving this as a draft until 1. getting confirmation from Denise and Aerosols teams about updates made and 2. I can resolve the module issues that make the wave init job fail.

Right now I'm getting the error:
/lfs/h2/emc/couple/noscrub/jessica.meixner/epwave/global-workflow/exec/ww3_grid: error while loading shared libraries: libhdf5_hl.so.310: cannot open shared object file: No such file or directory

I've seen quite a few module updates be needed for various reasons, I think this specific one is the hdf5 version in the model not being the same as what's being loaded when running. Any thoughts on which is the best workaround for this specific module issue @aerorahul or @WalterKolczynski-NOAA Should I try to update HDF5 in versions/run.wcoss2.ver or do you prefer a different solution?

@WalterKolczynski-NOAA
Copy link
Contributor

Yes, the various version files should be updated if the mismatch is causing an issue. If you run the coupled atm-only, I can run a cycled test on WCOSS2 if you'd like when it is ready. CI will take care of Hera and Orion.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@rmontuoro @bbakernoaa I got passed some of the module issues and now ran into a MAPL failure on wcoss2.

The error log can be found in full here: /lfs/h2/emc/couple/noscrub/jessica.meixner/epwrk/test04/COMROT/test04/logs/2021032312/gfsfcst.log.0

The error is:

nid001023.cactus.wcoss2.ncep.noaa.gov 0: PASS: fcstRUN phase 1, n_atmsteps =               17 time is         5.030901
nid001023.cactus.wcoss2.ncep.noaa.gov 0: UFS Aerosols: Advancing from 2021-03-23T17:40:00 to 2021-03-23T18:00:00
nid001023.cactus.wcoss2.ncep.noaa.gov 0:
nid001023.cactus.wcoss2.ncep.noaa.gov 0:  Writing:     28 Slices to File:  gocart.inst_aod.20210323_1800z.nc4
nid001023.cactus.wcoss2.ncep.noaa.gov 0: pe=00000 FAIL at line=00187    NetCDF4_FileFormatter.F90                <status=13>
pe=00000 FAIL at line=00061    HistoryCollection.F90                    <status=13>
pe=00000 FAIL at line=00790    ServerThread.F90                         <status=13>
pe=00000 FAIL at line=00138    BaseServer.F90                           <status=13>
pe=00000 FAIL at line=00981    ServerThread.F90                         <status=13>
pe=00000 FAIL at line=00094    MessageVisitor.F90                       <status=13>
pe=00000 FAIL at line=00113    AbstractMessage.F90                      <status=13>
pe=00000 FAIL at line=00107    SimpleSocket.F90                         <status=13>
pe=00000 FAIL at line=00429    ClientThread.F90                         <status=13>
pe=00000 FAIL at line=00363    ClientManager.F90                        <status=13>
pe=00000 FAIL at line=03524    MAPL_HistoryGridComp.F90                 <status=13>
pe=00000 FAIL at line=01818    MAPL_Generic.F90                         <status=13>
pe=00000 FAIL at line=01284    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=01213    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=01159    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=00827    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=00967    MAPL_CapGridComp.F90                     <status=13>
nid001023.cactus.wcoss2.ncep.noaa.gov 0: MPICH ERROR [Rank 0] [job id cc2d288f-abf7-43ad-84f0-4d2628fa5a5c] [Wed Sep 27 15:14:57 2023] [nid001023] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 0

nid001023.cactus.wcoss2.ncep.noaa.gov 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred

Any chance you can either help or point this error to someone who could help? You can replicate what I did by running this branch (all changes have been pushed) and this describes the set-up: /lfs/h2/emc/couple/noscrub/jessica.meixner/epwave/global-workflow/workflow/coupled.sh

It could just be a configuration setting that needs updated as I'm unsure if the changes I made were correct? Could be a module issue although I don't expect that for the forecast job since I believe it's sourcing the ufs model modules. In the meantime I will continue by running on hera and/or without aerosols.

@JessicaMeixner-NOAA
Copy link
Contributor Author

I'm running into the same module problems on hera as I did on wcoss2 for wave init jobs, which isn't super surprising. That being said, I also see this PR: #1882 which also has module issues but their fix is more job specific because of some known issues it looks like I'll also run into, so I'm wondering if changing the module versions is actually the way to go or not. @aerorahul any chance you have thoughts on this before I continue on?

@JessicaMeixner-NOAA
Copy link
Contributor Author

Tagging @lipan-NOAA to see if he can help confirm the aerosol configuration and diagnose the aerosol related error issue mentioned above.

@bbakernoaa
Copy link
Contributor

@JessicaMeixner-NOAA it looks like you need to delete the gocart.inst_aod*.nc files. If these files exist during model execution it will crash

@JessicaMeixner-NOAA
Copy link
Contributor Author

Okay let me try a clean test run and see if I still get this error.

@JessicaMeixner-NOAA
Copy link
Contributor Author

@bbakernoaa offline suggestion to remove the linking of gocart.inst_aod.* netcdf files solved the MAPL issue. MAPL doesnt' apparently like those files to be linked.

Meanwhile on hera, the forecast job fails (even for S2S) because of the new module updates in the ufs model mean that prod_util can no longer be loaded. So looking for a compatible prod_uitl module to load... so far no luck.

@JessicaMeixner-NOAA
Copy link
Contributor Author

The prod_util issue that I'm running into has an issue created here: JCSDA/spack-stack#780

@JessicaMeixner-NOAA
Copy link
Contributor Author

Okay - so this works on WCOSS2 right now. But to have this work on all machines we need to take the approach for the wave tasks the same that the forecast model is doing here: https://github.com/NOAA-EMC/global-workflow/blob/develop/jobs/rocoto/fcst.sh#L11-L58 It's possible that the wcoss2 module versions also would need to be walked back as to not encounter unexpected issues elsewhere.

Also note that the linking of the gocart nc files were commented out and should be copied at the end of this job (or try the newer version of gocart which should be merged in soon, which may or may not have that same issue).

@JessicaMeixner-NOAA
Copy link
Contributor Author

I've made some progress but am now getting issues on orion (at minimum) with wave post sbs because WGRIB2 is not defined, I get this even when I load the wgrib2 model. I think it's because WGRIB2 needs to be set explicitly. Trying this now. Also the point job seemed to take a significantly longer time on orion, which given that it occasionally has file system issues is likely not that surprising, however something to watch on the other platforms as I continue testing

@WalterKolczynski-NOAA
Copy link
Contributor

$WGRIB2 is being defined in module_base.orion, so it should be set for all the jobs except the forecast and new DA jobs (they currently bypass module_base). And development doesn't have an issue running wave post (or any other job using wgrib2). I don't see any changes that would break that, but maybe you have ones that haven't been pushed yet? Please let me know if I can help troubleshoot.

@JessicaMeixner-NOAA
Copy link
Contributor Author

$WGRIB2 is being defined in module_base.orion, so it should be set for all the jobs except the forecast and new DA jobs (they currently bypass module_base). And development doesn't have an issue running wave post (or any other job using wgrib2). I don't see any changes that would break that, but maybe you have ones that haven't been pushed yet? Please let me know if I can help troubleshoot.

@WalterKolczynski-NOAA it's because i'm not using module_base because it's not using the latest HDF5 so I'm using the ufs-weather-model modules. It's my understanding that just simply updaing everything to the latest HDF5 is not the way to go and will likely have unintent consequences. I got past the WGRIB2 issue, but then am having other issues with MPMD jobs on WCOSS2. Since this branch is "working" and some might be using it I've been putting my latest updates here: https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15

@JessicaMeixner-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA I'll run a fresh case on WCOSS2 tomorrow and see what error I get - I forgot I was in the process of re-cloning and bulding there on Friday. Help would be great. I think the other branch https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15 is running on both hera and orion (although the wave point job runs really long and needs extra time), but I can't get wcoss2 to work. Let me know if you'd rather me send error messages/help requests here or via a different issue/venue. Thanks for the offer to help!

@WalterKolczynski-NOAA
Copy link
Contributor

Specific things to look at would be appreciated, rather than me running off to run it independently.

For HDF5, if the version file is updated that should theoretically be applied to everything anyway, as that is the point to having the versions file.

@JessicaMeixner-NOAA
Copy link
Contributor Author

Waiting for development to open on wcoss2. I'm not updating the version files anymore but instead trying to have the wave jobs use the ufs-weather-model modules + whatever else is needed. This is being done in this branch: https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15

@JessicaMeixner-NOAA
Copy link
Contributor Author

@WalterKolczynski-NOAA - I finally got a fresh install and test on dogwood and was able to get past my last error. I don't have everything running yet but am at least making progress again. I'll let you know if I run into any other issues.

@@ -1038,7 +1038,7 @@ GOCART_postdet() {
rm -f "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4"
fi

${NLN} "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" \
"${DATA}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4"
#${NLN} "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WalterKolczynski-NOAA @bbakernoaa @lipan-NOAA @rmontuoro --- I think i have other things working in my other branch so I'm coming back to other issues including this. Is it anticipated that if I link these files that the model wouldn't run? Barry helped me figure out that this was causing some earlier crashes. I can re-test to see if this is still an issue, but want to see if there's some known issue with this or suggested workaround. My other branch is : https://github.com/JessicaMeixner-NOAA/global-workflow/tree/feature/updateufsstack15

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the links, the output will never get into a permanent location. I think what needs to happen is any existing files at the target need to be deleted. GOCART seems to be okay with the links as long as the target doesn't exist, otherwise we would be seeing more problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I can try again without this commented out and see if other issuses I was having was partially disguised as this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uncommenting this line resulted in a failure still (even from a completely clean run):

nid001088.dogwood.wcoss2.ncep.noaa.gov 0: UFS Aerosols: Advancing from 2021-03-23T17:40:00 to 2021-03-23T18:00:00
nid001088.dogwood.wcoss2.ncep.noaa.gov 0:
nid001088.dogwood.wcoss2.ncep.noaa.gov 0:  Writing:     28 Slices to File:  gocart.inst_aod.20210323_1800z.nc4
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: pe=00000 FAIL at line=00187    NetCDF4_FileFormatter.F90                <status=13>
pe=00000 FAIL at line=00061    HistoryCollection.F90                    <status=13>
pe=00000 FAIL at line=00790    ServerThread.F90                         <status=13>
pe=00000 FAIL at line=00138    BaseServer.F90                           <status=13>
pe=00000 FAIL at line=00981    ServerThread.F90                         <status=13>
pe=00000 FAIL at line=00094    MessageVisitor.F90                       <status=13>
pe=00000 FAIL at line=00113    AbstractMessage.F90                      <status=13>
pe=00000 FAIL at line=00107    SimpleSocket.F90                         <status=13>
pe=00000 FAIL at line=00429    ClientThread.F90                         <status=13>
pe=00000 FAIL at line=00363    ClientManager.F90                        <status=13>
pe=00000 FAIL at line=03524    MAPL_HistoryGridComp.F90                 <status=13>
pe=00000 FAIL at line=01818    MAPL_Generic.F90                         <status=13>
pe=00000 FAIL at line=01284    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=01213    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=01159    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=00827    MAPL_CapGridComp.F90                     <status=13>
pe=00000 FAIL at line=00967    MAPL_CapGridComp.F90                     <status=13>
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: MPICH ERROR [Rank 0] [job id 2794f105-9f31-4db4-b1dc-56a1883195f6] [Fri Oct 13 12:32:25 2023] [nid001088] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 0

nid001088.dogwood.wcoss2.ncep.noaa.gov 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source
ufs_model.x        000000000641185A  Unknown               Unknown  Unknown
libpthread-2.31.s  00001456DAA868C0  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001456DCBE3BFA  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001456DCABC05F  Unknown               Unknown  Unknown
libmpi_intel.so.1  00001456DB1C9DA8  MPI_Abort             Unknown  Unknown
ufs_model.x        00000000013E79C4  _ZN5ESMCI3VMK5abo         757  ESMCI_VMKernel.C
ufs_model.x        00000000013C6B57  _ZN5ESMCI2VM5abor        3597  ESMCI_VM.C
ufs_model.x        0000000000BE8E83  c_esmc_vmabort_          1190  ESMCI_VM_F.C
ufs_model.x        000000000054D279  esmf_vmmod_mp_esm        9431  ESMF_VM.F90
ufs_model.x        00000000006CEFCF  esmf_initmod_mp_e        1226  ESMF_Init.F90
ufs_model.x        000000000042B7B0  MAIN__                    403  UFS.F90
ufs_model.x        000000000042A292  Unknown               Unknown  Unknown
libc-2.31.so       00001456DA69124D  __libc_start_main     Unknown  Unknown

Full log file: /lfs/h2/emc/couple/noscrub/jessica.meupdatemodel/s2swc48t03/COMROOT/s2swc48t03/logs/2021032312/gfsfcst.log.0

It does seem that GOCART has a problem with this, unless I'm missing something. At this point I'm ready to do a fresh round of low res testing + a high res spot check and open a new PR to update the model. But this seems to likely be a sticking point. I'm hoping someone who works on the aerosols component can chime in on this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, comment it out and add a GOCART_out() to match the others that copies the files to COM_CHEM_HISTORY at the end of the forecast.

I'd also like to know what changed that this no longer works and if there is anyone working to change it back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error from: /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/COMROOT/s2swc48t02/logs/2021032312/gfsfcst.log.0
is:

SUB GOCART_out: Copying output data for GOCART
+ forecast_postdet.sh[1052]: for fhr in '${FV3_OUTPUT_FH}'
++ forecast_postdet.sh[1053]: date --utc -d '20210323 12 + 0 hours' +%Y%m%d%H
+ forecast_postdet.sh[1053]: local vdate=2021032312
+ forecast_postdet.sh[1054]: /bin/cp -p /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/RUNDIRS/s2swc48t02/fcst.123448/gocart.inst_aod.20210323_1200z.nc4 /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/COMROOT/s2swc48t02/gfs.20210323/12//model_data/chem/history/gocart.inst_aod.20210323_1200z.nc4
/bin/cp: cannot stat '/scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/RUNDIRS/s2swc48t02/fcst.123448/gocart.inst_aod.20210323_1200z.nc4': No such file or directory

This was the code: JessicaMeixner-NOAA@6a61d8a

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand wanting to copy explicit lists but we should really be general as possible here as the inst_aod is just one of the output files that are possible.

It would really be better if we copied or linked all of the gocart.*.nc4 files to the chem directory as there are lots of possible diagnostics available

We can't link them or the run dies. If what I'm trying now works, we could try to make it slightly more general as long as it doesn't conflict with other linking statements.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to skip the f000 one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be something in there that we just need to add to the AERO_HISTORY.rc file.

It needs to be added at the top of the file

Allow_Overwrite: true

Testing now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How'd your tests go @bbakernoaa ? Mine did not go well. The changes I tried are here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/trygocartfix with the allow overwrite and back tracking the other changes.

I also haven't had good luck copying files at the end of the run, I keep getting errors, that branch is here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/updateUFS101223

@JessicaMeixner-NOAA
Copy link
Contributor Author

Closing this PR and have opened new one: #1933

@JessicaMeixner-NOAA JessicaMeixner-NOAA deleted the feature/updateufs branch March 3, 2024 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

changes required for implementing ocean albedo calculation in coupled model
3 participants