-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update UFS to latest #1889
Update UFS to latest #1889
Conversation
@DeniseWorthen can you check the updates here that are required and mentioned in the issue you created #1811 @rmontuoro there were two GOCART updates here, can you check if these are correct. Are there other config updates that might have been missed? |
I'll be leaving this as a draft until 1. getting confirmation from Denise and Aerosols teams about updates made and 2. I can resolve the module issues that make the wave init job fail. Right now I'm getting the error: I've seen quite a few module updates be needed for various reasons, I think this specific one is the hdf5 version in the model not being the same as what's being loaded when running. Any thoughts on which is the best workaround for this specific module issue @aerorahul or @WalterKolczynski-NOAA Should I try to update HDF5 in versions/run.wcoss2.ver or do you prefer a different solution? |
Yes, the various version files should be updated if the mismatch is causing an issue. If you run the coupled atm-only, I can run a cycled test on WCOSS2 if you'd like when it is ready. CI will take care of Hera and Orion. |
@rmontuoro @bbakernoaa I got passed some of the module issues and now ran into a MAPL failure on wcoss2. The error log can be found in full here: /lfs/h2/emc/couple/noscrub/jessica.meixner/epwrk/test04/COMROT/test04/logs/2021032312/gfsfcst.log.0 The error is:
Any chance you can either help or point this error to someone who could help? You can replicate what I did by running this branch (all changes have been pushed) and this describes the set-up: /lfs/h2/emc/couple/noscrub/jessica.meixner/epwave/global-workflow/workflow/coupled.sh It could just be a configuration setting that needs updated as I'm unsure if the changes I made were correct? Could be a module issue although I don't expect that for the forecast job since I believe it's sourcing the ufs model modules. In the meantime I will continue by running on hera and/or without aerosols. |
I'm running into the same module problems on hera as I did on wcoss2 for wave init jobs, which isn't super surprising. That being said, I also see this PR: #1882 which also has module issues but their fix is more job specific because of some known issues it looks like I'll also run into, so I'm wondering if changing the module versions is actually the way to go or not. @aerorahul any chance you have thoughts on this before I continue on? |
Tagging @lipan-NOAA to see if he can help confirm the aerosol configuration and diagnose the aerosol related error issue mentioned above. |
@JessicaMeixner-NOAA it looks like you need to delete the |
Okay let me try a clean test run and see if I still get this error. |
@bbakernoaa offline suggestion to remove the linking of gocart.inst_aod.* netcdf files solved the MAPL issue. MAPL doesnt' apparently like those files to be linked. Meanwhile on hera, the forecast job fails (even for S2S) because of the new module updates in the ufs model mean that prod_util can no longer be loaded. So looking for a compatible prod_uitl module to load... so far no luck. |
The prod_util issue that I'm running into has an issue created here: JCSDA/spack-stack#780 |
Okay - so this works on WCOSS2 right now. But to have this work on all machines we need to take the approach for the wave tasks the same that the forecast model is doing here: https://github.com/NOAA-EMC/global-workflow/blob/develop/jobs/rocoto/fcst.sh#L11-L58 It's possible that the wcoss2 module versions also would need to be walked back as to not encounter unexpected issues elsewhere. Also note that the linking of the gocart nc files were commented out and should be copied at the end of this job (or try the newer version of gocart which should be merged in soon, which may or may not have that same issue). |
I've made some progress but am now getting issues on orion (at minimum) with wave post sbs because WGRIB2 is not defined, I get this even when I load the wgrib2 model. I think it's because WGRIB2 needs to be set explicitly. Trying this now. Also the point job seemed to take a significantly longer time on orion, which given that it occasionally has file system issues is likely not that surprising, however something to watch on the other platforms as I continue testing |
|
@WalterKolczynski-NOAA it's because i'm not using module_base because it's not using the latest HDF5 so I'm using the ufs-weather-model modules. It's my understanding that just simply updaing everything to the latest HDF5 is not the way to go and will likely have unintent consequences. I got past the WGRIB2 issue, but then am having other issues with MPMD jobs on WCOSS2. Since this branch is "working" and some might be using it I've been putting my latest updates here: https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15 |
@WalterKolczynski-NOAA I'll run a fresh case on WCOSS2 tomorrow and see what error I get - I forgot I was in the process of re-cloning and bulding there on Friday. Help would be great. I think the other branch https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15 is running on both hera and orion (although the wave point job runs really long and needs extra time), but I can't get wcoss2 to work. Let me know if you'd rather me send error messages/help requests here or via a different issue/venue. Thanks for the offer to help! |
Specific things to look at would be appreciated, rather than me running off to run it independently. For HDF5, if the version file is updated that should theoretically be applied to everything anyway, as that is the point to having the versions file. |
Waiting for development to open on wcoss2. I'm not updating the version files anymore but instead trying to have the wave jobs use the ufs-weather-model modules + whatever else is needed. This is being done in this branch: https://github.com/jessicameixner-noaa/global-workflow/tree/feature/updateufsstack15 |
@WalterKolczynski-NOAA - I finally got a fresh install and test on dogwood and was able to get past my last error. I don't have everything running yet but am at least making progress again. I'll let you know if I run into any other issues. |
@@ -1038,7 +1038,7 @@ GOCART_postdet() { | |||
rm -f "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" | |||
fi | |||
|
|||
${NLN} "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" \ | |||
"${DATA}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" | |||
#${NLN} "${COM_CHEM_HISTORY}/gocart.inst_aod.${vdate:0:8}_${vdate:8:2}00z.nc4" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WalterKolczynski-NOAA @bbakernoaa @lipan-NOAA @rmontuoro --- I think i have other things working in my other branch so I'm coming back to other issues including this. Is it anticipated that if I link these files that the model wouldn't run? Barry helped me figure out that this was causing some earlier crashes. I can re-test to see if this is still an issue, but want to see if there's some known issue with this or suggested workaround. My other branch is : https://github.com/JessicaMeixner-NOAA/global-workflow/tree/feature/updateufsstack15
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the links, the output will never get into a permanent location. I think what needs to happen is any existing files at the target need to be deleted. GOCART seems to be okay with the links as long as the target doesn't exist, otherwise we would be seeing more problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I can try again without this commented out and see if other issuses I was having was partially disguised as this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncommenting this line resulted in a failure still (even from a completely clean run):
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: UFS Aerosols: Advancing from 2021-03-23T17:40:00 to 2021-03-23T18:00:00
nid001088.dogwood.wcoss2.ncep.noaa.gov 0:
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: Writing: 28 Slices to File: gocart.inst_aod.20210323_1800z.nc4
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: pe=00000 FAIL at line=00187 NetCDF4_FileFormatter.F90 <status=13>
pe=00000 FAIL at line=00061 HistoryCollection.F90 <status=13>
pe=00000 FAIL at line=00790 ServerThread.F90 <status=13>
pe=00000 FAIL at line=00138 BaseServer.F90 <status=13>
pe=00000 FAIL at line=00981 ServerThread.F90 <status=13>
pe=00000 FAIL at line=00094 MessageVisitor.F90 <status=13>
pe=00000 FAIL at line=00113 AbstractMessage.F90 <status=13>
pe=00000 FAIL at line=00107 SimpleSocket.F90 <status=13>
pe=00000 FAIL at line=00429 ClientThread.F90 <status=13>
pe=00000 FAIL at line=00363 ClientManager.F90 <status=13>
pe=00000 FAIL at line=03524 MAPL_HistoryGridComp.F90 <status=13>
pe=00000 FAIL at line=01818 MAPL_Generic.F90 <status=13>
pe=00000 FAIL at line=01284 MAPL_CapGridComp.F90 <status=13>
pe=00000 FAIL at line=01213 MAPL_CapGridComp.F90 <status=13>
pe=00000 FAIL at line=01159 MAPL_CapGridComp.F90 <status=13>
pe=00000 FAIL at line=00827 MAPL_CapGridComp.F90 <status=13>
pe=00000 FAIL at line=00967 MAPL_CapGridComp.F90 <status=13>
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: MPICH ERROR [Rank 0] [job id 2794f105-9f31-4db4-b1dc-56a1883195f6] [Fri Oct 13 12:32:25 2023] [nid001088] - Abort(1) (rank 0 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 0
nid001088.dogwood.wcoss2.ncep.noaa.gov 0: forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
ufs_model.x 000000000641185A Unknown Unknown Unknown
libpthread-2.31.s 00001456DAA868C0 Unknown Unknown Unknown
libmpi_intel.so.1 00001456DCBE3BFA Unknown Unknown Unknown
libmpi_intel.so.1 00001456DCABC05F Unknown Unknown Unknown
libmpi_intel.so.1 00001456DB1C9DA8 MPI_Abort Unknown Unknown
ufs_model.x 00000000013E79C4 _ZN5ESMCI3VMK5abo 757 ESMCI_VMKernel.C
ufs_model.x 00000000013C6B57 _ZN5ESMCI2VM5abor 3597 ESMCI_VM.C
ufs_model.x 0000000000BE8E83 c_esmc_vmabort_ 1190 ESMCI_VM_F.C
ufs_model.x 000000000054D279 esmf_vmmod_mp_esm 9431 ESMF_VM.F90
ufs_model.x 00000000006CEFCF esmf_initmod_mp_e 1226 ESMF_Init.F90
ufs_model.x 000000000042B7B0 MAIN__ 403 UFS.F90
ufs_model.x 000000000042A292 Unknown Unknown Unknown
libc-2.31.so 00001456DA69124D __libc_start_main Unknown Unknown
Full log file: /lfs/h2/emc/couple/noscrub/jessica.meupdatemodel/s2swc48t03/COMROOT/s2swc48t03/logs/2021032312/gfsfcst.log.0
It does seem that GOCART has a problem with this, unless I'm missing something. At this point I'm ready to do a fresh round of low res testing + a high res spot check and open a new PR to update the model. But this seems to likely be a sticking point. I'm hoping someone who works on the aerosols component can chime in on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, comment it out and add a GOCART_out()
to match the others that copies the files to COM_CHEM_HISTORY
at the end of the forecast.
I'd also like to know what changed that this no longer works and if there is anyone working to change it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error from: /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/COMROOT/s2swc48t02/logs/2021032312/gfsfcst.log.0
is:
SUB GOCART_out: Copying output data for GOCART
+ forecast_postdet.sh[1052]: for fhr in '${FV3_OUTPUT_FH}'
++ forecast_postdet.sh[1053]: date --utc -d '20210323 12 + 0 hours' +%Y%m%d%H
+ forecast_postdet.sh[1053]: local vdate=2021032312
+ forecast_postdet.sh[1054]: /bin/cp -p /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/RUNDIRS/s2swc48t02/fcst.123448/gocart.inst_aod.20210323_1200z.nc4 /scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/COMROOT/s2swc48t02/gfs.20210323/12//model_data/chem/history/gocart.inst_aod.20210323_1200z.nc4
/bin/cp: cannot stat '/scratch1/NCEPDEV/climate/Jessica.Meixner/HR3/updatemodel02/s2swc48t02/RUNDIRS/s2swc48t02/fcst.123448/gocart.inst_aod.20210323_1200z.nc4': No such file or directory
This was the code: JessicaMeixner-NOAA@6a61d8a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand wanting to copy explicit lists but we should really be general as possible here as the inst_aod is just one of the output files that are possible.
It would really be better if we copied or linked all of the
gocart.*.nc4
files to thechem
directory as there are lots of possible diagnostics available
We can't link them or the run dies. If what I'm trying now works, we could try to make it slightly more general as long as it doesn't conflict with other linking statements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to skip the f000 one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be something in there that we just need to add to the AERO_HISTORY.rc file.
It needs to be added at the top of the file
Allow_Overwrite: true
Testing now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How'd your tests go @bbakernoaa ? Mine did not go well. The changes I tried are here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/trygocartfix with the allow overwrite and back tracking the other changes.
I also haven't had good luck copying files at the end of the run, I keep getting errors, that branch is here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/updateUFS101223
Closing this PR and have opened new one: #1933 |
Description
This updates the model to the latest version of ufs-weather-model, including config file changes that are following regression tests updates: ufs-community/ufs-weather-model@GFSv17.HR2...f7a94ce some of these changes are described in #1811
Resolves #1811
Type of change
Change characteristics
How has this been tested?
Working on testing, posting this now so that others can check updates.
Checklist