Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

p8 / p5 tag issue on gaea: CPC experiment support #1755

Closed
jkbk2004 opened this issue May 16, 2023 · 83 comments
Closed

p8 / p5 tag issue on gaea: CPC experiment support #1755

jkbk2004 opened this issue May 16, 2023 · 83 comments
Assignees
Labels

Comments

@jkbk2004
Copy link
Collaborator

Description

  • Due to recent OS updates, there is a need to validate Gaea modulefile in the existing tags
  • An experiment setup with p8 tag is underway at CPC

Solution

  • Develop branch includes Gaea modulefile update. Gaea9 node dedicated for old C3 OS is no longer supported in the develop branch.
@jieshunzhu
Copy link
Collaborator

Thanks for your help in advance.

I git clone tags/Prototype-P8 on Gaea. In the tests/, I replaced module-setup.sh with the one in the develop branch. The main difference is "source /lustre/f2/dev/role.epic/contrib/Lmod_init.sh". But when I compiled it, I got the following error,
++++++++++++
"Lmod has detected the following error: The following module(s) are unknown:
"intel/2021.3.0" "gcc/8.3.0" "intel/18.0.6.288" "PrgEnv-intel/6.0.5"
"cray-python/3.7.3.2"
++++++++++++

My directory is /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_22995/compile_001

In addition, my default shell is tsch. Before compiling, I changed it to bash by typing "bash".

@natalie-perlin
Copy link
Collaborator

@jieshunzhu -
Following the recent Gaea updates, the modules "intel/2021.3.0" "gcc/8.3.0" "intel/18.0.6.288" "PrgEnv-intel/6.0.5" "cray-python/3.7.3.2"
are no longer available on neither C3 nor C4 partitions. Please see notes on stack changes for Gaea in WM-issue #1753

@jieshunzhu
Copy link
Collaborator

@natalie-perlin Thanks for it. I am looking at #1753.
BTW, I was able to compile the UFS develop (the version of 20230515) branch. Do you know where else I should modify in P8, other than module-setup.sh.

@jieshunzhu
Copy link
Collaborator

I replaced /modulefiles with the one in develop branch. When compiling, I got the error about w3nco (/lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_12517/compile_001/err)
++++++
CMake Error at CMakeLists.txt:135 (find_package):
By not providing "Findw3nco.cmake" in CMAKE_MODULE_PATH this project has
asked CMake to find a package configuration file provided by "w3nco", but
CMake did not find one.

Could not find a package configuration file provided by "w3nco" (requested
version 2.4.0) with any of the following names:

w3ncoConfig.cmake
w3nco-config.cmake

Add the installation prefix of "w3nco" to CMAKE_PREFIX_PATH or set
"w3nco_DIR" to a directory containing one of the above files. If "w3nco"
provides a separate development package or SDK, be sure it has been
installed.
+++++

In CMakeLists.txt of P8, I found "find_package(w3nco 2.4.0 REQUIRED)". But in the same file of develop branch, I found "find_package(w3emc 2.9.2 REQUIRED)". Are w3nco and w3emc replaceable?

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented May 17, 2023

@jkbk2004 -
Please note that the stack /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/ was built on Gaea C3 partition before the C3 upgrade, but after the C4 upgrade, and the stack attempted to address different compilers, mpich, and Cray programming env. modules on C3 and C4. After the C3 upgrade, modules on C3 and C4 appear to be identical; some questions could remain regarding that "intermediate stack" could be fully used.

The main difference is that before the C3 upgrade, the UFS weather-model compile jobs in regression tests were built on Gaea C3 login node, which would then use the same compilers and Cray prog. environment as used during the hpc-stack build time. Only the RT test binaries were run on C4.

After the C3 upgrade, the RT weather-model compile jobs use different modules and prog. environment from the time the ./hpc-stack/intel-2022.0.2/ was built. (It may or may not create issues during the runtime.)

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented May 18, 2023

An updated stack had been prepared with the same compilers as for ./hpc-stack/intel-2022.0.2/,
now adapted for the upgraded C3 and C4 as following: ./hpc-stack/intel-classic-2022.0.2/
The ufs_gaea.intel.lua module loads the stack as following:

[RegressionTests_gaea.intel.log.txt](https://github.com/ufs-community/ufs-weather-model/files/11508541/RegressionTests_gaea.intel.log.txt)

prepend_path("MODULEPATH","/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack")
load(pathJoin("hpc", os.getenv("hpc_ver") or "1.2.0"))

load(pathJoin("intel-classic", os.getenv("intel_classic_ver") or "2022.0.2"))
load(pathJoin("cray-mpich", os.getenv("cray_mpich_ver") or "7.7.20"))
load(pathJoin("hpc-intel-classic", os.getenv("hpc_intel_classic_ver") or "2022.0.2"))
load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.20"))
load(pathJoin("libpng", os.getenv("libpng_ver") or "1.6.37"))

A subset of regression tests (from # ATM tests line untill the end of the list in rt.conf) has finished successfully, log attached.
model setup: /lustre/f2/dev/role.epic/sandbox/UFS-WM/ufs-wm-dev1/tests
RT run directory: /lustre/f2/scratch/role.epic/FV3_RT/rt_32501

Closing the issue #1753 at the moment, which was for stack for the higher-version compilers.
RegressionTests_gaea.intel.log.txt

@natalie-perlin
Copy link
Collaborator

@jkbk2004 @zach1221
All the regression tests have passed on Gaea with the stack
/lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack/

The logs from the remaining set of regression test (coupled) is attached.
RegressionTests_gaea.intel.log2.txt

@jkbk2004
Copy link
Collaborator Author

@natalie-perlin can you add yafyaml/v0.5.1 to /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack ? We used to use yafyaml with p8 tag @jieshunzhu is trying to use.

@jieshunzhu
Copy link
Collaborator

By using intel-2022.0.2 and some other minor changes, I was able to compile P8 tag. For the regression tests, however, it missed baseline. Same thing happened for tag GFSv17.HR1.

I will try intel-classic-2022.0.2 @natalie-perlin pointed.

@jkbk2004
Copy link
Collaborator Author

I agree baselines for those tags might be missing during OS transition. But we can compare a few cases with creating new baselines with tag. Compiler change is likely to cause some change at white noise level. We can confirm manually.

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented May 18, 2023

@jkbk2004
yafyaml/v0.5.1 is already part of the stack, for both hpc-intel/2022.0.2 and hpc-intel-classic/2022.0.2
Please let me know what might be missing. Is a different module name needed? ( v0.5.1 as opposed to 0.5.1)?

That's what you find when loading the ./intel-2022.0.2/ stack:

module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack
module load hpc
module load hpc-intel/2022.0.2
module show yafyaml
module avail yafyaml
------ /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/compiler/intel/2022.0.2 ------
   yafyaml/v0.5.1 (L)

... and when loading the ./intel-classic-2022.0.2/ stack:

module unload yafyaml hpc-intel/2022.0.2 hpc 
module use /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/stack
module load hpc
module load hpc-intel-classic
module load yafyaml
module avail yafyaml
 /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-classic-2022.0.2/modulefiles/compiler/intel-classic/2022.0.2 
  yafyaml/v0.5.1 (L)

UPD: Links created, modules are loadable either way, yafyaml/v0.5.1 or yafyaml/0.5.1.

@jieshunzhu
Copy link
Collaborator

Even though it might not matter for me (because I have got P8 and HR1 complied by using intel-2022.0.2), I want to give you the update about HR1 compilation with intel-classic-2022.0.2. I got the error related to ESMF library.
+++++++++++++++++++++++++++++++++++
CMake Warning at CMakeModules/Modules/FindESMF.cmake:114 (message):
ESMFMKFILE does not exist
Call Stack (most recent call first):
CMakeLists.txt:122 (find_package)

CMake Error at /ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhh

bcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find ESMF (missing: ESMF_LIBRARY_LOCATION
ESMF_INTERFACE_LINK_LIBRARIES ESMF_F90COMPILEPATHS) (Required is at least
version "8.3.0")
Call Stack (most recent call first):
/ncrc/sw/gaea-cle7/uasw/ncrc/envs/20200417/opt/linux-sles15-x86_64/gcc-7.5.0/cmake-3.20.1-w7tkahac22qulhbcbi6io54u5dfr36zs/share/cmake-3.20/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
CMakeModules/Modules/FindESMF.cmake:121 (find_package_handle_standard_args)
CMakeLists.txt:122 (find_package)
+++++++++++++++++++++++++++++++++++++++

More details are seen in /lustre/f2/scratch/ncep/JieShun.Zhu/FV3_RT/rt_7661/compile_001
My source code directory is /lustre/f2/dev/ncep/JieShun.Zhu/HR1/ufs-weather-model

@natalie-perlin
Copy link
Collaborator

@jieshunzhu - looking into it now!
Regression testing has been passing successfully, in another round of full-suite of test, however (see #1758)

@natalie-perlin
Copy link
Collaborator

natalie-perlin commented May 19, 2023

@jieshunzhu -
it doesn't look like you have hpc-cray-mpich module loaded...
The modulefile /lustre/f2/dev/ncep/JieShun.Zhu/HR1/ufs-weather-model/modulefiles/ufs_gaea.intel.lua does not have all the modifications needed to load hpc-cray-mpich, as suggested in
#1755 (comment)

It needs to have the following:

load(pathJoin("cray-mpich", os.getenv("cray_mpich_ver") or "7.7.20"))

load(pathJoin("hpc-cray-mpich", os.getenv("hpc_cray_mpich_ver") or "7.7.20"))

@jieshunzhu
Copy link
Collaborator

@natalie-perlin Thanks for the quick response. Got your idea. Let me try it again. I will update soon.

@jieshunzhu
Copy link
Collaborator

@natalie-perlin now both compilation and regression tests are done, but regression tests miss baseline.

@jkbk2004
Copy link
Collaborator Author

@natalie-perlin now both compilation and regression tests are done, but regression tests miss baseline.

Do you wan us to create baseline with the code you are testing? so that we can continue to follow on as you move.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 not necessary if you are busy on other projects. Thanks for the help. Really appreciate it.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 @natalie-perlin Could you please reopen the issue?

It looks like someone removed the hpc-stack which I used for building P8 months ago: /lustre/f2/dev/role.epic/contrib/hpc-stack/intel-2022.0.2/modulefiles/stack

Now, I tried to rebuild it with /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/ufs-pio-2.5.10/install/modulefiles/Core. With the spack-stack-1.4.1-c4, I can compile develop branch.

But when building P8, I got errors about "PIO". Could you please help me take a look at it?
+++++++++++++++++++++++++++
CMake Error at /lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/unified-env/install/intel/2022.0.2/cmake-3.23.1-gteb7td/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
Could NOT find PIO (missing: C Fortran) (Required is at least version
"2.5.3")
Call Stack (most recent call first):
/lustre/f2/dev/wpo/role.epic/contrib/spack-stack/spack-stack-1.4.1-c4/envs/unified-env/install/intel/2022.0.2/cmake-3.23.1-gteb7td/share/cmake-3.23/Modules/FindPackageHandleStandardArgs.cmake:594 (_FPHSA_FAILURE_MESSAGE)
CMakeModules/Modules/FindPIO.cmake:184 (find_package_handle_standard_args)
CMakeLists.txt:130 (find_package)
++++++++++++++++++++++++++++++

@jkbk2004 jkbk2004 reopened this Oct 3, 2023
@jieshunzhu
Copy link
Collaborator

@natalie-perlin. I have no clear idea about which way is more efficient. The P5 version was set up at CPC two years ago by another person who has retired.
When going into the compiling scripts, it seems there are lots of files that have to be modified, not only fv3. Examples include ./NEMS/src/conf/module-setup.sh.inc, ./tests/compile.sh, ./FV3/ccpp/build_ccpp.sh.

@natalie-perlin
Copy link
Collaborator

@jieshunzhu - yes, I'm looking into these scripts, too.
Do you know (or does anybody know) how to compile the code? Any way works.
It differs from the current UFS WM, so I have to know how to test my changes to the modules.

@jieshunzhu
Copy link
Collaborator

@natalie-perlin The person built it at CPC is Weiyu Yang. I don't know if he followed the structure of EMC's or completely his own style. He has retired, but let me try contacting him. If I find any useful information, I will share with you. Really appreciate your helps, Natalie.

@jieshunzhu
Copy link
Collaborator

@natalie-perlin I called Weiyu and didnot get any useful information. As I mentioned earlier, Weiyu put in lots of hard-coded modifications. We have to modify them one by one when testing any new stacks.

In addition, I compiled the system around 2 years ago. The associated log files are still here: /lustre/f2/dev/ncep/JieShun.Zhu/ufsp5/ufs-s2s-model_zbot/tests/log_gaea.intel/compile_1.log. That may help you better follow the scripts.

@natalie-perlin
Copy link
Collaborator

Thank you for clarification of what needs to be done and for the log files.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 @natalie-perlin I am able to compile P5 using libraries that were built for P8 with hpc-stack. But when testing the executable, I got errors saying "Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT and AVX2 instructions."

Did you see the error before? Thanks.

@jkbk2004
Copy link
Collaborator Author

sounds like it is not capturing processor information at compiler level. what about running cpuinfo ?

@jieshunzhu
Copy link
Collaborator

@jkbk2004 Can you point me where I specify processor information in UFS? Here is my job card for my experiment: /lustre/f2/scratch/ncep/JieShun.Zhu/UFS_zbot/fcst_25e1/cpld_fv3_ccpp_mom6_cice_cmeps_cold_2023102500/job_card. The error is shown in the "out" file.

@jkbk2004
Copy link
Collaborator Author

Somewhere cmake level, I think: https://github.com/ufs-community/ufs-weather-model/blob/develop/cmake/configure_s4.intel.cmake or add directly to compile flag like ./CICE-interface/CICE/configuration/scripts/machines/Macros.derecho_intel:FFLAGS := -fp-model precise -convert big_e
ndian -assume byterecl -ftz -traceback -march=core-avx2

@jieshunzhu
Copy link
Collaborator

@jkbk2004 In my original compilation flags there was an option xcore-avx2 which is related to Intel processors. After removing it, the model can run a bit, but stopped after "COMPLETED MOM INITIALIZATION". The model just stuck there until reaching the wall clock. Did you or @natalie-perlin see a similar problem before?

@jkbk2004
Copy link
Collaborator Author

@jieshunzhu Maybe it might be worth to build mom6 with debug. Or some print out at mom6 main driver level.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 Thanks for the suggestions. I tried building MOM6 with debug, but it is interesting that I did not see additional log information. I am actually working with the same strategy as your second idea. I will let you know if I find anything.

@jkbk2004
Copy link
Collaborator Author

@jieshunzhu I am not sure if DDT (debugger) is available on gaea. I will check just in case.

@jkbk2004
Copy link
Collaborator Author

@jieshunzhu
Copy link
Collaborator

Thanks @jkbk2004. I havenot tried DDT before. I may ask you questions about it later.

@natalie-perlin
Copy link
Collaborator

@jieshunzhu @jkbk2004
FYI on the progress with the P5 code, if it is still under consideration (as you mentioned Nov. 30 deadline)

I'm getting close to have a P5 code compiled on my end on Gaea C5 with the spack-stack/1.4.1, which corresponds to the same version of compilers as EPIC-built hpc-stack (intel-classic-2023.1.0), and higher versions of hdf5/1.14.0, netcdf-c/4.9.2, esmf/4.8.2. There a couple of relatively simple errors/paths still need fixing for the fms build.
I will plan to do a test run after it is fully built, yet I haven't looked into the setup of the initialization files and whether anything special is required to stage the run. Please let me know any comments.
Please feel free to take a look into my setup on Gaea:
/lustre/f2/scratch/ncep/Natalie.Perlin/ufs-s2s/ufsp5/ufs-s2s-model_zbotC5/

@jieshunzhu
Copy link
Collaborator

@natalie-perlin Thanks for the update. I think I almost fix the problem by using hpc-stack. You can hold it on your side (I do not want to waste your time).

But I may need to ask you about how to build spack-stack which I need to use for jedi-soca. Sine the jedi-soca version is not the develop branch, I may need to build an elder spack-stack.

Thanks again for your and @jkbk2004 Jong's persistent support and help on our projects at CPC. Really appreciate it!

@jieshunzhu
Copy link
Collaborator

@jkbk2004 @natalie-perlin Just want to give you an update about transitioning P5 to C5: it works now. The key thing here is still about the version of ESMF. I need to use an old version for P5. Thanks again for all your supports!

@natalie-perlin
Copy link
Collaborator

Thank you so much for letting us know that this works for you!
If you don't mind sharing your recent staging location for the P5 on Gaea-c5, I'd be glad to take a look that it all looks consistent!

@natalie-perlin
Copy link
Collaborator

As to older spack-stack, if the packages and versions that you need in the jedi-soca have been made available to spack central repository, there should be no issues of building them as a part of custom spack-stack. The key is to know the list of exact packages to specify for the spack-stack configuration.

@jieshunzhu
Copy link
Collaborator

Sure. It will be my pleasure.
My P5 source code directory with modifications: /lustre/f2/dev/ncep/JieShun.Zhu/ufsp5/ufs-s2s-model_zbotC5t
My running directory with outputs: /lustre/f2/scratch/ncep/JieShun.Zhu/UFS_zbot/fcst_25e1
The stack used to compile P5: /lustre/f2/dev/ncep/JieShun.Zhu/util/hpc-stack/c5/intel-classic-2023.1.0P5
The source code of building the stack: /lustre/f2/dev/ncep/JieShun.Zhu/util/hpc-stack/c5/src-intel-classic-2023.1.0P5

@jieshunzhu
Copy link
Collaborator

As to older spack-stack, if the packages and versions that you need in the jedi-soca have been made available to spack central repository, there should be no issues of building them as a part of custom spack-stack. The key is to know the list of exact packages to specify for the spack-stack configuration.

Thanks for sharing the information. I need to finish some other more urgent projects before going into the spack-stack. When starting with it, I may ask you questions about it. Thanks in advance.

@jkbk2004
Copy link
Collaborator Author

@jieshunzhu Congrats! It will be beneficial to continue the support for cpc's p5/p8/c5 operational run: stack, ufs-wm version update, etc. I will tag you up later.

@jieshunzhu
Copy link
Collaborator

jieshunzhu commented Dec 14, 2023

@jkbk2004 @natalie-perlin Do you have time to help me with another small tool? This tool converts CFSR atmospheric states to FV3 initial conditions. It uses lots of libraries of UFS/FV3, i.e., hpc-stack. I need to compile it on C5 as well.

  • The source code is here: /lustre/f2/dev/ncep/JieShun.Zhu/util/ICchgres_CFSR_FV3_C5/global_chgres.fd4EPIC.
  • I gave it a try in ../global_chgres.fd in which I made a new file (mk.sh) by including libraries information. The error information is in make.out. My problem is related to the linkage to those libraries.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 @natalie-perlin Do you have time to help me with another small tool? This tool converts CFSR atmospheric states to FV3 initial conditions. It uses lots of libraries of UFS/FV3, i.e., hpc-stack. I need to compile it on C5 as well.

  • The source code is here: /lustre/f2/dev/ncep/JieShun.Zhu/util/ICchgres_CFSR_FV3_C5/global_chgres.fd4EPIC.
  • I gave it a try in ../global_chgres.fd in which I made a new file (mk.sh) by including libraries information. The error information is in make.out. My problem is related to the linkage to those libraries.

Never mind. I got the problem fixed. Thanks anyway.

@jkbk2004
Copy link
Collaborator Author

@jieshunzhu we can extend a bit of #2005 on our side.

@jieshunzhu
Copy link
Collaborator

@jkbk2004 @natalie-perlin Happy New Year! I am now trying to transition the JEDI soca-science to C5. Similar to my UFS problem, on C5 I failed running the version of soca-science I need by using spack-stack 1.5.1 (which works for the "develop" repository of soca-science). On C4, I can run it with spack-stack1.4.0. So I tried to install spack-stack1.4.0 in my own directory (/lustre/f2/dev/ncep/JieShun.Zhu/util/spack-stack/c5/spack-stack-1.4.0).

I git clone spack-stack-1.4.0 directly from JCSDA website, and didnot make any changes. After installation, I cannot see Core/ under /envs/unified-dev/install/modulefiles/. Could you please give me some some hints about my problems? I saved installation log files in my directory. Thanks in advance.

@zach1221
Copy link
Collaborator

@jieshunzhu I'm going to place this ticket in resolved. Please let me know if you feel it should be kept open.

@jieshunzhu
Copy link
Collaborator

@zach1221 Sure, it can be closed. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

5 participants