Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up kokkos/tril situation #6473

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

jgfouca
Copy link
Member

@jgfouca jgfouca commented Jun 13, 2024

If Trilinos is being used, we should use the kokkos that comes with it. This removes the need to explicitly set Kokkos_ROOT in config_machines.xml to the Trilinos one.

[BFB]

If Trilinos is being used, we should use the kokkos that comes
with it. This removes the need to explicitly set Kokkos_ROOT in
config_machines.xml to the Trilinos one.
@jgfouca jgfouca added BFB PR leaves answers BFB CMake build system labels Jun 13, 2024
@jgfouca jgfouca self-assigned this Jun 13, 2024
Copy link

PR Preview Action v1.4.7
🚀 Deployed preview to https://E3SM-Project.github.io/E3SM/pr-preview/pr-6473/
on branch gh-pages at 2024-06-13 18:46 UTC

set(ENV{Kokkos_ROOT} ${INSTALL_SHAREDPATH})
# We should use the kokkos from Trilinos if we are using Trilinos
if (USE_TRILINOS)
set (ENV{Kokkos_ROOT} ${Trilinos_ROOT})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue with this is that when we switch to the preinstalled kokkos, we'll need a new path. But maybe I can change this then by creating something like Trilinos_Kokkos_ROOT.

Copy link
Contributor

@bartgol bartgol Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at first Kokkos_ROOT is only used to decide whether CIME needs to build kokkos during sharedlib phase. Then, in the CMake logic of the modelbuild phase, we could look for Kokkos only after handling Albany/Trilinos (if needed), so that we may skip find_package(Kokkos) altogether. Something like

if (USE_ALBANY)
  find_package(Albany REQUIRED)  # this also finds trilinos and kokkos, as they are deps
endif()
if (USE_TRILINOS)
  find_package(Trilinos REQUIRED)  # this also finds kokkos, as it is a dep
endif()
if (USE_KOKKOS)
  find_package(Kokkos REQUIRED)
endif()

In this version, if USE_ALBANY=TRUE, the calls to find_pacakge for trilinos and kokkos return immediately, since the package was alrady found.

Right now, instead, we first look for kokkos, then for trilinos, and finally for albany. We should do it the other way around, since cmake pulls in deps anyways... If we do that, then Kokkos_ROOT becomes unused, since Kokkos is already found when find_package(Kokkos REQUIRED) is parsed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartgol , those latter two blocks should be elseifs , right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't matter. The calls to find_package return immediately if the pkg was already found anyways...

@mperego
Copy link
Contributor

mperego commented Jun 13, 2024

What's the long term solution when we'll have different components using Kokkos?

@rljacob
Copy link
Member

rljacob commented Jun 13, 2024

Yes what happens when EAMxx and MALI are both running?

@bartgol
Copy link
Contributor

bartgol commented Jun 13, 2024

@rljacob we first need to answer another question: should e3sm force all components to use the same backend in kokkos? Or should they be free to use different ones? I think the latter is the right choice. Though we can (and should) force all components to use the same installation of kokkos.

That said, we could potentially install a prebuilt kokkos with all "reasonable" backends enabled (that is, Serial/OpenMP on CPU machines, Serial/OpenMP/Cuda on pm-gpu, Serial/OpenMP/HIP on Frontier,...). Then, each component is free to use whatever backend they need from kokkos. If MALI was installed with both Serial/OpenMP enabled, then eamxx will work just fine with compile_threaded on or off, regardless of whether MALI uses OpenMP or Serial.

@jgfouca
Copy link
Member Author

jgfouca commented Jun 13, 2024

Or should they be free to use different ones? I think the latter is the right choice

I agree although we will all have to be really careful to explicitly use the device we want to use. The default device will change depending on Kokkos cmake settings.

@bartgol
Copy link
Contributor

bartgol commented Jun 13, 2024

Or should they be free to use different ones? I think the latter is the right choice

I agree although we will all have to be really careful to explicitly use the device we want to use. The default device will change depending on Kokkos cmake settings.

Yes, using the default device should not be allowed. I don't know how to tell different components to use different devices (like "lnd use OpenMP, atm use Cuda, ..."), or how to force components to use Serial if compile_threaded=false. I don't even know exactly what we want to achieve. I suppose that if component A can run on GPU, we will never ask it to run on CPU on a gpu machine, right? The only catch is then how to prevent use of OpenMP if compile_threaded=False. I think we could have E3SM define the CMake var E3SM_KOKKOS_DEVICE, which components can use to pick a pre-defined device. E.g., e3sm can set E3SM_KOKKOS_DEVICE=Serial if compile_threaded=False (on a CPU machine only, since on GPU we prob want to use the GPU device).

If, otoh, we may want component A to run on CPU even if we are on a GPU machine (e.g., component B can use GPU better), then we need new XML logic to specify a device for each component. Something like ATM_KOKKOS_DEVICE, LND_KOKKOS_DEVICE, ..., all of which can be initialized to whatever is the default device on that machine... But this would conflict a bit with compile_threaded, there would be some redundant (and possibly conflicting) information.

@jewatkins
Copy link
Contributor

If MALI was installed with both Serial/OpenMP enabled, then eamxx will work just fine with compile_threaded on or off, regardless of whether MALI uses OpenMP or Serial.

If we can preinstall kokkos with Serial/OpenMP and have compile_threaded use the appropriate device for both eamxx/MALI, that seems to be the solution. I'm not sure if that's possible with preinstalled trilinos/albany. At worst, we'd have to preinstall two different versions of each. Alternatively, we could force trilinos/albany to use a specific backend regardless of e3sm parameters. It probably does make sense to have all backends though to be able to control which component runs on CPU/GPU.

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

When I try to build on pm-cpu with ERP_Ln9.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_gnu

/global/cfs/cdirs/e3sm/ndk/repos/jgfouca_cleanup_kokkos_tril/components/eamxx/src/share/atm_process/atmosphere_process_dag.cpp: In lambda function:
/global/cfs/cdirs/e3sm/ndk/repos/jgfouca_cleanup_kokkos_tril/components/eamxx/src/share/atm_process/atmosphere_process_dag.cpp:149:37: error: 'const Units' {aka 'const class ekat::units::Units'} has no member named 'get_string'\
; did you mean 'get_si_string'?
  149 |         s += " [" + fid.get_units().get_string() + "]";
      |                                     ^~~~~~~~~~
      |                                     get_si_string

If I replace all of those get_string with get_si_string, it gets further

./eamxx/src/share/atm_process/atmosphere_process_dag.cpp:149:        s += " [" + fid.get_units().get_string() + "]";
./eamxx/src/share/field/field_identifier.cpp:51:  m_identifier += " [" + m_units.get_string() + "]";
./eamxx/src/share/io/scorpio_output.cpp:965:    std::string units = fid.get_units().get_string();

but still fails:

/global/cfs/cdirs/e3sm/ndk/repos/jgfouca_cleanup_kokkos_tril/components/eamxx/src/share/field/field_manager.cpp:49:69: error: invalid initialization of reference of type 'const scream::FieldLayout&' from expression of type 'const Un\
its' {aka 'const ekat::units::Units'}
   49 |           "         - input field units:  " + to_string(id.get_units()) + "\n"
      |                                                         ~~~~~~~~~~~~^~

@jewatkins
Copy link
Contributor

I'm not sure if that's possible with preinstalled trilinos/albany. At worst, we'd have to preinstall two different versions of each.

actually, I'm not sure if two different versions would work either if we use kokkos default device and we don't have a "TRILINOS_KOKKOS_DEVICE" or "ALBANY_KOKKOS_DEVICE".

@rljacob
Copy link
Member

rljacob commented Jun 13, 2024

I don't think EAMxx and MALI will appear in the same compset until low-res EAMxx is working so an intermediate solution is fine for now.

@bartgol
Copy link
Contributor

bartgol commented Jun 13, 2024

@jewatkins yeah, if you want MALI to honor compile_threaded, then we need multiple MALI installations. I don't see a way out of that.

That said, if compile_threaded=False, we could have E3SM set OMP_NUM_THREADS=1, so that even if MALI used OpenMP all the time, it would still result in a single-thread run (Serial is not the same as OpenMP with 1 thread in terms of perf, but it should be bfb the same).

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

Not sure if this is what you meant, but with E3SM, builds without openmp are not BFB with builds using openmp and only 1 thread. Only threaded builds are BFB with each other.

@bartgol
Copy link
Contributor

bartgol commented Jun 13, 2024

Not sure if this is what you meant, but with E3SM, builds without openmp are not BFB with builds using openmp and only 1 thread. Only threaded builds are BFB with each other.

O_O

Isn't this a bit concerning?

@ndkeen
Copy link
Contributor

ndkeen commented Jun 13, 2024

Yes, it's always bugged me, but there were some strange openmp decisions made long ago (before I joined). Was kinda hoping we would be rid of them once moved to C++/kokkos.

@bartgol
Copy link
Contributor

bartgol commented Jun 14, 2024

Well, in all fairness, it is possible that reduction kernels are handled slightly differently in Serial vs OpenMP. Though that's a remote possibility..

@jonbob
Copy link
Contributor

jonbob commented Jul 9, 2024

Currently BG cases, which have both EAM and MALI active, are broken. So it's not just eamxx? Or was it the removal of Kokkos_ROOT machine settings in PR #6489?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB CMake build system
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants