Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NASA:Update] Distributed dace cache (rework) #16

Merged
merged 71 commits into from
Sep 11, 2023

Conversation

FlorianDeconinck
Copy link
Collaborator

@FlorianDeconinck FlorianDeconinck commented Aug 23, 2023

Purpose

The original work to be able to compile DaCe orchestrated backend strategy was:

  • Compile on 3x3 layout, compiling caches from .gt_cache_00000 -> .gt_cache_00008, reflecting the 9 possible codepath
  • Any other layout would RUN mapping each rank to one of those 9 caches
    Unfortunately this is unreadable and demands a compile at 3x3 layout before doing any other runs

The new strategy cleans up code and actually generates the correct caches with any layout. E.g.

  • 9 ranks will compile, tagging them TL, T, TR, R, BR, B, BL, L (for TopLeft, Top, etc.) which describes the actual code path on the cube-sphere face
  • In RUN any ranks on any layout will compute which codepath they are part of and load accordingly

The same system should be deploy for gt backends, but is more complex due to the atomic nature of compiling, therefore is not part of this work.

This PR will synchronizes NASA & NOAA forks.

Code changes:

  • ETA values for 137 levels
  • GEOS wrapper bridge prints more informations

Checklist

Before submitting this PR, please make sure:

  • You have followed the coding standards guidelines established at Code Review Checklist.
  • Docstrings and type hints are added to new and updated routines, as appropriate
  • All relevant documentation has been updated or added (e.g. README, CONTRIBUTING docs)
  • For each public change and fix in pace-util, HISTORY has been updated
  • Unit tests are added or updated for non-stencil code changes

pchakraborty and others added 30 commits January 26, 2023 20:47
2. Made GEOS specific changes to thresholds in saturation adjustment
Parametrize tool with backend, output format
Add saturation adjustement threshold to const
Fix bad merge for bdt with GEOS_Wrapper
@FlorianDeconinck
Copy link
Collaborator Author

We are having a problem with determinism of temporaries. The utest fails sometimes, which sounds bad. But the regression test are passing fine.

@FlorianDeconinck
Copy link
Collaborator Author

We are having a problem with determinism of temporaries. The utest fails sometimes, which sounds bad. But the regression test are passing fine.

This is now solved.

Copy link
Contributor

@fmalatino fmalatino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with Oliver

Copy link
Contributor

@oelbert oelbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall good pr, minor points about some of the structure of the config/cache checking, and the heat_source and diss_est issue keeps coming back up

Comment on lines -188 to +186
backend = quantity_factory.empty(
backend = quantity_factory.zeros(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does it matter whether we use empty or zeros here? It seems like setting the memory to 0 is an unnecessary step. Also is there a better way to get at the backend than through a Quantity?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too fix the "deterministic" test in utest. Empty grabs any memory. The fact is that test fails because halos have the wrong values and it gets pass down. Arguably I did a blanket cover of all the empty/zeros because non of them will be executed at realtime, so the small extra cost of zero-ing it out does not matter.

driver/pace/driver/initialization.py Show resolved Hide resolved
dsl/pace/dsl/caches/codepath.py Outdated Show resolved Hide resolved
dsl/pace/dsl/dace/build.py Show resolved Hide resolved
from pace.util import CubedSpherePartitioner


def identify_code_path(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something about this feels awkward to me, but I'm not sure making this live inside of a more fully-featured FV3CodePath class makes sense?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a proper build class/system that uproots most of all of this and consolidate it to one place. It should include distributed compiling for non-orchestrated backend, better hash/cache and much more.

I went for a functional paradigm in the meantime, since I fully expect this to be reworked

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I don't have a super compelling alternative and it's not really blocking, just felt awkward when I read it

fv3core/pace/fv3core/stencils/d_sw.py Outdated Show resolved Hide resolved
attr1, attr2, err_msg=f"{attr} not equal"
)
except AssertionError:
assert np.allclose(attr1, attr2, equal_nan=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have NaNs in our temporaries? Is there another reason for this change?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Grid/Metric terms generation makes NaNs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have NaN in the geometry under some circumstances at least:

/home/runner/work/pace/pace/util/pace/util/grid/geometry.py:516: RuntimeWarning: invalid value encountered in divide
    del6_v = sina_u * dy / dxc

I believe those are in the halo

tests/main/dsl/test_caches.py Outdated Show resolved Hide resolved
util/pace/util/grid/eta.py Show resolved Hide resolved
@bensonr
Copy link
Contributor

bensonr commented Aug 29, 2023

@FlorianDeconinck - in the orchestration.py source file, the comments highlighted are outdated with the new logic. Please update them.

Copy link
Contributor

@bensonr bensonr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Please see the non-review comment as well as these here.

attr1, attr2, err_msg=f"{attr} not equal"
)
except AssertionError:
assert np.allclose(attr1, attr2, equal_nan=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't recall this happening in the fortran as the debug mode checks for all but underflows - which are ftz'd. Do you know what causes this in the python code?

@FlorianDeconinck
Copy link
Collaborator Author

All issues raised are logged.

Copy link
Contributor

@oelbert oelbert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks

@bensonr bensonr merged commit b1ef6b5 into NOAA-GFDL:main Sep 11, 2023
2 checks passed
@FlorianDeconinck FlorianDeconinck deleted the up/feature/distributed_dace_cache branch September 12, 2023 15:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants