Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Testing some par_dispatch stuff #1156

Open
wants to merge 32 commits into
base: develop
Choose a base branch
from

Conversation

lroberts36
Copy link
Collaborator

@lroberts36 lroberts36 commented Aug 21, 2024

PR Summary

This is just an attempt to see if some the template magic in #1142 could be written in a little different way. Basically just copies the ideas there but structures the code differently. Seems to be working both on cpu and on device.

  • All loops can be called with either sets of integers (as currently supported) or with sets of IndexRanges defining the loop bounds. This is enabled by LoopBoundTranslator.
  • All loop patterns should work for any rank of loop (e.g. except for LoopPatternTPTTR which requires at least rank 2)
  • Things are set up to try different types of work partitioning in hierarchical loops like TPTTR
  • If an unsupported loop pattern for a particular loop is requested, should automatically fail through to a pattern that does support that type of loop (generally FlatRange).
  • Adds ThreadVectorRange as an option for par_for_inner

PR Checklist

  • Code passes cpplint
  • New features are documented.
  • Adds a test for any bugs fixed. Adds tests for new features.
  • Code is formatted
  • Changes are summarized in CHANGELOG.md
  • Change is breaking (API, behavior, ...)
    • Change is additionally added to CHANGELOG.md in the breaking section
    • PR is marked as breaking
    • Short summary API changes at the top of the PR (plus optionally with an automated update/fix script)
  • CI has been triggered on Darwin for performance regression tests.
  • Docs build
  • (@lanl.gov employees) Update copyright on changed files

@lroberts36
Copy link
Collaborator Author

lroberts36 commented Aug 22, 2024

Testing vectorization using this branch in Riot with gcc/9.4.0 on a skylake-platinum node with -O3 -march=skylake-avx512 I get that with the old version of par_dispatch the simd for loop is vectorized in 191 places and with the new version it vectorizes in 200 places. Not sure what to make of this difference. It is hard to determine more from the gcc vectorization report.

The run time difference for a 3D AMR test problem between these two branches is negligible, the old one wins by maybe a couple of percent but I think the run to run variance is larger than that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant