This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR applies a technique similar to one in segmented sort algorithm. Segments are partitioned and various thread groups are applied to various segment categories. While optimizing segmented reduction I introduced warp reduce agent and generalized reduce agent implementation. Below are speedups for small segment sizes, best speedup is about 66x:
![small](https://user-images.githubusercontent.com/9890394/193343154-c55bc4cb-ef73-4830-862f-a5765dd68c74.png)
Medium size segments experience minor slowdowns, but it can be addressed by further tuning:
![mid](https://user-images.githubusercontent.com/9890394/193343515-53797e3b-00af-4aea-bb1e-44334563a4f8.png)
Large size segments are not affected by optimization:
![large](https://user-images.githubusercontent.com/9890394/193343560-f62eac98-8f42-468e-9577-b92477026f31.png)
In the commits, there's an attempt to fuse small segments reduction with the partitioning stage. This optimization doesn't perform as well. My guess is that it slows down decoupled look-back at the partitioning stage or affects it's occupancy, which leads to overall slowdown.
In order not to break stream capture (if one is used), I incorporated a separate check for that. We might need to check stream capturing mode in our tests later.