[DRAFT] GR1: Additional Vectorization Pass supporting more fusion potentials. #870
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
@philipportner @pdamme
Here is my work of the new vectorization capabilities in DAPHNE reduced to the first changes, that includes more fusion potentials by increasing the number of compatible operations and situations (mainly Horizontal/Sibling Fusion).
For reproducing the unexpected behaviour of slower execution, in case we are using horz. fusion, you will find a python script named run_horz.py in the root directory of the repo. It allows for generating and measuring of performance of the current implementation with and without horz. fusion.
e.g.
python3 run_horz.py --tool PAPI_STD --script ADD --verbose-output --explain --num-ops 10 --threads 1 --rows 30000 --cols 30000 --batchSize 0 --samples 2
.--tool allows for selection of measuring tool PAPI_STD, PAPI_L1, PAPI_MPLX, NOW (can be found and configured in shared.py). NOW allows for measuring with now() inside the DAPHNE script.
--script: two selection method ADD and ADD_SUM.
--num-ops: for specifiying the number of ops/pairs N.
--threads: how many threads should be used for a vectorized execution.
--rows and --cols specifying the size of the shared input matrix X.
--batchSize: number of rows per vectorized task. If 0 means normal behaviour of the MTWrapper (calculation based on 8mb)
--samples: number of executions for each of both settings (with and without horz. fusion)
--verbose-output: allows for printing out the stdout and stderr of each run of the DAPHNE executable
--explain: inserts a --explain=vectorized to the command for running DAPHNE.
The generated DAPHNE script, that will get executed, can be found in the CWD where the python script was executed; it is named _horz.py
Needed packages: numpy, tabulate, pandas (latest should work)
For DAPHNE itself, we introduced these following arguments:
--vec-type (GREEDY_1 or DAPHNE): for selection of the Vectorization strategy (DAPHNE is not tested)
--no-hf: to deactivate the horizontal fusion pass
--batchSize: to experiment with the task size of a vectorized execution
Let me know, if you need anything else.