[Core][Parallelization] Making explicitily `schedule(runtime)`, with `dynamic` by default, in OMP loops in `ParallelUtils` #12923

loumalouomega · 2024-12-11T09:13:52Z

📝 Description

Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils. I still need to add a benchmark and actually compare that is faster. Also updates the bannerwith the parallelism information:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 10.1."0"-core/explicit-schedule-parallel-utili-d7754dadfa-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with Clang-14.0
Compiled with threading and MPI support. Threading support with OpenMP, scheduling dynamic.
Maximum number of threads: 20.
Running without MPI.

Add benchmark
Compare results

Fixes #12924

🆕 Changelog

Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils
Refactor for runtime
Benchmark from [Core][Benchmark] Adding benchmark for parallel_utilities #12942
Changes in kernel from [Core] Adding additional SMP information into header information #12947 + extension

… in `ParallelUtils`

…ith dynamic schedule without conflicting the GIL

RiccardoRossi · 2024-12-11T12:20:58Z

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

loumalouomega · 2024-12-11T12:41:54Z

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

Look at https://github.com/KratosMultiphysics/Kratos/actions/runs/12273173829/job/34243450170

loumalouomega · 2024-12-11T12:43:39Z

are u sure this is needed? because this is c++ code, i don't think the gil presents a problem here

Look at KratosMultiphysics/Kratos/actions/runs/12273173829/job/34243450170

And now is failing when running tests: https://github.com/KratosMultiphysics/Kratos/actions/runs/12275201329/job/34250231555?pr=12923. I will define in CMake

RiccardoRossi · 2024-12-11T13:31:27Z

@loumalouomega dynamic scheduling is used today for example in the builder and solver....without the need of releasing the GIL

why is that different?

loumalouomega · 2024-12-11T13:33:38Z

@loumalouomega dynamic scheduling is used today for example in the builder and solver....without the need of releasing the GIL

why is that different?

No idea, look at the outcome from the CI. We tested for some functions and the improvement is significant. This was added in a recent version of pybind11. pybind/pybind11#4246

loumalouomega · 2024-12-11T14:53:38Z

Okay, looks like the last change fixed the issue

loumalouomega · 2024-12-12T09:37:30Z

@RiccardoRossi we can set it on runtime with this: https://www.openmp.org/spec-html/5.0/openmpse49.html and keep the current code and set the OMP_SCHEDULE by default to "dynamic"

loumalouomega · 2024-12-12T12:05:14Z

Modified to be on runtime, defaulted to dynamic

loumalouomega · 2024-12-12T15:49:11Z

Okay, looks like the runtime works

RiccardoRossi · 2024-12-12T16:00:54Z

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

loumalouomega · 2024-12-12T16:05:04Z

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

The default is dynamic, not dynamic, 4. dynamic,4 is just an example, not the actual default. Anyway i am seeing that is not taking properly the environment variable.

…ortInfo

loumalouomega · 2024-12-12T17:08:09Z

Right now if you have 4 tareas and 1000 items, you will do 250 on each...definitely suboptimal.for dyna.ic scheduling...

The default is dynamic, not dynamic, 4. dynamic,4 is just an example, not the actual default. Anyway i am seeing that is not taking properly the environment variable.

Okay fixed that issue, BTW, now the the banner includes the parallelism information:

 |  /           |                  
 ' /   __| _` | __|  _ \   __|    
 . \  |   (   | |   (   |\__ \  
_|\_\_|  \__,_|\__|\___/ ____/
           Multi-Physics 10.1."0"-core/explicit-schedule-parallel-utili-d7754dadfa-Release-x86_64
           Compiled for GNU/Linux and Python3.10 with Clang-14.0
Compiled with threading and MPI support. Threading support with OpenMP, scheduling dynamic.
Maximum number of threads: 20.
Running without MPI.

…t variable for scheduling type

RiccardoRossi · 2024-12-13T07:54:27Z

kratos/utilities/parallel_utilities.h

@@ -206,7 +206,7 @@ class BlockPartition
        KRATOS_PREPARE_CATCH_THREAD_EXCEPTION

        TReducer global_reducer;
-        #pragma omp parallel for
+        #pragma omp parallel for schedule(runtime)


@loumalouomega as i am telling, take a look to line 154. It does not make sense to change this unless we change what happens there.

also to my understanding the runtime behaviour has potentially a very high overhead due to the need of making a syscall to fetch an env variable.

https://stackoverflow.com/questions/7460552/reading-environment-variables-is-slow-operation/7460612#7460612

not sure if that matters...but at least we need to beware of this

We should use the benchmark to check that it affects significantly

The idea of the runtime is to give flexibility, if you prefer we can define it on compiling time...

It is recommended by OMP itself:

https://www.openmp.org/wp-content/uploads/openmp-webinar-vanderPas-20210318.pdf

(And I found a master thesis saying that it doesn't penalize https://hpc.dmi.unibas.ch/wp-content/uploads/sites/87/2020/10/2019_akan_yilmaz_ma_thesisjune2019.pdf)

@loumalouomega aside of the comments on the opportunity of using the OMP_SCHEDULE did u take a look at what i am writing?

we are doing "by hand" the chunking. If we don't change that, it makes no sense to use a different scheduling, as everyone will be working on its chunk (as of now we dot have more chunks than threads!)

Okay...let me think this...

Centrantly there is no effect (significative), I need to rethink this...

In that case we may need to rethink the chunging (to be dependent of the CPU architecture)

@RiccardoRossi what do you suggest exactly, because I has been studying this and our chunking conflicts with the OMP scheduling, and a priori the most efficient would be to let OMP to do the chunking. The problem is that with that we lose the parallel_utilities design and reduction utilities.

….cpp

roigcarlo · 2024-12-17T11:45:57Z

CMakeLists.txt

+    # Check if the environment variable OMP_SCHEDULE is defined
+    if(DEFINED ENV{OMP_SCHEDULE})
+        # Set the already defined one
+        set(KRATOS_OMP_SCHEDULE $ENV{OMP_SCHEDULE})


OMP_SCHEDULE is a runtime env variable, it is a extremely bad idea to use it a compilation switch (IMO).

I understand, but this is the following.

During compilation the OMP_SCHEDULE will set KRATOS_OMP_SCHEDULE that will be used as default if actually OMP_SCHEDULE is not defined, but if OMP_SCHEDULE is defined OMP_SCHEDULE will be taken into account. Do you understand me?

pooyan-dadvand · 2024-12-17T14:03:55Z

I agree with chunk size argument by @RiccardoRossi

My point (in #12924) was to first give a way to define dynamic scheduling in our for each loop. This would let us to fine tune our parallelization in many cases that dynamic would be better or at least not worst.

For having the dynamic as default now I understand that would not work and chunk size would be an important blocker....

RiccardoRossi · 2024-12-17T15:03:32Z

I agree with chunk size argument by @RiccardoRossi

My point (in #12924) was to first give a way to define dynamic scheduling in our for each loop. This would let us to fine tune our parallelization in many cases that dynamic would be better or at least not worst.

For having the dynamic as default now I understand that would not work and chunk size would be an important blocker....

to clarify, it is NOT difficult to change the chunking algorithm (i guess it will be a 20 lines long code), i am simply telling that it needs to be done aside of the other changes.

…ock-based operations for improved performance and clarity

…uling for improved performance

[Core] Making explicitily schedule(dynamic) by default in OMP loops…

897cd72

… in `ParallelUtils`

loumalouomega added Kratos Core Performance Parallel-SMP Shared memory parallelism with OpenMP or C++ Threads labels Dec 11, 2024

loumalouomega requested a review from roigcarlo December 11, 2024 09:14

loumalouomega mentioned this pull request Dec 11, 2024

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

Open

loumalouomega changed the title ~~[Core] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils~~ [Core][Parallelization] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils Dec 11, 2024

Define PYBIND11_NO_ASSERT_GIL_HELD_INCREF_DECREF to run OMP loops w…

dc8a7e2

…ith dynamic schedule without conflicting the GIL

Moving to Cmake

91374a0

loumalouomega marked this pull request as ready for review December 11, 2024 14:53

loumalouomega requested a review from a team as a code owner December 11, 2024 14:53

Refcator for runtime

83ce3e7

loumalouomega changed the title ~~[Core][Parallelization] Making explicitily schedule(dynamic) by default in OMP loops in ParallelUtils~~ [Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils Dec 12, 2024

Missing )

d7754da

loumalouomega added 2 commits December 12, 2024 18:06

Fix preprocessor directive formatting in Kernel::PrintParallelismSupp…

7ece75a

…ortInfo

Set default value for KRATOS_OMP_SCHEDULE if not defined in CMake

8ba1ab5

loumalouomega added 2 commits December 12, 2024 21:49

Enhance Kernel::PrintParallelismSupportInfo to set and log environmen…

5503861

…t variable for scheduling type

Code for Windows

67c4c6f

RiccardoRossi reviewed Dec 13, 2024

View reviewed changes

loumalouomega added 3 commits December 13, 2024 09:10

Duplicated

ad3f345

Fix formatting of license comment in test_parallel_utilities.cpp

ddfdca6

Add benchmarks for parallel utilities in parallel_utilities_benchmark…

9c95018

….cpp

loumalouomega mentioned this pull request Dec 16, 2024

[Core][Benchmark] Adding benchmark for parallel_utilities #12942

Open

roigcarlo reviewed Dec 17, 2024

View reviewed changes

loumalouomega mentioned this pull request Dec 17, 2024

[Core] Adding additional SMP information into header information #12947

Open

loumalouomega added 3 commits December 17, 2024 17:06

Add authorship in parallel_utilities.h

bc5074e

Refactor benchmarks in parallel_utilities_benchmark.cpp to utilize bl…

606882b

…ock-based operations for improved performance and clarity

Update OpenMP scheduling in parallel_utilities.h to use runtime sched…

24f40aa

…uling for improved performance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Parallelization] Making explicitily `schedule(runtime)`, with `dynamic` by default, in OMP loops in `ParallelUtils` #12923

[Core][Parallelization] Making explicitily `schedule(runtime)`, with `dynamic` by default, in OMP loops in `ParallelUtils` #12923

loumalouomega commented Dec 11, 2024 •

edited

Loading

RiccardoRossi commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024 •

edited

Loading

RiccardoRossi commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 12, 2024

loumalouomega commented Dec 12, 2024

loumalouomega commented Dec 12, 2024

RiccardoRossi commented Dec 12, 2024

loumalouomega commented Dec 12, 2024 •

edited

Loading

loumalouomega commented Dec 12, 2024

RiccardoRossi Dec 13, 2024

loumalouomega Dec 13, 2024

loumalouomega Dec 13, 2024

loumalouomega Dec 13, 2024

RiccardoRossi Dec 17, 2024

loumalouomega Dec 17, 2024

loumalouomega Dec 17, 2024

loumalouomega Dec 17, 2024

loumalouomega Dec 18, 2024

roigcarlo Dec 17, 2024

loumalouomega Dec 17, 2024

pooyan-dadvand commented Dec 17, 2024

RiccardoRossi commented Dec 17, 2024

[Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils #12923

Are you sure you want to change the base?

[Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils #12923

Conversation

loumalouomega commented Dec 11, 2024 • edited Loading

RiccardoRossi commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024 • edited Loading

RiccardoRossi commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 12, 2024

loumalouomega commented Dec 12, 2024

loumalouomega commented Dec 12, 2024

RiccardoRossi commented Dec 12, 2024

loumalouomega commented Dec 12, 2024 • edited Loading

loumalouomega commented Dec 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pooyan-dadvand commented Dec 17, 2024

RiccardoRossi commented Dec 17, 2024

[Core][Parallelization] Making explicitily `schedule(runtime)`, with `dynamic` by default, in OMP loops in `ParallelUtils` #12923

[Core][Parallelization] Making explicitily `schedule(runtime)`, with `dynamic` by default, in OMP loops in `ParallelUtils` #12923

loumalouomega commented Dec 11, 2024 •

edited

Loading

loumalouomega commented Dec 11, 2024 •

edited

Loading

loumalouomega commented Dec 12, 2024 •

edited

Loading