[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

pooyan-dadvand · 2024-12-11T09:15:29Z

The default scheduling in OpenMP is static due to its smaller overhead in comparison to others.

However, there are many modern CPUs that have two types of cores. The strong ones and efficient ones. This makes the current static loops inefficient because Strong CPUs finish their job way earlier than the others and wait for long time. The effect in our test is about 3x slower for a 24 cores laptop.

My proposal is:

Change the by default schedule to dynamic
Add an additional template argument for scheduling in all for_each functions to have a mechanism in selecting the schedule in special situations

@KratosMultiphysics/technical-committee

loumalouomega · 2024-12-11T09:25:44Z

Related #12923

loumalouomega · 2024-12-11T09:27:17Z

FYI: https://610yilingliu.github.io/2020/07/15/ScheduleinOpenMP/

loumalouomega · 2024-12-11T09:31:20Z

I would also suggest to detect the OpenMP version so we can add more modern stuff (OpenMP is updated in Linux/Mac, but not in Windows), can be done with the variable __OPENMP, for example:

#include <iostream>

// Define macros for different OpenMP versions
#if defined(__OPENMP)
    #if __OPENMP >= 201811
        #define OPENMP_VERSION "OpenMPv5.0+"
    #elif __OPENMP >= 201511
        #define OPENMP_VERSION "OpenMPv4.5+"
    #elif __OPENMP >= 200805
        #define OPENMP_VERSION "OpenMPv3.0+"
    #else
        #define OPENMP_VERSION "OpenMPv2.0"
    #endif
#else
    #define OPENMP_VERSION "OpenMP is not supported"
#endif

int main() {
    std::cout << OPENMP_VERSION << std::endl;
    return 0;
}

matekelemen · 2024-12-11T13:59:14Z

I'm not against dynamic scheduling, but

I don't like the reasoning. Users must at least know the basics of their hardware; we just cannot rack up small performance hits in an effort to pamper every type of hardware at the same time. The performance/efficiency core tradeoff is very similar to hyperthreading, which completely murders our performance.
It'd be best to stop relying on OpenMP on the long run and replace it with either raw C++11 threads (or jthreads), or use some 3rd party lib.

loumalouomega · 2024-12-11T14:05:11Z

I'm not against dynamic scheduling, but

1. I don't like the reasoning. Users must at least know the basics of their hardware; we just cannot rack up small performance hits in an effort to pamper every type of hardware at the same time. The performance/efficiency core tradeoff is very similar to hyperthreading, which completely murders our performance.

We can refactor the utilities to accept more arguments iin order to chose the mechanism, but if we want to minimize changes this is the simplest way.

2. It'd be best to stop relying on OpenMP on the long run and replace it with either raw C++11 threads (or jthreads), or use some 3rd party lib.

Better the C++17 parallel , no?. Alternatives mostly we have tested TBB from Intel, but that can be problamatic.

RiccardoRossi · 2024-12-12T09:14:29Z

I would also suggest to detect the OpenMP version so we can add more modern stuff (OpenMP is updated in Linux/Mac, but not in Windows), can be done with the variable __OPENMP, for example:

#include

// Define macros for different OpenMP versions
#if defined(__OPENMP)
#if __OPENMP >= 201811
#define OPENMP_VERSION "OpenMPv5.0+"
#elif __OPENMP >= 201511
#define OPENMP_VERSION "OpenMPv4.5+"
#elif __OPENMP >= 200805
#define OPENMP_VERSION "OpenMPv3.0+"
#else
#define OPENMP_VERSION "OpenMPv2.0"
#endif
#else
#define OPENMP_VERSION "OpenMP is not supported"
#endif

int main() {
std::cout << OPENMP_VERSION << std::endl;
return 0;
}

about this, the point is that we are essentially restricted to the lowest because of MSVC. I don't love the idea of having compile time dependencies on the version of openmp and different behaviours depending on it. (but this is just my CURRENT pesonal opinion, and i am open to contributions about this).

Also i agree with @matekelemen that we should transition away from openmp and move towards native parallelism.

the point i am raising here is however slightly different:
as of now we are doing the scheduling by hand, which is based on a partitionining in few chunks (as many of the cores). If we want to change to dynamic parallelism we should change the chunking first.

On the positive side, i think that @loumalouomega argument about transitioning to dynamic is because of the tendence to having heterogeneous cores (E-cores and P-cores on intel). In the context in which not all the cores are the same it does make sense to use dynamic over static...

loumalouomega · 2024-12-12T09:31:43Z

I would also suggest to detect the OpenMP version so we can add more modern stuff (OpenMP is updated in Linux/Mac, but not in Windows), can be done with the variable __OPENMP, for example:
#include
// Define macros for different OpenMP versions
#if defined(__OPENMP)
#if __OPENMP >= 201811
#define OPENMP_VERSION "OpenMPv5.0+"
#elif __OPENMP >= 201511
#define OPENMP_VERSION "OpenMPv4.5+"
#elif __OPENMP >= 200805
#define OPENMP_VERSION "OpenMPv3.0+"
#else
#define OPENMP_VERSION "OpenMPv2.0"
#endif
#else
#define OPENMP_VERSION "OpenMP is not supported"
#endif
int main() {
std::cout << OPENMP_VERSION << std::endl;
return 0;
}

about this, the point is that we are essentially restricted to the lowest because of MSVC. I don't love the idea of having compile time dependencies on the version of openmp and different behaviours depending on it. (but this is just my CURRENT pesonal opinion, and i am open to contributions about this).

Also i agree with @matekelemen that we should transition away from openmp and move towards native parallelism.

the point i am raising here is however slightly different: as of now we are doing the scheduling by hand, which is based on a partitionining in few chunks (as many of the cores). If we want to change to dynamic parallelism we should change the chunking first.

On the positive side, i think that @loumalouomega argument about transitioning to dynamic is because of the tendence to having heterogeneous cores (E-cores and P-cores on intel). In the context in which not all the cores are the same it does make sense to use dynamic over static...

We can also define it in execution time and detect the CPU type before assigning the schedule type

loumalouomega linked a pull request Dec 11, 2024 that will close this issue

[Core][Parallelization] Making explicitily schedule(runtime), with dynamic by default, in OMP loops in ParallelUtils #12923

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

pooyan-dadvand commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

matekelemen commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

RiccardoRossi commented Dec 12, 2024

loumalouomega commented Dec 12, 2024

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

[Core][Parallelization] Shall we change our parallel utils to use dynamic scheduling instead of static? #12924

Comments

pooyan-dadvand commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

matekelemen commented Dec 11, 2024

loumalouomega commented Dec 11, 2024

RiccardoRossi commented Dec 12, 2024

loumalouomega commented Dec 12, 2024