Discussion on performance distortions introduced by the instrumenting profilers #64

tyoma · 2021-01-12T06:44:58Z

No description provided.

tyoma · 2021-01-12T07:25:32Z

@Lectem Can you please, share your code? Because below is what I'm getting.
MicroProfiler does indeed compensate measured time for the latencies caused by its overhead. This compensation, though, can accurately work only when the processor speed is fixed (which may not be the case if the 'Balanced' power profile is enacted on Windows - you need to fix processor freq by setting it to 99% for both min and max power in power profile settings).

This is the sample:

	template <typename IteratorT>
	typename iterator_traits<IteratorT>::value_type pairwize_sum(IteratorT begin, IteratorT end)
	{
		switch (const auto d = distance(begin, end))
		{
		case 1: return *begin;
		default: return pairwize_sum(begin, begin + d / 2) + pairwize_sum(begin + d / 2, end);
		case 2: const auto v1 = *begin; return v1 + *++begin;
		}
	}

UPD: I realized my sample wasn't correct because of *begin + *++begin, now fixed, but that didn't affect the outcome.

Lectem · 2021-01-12T09:11:33Z

(transferred from #63 )

1st comment:

What do you mean by 'layout changed'? What did it look like before?

See the space between the pie chart and the first column

As a sidenote: I get what you're saying about inlining - it may reduces the amount of functions inlined, but not prevent it. You can check it out yourself here - I ran profiling of the same agge rendition code with and without inlining (in Release, all other optimizations and flags are the same). You can look at the spreadsheet or at the profiles attached (you can open them from the Extensions/MicroProfiler/Open...) agge-rendition-with-or-without-inlining.zip

My point was that it entirely changes the performance of the functions as soon as you have a few callees, and even worse if not inlined. See the following:
| noinstru |  instru  | Summation 256
|---------:|---------:|:--------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.6% |    83.7% | `std::accumulate`
|   152.8% |    21.7% | `pair-wise summation`
|   146.6% |    21.5% | `pair-wise (iterators) summation`
|   290.8% |   127.8% | `pair-wise simd summation`
|    20.9% |    28.0% | `kahan summation`
|    66.5% |    83.2% | `neumaier summation`
|    59.4% |    76.7% | `knuth summation`
|   817.2% |   442.0% | `naive raw sum ispc`
|   162.7% |   158.7% | `kahan summation ispc`
|   306.7% |   244.1% | `neumaier summation ispc`
|   375.2% |   322.2% | `knuth summation ispc`
            			
| noinstru |  instru  | Summation 1024
|---------:|---------:|:---------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.0% |    98.2% | `std::accumulate`
|   177.5% |    12.2% | `pair-wise summation`
|   170.0% |    18.7% | `pair-wise (iterators) summation`
|   324.3% |   117.8% | `pair-wise simd summation`
|    23.2% |    25.0% | `kahan summation`
|    77.2% |    81.3% | `neumaier summation`
|    67.3% |    66.0% | `knuth summation`
|   848.8% |   565.0% | `naive raw sum ispc`
|   190.7% |   186.0% | `kahan summation ispc`
|   418.9% |   397.2% | `neumaier summation ispc`
|   508.8% |   464.3% | `knuth summation ispc`
            			
| noinstru |  instru  | Summation 1048576
|---------:|---------:|:------------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.0% |    99.8% | `std::accumulate`
|   174.8% |    17.1% | `pair-wise summation`
|   174.4% |    13.5% | `pair-wise (iterators) summation`
|   289.4% |   106.8% | `pair-wise simd summation`
|    23.9% |    23.9% | `kahan summation`
|    81.1% |    80.8% | `neumaier summation`
|    72.4% |    72.4% | `knuth summation`
|   789.4% |   791.3% | `naive raw sum ispc`
|   199.4% |   198.5% | `kahan summation ispc`
|   453.2% |   446.5% | `neumaier summation ispc`
|   554.3% |   552.5% | `knuth summation ispc`
When instrumented (with micro-profiler), you would get the opposite conclusion when comparing naive and pair wise summation (which is recursive).
I suppose micro-profiler does something to compensate a bit, but still, it shows the wrong picture:

Answer to above comment of tyoma:

Yes I am aware of frequency changes and turbo boost, I did set my freq min and max to 99%.

@Lectem Can you please, share your code? Because below is what I'm getting.

Sure !

Give a try to the code in the following benchmark https://quick-bench.com/q/Kd1zaYVHvOo8gOWu-2iER7Z9bBU to have an idea of the execution time ratio between sum_naive and sum_pairwise.

Then try the following (should work by copy pasting when using MSVC).
To remove any bias:

Make sure to use the doNotOptimizeAwaySink function

Make sure you run the same number of times both tests

Run them in multiple passes +same data to remove cache bias

Used noinline as otherwise naive_sum it would not be shown by profiler. Not that it had no impact on the results of sum_pairwise
#include <algorithm>
#include <vector>
#include <numeric>
#include <thread>
#include <chrono>

void doNotOptimizeAwaySink(const void*);

template <class Tp>
__forceinline void  DoNotOptimize(Tp const& value) {
    doNotOptimizeAwaySink(&value);
    _ReadWriteBarrier();
}

__declspec(noinline) double naive_sum(const double* begin, const double* end)
{
    double acc = double(0);
    while (begin != end)
        acc += *(begin++);
    return acc;
}

__declspec(noinline) double pairwise_sum(ptrdiff_t n, const double* values)
{
    if (n <= 16)
        return std::accumulate(values, values + n, double(0));

    const ptrdiff_t pivotIdx = n / 2;
    return pairwise_sum(pivotIdx, values) + pairwise_sum(n - pivotIdx, values + pivotIdx);
}

int main()
{

    std::vector<double> values;
    values.resize(100000);
    std::generate(values.begin(), values.end(), [] { return double(rand()) / RAND_MAX; });


    int nbPasses = 100;
    int nbIterations = 1000;

    for (int i = 0; i < nbPasses; i++)
    {

        for (int iter = 0; iter < nbIterations; iter++) {
            auto s = naive_sum(values.data(), values.data() + values.size());
            DoNotOptimize(s);
        }

        for (int iter = 0; iter < nbIterations; iter++) {
            auto s = pairwise_sum(values.size(), values.data());
            DoNotOptimize(s);
        }


    }
    printf("Start sleeping\n"); 
    using namespace std::chrono_literals;
    // put a breakpoint here !
    std::this_thread::sleep_for(20s);
}



#        pragma optimize("", off)
void doNotOptimizeAwaySink(void const*) {}
#        pragma optimize("", on)
MicroProfiler still tells me that pairwise_sum is 1.7 times slower than naive, while in reality it should be the opposite!

It's even worse if I remove the small array size optimization and use the following checks (closer to your implementation) instead of the if(n<=16)
if (n == 1)
        return *values;

    if (n == 2)
        return *values + *(values+1);
Where I get this: (took almost 5minutes to finish)

tyoma added the discussion label Jan 12, 2021

tyoma mentioned this issue Jan 12, 2021

Call for beta testing! #63

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion on performance distortions introduced by the instrumenting profilers #64

Discussion on performance distortions introduced by the instrumenting profilers #64

tyoma commented Jan 12, 2021

tyoma commented Jan 12, 2021

Lectem commented Jan 12, 2021 •

edited

Loading

Discussion on performance distortions introduced by the instrumenting profilers #64

Discussion on performance distortions introduced by the instrumenting profilers #64

Comments

tyoma commented Jan 12, 2021

tyoma commented Jan 12, 2021

Lectem commented Jan 12, 2021 • edited Loading

Lectem commented Jan 12, 2021 •

edited

Loading