Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion on performance distortions introduced by the instrumenting profilers #64

Open
tyoma opened this issue Jan 12, 2021 · 2 comments

Comments

@tyoma
Copy link
Owner

tyoma commented Jan 12, 2021

No description provided.

@tyoma
Copy link
Owner Author

tyoma commented Jan 12, 2021

@Lectem Can you please, share your code? Because below is what I'm getting.
MicroProfiler does indeed compensate measured time for the latencies caused by its overhead. This compensation, though, can accurately work only when the processor speed is fixed (which may not be the case if the 'Balanced' power profile is enacted on Windows - you need to fix processor freq by setting it to 99% for both min and max power in power profile settings).

image

This is the sample:

	template <typename IteratorT>
	typename iterator_traits<IteratorT>::value_type pairwize_sum(IteratorT begin, IteratorT end)
	{
		switch (const auto d = distance(begin, end))
		{
		case 1: return *begin;
		default: return pairwize_sum(begin, begin + d / 2) + pairwize_sum(begin + d / 2, end);
		case 2: const auto v1 = *begin; return v1 + *++begin;
		}
	}

UPD: I realized my sample wasn't correct because of *begin + *++begin, now fixed, but that didn't affect the outcome.

@Lectem
Copy link

Lectem commented Jan 12, 2021

(transferred from #63 )

1st comment:

What do you mean by 'layout changed'? What did it look like before?

See the space between the pie chart and the first column

As a sidenote: I get what you're saying about inlining - it may reduces the amount of functions inlined, but not prevent it. You can check it out yourself here - I ran profiling of the same agge rendition code with and without inlining (in Release, all other optimizations and flags are the same). You can look at the spreadsheet or at the profiles attached (you can open them from the Extensions/MicroProfiler/Open...) agge-rendition-with-or-without-inlining.zip

My point was that it entirely changes the performance of the functions as soon as you have a few callees, and even worse if not inlined. See the following:


| noinstru |  instru  | Summation 256
|---------:|---------:|:--------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.6% |    83.7% | `std::accumulate`
|   152.8% |    21.7% | `pair-wise summation`
|   146.6% |    21.5% | `pair-wise (iterators) summation`
|   290.8% |   127.8% | `pair-wise simd summation`
|    20.9% |    28.0% | `kahan summation`
|    66.5% |    83.2% | `neumaier summation`
|    59.4% |    76.7% | `knuth summation`
|   817.2% |   442.0% | `naive raw sum ispc`
|   162.7% |   158.7% | `kahan summation ispc`
|   306.7% |   244.1% | `neumaier summation ispc`
|   375.2% |   322.2% | `knuth summation ispc`
            			
| noinstru |  instru  | Summation 1024
|---------:|---------:|:---------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.0% |    98.2% | `std::accumulate`
|   177.5% |    12.2% | `pair-wise summation`
|   170.0% |    18.7% | `pair-wise (iterators) summation`
|   324.3% |   117.8% | `pair-wise simd summation`
|    23.2% |    25.0% | `kahan summation`
|    77.2% |    81.3% | `neumaier summation`
|    67.3% |    66.0% | `knuth summation`
|   848.8% |   565.0% | `naive raw sum ispc`
|   190.7% |   186.0% | `kahan summation ispc`
|   418.9% |   397.2% | `neumaier summation ispc`
|   508.8% |   464.3% | `knuth summation ispc`
            			
| noinstru |  instru  | Summation 1048576
|---------:|---------:|:------------------
|   100.0% |   100.0% | `Naive raw sum`
|   100.0% |    99.8% | `std::accumulate`
|   174.8% |    17.1% | `pair-wise summation`
|   174.4% |    13.5% | `pair-wise (iterators) summation`
|   289.4% |   106.8% | `pair-wise simd summation`
|    23.9% |    23.9% | `kahan summation`
|    81.1% |    80.8% | `neumaier summation`
|    72.4% |    72.4% | `knuth summation`
|   789.4% |   791.3% | `naive raw sum ispc`
|   199.4% |   198.5% | `kahan summation ispc`
|   453.2% |   446.5% | `neumaier summation ispc`
|   554.3% |   552.5% | `knuth summation ispc`

When instrumented (with micro-profiler), you would get the opposite conclusion when comparing naive and pair wise summation (which is recursive).
I suppose micro-profiler does something to compensate a bit, but still, it shows the wrong picture:

image


Answer to above comment of tyoma:

Yes I am aware of frequency changes and turbo boost, I did set my freq min and max to 99%.

@Lectem Can you please, share your code? Because below is what I'm getting.

Sure !

Give a try to the code in the following benchmark https://quick-bench.com/q/Kd1zaYVHvOo8gOWu-2iER7Z9bBU to have an idea of the execution time ratio between sum_naive and sum_pairwise.

Then try the following (should work by copy pasting when using MSVC).
To remove any bias:

  • Make sure to use the doNotOptimizeAwaySink function
  • Make sure you run the same number of times both tests
  • Run them in multiple passes +same data to remove cache bias
  • Used noinline as otherwise naive_sum it would not be shown by profiler. Not that it had no impact on the results of sum_pairwise
#include <algorithm>
#include <vector>
#include <numeric>
#include <thread>
#include <chrono>

void doNotOptimizeAwaySink(const void*);

template <class Tp>
__forceinline void  DoNotOptimize(Tp const& value) {
    doNotOptimizeAwaySink(&value);
    _ReadWriteBarrier();
}

__declspec(noinline) double naive_sum(const double* begin, const double* end)
{
    double acc = double(0);
    while (begin != end)
        acc += *(begin++);
    return acc;
}

__declspec(noinline) double pairwise_sum(ptrdiff_t n, const double* values)
{
    if (n <= 16)
        return std::accumulate(values, values + n, double(0));

    const ptrdiff_t pivotIdx = n / 2;
    return pairwise_sum(pivotIdx, values) + pairwise_sum(n - pivotIdx, values + pivotIdx);
}

int main()
{

    std::vector<double> values;
    values.resize(100000);
    std::generate(values.begin(), values.end(), [] { return double(rand()) / RAND_MAX; });


    int nbPasses = 100;
    int nbIterations = 1000;

    for (int i = 0; i < nbPasses; i++)
    {

        for (int iter = 0; iter < nbIterations; iter++) {
            auto s = naive_sum(values.data(), values.data() + values.size());
            DoNotOptimize(s);
        }

        for (int iter = 0; iter < nbIterations; iter++) {
            auto s = pairwise_sum(values.size(), values.data());
            DoNotOptimize(s);
        }


    }
    printf("Start sleeping\n"); 
    using namespace std::chrono_literals;
    // put a breakpoint here !
    std::this_thread::sleep_for(20s);
}



#        pragma optimize("", off)
void doNotOptimizeAwaySink(void const*) {}
#        pragma optimize("", on)

MicroProfiler still tells me that pairwise_sum is 1.7 times slower than naive, while in reality it should be the opposite!

image

It's even worse if I remove the small array size optimization and use the following checks (closer to your implementation) instead of the if(n<=16)

if (n == 1)
        return *values;

    if (n == 2)
        return *values + *(values+1);

Where I get this: (took almost 5minutes to finish)

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants