Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assignment of xt::zeros needs major speed-up #702

Closed
ukoethe opened this issue Mar 18, 2018 · 29 comments
Closed

Assignment of xt::zeros needs major speed-up #702

ukoethe opened this issue Mar 18, 2018 · 29 comments

Comments

@ukoethe
Copy link
Contributor

ukoethe commented Mar 18, 2018

I benchmarked resetting an xarray to zero and found that array = xt::zeros<float>(shape) is 27x slower than std::fill(), whereas I expected to see no difference. Here are the numbers for gcc-7:
(Edit: I added results for dynamic view, which are even worse - 54x slower.)
(Edit2: in releases 0.10 to 0.14, array = xt::zeros<float>(shape) was "only" 16x slower.)

----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
init_memset<float>                  70685 ns      70658 ns       9878
xarray_init_std_fill<float>         67681 ns      67683 ns      10334
xarray_init_zeros<float>          1868781 ns    1866331 ns        372
dynamic_view_init_zeros<float>    3712534 ns    3712598 ns        188

Code:
(BTW, why is straightforward array = 0 not supported?)

    constexpr int SIZE = 1 << 20;

    template <class V>
    void init_memset(benchmark::State& state)
    {
        std::vector<V> array(SIZE);

        for (auto _ : state)
        {
            std::memset(array.data(), 0, SIZE*sizeof(V));
            benchmark::DoNotOptimize(array.data());
        }
    }
    BENCHMARK_TEMPLATE(init_memset, float);

    template <class V>
    void xarray_init_std_fill(benchmark::State& state)
    {
        auto array = xt::xarray<V>::from_shape({SIZE});

        for (auto _ : state)
        {
            std::fill(array.begin(), array.end(), V());
            benchmark::DoNotOptimize(array.raw_data());
        }
    }
    BENCHMARK_TEMPLATE(xarray_init_std_fill, float);

    template <class V>
    void xarray_init_zeros(benchmark::State& state)
    {
        auto array = xt::xarray<V>::from_shape({SIZE});

        for (auto _ : state)
        {
            array = xt::zeros<V>({SIZE});
            benchmark::DoNotOptimize(array.raw_data());
        }
    }
    BENCHMARK_TEMPLATE(xarray_init_zeros, float);

    template <class V>
    void dynamic_view_init_zeros(benchmark::State& state)
    {
        auto array = xt::xarray<V>::from_shape({SIZE});
        auto view  = xt::dynamic_view(data, xt::slice_vector{xt::all()});

        for (auto _ : state)
        {
            view = xt::zeros<V>({SIZE});
            benchmark::DoNotOptimize(array.raw_data());
        }
    }
    BENCHMARK_TEMPLATE(dynamic_view_init_zeros, float);
@JohanMabille
Copy link
Member

Hi,

Assigning to a container or a view after it has been initialized involves a temporary to avoid aliasing. Could you test with the noalias proxy:

    template <class V>
    void xarray_init_zeros(benchmark::State& state)
    {
        auto array = xt::xarray<V>::from_shape({SIZE});

        for (auto _ : state)
        {
            xt::noalias(array) = xt::zeros<V>({SIZE});
            benchmark::DoNotOptimize(array.raw_data());
        }
    }
    BENCHMARK_TEMPLATE(xarray_init_zeros, float);

    template <class V>
    void dynamic_view_init_zeros(benchmark::State& state)
    {
        auto array = xt::xarray<V>::from_shape({SIZE});
        auto view  = xt::dynamic_view(data, xt::slice_vector{xt::all()});

        for (auto _ : state)
        {
            xt::noalias(view) = xt::zeros<V>({SIZE});
            benchmark::DoNotOptimize(array.raw_data());
        }
    }
    BENCHMARK_TEMPLATE(dynamic_view_init_zeros, float);

Regarding the assignment xarray = 0, scalar and 0-d expressions are considered as equivalent and handled the same way in xtensor. Thus, to be consistent, array = 0 should resize the array to a 0-d array an then assign it the value 0, which would be confusing.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 18, 2018

noalias() indeed helps the dynamic view, so that it is only 27x slower. However, this is unsatisfactory for three reasons:

  • It is still too slow.
  • Since xt::zeros<V>({SIZE}) can never share data with any array, aliasing is impossible, and the library should be able to detect this without external help.
  • Even if aliasing is possible (as in the assignment view1 = view2), it can be easily ruled out in 80% of the cases by a simple check for overlapping address ranges in the LHS and RHS. I have used this technique in vigra for a long time and never encountered any problems. You really ought to improve this!

Thus, to be consistent, array = 0 should resize the array to a 0-d array an then assign it the value 0, which would be confusing.

To the contrary, I was shocked when I first learned that assignment of an xscalar resizes the LHS to 0-d. This behavior is counter-intuitive and useless in practice. As the Zen of Python puts it: "practicality beats purity". Assignment of scalars (both built-in types and xscalar) should not resize the array, but just overwrite its contents.

@SylvainCorlay
Copy link
Member

Even if aliasing is possible (as in the assignment view1 = view2), it can be easily ruled out in 80% of the cases by a simple check for overlapping address ranges in the LHS and RHS. I have used this technique in vigra for a long time and never encountered any problems. You really ought to improve this!

In fact, the rhs can be a complicated expression involving views, concatenations, computations, views on combinations of those. The current mechanism is very general, and there is no systematic way of ruling out aliasing.

Even the lhs can have a "view semantics", which complicates the picture.

There are cases where we can probably rule out aliasing like when the rhs is a generator like zeros or linspace. (Although, when we move to strided random generators, it might be a more complicated question.)

To the contrary, I was shocked when I first learned that assignment of an xscalar resizes the LHS to 0-d. This behavior is counter-intuitive and useless in practice.

In extensor, xscalars are expressions of dimension zero, and assignment proceeds like for any expressions, and we are not making it a special case. Since you quote the Zen of Python:

Special cases aren't special enough to break the rules.

Now, the assignment behavior depends on what the lhs is. In the case where the lhs has a view semantics, the rhs is broadcasted to fill the content (including the case of the rhs being e.g. 1-D). In the case where the lhs has a container semantics, it is an assignment, just like for a numpy array.

You really ought to improve this!

The fact that assignment from a generator goes through the whole broadcasting logic (compared to a simple std::fill) still comes at a cost, but we can certainly improve this.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 19, 2018

Thanks for the quick answer, I really appreciate this. I understand the logic behind your assignment semantics, but I'm sorry to say that I'm not convinced of the premises.

Vigra fundamentally rests on the assumption that algorithms cannot know (and should not care) if they are called with arrays or views. This is important because views are very common (for example, many algorithms are applied to each channel individually). This premise has two implications:

  • Arrays and views must have the same API (but arrays may offer additional functionality).
  • Operations on arrays and views should be equally fast.

In vigra, these requirements are fulfilled by construction, because arrays are implemented as subclasses of views. In contrast, xtensor fulfills neither, and this is a big problem for me. Do you have any suggestion how to resolve this dilemma?

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 19, 2018

In extensor, xscalars are expressions of dimension zero, and assignment proceeds like for any expressions, and we are not making it a special case.

Please cite practical use cases where a resize to 0-d is actually the desired outcome of a scalar assignment. Otherwise, "practicality beats purity" kicks in (at least this is how I understand the Zen).

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 19, 2018

There are cases where we can probably rule out aliasing like when the rhs is a generator like zeros or linspace.

Here is how I do it in vigra: all arrays, views, and expressions provide a function check_memory_overlap(void * begin, void * end) (expressions forward this call to their parts). This function is called at the beginning of an assignment with the memory bounds of the LHS. If the call returns false, no aliasing can occur and no temporary is needed. This is really easy to implement and covers 80% of the cases (after all, aliasing is rare). Why does this technique not work in xtensor?

@JohanMabille
Copy link
Member

JohanMabille commented Mar 19, 2018

Even if aliasing is possible (as in the assignment view1 = view2), it can be easily ruled out in 80% of the cases by a simple check for overlapping address ranges in the LHS and RHS. I have used this technique in vigra for a long time and never encountered any problems

The problem is the remaining 20%. You end up with a scheme where the user does not have a simple rule to determine whether he has to use noalias. When writing generic code that could work with any kind of expression, one would use noalias anyway to guarantee that no copy is made when the dynamic check is not able to determine whether overlap happens.

To the contrary, I was shocked when I first learned that assignment of an xscalar resizes the LHS to 0-d.

If assigning a 0d array / scalar to an array was broadcasting to the array instead of resizing it to a 0d array, the copy constructor and assignment operators would have different behaviors:

double s = 1.;
xt::xarray<double> a(s);
// a is a 0d array
xt::xarray<double> b = {{0., 1., 2.}, {3., 4., 5.}};
b = s;
// b is still a 2d array

And that is a really bug-prone behavior.

  • Arrays and views must have the same API (but arrays may offer additional functionality).
  • Operations on arrays and views should be equally fast.
    [...] In contrast, xtensor fulfills neither

I agree with you on the second point (and we working hard to fix that), on the first point I think arrays and views provide similar APIs (with additional features such as resizing / simd instructions for arrays).

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 19, 2018

The problem is the remaining 20%.

I don't see the problem.

  1. When 80% of the cases are fast, and 20% somewhat slower, the code's overall performance wouldn't suffer much. You should also keep in mind that array expressions are the least expensive operations in number crunching applications. Usually, most of a program's running time is spent in expensive functions (such as convolution) where noalias() doesn't apply anyway.
  2. There is no contradiction between check_memory_overlap() and noalias(), because both can be used together (one could even throw an exception when the user misused noalias()).
  3. I don't have the feeling that the programmer possesses superior knowledge about the possibility of aliasing in most situations. For example, in a generic function that can be called with arbitrarily overlapping views, I have no idea where noalias() is safe. The best I can do is to create copies of the data at strategic places, and give the guarantee afterwards. But this is exactly the situation, which is also easy to detect by check_memory_overlap().

Can you give specific use cases where the programmer knows that noalias() is safe, but check_memory_overlap() can't detect it (i.e. would return true)?

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 19, 2018

If assigning a 0d array / scalar to an array was broadcasting to the array instead of resizing it to a 0d array, the copy constructor and assignment operators would have different behaviors:

For that reason, vigra doesn't provide a constructor taking a scalar. The constructor's first argument is always the shape, or an array expression that provides a method shape(). Therefore

array_t<double> a(2.0);

is an error, whereas

array_t<double> a{2.0};

would invoke an initializer_list that creates an array of size 1 -- but this syntax cannot be confused with assignment.

For the record: anyone I asked wants the assignment a = 0; to zero out the array.

@SylvainCorlay
Copy link
Member

  1. Like in numpy when lhs is a view, rhs is broadcasted to match the shape if possible

numpy:

>>> import numpy as np
>>> x = np.array([[[1, 2], [3, 4]], [[10, 20], [30, 40]]])
>>> x [:, :, 1] = [-1, -2]
>>> x

array([[[ 1, -1],
        [ 3, -2]],

       [[10, -1],
        [30, -2]]])

xtensor:

>>> auto x = xt::xarray<double> ({{{1, 2},{3, 4}},{{10, 20}, {30, 40}}});
>>> auto v = xt::view(x, xt::all(), xt::all(), 1);
>>> xt::xtensor<double, 1> t = {-1, -2};
>>> v = t;
>>> std::cout << x << std::endl;

{{{  1.,  -1.},
  {  3.,  -2.}},
 {{ 10.,  -1.},
  { 30.,  -2.}}}
  1. When lhs is a container, assigning the an expression to the container will change the dimension of the container.

  2. In all of xtensor, scalar are treated as 0-D expressions, and assignment is treated accordingly.

It would basically break xtensor's consistency to treat scalars differently for assignment, so I don't think that this can change.

@JohanMabille
Copy link
Member

JohanMabille commented Mar 20, 2018

For that reason, vigra doesn't provide a constructor taking a scalar.

With that solution, the inconsistency is now between scalars and 0-d expression:

xarray<double> a = {{0., 1., 2.}, {3., 4., 5.}};
xarray<double> b = a;
b = sum(a);
// => b is a 0d array
b = a;
double s = sum(a);
b = s;
// => b is still a 2d array

This is still bug-prone. This decision (i.e. assigning scalars as 0d expressions) has not been taken lightly, we want to ensure consistency between different ways of writing the "same" code and avoid vicious bugs when refactoring (i.e I want to cache the result of a 0d expression in a scalar because now it is used many times while before lazy computing was ok).

So this is something that we won't change. I'll add a paragraph explaining all of this in the documentation of xtensor.

However, we can think of a way to simplify the filling of an N-dimensional expression without resizing it:

  • provide a fill function like numpy
  • see if user-defined literals can help, so we can have a syntax like:
xarray<double> a = {{0., 1., 2. }, {3., 4., 5.}};
a = 1_br;
// => a is a 2d array containing 1 everywhere

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

I implemented @JohanMabille's use case and printed the shapes:

xarray<double> a = {{0., 1., 2.}, {3., 4., 5.}};
xarray<double> b = a;
std::cerr << "shape before: (" << b.shape()[0] << ", " << b.shape()[1] << 
             ")\ncontents:\n" << b << "\n";
b = sum(a);
std::cerr << "shape after: (" << b.shape()[0] << ", " << b.shape()[1] << 
             ")\ncontents:\n" << b << "\n";

The output is

shape before: (2, 3)
contents:
{{ 0.,  1.,  2.},
 { 3.,  4.,  5.}}
shape after: (2, 3)
contents:
 15.

This looks like a bug to me.
Edit: Presumably, it is not a bug, because b.shape() has size 0 after the assignment. Sorry for the mistake, but this is a trap many users will fall into.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

Continuing on @JohanMabille's use case, when I use a view instead of an array

xarray<double> a = {{0., 1., 2.}, {3., 4., 5.}};
xarray<double> b = a;
auto c = view(b, all(), all());
std::cerr << "shape before: (" << c.shape()[0] << ", " << c.shape()[1] << 
             ")\ncontents:\n" << c << "\n";
c = sum(a);
std::cerr << "shape after: (" << c.shape()[0] << ", " << c.shape()[1] << 
             ")\ncontents:\n" << c << "\n";

I get (as expected)

shape before: (2, 3)
contents:
{{ 0.,  1.,  2.},
 { 3.,  4.,  5.}}
shape after: (2, 3)
contents:
{{ 15.,  15.,  15.},
 { 15.,  15.,  15.}}

but the behavior differs between arrays and views. I can absolutely not live with this inconsistency, because vigra's generic algorithms must work identically, regardless of their arguments being arrays or views.

I see only one (admittedly radical) solution: arrays adopt the assignment semantics of views, i.e. assignments cannot resize the LHS (with one exception: when the LHS does not yet contain data, e.g. after default construction).

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Mar 20, 2018

but the behavior differs between arrays and views. I can absolutely not live with this inconsistency, because vigra's generic algorithms must work identically, regardless of their arguments being arrays or views.

Don't hold your breath though 😄 . Thanks for catching the bug on sum!

Regarding the different behavior on views and containers, we don't consider this as an inconsistency, and from the very start we have separated the two semantics (just like numpy).

One things you can probably do if you want a view semantics is take a view on an array with no slice. Soon(ish), strided views should have exactly the same performances as arrays.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

One things you can probably do if you want a view semantics is take a view on an array with no slice.

That means, every algorithm must begin with

auto in_view = make_view(in);

etc. -- not a pretty design, and easy to forget.

@SylvainCorlay
Copy link
Member

This looks like a bug to me.

Edit: Presumably, it is not a bug, because b.shape() has size 0 after the assignment. Sorry for the mistake, but this is a trap many users will fall into.

Yeah, we were posting an answer and that. So no bug, phew.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

we have separated the two semantics (just like numpy)

It is technically impossible in Python to assign a scalar to an array, because the expression array = 0 doesn't touch the array at all, but only changes the reference behind the identifier array. I therefore don't buy the analogy to numpy as a justification for your design decision allowing assignments to resize the LHS.

From the top of my head, I can't come up with other semantic differences between numpy's arrays and views. What differences are you referring to?

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Mar 20, 2018

Containers should behave like std::vector. Assigning or moving a vector to another should result in a resized container.

Assigning or moving any expression to an xarray should behave like assigning an xarray to an xarray: it should resize. Otherwise it would be inconsistent.

@JohanMabille
Copy link
Member

I therefore don't buy the analogy to numpy as a justification for your design decision

That's because you're reading the arguments one by one instead of taking them as a whole thing. So to be clear:

  • in numpy, when assigning an array / numpy expression to a view, the right hand side is broadcasted to the shape of the view
    => The same happens in xtensor, so we're consistent with numpy behavior

  • in numpy, when assigning an array / numpy expression to an array, from a user point of view, it is as if the LHS had been resized to match the RHS shape (even if the technical details are quite different because it is python and yes I know that you get a new array under the hood, the reference points to this new array and the original array is not really resized but that does not matter here)
    => The same happens in xtensor, so we're consistent with numpy behavior. Besides we're consistent with STL's containers behavior.

  • in numpy, you cannot assign a scalar to a numpy array or to a numpy view
    => That's because it is python, however in C++ it it technically possible.

Then you have 3 means to handle that last point:

  • you forbid scalar assignment; that is too restrictive from our point of view
  • assigning a scalar broadcasts it to the lhs => this leads to inconsistency with 0-D expression assignment, I have given a example in one of my post above (and this will be soon in xtensor documentation)
  • assigning a scalar is handled like assigning a 0-D expression, and the system is consistent.

So the analogy with numpy justifies the broadcasting when assigning to a view while resizing when assigning to an array, consistency between scalars and 0-D expression justifies the way we handle scalar assignment.

Thus, as already said before, we won't change this behavior.

That being said, we can consider ways of simplifying the filling of a N-dimensional expression without resizing it:

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

Then you have 3 means to handle that last point:

You forgot to mention the fourth possibility: Change arrays to adopt view semantics (i.e. scalars and 0-D expressions are always broadcast to the LHS), and the system is consistent.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Mar 20, 2018

Then you have 3 means to handle that last point:

You forgot to mention the fourth possibility: Change arrays to adopt view semantics (i.e. scalars and 0-D expressions are always broadcast to the LHS), and the system is consistent.

No, it breaks consistency with STL containers, as explained two comments above.

  • You want to be able to assign or move a container to another (like an STL vector), which should resize, even when the RHS is 0-D.

  • You want the behavior to be the same for any expression on the RHS.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

When you work on speeding-up expressions on views, also include view creation itself, i.e. functions xt::view() and xt::dynamic_view(). I need view creation to be cheap as well (and looking at the code of xt::dynamic_view(), this might prove quite challenging).

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

You want to be able to assign or move a container to another

No, I don't want to do this:

  • I have no use case -- arrays don't change shape halfway through the algorithm.
  • One doesn't want memory management to happen inside loops. Thus, allocation is done before the loop with constructor calls.
  • I can't rely on the resizing capability, because the LHS could also be a view.
  • Consistency with STL containers is not important, because we are in an entirely different domain.
  • Should I be wrong, and resizing assignment is needed, it can still be achieved by the standard idiom xarray<double>(rhs).swap(lhs).

You want the behavior to be the same for any expression on the RHS.

I think this is not violated when arrays have view semantics.

@SylvainCorlay
Copy link
Member

You want to be able to assign or move a container to anotherNo, I don't want to do this:

No, I don't want to do this:

This was a generic "you". There are other users. If you want something like an array that has a view semantics, you can make one that is a valid expression for xtensor by inheriting from xview_semantics and xstrided_container. There are only a couple of constructor to implement.

Consistency with STL containers is not important, because we are in an entirely different domain.

We strongly disagree. This was a deliberate conscious choice.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

This was a generic "you".

I repeatedly asked for uses cases from your side, but got none.

@SylvainCorlay
Copy link
Member

SylvainCorlay commented Mar 20, 2018

I repeatedly asked for uses cases from your side, but got none.

I think that we are in the land of sea-lioning here.

@ukoethe
Copy link
Contributor Author

ukoethe commented Mar 20, 2018

If you want something like an array that has a view semantics, you can make one that is a valid expression for xtensor by inheriting from xview_semantics and xstrided_container.

This looks like what I'll try next. I didn't know that it would guarantee view semantic as opposed to array semantic.

@wolfv
Copy link
Member

wolfv commented Mar 21, 2018

Another note on this issue: for #666 I was planning to overload assign for the special case of a RHS of a single reducer, and then reroute the reducer to use the faster, strided-loop reducers instead of the lazy ones.
Also, when we have implemented the strided loop variant of assign, broadcasts (and probably also broadcasts of scalars) should speed up quite a bit.

@wolfv wolfv closed this as completed Apr 19, 2018
@wolfv
Copy link
Member

wolfv commented May 3, 2018

So I've added a benchmark for xt::zeros and assignment to a container is now as fast as using std::fill directly:

----------------------------------------------------------
Benchmark                   Time           CPU Iterations
----------------------------------------------------------
benchmark_arange        52827 ns      52004 ns      12112
benchmark_std_iota      10540 ns      10519 ns      50931
benchmark_ones          14028 ns      14005 ns      38908
benchmark_std_fill      14309 ns      14280 ns      38371

as you can see, on arange there is still some work outstanding. We'll spend some cycles to add assign_to to other expressions.

aosmw added a commit to aosmw/navigation2 that referenced this issue Mar 28, 2024
* state.reset() now also resets pose and speed and its xtensors
* iterate over control_history_ and call reset()
* use xt::noalias() when zeroing in reset() funcs to actually
assign zero and clean out old state.
"Assigning to a container or a view after it has
been initialized involves a temporary to avoid aliasing."
See xtensor-stack/xtensor#702 (comment)

Signed-off-by: Mike Wake <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants