Matrix assembly and CPU single core execution #1753

Ben90001 · 2024-12-12T11:31:34Z

Ben90001
Dec 12, 2024

Hi,
In the context of my bachelors thesis I am comparing Ginkgo to DUNE-ISTL.
In single core CPU comparison ISTL seems to be ahead by a good amount so I was wondering whether maybe i didn't use the most efficient way to do things with Ginkgo:

The original paper from 2018 states, that Ginkgo focuses on GPU performance before implementing efficient CPU kernels. Have the CPU kernels been optimized since?
The reference executor is mentioned somewhere to simply be the OMP-executor using only 1 core, so I assumed it is suitable for single core performance comparison?
I am measuring much higher matrix generation times when using matrix_assembly_data compared to using matrix_data, even though the documentation warns me that matrix_data is not optimized for performance.
The same note warns, that matrix_data can only exist on the CPU implying matrix_assembly_data is able to run on the GPU as it has no such warning. Would that require writing custom CUDA code? Or should matrix_assembly_data run faster when using the CUDA executor?
Is there a way to give additional information to the assembly datastructure or when building the matrix? For example average #nnz per row or total #nnz?

Thanks in advance, all help is very much appreciated. I'll append the generation code below.
Should you be interested in the data i generated so far you can find a preliminary version here. The repository I used to generate the data can be found here.

n resembles the grid size in each dimension (d).

#include<ginkgo/ginkgo.hpp>

#include<iostream>
#include<chrono>
#include<map>
#include<fstream>

// implementation using matrix_data -> uses AoS
template <class MatrixType, typename CoefficientFunction, typename BoundaryTypeFunction, class ExecutorType>
std::unique_ptr<MatrixType> diffusion_matrix_matrix_data(const size_t n, const size_t d,
                                              CoefficientFunction diffusion_coefficient,
                                              BoundaryTypeFunction dirichlet_boundary,
                                              ExecutorType exec)
{
  // relevant types
  // using MatrixEntry = double;
  using mtx = MatrixType;

  // prepare grid information
  std::vector<std::size_t> sizes(d + 1, 1);
  for (int i = 1; i <= d; ++i)
    sizes[i] = sizes[i - 1] * n; 
  double mesh_size = 1.0 / n;
  int N = sizes[d];

  // create matrix entries
  // gko::matrix_data<double,size_t> mtx_data{gko::dim<2,size_t>(N,N)};
  gko::matrix_data<> mtx_data{gko::dim<2>(N)};           
  //gko::matrix_data<> mtx_data{gko::dim<2>{N}};
  for (std::size_t index = 0; index < sizes[d]; index++) /// each grid cell
  {
    // create multiindex from row number                    ///fancy way of doing 3(=d) for loops over n -> more powerful: works for all d
    std::vector<std::size_t> multiindex(d, 0);
    auto copiedindex = index;
    for (int i = d - 1; i >= 0; i--)
    {
      multiindex[i] = copiedindex / sizes[i];
      copiedindex = copiedindex % sizes[i];

    // the current cell
    std::vector<double> center_position(d);
    for (int i = 0; i < d; ++i)
      center_position[i] = multiindex[i] * mesh_size;
    double center_coefficient = diffusion_coefficient(center_position);
    double center_matrix_entry = 0.0;

    // loop over all neighbors
    for (int i = 0; i < d; i++)
    {
      // down neighbor
      if (multiindex[i] > 0)
      {
        // we have a neighbor cell
        std::vector<double> neighbor_position(center_position);
        neighbor_position[i] -= mesh_size;
        double neighbor_coefficient = diffusion_coefficient(neighbor_position);
        double harmonic_average = 2.0 / ((1.0 / neighbor_coefficient) + (1.0 / center_coefficient));
        mtx_data.nonzeros.emplace_back(index, index - sizes[i], -harmonic_average);              ///matrix_data usage
        center_matrix_entry += harmonic_average;
      }
      else
      {
        // current cell is on the boundary in this direction
        std::vector<double> neighbor_position(center_position);
        neighbor_position[i] = 0.0;
        if (dirichlet_boundary(neighbor_position))
          center_matrix_entry += center_coefficient * 2.0;
      }

      // up neighbor
      if (multiindex[i] < n - 1)
      {
        // we have a neighbor cell
        std::vector<double> neighbor_position(center_position);
        neighbor_position[i] += mesh_size;
        double neighbor_coefficient = diffusion_coefficient(neighbor_position);
        double harmonic_average = 2.0 / ((1.0 / neighbor_coefficient) + (1.0 / center_coefficient));
        mtx_data.nonzeros.emplace_back(index, index + sizes[i], -harmonic_average);             ///matrix_data usage
        center_matrix_entry += harmonic_average;
      }
      else
      {
        // current cell is on the boundary in this direction
        std::vector<double> neighbor_position(center_position);
        neighbor_position[i] = 1.0;
        if (dirichlet_boundary(neighbor_position))
          center_matrix_entry += center_coefficient * 2.0;
      }
    }

    // finally the diagonal entry
    mtx_data.nonzeros.emplace_back(index, index, center_matrix_entry);                          ///matrix_data usage
  }
  // create matrix from data
  auto pA = mtx::create(exec);
  pA->read(mtx_data);                                                                           ///matrix_data usage

  return pA;
}

The matrix_assembly_data version is exactly the same but using

auto mtx_assembly_data = gko::matrix_assembly_data<>{gko::dim<2>(N)};
(...)
mtx_assembly_data.set_value(index, index - sizes[i], -harmonic_average);
(...)
mtx_assembly_data.set_value(index, index + sizes[i], -harmonic_average);
(...)
mtx_assembly_data.set_value(index, index, center_matrix_entry);
(...)
auto pA = mtx::create(exec);
pA->read(mtx_assembly_data.get_ordered_data());

Find the full file here.

Answered by MarcelKoch

Dec 12, 2024

Hi @Ben90001, thanks for your interest in comparing Ginkgo to Dune-ISTL. I will try to answer your questions one-by-one:

Our CPU kernels are indeed lacking optimizations compared to our GPU kernels. Honestly, I would not be surprised if ISTL is faster than Ginkgo on CPUs. We would welcome however any contributions to our CPU kernels, if this is something you want to pursue.
If you want to run Ginkgo only on a single CPU core, then yes, the ReferenceExecutor would be the best option. If you want to run on the full CPU, then you would need to use the OmpExecutor. You can also use the OmpExecutor to run on a single core by setting the environment variable OMP_NUM_THREADS=1. I would assume t…

View full answer

MarcelKoch · 2024-12-12T12:11:03Z

MarcelKoch
Dec 12, 2024
Maintainer

Hi @Ben90001, thanks for your interest in comparing Ginkgo to Dune-ISTL. I will try to answer your questions one-by-one:

Our CPU kernels are indeed lacking optimizations compared to our GPU kernels. Honestly, I would not be surprised if ISTL is faster than Ginkgo on CPUs. We would welcome however any contributions to our CPU kernels, if this is something you want to pursue.
If you want to run Ginkgo only on a single CPU core, then yes, the ReferenceExecutor would be the best option. If you want to run on the full CPU, then you would need to use the OmpExecutor. You can also use the OmpExecutor to run on a single core by setting the environment variable OMP_NUM_THREADS=1. I would assume that this has some overhead compared to the ReferenceExecutor. One thing to note about our ReferenceExecutor is that we explicitly not care about optimizing our kernels for it. Its purpose is only to ensure correctness of our code.
It is expected that the matrix_assembly_data has worse performance than the matrix_data, even though we don't explicitly state this in our documentation.
Similar to the previous point, the documentation is a bit incomplete. matrix_assembly_data is also not capable of running on GPUs. If you are interested in doing the assembly on the GPU, then you need to use the device_matrix_data.
Compared to ISTL, the matrix assembly is a lot simpler in Ginkgo. You already found all the approaches to assembly that we have availabe.

As I mentioned in 2. you can also try out the OmpExecutor. Maybe it would be interesting to see if that gives you different performance results.

I think your matrix generation code looks fine, it's similar to what we use for example in our benchmarks.

I've also worked a bit with Dune in my past, so feel free to ask any other questions regarding interfacing GInkgo and Dune.

3 replies

upsj Dec 12, 2024
Maintainer

I would expand upon 2. a bit more: The algorithms we design for ReferenceExecutor aim at simplicity, so they may sometimes use suboptimal data structures to provide a clearer code. The OpenMP algorithms tend to be more optimized due to better algorithms/data structures and explicit loop unrolling, which makes vectorization easier. The overhead incurred by the OpenMP runtime should be negligible when running with a single core, since the main thread will also participate in the execution. So if you want the best possible performance, regardless of the number of cores, I would always recommend using OpenMP.

The design for matrix_assembly_data was motivated by the need for a simple framework to assemble matrices where multiple assembled parts (e.g. cells in FEM) may contribute to the same non-zero entry of the matrix. The std::unordered_map we use under the hood is well known for its poor performance compared to other data structures like abseil's flat_map, but avoided adding another dependency to our library. It should be easy to build a data structure that can use an upper bound for the number of nnz per row to build a small flat hashmap for each row. I'll put it on my TODO list :)

Ben90001 Dec 12, 2024
Author

Thanks a lot for the fast Answers :)
I don't know whether including the abseil hashmap will fit into my timeplan but I will definitely run my timings on the OMP executor with a single core as well and take a look at the device_matrix_data.

upsj Dec 12, 2024
Maintainer

I'll also add #1754 here, in case we get around to address it soon. Abseil's flat map might work well as a global hash map for the entire matrix, but since we know that all rows are non-empty, we could do a bit better by storing a single flat map for each row. Using Abseil for that would probably give suboptimal performance, since it requires an allocation for each row, but if we know an upper bound for each row length, we can implement a flat map for ourselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matrix assembly and CPU single core execution #1753

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Matrix assembly and CPU single core execution #1753

Ben90001 Dec 12, 2024

Replies: 1 comment · 3 replies

MarcelKoch Dec 12, 2024 Maintainer

upsj Dec 12, 2024 Maintainer

Ben90001 Dec 12, 2024 Author

upsj Dec 12, 2024 Maintainer

Ben90001
Dec 12, 2024

Replies: 1 comment 3 replies

MarcelKoch
Dec 12, 2024
Maintainer

upsj Dec 12, 2024
Maintainer

Ben90001 Dec 12, 2024
Author

upsj Dec 12, 2024
Maintainer