Skip to content

Latest commit

 

History

History
77 lines (50 loc) · 3.78 KB

README.md

File metadata and controls

77 lines (50 loc) · 3.78 KB

accessor-BLAS

The purpose of this repository is to both showcase the performance of the Ginkgo accessor and to serve as an integration example.

Including the Ginkgo's accessor into your project

Integration into the build system

To use Ginkgo's accessor, you need to:

  1. Use C++14 or higher in your own project
  2. use ${GINKGO_DIR} as an include directory (you only need ${GINKGO_DIR}/accessor from Ginkgo)

In this repository, we use CMake, which makes the integration straight forward. We give users the option to either specify a local copy of the Ginkgo repository, or automatically clone the repository into the build directory, followed by using it. We achieve both with these few lines in CMake:

string(COMPARE EQUAL "${GINKGO_DIR}" "" empty_ginkgo_dir)
# If directory not specified, try to find it. If not found, clone it locally
if (empty_ginkgo_dir)
message(STATUS "GINKGO_DIR was not specified. Cloning Ginkgo into build directory...")
find_package(Git REQUIRED)
execute_process(
COMMAND "${GIT_EXECUTABLE}" clone --branch develop
"https://github.com/ginkgo-project/ginkgo.git" "Ginkgo"
WORKING_DIRECTORY "${accessor_example_BINARY_DIR}"
RESULT_VARIABLE result
)
if(result)
message(FATAL_ERROR "Failed to clone Ginkgo repository. Error: ${result}")
endif()
set(GINKGO_DIR "${accessor_example_BINARY_DIR}/Ginkgo" CACHE PATH "${GINKGO_DIR_INFO}" FORCE)
message(STATUS "Ginkgo successfully cloned into \"${GINKGO_DIR}\"")
endif()
# This is everything that is needed to use the Ginkgo accessor: C++14 and the
# Ginkgo directory as include directory to have access to the accessor headers.
function(example_apply_default_target_settings target)
target_compile_features("${target}" PUBLIC cxx_std_14)
target_include_directories("${target}" PRIVATE
"${GINKGO_DIR}"
)
endfunction()
Now, we only need to call this function for every target where we use the accessor.

Creating a range with an accessor

In this repository, we only use the reduced_row_major accessor, but all others work accordingly.

For the reduced_row_major accessor, you need to specify:

  1. the dimensionality of the range (we specify 2D, even for vectors, so we can access vectors with a stride)
  2. the arithmetic and storage type Now, this type can be used to create the range<reduced_row_major<...>> by specifying the size, stride and storage pointer.

We showcase the creation of both constant and non-constant ranges with reduced_row_major accessors here:

constexpr std::size_t dimensionality{2};
std::array<gko::acc::size_type, dimensionality - 1> m_stride{m_info.stride};
std::array<gko::acc::size_type, dimensionality - 1> x_stride{x_info.stride};
std::array<gko::acc::size_type, dimensionality - 1> res_stride{res_info.stride};
using accessor =
gko::acc::reduced_row_major<dimensionality, ArType, StType>;
using range = gko::acc::range<accessor>;
using c_range = gko::acc::range<typename accessor::const_accessor>;
auto m_acc = c_range(m_info.size, mtx, m_stride);
auto x_acc = c_range(x_info.size, x, x_stride);
auto res_acc = range(res_info.size, res, res_stride);
We utilize the constant accessors for the matrix and the x vector since both storage pointers are also constant.

Utilizing the range/accessor in a CUDA kernel

Utilizing the range in a kernel (works the same for CPUs) is straight forward:

  1. Use a templated kernel argument in order to accept all kind of ranges
  2. Read and write operations utilize the bracket operator()

To know which arithmetic type is used, we can either use the accessor::arithmetic_type property, or detect what type arithmetic operations result in. In this example, we use the second option:

using ar_type = decltype(alpha * mtx(0, 0) * x(0, 0) + beta * res(0, 0));

Read and write options can be observed in GEMV here:

res(row_idx, 0) = alpha * shared[local_id] + beta * res(row_idx, 0);

Difference between using plain pointers and using the range/accessor

Here, we compare the GEMV kernel written with plain pointers:

template <std::int64_t block_size, typename ValueType>
__global__ __launch_bounds__(block_size) void gemv(
const matrix_info m_info, ValueType alpha, const ValueType *__restrict__ mtx,
const matrix_info x_info, const ValueType *__restrict__ x,
const matrix_info res_info, ValueType beta, ValueType *__restrict__ res)
{
// expect x_info.size[1] == 1
const std::int64_t row_idx{blockIdx.x};
if (row_idx >= m_info.size[0]) {
return;
}
const std::int64_t row_start = row_idx * m_info.stride;
__shared__ char shared_impl[sizeof(ValueType) * block_size];
auto shared = reinterpret_cast<ValueType *>(shared_impl);
ValueType local_result{};
const auto group = cg::this_thread_block();
const auto local_id = group.thread_rank();
for (std::int64_t col = local_id; col < m_info.size[1]; col += block_size) {
const auto mtx_val = mtx[row_start + col];
const auto x_val = x[col * x_info.stride];
local_result += mtx_val * x_val;
}
shared[local_id] = local_result;
reduce(group, shared, [](ValueType a, ValueType b) { return a + b; });
if (local_id == 0) {
const auto res_idx = row_idx * res_info.stride;
if (beta == ValueType{0}) {
res[res_idx] = alpha * shared[local_id];
} else {
res[res_idx] = alpha * shared[local_id] + beta * res[res_idx];
}
}
}

and using the range/accessor:

template <std::int64_t block_size, typename MtxRange, typename XRange,
typename ResRange, typename ArType>
__global__ __launch_bounds__(block_size) void acc_gemv(ArType alpha,
MtxRange mtx, XRange x,
ArType beta,
ResRange res)
{
using ar_type = decltype(alpha * mtx(0, 0) * x(0, 0) + beta * res(0, 0));
static_assert(std::is_same<ArType, ar_type>::value, "Types must be equal!");
// expect x_info.size[1] == 1
const std::int64_t row_idx{blockIdx.x};
if (row_idx >= mtx.length(0)) {
return;
}
const auto num_cols = mtx.length(1);
__shared__ char shared_impl[sizeof(ar_type) * block_size];
auto shared = reinterpret_cast<ar_type *>(shared_impl);
ar_type local_result{};
const auto group = cg::this_thread_block();
const auto local_id = group.thread_rank();
for (std::int64_t col = local_id; col < num_cols; col += block_size) {
local_result += mtx(row_idx, col) * x(col, 0);
}
shared[local_id] = local_result;
reduce(group, shared, [](ar_type a, ar_type b) { return a + b; });
if (local_id == 0) {
if (beta == ArType{0}) {
res(row_idx, 0) = alpha * shared[local_id];
} else {
res(row_idx, 0) = alpha * shared[local_id] + beta * res(row_idx, 0);
}
}
}

The main differences between these are:

  1. We have fewer parameters in the range/accesser kernel because stride and size information are integrated into the ranges.
  2. We don't need to compute a 1D index with the range/accessor because indices to both dimensions are fed into the brace operator
  3. For debug purposes (and showcase), we extract the arithmetic type from the accessor.

Plots

So far, we ran the benchmarks and error analysis on an nvidia A100 and on an nvidia V100 GPU.

Evaluation on the A100

A100 dot error A100 dot flops A100 gemv error A100 gemv flops

Evaluation on the V100

V100 dot error V100 dot flops V100 gemv error V100 gemv error