Refactoring MetaDataObject out of DenseMatrix #758

corepointer · 2024-06-17T16:13:03Z

This PR moves the MetaDataObject (MDO) functionality out of DenseMatrix and generalizes it to be used by other classes derived from Structure as well.

Furthermore, this contains a performance improvement to prevent excessive allocation ID lookups and a separation of ranged and full allocations.

All tests are running except the distributed ones.

* This commit introduces the meta data object to the CSR data type * Memory pinning To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.

… Pinning * This commit introduces the meta data object to the CSRMatrix data type To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix. * Separate handling of ranges Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that". * Memory pinning To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses. Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done. Closes daphne-eu#758

corepointer · 2024-10-18T17:13:51Z

The numerous force pushes are a result of my local clang-format disagreeing with the CI's clang-format:

 --- src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(original)
+++ src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(reformatted)
@@ -35,7 +35,7 @@
   public:
     AllocationDescriptorGRPC() = default;
     AllocationDescriptorGRPC(DaphneContext *ctx, const std::string &address, const DistributedData &data)
-        : ctx(ctx), workerAddress(address), distributedData(data) {};
+        : ctx(ctx), workerAddress(address), distributedData(data){};
 
     ~AllocationDescriptorGRPC() override = default;
     [[nodiscard]] ALLOCATION_TYPE getType() const override { return type; };

corepointer · 2024-10-18T17:20:24Z

Explaining the labels:

feature: CUDA handling CSRMatrix is new
Accelerator: it's (also) about CUDA ops
Distributed: the refactoring affects this component
Performance: besides this one explaining itself, the pinning and being able to run sparse stuff on GPU help with performance 💪

…not throw Changing the behavior of fileExists() to a boolean operation as suggested by the method's name. Throwing an exception us up to the caller of this method. Closes daphne-eu#867

… Pinning * This commit introduces the meta data object to the CSRMatrix data type To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix. * Separate handling of ranges Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that". * Memory pinning To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses. Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done. Closes daphne-eu#758

Due to the use of ptr to local var the distributed (GRPC_SYNC) mode crashed in test cases. This patch fixes this by using std::unique_ptr appropriately. Furthermore, a check for nullptr is performed before getting distributed data to add a message indicating that execution failed here.

philipportner · 2024-10-24T20:51:44Z

Hi @corepointer, thanks for putting so much effort into improving this code.

This PR is pretty big and includes quite a few important changes at the core of our data structures.

As there's a lot of changes to this core code, but no changes to the documentation, could you please update the documentation describing the overall idea behind the whole MetaDataObject and IAllocatoinDescriptor functionality?

Especially the knowledge that is hard to get from just reading the code, like what it can and what it cannot do (and what it should do?). From what I can remember of previous discussions and you mention "ranged and full allocations", but maybe I'm wrong, it should handle parts of a DenseMatrix being allocated on CPU and parts of it being moved to GPU? Stuff like that :)

Additionally, it would be great if this can explicitly be tested as part of our test suite. Proper testing of this core functionality would be incredibly beneficial for working on this code in the future. This should be doable by providing mock implementations of the IAllocationDescriptor.

Lastly, could you please provide some performance results for these changes that measure these improvements you mention concerning excessive allocation ID lookups?

I'd also be happy to do a review of the PR if you like.

pdamme · 2024-10-24T21:06:41Z

I was also just finishing reading the code: Thanks for this contribution, @corepointer. Refactoring the use of the meta data objects to also support CSRMatrix and to solve some of the performance issues of the vectorized engine runtime is highly welcome!

Major comments:

I think I have a very rough idea of how the meta data object etc. works on main, but I cannot claim that my understanding is deep enough to be able to judge all implications of your changes. While I didn't spot any obvious big problems, I totally second @philipportner's request for more developer documentation (e.g., a page about the ideas behind the meta data object and related entities and how they are used by the data structures etc.).
Likewise about the performance implications: You mentioned that the "pinning" of the allocation can make the repeated calls to getValues() in the vectorized engine faster. Do you have a concrete example where this performance improvement can be observed? A quick experiment of bin/daphne --vec test/api/cli/algorithms/kmeans.daphne r=1000000 f=1000 c=5 i=10 didn’t show any improvement compared to the main branch on my system (but I didn’t try hard to find a good example).

A few detailed comments:

CSRMatrix.h/cpp: The renaming of getValues() to getRowValues() and getColIdxs() to getColIdxsOfRow() is inconsistent. Could we call the former getValuesOfRow() (or, do we really need the renaming at all)?
AggCum.h: You introduced variables arg_inc and res_inc, which are either the numCols or the rowSkip, depending on whether the respective matrix is a vew. However, the rowSkip is meant to abstract exactly that: no matter if a matrix is an "original" or a view, the rowSkip is always the number by which one must increase a pointer in order to get to the element below.

corepointer · 2024-11-05T09:31:47Z

Thanks for the feedback @philipportner and @pdamme. And thanks for a taste of my one medicine (I frequently note the lacking documentation and test cases in code reviews) 😆 I will, of course, fix this.

Pinning performance improvement: The idea originated from an observation of @philipportner that a lot of time is spent in getDataPlacementByID(), which ran a double for loop on every get/set of a value. Do you remember (@philipportner) what that script was where you encountered the issue? I'd use that as a starting point.
Adding a getRowValues() (in addition to getValues()) was an ad-hoc decision concerning the naming and can certainly be changed.
In the AggCum.h issue I can't remember off the top of my head now what exactly the issue was. I think I found a case where the increment in the algorithm needed to be numCols for some cases and rowSkip for other cases depending on whether there is a view or a "real" matrix.

corepointer mentioned this pull request Jul 22, 2024

Dnn ops #734

Draft

corepointer force-pushed the mdo_csr_cuda_refactor branch from df4702e to cfd8053 Compare October 18, 2024 15:15

corepointer force-pushed the mdo_csr_cuda_refactor branch from cfd8053 to 17d3baa Compare October 18, 2024 15:31

corepointer force-pushed the mdo_csr_cuda_refactor branch from 17d3baa to d9d1b59 Compare October 18, 2024 17:04

corepointer force-pushed the mdo_csr_cuda_refactor branch from d9d1b59 to 9016ae9 Compare October 18, 2024 17:06

corepointer force-pushed the mdo_csr_cuda_refactor branch from 9016ae9 to 6f6da3b Compare October 18, 2024 17:10

corepointer force-pushed the mdo_csr_cuda_refactor branch from 6f6da3b to ce36921 Compare October 18, 2024 17:12

corepointer marked this pull request as ready for review October 18, 2024 17:15

corepointer added feature missing/requested features performance label for PRs of perf++ and issues of perf-- Accelerators Distributed Issues and PRs related to distributed computation labels Oct 18, 2024

corepointer requested a review from pdamme October 18, 2024 17:20

corepointer added 3 commits October 19, 2024 02:52

[DAPHNE-daphne-eu#867] Change ConfigParser::fileExists() behavior to …

3e442ed

…not throw Changing the behavior of fileExists() to a boolean operation as suggested by the method's name. Throwing an exception us up to the caller of this method. Closes daphne-eu#867

corepointer force-pushed the mdo_csr_cuda_refactor branch from ce36921 to d434bf5 Compare October 19, 2024 00:59

philipportner mentioned this pull request Nov 1, 2024

[WIP] kmeans.daph #876

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring MetaDataObject out of DenseMatrix #758

Refactoring MetaDataObject out of DenseMatrix #758

corepointer commented Jun 17, 2024

corepointer commented Oct 18, 2024

corepointer commented Oct 18, 2024

philipportner commented Oct 24, 2024

pdamme commented Oct 24, 2024

corepointer commented Nov 5, 2024

Refactoring MetaDataObject out of DenseMatrix #758

Are you sure you want to change the base?

Refactoring MetaDataObject out of DenseMatrix #758

Conversation

corepointer commented Jun 17, 2024

corepointer commented Oct 18, 2024

corepointer commented Oct 18, 2024

philipportner commented Oct 24, 2024

pdamme commented Oct 24, 2024

corepointer commented Nov 5, 2024