Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add occupied(n) for unordered set and map #427

Conversation

tanzby
Copy link
Contributor

@tanzby tanzby commented Aug 8, 2024

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 


stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

@tanzby
Copy link
Contributor Author

tanzby commented Aug 8, 2024

@stotko help to review this PR, thanks

Copy link

codecov bot commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 0% with 4 lines in your changes missing coverage. Please review.

Project coverage is 97.19%. Comparing base (3b7d712) to head (87fe7d7).
Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
src/stdgpu/impl/unordered_map_detail.cuh 0.00% 2 Missing ⚠️
src/stdgpu/impl/unordered_set_detail.cuh 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #427      +/-   ##
==========================================
- Coverage   97.34%   97.19%   -0.16%     
==========================================
  Files          32       32              
  Lines        2524     2528       +4     
==========================================
  Hits         2457     2457              
- Misses         67       71       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@stotko
Copy link
Owner

stotko commented Aug 12, 2024

Add bool occupied(index n) function for unordered_set and unordered_map. So that we can:

struct AllocateNewBlocks {
  AllocateNewBlocks(stdgpu::unorderd_map<xx> block_map,
                    BlockBuffer buffer_buffer,
                    stdgpu::unorderd_set<xx> not_exist_block_indices)
      : buffer_buffer(buffer_buffer),
        block_map(block_map),
        not_exist_block_indices(not_exist_block_indices) {}

  __device__ void operator()(const stdgpu::index_t index) {
    if (!not_exist_block_indices.occupied(index)) { // Used here.
      return;
    }
    const BlockIndex block_index = *(not_exist_block_indices.begin() + index);
    if (const auto& [iter, is_inserted] = block_map.emplace(block_index, 0); is_inserted) {
      iter->second = buffer_view.AllocateBlock();
    }
  }

  BlockBuffer buffer_buffer;
  stdgpu::unorderd_map<xx> block_map;
  stdgpu::unorderd_set<xx> not_exist_block_indices;
}; 


stdgpu::for_each_index(thrust::cuda::par.on(stream()),
                       not_exist_block_indices().max_size(),
                       AllocateNewBlocks(block_map, block_buffer, not_exist_block_indices);

It skips the need for getting device_range.

Thanks for working on this. However, I believe that exposing the occupied function is not the right way to move forward since this function really meant as an implementation detail of the base container. Even exposing begin(), for symmetry with end() (required for find()), already gives more access to the internals than typically needed.

While the use case you mentioned is fine, it may suffer from bad performance since the load factor of unordered_map/unordered_set is typically low and thus the container is only sparsely filled, which would lead to a low thread utilization in your kernel where many threads of a warp immediately return in the if statement. That is the reason for having device_range() as it allows to densely pack all occupied values.

As mentioned in #423, adding (host-only) overloads with an additional stream argument for the load() and store() function of atomic would be better as it addresses the actual underlying problem. More concretely, I see two options here:

  1. Also implement a stream-aware host-to-device memcpy function: Clean, but not directly straightforward to do as the stream is a template class and the internal memory management system is intentionally strongly decoupled from the rest of the library.
  2. Simulate the memcpy with a "no-op" transform_reduce_index to let thrust do the work for us: More like a workaround and might be inefficient in terms of performance.

@stotko
Copy link
Owner

stotko commented Nov 20, 2024

The underlying issue has been fixed in #450.

@stotko stotko closed this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants