Pre-launch workgroupsize auto-tuning #216

tkf · 2021-02-21T23:00:37Z

If the caller (host-side code) of a kernel needs to pre-allocate buffer that depends on workgroupsize and the workgroupsize is not specified, the caller needs to run the auto-tuning of workgroupsize before launching the kernel. For example, I used it for implementing "mapreduce" kernel in FoldsCUDA.jl. Can we have an API for invoking workgroupsize auto-tuning before launching the kernel?

tkf · 2021-02-21T23:03:12Z

Can this be supported with dynamic localmem #11?

tkf mentioned this issue Feb 22, 2021

Auto-tuning workgroupsize when localmem consumption depends on it #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-launch workgroupsize auto-tuning #216

Pre-launch workgroupsize auto-tuning #216

tkf commented Feb 21, 2021

tkf commented Feb 21, 2021

Pre-launch workgroupsize auto-tuning #216

Pre-launch workgroupsize auto-tuning #216

Comments

tkf commented Feb 21, 2021

tkf commented Feb 21, 2021