From 86ea895c6e680d03a59283b40ae614f16a1a10ae Mon Sep 17 00:00:00 2001
From: Zhuoran Zhao <zhuoran@meta.com>
Date: Thu, 8 Feb 2024 02:01:36 -0800
Subject: [PATCH] Fix BF16 group_index_select_2d on AMD GPU (#2321)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Summary:
Pull Request resolved: https://github.com/pytorch/FBGEMM/pull/2321

as title
```
[zhuoran@devgpu003.snc8 /data/users/zhuoran/fbsource/fbcode (7932bb4ab|remote/fbsource/stable...)]$ HIP_VISIBLE_DEVICES=7 numactl --cpunodebind=1 --membind=1 buck2 run mode/{opt,amd-gpu} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //hammer/modules/sequential/encoders/tests:hstu_bench -- --enable-multi-stream=true --enable_profiler=true --num-streams=3 --num-workers=3
Watchman fresh instance: new mergebase, cleared graph state, cleared dep files
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:gen_version_header to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2:substitute to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/amd_build:build_amd to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/torchgen:gen to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
 ⚠  Python 3.8 is EOL, and is going away by the end of H1 2024. Upgrade //caffe2/tools/setup_helpers:generate_code to Python 3.10 now to avoid breakages. https://fburl.com/py38-sunsetting
Action failed: fbcode//deeplearning/fbgemm/fbgemm_gpu:sparse_ops_hip (hip_compile src/sparse_ops/sparse_group_index.hip (pic))
Remote command returned non-zero exit code 1
Reproduce locally: `frecli cas download-action f0569d85851723e287f08ed03c0bc831587c0a05f94c911fe0b204ddd7670d24:145`
stdout:
stderr:
buck-out/v2/gen/fbcode/2ab98e452e15a67d/deeplearning/fbgemm/fbgemm_gpu/__sparse_ops_hip_hipify_gen__/out/src/sparse_ops/sparse_group_index.hip:11:10: fatal error: 'cuda_bf16.h' file not found
#include <cuda_bf16.h>
         ^~~~~~~~~~~~~
1 error generated when compiling for gfx90a.
```

Reviewed By: nrsatish, sryap, htyu

Differential Revision: D53549323

fbshipit-source-id: 73753c91cbb4c327ff6952bfa7d889ef02b8a31f
---
 fbgemm_gpu/src/sparse_ops/sparse_group_index.cu | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/fbgemm_gpu/src/sparse_ops/sparse_group_index.cu b/fbgemm_gpu/src/sparse_ops/sparse_group_index.cu
index 353fa1eab..b8dd05529 100644
--- a/fbgemm_gpu/src/sparse_ops/sparse_group_index.cu
+++ b/fbgemm_gpu/src/sparse_ops/sparse_group_index.cu
@@ -6,12 +6,13 @@
  * LICENSE file in the root directory of this source tree.
  */
 
-#ifdef USE_ROCM
-#include <hip/hip_bf16.h>
-#else
+#if (defined(USE_ROCM))
+#include <hip/hip_bfloat16.h>
+#elif (                                                \
+    (defined(CUDA_VERSION) && CUDA_VERSION < 11000) || \
+    (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 800)))
 #include <cuda_bf16.h>
-#endif // USE_ROCM
-
+#endif
 #include "common.cuh"
 
 using Tensor = at::Tensor;