Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance sluggish on AMD RX-Vega 64 #403

Open
JSav87 opened this issue Oct 13, 2020 · 1 comment
Open

Performance sluggish on AMD RX-Vega 64 #403

JSav87 opened this issue Oct 13, 2020 · 1 comment

Comments

@JSav87
Copy link

JSav87 commented Oct 13, 2020

Hello CLBLast group,

Thank you for writing such an awesome library, I think your contribution to the world of open source is really great.

Unfortunately I have noticed that the performance of the CLBlast GEMM really isn’t much better than the multiplication on my CPU using standard Eigen. It is perhaps a factor of 2 or 3 faster. I would have thought this would be much better. I ran all-tuners, updated the optimisation results and recompiled as described on the optimisations page. I am running on the AMD RX-Vega 64 GPU as within the optimization results I recently uploaded. For the tuners/compilation do I need to enable some sort of extra flag for the AMD architecture?

Any help would be appreciated, I really would rather stick with OpenCLblast and pass this on to the users of my library.

@CNugteren
Copy link
Owner

CNugteren commented Oct 13, 2020

Hello and thanks for the nice words.

Indeed, you don't need to do anything special to use the tuning results, except for making sure you use the latest version and have recompiled the library of course.

About the speed issue, this can depend on a lot of factors. Here are some steps to follow to get a bit more insight into your issue (which I did actually):

  1. Typically it is good to compare the peak performance of the device against what you get with CLBlast. You won't get 100%, but something above 50% should be attainable. According to wikipedia your Vega 64 should get around 10.000 GFLOPS peak, assuming we are talking about single precision (SGEMM).
  2. Now that we know that, let's look at what you got when running the tuners. Easiest is to look at your logs (or re-run the tuner), because it will tell you the number in GFLOPS when running the xgemm tuner in 32-bit precision. Alternatively, in the final database JSON results it also shows the same information, but then measured in execution time. From your data I see that 0.43 ms was the best you got for 1024x1024x1024, which translates to around 5000 GFLOPS if my calculations are correct. That is about a factor 2 off of what we should hope to get in theory, but not that bad. There seems to be something special about the Vega architecture that CLBlast doesn't optimize for. Other reports here have shown that it is not easy to get good performance, so we shouldn't expect much beyond that number.
  3. If the numbers in point 2 are good but your final benchmark wouldn't, then something else could be wrong, e.g. your matrix sizes are too small or some other overhead in CLBlast starts to become an issue. Or you are not measuring correctly. So how are you measuring this? With the CLBlast 'client' software that does this measurement for you, or your own measurement? If it is the latter, can you try compiling the clients and run something like ./clblast_client_xgemm -n 1024 -m 1024 -k 1024?

Other than that, the most useful piece of information here would be whether you can run a 1024x1024x1024 SGEMM with other software on your Vega GPU and report what you get. For example with AMD's ROCm BLAS. Then we can be sure it is a CLBlast issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants