-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to run Inference? #11
Comments
On Ubuntu the result of running inference_server is much shorter though...
|
The killed message on Ubuntu is probably a CPU OOM - running the script and then running sudo dmesg should confirm that. It looks like for some reason the windows drivers / torch version / cuda version you're using doesn't support flash attention; we'll see if we can reproduce this and swap out the attention kernel. |
@devanshrpandey In Windows I separately tested Flash Attention with external script and it worked. I compiled it on this machine for Torch 2.5.1 with CUDA 12.4. Thank you! |
I encountered the same issue as previously reported in the thread here (include link to the relevant GitHub issue if applicable). I am facing a RuntimeError: No available kernel. Aborting execution. when trying to run inference_server.py. Below is my system information: OsName OsOperatingSystemSKU OsArchitecture Microsoft Windows 11 專業版 48 64 位元 Python 3.10.10 nvcc: NVIDIA (R) Cuda compiler driver NVIDIA GeForce RTX 4090 32.0.15.6603 32GB RAM I followed the installation and setup instructions as documented but ended up with this runtime error. Could you please provide any insights on what might be causing this issue and how to resolve it? Thank you for your help! |
@hl2dm Looks like we need to find a magic combination of "No available kernel" means you don't have Flash Attention installed. But even though I have compiled it for Torch 2.5.1 / CUDA 12.4, it doesn't work (as described above). |
Hey guys, thanks for updating the repo, now it works. |
It works now, but what I don't understand is how to use it. Do I have to use a microphone to speak? Then there is a small problem is that there will be a timeout problem on the connection end. Maybe I should try using jupyter. |
@hl2dm |
Config:
Windows 10 with RTX4090
All requirements incl. flash-attn build - done!
Server:
Any advice on how to run inference?
Thank you!
The text was updated successfully, but these errors were encountered: