Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run Inference? #11

Open
SuperMaximus1984 opened this issue Nov 4, 2024 · 8 comments
Open

How to run Inference? #11

SuperMaximus1984 opened this issue Nov 4, 2024 · 8 comments

Comments

@SuperMaximus1984
Copy link

SuperMaximus1984 commented Nov 4, 2024

Config:
Windows 10 with RTX4090
All requirements incl. flash-attn build - done!

Server:

(venv) D:\PythonProjects\hertz-dev>python inference_server.py
Using device: cuda
<All keys matched successfully>
<All keys matched successfully>
Loaded tokenizer state dict: _IncompatibleKeys(missing_keys=[], unexpected_keys=['recon_metric.metrics.0.window', 'encoder.res_stack.0.pad_buffer', 'encoder.res_stack.1.res_block.0.conv1.pad_buffer', 'encoder.res_stack.1.res_block.1.conv1.pad_buffer', 'encoder.res_stack.1.res_block.2.conv1.pad_buffer', 'encoder.res_stack.1.res_block.3.pad_buffer', 'encoder.res_stack.2.res_block.0.conv1.pad_buffer', 'encoder.res_stack.2.res_block.1.conv1.pad_buffer', 'encoder.res_stack.2.res_block.2.conv1.pad_buffer', 'encoder.res_stack.2.res_block.3.pad_buffer', 'encoder.res_stack.3.res_block.0.conv1.pad_buffer', 'encoder.res_stack.3.res_block.1.conv1.pad_buffer', 'encoder.res_stack.3.res_block.2.conv1.pad_buffer', 'encoder.res_stack.3.res_block.3.pad_buffer', 'encoder.res_stack.4.res_block.0.conv1.pad_buffer', 'encoder.res_stack.4.res_block.1.conv1.pad_buffer', 'encoder.res_stack.4.res_block.2.conv1.pad_buffer', 'encoder.res_stack.5.res_block.0.conv1.pad_buffer', 'encoder.res_stack.5.res_block.1.conv1.pad_buffer', 'encoder.res_stack.5.res_block.2.conv1.pad_buffer', 'encoder.res_stack.6.res_block.0.conv1.pad_buffer', 'encoder.res_stack.6.res_block.1.conv1.pad_buffer', 'encoder.res_stack.6.res_block.2.conv1.pad_buffer', 'decoder.res_stack.0.res_block.2.conv1.pad_buffer', 'decoder.res_stack.0.res_block.3.conv1.pad_buffer', 'decoder.res_stack.0.res_block.4.conv1.pad_buffer', 'decoder.res_stack.1.res_block.2.conv1.pad_buffer', 'decoder.res_stack.1.res_block.3.conv1.pad_buffer', 'decoder.res_stack.1.res_block.4.conv1.pad_buffer', 'decoder.res_stack.2.res_block.2.conv1.pad_buffer', 'decoder.res_stack.2.res_block.3.conv1.pad_buffer', 'decoder.res_stack.2.res_block.4.conv1.pad_buffer', 'decoder.res_stack.3.res_block.1.conv1.pad_buffer', 'decoder.res_stack.3.res_block.2.conv1.pad_buffer', 'decoder.res_stack.3.res_block.3.conv1.pad_buffer', 'decoder.res_stack.4.res_block.1.conv1.pad_buffer', 'decoder.res_stack.4.res_block.2.conv1.pad_buffer', 'decoder.res_stack.4.res_block.3.conv1.pad_buffer', 'decoder.res_stack.5.res_block.1.conv1.pad_buffer', 'decoder.res_stack.5.res_block.2.conv1.pad_buffer', 'decoder.res_stack.5.res_block.3.conv1.pad_buffer', 'decoder.res_stack.6.pad_buffer'])
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Memory efficient kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:773.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/transformers/sdp_utils_cpp.h:558.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Flash attention kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:775.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:599.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: CuDNN attention kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:777.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: CuDNN attention has been runtime disabled. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:528.)
  x = F.scaled_dot_product_attention(
Traceback (most recent call last):
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 166, in <module>
    audio_processor = AudioProcessor(model=model, prompt_path=args.prompt_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 58, in __init__
    self.initialize_state(prompt_path)
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 78, in initialize_state
    self.next_model_audio = self.model.next_audio_from_audio(self.loaded_audio.unsqueeze(0), temps=TEMPS)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 323, in next_audio_from_audio
    next_latents = self.next_latent(latents_in, temps)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 333, in next_latent
    logits1, logits2 = self.forward(model_input)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 313, in forward
    x = layer(x, kv=self.cache[l])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 301, in forward
    h = self.attn(x, kv)
        ^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 253, in forward
    return x + self.attn(self.attn_norm(x), kv)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 233, in forward
    return self._attend(q, k, v, kv_cache=kv)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 212, in _attend
    x = self._sdpa(q, k, v)
        ^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 195, in _sdpa
    x = F.scaled_dot_product_attention(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No available kernel. Aborting execution.

Any advice on how to run inference?
Thank you!

@SuperMaximus1984
Copy link
Author

On Ubuntu the result of running inference_server is much shorter though...

Using device: cuda
Killed

@devanshrpandey
Copy link
Contributor

devanshrpandey commented Nov 6, 2024

The killed message on Ubuntu is probably a CPU OOM - running the script and then running sudo dmesg should confirm that.

It looks like for some reason the windows drivers / torch version / cuda version you're using doesn't support flash attention; we'll see if we can reproduce this and swap out the attention kernel.

@SuperMaximus1984
Copy link
Author

@devanshrpandey
In Ubuntu (WSL) I have 32GB memory. I receive this message even before the whole memory is full.

In Windows I separately tested Flash Attention with external script and it worked. I compiled it on this machine for Torch 2.5.1 with CUDA 12.4.
Please suggest what should I do to make your code working (change PyTorch / CUDA version?)?

Thank you!

@hl2dm
Copy link

hl2dm commented Nov 7, 2024

I encountered the same issue as previously reported in the thread here (include link to the relevant GitHub issue if applicable). I am facing a RuntimeError: No available kernel. Aborting execution. when trying to run inference_server.py. Below is my system information:

OsName OsOperatingSystemSKU OsArchitecture


Microsoft Windows 11 專業版 48 64 位元

Python 3.10.10

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:51:05_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

NVIDIA GeForce RTX 4090 32.0.15.6603

32GB RAM

I followed the installation and setup instructions as documented but ended up with this runtime error. Could you please provide any insights on what might be causing this issue and how to resolve it?

Thank you for your help!

@SuperMaximus1984
Copy link
Author

SuperMaximus1984 commented Nov 7, 2024

@hl2dm Looks like we need to find a magic combination of
Flash-attn / Torch / Cuda versions to make it work in Windows.

"No available kernel" means you don't have Flash Attention installed. But even though I have compiled it for Torch 2.5.1 / CUDA 12.4, it doesn't work (as described above).
Like @devanshrpandey said, it seems they need to fix something in their code for that.

@SuperMaximus1984
Copy link
Author

Hey guys, thanks for updating the repo, now it works.
But, the conversation is strange, I mostly hear "aaa.. ughhh", some sighs, but not a constructive dialogue. Does it have to do something with inference.ipynb and prompts over there? I just launched it with the default data.

@hl2dm
Copy link

hl2dm commented Nov 10, 2024

It works now, but what I don't understand is how to use it. Do I have to use a microphone to speak? Then there is a small problem is that there will be a timeout problem on the connection end. Maybe I should try using jupyter.

@SuperMaximus1984
Copy link
Author

@hl2dm
Yeah, same question - how to use it properly and have conversation with LLM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants