How to run Inference? #11

SuperMaximus1984 · 2024-11-04T19:42:44Z

Config:
Windows 10 with RTX4090
All requirements incl. flash-attn build - done!

Server:

(venv) D:\PythonProjects\hertz-dev>python inference_server.py
Using device: cuda
<All keys matched successfully>
<All keys matched successfully>
Loaded tokenizer state dict: _IncompatibleKeys(missing_keys=[], unexpected_keys=['recon_metric.metrics.0.window', 'encoder.res_stack.0.pad_buffer', 'encoder.res_stack.1.res_block.0.conv1.pad_buffer', 'encoder.res_stack.1.res_block.1.conv1.pad_buffer', 'encoder.res_stack.1.res_block.2.conv1.pad_buffer', 'encoder.res_stack.1.res_block.3.pad_buffer', 'encoder.res_stack.2.res_block.0.conv1.pad_buffer', 'encoder.res_stack.2.res_block.1.conv1.pad_buffer', 'encoder.res_stack.2.res_block.2.conv1.pad_buffer', 'encoder.res_stack.2.res_block.3.pad_buffer', 'encoder.res_stack.3.res_block.0.conv1.pad_buffer', 'encoder.res_stack.3.res_block.1.conv1.pad_buffer', 'encoder.res_stack.3.res_block.2.conv1.pad_buffer', 'encoder.res_stack.3.res_block.3.pad_buffer', 'encoder.res_stack.4.res_block.0.conv1.pad_buffer', 'encoder.res_stack.4.res_block.1.conv1.pad_buffer', 'encoder.res_stack.4.res_block.2.conv1.pad_buffer', 'encoder.res_stack.5.res_block.0.conv1.pad_buffer', 'encoder.res_stack.5.res_block.1.conv1.pad_buffer', 'encoder.res_stack.5.res_block.2.conv1.pad_buffer', 'encoder.res_stack.6.res_block.0.conv1.pad_buffer', 'encoder.res_stack.6.res_block.1.conv1.pad_buffer', 'encoder.res_stack.6.res_block.2.conv1.pad_buffer', 'decoder.res_stack.0.res_block.2.conv1.pad_buffer', 'decoder.res_stack.0.res_block.3.conv1.pad_buffer', 'decoder.res_stack.0.res_block.4.conv1.pad_buffer', 'decoder.res_stack.1.res_block.2.conv1.pad_buffer', 'decoder.res_stack.1.res_block.3.conv1.pad_buffer', 'decoder.res_stack.1.res_block.4.conv1.pad_buffer', 'decoder.res_stack.2.res_block.2.conv1.pad_buffer', 'decoder.res_stack.2.res_block.3.conv1.pad_buffer', 'decoder.res_stack.2.res_block.4.conv1.pad_buffer', 'decoder.res_stack.3.res_block.1.conv1.pad_buffer', 'decoder.res_stack.3.res_block.2.conv1.pad_buffer', 'decoder.res_stack.3.res_block.3.conv1.pad_buffer', 'decoder.res_stack.4.res_block.1.conv1.pad_buffer', 'decoder.res_stack.4.res_block.2.conv1.pad_buffer', 'decoder.res_stack.4.res_block.3.conv1.pad_buffer', 'decoder.res_stack.5.res_block.1.conv1.pad_buffer', 'decoder.res_stack.5.res_block.2.conv1.pad_buffer', 'decoder.res_stack.5.res_block.3.conv1.pad_buffer', 'decoder.res_stack.6.pad_buffer'])
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Memory efficient kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:773.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Memory Efficient attention has been runtime disabled. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen/native/transformers/sdp_utils_cpp.h:558.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Flash attention kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:775.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: Torch was not compiled with flash attention. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:599.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: CuDNN attention kernel not used because: (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:777.)
  x = F.scaled_dot_product_attention(
D:\PythonProjects\hertz-dev\transformer.py:195: UserWarning: CuDNN attention has been runtime disabled. (Triggered internally at C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:528.)
  x = F.scaled_dot_product_attention(
Traceback (most recent call last):
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 166, in <module>
    audio_processor = AudioProcessor(model=model, prompt_path=args.prompt_path)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 58, in __init__
    self.initialize_state(prompt_path)
  File "D:\PythonProjects\hertz-dev\inference_server.py", line 78, in initialize_state
    self.next_model_audio = self.model.next_audio_from_audio(self.loaded_audio.unsqueeze(0), temps=TEMPS)
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 323, in next_audio_from_audio
    next_latents = self.next_latent(latents_in, temps)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 333, in next_latent
    logits1, logits2 = self.forward(model_input)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\utils\_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\model.py", line 313, in forward
    x = layer(x, kv=self.cache[l])
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 301, in forward
    h = self.attn(x, kv)
        ^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 253, in forward
    return x + self.attn(self.attn_norm(x), kv)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\venv\Lib\site-packages\torch\nn\modules\module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 233, in forward
    return self._attend(q, k, v, kv_cache=kv)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 212, in _attend
    x = self._sdpa(q, k, v)
        ^^^^^^^^^^^^^^^^^^^
  File "D:\PythonProjects\hertz-dev\transformer.py", line 195, in _sdpa
    x = F.scaled_dot_product_attention(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: No available kernel. Aborting execution.

Any advice on how to run inference?
Thank you!

The text was updated successfully, but these errors were encountered:

SuperMaximus1984 · 2024-11-06T19:52:46Z

On Ubuntu the result of running inference_server is much shorter though...

Using device: cuda
Killed

devanshrpandey · 2024-11-06T22:13:20Z

The killed message on Ubuntu is probably a CPU OOM - running the script and then running sudo dmesg should confirm that.

It looks like for some reason the windows drivers / torch version / cuda version you're using doesn't support flash attention; we'll see if we can reproduce this and swap out the attention kernel.

SuperMaximus1984 · 2024-11-06T22:28:14Z

@devanshrpandey
In Ubuntu (WSL) I have 32GB memory. I receive this message even before the whole memory is full.

In Windows I separately tested Flash Attention with external script and it worked. I compiled it on this machine for Torch 2.5.1 with CUDA 12.4.
Please suggest what should I do to make your code working (change PyTorch / CUDA version?)?

Thank you!

hl2dm · 2024-11-07T07:52:47Z

I encountered the same issue as previously reported in the thread here (include link to the relevant GitHub issue if applicable). I am facing a RuntimeError: No available kernel. Aborting execution. when trying to run inference_server.py. Below is my system information:

OsName OsOperatingSystemSKU OsArchitecture

Microsoft Windows 11 專業版 48 64 位元

Python 3.10.10

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Nov__3_17:51:05_Pacific_Daylight_Time_2023
Cuda compilation tools, release 12.3, V12.3.103
Build cuda_12.3.r12.3/compiler.33492891_0

NVIDIA GeForce RTX 4090 32.0.15.6603

32GB RAM

I followed the installation and setup instructions as documented but ended up with this runtime error. Could you please provide any insights on what might be causing this issue and how to resolve it?

Thank you for your help!

SuperMaximus1984 · 2024-11-07T08:19:05Z

@hl2dm Looks like we need to find a magic combination of
Flash-attn / Torch / Cuda versions to make it work in Windows.

"No available kernel" means you don't have Flash Attention installed. But even though I have compiled it for Torch 2.5.1 / CUDA 12.4, it doesn't work (as described above).
Like @devanshrpandey said, it seems they need to fix something in their code for that.

SuperMaximus1984 · 2024-11-08T14:36:52Z

Hey guys, thanks for updating the repo, now it works.
But, the conversation is strange, I mostly hear "aaa.. ughhh", some sighs, but not a constructive dialogue. Does it have to do something with inference.ipynb and prompts over there? I just launched it with the default data.

hl2dm · 2024-11-10T08:10:59Z

It works now, but what I don't understand is how to use it. Do I have to use a microphone to speak? Then there is a small problem is that there will be a timeout problem on the connection end. Maybe I should try using jupyter.

SuperMaximus1984 · 2024-11-10T12:27:05Z

@hl2dm
Yeah, same question - how to use it properly and have conversation with LLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run Inference? #11

How to run Inference? #11

SuperMaximus1984 commented Nov 4, 2024 •

edited

Loading

SuperMaximus1984 commented Nov 6, 2024

devanshrpandey commented Nov 6, 2024 •

edited

Loading

SuperMaximus1984 commented Nov 6, 2024

hl2dm commented Nov 7, 2024

SuperMaximus1984 commented Nov 7, 2024 •

edited

Loading

SuperMaximus1984 commented Nov 8, 2024

hl2dm commented Nov 10, 2024

SuperMaximus1984 commented Nov 10, 2024

How to run Inference? #11

How to run Inference? #11

Comments

SuperMaximus1984 commented Nov 4, 2024 • edited Loading

SuperMaximus1984 commented Nov 6, 2024

devanshrpandey commented Nov 6, 2024 • edited Loading

SuperMaximus1984 commented Nov 6, 2024

hl2dm commented Nov 7, 2024

SuperMaximus1984 commented Nov 7, 2024 • edited Loading

SuperMaximus1984 commented Nov 8, 2024

hl2dm commented Nov 10, 2024

SuperMaximus1984 commented Nov 10, 2024

SuperMaximus1984 commented Nov 4, 2024 •

edited

Loading

devanshrpandey commented Nov 6, 2024 •

edited

Loading

SuperMaximus1984 commented Nov 7, 2024 •

edited

Loading