Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference problem #9

Open
maciekpoplawski opened this issue Nov 4, 2024 · 7 comments
Open

Inference problem #9

maciekpoplawski opened this issue Nov 4, 2024 · 7 comments

Comments

@maciekpoplawski
Copy link

Hi! Good job one the model.

But i have trouble testing it.
Setup RTX4090 + 64gb ram (on loading models im kissing 63.9gb :) )

Tested on Windows - can't launch with default code cause of missing support for FLASH_ATTENTION
image
Exchanged for EFFICIENT_ATTENTION and i hear initial prompt with "bob how's it going bob" and than silence

Unfortunately same on Linux (no errors setup). Only initial prompt and nothing more. Silence :(

Torch installed with this command:
pip3 install torch torchaudio --index-url https://download.pytorch.org/whl/cu118

Any tips how to go further with this?

@maciekpoplawski
Copy link
Author

Maybe somebody end up here with a problem - on inference_client.py i was missing this to launch it on linux Ubuntu
sudo apt-get install libportaudio2

@wpq3142
Copy link

wpq3142 commented Nov 6, 2024

same problem!!

@calculating
Copy link
Contributor

@wpq3142 did the libportaudio2 fix work for you? I've added it to the readme

@maciekpoplawski
Copy link
Author

Sorry i mixes two things in one issue.
Original issue from first post is not resolved.
libportaudio2 fix was needed on Ubuntu to be able to select audio devices. And it WORKS.

@KadirErturk4r
Copy link

how much vRam memory requires?
I have 16GB 3060 and got CUDA out of memory.

@calculating
Copy link
Contributor

how much vRam memory requires? I have 16GB 3060 and got CUDA out of memory.

with our current bfloat16 implementation, 24GB.

@robonxt-ai
Copy link

robonxt-ai commented Nov 18, 2024

with our current bfloat16 implementation, 24GB.

Will there be a quantized or optimized build in the upcoming future?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants