-
Notifications
You must be signed in to change notification settings - Fork 471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix ws VAD for codec Opus, pcm8, pcm16 #565
Conversation
Tests: Opus:
PCM:
|
I didn't touch the deepgram thingy in those commit cause assumed it already work. Turns out i need to do some config for PCM8 and PCM16 to DG. |
Can you show how you determined it is working? @0xzre |
I didn't use the app but I read the util enough (i think) to understand how to do it simpler. sd.InputStream(samplerate=sample_rate, channels=channels, callback=callback): then to mimic the pcm8 i process audio bytes on the callback # Encode data based on the selected codec\n
if codec == 'pcm16':
encoded_data = (indata * 32767).astype(np.int16).tobytes()
elif codec == 'pcm8':
encoded_data = ((indata + 1.0) * 127.5).astype(np.uint8).tobytes() then send it. |
It must be tested on the app, otherwise we can't tell if it really works, thanks. |
Alright, gonna turn on the developer mode on the app right now, is there anything else to setup for the ws? |
backend/utils/stt/deepgram_util.py
Outdated
@@ -219,3 +221,11 @@ def connect_to_deepgram(on_message, on_error, language: str, sample_rate: int, c | |||
return dg_connection | |||
except Exception as e: | |||
raise Exception(f'Could not open socket: {e}') | |||
|
|||
def convert_pcm8_to_pcm16(data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pcm8 is not 8 bit pcm
Is pcm 16 bits, 8khz, deepgram supports it, as it's been doing it for long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0xzre I think Joan is correct. If you convert it from 8-bit to 16-bit it will cause the audio to be garbled. This is all that I had to do to deal with opus vs pcm8:
data = await self.websocket.receive_bytes()
if self.codec == "opus":
frame_size = 160
data = self.opus_decoder.decode(data, frame_size)
buffer.extend(data)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@josancamon19 Yup I mistook the 8 with pcm 8-bit encoding out there, gonna remove that processing now, thanks
@johnmccombs1 Yeah I recorded what the server sent to DG and that's garbled but DG able to translate it then I wonder why :D, turns out its 8khz, thanks
Hey @0xzre do you have a board or the device for testing with the app? |
From the phone you can test pcm16 recording from phone. Also, check main, I manage to make it work slightly with pcm8 from device, but it's not as accurate as expected, I also included some code for exporting the bytes marked with voice, to a txt, and then converting it to a wav, please do this for testing the results of the VAD. Try using this video https://www.youtube.com/watch?v=63EVXf_S4WQ Also you can reach out to my email directly [email protected], happy to chat more, and help in anyway I can. |
Lastly test scenarios, when there's no voice, for like 10 minutes, and then there's voice. Or |
backend/routers/transcribe.py
Outdated
samples = torch.frombuffer(data, dtype=torch.int16).float() / 32768.0 | ||
elif codec in ['pcm8', 'pcm16']: | ||
dtype = torch.int8 if codec == 'pcm8' else torch.int16 | ||
writeable_data = bytearray(data) | ||
samples = torch.frombuffer(writeable_data, dtype=dtype).float() | ||
samples = samples / (128.0 if codec == 'pcm8' else 32768.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@0xzre here is a slightly simpler approach (some of this can be combined), but once the audio has been extracted you can use the following code to process it whether it's opus with 16Khz sample rate or pcm8 with 8Khz sample rate (I haven't tested with pcm16):
def convert_audio_bytes_to_resampled_numpy(self, audio_bytes: bytes):
audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32767.0
# Normalize the audio to bring the peak to 1.0
max_val = np.max(np.abs(audio_data))
if max_val > 0:
audio_data = audio_data / max_val
waveform = torch.from_numpy(audio_data).unsqueeze(0)
# Ensure the audio is mono
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
# Resample if the source sample rate 8Khz
if self.source_sample_rate == 8000:
waveform = torchaudio.functional.resample(waveform, orig_freq=8000, new_freq=16000)
resampled_np = waveform.squeeze().numpy()
return resampled_np
Also as a side note I found that the audio is really quiet and so normalizing it seems to help bring it up a little bit without added artifacts to the audio. However, if you want it even louder (which may help with VAD) you can also add this:
gain_factor = 1.2
audio_data = np.clip(audio_data * gain_factor, -1.0, 1.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I get the idea of your code and agree but I think normalizing naively is not that good for VAD. Still will try normalizing it though to see if it's better
I do not have any of them, but I'll try with my phone device cause sending the audio is all I need now Okay moving to full PR. thanks |
Okay I've checked the main and will record the wav for my code! I will email the audio result, thanks |
Hi @0xzre didn't get your email, please let me know when I can review this, as it is slightly urgent |
@0xzre can you please resolve the conflicts? |
rebased, please review :) @mdmohsin7 , sorry maybe next time i'll just merge lol |
audio_buffer = deque(maxlen=sample_rate * 1) # 1 secs | ||
databuffer = bytearray(b"") | ||
|
||
REALTIME_RESOLUTION = 0.01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use 10ms chunks here?
20ms seems like a more standard size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my testing, I have not found the practical difference between them. I choose 10ms because more responsive would be good. Deepgram still buffer them until a good transcription is detected, right? Although 20ms would mean 2x less times of sending through the DG socket, I don't think that would increase the cost to the DG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no increase or decrease in cost of sending audio chunks but 10ms is very low and not recomended
20ms is the recommended minimum.
The server also has very high CPU usage with 10ms when multiple streams are running.
The receiver thread will be blocked by sender threads too eg if you have 100 connections all doing work 10ms apart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice insight. Generally we should do stress test but I think going for the standard for baseline is never wrong. I'll make the required changes soon. Thanks!
backend/routers/transcribe.py
Outdated
if len(databuffer) >= chunk_size: | ||
socket1.send(databuffer[:len(databuffer) - len(databuffer) % chunk_size]) | ||
databuffer = databuffer[len(databuffer) - len(databuffer) % chunk_size:] | ||
await asyncio.sleep(REALTIME_RESOLUTION) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleep here is not perfect and can sleep for more or less than REALTIME_RESOLUTION and can cause significant drift over time especially with a low value like 10ms (100x per second)
Here is an example how you could solve/offset the issue
https://github.com/deepgram/median-streaming-latency/blob/main/latency.py#L78-L91
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah for low spec server that surely would cost us the drift. I'll fix that part, and thank you for the reference code! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If sleep is 1ms off every 10ms then you would drift 100ms per second or 6 sec per minute
I have seen the drift go as high as 60sec per minute and essentially cause you to stream audio very slowly at half real time speeds
backend/utils/stt/deepgram_util.py
Outdated
@@ -103,6 +104,7 @@ async def process_audio_dg( | |||
def on_message(self, result, **kwargs): | |||
# print(f"Received message from Deepgram") # Log when message is received | |||
sentence = result.channel.alternatives[0].transcript | |||
print(sentence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will spam the logs
backend/utils/stt/vad.py
Outdated
@@ -35,33 +29,51 @@ def get_speech_state(data, vad_iterator, window_size_samples=256): | |||
# maybe like, if `end` was last, then return end? TEST THIS | |||
|
|||
if speech_dict: | |||
vad_iterator.reset_states() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the reset_states be called even if speech_dict is false?
…less agressive timeout
@0xzre lmk once you resolve the conflicts and are on track with the main branch. Thanks |
CHANGESReset state : Every checking speech -> Only done on timeout, cause timeout is intended for like end of a person speech, so naturally VAD should prepare for new context/environment Thanks! |
@mdmohsin7 Please check, solved conflicts |
@0xzre it works with pcm8 and pcm16, but does not work with opus at all. I don't get any transcripts for it (doesn't matter if I play a Youtube video or speak myself). For pcm8 and pcm16, it is missing more segments (a lot more for pcm16) than it was without VAD. Is this happening because when certain segments are being processed by VAD, the immediate next segments are being lost due to the processing time of VAD? Also if there's some good amount of background noise (non-conversational noise) then the speech isn't being detected in most cases. Can this be improved? You can try running the App (even if you don't have the device) locally and test out pcm16 through the App's record with mic feature. |
Have you make sure to install opus? Is there any error while running that? |
speech timeout begin calculated when recving data now, cause processing vad takes pretty significant time actually |
Yes I have it installed and I don't get any errors. Please dm me on discord https://discord.com/users/710158215723089930, I'll share the recordings Also please comment the vad for opus for now and clean up the code a bit |
Related issues:
Overview:
This PR improved Voice Activity Detection (VAD) for ws stream, specifically targeting Opus, PCM8, and PCM16 audio formats. It fixes false negatives and false positives for those codecs on many languages
What has changed:
Testing: