fix ws VAD for codec Opus, pcm8, pcm16 #565

0xzre · 2024-08-10T03:37:20Z

Related issues:

Resolves Websocket silero VAD works for (opus, pcm8, pcm16) ($500) #518.

Overview:
This PR improved Voice Activity Detection (VAD) for ws stream, specifically targeting Opus, PCM8, and PCM16 audio formats. It fixes false negatives and false positives for those codecs on many languages

What has changed:

VAD Integration: The 'silero-vad' library has been reintegrated again
Buffer Management: Uses a deque to manage an audio buffer that accommodates up to 1 seconds of audio. This buffer is crucial for ensuring that VAD has enough samples to make accurate decisions.
Audio processing : The code differentiates handling based on the codec and applies VAD to identify active speech.

Testing:

Manual testing conducted with variation in codecs, and languages. But for languages the testing is not that deep, only in en, chinese, indo yet.

josancamon19 · 2024-08-11T08:09:06Z

Tests:

Opus:

Could not process audio: error data length must be a multiple of '(sample_width * channels)'

PCM:

It detects properly (sometimes) that there's speech, but it never transcribes on deepgram

0xzre · 2024-08-11T22:58:04Z

I didn't touch the deepgram thingy in those commit cause assumed it already work. Turns out i need to do some config for PCM8 and PCM16 to DG.
Now it has worked, and i tried to improve the detection time, which works well now on specified channel, rate. But for the Opus I will fix that soon. Thank you for feedback fren

josancamon19 · 2024-08-13T02:33:43Z

Can you show how you determined it is working? @0xzre
I've just tested from the app and doesnt seem to be working, thanks. on pcm8

0xzre · 2024-08-13T04:39:26Z

I didn't use the app but I read the util enough (i think) to understand how to do it simpler.
I use python script to comunicate via ws and send the audio from my mic via sounddevice lib

sd.InputStream(samplerate=sample_rate, channels=channels, callback=callback):

then to mimic the pcm8 i process audio bytes on the callback

# Encode data based on the selected codec\n
if codec == 'pcm16':
    encoded_data = (indata * 32767).astype(np.int16).tobytes()
elif codec == 'pcm8':
    encoded_data = ((indata + 1.0) * 127.5).astype(np.uint8).tobytes()

then send it.
Then the server would detect them and transcribe if i mutter some words.
And after testing it more i think 0.9 for VAD threshold is more suitable for most cases indoor outdoor

josancamon19 · 2024-08-13T18:42:11Z

It must be tested on the app, otherwise we can't tell if it really works, thanks.

0xzre · 2024-08-13T19:51:47Z

Alright, gonna turn on the developer mode on the app right now, is there anything else to setup for the ws?

josancamon19 · 2024-08-13T21:07:10Z

backend/utils/stt/deepgram_util.py

@@ -219,3 +221,11 @@ def connect_to_deepgram(on_message, on_error, language: str, sample_rate: int, c
        return dg_connection
    except Exception as e:
        raise Exception(f'Could not open socket: {e}')
+
+def convert_pcm8_to_pcm16(data):


pcm8 is not 8 bit pcm
Is pcm 16 bits, 8khz, deepgram supports it, as it's been doing it for long.

@0xzre I think Joan is correct. If you convert it from 8-bit to 16-bit it will cause the audio to be garbled. This is all that I had to do to deal with opus vs pcm8:

data = await self.websocket.receive_bytes() if self.codec == "opus": frame_size = 160 data = self.opus_decoder.decode(data, frame_size) buffer.extend(data)

@josancamon19 Yup I mistook the 8 with pcm 8-bit encoding out there, gonna remove that processing now, thanks
@johnmccombs1 Yeah I recorded what the server sent to DG and that's garbled but DG able to translate it then I wonder why :D, turns out its 8khz, thanks

backend/utils/stt/vad.py

josancamon19 · 2024-08-13T21:09:10Z

Hey @0xzre do you have a board or the device for testing with the app?
Also please convert this from draft to full PR

josancamon19 · 2024-08-13T21:20:37Z

From the phone you can test pcm16 recording from phone.

Also, check main, I manage to make it work slightly with pcm8 from device, but it's not as accurate as expected, I also included some code for exporting the bytes marked with voice, to a txt, and then converting it to a wav, please do this for testing the results of the VAD.

Try using this video https://www.youtube.com/watch?v=63EVXf_S4WQ
And submitting the resulted audio.

Also you can reach out to my email directly [email protected], happy to chat more, and help in anyway I can.

josancamon19 · 2024-08-13T21:21:29Z

Lastly test scenarios, when there's no voice, for like 10 minutes, and then there's voice.

Or
voice, empty, voice ~ with a longer empty voice.

johnmccombs1 · 2024-08-14T18:46:11Z

backend/routers/transcribe.py

+                    samples = torch.frombuffer(data, dtype=torch.int16).float() / 32768.0
+                elif codec in ['pcm8', 'pcm16']:
+                    dtype = torch.int8 if codec == 'pcm8' else torch.int16
+                    writeable_data = bytearray(data)
+                    samples = torch.frombuffer(writeable_data, dtype=dtype).float()
+                    samples = samples / (128.0 if codec == 'pcm8' else 32768.0)


@0xzre here is a slightly simpler approach (some of this can be combined), but once the audio has been extracted you can use the following code to process it whether it's opus with 16Khz sample rate or pcm8 with 8Khz sample rate (I haven't tested with pcm16):

def convert_audio_bytes_to_resampled_numpy(self, audio_bytes: bytes): audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32767.0 # Normalize the audio to bring the peak to 1.0 max_val = np.max(np.abs(audio_data)) if max_val > 0: audio_data = audio_data / max_val waveform = torch.from_numpy(audio_data).unsqueeze(0) # Ensure the audio is mono if waveform.shape[0] > 1: waveform = waveform.mean(dim=0, keepdim=True) # Resample if the source sample rate 8Khz if self.source_sample_rate == 8000: waveform = torchaudio.functional.resample(waveform, orig_freq=8000, new_freq=16000) resampled_np = waveform.squeeze().numpy() return resampled_np

Also as a side note I found that the audio is really quiet and so normalizing it seems to help bring it up a little bit without added artifacts to the audio. However, if you want it even louder (which may help with VAD) you can also add this:

gain_factor = 1.2 audio_data = np.clip(audio_data * gain_factor, -1.0, 1.0)

Thanks! I get the idea of your code and agree but I think normalizing naively is not that good for VAD. Still will try normalizing it though to see if it's better

0xzre · 2024-08-15T03:11:15Z

Hey @0xzre do you have a board or the device for testing with the app?
Also please convert this from draft to full PR

I do not have any of them, but I'll try with my phone device cause sending the audio is all I need now

Okay moving to full PR. thanks

0xzre · 2024-08-15T03:15:28Z

From the phone you can test pcm16 recording from phone.

Also, check main, I manage to make it work slightly with pcm8 from device, but it's not as accurate as expected, I also included some code for exporting the bytes marked with voice, to a txt, and then converting it to a wav, please do this for testing the results of the VAD.

Try using this video https://www.youtube.com/watch?v=63EVXf_S4WQ
And submitting the resulted audio.

Also you can reach out to my email directly [email protected], happy to chat more, and help in anyway I can.

Okay I've checked the main and will record the wav for my code!

I will email the audio result, thanks

josancamon19 · 2024-08-15T19:25:34Z

Hi @0xzre didn't get your email, please let me know when I can review this, as it is slightly urgent

mdmohsin7 · 2024-08-17T12:28:17Z

@0xzre can you please resolve the conflicts?

…g opus still

0xzre · 2024-08-17T14:12:52Z

rebased, please review :) @mdmohsin7 , sorry maybe next time i'll just merge lol

DamienDeepgram · 2024-08-17T16:55:59Z

backend/routers/transcribe.py

+        audio_buffer = deque(maxlen=sample_rate * 1)  # 1 secs
+        databuffer = bytearray(b"")
+
+        REALTIME_RESOLUTION = 0.01


Why use 10ms chunks here?

20ms seems like a more standard size

On my testing, I have not found the practical difference between them. I choose 10ms because more responsive would be good. Deepgram still buffer them until a good transcription is detected, right? Although 20ms would mean 2x less times of sending through the DG socket, I don't think that would increase the cost to the DG.

There is no increase or decrease in cost of sending audio chunks but 10ms is very low and not recomended

20ms is the recommended minimum.

The server also has very high CPU usage with 10ms when multiple streams are running.

The receiver thread will be blocked by sender threads too eg if you have 100 connections all doing work 10ms apart

Nice insight. Generally we should do stress test but I think going for the standard for baseline is never wrong. I'll make the required changes soon. Thanks!

DamienDeepgram · 2024-08-17T16:59:15Z

backend/routers/transcribe.py

+                    if len(databuffer) >= chunk_size:
+                        socket1.send(databuffer[:len(databuffer) - len(databuffer) % chunk_size])
+                        databuffer = databuffer[len(databuffer) - len(databuffer) % chunk_size:]
+                        await asyncio.sleep(REALTIME_RESOLUTION)


Sleep here is not perfect and can sleep for more or less than REALTIME_RESOLUTION and can cause significant drift over time especially with a low value like 10ms (100x per second)

Here is an example how you could solve/offset the issue

https://github.com/deepgram/median-streaming-latency/blob/main/latency.py#L78-L91

Yeah for low spec server that surely would cost us the drift. I'll fix that part, and thank you for the reference code! :)

If sleep is 1ms off every 10ms then you would drift 100ms per second or 6 sec per minute

I have seen the drift go as high as 60sec per minute and essentially cause you to stream audio very slowly at half real time speeds

DamienDeepgram · 2024-08-17T16:59:59Z

backend/utils/stt/deepgram_util.py

@@ -103,6 +104,7 @@ async def process_audio_dg(
    def on_message(self, result, **kwargs):
        # print(f"Received message from Deepgram")  # Log when message is received
        sentence = result.channel.alternatives[0].transcript
+        print(sentence)


This will spam the logs

DamienDeepgram · 2024-08-17T17:00:36Z

backend/utils/stt/vad.py

@@ -35,33 +29,51 @@ def get_speech_state(data, vad_iterator, window_size_samples=256):
        #   maybe like, if `end` was last, then return end? TEST THIS

        if speech_dict:
+            vad_iterator.reset_states()


Should the reset_states be called even if speech_dict is false?

…less agressive timeout

mdmohsin7 · 2024-08-18T08:37:28Z

@0xzre lmk once you resolve the conflicts and are on track with the main branch. Thanks

0xzre · 2024-08-18T08:40:46Z

CHANGES

Reset state : Every checking speech -> Only done on timeout, cause timeout is intended for like end of a person speech, so naturally VAD should prepare for new context/environment
Data length sent to DG: Sending them all, not using complicated indexing cause DG already handle such thing.
Sleep Logic: Implemented new sleep logic that should give more accurate sleep time. Now it accounts many thing to calculate the time. Please check @DamienDeepgram :)
Timeout speech duration: 0.7 sec -> 1 sec. just dont wanna get that agressive. Now VAD wont cut a natural silent between words in speech.

Thanks!

0xzre · 2024-08-18T08:52:51Z

@mdmohsin7 Please check, solved conflicts

mdmohsin7 · 2024-08-18T16:23:20Z

@0xzre it works with pcm8 and pcm16, but does not work with opus at all. I don't get any transcripts for it (doesn't matter if I play a Youtube video or speak myself).

For pcm8 and pcm16, it is missing more segments (a lot more for pcm16) than it was without VAD. Is this happening because when certain segments are being processed by VAD, the immediate next segments are being lost due to the processing time of VAD? Also if there's some good amount of background noise (non-conversational noise) then the speech isn't being detected in most cases. Can this be improved?

You can try running the App (even if you don't have the device) locally and test out pcm16 through the App's record with mic feature.

…size

0xzre · 2024-08-19T02:07:45Z

@0xzre it works with pcm8 and pcm16, but does not work with opus at all. I don't get any transcripts for it (doesn't matter if I play a Youtube video or speak myself).

For pcm8 and pcm16, it is missing more segments (a lot more for pcm16) than it was without VAD. Is this happening because when certain segments are being processed by VAD, the immediate next segments are being lost due to the processing time of VAD? Also if there's some good amount of background noise (non-conversational noise) then the speech isn't being detected in most cases. Can this be improved?

You can try running the App (even if you don't have the device) locally and test out pcm16 through the App's record with mic feature.

Have you make sure to install opus? Is there any error while running that?
For the missing segment, I have checked that it's not gonna missing because it's processed asynchronously and checked that the audio cut (by recording the audio to wav file too) match with the not detected debug print. But I'll check more.
For background noise, that's why increasing gain wont directly help. I have increased the threshold to 0.7 also timeout to 2 sec to help with that, already tested by me.
Yeah I built the app on my phone already thanks.
Also now i add prespeech (max 500 ms) that will be included before the first time speech become active. Will help some the VAD because sometime VAD has very little late initial speech detection.
I also make the audio for the vad detection longer to get better audio context, though it will cost at server performance.
Thanks @mdmohsin7 !

0xzre · 2024-08-19T02:30:05Z

speech timeout begin calculated when recving data now, cause processing vad takes pretty significant time actually

mdmohsin7 · 2024-08-19T07:12:10Z

Have you make sure to install opus? Is there any error while running that?

Yes I have it installed and I don't get any errors. Please dm me on discord https://discord.com/users/710158215723089930, I'll share the recordings

Also please comment the vad for opus for now and clean up the code a bit
@0xzre

0xzre mentioned this pull request Aug 10, 2024

Websocket silero VAD works for (opus, pcm8, pcm16) ($500) #518

Open

josancamon19 reviewed Aug 13, 2024

View reviewed changes

johnmccombs1 reviewed Aug 14, 2024

View reviewed changes

0xzre marked this pull request as ready for review August 15, 2024 03:24

0xzre added 7 commits August 17, 2024 20:18

fix ws VAD for codec Opus, pcm8, pcm16

b3fd503

revert main, fix buffer maxlen, add missing vad audio buffer clear

5b60966

fix pcm8 and pcm16 vad and for the dg transcription now works, missin…

11745bc

…g opus still

fix opus codec decoding

2bef541

fix pcm8 and change threshold

c758076

tested

91c8d5d

fix some function to match working one

70d1767

0xzre force-pushed the ws-vad-fix branch from c4c221b to 70d1767 Compare August 17, 2024 14:11

DamienDeepgram reviewed Aug 17, 2024

View reviewed changes

fix WS transcribe: reset state, data length sent to DG, sleep logic, …

a6b8b81

…less agressive timeout

0xzre added 2 commits August 18, 2024 15:46

Merge remote-tracking branch 'origin' into ws-vad-fix

3f40a43

Merge branch 'main' of https://github.com/0xzre/Friend into ws-vad-fix

465cb34

feat prespeech(500ms), change speechtimeout & threshold & VAD window …

25f30d2

…size

more accuracy on speech timeout

52fd340

mdmohsin7 self-requested a review August 19, 2024 07:12

0xzre added 2 commits August 19, 2024 15:13

only using VAD on pcm8/16 now

2f0f3d8

remove sentece log

4aa433f

mdmohsin7 approved these changes Aug 19, 2024

View reviewed changes

mdmohsin7 merged commit bdd6c71 into BasedHardware:main Aug 19, 2024

fix ws VAD for codec Opus, pcm8, pcm16 #565

fix ws VAD for codec Opus, pcm8, pcm16 #565

Conversation

0xzre commented Aug 10, 2024

josancamon19 commented Aug 11, 2024

0xzre commented Aug 11, 2024

josancamon19 commented Aug 13, 2024

0xzre commented Aug 13, 2024

josancamon19 commented Aug 13, 2024

0xzre commented Aug 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josancamon19 commented Aug 13, 2024

josancamon19 commented Aug 13, 2024

josancamon19 commented Aug 13, 2024

johnmccombs1 Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

0xzre Aug 15, 2024 • edited Loading

Choose a reason for hiding this comment

0xzre commented Aug 15, 2024

0xzre commented Aug 15, 2024

josancamon19 commented Aug 15, 2024

mdmohsin7 commented Aug 17, 2024

0xzre commented Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DamienDeepgram Aug 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mdmohsin7 commented Aug 18, 2024

0xzre commented Aug 18, 2024

CHANGES

0xzre commented Aug 18, 2024

mdmohsin7 commented Aug 18, 2024

0xzre commented Aug 19, 2024 • edited Loading

0xzre commented Aug 19, 2024

mdmohsin7 commented Aug 19, 2024

johnmccombs1 Aug 14, 2024 •

edited

Loading

0xzre Aug 15, 2024 •

edited

Loading

0xzre commented Aug 17, 2024 •

edited

Loading

DamienDeepgram Aug 17, 2024 •

edited

Loading

0xzre commented Aug 19, 2024 •

edited

Loading