Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix ws VAD for codec Opus, pcm8, pcm16 #565

Merged
merged 14 commits into from
Aug 19, 2024
Merged

Conversation

0xzre
Copy link
Contributor

@0xzre 0xzre commented Aug 10, 2024

Related issues:

Overview:
This PR improved Voice Activity Detection (VAD) for ws stream, specifically targeting Opus, PCM8, and PCM16 audio formats. It fixes false negatives and false positives for those codecs on many languages

What has changed:

  • VAD Integration: The 'silero-vad' library has been reintegrated again
  • Buffer Management: Uses a deque to manage an audio buffer that accommodates up to 1 seconds of audio. This buffer is crucial for ensuring that VAD has enough samples to make accurate decisions.
  • Audio processing : The code differentiates handling based on the codec and applies VAD to identify active speech.

Testing:

  • Manual testing conducted with variation in codecs, and languages. But for languages the testing is not that deep, only in en, chinese, indo yet.

@josancamon19
Copy link
Contributor

Tests:

Opus:

Could not process audio: error data length must be a multiple of '(sample_width * channels)'

PCM:

It detects properly (sometimes) that there's speech, but it never transcribes on deepgram

@0xzre
Copy link
Contributor Author

0xzre commented Aug 11, 2024

I didn't touch the deepgram thingy in those commit cause assumed it already work. Turns out i need to do some config for PCM8 and PCM16 to DG.
Now it has worked, and i tried to improve the detection time, which works well now on specified channel, rate. But for the Opus I will fix that soon. Thank you for feedback fren

@josancamon19
Copy link
Contributor

Can you show how you determined it is working? @0xzre
I've just tested from the app and doesnt seem to be working, thanks. on pcm8

@0xzre
Copy link
Contributor Author

0xzre commented Aug 13, 2024

I didn't use the app but I read the util enough (i think) to understand how to do it simpler.
I use python script to comunicate via ws and send the audio from my mic via sounddevice lib

sd.InputStream(samplerate=sample_rate, channels=channels, callback=callback):

then to mimic the pcm8 i process audio bytes on the callback

# Encode data based on the selected codec\n
if codec == 'pcm16':
    encoded_data = (indata * 32767).astype(np.int16).tobytes()
elif codec == 'pcm8':
    encoded_data = ((indata + 1.0) * 127.5).astype(np.uint8).tobytes()

then send it.
Then the server would detect them and transcribe if i mutter some words.
And after testing it more i think 0.9 for VAD threshold is more suitable for most cases indoor outdoor

@josancamon19
Copy link
Contributor

It must be tested on the app, otherwise we can't tell if it really works, thanks.

@0xzre
Copy link
Contributor Author

0xzre commented Aug 13, 2024

Alright, gonna turn on the developer mode on the app right now, is there anything else to setup for the ws?

@@ -219,3 +221,11 @@ def connect_to_deepgram(on_message, on_error, language: str, sample_rate: int, c
return dg_connection
except Exception as e:
raise Exception(f'Could not open socket: {e}')

def convert_pcm8_to_pcm16(data):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pcm8 is not 8 bit pcm
Is pcm 16 bits, 8khz, deepgram supports it, as it's been doing it for long.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xzre I think Joan is correct. If you convert it from 8-bit to 16-bit it will cause the audio to be garbled. This is all that I had to do to deal with opus vs pcm8:

data = await self.websocket.receive_bytes()

if self.codec == "opus":
    frame_size = 160
    data = self.opus_decoder.decode(data, frame_size)
                
buffer.extend(data)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@josancamon19 Yup I mistook the 8 with pcm 8-bit encoding out there, gonna remove that processing now, thanks
@johnmccombs1 Yeah I recorded what the server sent to DG and that's garbled but DG able to translate it then I wonder why :D, turns out its 8khz, thanks

backend/utils/stt/vad.py Outdated Show resolved Hide resolved
@josancamon19
Copy link
Contributor

Hey @0xzre do you have a board or the device for testing with the app?
Also please convert this from draft to full PR

@josancamon19
Copy link
Contributor

From the phone you can test pcm16 recording from phone.

Also, check main, I manage to make it work slightly with pcm8 from device, but it's not as accurate as expected, I also included some code for exporting the bytes marked with voice, to a txt, and then converting it to a wav, please do this for testing the results of the VAD.

Try using this video https://www.youtube.com/watch?v=63EVXf_S4WQ
And submitting the resulted audio.

Also you can reach out to my email directly [email protected], happy to chat more, and help in anyway I can.

@josancamon19
Copy link
Contributor

Lastly test scenarios, when there's no voice, for like 10 minutes, and then there's voice.

Or
voice, empty, voice ~ with a longer empty voice.

Comment on lines 127 to 132
samples = torch.frombuffer(data, dtype=torch.int16).float() / 32768.0
elif codec in ['pcm8', 'pcm16']:
dtype = torch.int8 if codec == 'pcm8' else torch.int16
writeable_data = bytearray(data)
samples = torch.frombuffer(writeable_data, dtype=dtype).float()
samples = samples / (128.0 if codec == 'pcm8' else 32768.0)
Copy link

@johnmccombs1 johnmccombs1 Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@0xzre here is a slightly simpler approach (some of this can be combined), but once the audio has been extracted you can use the following code to process it whether it's opus with 16Khz sample rate or pcm8 with 8Khz sample rate (I haven't tested with pcm16):

    def convert_audio_bytes_to_resampled_numpy(self, audio_bytes: bytes):
        audio_data = np.frombuffer(audio_bytes, dtype=np.int16).astype(np.float32) / 32767.0

        # Normalize the audio to bring the peak to 1.0
        max_val = np.max(np.abs(audio_data))
        if max_val > 0:
            audio_data = audio_data / max_val

        waveform = torch.from_numpy(audio_data).unsqueeze(0)
        
        # Ensure the audio is mono
        if waveform.shape[0] > 1:
            waveform = waveform.mean(dim=0, keepdim=True)
        
        # Resample if the source sample rate 8Khz
        if self.source_sample_rate == 8000:
            waveform = torchaudio.functional.resample(waveform, orig_freq=8000, new_freq=16000)
        
        resampled_np = waveform.squeeze().numpy()

        return resampled_np

Also as a side note I found that the audio is really quiet and so normalizing it seems to help bring it up a little bit without added artifacts to the audio. However, if you want it even louder (which may help with VAD) you can also add this:

gain_factor = 1.2
audio_data = np.clip(audio_data * gain_factor, -1.0, 1.0)

Copy link
Contributor Author

@0xzre 0xzre Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I get the idea of your code and agree but I think normalizing naively is not that good for VAD. Still will try normalizing it though to see if it's better

@0xzre
Copy link
Contributor Author

0xzre commented Aug 15, 2024

Hey @0xzre do you have a board or the device for testing with the app?
Also please convert this from draft to full PR

I do not have any of them, but I'll try with my phone device cause sending the audio is all I need now

Okay moving to full PR. thanks

@0xzre
Copy link
Contributor Author

0xzre commented Aug 15, 2024

From the phone you can test pcm16 recording from phone.

Also, check main, I manage to make it work slightly with pcm8 from device, but it's not as accurate as expected, I also included some code for exporting the bytes marked with voice, to a txt, and then converting it to a wav, please do this for testing the results of the VAD.

Try using this video https://www.youtube.com/watch?v=63EVXf_S4WQ
And submitting the resulted audio.

Also you can reach out to my email directly [email protected], happy to chat more, and help in anyway I can.

Okay I've checked the main and will record the wav for my code!

I will email the audio result, thanks

@0xzre 0xzre marked this pull request as ready for review August 15, 2024 03:24
@josancamon19
Copy link
Contributor

Hi @0xzre didn't get your email, please let me know when I can review this, as it is slightly urgent

@mdmohsin7
Copy link
Collaborator

@0xzre can you please resolve the conflicts?

@0xzre
Copy link
Contributor Author

0xzre commented Aug 17, 2024

rebased, please review :) @mdmohsin7 , sorry maybe next time i'll just merge lol

audio_buffer = deque(maxlen=sample_rate * 1) # 1 secs
databuffer = bytearray(b"")

REALTIME_RESOLUTION = 0.01
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use 10ms chunks here?

20ms seems like a more standard size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my testing, I have not found the practical difference between them. I choose 10ms because more responsive would be good. Deepgram still buffer them until a good transcription is detected, right? Although 20ms would mean 2x less times of sending through the DG socket, I don't think that would increase the cost to the DG.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no increase or decrease in cost of sending audio chunks but 10ms is very low and not recomended

20ms is the recommended minimum.

The server also has very high CPU usage with 10ms when multiple streams are running.

The receiver thread will be blocked by sender threads too eg if you have 100 connections all doing work 10ms apart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice insight. Generally we should do stress test but I think going for the standard for baseline is never wrong. I'll make the required changes soon. Thanks!

if len(databuffer) >= chunk_size:
socket1.send(databuffer[:len(databuffer) - len(databuffer) % chunk_size])
databuffer = databuffer[len(databuffer) - len(databuffer) % chunk_size:]
await asyncio.sleep(REALTIME_RESOLUTION)
Copy link
Contributor

@DamienDeepgram DamienDeepgram Aug 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sleep here is not perfect and can sleep for more or less than REALTIME_RESOLUTION and can cause significant drift over time especially with a low value like 10ms (100x per second)

Here is an example how you could solve/offset the issue

https://github.com/deepgram/median-streaming-latency/blob/main/latency.py#L78-L91

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for low spec server that surely would cost us the drift. I'll fix that part, and thank you for the reference code! :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If sleep is 1ms off every 10ms then you would drift 100ms per second or 6 sec per minute

I have seen the drift go as high as 60sec per minute and essentially cause you to stream audio very slowly at half real time speeds

@@ -103,6 +104,7 @@ async def process_audio_dg(
def on_message(self, result, **kwargs):
# print(f"Received message from Deepgram") # Log when message is received
sentence = result.channel.alternatives[0].transcript
print(sentence)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will spam the logs

@@ -35,33 +29,51 @@ def get_speech_state(data, vad_iterator, window_size_samples=256):
# maybe like, if `end` was last, then return end? TEST THIS

if speech_dict:
vad_iterator.reset_states()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the reset_states be called even if speech_dict is false?

@mdmohsin7
Copy link
Collaborator

@0xzre lmk once you resolve the conflicts and are on track with the main branch. Thanks

@0xzre
Copy link
Contributor Author

0xzre commented Aug 18, 2024

CHANGES

Reset state : Every checking speech -> Only done on timeout, cause timeout is intended for like end of a person speech, so naturally VAD should prepare for new context/environment
Data length sent to DG: Sending them all, not using complicated indexing cause DG already handle such thing.
Sleep Logic: Implemented new sleep logic that should give more accurate sleep time. Now it accounts many thing to calculate the time. Please check @DamienDeepgram :)
Timeout speech duration: 0.7 sec -> 1 sec. just dont wanna get that agressive. Now VAD wont cut a natural silent between words in speech.

Thanks!

@0xzre
Copy link
Contributor Author

0xzre commented Aug 18, 2024

@mdmohsin7 Please check, solved conflicts

@mdmohsin7
Copy link
Collaborator

@0xzre it works with pcm8 and pcm16, but does not work with opus at all. I don't get any transcripts for it (doesn't matter if I play a Youtube video or speak myself).

For pcm8 and pcm16, it is missing more segments (a lot more for pcm16) than it was without VAD. Is this happening because when certain segments are being processed by VAD, the immediate next segments are being lost due to the processing time of VAD? Also if there's some good amount of background noise (non-conversational noise) then the speech isn't being detected in most cases. Can this be improved?

You can try running the App (even if you don't have the device) locally and test out pcm16 through the App's record with mic feature.

@0xzre
Copy link
Contributor Author

0xzre commented Aug 19, 2024

@0xzre it works with pcm8 and pcm16, but does not work with opus at all. I don't get any transcripts for it (doesn't matter if I play a Youtube video or speak myself).

For pcm8 and pcm16, it is missing more segments (a lot more for pcm16) than it was without VAD. Is this happening because when certain segments are being processed by VAD, the immediate next segments are being lost due to the processing time of VAD? Also if there's some good amount of background noise (non-conversational noise) then the speech isn't being detected in most cases. Can this be improved?

You can try running the App (even if you don't have the device) locally and test out pcm16 through the App's record with mic feature.

Have you make sure to install opus? Is there any error while running that?
For the missing segment, I have checked that it's not gonna missing because it's processed asynchronously and checked that the audio cut (by recording the audio to wav file too) match with the not detected debug print. But I'll check more.
For background noise, that's why increasing gain wont directly help. I have increased the threshold to 0.7 also timeout to 2 sec to help with that, already tested by me.
Yeah I built the app on my phone already thanks.
Also now i add prespeech (max 500 ms) that will be included before the first time speech become active. Will help some the VAD because sometime VAD has very little late initial speech detection.
I also make the audio for the vad detection longer to get better audio context, though it will cost at server performance.
Thanks @mdmohsin7 !

@0xzre
Copy link
Contributor Author

0xzre commented Aug 19, 2024

speech timeout begin calculated when recving data now, cause processing vad takes pretty significant time actually

@mdmohsin7
Copy link
Collaborator

Have you make sure to install opus? Is there any error while running that?

Yes I have it installed and I don't get any errors. Please dm me on discord https://discord.com/users/710158215723089930, I'll share the recordings

Also please comment the vad for opus for now and clean up the code a bit
@0xzre

@mdmohsin7 mdmohsin7 self-requested a review August 19, 2024 07:12
@mdmohsin7 mdmohsin7 merged commit bdd6c71 into BasedHardware:main Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Websocket silero VAD works for (opus, pcm8, pcm16) ($500)
5 participants