Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve speech recognition and remove postprocessing #837

Closed
14 tasks done
josancamon19 opened this issue Sep 14, 2024 · 7 comments
Closed
14 tasks done

Improve speech recognition and remove postprocessing #837

josancamon19 opened this issue Sep 14, 2024 · 7 comments
Assignees

Comments

@josancamon19
Copy link
Contributor

josancamon19 commented Sep 14, 2024

Refactoring STT system

https://artificialanalysis.ai/speech-to-text

Points to https://www.speechmatics.com/ as the winner in WER %

Image

Deepgram has a worst WER by 40%, which it's forcing us to do a postprocessing using whisper-x.

Also tried assembly AI, unfortunately streaming only works for english language, so it's discarded.
Image
Image

Speechmatics is marginally better than assembly ai, but works with all languages, and has interesting features future proof.

NOTE I will do the exact same pipeline first in Soniox first, we already have 10k in credits, but I'm unsure if I trust their accuracy for some reason, as the WER comparison was made by themselves.
Also they made the research before the releases of latest models.
Image

Still the reason of testing soniox first, is because we have already a good % of the pipeline integrated, so it shouldn't take long.


  • Setup speechmatics websocket concurrently with existing deepgram websocket.
  • From the app use a settings dropdown, that allows to select transcript model (only while testing)
  • Test both options in 10 scenarios. (Deepgram + postprocessing) (Speechmatics + postprocessing)
  • Script to view line by line comparison between each one of them
    • Prompt GPT to compare the 3 transcripts at each scenario, which one has better accuracy.
    • (Maybe) Use groq whisper v3 as source of truth and perform WER in comparison
  • If tests point that speechmatics <= whisper-x results by 5-10%, skip and remove postprocessing.

Important:

  • Need to double check scalability (no response)
    Image

  • Need to ask for free credits, it's 4x more expensive than deepgram. (no response).

  • Speechmatics will only be supported for opus, for 1.0.2, will continue using deepgram.

Add ons:

  • VAD Implementation will be needed. Finish ticket specially for Opus.
  • Push more users to migrate, initiate "campaign" to help users migrate from 1.0.2 to 1.0.4 in < 30days so we can deprecate pcm8.
    • Understand the data (how many are still on pcm8?)
  • Improve speech recognition, make sure the file is being sent correctly (use the raw audio .wav instead of the saved opus encoded bytes), double check the duration at which performs 90% of the time.
@josancamon19 josancamon19 self-assigned this Sep 14, 2024
@josancamon19
Copy link
Contributor Author

How WER tests were made by artificialanalysis

Image

@josancamon19
Copy link
Contributor Author

josancamon19 commented Sep 14, 2024

@kodjima33
Copy link
Collaborator

@josancamon19 can you pls specify what languages are required to complete the task? This will allow me to quicker understand whom to ask to do it

@josancamon19
Copy link
Contributor Author

josancamon19 commented Sep 23, 2024

Note had an issue with speechmatics diarization results, hunch that it is still better than deepgram.
For WER used jiwer
For DER used from pyannote.metrics.diarization import DiarizationErrorRate

Average WER Table

Model Average WER
soniox 20.04%
speechmatics 20.85%
fal_whisperx 21.80%
deepgram 31.59%

Average DER Table

Rank Model Average DER
1 deepgram 24.07%
1 soniox 24.32%
2 fal_whisperx 27.93%
3 speechmatics 1258.00%

How was this computed

  • For WER, groq-whisper-large-v3, was used as reference, and other models results where computed against this.
  • For DER pyannote diarization 3.1 was used as baseline as reference via https://pyannote.ai/

In english soniox is better than speechmatics, in the overall WER, sometimes by a huge difference, but overall speechmatics, was more reliable in terms of WER around multiple recording scenarios.

Considering that, we will use soniox as of right now, as we have credits, and speechmatics costs 2.5x than soniox, or 4.5x than deepgram.

Deepgram was slighty better on speaker diarization, speechmatics was preferred by users perception, (but on pipeline had issues with the pipeline computing speechmatics), thus not sure how good it is.

Still soniox cheaper, and very good.

Postprocessing:
groq+pyannote is definitely a better pipeline than fal_whisperx.

From the results, there's no benefit on using fal whisperx, even tho some results were almost as good as groq-whisper-large-v3, with something like 1% WER and DER, it's still very unreliable, at sometimes outputs 20% of the expected transcript, or outputs non-sense.

Thus postprocessing will be removed.

@beastoin beastoin self-assigned this Sep 24, 2024
@beastoin
Copy link
Collaborator

👋

@josancamon19
Copy link
Contributor Author

Asked for credits to speechmatics 3 times, no response, will keep bothering, not much we can do for now.

@josancamon19
Copy link
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

3 participants