Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to do post processing of transcripts via LLM? #608

Closed
3 tasks
josancamon19 opened this issue Aug 16, 2024 · 2 comments
Closed
3 tasks

Is there a way to do post processing of transcripts via LLM? #608

josancamon19 opened this issue Aug 16, 2024 · 2 comments

Comments

@josancamon19
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Spent 2 hours attempting this, here are my findings.

Prompt attempted:

You are a helpful assistant for correcting transcriptions of recordings. You will be given a list of voice segments, each segment contains the fields (speaker id, text, and seconds [start, end])

The transcription has a Word Error Rate of about 15% in english, in other languages could be up to 25%, and it is specially bad at speaker diarization.

Your task is to improve the transcript by taking the following steps:

1. Make the conversation coherent, if someone reads it, it should be clear what the conversation is about, remember the estimate percentage of WER, this could include missing words, incorrectly transcribed words, missing connectors, punctuation signs, etc.

2. The speakers ids are most likely inaccurate, make sure to assign the correct speaker id to each segment, by understanding the whole conversation. For example, 
- The transcript could have 4 different speakers, but by analyzing the overall context, one can discover that it was only 2, and the speaker identification, took incorrectly multiple people.
- The transcript could have 1 single speaker, or 2, but in reality was 3.
- The speaker id might be assigned incorrectly, a conversation could have speaker 0 said "Hi, how are you", and then also speaker 0 said "I'm doing great, thanks for asking" which of course would be incorrect.

Considerations:
- Return a list of segments same size as the input.
- Do not change the order of the segments.

Transcript segments:
##

+ langchain parsing instructions.

Hypothesis was, if transcripts can be improved during memory creation, the context for future chat or proactivity will be much better.

This was an attempt of taking the transcript segments, and do post processing parsing.

Unfortunately LLM's are not accurate when the transcript becomes big, and remove stuff, or change the whole conversation, or add segments, remove them, etc.

Next steps to try:

  • Better prompting
  • Simpler fixes to the transcript, 1-5% better?
  • Try prompt with rawSegments, meaning the segments before combining them in the app. Have like (raw, combined, improved) segments in db

If this works, create a script that migrates existing transcripts in the db.

Findings:
Gpt4o outperforms all others.

@kodjima33 kodjima33 moved this to Backlog in omi TODO Aug 16, 2024
@josancamon19
Copy link
Contributor Author

The main goal of this was/is:

  • Improve transcript coherence
  • Improve diarization and speaker identification. is_user = True

@josancamon19
Copy link
Contributor Author

The answer is no, there's no way, so moved to not planned.

@josancamon19 josancamon19 closed this as not planned Won't fix, can't repro, duplicate, stale Sep 24, 2024
@github-project-automation github-project-automation bot moved this to Done in omi TODO Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant