Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper normalization for evals #47

Open
pcuenca opened this issue Feb 7, 2024 · 0 comments
Open

Whisper normalization for evals #47

pcuenca opened this issue Feb 7, 2024 · 0 comments

Comments

@pcuenca
Copy link
Member

pcuenca commented Feb 7, 2024

The transformers version of the Whisper tokenizer has an EnglishTextNormalizer (https://github.com/huggingface/transformers/blob/d9deddb4c18410a14952537a91099319ecedb869/src/transformers/models/whisper/tokenization_whisper.py#L529) that is initialized with the contents of this file. There's also a BasicTextNormalizer and some additional stuff.

These normalizers are not applied during regular use of the tokenizer. They can be enabled by passing custom flags to decode. This usually happens during quality evaluation, as explained in this PR, or as seen in the Open ASR leaderboard, which contains a hardcoded version of the English normalization file.

It'd be interesting to add these features as opt-in capabilities, but they are really not required until we want to run evaluations in Swift. Opening this issue for future reference.

h/t @ZachNagengast for his help diving into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant