Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding catalan language #8

Open
ccoreilly opened this issue Jun 17, 2021 · 9 comments
Open

Adding catalan language #8

ccoreilly opened this issue Jun 17, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@ccoreilly
Copy link

I would like to contribute by adding support for the catalan language to gruut (and gruut-ipa / ipa2kaldi) but I am not sure about the g2p model.

I have a phonetisaurus g2p model which outputs CMU phonemes and the corresponding dictionary, would that suffice or should the model output IPA phonemes? I could maybe manually map the CMU phonemes to IPA and retrain the model.

I have also seen you have extracted g2p models from espeak-ng, how could I do so? Or have you converted a lexicon to its IPA phonetic representation with espeak and then trained a g2p model based on that?

@synesthesiam
Copy link
Contributor

Hi @ccoreilly, thanks for offering to volunteer!

When adding a new language, my first step is to add the phonemes to gruut-ipa. These should be IPA, and I usually just use a Wikipedia page.

If you can manually map the CMU phonemes to IPA, that would be great. If you follow the convention here for English, it will be possible for gruut-ipa to convert between the CMU and IPA phonemes automatically.

I have also seen you have extracted g2p models from espeak-ng, how could I do so?

I created a small script for this. I start by creating a list of words, usually just the words from my lexicon plus a list of frequent words in the language (I have one for Catalan). Make sure to lower-case and de-duplicate the words. Then I create the espeak-ng lexicon like this:

./espeak_word.sh < words.txt > lexicon.espeak.txt

After that, converting it to a database is straightforward:

python3 -m gruut.lexicon2db --casing lower --lexicon lexicon.espeak.txt --database espeak/lexicon.db

I train separate g2p models for IPA and espeak-ng phonemes. See below for instructions on that, and let me know if you have any questions 🙂

G2P

Recent versions of gruut aren't using Phonetisaurus at runtime anymore to reduce the runtime dependencies. I'm hoping to add support for reading the g2p FSTs in pure Python, but for now I'm using a different framework.

Training still needs Phonetisuarus, however, for initial alignment of the corpus. If you're using my phonetisaurus Python package, you can get this when you train a model:

phonetisaurus train --corpus g2p.corpus --model g2p.fst lexicon.txt

The g2p.corpus file contains the alignments for all words in the lexicon. You use this to train a model in my new framework like this:

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

@synesthesiam synesthesiam added the enhancement New feature or request label Jun 23, 2021
@ccoreilly
Copy link
Author

Thanks for the thorough response Michael! I have been a bit busy lately but will make time to contribute.

@mlrober
Copy link

mlrober commented Nov 3, 2021

Hi Michael,

i'm trying to add new language and created model.fst and model.corpus with phonetisaurus.
Howver, when i try to run the below command to get "model.crt" with :

python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

i'm getting error as

zsh: killed python3 -m gruut.g2p train --corpus g2p.corpus --output g2p/model.crf

that's it. Any idea or troubleshooting steps to get rid of this or any other way to get model.crt ?

@synesthesiam
Copy link
Contributor

How big is your pronunciation dictionary? Is it eating up all of your memory?

@mlrober
Copy link

mlrober commented Nov 5, 2021

Thanks for the reply. The corpus file is of 23M size
Is it too big to train? what would be the ideal size?

@ccoreilly
Copy link
Author

@mlrober are you working on Catalan or another language? (I haven't had the time so it'd be great if your questions were specific to the catalan language :)

@mlrober
Copy link

mlrober commented Nov 5, 2021 via email

@synesthesiam
Copy link
Contributor

I guess we can consider this thread as "adding a new language" more generally 🙂

@mlrober, can you clarify what "howling all the nos in loss" means? Sorry, I can't quite interpret it 😕

@mlrober
Copy link

mlrober commented Nov 6, 2021 via email

synesthesiam pushed a commit that referenced this issue Jul 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants