Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steps involved to add a new language? #265

Open
fracpete opened this issue Apr 27, 2020 · 6 comments
Open

Steps involved to add a new language? #265

fracpete opened this issue Apr 27, 2020 · 6 comments

Comments

@fracpete
Copy link

So far, I've only come across readily available language models etc for the various STT/TTS plugins.
My question is, what steps are necessary in order to add a completely new language, e.g., an indigenous one, to Naomi? Code and/or configuration changes? What would be necessary to use DeepSpeech in such a scenario (I presume some form of training on a new audio corpus)?

@aaronchantrill
Copy link
Contributor

aaronchantrill commented Apr 28, 2020

Hey Peter. That's a good question. There is a sort of answer here: https://support.projectnaomi.com/forums/topic/where-to-start-developing-a-acoustic-model-for-esperanto-for-cmusphinx/ with regard to adding Esperanto specifically, although nobody has yet gone through the whole process of adding a whole new language.

As far as changes to the core naomi code, you would have to add your language as an option to the get_language method in naomi/commandline.py (this really isn't the right place for this function, so I imagine it will move eventually, and it should also be modified to key off of the list of .po files in naomi/data/locale instead of a hard-coded list) if you want people to be able to select it during the initial configuration (once a language is selected, Naomi should switch to communicating in that language, but the translation files for French and German currently need to be updated - see pull request #249).

The process of adding a new language could be broken down into three projects: STT (speech to text), TTI (text to intent), and TTS (text to speech).

With Speech to Text, you first have to decide what engine you want to work with. Pocketsphinx and Deepspeech are both good choices for a more or less complete solution, and Kaldi is good for a solution once you are more comfortable with the concepts used in STT. They all have tutorials where building a new speech recognition model is discussed:

Speech to text is generally broken down into three main concepts of acoustic model, phoneme to grapheme, and language model. The language model flows directly into the next project, Text to Intent, since it is using expectations to determine what it most likely heard.

For Text to intent, you currently have to modify the intents() methods in each of the speechhandler plugins, and also generate new .po gettext translation files for the core and plugins.

The intents() methods return a list of things the user might say to activate the plugin, which then get fed into whichever Text to Intent plugin you are using. Since there are different numbers of ways to say things in different languages, it just didn't work to use literal translations here. This gives an intent author better control of how an intent is constructed in a specific language, but does require someone who is adding a new language to do a lot more work, and modify every speechhandler plugin. There are instructions for writing intents here: https://projectnaomi.com/dev/docs/developer/plugins/speechhandler_plugin.html

I have considered defining a JSON format file for holding intents, so that someone adding a new translation would be adding new files, not modifying the intents() method of the plugin itself. One benefit of using that kind of file structure is that it could provide a fairly easy method of identifying plugins in the Naomi Plugin Exchange by the locales they are configured to work with, and possibly even allow people to attach additional translations to remotely hosted plugins without having to modify the plugin itself. Currently there is no indication on the Naomi Plugin Exchange of which languages a plugin supports.

Generating the .po files is simply a matter of running the "update_languages.sh -l <locale_identifier>" with the locale identifier. If there is no locale identifier for the language you want to add, you can just make it up. It only has to be consistent within Naomi. This generates a bunch of files called <local_identifier>.po. Unfortunately, you have to go in and manually translate all the phrases in those .po files. This allows Naomi to translate its responses into another language to either display on the screen or say to the user.

Last, you need Naomi to be able to say the response, so you need a Text to Speech system which is trained to speak your language so is able to pronounce the words being fed to it. If you have a mismatch between the locale and the voice, it can be difficult to understand (as an analog, I have been told by a Swiss friend that pronouncing Maori words correctly is much easier if you try to pronounce them in German than in English). Voice building can be a pretty complex task. Here are instructions for building a new voice for the Mary TTS system: https://github.com/marytts/marytts/wiki/VoiceImportToolsTutorial and Festival: http://www.cstr.ed.ac.uk/projects/festival/manual/festival_24.html#SEC99

So it is certainly not easy, but it can be done. If someone would be interested in doing this and documenting their progress, I think that would be incredibly helpful to others. The whole process is a lot easier if you can find a ready made STT model and TTS voice. If you are generating a new language from scratch then you will need a lot of labeled recordings. Naomi is able to help with that, especially if you can find a cloud provider that already provides STT and TTS in the target language. Then, once you have customized the intents and built translation files, you could use Naomi normally with audiolog enabled to build up a collection of labeled samples which could be used to build both the STT models and TTS voice.

Please let me know if you have more questions or if you see any mistakes above. I hope that's helpful and not overly dense. I could go into a lot more detail, and would be willing to work directly with someone attempting to do this, especially if they would be willing to help document the process. Do you have a specific use case in mind?

@fracpete
Copy link
Author

Thanks for the detailed reply, I will have to mull over that a bit, as there is quite a bit of work involved. A possible use case would have been to add Maori as a language.
BTW What do you think of meta-STT and meta-TTS wrappers?
For example, you could take the output of your base STT and push that through Google translate (from Maori to English) to avoid having to update all your plugins for handling this language. Any English text that would have to be spoken could be translated again, e.g., through Google translate again, before pushing it out through a Maori TTS. I could imagine that this kind of translation approach might work relatively well, as long as the incoming and outgoing text is relatively simple and short.

@aaronchantrill
Copy link
Contributor

aaronchantrill commented May 1, 2020

I agree, it's a lot of work. I don't know that google translate could help all that much. My experience with google translate is that it works well enough to get the intent across, but rarely beyond pidgin level language.

My biggest concern is with regard to 3rd party speechhandler plugins. I'd like to have an easy way for someone to add a translation "pack" to some else's plugin consisting of a .po translation file and an intent file, and for another user to then download the plugin with the added translation files.

I have thought about expanding the "update_translations.py" program to use Google translate to generate the translation files, which should work fine in our current state without leaking usage data to Google. That way a user could generate all the translation files needed for their whole system to get Naomi up and running in a new language quickly and later go and fix the translations.

For the time being, I see Naomi as more of a development kit than a finished product. My hope is that through being able to experiment with different technologies, we will eventually be able to come up with an effective platform that runs locally rather than in the cloud. Once something really seems to be working, Naomi could be used as a template to write a system optimized for the specific plugins.

Have you ever used a piece of software called Simon (https://userbase.kde.org/Simon)? It used to be part of the KDE desktop, but the lead developer went to work for Apple's Siri division a few years ago and the project sort of fell apart after that. The idea was to have a simple means of controlling the desktop using voice, sort of like the scene in Bladerunner where Harrison Ford is zooming in on an image on his computer. One thing it included was the ability to train STT systems, especially the Julius STT engine. One of my visions for Naomi is to provide that same ability to train speech recognition engines which can then be re-applied to your own projects.

You would still need both Speech to Text and Text to Speech systems with acoustic models optimized for the specific language, which I consider to be the most difficult part. Doing the actual translations is pretty straightforward.

@aaronchantrill
Copy link
Contributor

@fracpete can you recommend what form you would like this project to take? Do you want some specific documentation around it? This is kind of a big question, and given that Naomi uses plugins, it's impossible to give a specific set of instructions that cover every use case. At the same time, there are some definite steps that would always have to be done that can be documented along with some vague information about generating new STT and TTS models for those who need to. I'm just not sure how to resolve this.

@fracpete
Copy link
Author

fracpete commented May 4, 2020

We had a discussion around this today. For the time being, we will concentrate on getting a handle on STT (DeepSpeech) and TTS (MaryTTS), with building up a speech corpus as the first step. Once model performance is satisfactory, we will look into a tighter integration into Naomi.

@aaronchantrill
Copy link
Contributor

aaronchantrill commented May 4, 2020

Wow, that's awesome. Definitely keep me informed.

What I have learned from working with Naomi, though, is that you don't need perfect recognition to get good comprehension at the intent level, especially if you are willing to accept a pretty simple model where only one intent is triggered at a time.

Adding some humorous responses can be a good strategy for generating some good will with users and keeping them engaged when the computer is having some trouble understanding them, as long as they don't happen too often.

The point is that speech recognition will never be perfect, since even you are processing language at multiple levels to make sense of it. I have some hearing loss, so often what I actually hear is garbled, but I can usually work out the speaker's intent from context.

A good illustration of the process of of the process our brains engage in for listening would be the "Mares eat oats" song which can sound like gibberish until you get to the "wouldn't you?" part of the song and realize that the whole thing has been in english the whole time.

Since I started verifying/correcting transcriptions with the NaomiSTTTrainer.py software, I have gained a lot of understanding of how exactly the computer hears things and the process of matching the sounds up with a meaningful sentence. Often a little nudging in the language model is enough to get much better comprehension, and that is why edit distances and soundex type matching can be a huge help when dealing with spoken language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants