-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Separation of phonemes #33
Comments
@nam-ak That's a good question, and I also don't know how @carmo-evan achieved this -- perhaps with a script? I know that there are existing syllabification parsers for various languages, but it is not a simple or error-free process, and the algorithm for each language would need to be quite different. Overall, I think it would be good to have eventual syllabification added for all languages as an eventual goal, so if you have any ideas of how to automate this in a reasonably accurate way or if you want to submit a pull request to add syllable parsing for a particular language that would be very welcome. 😄 |
Are those syllabification parses just for words and not for theirs phonetic transcription of IPA? |
Yes, I fully agree. Which of course brings us back to the reason that most of the languages in the project don't currently include syllabification... Interestingly, the database for Dutch (which has not been merged yet) seems to include dot-separated phonemes as well. I assume that this is also a case where the source dictionary already incorporates these. |
Perhaps @VincentCCL might be able to shed some light on how syllabification was carried out for the Dutch data. For example, was it added manually, or through an automated process? |
For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex |
"WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German." https://catalog.ldc.upenn.edu/LDC96L14 "For each language, this data set contains detailed information on: ・orthography (variations in spelling, hyphenation) So I assume in this lexical database the languages English and German the phonetics transcriptions are also separated by dots, it has syllabification of the IPA transcription. Can you confirm? I don't know how to access this database. |
I'll check whether I have access to the non-Dutch data -- I am only familiar with the Dutch CELEX. |
Any updates on this? I'm interested on it |
The phonemes are separated by dots on the pull request "Create pt-BR.txt" by @carmo-evan but not in any other language in this project. I have no idea how he did it and if it's possible to do it automatically in the other languages.
The text was updated successfully, but these errors were encountered: