Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separation of phonemes #33

Open
nam-ak opened this issue Oct 2, 2022 · 8 comments
Open

Separation of phonemes #33

nam-ak opened this issue Oct 2, 2022 · 8 comments

Comments

@nam-ak
Copy link

nam-ak commented Oct 2, 2022

The phonemes are separated by dots on the pull request "Create pt-BR.txt" by @carmo-evan but not in any other language in this project. I have no idea how he did it and if it's possible to do it automatically in the other languages.

@dohliam
Copy link
Member

dohliam commented Oct 2, 2022

@nam-ak That's a good question, and I also don't know how @carmo-evan achieved this -- perhaps with a script? I know that there are existing syllabification parsers for various languages, but it is not a simple or error-free process, and the algorithm for each language would need to be quite different.

Overall, I think it would be good to have eventual syllabification added for all languages as an eventual goal, so if you have any ideas of how to automate this in a reasonably accurate way or if you want to submit a pull request to add syllable parsing for a particular language that would be very welcome. 😄

@nam-ak
Copy link
Author

nam-ak commented Dec 3, 2022

Are those syllabification parses just for words and not for theirs phonetic transcription of IPA?
Because I don't think it's possible to automate in languages several languages, even in English, because unfortunately, there is no straightforward syllabification method that is accepted by a majority of linguists. I think what @carmo-evan did on the pt-BR.txt was already on the database of whatever dictionary he used to create the data for this open source project.

@dohliam
Copy link
Member

dohliam commented Dec 3, 2022

Yes, I fully agree. Which of course brings us back to the reason that most of the languages in the project don't currently include syllabification...

Interestingly, the database for Dutch (which has not been merged yet) seems to include dot-separated phonemes as well. I assume that this is also a case where the source dictionary already incorporates these.

@dohliam
Copy link
Member

dohliam commented Dec 3, 2022

Perhaps @VincentCCL might be able to shed some light on how syllabification was carried out for the Dutch data. For example, was it added manually, or through an automated process?

@VincentCCL
Copy link

VincentCCL commented Dec 5, 2022

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

@nam-ak
Copy link
Author

nam-ak commented Dec 6, 2022

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

"WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German."

https://catalog.ldc.upenn.edu/LDC96L14
In this website of CELEX2 I found the following information:

"For each language, this data set contains detailed information on:

・orthography (variations in spelling, hyphenation)
・ phonology (phonetic transcriptions, variations in ・pronunciation, syllable structure, primary stress)
・morphology (derivational and compositional structure, inflectional paradigms)
・syntax (word class, word class-specific subcategorizations, argument structures)
word frequency (summed word and lemma counts, based on recent and representative text corpora)"

So I assume in this lexical database the languages English and German the phonetics transcriptions are also separated by dots, it has syllabification of the IPA transcription. Can you confirm? I don't know how to access this database.
Perhaps, if the license allows it, we can also substitute the currently existing data for English and German, and merge it?

@VincentCCL
Copy link

I'll check whether I have access to the non-Dutch data -- I am only familiar with the Dutch CELEX.

@oamamao
Copy link

oamamao commented Feb 2, 2023

Any updates on this? I'm interested on it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants