Separation of phonemes #33

nam-ak · 2022-10-02T09:39:59Z

The phonemes are separated by dots on the pull request "Create pt-BR.txt" by @carmo-evan but not in any other language in this project. I have no idea how he did it and if it's possible to do it automatically in the other languages.

dohliam · 2022-10-02T23:42:00Z

@nam-ak That's a good question, and I also don't know how @carmo-evan achieved this -- perhaps with a script? I know that there are existing syllabification parsers for various languages, but it is not a simple or error-free process, and the algorithm for each language would need to be quite different.

Overall, I think it would be good to have eventual syllabification added for all languages as an eventual goal, so if you have any ideas of how to automate this in a reasonably accurate way or if you want to submit a pull request to add syllable parsing for a particular language that would be very welcome. 😄

nam-ak · 2022-12-03T10:37:17Z

Are those syllabification parses just for words and not for theirs phonetic transcription of IPA?
Because I don't think it's possible to automate in languages several languages, even in English, because unfortunately, there is no straightforward syllabification method that is accepted by a majority of linguists. I think what @carmo-evan did on the pt-BR.txt was already on the database of whatever dictionary he used to create the data for this open source project.

dohliam · 2022-12-03T18:10:35Z

Yes, I fully agree. Which of course brings us back to the reason that most of the languages in the project don't currently include syllabification...

Interestingly, the database for Dutch (which has not been merged yet) seems to include dot-separated phonemes as well. I assume that this is also a case where the source dictionary already incorporates these.

dohliam · 2022-12-03T18:14:41Z

Perhaps @VincentCCL might be able to shed some light on how syllabification was carried out for the Dutch data. For example, was it added manually, or through an automated process?

VincentCCL · 2022-12-05T10:32:22Z

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

nam-ak · 2022-12-06T23:00:28Z

For the orthographic transcriptions, we have manually defined syllable borders -- for the phonetic transcription this is not 100% clear, but our documentation does not say that they are not manual -- so we assume they have been made manually, or transferred from the orthographic transcript. The source is part of the CELEX data, cf https://kdutch.ivdnt.org/wiki/Lexica#CELEX_and_WebCelex

"WebCelex is a webbased interface to the CELEX lexical databases of English, Dutch and German."

https://catalog.ldc.upenn.edu/LDC96L14
In this website of CELEX2 I found the following information:

"For each language, this data set contains detailed information on:

・orthography (variations in spelling, hyphenation)
・ phonology (phonetic transcriptions, variations in ・pronunciation, syllable structure, primary stress)
・morphology (derivational and compositional structure, inflectional paradigms)
・syntax (word class, word class-specific subcategorizations, argument structures)
word frequency (summed word and lemma counts, based on recent and representative text corpora)"

So I assume in this lexical database the languages English and German the phonetics transcriptions are also separated by dots, it has syllabification of the IPA transcription. Can you confirm? I don't know how to access this database.
Perhaps, if the license allows it, we can also substitute the currently existing data for English and German, and merge it?

VincentCCL · 2022-12-07T07:45:38Z

I'll check whether I have access to the non-Dutch data -- I am only familiar with the Dutch CELEX.

oamamao · 2023-02-02T21:03:48Z

Any updates on this? I'm interested on it

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separation of phonemes #33

Separation of phonemes #33

nam-ak commented Oct 2, 2022 •

edited

Loading

dohliam commented Oct 2, 2022

nam-ak commented Dec 3, 2022

dohliam commented Dec 3, 2022

dohliam commented Dec 3, 2022

VincentCCL commented Dec 5, 2022 •

edited

Loading

nam-ak commented Dec 6, 2022 •

edited

Loading

VincentCCL commented Dec 7, 2022

oamamao commented Feb 2, 2023

Separation of phonemes #33

Separation of phonemes #33

Comments

nam-ak commented Oct 2, 2022 • edited Loading

dohliam commented Oct 2, 2022

nam-ak commented Dec 3, 2022

dohliam commented Dec 3, 2022

dohliam commented Dec 3, 2022

VincentCCL commented Dec 5, 2022 • edited Loading

nam-ak commented Dec 6, 2022 • edited Loading

VincentCCL commented Dec 7, 2022

oamamao commented Feb 2, 2023

nam-ak commented Oct 2, 2022 •

edited

Loading

VincentCCL commented Dec 5, 2022 •

edited

Loading

nam-ak commented Dec 6, 2022 •

edited

Loading