Taiwanese-Hokkien_Mandarin_CM_Dataset

Dataset of EMNLP 2022 findings: Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Tittle

Exploring Methods for Building Dialects-Mandarin Code-Mixing Corpora: A Case Study in Taiwanese Hokkien

Abstract

In natural language processing (NLP), code-mixing (CM) is a challenging task, especially when the mixed languages include dialects. In Southeast Asian countries such as Singapore, Indonesia, and Malaysia, Hokkien-Mandarin is the most widespread code-mixed language pair among Chinese immigrants, and it is also common in Taiwan. However, dialects such as Hokkien often have a scarcity of resources and the lack of an official writing system, limiting the development of dialect CM research. In this paper, we propose a method to construct a Hokkien-Mandarin CM dataset to mitigate the limitation, overcome the morphological issue under the Sino-Tibetan language family, and offer an efficient Hokkien word segmentation method through a linguistics-based toolkit. Furthermore, we use our proposed dataset and employ transfer learning to train the XLM (cross-lingual language model) for translation tasks. To fit the code-mixing scenario, we adapt XLM slightly. We found that by using linguistic knowledge, rules, and language tags, the model produces good results on CM data translation while maintaining monolingual translation quality.

Full Paper

Paper.pdf

Note

Articut Taiwanese Tokenizer now availible, please refer to https://github.com/Droidtown/ArticutAPI_Taigi

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
testcm.min-zh.min		testcm.min-zh.min
testcm.min-zh.zh		testcm.min-zh.zh
traincm.min-zh.min		traincm.min-zh.min
traincm.min-zh.zh		traincm.min-zh.zh
validcm.min-zh.min		validcm.min-zh.min
validcm.min-zh.zh		validcm.min-zh.zh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Taiwanese-Hokkien_Mandarin_CM_Dataset

Tittle

Abstract

Full Paper

Note

About

Releases

Packages

alznn/Taiwanese-Hokkien_Mandarin_CM_Dataset

Folders and files

Latest commit

History

Repository files navigation

Taiwanese-Hokkien_Mandarin_CM_Dataset

Tittle

Abstract

Full Paper

Note

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages