ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

ViHealthBERT is a strong baseline language models for Vietnamese in Healthcare domain.

We empirically investigate our model with different training strategies, achieving state of the art (SOTA) performances on 3 downstream tasks: NER (COVID-19 & ViMQ), Acronym Disambiguation, and Summarization.
We introduce two Vietnamese datasets: the acronym dataset (acrDrAid) and the FAQ summarization dataset in the healthcare domain. Our acrDrAid dataset is annotated with 135 sets of keywords.

Our work can be found in this paper . The proceeding will soon be available from the ACL Anthology.

Experimental Results

Model	Mac-F1*	Mic-F1*	Mac-F1**	Mic-F1**
PhoBERT-base	0.942	0.920	0.847	0.8224
PhoBERT-large	0.945	0.931	0.8524	0.8257
ViHealthBERT	0.9677	0.9677	0.8601	0.8432

The overview of experimental results in COVID-19 and ViMQ datasets. * refers to COVID-19 dataset, ** refers to ViMQ dataset.

Hugging face

Model	#params	Arch.	Tokenizer
`demdecuong/vihealthbert-base-word`	135M	base	Word-level
`demdecuong/vihealthbert-base-syllable`	135M	base	Syllable-level

If you find our work is helpful, please cite

@InProceedings{minh-EtAl:2022:LREC,
  author    = {Minh, Nguyen  and  Tran, Vu Hoang  and  Hoang, Vu  and  Ta, Huy Duc  and  Bui, Trung Huu  and  Truong, Steven Quoc Hung},
  title     = {ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining},
  booktitle      = {Proceedings of the Language Resources and Evaluation Conference},
  month          = {June},
  year           = {2022},
  address        = {Marseille, France},
  publisher      = {European Language Resources Association},
  pages     = {328--337},
  url       = {https://aclanthology.org/2022.lrec-1.35}
}

Please cite our repo when it is used to help produce published results or is incorporated into other software.

contact

[email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
asset		asset
code		code
dataset		dataset
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Table of contents

ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

Experimental Results

Hugging face

contact

About

Releases

Packages

Languages

License

demdecuong/vihealthbert

Folders and files

Latest commit

History

Repository files navigation

Table of contents

ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining

Experimental Results

Hugging face

contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages