A code implementation for the EMNLP 2023 paper "Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond"
@inproceedings{liu-etal-2023-task,
title = "Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond",
author = "Liu, Siyang and
Deng, Naihao and
Sabour, Sahand and
Jia, Yilin and
Huang, Minlie and
Mihalcea, Rada",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.emnlp-main.944",
doi = "10.18653/v1/2023.emnlp-main.944",
pages = "15264--15281",
}
1. training a sentencepiece vocabulary using your downstream corpus
see example in ./vocab_files/target_unigram_model_for_psyqa/train_spm_model.py
If you want to build a specialized vocabulary for other datasets, please see: ./vocab_files/target_unigram_model_for_psyqa/vocabulary_build.py
2. save the base vocabulary into a folder
create a directory under ./vocab_files, and put all vocab files and config files under ./vocab_files/{dir}. See an example in ./vocab_files/merged_vocab_from_llama_base_for_psyqa
3. run Build_TAT_from_BaseTokenizer in create_task_adptive_tokenizer_from_base.py
this script will build a task-adaptive tokenizer and save the newly merged vocab file into the output
./data/PsyQa/loading_script.py will automatically prepare dataset we need. You usually just need this script.
You may need two environments to run: open and follow the following command in ./install.sh
see ./train.sh, and change some parameters accordingly
see ./generate.sh, and change some parameters accordingly