Task Adaptive Tokenization

A code implementation for the EMNLP 2023 paper "Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond"

@inproceedings{liu-etal-2023-task,
    title = "Task-Adaptive Tokenization: Enhancing Long-Form Text Generation Efficacy in Mental Health and Beyond",
    author = "Liu, Siyang  and
      Deng, Naihao  and
      Sabour, Sahand  and
      Jia, Yilin  and
      Huang, Minlie  and
      Mihalcea, Rada",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.emnlp-main.944",
    doi = "10.18653/v1/2023.emnlp-main.944",
    pages = "15264--15281",
}

Create a Task-adaptive Tokenizer

1. training a sentencepiece vocabulary using your downstream corpus

see example in ./vocab_files/target_unigram_model_for_psyqa/train_spm_model.py

If you want to build a specialized vocabulary for other datasets, please see: ./vocab_files/target_unigram_model_for_psyqa/vocabulary_build.py

2. save the base vocabulary into a folder

create a directory under ./vocab_files, and put all vocab files and config files under ./vocab_files/{dir}. See an example in ./vocab_files/merged_vocab_from_llama_base_for_psyqa

3. run Build_TAT_from_BaseTokenizer in create_task_adptive_tokenizer_from_base.py

this script will build a task-adaptive tokenizer and save the newly merged vocab file into the output

Project Reproducing

download data

./data/PsyQa/loading_script.py will automatically prepare dataset we need. You usually just need this script.

setting up the environment

You may need two environments to run: open and follow the following command in ./install.sh

training models

see ./train.sh, and change some parameters accordingly

generating

see ./generate.sh, and change some parameters accordingly

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
PsyQa		PsyQa
redditMHP		redditMHP
vocab_files		vocab_files
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
customed_bart.py		customed_bart.py
customed_gpt2.py		customed_gpt2.py
generate.sh		generate.sh
install.sh		install.sh
sentencepiece_model_pb2.py		sentencepiece_model_pb2.py
sentencepiece_pb2.py		sentencepiece_pb2.py
sp_tokenizer.py		sp_tokenizer.py
train.py		train.py
train.sh		train.sh
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Task Adaptive Tokenization

Create a Task-adaptive Tokenizer

Project Reproducing

download data

setting up the environment

training models

generating

About

Releases

Packages

Languages

License

MichiganNLP/task-adaptive_tokenization

Folders and files

Latest commit

History

Repository files navigation

Task Adaptive Tokenization

Create a Task-adaptive Tokenizer

Project Reproducing

download data

setting up the environment

training models

generating

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages