Metadata Pretraining Towards Instruction Finetuning

We pretrain unidirectional language models on 4B tokens from UberText 2.0. We enrich document text with weakly structured metadata, such as title, tags, and publication year, enabling metadata-conditioned text generation and text-conditioned metadata prediction at the same time. We pretrain GPT-2 Small, Medium, and Large models on a single GPU, reporting training times, BPC on BrUK, BERTScore, and BLEURT on titles for 1000 News from the Future.

Install haloop to access the model: https://pypi.org/project/haloop/

See video on metadata pretraining (2m33s):

Model checkpoints are available at https://a.wilab.org.ua/gpt/. BLEURT/BERTscore evaluation on News from the Future is available on lang-uk/bleurt_eval

Next, we venture to formatting POS and NER datasets as instructions, and train low-rank attention adapters, performing these tasks as constrained text generation. See video (2m50s): https://www.youtube.com/watch?v=NDXJ9hXtf-o

See video on instruction finetuning (2m50s):

See POS and NER adapters can be trained using examples/Makefile.

This repository fuses karpathy/NanoGPT and asivokon/unlp-2023-shared-task

Authors:

Volodymyr Kyrylov @proger
Dmytro Chaplynskyi @dchaplinsky

Erratum

When reporting BPC results in the UNLP paper, we make a mistake switching to the log-2 base. True measurements are larger by a factor of ~2.08. The correct measurements are reported in commit 1c5dc381. The updated table is available in the preprint https://wilab.org.ua/uk4b.pdf.

Name		Name	Last commit message	Last commit date
Latest commit History 248 Commits
.github		.github
assets		assets
config		config
data		data
examples		examples
lsh		lsh
uk4b		uk4b
.gitattributes		.gitattributes
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
configurator.py		configurator.py
mlm.py		mlm.py
model.py		model.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
sample.py		sample.py
scaling_laws.ipynb		scaling_laws.ipynb
train.py		train.py
transformer_sizing.ipynb		transformer_sizing.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Metadata Pretraining Towards Instruction Finetuning

Erratum

About

Releases 1

Packages

Contributors 19

Languages

License

proger/uk4b

Folders and files

Latest commit

History

Repository files navigation

Metadata Pretraining Towards Instruction Finetuning

Erratum

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 19

Languages

Packages