Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ByT5 encoder #73

Open
kylebgorman opened this issue Jun 27, 2023 · 0 comments
Open

ByT5 encoder #73

kylebgorman opened this issue Jun 27, 2023 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@kylebgorman
Copy link
Contributor

kylebgorman commented Jun 27, 2023

As a postlude to #72, I propose we make it possible to use ByT5 as the source (e.g., --source_encoder_arch byt5_base) and/or feature encoder. ByT5 is a byte-based pretrained transformer; in this mode we would be fine-tuning it.

This should become much easier to do upon completion of #72---we'd just implement a new encoder ByT5 in yoyodyne/models/modules/byt5.py. In the constructor, you'd use the transformers.T5Encoder.from_pretrained class method to instantiate an encoder; there are four sizes (small, base, large, xl) and we could just add all four.

I don't think I'd go about adding access to just any HuggingFace encoder though, as their tokenizers will be incompatible. If we think there are going to be a lot more of these, we could add some lightweight (i.e., built-in, not plug-in) registration mechanism that gives you one place to declare that HuggingFace encoder X is compatible with this library.

The tricky bit is: how does the model's tokenizer interact with our dataset config tokenization? Maybe we can just bypass theirs and add byte as a special-case separator option.

Here are some early notes on how to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants