Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to suppress tokens during generation #978

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

abheesht17
Copy link
Collaborator

@abheesht17 abheesht17 commented Apr 10, 2023

Resolves #975

@abheesht17 abheesht17 marked this pull request as draft April 10, 2023 13:59
@abheesht17 abheesht17 marked this pull request as ready for review April 10, 2023 15:06
@chenmoneygithub
Copy link
Contributor

Thanks Abi! I am a little unsure if a blocklist of token ids could work as expected, mainly due to subword tokenizer. For common English curse words, they are usually one token so it could work, but it does not work for phrases or words would be split by subword tokenizer. Blocklisting words is a complex task as I see, we will need to discuss a bit more about it, thanks!

@mattdangerw
Copy link
Member

Yeah, I think the short status update is we should leave this to discuss post 0.5 release. Logit manipulation generally is a good topic for discussion in our generation flow.

@abheesht17
Copy link
Collaborator Author

Gotcha! We will, however, have to do this for Whisper in any case: https://huggingface.co/openai/whisper-tiny/blob/main/generation_config.json#L126. I guess we can just modify the logits in the next() function for Whisper instead of changing the sampler API

@mattdangerw
Copy link
Member

Gotcha! We will, however, have to do this for Whisper in any case: https://huggingface.co/openai/whisper-tiny/blob/main/generation_config.json#L126. I guess we can just modify the logits in the next() function for Whisper instead of changing the sampler API

Good to know! I think whisper (like all the seq2seq stuff) we will leave out of 0.5, but does sounds like we need to think through this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider a suppressed_tokens arg in generate()
3 participants