Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown PostProcessor type: Sequence #122

Open
jovisaib opened this issue Sep 5, 2024 · 3 comments
Open

Unknown PostProcessor type: Sequence #122

jovisaib opened this issue Sep 5, 2024 · 3 comments

Comments

@jovisaib
Copy link

jovisaib commented Sep 5, 2024

It seems that the following post-processor block in tokenizer.json is not supported:

  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },

It throws the following error:

Unknown PostProcessor type: Sequence

You can find the specific pain point here, at Tokenizers/PostProcessor.swift:39

struct PostProcessorFactory {
    static func fromConfig(config: Config?) -> PostProcessor? {
        guard let config = config else { return nil }
        guard let typeName = config.type?.stringValue else { return nil }
        let type = PostProcessorType(rawValue: typeName)
        switch type {
          case .TemplateProcessing: return TemplateProcessing(config: config)
          case .ByteLevel         : return ByteLevelPostProcessor(config: config)
          case .RobertaProcessing : return RobertaProcessing(config: config)
          default                 : fatalError("Unsupported PostProcessor type: \(typeName)")
        }
    }
}

The original implementation in Rust can be found here: https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/processors/sequence.rs#L18-L36

Should be something simple and I will look for a solution over the weekend, but maybe it's something you've already found.

You can assign it to me and I will have it ready as soon as possible.

@jovisaib
Copy link
Author

jovisaib commented Sep 5, 2024

Checking hf transformers I have seen that BertProcessing should also be added.

@DePasqualeOrg
Copy link
Contributor

DePasqualeOrg commented Sep 25, 2024

This is needed for the Llama 3.2 models that were released today. It looks like this isn't a huge thing, so I'll see if I can port the Rust implementation to Swift.

@DePasqualeOrg
Copy link
Contributor

#129

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants