You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.
I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)
let input = ["Nazywam się Jan Kowalski i mieszkam we Wrocławiu."];let config = TokenClassificationConfig::new(ModelType::Bert,ModelResource::Torch(Box::new(LocalResource::from(PathBuf::from(model_path)))),LocalResource::from(PathBuf::from(model_config_path)),LocalResource::from(PathBuf::from(vocab_path)),Some(LocalResource::from(PathBuf::from(merge_path))),//merges resource only relevant with ModelType::Robertafalse,//lowercasefalse,None,LabelAggregationOption::Mode,);
Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.
let tokenizer = BertTokenizer::from_file_with_special_token_mapping(vocab_path,false,false, special_tokens)?;println!("{:?}", tokenizer.tokenize(input[0]));let ner_model = NERModel::new_with_tokenizer(config,TokenizerOption::Bert(tokenizer))?;let output = ner_model.predict_full_entities(&input);for entity in output {println!("{entity:?}");}
Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.
let tok_opt = TokenizerOption::from_hf_tokenizer_file(tokenizer_path, special_tokens).unwrap();println!("{:?}", tok_opt.tokenize(input[0]));let ner_model = NERModel::new_with_tokenizer(config, tok_opt)?;
But now I encountered a runtime panic during the prediction phase:
thread 'main' panicked at <path>/rust-bert/src/pipelines/token_classification.rs:1113:51:
slice index starts at 50 but ends at 49
Environment:
Rust version: 1.77.2
PyTorch version: 2.2.0
tch version: v0.15.0
rust-bert copy of repository (current version from the main branch)
I would be grateful if you could help.
EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.
The text was updated successfully, but these errors were encountered:
Description
I am working on a university assignment that involves extracting Named Entities (NE) from Polish text using a BERT-based model. I have chosen the FastPDN model from Hugging Face clarin-pl/FastPDN and prepared it using the utils/convert_model.py script.
I created TokenClassificationConfig based on one of examples (files config, special_tokens_map are downloaded from huggingface, simillar vocab.json but I extracted all keys from json and saved them in txt, each in new line)
Initially, I encountered issues with tokenization when using the BertTokenizer. The output tokens did not match the expected format, leading to incorrect predictions when using the predict_full_entities method.
as output I got:
Upon switching to a tokenizer created from a tokenizer.json file (using TokenizerOption::from_hf_tokenizer_file), the tokenization improved significantly. The tokens now correctly represent the words and punctuation in the input text.
But now I encountered a runtime panic during the prediction phase:
Environment:
I would be grateful if you could help.
EDIT: trying to use BertTokenizer was a complete mistake on my part, due to the model apparently using a customized tokenizer which is slightly different from base BERT's one.
The text was updated successfully, but these errors were encountered: