-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible Tokenization Issues #13
Comments
@arnavgupta16 Please also share the file you are using as input. |
sure i ve added it in the repo |
@arnavgupta16 Hi, don't seem to be able to replicate this issue. Encoding and decoding text does add a CLS token to the front of the text - the tokenizer should probably have had functionality to automatically strip that, sorry - but here is a diff before and after encoding. No added spaces. FWIW, I'm 99.9% sure that your problem probably has to do with extracting text from the PDF, because when I extract text from the PDF it already includes some suspicious extra spaces. Closing because this isn't a SHIBA issue. Edit: just for reference, the test I ran in a Jupyter notebook: import PyPDF2
from shiba import CodepointTokenizer
tokenizer = CodepointTokenizer()
def decode_tokens(encoded_chunks):
decoded_text = ""
for tokens in encoded_chunks:
decoded_text += tokenizer.decode(tokens)
return decoded_text
pdf_path = "Speech-of-Barack-Obama.pdf"
def extract_text_from_pdf(pdf_path):
"""Extract text from a given PDF file."""
text = ""
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
for page in reader.pages:
text += page.extract_text()
return text
def split_text(text, max_length=1800):
"""Split text into chunks within the max length allowed by SHIBA."""
return [text[i:i + max_length] for i in range(0, len(text), max_length)]
text = extract_text_from_pdf(pdf_path)
# Split text into manageable chunks
text_chunks = split_text(text)
before = text_chunks[1]
a = tokenizer.encode_batch([before])
after = tokenizer.decode(a['input_ids'][0])
before, after |
https://github.com/arnavgupta16/shiba-canine_tokenization/blob/main/main.py
Originally posted by @arnavgupta16 in #12 (comment)
The text was updated successfully, but these errors were encountered: