Possible Tokenization Issues #13

Mindful · 2024-11-05T00:48:52Z

          @Mindful you can check this repo , after encoding and decoding there's a random gap in each line at the same time in a pattern

https://github.com/arnavgupta16/shiba-canine_tokenization/blob/main/main.py

Originally posted by @arnavgupta16 in #12 (comment)

The text was updated successfully, but these errors were encountered:

Mindful · 2024-11-05T00:49:16Z

@arnavgupta16 Please also share the file you are using as input.

arnavgupta16 · 2024-11-05T07:40:07Z

sure i ve added it in the repo

Mindful · 2024-11-09T06:26:37Z

@arnavgupta16 Hi, don't seem to be able to replicate this issue. Encoding and decoding text does add a CLS token to the front of the text - the tokenizer should probably have had functionality to automatically strip that, sorry - but here is a diff before and after encoding. No added spaces.
You can easily remove the CLS token and it will always be at the beginning, so that shouldn't be a problem.

FWIW, I'm 99.9% sure that your problem probably has to do with extracting text from the PDF, because when I extract text from the PDF it already includes some suspicious extra spaces. Closing because this isn't a SHIBA issue.

Edit: just for reference, the test I ran in a Jupyter notebook:

import PyPDF2
from shiba import CodepointTokenizer

tokenizer = CodepointTokenizer()

def decode_tokens(encoded_chunks):
    decoded_text = ""
    for tokens in encoded_chunks:
        decoded_text += tokenizer.decode(tokens)
    return decoded_text

pdf_path = "Speech-of-Barack-Obama.pdf"

def extract_text_from_pdf(pdf_path):
    """Extract text from a given PDF file."""
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text


def split_text(text, max_length=1800):
    """Split text into chunks within the max length allowed by SHIBA."""
    return [text[i:i + max_length] for i in range(0, len(text), max_length)]

text = extract_text_from_pdf(pdf_path)
# Split text into manageable chunks
text_chunks = split_text(text)

before = text_chunks[1]
a = tokenizer.encode_batch([before])
after = tokenizer.decode(a['input_ids'][0])

before, after

Mindful closed this as completed Nov 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Tokenization Issues #13

Possible Tokenization Issues #13

Mindful commented Nov 5, 2024

Mindful commented Nov 5, 2024 •

edited

Loading

arnavgupta16 commented Nov 5, 2024

Mindful commented Nov 9, 2024 •

edited

Loading

Possible Tokenization Issues #13

Possible Tokenization Issues #13

Comments

Mindful commented Nov 5, 2024

Mindful commented Nov 5, 2024 • edited Loading

arnavgupta16 commented Nov 5, 2024

Mindful commented Nov 9, 2024 • edited Loading

Mindful commented Nov 5, 2024 •

edited

Loading

Mindful commented Nov 9, 2024 •

edited

Loading