Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible Tokenization Issues #13

Closed
Mindful opened this issue Nov 5, 2024 · 3 comments
Closed

Possible Tokenization Issues #13

Mindful opened this issue Nov 5, 2024 · 3 comments

Comments

@Mindful
Copy link
Collaborator

Mindful commented Nov 5, 2024

          @Mindful you can check this repo , after encoding and decoding there's a random gap in each line at the same time in a pattern 
Screenshot 2024-11-04 at 10 56 30 PM

https://github.com/arnavgupta16/shiba-canine_tokenization/blob/main/main.py

Originally posted by @arnavgupta16 in #12 (comment)

@Mindful
Copy link
Collaborator Author

Mindful commented Nov 5, 2024

@arnavgupta16 Please also share the file you are using as input.

@arnavgupta16
Copy link

sure i ve added it in the repo

@Mindful
Copy link
Collaborator Author

Mindful commented Nov 9, 2024

@arnavgupta16 Hi, don't seem to be able to replicate this issue. Encoding and decoding text does add a CLS token to the front of the text - the tokenizer should probably have had functionality to automatically strip that, sorry - but here is a diff before and after encoding. No added spaces.
You can easily remove the CLS token and it will always be at the beginning, so that shouldn't be a problem.

FWIW, I'm 99.9% sure that your problem probably has to do with extracting text from the PDF, because when I extract text from the PDF it already includes some suspicious extra spaces. Closing because this isn't a SHIBA issue.

image

Edit: just for reference, the test I ran in a Jupyter notebook:

import PyPDF2
from shiba import CodepointTokenizer

tokenizer = CodepointTokenizer()

def decode_tokens(encoded_chunks):
    decoded_text = ""
    for tokens in encoded_chunks:
        decoded_text += tokenizer.decode(tokens)
    return decoded_text

pdf_path = "Speech-of-Barack-Obama.pdf"

def extract_text_from_pdf(pdf_path):
    """Extract text from a given PDF file."""
    text = ""
    with open(pdf_path, "rb") as file:
        reader = PyPDF2.PdfReader(file)
        for page in reader.pages:
            text += page.extract_text()
    return text


def split_text(text, max_length=1800):
    """Split text into chunks within the max length allowed by SHIBA."""
    return [text[i:i + max_length] for i in range(0, len(text), max_length)]

text = extract_text_from_pdf(pdf_path)
# Split text into manageable chunks
text_chunks = split_text(text)

before = text_chunks[1]
a = tokenizer.encode_batch([before])
after = tokenizer.decode(a['input_ids'][0])

before, after

@Mindful Mindful closed this as completed Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants