Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does ScatterText somehow combine tokens? #132

Open
mikkokotila opened this issue Jul 31, 2023 · 5 comments
Open

Does ScatterText somehow combine tokens? #132

mikkokotila opened this issue Jul 31, 2023 · 5 comments

Comments

@mikkokotila
Copy link

mikkokotila commented Jul 31, 2023

I have many cases where two tokens such as བྱང་ཆུབ་ and སེམས་དཔ become a single thing in the scatterplot. Is this something that ScatterText is doing? The tokenizer I'm using does not do that.

@JasonKessler
Copy link
Owner

Could you please provide a runnable example to show this? It's possible the tokenizer is merging those two words into a single token, or Scattertext ended up aligning them in the labeling phase.

@JasonKessler
Copy link
Owner

But please submit a reproducible example of where this occurs. Otherwise, there's nothing I can do to look into this.

@mikkokotila
Copy link
Author

Just wanted to first know if there is a possibility for something like that happening which you are aware of.

The small background is that I'm very familiar with the tokenizer I'm using (botok), and am 100% sure it is not the cause of this as I've stared at material tokenized by it for thousands of hours.

Here is the data I'm using:

tibetan_strings.txt

Here is the one-liner to read it to ensure consistency with the way I have it:

open('tibetan_strings.txt', 'r').readlines()[0].split(' ')

Here is the wrapper for the tokenizer I'm using, inspired by the the chinese_nlp example:

import re
from botok import WordTokenizer
tokenizer = WordTokenizer()


class Tok(object):
    
    def __init__(self, pos, lem, orth, low, ent_type, tag):
        self.pos_ = pos
        self.lemma_ = lem
        self.lower_ = low
        self.orth_ = orth
        self.ent_type_ = ent_type
        self.tag_ = tag
    
    def __repr__(self): return self.orth_
    
    def __str__(self): return self.orth_


class Doc(object):
    
    def __init__(self, sents, raw):
        self.sents = sents
        self.string = raw
        self.text = raw
    
    def __str__(self):
        return ' '.join(str(sent) for sent in self.sents)
    
    def __repr__(self):
        return self.__str__()
    
    def __iter__(self):
        for sent in self.sents:
            for tok in sent:
                yield tok


class Sentence(object):
    
    def __init__(self, toks, raw):
        self.toks = toks
        self.raw = raw
    
    def __iter__(self):
        for tok in self.toks:
            yield tok
    
    def __str__(self):
        return ' '.join([str(tok) for tok in self.toks])
    
    def __repr__(self):
        return self.raw

import bokit
punct_list = bokit.utils.create_punctuation_list()

punct_str = "|".join(map(re.escape, punct_list))  # Escape special characters
punct_re = re.compile(r'^({})+$'.format(punct_str))  # Create the regex pattern


def tibetan_nlp(doc, entity_type=None, tag_type=None):
    
    toks = []
    
    for tok_obj in tokenizer.tokenize(doc):
        tok = tok_obj['text_unaffixed']
        pos = tok_obj['pos']

        if tok.strip() == '':
            pos = 'SPACE'
        elif punct_re.match(tok):
            pos = 'PUNCT'
        
        token = Tok(pos,
                    tok_obj['lemma'],
                    tok.lower(),
                    tok,
                    ent_type='' if entity_type is None else entity_type.get(tok, ''),
                    tag='' if tag_type is None else tag_type.get(tok, ''))

        toks.append(token)
    
    return Doc([Sentence(toks, doc)], doc)

@JasonKessler
Copy link
Owner

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

@mikkokotila
Copy link
Author

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

Yes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants