Does ScatterText somehow combine tokens? #132

mikkokotila · 2023-07-31T13:39:57Z

I have many cases where two tokens such as བྱང་ཆུབ་ and སེམས་དཔ become a single thing in the scatterplot. Is this something that ScatterText is doing? The tokenizer I'm using does not do that.

JasonKessler · 2023-07-31T15:32:15Z

Could you please provide a runnable example to show this? It's possible the tokenizer is merging those two words into a single token, or Scattertext ended up aligning them in the labeling phase.

JasonKessler · 2023-07-31T19:27:43Z

But please submit a reproducible example of where this occurs. Otherwise, there's nothing I can do to look into this.

mikkokotila · 2023-08-01T07:45:50Z

Just wanted to first know if there is a possibility for something like that happening which you are aware of.

The small background is that I'm very familiar with the tokenizer I'm using (botok), and am 100% sure it is not the cause of this as I've stared at material tokenized by it for thousands of hours.

Here is the data I'm using:

tibetan_strings.txt

Here is the one-liner to read it to ensure consistency with the way I have it:

open('tibetan_strings.txt', 'r').readlines()[0].split(' ')

Here is the wrapper for the tokenizer I'm using, inspired by the the chinese_nlp example:

import re
from botok import WordTokenizer
tokenizer = WordTokenizer()


class Tok(object):
    
    def __init__(self, pos, lem, orth, low, ent_type, tag):
        self.pos_ = pos
        self.lemma_ = lem
        self.lower_ = low
        self.orth_ = orth
        self.ent_type_ = ent_type
        self.tag_ = tag
    
    def __repr__(self): return self.orth_
    
    def __str__(self): return self.orth_


class Doc(object):
    
    def __init__(self, sents, raw):
        self.sents = sents
        self.string = raw
        self.text = raw
    
    def __str__(self):
        return ' '.join(str(sent) for sent in self.sents)
    
    def __repr__(self):
        return self.__str__()
    
    def __iter__(self):
        for sent in self.sents:
            for tok in sent:
                yield tok


class Sentence(object):
    
    def __init__(self, toks, raw):
        self.toks = toks
        self.raw = raw
    
    def __iter__(self):
        for tok in self.toks:
            yield tok
    
    def __str__(self):
        return ' '.join([str(tok) for tok in self.toks])
    
    def __repr__(self):
        return self.raw

import bokit
punct_list = bokit.utils.create_punctuation_list()

punct_str = "|".join(map(re.escape, punct_list))  # Escape special characters
punct_re = re.compile(r'^({})+$'.format(punct_str))  # Create the regex pattern


def tibetan_nlp(doc, entity_type=None, tag_type=None):
    
    toks = []
    
    for tok_obj in tokenizer.tokenize(doc):
        tok = tok_obj['text_unaffixed']
        pos = tok_obj['pos']

        if tok.strip() == '':
            pos = 'SPACE'
        elif punct_re.match(tok):
            pos = 'PUNCT'
        
        token = Tok(pos,
                    tok_obj['lemma'],
                    tok.lower(),
                    tok,
                    ent_type='' if entity_type is None else entity_type.get(tok, ''),
                    tag='' if tag_type is None else tag_type.get(tok, ''))

        toks.append(token)
    
    return Doc([Sentence(toks, doc)], doc)

JasonKessler · 2023-08-01T14:24:36Z

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

mikkokotila · 2023-08-13T08:54:09Z

I realize you have a lot of experience with this token, but have you programmatically checked the tokenizer's output on this file to verify that the token in question isn't there?

Yes

JasonKessler added the could not reproduce label Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does ScatterText somehow combine tokens? #132

Does ScatterText somehow combine tokens? #132

mikkokotila commented Jul 31, 2023 •

edited

Loading

JasonKessler commented Jul 31, 2023

JasonKessler commented Jul 31, 2023

mikkokotila commented Aug 1, 2023

JasonKessler commented Aug 1, 2023

mikkokotila commented Aug 13, 2023

Does ScatterText somehow combine tokens? #132

Does ScatterText somehow combine tokens? #132

Comments

mikkokotila commented Jul 31, 2023 • edited Loading

JasonKessler commented Jul 31, 2023

JasonKessler commented Jul 31, 2023

mikkokotila commented Aug 1, 2023

JasonKessler commented Aug 1, 2023

mikkokotila commented Aug 13, 2023

mikkokotila commented Jul 31, 2023 •

edited

Loading