Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python stop working while set Chinese word #47

Open
eromoe opened this issue Dec 18, 2017 · 1 comment
Open

python stop working while set Chinese word #47

eromoe opened this issue Dec 18, 2017 · 1 comment

Comments

@eromoe
Copy link

eromoe commented Dec 18, 2017

Hello,

Python stop working happened on python 2.7 and 3.6 both
image

My code like:

    text = htmls_2_text(input_dir)
    sentences = text_2_sentences(text)
    wf = count_wrodfreq_range(sentences)

    print('Init trie')

    trie = datrie.Trie(''.join(set(text)))

    print('Update counter dict:', len(wf))
    for k, v in wf.most_common(3000):
        try:
            trie[k] = v
        except Exception as e:
            print(k, v)

    # trie.update(wf.most_common(3000))

Can't catch exception, It didn't happen when add 1000 word, sometime happen when add 2000 word(but sometimes not) .

test data

text :
test.txt

wf (please use pickle.load ) :
freq.txt

@eromoe
Copy link
Author

eromoe commented Dec 18, 2017

I think the problem is by encoding

    import datrie
    text = htmls_2_text(input_dir)
    unique_text = ''.join(set(text))
    trie = datrie.Trie(unique_text )
    trie['今天天气真好'] = 111
    trie['今天好'] = 222
    trie['今天'] = 444
    
    print(trie.items())
    
    [('今义', 444), ('今义义傲兢于', 111), ('今义于', 222)]

Wrong word .


I tried to locate the error:

unique chars: https://pastebin.com/n2i280i8

Error

u = ''.join(set('今天天气真好' + unique_text[:400]))

got [('今天', 444), ('今天天气I好', 111), ('今天好', 222)]

Correct

u = ''.join(set('今天天气真好' + unique_text[:396]))
u = ''.join(set('今天天气真好' + unique_text[:396]+unique_text[398:400]))
u = ''.join(set('今天天气真好' + unique_text[396:400]))

all correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant