Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UTF #2

Open
ruvsv opened this issue Jul 21, 2019 · 3 comments
Open

UTF #2

ruvsv opened this issue Jul 21, 2019 · 3 comments

Comments

@ruvsv
Copy link

ruvsv commented Jul 21, 2019

Unfortunately, I have a problem with non-ansii charset. Can u add utf-8 support?

@antihutka
Copy link
Owner

I fixed vocabulary loading and implemented string decoding for >1 byte tokens. Unicode mode now works fine, and every reasonable tokenization scheme should work too.
Please let me know if you hit any other problems, or close this issue if everything is OK.

@ruvsv
Copy link
Author

ruvsv commented Aug 28, 2019

Sorry for the long answer.
It's utf text. For torch-rnn everything is ok

python3 train.py --input-h5 /root/data/e_kolt.h5 --input-json /root/data/e_kolt.json --device cuda
2019-08-28 03:37:52,564 - train - INFO - Creating model
longest token 4
0-Embedding
1-GRIDGRU
2-GRIDGRU
3-Linear
[Embedding(210, 128), GRIDGRU(), GRIDGRU(), Linear(in_features=128, out_features=210, bias=True)]
2019-08-28 03:37:52,572 - train - INFO - Created model with 448722 parameters
2019-08-28 03:37:52,572 - train - INFO - Loading data
2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from test
2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from val
2019-08-28 03:37:52,575 - dataloader - INFO - Loaded 608075 items from train
2019-08-28 03:37:52,577 - dataloader - INFO - No zeroes found in data, assuming one-based indexes
Traceback (most recent call last):
File "train.py", line 70, in
double_seq_on = [int(x) for x in args.double_seq_on.split(',')]
File "train.py", line 70, in
double_seq_on = [int(x) for x in args.double_seq_on.split(',')]
ValueError: invalid literal for int() with base 10: ''

@antihutka
Copy link
Owner

This was caused by incorrect parsing of the default --double-seq-on value that I introduced in the last commit. It should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants