UTF #2

ruvsv · 2019-07-21T12:56:50Z

Unfortunately, I have a problem with non-ansii charset. Can u add utf-8 support?

antihutka · 2019-07-22T07:40:28Z

I fixed vocabulary loading and implemented string decoding for >1 byte tokens. Unicode mode now works fine, and every reasonable tokenization scheme should work too.
Please let me know if you hit any other problems, or close this issue if everything is OK.

ruvsv · 2019-08-28T03:43:22Z

Sorry for the long answer.
It's utf text. For torch-rnn everything is ok

python3 train.py --input-h5 /root/data/e_kolt.h5 --input-json /root/data/e_kolt.json --device cuda
2019-08-28 03:37:52,564 - train - INFO - Creating model
longest token 4
0-Embedding
1-GRIDGRU
2-GRIDGRU
3-Linear
[Embedding(210, 128), GRIDGRU(), GRIDGRU(), Linear(in_features=128, out_features=210, bias=True)]
2019-08-28 03:37:52,572 - train - INFO - Created model with 448722 parameters
2019-08-28 03:37:52,572 - train - INFO - Loading data
2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from test
2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from val
2019-08-28 03:37:52,575 - dataloader - INFO - Loaded 608075 items from train
2019-08-28 03:37:52,577 - dataloader - INFO - No zeroes found in data, assuming one-based indexes
Traceback (most recent call last):
File "train.py", line 70, in
double_seq_on = [int(x) for x in args.double_seq_on.split(',')]
File "train.py", line 70, in
double_seq_on = [int(x) for x in args.double_seq_on.split(',')]
ValueError: invalid literal for int() with base 10: ''

antihutka · 2019-09-02T11:09:39Z

This was caused by incorrect parsing of the default --double-seq-on value that I introduced in the last commit. It should be fixed now.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF #2

UTF #2

ruvsv commented Jul 21, 2019

antihutka commented Jul 22, 2019

ruvsv commented Aug 28, 2019 •

edited

Loading

antihutka commented Sep 2, 2019

UTF #2

UTF #2

Comments

ruvsv commented Jul 21, 2019

antihutka commented Jul 22, 2019

ruvsv commented Aug 28, 2019 • edited Loading

antihutka commented Sep 2, 2019

ruvsv commented Aug 28, 2019 •

edited

Loading