-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix bug during w2v training with utf8 characters #76
Conversation
@pakhy2380 Please fix lint error (do not use single quote), and could you add a test case for this patch? |
@chiwanpark Is there any case that |
@pakhy2380 Is there additional memory overhead when we use UTF-8 dtype? Then, how the size of memory overhead is? If the memory overhead is negligible, please apply the dtype to |
@pakhy2380 BTW, maybe you need to clean the commit history (or do sign the our CLA). Currently, this PR contains the commits from multiple authors @pakhy2380 and @hugh-ga, and the last author haven't agreed to CLA. |
@chiwanpark I will apply the same data type to original code
new code (only iid)
new code (including uid)
test case info
sample (3 rows)
|
@pakhy2380 There is a lint error for unused packages ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pakhy2380 LGTM, I'm merging...
bug
when training w2v with Korean words(
utf-8
characters),idmap['cols']
couldn't get utf-8as-is
dtype=Sx
to be
dtype=h5py.string_dtype('utf-8')