Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

training using documents seperated by sentence #3

Open
xiaoouwang opened this issue Jan 7, 2022 · 1 comment
Open

training using documents seperated by sentence #3

xiaoouwang opened this issue Jan 7, 2022 · 1 comment

Comments

@xiaoouwang
Copy link

xiaoouwang commented Jan 7, 2022

Hello,

Tks for this fantastic implementation.

I'm wondering if it's possible to use sentences as training units because normally the window is put on the sentence right? If we use documents the last word of a sentence will has a right window of 5 words which shouldn't have been included.

One can argue that it suffices to give the list of lists of sentences as input, however

from svd2vec import svd2vec
documents = ["this is a test right left".split(
), "this is the second test left right".split()]
svd = svd2vec(documents, window=2, min_count=1, size=2)

gives

test_svd.py 3 <module>
svd = svd2vec(documents, window=2, min_count=1,size=2)

core.py 146 __init__
self.weighted_count_matrix_file = self.skipgram_weighted_count_matrix()

core.py 234 skipgram_weighted_count_matrix
(self.vocabulary_len, self.vocabulary_len), np.dtype('float16'))

temporary_array.py 17 __init__
matrix = self.load(erase=True)

temporary_array.py 23 load
return np.memmap(self.file_name, shape=self.shape, dtype=self.dtype, mode='w+')

memmap.py 267 __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)

ValueError:
cannot mmap an empty file

As one can expect, the error would disappear if one gives a larger list:

from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*100, "this is the second test left right".split()*100]
svd = svd2vec(documents, window=2, min_count=1, size=2)

Tks again!

@xiaoouwang
Copy link
Author

I found then when the document was large, each document can be of short length:

from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*2, "this is the second test left right".split()*2] * 10
svd = svd2vec(documents, window=2, min_count=0, size=4)

This one works.

So why don't you use sentence as unit? Does the author of the paper specifies that or it's for some computation reasons?

Tks :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant