training using documents seperated by sentence #3

xiaoouwang · 2022-01-07T01:53:47Z

Hello,

Tks for this fantastic implementation.

I'm wondering if it's possible to use sentences as training units because normally the window is put on the sentence right? If we use documents the last word of a sentence will has a right window of 5 words which shouldn't have been included.

One can argue that it suffices to give the list of lists of sentences as input, however

from svd2vec import svd2vec
documents = ["this is a test right left".split(
), "this is the second test left right".split()]
svd = svd2vec(documents, window=2, min_count=1, size=2)

gives

test_svd.py 3 <module>
svd = svd2vec(documents, window=2, min_count=1,size=2)

core.py 146 __init__
self.weighted_count_matrix_file = self.skipgram_weighted_count_matrix()

core.py 234 skipgram_weighted_count_matrix
(self.vocabulary_len, self.vocabulary_len), np.dtype('float16'))

temporary_array.py 17 __init__
matrix = self.load(erase=True)

temporary_array.py 23 load
return np.memmap(self.file_name, shape=self.shape, dtype=self.dtype, mode='w+')

memmap.py 267 __new__
mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)

ValueError:
cannot mmap an empty file

As one can expect, the error would disappear if one gives a larger list:

from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*100, "this is the second test left right".split()*100]
svd = svd2vec(documents, window=2, min_count=1, size=2)

Tks again!

The text was updated successfully, but these errors were encountered:

xiaoouwang · 2022-01-07T03:18:34Z

I found then when the document was large, each document can be of short length:

from svd2vec import svd2vec
documents = ["this is a test right left".split(
)*2, "this is the second test left right".split()*2] * 10
svd = svd2vec(documents, window=2, min_count=0, size=4)

This one works.

So why don't you use sentence as unit? Does the author of the paper specifies that or it's for some computation reasons?

Tks :D

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

training using documents seperated by sentence #3

training using documents seperated by sentence #3

xiaoouwang commented Jan 7, 2022 •

edited

Loading

xiaoouwang commented Jan 7, 2022

training using documents seperated by sentence #3

training using documents seperated by sentence #3

Comments

xiaoouwang commented Jan 7, 2022 • edited Loading

xiaoouwang commented Jan 7, 2022

xiaoouwang commented Jan 7, 2022 •

edited

Loading