You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
words_count = Counter(words)
words = [w for w in words if words_count[w] > 50]
In [19]:
vocab = set(words)
vocab_to_int = {w: c for c, w in enumerate(vocab)}
int_to_vocab = {c: w for c, w in enumerate(vocab)}
In [20]:
print("total words: {}".format(len(words)))
print("unique words: {}".format(len(set(words))))
total words: 8623686
unique words: 6791
In [21]:
int_words = [vocab_to_int[w] for w in words]
其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置
t = 1e-5 # t值
threshold = 0.9 # 剔除概率阈值
然后这里居然用这个下标用来计算词频??有人能告诉我是什么情况
int_word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
对单词进行采样
train_words = [w for w in int_words if prob_drop[w] < threshold]
The text was updated successfully, but these errors were encountered:
发现这块逻辑存在问题,
words_count = Counter(words)
words = [w for w in words if words_count[w] > 50]
In [19]:
vocab = set(words)
vocab_to_int = {w: c for c, w in enumerate(vocab)}
int_to_vocab = {c: w for c, w in enumerate(vocab)}
In [20]:
print("total words: {}".format(len(words)))
print("unique words: {}".format(len(set(words))))
total words: 8623686
unique words: 6791
In [21]:
int_words = [vocab_to_int[w] for w in words]
其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置
t = 1e-5 # t值
threshold = 0.9 # 剔除概率阈值
然后这里居然用这个下标用来计算词频??有人能告诉我是什么情况
int_word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
对单词进行采样
train_words = [w for w in int_words if prob_drop[w] < threshold]
The text was updated successfully, but these errors were encountered: