Am Anfang war das Wort - am Ende die Phrase.
_~~~ Stanislaw Jerzy Lec ~~~_
wort
is a python
library for creating count-based distributional semantic word vectors. It adopts a scikit-learn
like API and is built on top of numpy
, scipy
and scikit-learn
.
First
git clone https://github.com/tttthomasssss/wort.git
cd wort
git checkout tags/v0.1.0
Then
pip install -e .
Or
python setup.py install
from wort.corpus_readers import TextStreamReader
from wort.vsm import VSMVectorizer
# Create PPMI vectors with a symmetric window of 5 from a lowercased corpus, discarding all items occurring less than 100 times
wort = VSMVectorizer(window_size=5, weighting='ppmi', min_frequency=100, lowercase=True)
corpus_path = 'path/to/corpus/on/disk.txt
corpus = TextStreamReader(corpus_path)
wort.fit(corpus) # Depending on the size of the corpus, this can take a while...
# Serialise model for later usage
wort.save_to_file('some/path/to/store/the/model')
Creating meaningful word vector representations requires a lot of data (e.g. all of Wikipedia or all of Project Gutenberg).
wort
expects 1 line in the corpus file to correspond to 1 document in the corpus (e.g. 1 Wikipedia article or 1 book from Project Gutenberg).
wort
provides a few basic corpus readers in wort.corpus_readers
to deal with corpora in txt
, csv
/tsv
and gzip
format (assuming 1 line = 1 document).
corpus_path = 'path/to/corpus'
# Reading txt files
from wort.corpus_readers import TextStreamReader
corpus = TextStreamReader(corpus_path)
# Reading csv/tsv files
from wort.corpus_readers import CSVStreamReader
corpus = CSVStreamReader(corpus_path, delimiter='\t') # tsv file, the default assues delimiter=',' (csv file)
# Reading gzip files
from wort.corpus_readers import GzipStreamReader(corpus_path)
corpus = GzipStreamReader(corpus_path)
Any of the corpus
objects can then be passed to the fit()
method
from wort.vsm import VSMVectorizer
wort = VSMVectorizer(...)
wort.fit(corpus)
wort
requires two passes over the corpus, the first pass extracts the vocabulary and the second pass constructs the count co-occurrence matrix given the vocabulary.
So far, wort
offers a range of ppmi
based parameterisations (in addition to some common scikit-learn
Vectorizer
options):
weighting
:wort
currently supportsweighting='ppmi'
(with support forweighting='plmi' and
weighting='pnpmi'about to be implemented). However, a callable can be passed as well and needs to accept 4 values, the raw PMI matrix (
sparse.csr_matrix), the matrix of joint probabilities P(w, c) (**!!!ATTENTION: Currently
Noneis passed instead of the matrix!!!**), a
numpy.ndarrayvector representing P(w) and a
numpy.ndarray` vector representing P(c).window_size
: Size of the sliding window, accepts symmetric windows (e.g.window_size=5
orwindow_size=(5, 5)
), or asymmetric windows (e.g.window_size=(1, 5)
)context_window_weighting
: Weighting to the items within the sliding window, default iscontext_window_weighting='constant'
, but a range of other schemes are supported (so far'aggressive'
,'very_aggressive'
,'harmonic'
(thats whatGloVe
is doing),'distance'
(thats whatword2vec
is doing),'sigmoid'
,'inverse_sigmoid'
,'absolute_sigmoid'
,'inverse_absolute_sigmoid'
). Again, a callable can be passed as well and needs to accept adistance
parameter, representing the distance from the current word and awindow_size
parameter, representing the size of the window under consideration (Note that this may not be equivalent to thewindow_size
parameter used to create thewort
object).min_frequency
: Words with a frequency <min_frequency
will be filtered and discardedbinary
: If set toTrue
, converts the count based co-occurrence matrix to a binary indicator matrixsppmi_shift
: Subtractssppmi_shift
from all non-zero entries of the final PMI matrix. This is equivalent to the number of negative samples inword2vec
, see Levy & Goldberg (2014) and Levy et al. (2015) for more information.cds
: Context distribution smoothing, performsp(c) ** cds
, typically it was found thatcds=0.75
performs particularly well, again see Levy et al. (2015) for more information.dim_reduction
: Perform dimensionality reduction on the PMI matrix, currently onlydim_reduction='svd'
is supported.svd_dim
: Dimensionality of the reduced space.svd_eig_weighting
: Eigenvalue weighting of the SVD reduced space, Levy et al. (2015) found thatsvd_eig_weighting=0.5
orsvd_eig_weighting=0.0
perform better than using SVD the "correct" way.add_context_vectors
: After reducing the dimensionality, the word and context vectors can be added together, see Pennington et al. (2014) and Levy et al. (2015) for more information.word_white_list
: In academic settings one often evaluates the quality of word vectors on some word similarity dataset. The words in those datasets should obviously not be discarded by amin_frequency
filter, thuswort
allows the usage of a white list of words that should not be discarded under any circumstances.
With all of these options in mind, creating wort
object is as simple as creating a CountVectorizer
or a TfidfVectorizer
in scikit-learn
:
from wort.vsm import VSMVectorizer
wort = VSMVectorizer(window_size=(1, 7), weighting='ppmi', context_window_weighting='harmonic', min_frequency=100, cds=0.75)
wort.fit(...)
Given that fitting a distributional model takes a significant amount of time, it is feasible (that means necessary!) to save the models to disk after they've been fitted:
from wort.vsm import VSMVectorizer
wort = VSMVectorizer(...)
wort.fit(...)
# Save model to disk
wort.save_to_file(path='path/to/some/location')
The function save_to_file()
stores the most important assets (not all, to reduce disk space usage) to disk, which includes the final PMI matrix, an index file mapping numbers to words, an inverted index performing the opposite mapping and the word probability distribution P(w).
Once a number of different wort
models have been created and serialised, loading an existing model is equally simple:
from wort.vsm import VSMVectorizer
# Load model from disk
wort = VSMVectorizer.load_from_file(path='path/to/existing/wort/model')
Accessing word vectors adopts a dict
style aproach:
from wort.vsm import VSMVectorizer
# Load wort model from disk
wort = VSMVectorizer.load_from_file(path='path/to/existing/wort/model')
v_book = wort['book']
The vector for book is a 1 x N scipy.sparse.csr_matrix
, where N
is the dimensionality of the vector space, which can be queried by:
wort.get_vector_size() # Returns an integer
Checking whether a word is present in the model can be done by:
'book' in wort # Returns True or False
The most common (though arguably not the ideal) evaluation strategy for word vectors is an "intrinsic" evaluation on Word Similarity tasks, where the cosine
similarity of two word pairs is compared against (aggregated) human similarity judgements.
Over the years a number of word similarity datasets have been created, of which wort
currently supports the following:
- WS353 (
key='ws353'
), see Finkelstein et al. (2001) - Placing Search in Context: The Concept Revisited - WS353 (similarity) (
key='ws353_similarity'
), see Agirre et al. (2009) - A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches - WS353 (relatedness) (
key='ws353_relatedness'
), see Agirre et al. (2009) - A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches - SimLex-999 (
key='simlex999'
), see Hill et al. (2014) - SimLex-999: Evaluating Semantic Models with (Genuine) Similarity Estimation - MEN (
key='men'
), see Bruni et al. (2014) - Multimodal Distributional Semantics - Mechanical Turk (MTurk) (
key='mturk'
), see Radinsky et al. (2011) - A word at a time: computing word relatedness using temporal semantic analysis - Rare Words (rw) (
key='rw'
), see Luong et al. (2013) - Better word representations with recursive neural networks for morphology - MC30 (
key='mc30'
), see Miller & Charles (1991) - Contextual correlates of semantic similarity - RG65 (
key='rg65'
), see Rubinstein & Goodenough (1965) - Contextual correlates of synonymy
Evaluating a wort
model on one of these datasets is straightforward:
# Evaluate `wort` model on SimLex-999
from wort import evaluation
from wort.vsm import VSMVectorizer
# Load `wort` model from disk
wort = VSMVectorizer.load_from_file(path='path/to/existing/wort_model')
evaluation.intrinsic_word_similarity_evaluation(wort_model=wort, datasets=['simlex999'])
Furthermore, wort
supports batched evaluation of a number of different wort
models on all available word similarity datasets in wort/tools
:
./tools/batch_intrinsic_word_similarity_evaluation.sh -i path/to/wort/models -p naming_pattern_of_wort_models
The model fitting process can be broken down into 3 individual steps (4 if dimensionality reduction is performed):
- Vocabulary extraction (can easily take 1 hour)
- Co-Occurrence Matrix construction (can easily take 3 hours or more)
- PMI transformation (~ a few minutes)
- Dimensionality reduction (depending on the number of dimensions, can tak anything from a few seconds to several hours)
To optimise model throughput when multiple parameters are investigated (e.g. different window sizes, context weighting functions, context distribution smoothing values, sppmi shifts, etc), wort
employs a caching scheme that (if cache_intermediary_results=True
in the VSMVectorizer
constructor) that re-uses results from previous processing steps by noticing that:
- The vocabulary stays the same, independent of the options affecting the co-occurrence matrix construction (e.g.
window_size
,context_window_weighting
) - The co-occurrence matrix stays the same, independent of the options affecting the PMI calculation (e.g.
cds
,sppmi_shift
,weighting
) - The PMI matrix stays the same, independent of the options affecting the dimensionality reduction (e.g.
svd_dim
,svd_eig_weighting
,add_context_vectors
)
Thus, wort
re-uses whatever it can when past model configurations match the current configuration in order to optimise the time spent on creating models.
With time the cache will grow and potentially occupy a large amount of disk space, in which case the cache can be deleted by executing the delete_cache.sh
script in wort/tools
(by default wort
uses ~/.wort_data/model_cache
as cache location):
./tools/delete_cache.sh -v -p /path/to/cache
This small example illustrates a complete use case of wort
(Note that support of the scikit-learn
parameter search API is not yet supported, but will be soon!). In practice, it is recommended to split the creation of wort
models and their evaluation into two separate bits
import os
import json
from wort import evaluation
from wort.corpus_readers import TextStreamReader
from wort.datasets import get_men_words
from wort.datasets import get_simlex_999_words
from wort.datasets import get_ws353_words
from wort.vsm import VSMVectorizer
# Load Corpus
corpus_path = 'path/to/corpus/on/disk'
corpus = TextStreamReader(corpus_path)
# Define word_white_list of words needed for evaluation
white_list = get_men_words() | get_simlex_999_words() | get_ws353_words()
# Investigate effect of `window_size`, `context_window_weighting` and `cds`
window_size_values = [1, 2, 5, 10]
context_window_weighting_values = ['constant', 'harmonic', 'aggressive']
cds_values = [1.0, 0.75, 0.5]
min_frequency = 100
wort_base_path = 'path/to/location/on/disk'
collected_model_paths = [] # Note that this is just for illustrative purposes, in practice the bash script in wort/tools should be used
# Create `wort` models
for window_size in window_size_values:
for weighting in context_window_weighting_values:
for cds in cds_values:
print('Running with configuration: window_size={}; context_window_weighting={}; cds={}...'.format(window_size, weighting, cds))
model_name = 'wort_model_window-{}_weighting-{}_cds-{}'.format(window_size, weighting, cds)
model_out_path = os.path.join(wort_base_path, model_name)
collected_model_paths.append(model_out_path)
wort = VSMVectorizer(window_size=window_size, context_window_weighting=weighting, cds=cds,
min_frequency=min_frequency, word_white_list=white_list)
wort.fit(corpus)
print('Storing wort model at {}...'.format(model_out_path))
wort.save_to_file(path=model_out_path)
# Evaluate `wort` models
for model in collected_model_paths:
results = evaluation.intrinsic_word_similarity_evaluation(wort_model=model, datasets=['men', 'simlex999', 'ws353'])
print('Performance scores of {}:'.format(model))
print(json.dumps(results, indent=4))
print('====================================================================================')