GloVe Integration #3333
Replies: 6 comments
-
Hi @ttymck , Some disorganised thoughts on this: I really like self-contained packages that provide C-extensions. So, your wrapper should include the C source files, which would be linked into the Cython interface. You can see an example of using static linking for this in my Blis wrapper: https://github.com/explosion/cython-blis . The other question around this is, why GloVe as opposed to FastText? One thing I really like about GloVe is the two phase process, where first we count and then the algorithm runs. I think there's always been a huge missed opportunity around this. We should be using approximate counting, so that we can count a huge text corpus with a limited memory budget. RaRe's bounter library provides good approximate counting: https://github.com/RaRe-Technologies/bounter The other thing that would be cool to get right is the text preprocessing, and integration between that and spaCy. Especially for more exotic preprocessing, like merging phrases, lemmatization, etc. I don't know exactly what the solution to this should look like. |
Beta Was this translation helpful? Give feedback.
-
As a user, I would be very interested in a package like this. |
Beta Was this translation helpful? Give feedback.
This comment was marked as off-topic.
This comment was marked as off-topic.
-
I want to add here that I came across mittens from Roam Analytics: https://github.com/roamanalytics/mittens From the README:
I will still be pursuing a thin wrapper around C executables. Thank you, @honnibal for your timely feedback:
I'm working on a first pass of this, and will keep all of these in mind when doing so! |
Beta Was this translation helpful? Give feedback.
-
Managed to get a prototype working: https://github.com/ttymck/crucyble/blob/master/test/test_glove.py Aside from re-implementing the Other considerations include:
However I fear that pursuing features as such will result in straying too far from the "pure" stanford implementation...haven't yet decided if that is a primary concern or not...opinions welcome. |
Beta Was this translation helpful? Give feedback.
-
#2154 for Fasttext, it is better to have both GloVe and Fasttext (and other common embed vectors) |
Beta Was this translation helpful? Give feedback.
-
Feature description
I am interested in integrating GloVe closer to the python ecosystem. There currently exists 2 public re-implementations of the GloVe algorithm in the Python runtime (pure python: https://github.com/hans/glove.py and cython: https://github.com/maciejkula/glove-python). I am interested in providing a thin Cython wrapper around the stanfordnlp distribution of glove (a C program).
My proposed implementation would refactor the extant source to accept files instead of stdin/stdout, and a call to my
GloVe.train()
class would return a Path object to the file-on-disk containing the word vectors. I think this would integrate nicely with thefrom_glove()
method in spacy.I am wondering:
Could the feature be a custom component or spaCy plugin?
If so, we will tag it as
project idea
so other users can take it on.I believe this would would make sense as a plugin, with an emphasis on compatibility with other scientific packages (numpy, pandas, nltk, etc)
Beta Was this translation helpful? Give feedback.
All reactions