Getting vectors from BPE using user_hooks #3823
Replies: 1 comment
-
The code looks right to me, yes. However:
Unfortunately there's no easy way to allow this in spaCy at the moment. The models work by first vectorizing the document, and then looking up the integer IDs for the vectors in a table. We could modify it so that there's some sort of back-off system, but you can't do it as a library user at the moment unfortunately. It'll take some modifications within spaCy. We need to rethink the way the vectors work in the pretrained model anyway, as currently they rely on a global variable, which is messing things up in subprocesses and also causing problems if multiple models are loaded for the same language. |
Beta Was this translation helpful? Give feedback.
-
Feature description
Following #3761 I tried an script for getting the vector of an unknown word, as the average of all the subword units in it's byte-pair-encoding (BPE) decomposition, using SentencePiece. By defining a small vocabulary, each row from the model vocab will a contain a subword unit associated with a wordvector trained with glove, word2vec, etc. The method is better shown in BPEmb.
This could guaranteed that each oov word will always have an associated vector theoretically, according to the BPE tokenization while keeping the vocabulary and vectors table relative smalls. It is also very useful in languages such as German or Norwegian, where they could have infinite possibilities of word compositions.
At the moment I wrote an overwrite function for the
doc.user_token_hooks["vector"]
property and a pipeline component that installs it:Maybe it's a little hacky, I would like to make sure If I'm using the
user_hooks
in the correct way so this vector representation for oov words could be used as input in the different processing pipelines like: "tagger", "parser","ner","textcat", etc.Could the feature be a custom component or spaCy plugin?
BPE segmentation could be subject of PR in some cases as getting oov words, reduce models size, tokenization, applied before pretraining, as segmentation #828, etc.
Best regards
Beta Was this translation helpful? Give feedback.
All reactions