Getting vectors from BPE using user_hooks #3823

alejandrojcastaneira · 2019-06-05T13:14:46Z

alejandrojcastaneira
Jun 5, 2019

Feature description

Following #3761 I tried an script for getting the vector of an unknown word, as the average of all the subword units in it's byte-pair-encoding (BPE) decomposition, using SentencePiece. By defining a small vocabulary, each row from the model vocab will a contain a subword unit associated with a wordvector trained with glove, word2vec, etc. The method is better shown in BPEmb.

This could guaranteed that each oov word will always have an associated vector theoretically, according to the BPE tokenization while keeping the vocabulary and vectors table relative smalls. It is also very useful in languages such as German or Norwegian, where they could have infinite possibilities of word compositions.

At the moment I wrote an overwrite function for the doc.user_token_hooks["vector"] property and a pipeline component that installs it:

import spacy
import sentencepiece as spm
import numpy as np

def spm_vectors(token):
    if token.has_vector:
        return nlp.vocab[token.text].vector
    else:
        return np.mean([nlp.vocab[sub_word_unit].vector for sub_word_unit in sp.EncodeAsPieces(token.text)], axis=0, dtype=np.float32)

def set_custom_vectors(doc):
    doc.user_token_hooks["vector"] = spm_vectors
    return doc

sp = spm.SentencePieceProcessor()
sp.Load("de_bpe.model")

nlp = spacy.load("de_custom_bpe_model")
nlp.add_pipe(set_custom_vectors, first = True)

doc = nlp("Thanos sucht nach dem Unendlichkeitsstein")

print(doc[4].vector)


[-0.06138334 -0.075055    0.05727967 -0.22417586  0.05276467 -0.03159783 ...

Maybe it's a little hacky, I would like to make sure If I'm using the user_hooks in the correct way so this vector representation for oov words could be used as input in the different processing pipelines like: "tagger", "parser","ner","textcat", etc.

Could the feature be a custom component or spaCy plugin?

BPE segmentation could be subject of PR in some cases as getting oov words, reduce models size, tokenization, applied before pretraining, as segmentation #828, etc.

Best regards

honnibal · 2019-06-16T14:22:51Z

honnibal
Jun 16, 2019
Maintainer

The code looks right to me, yes. However:

vector representation for oov words could be used as input in the different processing pipelines like: "tagger", "parser","ner","textcat", etc.

Unfortunately there's no easy way to allow this in spaCy at the moment. The models work by first vectorizing the document, and then looking up the integer IDs for the vectors in a table. We could modify it so that there's some sort of back-off system, but you can't do it as a library user at the moment unfortunately. It'll take some modifications within spaCy.

We need to rethink the way the vectors work in the pretrained model anyway, as currently they rely on a global variable, which is messing things up in subprocesses and also causing problems if multiple models are loaded for the same language.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting vectors from BPE using user_hooks #3823

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Getting vectors from BPE using user_hooks #3823

alejandrojcastaneira Jun 5, 2019

Feature description

Could the feature be a custom component or spaCy plugin?

Replies: 1 comment

honnibal Jun 16, 2019 Maintainer

alejandrojcastaneira
Jun 5, 2019

honnibal
Jun 16, 2019
Maintainer