Average embeddings using SentenceEmbeddings #5974
-
Hello, I am trying to create document level embeddings using the 100 dimension glove model and am having issues with the SetentenceEmbeddings annotator. My understanding is that by setting the inputCols to ["document", "embeddings"], the embeddings should then be averaged across the whole document. What I am finding is that the output vector seems to be concatenating the vectors for each sentence. In the example below, I provide three sentences and so I expect a single vector of 100 dimensions that is the average of the embeddings across all 3 sentences, but instead, I get a vector of length 300. If I provide 4 sentences, I get a length of 400, etc. Am I defining my pipeline incorrectly to produce the document-level embeddings? I am using sparknlp 3.2.1 and Spark 3.1.0. Thank you in advance for any help!
|
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 9 replies
-
Hi, Since you are using Could you please try that and see what happens? UPDATE: This should be fixed in our new release: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.2.3 |
Beta Was this translation helpful? Give feedback.
Hi,
Since you are using
WordEmbeddingsModel
, and for this annotator (GloVe
) there is no maximum sequence unlike transformers (BERT, ALBERT, RoBERTa, etc.), you can just remove SentenceDetector, usedocument
for setInputCols in all the annotators that come after DocumentAssembler.Could you please try that and see what happens?
UPDATE: This should be fixed in our new release: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.2.3