Average embeddings using SentenceEmbeddings #5974

pdurham2 · 2021-08-23T13:53:46Z

pdurham2
Aug 23, 2021

Hello,

I am trying to create document level embeddings using the 100 dimension glove model and am having issues with the SetentenceEmbeddings annotator. My understanding is that by setting the inputCols to ["document", "embeddings"], the embeddings should then be averaged across the whole document. What I am finding is that the output vector seems to be concatenating the vectors for each sentence. In the example below, I provide three sentences and so I expect a single vector of 100 dimensions that is the average of the embeddings across all 3 sentences, but instead, I get a vector of length 300. If I provide 4 sentences, I get a length of 400, etc. Am I defining my pipeline incorrectly to produce the document-level embeddings? I am using sparknlp 3.2.1 and Spark 3.1.0.

Thank you in advance for any help!

import sparknlp
from sparknlp.base import *
from sparknlp.annotator import *
from pyspark.ml import Pipeline
import pyspark.sql.functions as F

document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document") \
.setIdCol("id") \
.setCleanupMode("shrink") # removes new lines, tabs, and merges multiple blank spaces into one space 

sentencer = SentenceDetector()\
  .setInputCols(["document"]) \
  .setOutputCol("sentences")

tokenizer = Tokenizer() \
.setInputCols(["sentences"]) \
.setOutputCol("token") \
.setSplitChars(['-']) \
.setContextChars(['(', ')', '?', '!', '#', '@'])

embeddings = WordEmbeddingsModel.pretrained("glove_100d", "en") \
      .setInputCols("sentences", "token") \
      .setOutputCol("embeddings")

embeddingsDocument = SentenceEmbeddings() \
    .setInputCols(["document", "embeddings"]) \
    .setOutputCol("document_embeddings") \
    .setPoolingStrategy("AVERAGE")

embeddingsFinisher = EmbeddingsFinisher() \
    .setInputCols(["document_embeddings"]) \
    .setOutputCols("finished_embeddings") \
    .setOutputAsVector(True) \
    .setCleanAnnotations(False)
      

nlp_pipeline = Pipeline(stages=[document_assembler, sentencer, tokenizer, embeddings, embeddingsDocument, embeddingsFinisher])

# example query
t = 'this is the first sentence. here is another one. this is the last one.'
pipeline_model.transform(spark.createDataFrame([[0, t]], ["id", "text"])).show(truncate=False)

Answered by maziyarpanahi

Aug 23, 2021

Hi,

Since you are using WordEmbeddingsModel, and for this annotator (GloVe) there is no maximum sequence unlike transformers (BERT, ALBERT, RoBERTa, etc.), you can just remove SentenceDetector, use document for setInputCols in all the annotators that come after DocumentAssembler.

Could you please try that and see what happens?

UPDATE: This should be fixed in our new release: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.2.3

View full answer

maziyarpanahi · 2021-08-23T17:48:05Z

maziyarpanahi
Aug 23, 2021
Maintainer

Hi,

Since you are using WordEmbeddingsModel, and for this annotator (GloVe) there is no maximum sequence unlike transformers (BERT, ALBERT, RoBERTa, etc.), you can just remove SentenceDetector, use document for setInputCols in all the annotators that come after DocumentAssembler.

Could you please try that and see what happens?

UPDATE: This should be fixed in our new release: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.2.3

9 replies

pdurham2 Aug 24, 2021
Author

That is what I expected to see as well, but the sentence_embeddings is returning three arrays with length 300. If I change the input to be four sentences, I get four arrays with length 400 instead. Is it possible that the function that is averaging the arrays is actually concatenating the arrays when it performs the summation during the average?

maziyarpanahi Aug 24, 2021
Maintainer

I will debug this with your example and see what happens. I think something went wrong in your calculations of dimension but if it’s the bug will fix it in the next release

maziyarpanahi Aug 24, 2021
Maintainer

In the first test, this doesn't happen in Scala:

pipelineDF.select(size(pipelineDF("sentence_embeddings.embeddings").getItem(0)).as("sentence_embeddings_size")).show

This will give the correct dimension. I am not sure if it's the F.size/F.col not correctly work or it's in the Python side. (which doesn't make sense since the Python side is actually the code coming from Scala).

I will dig more to see why there is a difference. There should be no need for explode since we take the item from the array and check the size.

maziyarpanahi Sep 15, 2021
Maintainer

@pdurham2 This should be fixed in our new release: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/3.2.3

pdurham2 Sep 16, 2021
Author

@maziyarpanahi Great, thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Average embeddings using SentenceEmbeddings #5974

{{title}}

Replies: 1 comment 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Average embeddings using SentenceEmbeddings #5974

pdurham2 Aug 23, 2021

Replies: 1 comment · 9 replies

maziyarpanahi Aug 23, 2021 Maintainer

pdurham2 Aug 24, 2021 Author

maziyarpanahi Aug 24, 2021 Maintainer

maziyarpanahi Aug 24, 2021 Maintainer

maziyarpanahi Sep 15, 2021 Maintainer

pdurham2 Sep 16, 2021 Author

pdurham2
Aug 23, 2021

Replies: 1 comment 9 replies

maziyarpanahi
Aug 23, 2021
Maintainer

pdurham2 Aug 24, 2021
Author

maziyarpanahi Aug 24, 2021
Maintainer

maziyarpanahi Aug 24, 2021
Maintainer

maziyarpanahi Sep 15, 2021
Maintainer

pdurham2 Sep 16, 2021
Author