Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Changing fastembed version but with the same embedder I have different vector with the same text #373

Open
giovannialbero1992 opened this issue Oct 24, 2024 · 5 comments

Comments

@giovannialbero1992
Copy link

What happened?

I have two environments one with fastembed with the version 0.3.4 and another one with the version 0.1.3.
The embedder used is: https://huggingface.co/intfloat/multilingual-e5-large

What Python version are you on? e.g. python --version

python 3.10.11

Version

0.2.7 (Latest)

What os are you seeing the problem on?

Linux

Relevant stack traces and/or logs

No response

@giovannialbero1992
Copy link
Author

I'd share with you a kind of guide to reproduce what I'm observing.

Step 1

run a docker container with python 3.10.11

docker run -d -i -t python:3.10 bash

Step 2

Enter in the docker container getting the container's id with docker ps

docker exec -ti <CONTAINER ID> bash 

Step 3

Install vim

apt update && apt install vim

Step 4

Create a python's file embedder.py and insert this code

from langchain_community.embeddings import FastEmbedEmbeddings

embedder = FastEmbedEmbeddings(model_name="intfloat/multilingual-e5-large")

text = "Hello world"
embedding = embedder.embed_query(text)
print(embedding)

Step 5

Install dependencies

pip install langchain_core==0.1.22
pip install langchain==0.1.4
pip install fastembed==0.1.3

Step 6

Run the script and get the result

python embedder.py

First part of the vector

[0.024819795042276382, -0.023618297651410103, -0.006692419294267893, -0.04708532989025116, 0.0343518927693367, -0.026183584704995155, -0.029025807976722717, 0.041693683713674545, 0.060204412788152695, -0.015606507658958435, 0.02012583799660206, 0.03693017736077309, ...

Step 7

Upgrade the fastembed version

pip install fastembed==0.3.4

Step 8

Run the script and get the result

python embedder.py

First part of the vector

[-0.005152239464223385, 0.005240725819021463, 0.008123699575662613, -0.039657339453697205, 0.009418696165084839, -0.035511959344148636, -0.04110070690512657, 0.03789035230875015, 0.05153501033782959, -0.024316389113664627, 0.037706244736909866, 0.019727017730474472, ...

Step 9

Compare the result

@giovannialbero1992 giovannialbero1992 changed the title [Bug]: Changing fastembed version but same embedder I have different vector with the same text [Bug]: Changing fastembed version but with the same embedder I have different vector with the same text Oct 24, 2024
@I8dNLo
Copy link
Contributor

I8dNLo commented Oct 29, 2024

Reproduced for me, but last output is:
[-0.0010747660417109728, -0.0015742044197395444, 0.01378690730780363, -0.03357434272766113, 0.0050786384381353855 ...
First output exactly matches

@I8dNLo
Copy link
Contributor

I8dNLo commented Oct 29, 2024

Yap, times change:
You are looking at very early release

After release 0.2.0 the behavior stays as it's now. Please use some actual version of fastembed

@giovannialbero1992
Copy link
Author

Thanks @I8dNLo for the test.
I don't know why you have different vector on the last output but you have a difference anyway.

I checked the code and I observed that in previous version you were prepending query: before to embed the entire query.

@giovannialbero1992
Copy link
Author

Unfortunately the update it's disruptive on the RAG system that I've because I have different result.
I should plan a migration in a way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants