Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BM25实现方案的疑惑 #46

Open
qianxianyang opened this issue Nov 13, 2024 · 15 comments
Open

BM25实现方案的疑惑 #46

qianxianyang opened this issue Nov 13, 2024 · 15 comments

Comments

@qianxianyang
Copy link

qianxianyang commented Nov 13, 2024

你好,
milvus在实现BM25时,预计对文档通过(当前文档作为Query,其余文档作为Doc)实现当前文档的embedding化。在计算真实Query时,通过IDF获得了embedding向量,最终通过两个向量的内积作为相似度。
这种做法和原始BM25计算公式还是不太一样。
麻烦问下,这种实现的出发点是什么呢,不同实现的性能是多少呢?
BM25公式

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$$
@codingjaguar
Copy link
Collaborator

Screenshot 2024-11-13 at 18 15 26 Basically when search, the score can be calculated carefully so that the dot product of query sparse vector and doc sparse vector is equivalent to the BM25 equation.

Milvus 2.5 that will be released in a week adds native native BM25 support and accept text as input (so that users don't need to calculate doc vector and query vector themselves).

@xiaofan-luan
Copy link

@qianxianyang
Milvus 2.5 原生集成了BM25能力,cheers!

@xiaofan-luan
Copy link

本质上,corpus向量体现了TF,而query向量体现了queyr的TF和IDF

@xiaofan-luan
Copy link

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升

@qianxianyang
Copy link
Author

点赞,期待测试结果

@codingjaguar
Copy link
Collaborator

We have released Milvus 2.5 beta with the full text search feature available (https://github.com/milvus-io/milvus/releases/tag/v2.5.0-beta). The detailed documentation will be released soon, but here is a snippet:

from pymilvus import MilvusClient, DataType, Function, FunctionType

schema = MilvusClient.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

index_params = MilvusClient.prepare_index_params()

index_params.add_index(
    field_name="sparse",
    index_type="AUTOINDEX", 
    metric_type="BM25"
)

MilvusClient.create_collection(
    collection_name='demo', 
    schema=schema, 
    index_params=index_params
)

MilvusClient.insert('demo', [
    {'text': 'Artificial intelligence was founded as an academic discipline in 1956.'},
    {'text': 'Alan Turing was the first person to conduct substantial research in AI.'},
    {'text': 'Born in Maida Vale, London, Turing was raised in southern England.'},
])

search_params = {
    'params': {'drop_ratio_search': 0.6},
}

MilvusClient.search(
    collection_name='demo', 
    data=['Who started AI research?'],
    anns_field='sparse',
    limit=3,
    search_params=search_params
)

Feel free to check it out!

@KylinMountain
Copy link

@codingjaguar 在LlamaIndex中如何使用BM25混合检索呢?我看默认是BGE-M3。这两篇文档给我看糊涂了,

@wxywb
Copy link
Collaborator

wxywb commented Dec 13, 2024

@codingjaguar 在LlamaIndex中如何使用BM25混合检索呢?我看默认是BGE-M3。这两篇文档给我看糊涂了,

If you want use BM25 in current LlamaIndex hybrid search implementation. You need to implement a BM25EmbeddingFunction using milvus-model's BM25EmbeddingFunction(a thin wrapper). Here are suggested steps.

  1. Fit the milvus-model's bm25 on the dataset you have interest or load the default parameters.
  2. write a wrapper showed in the first doc.
  3. pass this wrapper class to hybrid retriever.

Since Milvus 2.5 has a native BM25 implementation(Full Text Search), LlamaIndex's hybrid retriever will also be upgraded to use it as the default. Please stay tuned.

@KylinMountain
Copy link

@wxywb thanks, I have implemented the bm25function for llamaindex.

@KylinMountain
Copy link

@wxywb But I have another question about hybrid search:

retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)

some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

@KylinMountain
Copy link

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗?

@codingjaguar
Copy link
Collaborator

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗?

Right, simply using

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)

@codingjaguar
Copy link
Collaborator

@wxywb But I have another question about hybrid search:

retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)

some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

cc @zc277584121 to take a look.

@wxywb
Copy link
Collaborator

wxywb commented Dec 14, 2024

@wxywb But I have another question about hybrid search:

retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)

some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

milvus 's hybrid search use RRF Ranker to rerank them. RRF is a method to generate list considering both sparse and dense results.
https://github.com/run-llama/llama_index/blob/095d410249f6bd8e571275993b418af688ca2daf/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L748C1-L755C63

@wxywb
Copy link
Collaborator

wxywb commented Dec 14, 2024

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗?

Actually Milvus2.5 's native BM25, not BM25EmbeddingFunction in milvus-model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants