BM25实现方案的疑惑 #46

qianxianyang · 2024-11-13T09:27:54Z

你好，
milvus在实现BM25时，预计对文档通过（当前文档作为Query，其余文档作为Doc）实现当前文档的embedding化。在计算真实Query时，通过IDF获得了embedding向量，最终通过两个向量的内积作为相似度。
这种做法和原始BM25计算公式还是不太一样。
麻烦问下，这种实现的出发点是什么呢，不同实现的性能是多少呢？
BM25公式

$$\text{BM25}(D, Q) = \sum_{i=1}^{n} \text{IDF}(q_i) \cdot \frac{f(q_i, D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{\text{avgdl}})}$$

codingjaguar · 2024-11-13T10:19:15Z

Basically when search, the score can be calculated carefully so that the dot product of query sparse vector and doc sparse vector is equivalent to the BM25 equation.

Milvus 2.5 that will be released in a week adds native native BM25 support and accept text as input (so that users don't need to calculate doc vector and query vector themselves).

xiaofan-luan · 2024-11-14T19:47:30Z

@qianxianyang
Milvus 2.5 原生集成了BM25能力，cheers！

xiaofan-luan · 2024-11-14T19:48:01Z

本质上，corpus向量体现了TF，而query向量体现了queyr的TF和IDF

xiaofan-luan · 2024-11-14T19:49:34Z

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果，从目前的测试结果看差不多差不多会有2-3倍的性能提升

qianxianyang · 2024-11-25T02:00:47Z

点赞，期待测试结果

codingjaguar · 2024-11-26T04:34:27Z

We have released Milvus 2.5 beta with the full text search feature available (https://github.com/milvus-io/milvus/releases/tag/v2.5.0-beta). The detailed documentation will be released soon, but here is a snippet:

from pymilvus import MilvusClient, DataType, Function, FunctionType

schema = MilvusClient.create_schema()

schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)

schema.add_function(bm25_function)

index_params = MilvusClient.prepare_index_params()

index_params.add_index(
    field_name="sparse",
    index_type="AUTOINDEX", 
    metric_type="BM25"
)

MilvusClient.create_collection(
    collection_name='demo', 
    schema=schema, 
    index_params=index_params
)

MilvusClient.insert('demo', [
    {'text': 'Artificial intelligence was founded as an academic discipline in 1956.'},
    {'text': 'Alan Turing was the first person to conduct substantial research in AI.'},
    {'text': 'Born in Maida Vale, London, Turing was raised in southern England.'},
])

search_params = {
    'params': {'drop_ratio_search': 0.6},
}

MilvusClient.search(
    collection_name='demo', 
    data=['Who started AI research?'],
    anns_field='sparse',
    limit=3,
    search_params=search_params
)

Feel free to check it out!

KylinMountain · 2024-12-13T11:21:17Z

@codingjaguar 在LlamaIndex中如何使用BM25混合检索呢？我看默认是BGE-M3。这两篇文档给我看糊涂了，

wxywb · 2024-12-13T12:55:31Z

@codingjaguar 在LlamaIndex中如何使用BM25混合检索呢？我看默认是BGE-M3。这两篇文档给我看糊涂了，

https://docs.llamaindex.ai/en/stable/examples/vector_stores/MilvusHybridIndexDemo/

https://milvus.io/docs/embed-with-bm25.md

If you want use BM25 in current LlamaIndex hybrid search implementation. You need to implement a BM25EmbeddingFunction using milvus-model's BM25EmbeddingFunction(a thin wrapper). Here are suggested steps.

Fit the milvus-model's bm25 on the dataset you have interest or load the default parameters.
write a wrapper showed in the first doc.
pass this wrapper class to hybrid retriever.

Since Milvus 2.5 has a native BM25 implementation(Full Text Search), LlamaIndex's hybrid retriever will also be upgraded to use it as the default. Please stay tuned.

KylinMountain · 2024-12-14T00:10:11Z

@wxywb thanks, I have implemented the bm25function for llamaindex.

KylinMountain · 2024-12-14T00:28:01Z

@wxywb But I have another question about hybrid search:

retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)

some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

KylinMountain · 2024-12-14T00:29:23Z

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果，从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗？

codingjaguar · 2024-12-14T02:38:48Z

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果，从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗？

Right, simply using

bm25_function = Function(
    name="text_bm25_emb", # Function name
    input_field_names=["text"], # Name of the VARCHAR field containing raw text data
    output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
    function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)

codingjaguar · 2024-12-14T02:40:49Z

@wxywb But I have another question about hybrid search:
retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)
some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

cc @zc277584121 to take a look.

wxywb · 2024-12-14T03:14:53Z

@wxywb But I have another question about hybrid search:
retriever = vector_index.as_retriever(vector_store_query_mode="hybrid", filters=filters, similarity_top_k=2)
some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes?

I have debug it and it indeed call hybrid search in milvus

milvus 's hybrid search use RRF Ranker to rerank them. RRF is a method to generate list considering both sparse and dense results.
https://github.com/run-llama/llama_index/blob/095d410249f6bd8e571275993b418af688ca2daf/llama-index-integrations/vector_stores/llama-index-vector-stores-milvus/llama_index/vector_stores/milvus/base.py#L748C1-L755C63

wxywb · 2024-12-14T03:16:29Z

我们即将放出Milvus 2.5 Versus ES的性能benchmark结果，从目前的测试结果看差不多差不多会有2-3倍的性能提升

@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗？

Actually Milvus2.5 's native BM25, not BM25EmbeddingFunction in milvus-model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BM25实现方案的疑惑 #46

BM25实现方案的疑惑 #46

qianxianyang commented Nov 13, 2024 •

edited

Loading

codingjaguar commented Nov 13, 2024

xiaofan-luan commented Nov 14, 2024

xiaofan-luan commented Nov 14, 2024

xiaofan-luan commented Nov 14, 2024

qianxianyang commented Nov 25, 2024

codingjaguar commented Nov 26, 2024

KylinMountain commented Dec 13, 2024

wxywb commented Dec 13, 2024 •

edited

Loading

KylinMountain commented Dec 14, 2024

KylinMountain commented Dec 14, 2024

KylinMountain commented Dec 14, 2024

codingjaguar commented Dec 14, 2024

codingjaguar commented Dec 14, 2024

wxywb commented Dec 14, 2024

wxywb commented Dec 14, 2024

BM25实现方案的疑惑 #46

BM25实现方案的疑惑 #46

Comments

qianxianyang commented Nov 13, 2024 • edited Loading

codingjaguar commented Nov 13, 2024

xiaofan-luan commented Nov 14, 2024

xiaofan-luan commented Nov 14, 2024

xiaofan-luan commented Nov 14, 2024

qianxianyang commented Nov 25, 2024

codingjaguar commented Nov 26, 2024

KylinMountain commented Dec 13, 2024

wxywb commented Dec 13, 2024 • edited Loading

KylinMountain commented Dec 14, 2024

KylinMountain commented Dec 14, 2024

KylinMountain commented Dec 14, 2024

codingjaguar commented Dec 14, 2024

codingjaguar commented Dec 14, 2024

wxywb commented Dec 14, 2024

wxywb commented Dec 14, 2024

qianxianyang commented Nov 13, 2024 •

edited

Loading

wxywb commented Dec 13, 2024 •

edited

Loading