-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BM25实现方案的疑惑 #46
Comments
@qianxianyang |
本质上,corpus向量体现了TF,而query向量体现了queyr的TF和IDF |
我们即将放出Milvus 2.5 Versus ES的性能benchmark结果,从目前的测试结果看差不多差不多会有2-3倍的性能提升 |
点赞,期待测试结果 |
We have released Milvus 2.5 beta with the full text search feature available (https://github.com/milvus-io/milvus/releases/tag/v2.5.0-beta). The detailed documentation will be released soon, but here is a snippet: from pymilvus import MilvusClient, DataType, Function, FunctionType
schema = MilvusClient.create_schema()
schema.add_field(field_name="id", datatype=DataType.INT64, is_primary=True, auto_id=True)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=1000, enable_analyzer=True)
schema.add_field(field_name="sparse", datatype=DataType.SPARSE_FLOAT_VECTOR)
bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
function_type=FunctionType.BM25,
)
schema.add_function(bm25_function)
index_params = MilvusClient.prepare_index_params()
index_params.add_index(
field_name="sparse",
index_type="AUTOINDEX",
metric_type="BM25"
)
MilvusClient.create_collection(
collection_name='demo',
schema=schema,
index_params=index_params
)
MilvusClient.insert('demo', [
{'text': 'Artificial intelligence was founded as an academic discipline in 1956.'},
{'text': 'Alan Turing was the first person to conduct substantial research in AI.'},
{'text': 'Born in Maida Vale, London, Turing was raised in southern England.'},
])
search_params = {
'params': {'drop_ratio_search': 0.6},
}
MilvusClient.search(
collection_name='demo',
data=['Who started AI research?'],
anns_field='sparse',
limit=3,
search_params=search_params
) Feel free to check it out! |
@codingjaguar 在LlamaIndex中如何使用BM25混合检索呢?我看默认是BGE-M3。这两篇文档给我看糊涂了, |
If you want use BM25 in current LlamaIndex hybrid search implementation. You need to implement a BM25EmbeddingFunction using milvus-model's BM25EmbeddingFunction(a thin wrapper). Here are suggested steps.
Since Milvus 2.5 has a native BM25 implementation(Full Text Search), LlamaIndex's hybrid retriever will also be upgraded to use it as the default. Please stay tuned. |
@wxywb thanks, I have implemented the bm25function for llamaindex. |
@wxywb But I have another question about hybrid search:
some website say hybrid search will return 2 nodes for bm25 and 2 nodes for vector search, but why milvus hybrid search only return 2 nodes? I have debug it and it indeed call hybrid search in milvus |
@xiaofan-luan 这个只要使用milvus model中的bm 25就能享受到这个加速吗? |
Right, simply using bm25_function = Function(
name="text_bm25_emb", # Function name
input_field_names=["text"], # Name of the VARCHAR field containing raw text data
output_field_names=["sparse"], # Name of the SPARSE_FLOAT_VECTOR field reserved to store generated embeddings
function_type=FunctionType.BM25,
)
schema.add_function(bm25_function) |
cc @zc277584121 to take a look. |
milvus 's hybrid search use RRF Ranker to rerank them. RRF is a method to generate list considering both sparse and dense results. |
Actually Milvus2.5 's native BM25, not BM25EmbeddingFunction in milvus-model. |
你好,
milvus在实现BM25时,预计对文档通过(当前文档作为Query,其余文档作为Doc)实现当前文档的embedding化。在计算真实Query时,通过IDF获得了embedding向量,最终通过两个向量的内积作为相似度。
这种做法和原始BM25计算公式还是不太一样。
麻烦问下,这种实现的出发点是什么呢,不同实现的性能是多少呢?
BM25公式
The text was updated successfully, but these errors were encountered: