-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
invalid input for sparse float vector #35
Comments
|
"bm25_msmarco_v1.json" is only for English corpus, you need to fit parameters on your own documents. Here is code example from pymilvus.model.sparse.bm25.tokenizers import build_default_analyzer
from pymilvus.model.sparse import BM25EmbeddingFunction
from pymilvus import MilvusClient, DataType
analyzer = build_default_analyzer(language="zh")
docs = [
"无机预涂板是一种具有优良性能的环保材料,常被应用于防火、抗菌、耐化学腐蚀等领域。",
"无机预涂板以其卓越的耐火性、抗菌性和易维护性,被广泛应用于各类建筑场景。",
"无机预涂板拥有防火、耐腐蚀、易清洁等特点,成为现代建筑中环保材料的首选。",
"无机预涂板兼具环保和实用性,具有防火、抗菌、耐酸碱等多种优异性能。",
"无机预涂板由于其出色的耐火性能、抗菌功能和环保特性,广泛应用于医院、实验室等场所。"
]
bm25_ef = BM25EmbeddingFunction(analyzer)
bm25_ef.fit(docs)
docs_embeddings = bm25_ef.encode_documents(docs)
query = '无机预涂板有耐火性吗?'
query_embeddings = bm25_ef.encode_queries([query])
client = MilvusClient(uri='test.db')
schema = client.create_schema(
auto_id=True,
enable_dynamic_fields=True,
)
schema.add_field(field_name="pk", datatype=DataType.VARCHAR, is_primary=True, max_length=100)
schema.add_field(field_name="sparse_vector", datatype=DataType.SPARSE_FLOAT_VECTOR)
schema.add_field(field_name="text", datatype=DataType.VARCHAR, max_length=65535)
index_params = client.prepare_index_params()
client.create_collection(collection_name="test_sparse_vector", schema=schema)
index_params.add_index(
field_name="sparse_vector",
index_name="sparse_inverted_index",
index_type="SPARSE_INVERTED_INDEX",
metric_type="IP",
)
# Create index
client.create_index(collection_name="test_sparse_vector", index_params=index_params)
search_params = {
"metric_type": "IP",
"params": {}
}
for i in range(len(docs)):
entity = {'sparse_vector': docs_embeddings[[i]], 'text':docs[i]}
client.insert(collection_name="test_sparse_vector", data=entity)
results = client.search(collection_name="test_sparse_vector", data=query_embeddings[[0]], output_fields=['text'], search_params=search_params)
print(results)
|
Documents are dynamically added to milvus and are more than 1 million in number, do I have to full fit all documents every time I execute a bm25 query? |
Although it is mathematically correct that BM25 should fit all inserted documents, a more practical approach is to |
These documents take up about 32 GB of memory. I need to load them all into memory, then execute |
yes, currently there is no incremental updates for bm25 and it is planned. Also Milvus will support native bm25, please stay tuned. |
code:
trace back output:
What's the reason? How to solve it?
The text was updated successfully, but these errors were encountered: