You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I would like to know if it is possible for add_to_index to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.
I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.
What I would like to do is index documents one by one using add_to_index, since I don’t want to have 100k documents in memory. Is this possible?
(I’m aware that the add_to_index function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)
My code:
def load_rag(self, index_name):
index_path = f".ragatouille/colbert/indexes/{index_name}/"
return RAGPretrainedModel.from_index(index_path)
def add_document(self, index_name, chunks, document_id, url):
try:
RAG = self.load_rag(index_name)
RAG.add_to_index(chunks, new_document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks), split_documents=False)
except FileNotFoundError:
logging.info(f"🔔 There are no documents in the index {index_name}, the index will be created")
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(
collection=chunks,
document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks),
index_name=index_name,
split_documents=False
)
Logs of individual appending (recalculation of embeddings):
I would like to know if it is possible for
add_to_index
to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.
What I would like to do is index documents one by one using
add_to_index
, since I don’t want to have 100k documents in memory. Is this possible?(I’m aware that the
add_to_index
function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)My code:
Logs of individual appending (recalculation of embeddings):
The text was updated successfully, but these errors were encountered: