Individual documents appending with add_to_index #248

aaraya-rr · 2024-09-13T15:28:39Z

I would like to know if it is possible for add_to_index to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.

I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.

What I would like to do is index documents one by one using add_to_index, since I don’t want to have 100k documents in memory. Is this possible?

(I’m aware that the add_to_index function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)

My code:

    def load_rag(self, index_name):
        index_path = f".ragatouille/colbert/indexes/{index_name}/"
        return RAGPretrainedModel.from_index(index_path)
        
    def add_document(self, index_name, chunks, document_id, url):
        try:
            RAG = self.load_rag(index_name)
            RAG.add_to_index(chunks, new_document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks), split_documents=False)
        except FileNotFoundError:
            logging.info(f"🔔 There are no documents in the index {index_name}, the index will be created")
            RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
            RAG.index(
                collection=chunks, 
                document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks),
                index_name=index_name, 
                split_documents=False
                )

Logs of individual appending (recalculation of embeddings):

[Sep 12, 18:56:39] [0] 		 #> Encoding 1164 passages..
[Sep 12, 18:56:45] [0] 		 avg_doclen_est = 208.9011993408203 	 len(local_sample) = 1,164
[Sep 12, 18:56:45] [0] 		 Creating 4,096 partitions.
[Sep 12, 18:56:45] [0] 		 *Estimated* 243,160 embeddings.
[Sep 12, 18:56:45] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..

...

[Sep 12, 18:57:02] [0] 		 #> Encoding 1173 passages..
[Sep 12, 18:57:08] [0] 		 avg_doclen_est = 208.96163940429688 	 len(local_sample) = 1,173
[Sep 12, 18:57:08] [0] 		 Creating 4,096 partitions.
[Sep 12, 18:57:08] [0] 		 *Estimated* 245,112 embeddings.
[Sep 12, 18:57:08] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Individual documents appending with add_to_index #248

Individual documents appending with add_to_index #248

aaraya-rr commented Sep 13, 2024 •

edited

Loading

Individual documents appending with add_to_index #248

Individual documents appending with add_to_index #248

Comments

aaraya-rr commented Sep 13, 2024 • edited Loading

aaraya-rr commented Sep 13, 2024 •

edited

Loading