Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Individual documents appending with add_to_index #248

Open
aaraya-rr opened this issue Sep 13, 2024 · 0 comments
Open

Individual documents appending with add_to_index #248

aaraya-rr opened this issue Sep 13, 2024 · 0 comments

Comments

@aaraya-rr
Copy link

aaraya-rr commented Sep 13, 2024

I would like to know if it is possible for add_to_index to allow adding new documents to an already existing index without having to recalculate the embeddings for all the previously indexed documents.

I’m not sure if I’m doing something wrong, but each time I add a new document, the embeddings for all the already indexed documents are regenerated, which makes the process scale significantly.

What I would like to do is index documents one by one using add_to_index, since I don’t want to have 100k documents in memory. Is this possible?

(I’m aware that the add_to_index function is still experimental, but I would appreciate knowing if I’m missing something in my approach.)

My code:

    def load_rag(self, index_name):
        index_path = f".ragatouille/colbert/indexes/{index_name}/"
        return RAGPretrainedModel.from_index(index_path)
        
    def add_document(self, index_name, chunks, document_id, url):
        try:
            RAG = self.load_rag(index_name)
            RAG.add_to_index(chunks, new_document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks), split_documents=False)
        except FileNotFoundError:
            logging.info(f"🔔 There are no documents in the index {index_name}, the index will be created")
            RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
            RAG.index(
                collection=chunks, 
                document_metadatas=[{"url": url, "document_id": document_id}]*len(chunks),
                index_name=index_name, 
                split_documents=False
                )

Logs of individual appending (recalculation of embeddings):

[Sep 12, 18:56:39] [0] 		 #> Encoding 1164 passages..
[Sep 12, 18:56:45] [0] 		 avg_doclen_est = 208.9011993408203 	 len(local_sample) = 1,164
[Sep 12, 18:56:45] [0] 		 Creating 4,096 partitions.
[Sep 12, 18:56:45] [0] 		 *Estimated* 243,160 embeddings.
[Sep 12, 18:56:45] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..

...

[Sep 12, 18:57:02] [0] 		 #> Encoding 1173 passages..
[Sep 12, 18:57:08] [0] 		 avg_doclen_est = 208.96163940429688 	 len(local_sample) = 1,173
[Sep 12, 18:57:08] [0] 		 Creating 4,096 partitions.
[Sep 12, 18:57:08] [0] 		 *Estimated* 245,112 embeddings.
[Sep 12, 18:57:08] [0] 		 #> Saving the indexing plan to .ragatouille/colbert/indexes/colbert_debug_chunks/plan.json ..
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant