Add a Recursive Chunking strategy #8548

davidsbatista · 2024-11-15T09:48:48Z

Use a set of predefined separators to split text recursively. The process follows these steps:

It starts with a list of separator characters, typically ordered from most to least specific (e.g., ["\n\n", "\n", " ", ""]).
The splitter attempts to divide the text using the first separator ("\n\n" in this case).
If the resulting chunks are still larger than the specified chunk size, it moves to the next separator in the list ("\n").
This process continues recursively, using progressively less specific separators until the chunks meet the desired size criteria.

sjrl · 2024-11-20T08:02:00Z

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

sjrl · 2024-11-20T08:03:23Z

Also I wanted to ask will the splitting by separators (e.g. ["\n\n", ".", " "]) be handled using a regex splitter? I think supporting regex would be great so we could provide more complicated separators to better handle complex documents and do things like header detection.

davidsbatista · 2024-11-20T11:09:54Z

that's a good suggestions, I will take it into consideration

davidsbatista · 2024-11-29T16:30:03Z

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

I would suggest using "sentence" and we use NLTK's sent_tokenize(text), but I now noticed that @vblagoje implemented something more robust.

I think we could use the SentenceSplitter here, but maybe we can also move it out of that file into some utils package or file so that can be reused by any component that wants to implement some splitting/chunking technique.

What do you say?

Also, this NLTKDocumentSplitter seems to be an exact copy of the DocumentSplitter except that it uses NLTK's sentence boundary detection algorithm. Maybe we could also merge these two in the future?

sjrl · 2024-12-03T07:27:43Z

I would suggest using "sentence" and we use NLTK's sent_tokenize(text), but I now noticed that @vblagoje implemented something more robust.

That sounds good to me!

I think we could use the SentenceSplitter here, but maybe we can also move it out of that file into some utils package or file so that can be reused by any component that wants to implement some splitting/chunking technique.

What do you say?

Yes I also agree. Let's reuse that and move it into utils.

Also, this NLTKDocumentSplitter seems to be an exact copy of the DocumentSplitter except that it uses NLTK's sentence boundary detection algorithm. Maybe we could also merge these two in the future?

This is totally correct! I asked the same question here and it does seem like we would like to merge these two in the future. Sounds like we should open an issue for this.

davidsbatista · 2024-12-03T15:39:45Z

I've opened an issue for that one: Unify DocumentSplitter and NLTKDocumentSplitter #8600
Have a PR to move the SentenceSplitter to it's own file, to be easily reusable refactor: moving SentenceSplitter outside NLTKDocumentSplitter #8599

davidsbatista self-assigned this Nov 15, 2024

davidsbatista linked a pull request Dec 4, 2024 that will close this issue

feat: add RecursiveSplitter component for Document preprocessing #8605

Open

davidsbatista added this to the 2.9.0 milestone Dec 9, 2024

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Dec 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Recursive Chunking strategy #8548

Add a Recursive Chunking strategy #8548

davidsbatista commented Nov 15, 2024 •

edited

Loading

sjrl commented Nov 20, 2024

sjrl commented Nov 20, 2024

davidsbatista commented Nov 20, 2024

davidsbatista commented Nov 29, 2024

sjrl commented Dec 3, 2024

davidsbatista commented Dec 3, 2024

Add a Recursive Chunking strategy #8548

Add a Recursive Chunking strategy #8548

Comments

davidsbatista commented Nov 15, 2024 • edited Loading

sjrl commented Nov 20, 2024

sjrl commented Nov 20, 2024

davidsbatista commented Nov 20, 2024

davidsbatista commented Nov 29, 2024

sjrl commented Dec 3, 2024

davidsbatista commented Dec 3, 2024

davidsbatista commented Nov 15, 2024 •

edited

Loading