splitText function is too long #140

kevin-btc · 2024-07-04T10:45:45Z

I've noticed that the splitText function is running pretty slow. When it's called on its own, it takes about 150 to 300 milliseconds. But when it's used on a whole list of transcripts in the frontend, it takes way too long and really slows down the app.

We need the splitText function to work faster, even with a big list of transcripts, to keep the app running smoothly.

As a quick fix, I've switched to using TokenTextSplitter from the langchain library, which is a lot faster for my needs. But this is just a temporary solution, and it would be great to have a more permanent fix in the polyfire-js library.

The text was updated successfully, but these errors were encountered:

lowczarc · 2024-07-05T04:33:16Z

I had similar problems in the api part a while ago.

A big optimization is to call encode once and do the splitting directly on the tokens then decode everything.

But even outside of that, I don't know if we still need an algorithm that complex. Right now it's trying as much as possible to cut between paragraphs first, lines second, sentences third etc.... while trying to have chunks as even as possible.
I feel like it's something we needed during the autodoc era but isn't really relevant anymore.

Maybe we could just do the same thing as in the api and just cut at the chunkSize limit or at least just enforce a sentence rule (where we would just split at every full stop, encode, merge sentences until they make a chunk size and decode every chunk)

victorforissier · 2024-07-05T10:48:09Z

Makes sense

…

On Fri, Jul 5 2024 at 6:33 AM, Lancelot Owczarczak < ***@***.*** > wrote: I had similar problems in the api part a while ago. A big optimization is to call encode once and do the splitting directly on the tokens then decode everything. But even outside of that, I don't know if we still need an algorithm that complex. Right now it's trying as much as possible to cut between paragraphs first, lines second, sentences third etc.... while trying to have chunks as even as possible. I feel like it's something we needed during the autodoc era but isn't really relevant anymore. Maybe we could just do the same thing as in the api and just cut at the chunkSize limit or at least just enforce a sentence rule (where we would just split at every full stop, encode, merge sentences until they make a chunk size and decode every chunk) — Reply to this email directly, view it on GitHub ( #140 (comment) ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AIF7C7FXODTDXDRLK2BF7STZKYO2DAVCNFSM6AAAAABKLJHQKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGEZDSMBUG4 ). You are receiving this because you are subscribed to this thread. Message ID: <polyfire-ai/polyfire-js/issues/140/2210129047 @ github. com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

splitText function is too long #140

splitText function is too long #140

kevin-btc commented Jul 4, 2024

lowczarc commented Jul 5, 2024

victorforissier commented Jul 5, 2024 via email

splitText function is too long #140

splitText function is too long #140

Comments

kevin-btc commented Jul 4, 2024

lowczarc commented Jul 5, 2024

victorforissier commented Jul 5, 2024 via email