-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
splitText function is too long #140
Comments
I had similar problems in the api part a while ago. A big optimization is to call encode once and do the splitting directly on the tokens then decode everything. But even outside of that, I don't know if we still need an algorithm that complex. Right now it's trying as much as possible to cut between paragraphs first, lines second, sentences third etc.... while trying to have chunks as even as possible. Maybe we could just do the same thing as in the api and just cut at the chunkSize limit or at least just enforce a sentence rule (where we would just split at every full stop, encode, merge sentences until they make a chunk size and decode every chunk) |
Makes sense
…On Fri, Jul 5 2024 at 6:33 AM, Lancelot Owczarczak < ***@***.*** > wrote:
I had similar problems in the api part a while ago.
A big optimization is to call encode once and do the splitting directly on
the tokens then decode everything.
But even outside of that, I don't know if we still need an algorithm that
complex. Right now it's trying as much as possible to cut between
paragraphs first, lines second, sentences third etc.... while trying to
have chunks as even as possible.
I feel like it's something we needed during the autodoc era but isn't
really relevant anymore.
Maybe we could just do the same thing as in the api and just cut at the
chunkSize limit or at least just enforce a sentence rule (where we would
just split at every full stop, encode, merge sentences until they make a
chunk size and decode every chunk)
—
Reply to this email directly, view it on GitHub (
#140 (comment)
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AIF7C7FXODTDXDRLK2BF7STZKYO2DAVCNFSM6AAAAABKLJHQKGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJQGEZDSMBUG4
).
You are receiving this because you are subscribed to this thread. Message
ID: <polyfire-ai/polyfire-js/issues/140/2210129047 @ github. com>
|
I've noticed that the
splitText
function is running pretty slow. When it's called on its own, it takes about 150 to 300 milliseconds. But when it's used on a whole list of transcripts in the frontend, it takes way too long and really slows down the app.We need the splitText function to work faster, even with a big list of transcripts, to keep the app running smoothly.
As a quick fix, I've switched to using
TokenTextSplitter
from the langchain library, which is a lot faster for my needs. But this is just a temporary solution, and it would be great to have a more permanent fix in the polyfire-js library.The text was updated successfully, but these errors were encountered: