Can I delete Vocab entries without reloading the whole model ? #12326

probavee · 2023-02-24T10:04:02Z

probavee
Feb 24, 2023

Hello!
I read different issues/discussions talking about the growth of vocabulary and the answer is pretty much always: "reload the model every now and then" but I don't understand why.

Context

I have a docker container managed by a kube instance which need to be always up with a minimum response time.
This service uses nlp.pipe with an infinite generator which yields strings as they come.
The flow of the service is straightforward:
API receive string -> String is sent to a Queue-> queue appends the string in the infinite generator -> doc is processed and returned by the API.

Problem

My API can receive anything so, often, it receives unknown tokens which make the vocab grow and lead too an OOM.
The solution "reloads the model" takes 6 seconds (using trf or lg models). So it either doubles the RAM usage if I'm loading a new one while the old one is working or I have 6s downtime.

My understanding

So correct me if I misunderstood it, but the reason it is recommended to reload the model is to make sure no doc is load in memory, because all doc processed by the model share the same vocab. So removing some lexeme means some tokens in the already processed docs could lead to the wrong lexeme.

Questions

Is it a problem if I stop the pipeline, (meaning no doc exists in memory, because they were either returned or not yet processed.) and remove the new lexemes? Is there a proper way to do this or, if it is not possible, why?
Thank you!

adrianeboyd · 2023-02-27T08:49:52Z

adrianeboyd
Feb 27, 2023

Your understanding is correct. There are two growing caches in the vocab, the lexeme cache in nlp.vocab and the string store cache in nlp.vocab.strings. I've looked into alternatives for partially resetting the caches, but because pipeline components can depend on adding an entry once and expecting it to be there in the future (more for the string store than the lexeme cache), once you consider all the details you're basically doing all the same work that you would to reload the model.

How often are you running into OOM errors? If this is happening very often (will vary by load obviously, but more than once a day-ish?), it might indicate a separate issue?

If RAM is always very limited or 6 seconds of downtime is an issue, it sounds like it might be worthwhile to consider having multiple servers?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can I delete Vocab entries without reloading the whole model ? #12326

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Can I delete Vocab entries without reloading the whole model ? #12326

probavee Feb 24, 2023

Context

Problem

My understanding

Questions

Replies: 1 comment

adrianeboyd Feb 27, 2023

probavee
Feb 24, 2023

adrianeboyd
Feb 27, 2023