nlp.pipe() is not faster than processing by example #5431

completelyboofyblitzed · 2020-05-13T13:38:45Z

completelyboofyblitzed
May 13, 2020

Hello! I need to process incoming texts from one queue and send the results to another queue. Since processing by one example is too slow I considered using nlp.pipe() function but as it returns a generator, in order to get the results themselves I need to go through each one of them but this doesn't give me any time saving.
The list of texts is of length 32. But with the raise of texts the amount of time grows linearly.
The following code

%%time
for doc in nlp.pipe(texts):
    send_to_rabbitmq(doc.ents)

results in this output:

CPU times: user 43.9 s, sys: 94.5 ms, total: 44 s
Wall time: 44.2 s

And the one here:

%%time
for text in texts:
    doc = nlp(text)
    send_to_rabbitmq(doc.ents)

results in

CPU times: user 45.5 s, sys: 94.3 ms, total: 45.5 s
Wall time: 45.9 s

Is this expected or I just don't know how to use it right?

Answered by completelyboofyblitzed

May 13, 2020

@adrianeboyd you were right! The problem was in my custom bpe tokenizer. It wasn't efficient so I replaced it with the one that supported batch processing and now I'm so happy to see the difference. THANK YOU!

View full answer

adrianeboyd · 2020-05-13T18:07:10Z

adrianeboyd
May 13, 2020

32 texts probably isn't enough to see a whole lot of benefit from using nlp.pipe(), but I suspect that the send method might be the overwhelming factor here.

It does depend a lot on the language/components, but nlp.pipe() typically processes on the order of 5K words/s on CPU. So unless your texts are particularly long (or your pipeline components are a lot slower than the default models), I wouldn't expect analyzing 32 texts to take anywhere near 45 s.

Using a much larger number of texts, try comparing the run time just for spacy and I think you will see a clearer difference:

for doc in nlp.pipe(texts):
    pass

for text in texts:
    doc = nlp(text)

0 replies

completelyboofyblitzed · 2020-05-13T18:20:27Z

completelyboofyblitzed
May 13, 2020
Author

Thank you for your reply @adrianeboyd, but I also tried with the sizes of 512 and 1024 and the time increased linearly just as it does without the pipe() function. The send_to_rabbitmq() function here is just to show what I'm going to do with the results. When I measured the time this function only executed a pass so the run time is actually spacy's only. 1024 texts were processed in 22 minutes.

0 replies

completelyboofyblitzed · 2020-05-13T18:24:30Z

completelyboofyblitzed
May 13, 2020
Author

Even if it is suspiciously slow for a model to process, my main question is still in the difference between batch processing and processing one by one which is none in my case.

0 replies

adrianeboyd · 2020-05-13T18:27:55Z

adrianeboyd
May 13, 2020

What components do you have in your pipeline?

0 replies

completelyboofyblitzed · 2020-05-13T21:46:18Z

completelyboofyblitzed
May 13, 2020
Author

@adrianeboyd you were right! The problem was in my custom bpe tokenizer. It wasn't efficient so I replaced it with the one that supported batch processing and now I'm so happy to see the difference. THANK YOU!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nlp.pipe() is not faster than processing by example #5431

{{title}}

Replies: 5 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

nlp.pipe() is not faster than processing by example #5431

completelyboofyblitzed May 13, 2020

Replies: 5 comments

adrianeboyd May 13, 2020

completelyboofyblitzed May 13, 2020 Author

completelyboofyblitzed May 13, 2020 Author

adrianeboyd May 13, 2020

completelyboofyblitzed May 13, 2020 Author

completelyboofyblitzed
May 13, 2020

adrianeboyd
May 13, 2020

completelyboofyblitzed
May 13, 2020
Author

completelyboofyblitzed
May 13, 2020
Author

adrianeboyd
May 13, 2020

completelyboofyblitzed
May 13, 2020
Author