Running in GPU instance #1185

sankagirirajeev · 2023-10-10T11:12:13Z

sankagirirajeev
Oct 10, 2023

I'm running Persidio on around 60 lakh records. It takes more than 4 minutes to process 1000k data. Is there any way to run it in the fattest way possible?

omri374 · 2023-10-11T09:37:38Z

omri374
Oct 11, 2023
Maintainer

Have you looked at the batch option? Presidio has a BatchAnalyzerEngine object, see sample here. The underlying NER model also affects the performance, and each model has different requirements, latency, and might be faster with a GPU.

Which model are you using? Please also share the code, as 4 minutes for 1mb sounds too much. Are you initializing the AnalyzerEngine multiple times?

0 replies

sankagirirajeev · 2023-10-11T09:48:49Z

sankagirirajeev
Oct 11, 2023
Author

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig
import datetime
import pandas as pd
import spacy
import re
spacy.require_gpu()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
data_sample =pd.read_parquet('xxx.parquet')

for index,row in data_sample.iterrows():
    content = row['content']
   
    if len(content) < 1000000:
    
        results = analyzer.analyze(text=content,
                               
                               language='en')
        """for result in results:
            print("Type:", result.entity_type)
            #print("Value:", result.text)
            #print("Score:", result.score)
            print(result)"""

0 replies

sankagirirajeev · 2023-10-11T11:48:54Z

sankagirirajeev
Oct 11, 2023
Author

I used batchanalyzer
start = timer()
analyzer_results = batch_analyzer.analyze_iterator(data_sample_content, language='en')
analyzer_results = list(analyzer_results)
print(timer()-start)

its taking same amount time

1 reply

omri374 Oct 11, 2023
Maintainer

I believe that the en_core_web_lg model is CPU optimized, but I'm not sure.

sankagirirajeev · 2023-10-11T12:13:56Z

sankagirirajeev
Oct 11, 2023
Author

is any other way we can process the data?

7 replies

sankagirirajeev Oct 11, 2023
Author

it would be more helpful ..if i find a solution? need to run large dataset within a day

omri374 Oct 11, 2023
Maintainer

Have you ran any profiling? Snakeviz is a nice option: https://jiffyclub.github.io/snakeviz/

omri374 Oct 11, 2023
Maintainer

We have deployment examples in Pyspark (using Python) and Kubernetes (using Docker). This would likely be the next step.

sankagirirajeev Oct 11, 2023
Author

Is this code useful for speeding up processing?

omri374 Oct 11, 2023
Maintainer

It allows you to run Presidio in parallel, using Spark or Kubernetes. Profiling is useful if you're interested in understanding where the bottleneck is.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running in GPU instance #1185

{{title}}

Replies: 4 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Running in GPU instance #1185

sankagirirajeev Oct 10, 2023

Replies: 4 comments · 8 replies

omri374 Oct 11, 2023 Maintainer

sankagirirajeev Oct 11, 2023 Author

sankagirirajeev Oct 11, 2023 Author

omri374 Oct 11, 2023 Maintainer

sankagirirajeev Oct 11, 2023 Author

sankagirirajeev Oct 11, 2023 Author

omri374 Oct 11, 2023 Maintainer

omri374 Oct 11, 2023 Maintainer

sankagirirajeev Oct 11, 2023 Author

omri374 Oct 11, 2023 Maintainer

sankagirirajeev
Oct 10, 2023

Replies: 4 comments 8 replies

omri374
Oct 11, 2023
Maintainer

sankagirirajeev
Oct 11, 2023
Author

sankagirirajeev
Oct 11, 2023
Author

omri374 Oct 11, 2023
Maintainer

sankagirirajeev
Oct 11, 2023
Author

sankagirirajeev Oct 11, 2023
Author

omri374 Oct 11, 2023
Maintainer

omri374 Oct 11, 2023
Maintainer

sankagirirajeev Oct 11, 2023
Author

omri374 Oct 11, 2023
Maintainer