Why number of Analyzer Results is different from Anonymizer Results? #1056

neboduus · 2023-04-14T10:40:39Z

neboduus
Apr 14, 2023

Hello everyone,

I hope you're doing well. Firstly, I'd like to thank the library creators and maintainers for providing us with such a useful and effective tool.

However, I am currently facing a challenge while designing an anonymization algorithm. I am using the library to do so and have noticed that when I use a very long legal text (which I cannot share due to privacy concerns) and run it through the code mentioned below, I end up having len(anonymized_text.items) != len(analyzer_results), which is quite strange to me.

Here's the code I am using:

from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider

configuration = {
    "nlp_engine_name": "spacy",
    "models": [
        {"lang_code": "it", "model_name": "it_core_news_lg"},
        {"lang_code": "en", "model_name": "en_core_web_lg"},
    ],
}

provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_italian = provider.create_engine()

analyzer = AnalyzerEngine(
    nlp_engine=nlp_engine_with_italian, supported_languages=["en", "it"]
)

analyzer_results = analyzer.analyze(text=text, language="it")
print("Analyzer Results from Italian request:")
print(analyzer_results)

anonymizer_engine = AnonymizerEngine()
anonymized_text = anonymizer_engine.anonymize(
    text=text,
    analyzer_results=results_italian,
)
anonymized_text.text

As I am planning to apply some text transformation on the anonymized text using some AI model, and in order to de-anonymize the transformed text afterwards, I need to match 1-to-1 the analyzer_results list with the anonymized_text.items. This will allow me to keep track of the anonymized elements and later use it to de-anonymize the transformed text.

Therefore, I would like to know if there's a way to keep track of this 1-to-1 matching.

Thank you in advance for any hint you can give me!

Answered by omri374

Apr 16, 2023

Hi @neboduus. Thanks for the kind words.

One reason for not having a 1-1 mapping between the results coming from the analyzer and the results coming from the anonymizer, is conflict resolution happening within the anonymizer. In case of overlaps, the analyzer would return multiple results whereas the anonymizer could potentially return only one result. For example, the string [email protected] could produce two types of PII: EMAIL and URL, however during the anonymization process there is handling of overlaps, which would only keep one of them. More on this can be found here: https://microsoft.github.io/presidio/anonymizer/#handling-overlaps-between-entities

Could this be the re…

View full answer

omri374 · 2023-04-16T08:00:26Z

omri374
Apr 16, 2023
Maintainer

Hi @neboduus. Thanks for the kind words.

One reason for not having a 1-1 mapping between the results coming from the analyzer and the results coming from the anonymizer, is conflict resolution happening within the anonymizer. In case of overlaps, the analyzer would return multiple results whereas the anonymizer could potentially return only one result. For example, the string [email protected] could produce two types of PII: EMAIL and URL, however during the anonymization process there is handling of overlaps, which would only keep one of them. More on this can be found here: https://microsoft.github.io/presidio/anonymizer/#handling-overlaps-between-entities

Could this be the reason for the mismatch you experience?

1 reply

neboduus Apr 17, 2023
Author

Hi @omri374. Thanks for your exhaustive reply! It definitely can be that the reason. I solved the issue by keeping track myself of the replacements that are being executed through the OperatorConfig specifying my own Anonymization function, which keeps in memory the anonimized words an their replacements, so that I can easily de-anonymize.

E.g.

def fake_credit_card_number(x):
        credit_card_number = faker.credit_card_number()
        self.anonymization_map[credit_card_number] = x
        return name

anonymization_map = {}
operators = {CREDIT_CARD": OperatorConfig(custom", {"lambda": fake_credit_card_number})}

def de_anonymize( anonymized_text):
        de_anonymized_text = anonymized_text
        for anonymized_value, real_value in anonymization_map.items():
            logger.debug(f'Replacing back {anonymized_value} '
                         f'with {real_value}')
            de_anonymized_text = de_anonymized_text\
                .replace(anonymized_value, real_value)
        return de_anonymized_text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why number of Analyzer Results is different from Anonymizer Results? #1056

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why number of Analyzer Results is different from Anonymizer Results? #1056

neboduus Apr 14, 2023

Replies: 1 comment · 1 reply

omri374 Apr 16, 2023 Maintainer

neboduus Apr 17, 2023 Author

neboduus
Apr 14, 2023

Replies: 1 comment 1 reply

omri374
Apr 16, 2023
Maintainer

neboduus Apr 17, 2023
Author