-
Hello everyone, I hope you're doing well. Firstly, I'd like to thank the library creators and maintainers for providing us with such a useful and effective tool. However, I am currently facing a challenge while designing an anonymization algorithm. I am using the library to do so and have noticed that when I use a very long legal text (which I cannot share due to privacy concerns) and run it through the code mentioned below, I end up having len(anonymized_text.items) != len(analyzer_results), which is quite strange to me. Here's the code I am using: from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
configuration = {
"nlp_engine_name": "spacy",
"models": [
{"lang_code": "it", "model_name": "it_core_news_lg"},
{"lang_code": "en", "model_name": "en_core_web_lg"},
],
}
provider = NlpEngineProvider(nlp_configuration=configuration)
nlp_engine_with_italian = provider.create_engine()
analyzer = AnalyzerEngine(
nlp_engine=nlp_engine_with_italian, supported_languages=["en", "it"]
)
analyzer_results = analyzer.analyze(text=text, language="it")
print("Analyzer Results from Italian request:")
print(analyzer_results)
anonymizer_engine = AnonymizerEngine()
anonymized_text = anonymizer_engine.anonymize(
text=text,
analyzer_results=results_italian,
)
anonymized_text.text As I am planning to apply some text transformation on the anonymized text using some AI model, and in order to de-anonymize the transformed text afterwards, I need to match 1-to-1 the Therefore, I would like to know if there's a way to keep track of this 1-to-1 matching. Thank you in advance for any hint you can give me! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi @neboduus. Thanks for the kind words. One reason for not having a 1-1 mapping between the results coming from the analyzer and the results coming from the anonymizer, is conflict resolution happening within the anonymizer. In case of overlaps, the analyzer would return multiple results whereas the anonymizer could potentially return only one result. For example, the string Could this be the reason for the mismatch you experience? |
Beta Was this translation helpful? Give feedback.
Hi @neboduus. Thanks for the kind words.
One reason for not having a 1-1 mapping between the results coming from the analyzer and the results coming from the anonymizer, is conflict resolution happening within the anonymizer. In case of overlaps, the analyzer would return multiple results whereas the anonymizer could potentially return only one result. For example, the string
[email protected]
could produce two types of PII:EMAIL
andURL
, however during the anonymization process there is handling of overlaps, which would only keep one of them. More on this can be found here: https://microsoft.github.io/presidio/anonymizer/#handling-overlaps-between-entitiesCould this be the re…