Presidio (Origin from Latin praesidium ‘protection, garrison’) helps to ensure sensitive data is properly managed and governed. It provides fast identification and anonymization modules for private entities in text and images. It is fully customizable and pluggable, can be adapted to your needs and be deployed into various environments.
!!! note "Note" Presidio is a library or SDK rather than a service. It is meant to be customized to the user's or organization's specific needs.
!!! warning "Warning" Presidio can help identify sensitive/PII data in un/structured text. However, because Presidio is using trained ML models, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.
By developing Presidio, our goals are:
- Allow organizations to preserve privacy in a simpler way by democratizing de-identification technologies and introducing transparency in decisions.
- Embrace extensibility and customizability to a specific business need.
- Facilitate both fully automated and semi-automated PII de-identification flows on multiple platforms.
The authors and maintainers of Presidio come from the Industry Solutions Engineering team. We work with customers on various engineering problems, and have found the proper handling of private and sensitive data a recurring challenge across many customers and industries.
!!! note "Note" Microsoft Presidio is not an official Microsoft product. Usage terms are defined in the repository's license.
What is the difference between Presidio and different PII detection services like Azure Text Analytics and Amazon Comprehend?
In a nutshell, Presidio is a library which is meant to be customized, whereas different SaaS tools for PII detection have less customization capabilities. Most of these SaaS offerings use dedicated ML models and other logic for PII detection and often have better entity coverage or accuracy than Presidio.
Based on our internal research, leveraging Presidio in parallel to 3rd party PII detection services like Azure Text Analytics can bring optimal results mainly when the data in hand has entity types or values not supported by the 3rd party service. (see example here).
- Check out the installation docs.
- Take a look at the different samples.
- Try the demo website.
Presidio is a suite built of several packages and building blocks:
- Presidio Analyzer: a package for detecting PII entities in natural language.
- Presidio Anonymizer: a package for manipulating PII entities in text (e.g. remove, redact, hash, encrypt).
- Presidio Image Redactor: A package for detecting PII entities in image using OCR.
- A set of sample deployments as Python packages or Docker containers for Kubernetes, Azure Data Factory, Spark and more.
Users can customize Presidio in multiple ways:
- Create new or updated PII recognizers (docs).
- Adapt Presidio to new languages (docs).
- Leverage state of the art Named Entity Recognition models (docs).
- Add new types of anonymizers (docs).
- Create PII analysis and anonymization pipelines on different environments using Docker or Python (samples).
And more.
Presidio supports spaCy version 3+ for Named Entity Recognition, tokenization, lemmatization and more. We also support Stanza using the spacy-stanza package, and it is further possible to create PII recognizers leveraging other frameworks like transformers or Flair.
For more information, see the docs.
Pseudonymization is a de-identification technique in which the real data is replaced with fake data in a reversible way. Since there are various ways and approaches for this, we provide a simple sample which can be extended for more sophisticated usage. If you have a question or a request on this topic, please open an issue on the repo.
This is an area we are actively looking into. We have an example implementation of using Presidio on structured/semi-structured data. Also see the different discussions on this topic on the Discussions section. If you have a question, suggestion, or a contribution in this area, please reach out by opening an issue, starting a discussion or reaching us directly at [email protected]
Presidio comes loaded with several PII recognizers (see list here), however its main strength lies in its customization capabilities to new entities, specific datasets, languages or use cases. For a recommended process for improving detection accuracy, see these guidelines.
Some PII recognizers are less specific than others. A driver's license number, for example, could be any 9-digit number. While Presidio leverages context words and other logic to improve the detection quality, it could still falsely detect non-entity values as PII entities.
In order to avoid false positives, one could try to:
- Change the acceptance threshold, which defines what is the minimum confidence value for a detected entity to be returned.
- Remove unnecessary PII recognizers, if the dataset does not contain these entities.
- Update/replace the logic of specific recognizers to better suit a specific dataset or use case.
- Replace PII recognizers with those coming from 3rd party services.
Every PII identification logic would have its errors, and there is a trade-off between false positives (falsely detected text) and false negatives (PII entities which are not detected).
In addition to Presidio, we maintain a repo focused on evaluation of models and PII recognizers here. It also features a simple PII data generator.
The main Presidio modules (analyzer, anonymizer, image-redactor) can be used both as a Python package and as a dockerized REST API. See the different deployment samples for example deployments.
First, review the contribution guidelines, and feel free to reach out by opening an issue, posting a discussion or emailing us at [email protected]
Please see the security information.