Skip to content

Allows to anonymize sensitive data in relational database

Notifications You must be signed in to change notification settings

igor-pcholkin/Anonymizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

84 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started

What is the dataprotection application?

The dataprotection application is a command-line tool designed to safeguard sensitive data. It achieves this by anonymization of the sensitive data within an enterprise's relational database. The anonymization reduces risk of unintended disclosure of the sensitive data while still keeping it useful for business processing. The European Union's General Data Protection Regulation (GDPR) requires that stored data on people in the EU undergo either anonymization or a pseudonymization process.

What is a sensitive data?

Sensitive information is whatever data that can be linked or attributed to a specific person, company or client. Examples of sensitive information include data about customers: company name, names of official representatives, phones, emails, addresses, credit cards, birth dates, medical information.

Why sensitive data should be kept secure?

Securing sensitive data is crucial for every digital business. When sensitive information about customers is leaked to the public, it can result in direct and indirect losses for the company, including reputational and financial damage.

Anonymization approach

In the data protection application, anonymization is accomplished through a data masking approach. This technique involves replacing sensitive values, such as personal identification numbers, credit card numbers, social security numbers, and email addresses, with fictitious or randomized values. Importantly, this process ensures that the structure and, potentially, the semantic meaning of the data are preserved while safeguarding its confidentiality..

How does the application work?

The dataprotection application retrieves data from a relational database, performs anonymization on it, and writes it back to the database. Anonymization is done according to rules specified in a transformation configuration file. The transformation configuration file outlines which specific columns in the database schema and tables should be anonymized, as well as how - as identified by identifier of anonymizer to use.

What anonymizers are available?

Currently, the following anonymizers are available, with the possibility of additional ones being introduced in the future. Displayed on the left is the ID of each anonymizer, which should be specified in the transformation YAML file.

name - the anonymizer replaces alphabetic letters with random alternatives while preserving the initial letters
email - changes email address to a randomly looking one
address - the anonymizer that obfuscates however preserves the appearance of addresses by maintaining spaces and punctuation marks while obscuring the underlying data
birthdate - the date anonymizer generates a new random date, effectively changing the original date
ccard - the credit card anonymizer replaces credit card numbers with randomly generated ones
pid - a basic "one-size-fits-all" anonymizer capable of handling various personal identification codes, social security numbers, and tax payer numbers
post - the post code anonymizer replaces post codes with alternative values sourced from the same column in the database
ip - the ip address anonymizer replaces ip address with another random ip address

How to add a custom anonymizer

  • Add a new anonymizer class (with test) to com.discreet.dataprotection.anonymizer package. Extend it either from BaseAnonymizer or CharSequenceAnonymizer.
  • Register the anonymizer class with corresponding alias in AnonymizerTable.java
  • Use anonymizer alias as a value mapped to db column in transformations.yaml file.
  • (Optionally) for using anonymizer in auto-detection process, map it to corresponding column name in columnToAnonymizer.properties

Additional functions

The application autonomously identifies which database columns are suitable for anonymization by leveraging metadata information about the database. This information can be provided in two ways: through a Data Definition Language (DDL) file containing statements used to create data tables in a database schema, or via a direct connection to the database schema.

Running the application

Application can be started used supplied script: dp.bat for windows or dp.sh for Linux or MacOS. The application requires Java 19 (or newer version), which should be installed separately. Upon execution without arguments, the application provides information about available command line arguments. At the moment the application works in 2 modes: "transform" and "detect", each represented by corresponding command and related options:

  • transform
    anonymize database
    • -tfn, --transformationsFileName=[transformationsFileName]
      yaml file describing for each schema/table how each column should be anonymized
  • detect
    auto detect what schema tables and columns in a given columns can be anonymized. The result is a temporary transformations.yaml file which can be later used with "transform" command.
    • -dbe, --dbEngine=[dbEngine]
      one of mysql, oracle, postgresql, db2, jtds, sybase, sqlserver, mariadb, derby, hive, h2, informix
    • -dsn, --defaultSchemaName=[defaultSchemaName]
      name to use in case if schema name is missing in DDL schema
    • -iid, --ignoreMissingIds
      ignore db tables with missing or not detected id columns
    • -sfn, --schemaFileName=[schemaFileName]
      ddl schema file name to use for auto-detection of DB table columns which can be anonymized
    • -sn, --schemaName=[schemaName]
      schema name to use for reading metadata from DB and auto-detection of DB table columns which can be anonymized

Configuration

  • conf/application.properties file should be used to set up DB connection URL and credentials. This (the only) DB will be used for anonymization and data analysis.
  • AnonymizerTable.java - for registering a custom anonymizer
  • columnToAnonymizer.properties file is used when adding new anonymizer and using it for auto-detection of columns in database for anonymization.

Use cases

1a. Auto-detect data eligible for anonymization using supplied DDL schema file:

./dp.sh detect -dbe mysql -dsn test -sfn schema.sql -iid

  • it will automatically identify database columns in the schema that likely require anonymization. Subsequently, it will generate a transformation.yaml file, which can be further refined through post-processing (manual editing) and subsequently utilized for actual anonymization. It's important to note that at this stage, no data in the database schema is altered. -dbe argument is used to supply db dialect. -iid is an optional, yet recommended, option that enables the application to ignore errors and continue processing if it fails to automatically detect the ID column for a specific database table. Without this option, the application will generate an error and halt the autodetection process.

or:
1b. Auto-detect data eligible for anonymization using supplied schema name.

./dp.sh detect -sn test

  • In this step, the application behaves similarly to the previous one, but instead of relying on a DDL file, it loads schema metadata directly from the database. This approach is useful when a DDL file is unavailable.
  • In the provided example, "test" represents the name of the schema to be analyzed in the database as defined in the application.properties file.

1c. Generate transformation.yaml file manually using transformations.yaml.example as an example using the following pattern:

[schema name]:
  [db table name]:
    anonymizers:
      [column name]: [anonymizer id]
      ...
  ...

2. Use the generated file transformation.yaml for actual anonymization of the selected schema:

./dp.sh transform -tfn transformations.yaml

Reference Documentation

For further reference, please consider the following links:

About

Allows to anonymize sensitive data in relational database

Topics

Resources

Stars

Watchers

Forks

Languages