anonimizacion-texto-libre

Free-text anonymization prototype and efficiency evaluation of different models

Configuration

Create virtualenv using python3 (follow https://virtualenvwrapper.readthedocs.io/en/latest/install.html)
```
 virtualenv venv
```
Activate the virtualenv
```
 . venv/bin/activate
```
Install requirements
```
 pip install -r requirements.txt
```
Download a spanish base model (we recommend the large one)
```
 python -m spacy download es_core_news_lg
```
For every anonymization you make, you will find information in a log file located in the folder logs.
There's a configuration file called configuration.py where you can find variables that are used in the code and you can modify.

Main functions

`anonymize_doc()`

Use this function when you want to anonymize a text, for that you have two (2) options:

free text
csv file

Parameters

text: Free text to be anonymized.
save_file: Flag that indicates if you want to save the file.
origin_path: Path to obtain the file to be anonymized.
file_name: The filename from the file to be anonymized.
column_to_use: Column from the file (only one) to obtain the text to be anonymized, indicate the column position (consider that the first index is zero).
include_titles: Flag that indicates if the file to be anonymized include titles.
destination_folder: Path where the anonymized file is going to be saved.

Examples

If you want to anonymize free text and see the results in the console:

python tasks.py anonymize_doc --text="this is the text to anonymize"

If you want to anonymize free text and save it to a file in the folder results:

python tasks.py anonymize_doc --text="this is the text to anonymize" --save_file --destination_folder="./results"

If you want to anonymize text from the third column in the file dataset1.csv located in the folder data and DON'T save it, the results will be shown in the console but we encourage you to save it to a file in order to analyze it better:

python tasks.py anonymize_doc --origin_path="./data" --file_name="dataset1.csv" --column_to_use=2

If you want to anonymize text from the third column in the file dataset1.csv located in the folder data and save it to a file in the folder results:

python tasks.py anonymize_doc --origin_path="./data" --file_name="dataset1.csv" --column_to_use=2 --save_file --destination_folder="./results"

If you want to anonymize text from the third column in the file dataset1.csv located in the folder data and save it to a file in the folder results. If the file include titles you should indicate it by using the parameter include_titles:

python tasks.py anonymize_doc --origin_path="./data" --file_name="dataset1.csv" --column_to_use=2 --save_file --destination_folder="./results" --include_titles

Considerations

If there's an error on the parameters you send, you will see in the console what is wrong (an suggestions to fix it).
If you add the flag save_file when you anonymize free text, the file will be named with a default file name such as texto_anonimizado.txt.
If you anonymize text from a csv file, the file will be named like the original file but adding _anonimizado at the end.

`evaluate_efficiency()`

Use this function when you want to compare the anonymization made by the model against expected annotations. In order to achieve that, yoiu should provide a text file to be anonymized (txt) and a field with the corresponding annotations (json). You can find a JSON file example for annotations in the folder data => annotations_example_eval_efficiency.json.

Parameters

origin_path: Path to the file to be anonymized on the way to evaluate efficiency.
file_name: The filename from the file to be anonymized on the way to evaluate efficiency (MUST be txt).
json_origin_path: Path to the json file with the annotations from the document previously indicated.
json_file_name: The filename from the json file with the annotations from the document previously indicated (MUST be json).
destination_folder: Path where the comparison between the anonymization and the annotations will be saved.
results_file_name: The file name where the comparison results will be added (it will be a csv file). The default file name will be obtain from the configuration file and informed in the console.

Results

On the results file you will find four columns indicating:

Entidad: entity that should be included in the list of entities declared in the configuration file.
Modelo: amount (integer) of entities recognized by the model.
Esperado: amount (integer) of entities identified by the annotations (this is the ideal amount of entities that should be recognized by the model).
Efectividad: model efficiency (percentage).

The efficiency is calculated by: Modelo / Esperado * 100

Based on that, if the model detects more entities that expected then the efficiency could be bigger than 100% (weird isn't it?). That means that there are missing entities in the annotations or that the model is marking as entities things which are not. Whatever the case, it should be reviewed.

Examples

If you want to evaluate the efficiency for testing_data/test_eval_ef.txt, using annotations from testing_data/annotations_test_eval_ef.json and saving the results in the folder results without setting a results_file_name:

python tasks.py evaluate_efficiency --origin_path="./testing_data" --file_name="test_eval_ef.txt" --json_origin_path="./testing_data" --json_file_name="annotations_test_eval_ef.json" --destination_folder="./results"

If you want to evaluate the efficiency for testing_data/test_eval_ef.txt, using annotations from testing_data/annotations_test_eval_ef.json and saving the results in the folder results with the file name results_evaluate_efficiency:

python tasks.py evaluate_efficiency --origin_path="./testing_data" --file_name="test_eval_ef.txt" --json_origin_path="./testing_data" --json_file_name="annotations_test_eval_ef.json" --destination_folder="./results" --results_file_name=results_evaluate_efficiency

Considerations

If there's an error on the parameters you send, you will see in the console what is wrong (an suggestions to fix it).
To see all entities that are being detected check ENTITIES_LIST in configuration.py.

Using it!

    Open a console and run the command
            `python [function you want to use] [parameters for that function]`

Development

Linting: pre-commit install

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
auxiliar_stuff		auxiliar_stuff
data		data
pipeline_components		pipeline_components
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
configuration.py		configuration.py
main.py		main.py
model_utils.py		model_utils.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
tasks.py		tasks.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anonimizacion-texto-libre

Configuration

Main functions

`anonymize_doc()`

Parameters

Examples

Considerations

`evaluate_efficiency()`

Parameters

Results

Examples

Considerations

Using it!

Development

About

Releases

Packages

Contributors 3

Languages

License

instituciones-abiertas/anonimizacion-texto-libre

Folders and files

Latest commit

History

Repository files navigation

anonimizacion-texto-libre

Configuration

Main functions

anonymize_doc()

Parameters

Examples

Considerations

evaluate_efficiency()

Parameters

Results

Examples

Considerations

Using it!

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

`anonymize_doc()`

`evaluate_efficiency()`

Packages