Skip to content

LRCFS/DF_NLP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DF_NLP

Grouping of scripts related to digital forensics NLP

Table of contents
  1. About the project
  2. Requirements
  3. Getting started
  4. Usage
  5. Data
  6. Roadmap
  7. Contributing
  8. Versions
  9. Author
  10. License
  11. Acknowledgement

About the project

This repository gathers several NLP tools used on digital forensics data. It allows:

  • the extraction of full text and abstracts from a raw corpus;
  • the benchmarking of several ATE methods against the SemEval-2017 Task 10 corpus, and;
  • the extraction of candidates terms from digital forensics full texts and abstracts.

Requirements

In order to use this package you will need the latest version of Python, git and an access to the LRCFS github.

Getting started

Create and activate your virtual environment:

# Example with the venv package
python3 -m venv ~/.venv/myenv
source ~/.venv/myenv/bin/activate

Download the last version of DF_NLP and install it:

git clone [email protected]:LRCFS/DF_NLP.git
python3 -m pip install -e ./DF_NLP

Install all the dependencies and test the package:

cd ./DF_NLP
make init
make tests

Usage

Corpus creation

The corpus creation is handled by corpus.py. To create the corpus you will need the CLS JSON bibliography and the authentication keys to access Elsevier, Springer and IEEE APIs. The IEEE API has a limited use (200 query per day) and therefore you may need to provide a threshold value to ignore the first IEEE references handled in a previous run of the script. The new threshold value to use is provided at each exception raising.

To run the creation of the corpus use the following command:

# You can run the script either without providing any threshold
./DF_NLP/corpus.py ./path_to_input.json ./path_to_output_dir/ ./path_to_API_keys.json

# or with providing it (the first 344 IEEE references will be ignored here)
./corpus.py ./path_to_input.json ./path_to_output_dir/ ./path_to_API_keys.json --threshold 345

The file containing the API keys must have the following format:

{
    "elsevier": "Elsevier_token",
    "springer": "Springer_token",
    "ieee": "IEEE_token"
}

The script will create an output file named corpus.json containing the main data of the references enriched with abstracts and path to the full texts (if available). It will also write in the output directory all the full texts.

ATE methods benchmarking

A training corpus can be used to benchmarking 5 ATE methods:

Those ATE methods can be benchmarked using one or more scoring methods from the following list:

  • PRF, which compute 3 scores for the extraction: its precision, recall and the F-Measure which balance the 2 previous features. The F-Measure requires the use of a parameter beta to give more importance to precision (beta < 1) or to recall (beta > 1).
  • Precision@K, which compute the precision for terms above the rank K.
  • Bpref, which evaluate the quantity of incorrect terms with an higher rank than correct ones.

The default benchmark use the PRF scoring:

# You can use the default value for beta (beta = 1)
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/
# Or provide your own value
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/ --beta 1.5

But you can decide which scoring method to use:

# You can use Bpref
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/ --scoring Bpref

# You can use several scoring methods
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/ --scoring PRF Bpref

Finally to use the P@K you may be satisfied by the default rank or want to provide your own:

# Use default ranks (500, 1000 and 5000)
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/ --scoring P@K

# Or provide your own ranks
./benchmarking.py ./path_to_the_input_dir/ ./path_to_the_output_dir/ --scoring PRF P@K --ranks 64 128 256 412

Corpus terms extraction

Once you have chosen the relevant ATE method to use you can run the ATE pipeline on your corpus. You can use one or more ATE methods, the default one being the Weirdness.

# You can use Weirdness by default
./ate.py ./path_to_corpus.json ./path_to_output_dir/

# Or select one or more ATE method
./ate.py ./path_to_corpus.json ./path_to_output_dir/ --ate Basic Cvalue

Data

The data directory contains several input and output related to this project:

  • biblio.json, which is the input file for the corpus creation.
  • corpus.zip, which is an archive containing the corpus.json and the full texts.
  • Training_set_SemEval_2017_Task_10, which is the directory containing the text and annotated files for the benchmark.

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Versions

For the versions available, see the tags on this repository.

Author(s)

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgement

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published