Skip to content

Latest commit

 

History

History
852 lines (634 loc) · 51.1 KB

slrkit.md

File metadata and controls

852 lines (634 loc) · 51.1 KB

SLRKIT command

The slrkit command helps to handle a slr-kit project. This command automates and handles all the phases of the document analysis. To do so, a set of configuration files are used in order to automate the process. These files are stored in a slr-kit project and are created and managed by the slrkit command itself.

SLR-KIT projects

An SLR-KIT project is a collection of files generated by the SLR-KIT scripts. All of these files, are generated from a set of documents that the user wishes to analyze. A project is also a git repository. The slrkit command initializes this repository when the project is first created and helps to track only the meaningful files.

Anatomy of a project

An SLR-KIT project is a directory that contains all the files related to an analysis. This directory must contain also a META.toml file and a project configuration directory.

META.toml

This file contains all the metadata about the project. It must be a TOML version 1.0.0 file. It must contain two dictionaries: Project and Source.

Project dictionary

This contains information about the project such as the name of the project, a description, the location of the configuration directory and some more. The allowed keys and their meaning are described in the following table:

Key Description Type
Author Information about the author of the project string
Config Name of the configuration directory string
Description Description of the project string
Keywords List of keywords related to the documents list of strings
Name Name of the project string
Source dictionary

This contains information about the source of the documents analyzed in the project. The allowed keys and their meaning are described in the following table:

Key Description Type
URL URL of the site used to retrieve the documents string
Query Query string used to retrieve the documents string
Date Date on which the documents were retrieved string
Origin Description of the origin of the documents string

The Origin key is meant to be used when the documents are retrieved without the help of a bibliographical search engine. In this case the URL and Query keys shall be left empty.

The project configuration directory

This directory is located inside the project directory. Its name is saved in the META.toml file in the Project.Config key. The default name for this directory is slrkit.conf but a different name may be used. This directory contains all the configuration files used by the project. The configuration files must be TOML v. 1.0.0 files. Information about each file can be found in the documentation of each slrkit sub-command. Each relative path included in the configuration files are considered to be relative to the project root directory. The configuration directory contains also a log directory that contains all the log files produced during the project. All the scripts that write a log use the slr-kit.log log file saved in the log directory.

The slrkit command

The slrkit command is the tool to handle a project. It uses the META.toml and the file in the project configuration directory to automate the operations. It is composed by some sub-commands to handle and automate all the phases of a project.

Usage:

python3 slrkit.py [-C /path/to/project] sub-command sub-command-arguments ...

The sub-commands are:

  • init: initialize a slr-kit project
  • import: import a bibliographic database converting to the csv format used by slr-kit.
  • journals: sub-command to extract and filter a list of journals. Requires a sub-command.
  • acronyms: extract acronyms from texts.
  • preprocess: run the preprocess stage in a slr-kit project
  • terms: sub-command to extract and handle lists of terms in a slr-kit project. Requires a sub-command
  • fawoc: run fawoc in a slr-kit project.
  • topics: extract topic from the documents of a slr-kit project.
  • report: run the report creation script in a slr-kit project.
  • record: record a snapshot of the project in the underlying git repository.
  • stopwords: extracts a list of terms classified as stopwords from the terms file.
  • build: re-create the not versioned files after a git clone.

Each command operates on the directory from which the slrkit command is run. The -C option allows to change the current directory to the one specified.

Commands workflow

Usually the workflow is the following:

  1. initialize a project with the init command. Fill the information in the META.toml file and save in the project the bibliographical database with the information on the paper. The name of this file must be written in the import.toml file in the configuration directory;
  2. import the data in the bibliographical database into a csv file with the import command;
  3. (optional) create a list of the journals that have published the paper with the journals extract command. This list can be reviewed and classified to exclude papers from not relevant journals;
  4. (optional) review the list of journals with the fawoc journals command;
  5. (optional) use the classification made in the step above to mark the papers that comes from a discarded journal. This step can be done with the journals filter command;
  6. (optional) extract a list of acronyms with the acronyms command. This list can be rewiwed to find the relevant acronyms;
  7. (optional) classify the acronyms with the fawoc acronyms;
  8. select the stop-words that have to be filtered from the paper. The stop-words must be stored in one or more file. Their names must be included in the preprocess.toml file in the configuration directory;
  9. (optional) if there are lists of terms that are surely relevant, these lists must be stored in the project, and their names must be included in the preprocess.toml file;
  10. prepare the text for the elaboration with the preprocess command;
  11. generate the list of terms with the terms generate command;
  12. classify the terms with the fawoc terms command;
  13. extract the topic and retrieve the document-topic association with the lda command;
  14. prepare a report with some statistics about the papers with the report command;

Before running any command is highly recommended reviewing its settings file to check if everything is correct.

Optionally the optimize_lda (faster) or the lda_grid_search (slower) commands can be used to find the best LDA model.

The record command is designed to record the meaningful files of the project in a git repository. Its use is highly recommended.

The stopwords command allows to retrieve a list of stopwords identified during the classification of the terms. The list created by this command can be used to refine the generation of the terms.

Commands reference

init

Initialize the current directory as an SLR-KIT project. Usage:

python3 slrkit.py init [--author AUTHOR] [--description DESCRIPTION] [--no-backup] name

Argument name is the name of the project. It will be used as a prefix for all the suggested file names. The --author option allows to specify the project author while the --description option allows to specify the project description. It creates the META.toml files with information from the command line. The user shall complete the content of this file.

This command also creates the configuration directory. This directory is populated with all the configuration files handled by the slrkit command. The file format is TOML version 1.0.0. The name of each file is the name of the slrkit sub-command (e.g. preprocess.toml is the configuration file for the slrkit preprocess command) and it contains a key for each parameter of the corresponding script. Refer to the documentation of each script and command for additional information about the configuration parameters. In each file some comments explain each parameter, and the output file name of each script are suggested with some good default names.

The init command also copies the ga_param.toml file to the configuration directory with the name optimize_lda_ga_params.toml. This file is used by the optimize_lda command for the parameters used in the optimization. See the documentation of the optimize_lda command for more information.

This command can be executed on an already initialized project. In this case the information in the META.toml are updated with the ones given on the command line. All the other fields are left untouched. The configuration files are updated. If one or more option are missing, they are filled with the default value. The other information are not changed. The original toml files are backupped in the configuration directory before any modification. The backups have the same name of the original files with the extension .bak. If the user gives the --no-backup option, no backup is performed.

The init sub-command also initializes the git repository of the project. A .gitignore file is provided. Its content is produced collecting the output of the to_ignore function of each module. Each module is imported and if a to_ignore function is defined, it is called with the content of the configuration file of the script as a dictionary. This function must return a list of file names to ignore. If there is something wrong in the configuration data, the function must raise a ValueError exception with the reason of the error. The message of the exception is used to create the error message to show to the user.

A first commit is recorded with:

  • the META.toml file;
  • all the configuration files;
  • the provided .gitignore.

import

This command imports a bibliographical database into the project, converting it to the csv format used by all the scripts. The output of this command will be called the abstracts file in the rest of this document.

Usage:

python3 slrkit.py import [--list_columns]

The import sub-command uses the import.toml configuration file and runs the import_biblio.py script. It imports the database in a csv usable by the other commands. To each paper is assigned a progressive identification number in the column id. All the selected columns are imported from the input file. The citation count for each paper is also retrieved and imported as the column citation. If the option --list_columns is set, the command outputs only the list of available columns of the input file specified in the configuration file and no data is imported.

The import.toml has the following structure:

  • input_file: path to the bibliographical database to import. Important: this field is not pre-filled by the init command, the user must fill it before running the import command. This file is committed to git repository by the record command;
  • type: type of the database to import. Actually the only supported type is RIS;
  • output: name of the output file. It is pre-filled with <project-name>_abstracts.csv;
  • columns: comma separated list of columns to import. It is pre-filled with title,abstract,year,journal,citation.

journals

This command allows the user to retrieve a list of journals and classify them in order to filter out the not relevant ones and the papers published on them.

This command accepts two sub-commands:

  • extract: extracts the list of journals from the abstracts file;
  • filter: uses the manual classification of the list of journals to filter out the papers published on the not relevant journals.

Usage:

python3 slrkit.py journals {extract, filter}

If the journals command is invoked without a sub-command, the extract sub-command is run.

journals extract

The extract sub-command produces a list in the format used by FAWOC. The structure is the following:

  • id: a progressive identification number;
  • term: the name of the journal;
  • label: the label added by FAWOC to the journal. This field is left blank by the extract sub-command;
  • count: the number of papers published in the journal.

FAWOC will move the count field in the fawoc_data.tsv file.

The extract sub-command uses the journals_extract.toml configuration file and runs the journal_lister.py script. The journals_extract.toml file has the following structure:

  • abstract_file: name of the abstracts file. It is pre-filled with <project-name>_abstracts.csv;
  • outfile: name of the output file. It is pre-filled with <project-name>_journals.csv.

journals filter

The filter sub-command filters the papers using the manual classification of the list of journals. It adds the status column to the abstracts file. This column will have the value good for the papers published in a journal classified with relevant or the keyword label. All the papers from journals not classified as relevant or keyword will be marked with the rejected value in the status column.

The filter sub-command uses the journals_filter.toml configuration file and runs the filter_paper.py script. The journals_filter.toml file has the following structure:

  • abstract_file: name of the abstract file. This file is used as both input and output. It is pre-filled with <project-name>_abstracts.csv;
  • journal_file: name of the journal list file produced by journal extract. It is pre-filled with <project-name>_journals.csv.

acronyms

This commands extracts acronyms from the papers. Its output format is suitable to be used with FAWOC to classify which acronym is relevant or not. If the input file (the abstract file) contains the status column created by the journals filter command, the acronyms command uses that column value to filter out the paper published in the rejected journals. The output of this command will be called the acronyms file in the rest of this document.

Usage:

python3 slrkit.py acronyms

The acronyms sub-command uses the acronyms.toml configuration file and runs the acronyms.py script. The output is in tsv format and has the following structure (suitable for FAWOC):

  • id: a progressive identification number;
  • term: the acronym in the form extended-acronym | (abbreviation);
  • label: the label added by FAWOC to the acronym. This field is left blank by the acronyms command.

No fawoc_data file is produced, so no count field is available for FAWOC.

After a correct execution, the command changes the preprocess.toml file updating its acronyms field with the name of the output file. All the commands consider only the acronyms classified with the relevant or the keyword label. All the other acronyms are not considered.

The acronyms.toml has the following structure:

  • datafile: input file. It is pre-filled with the value <project-name>_abstracts.csv;
  • output: output file. It is pre-filled with the value <project-name>_acronyms.csv;
  • columns: name of the column of datafile with the text to elaborate. It is pre-filled with the value abstract.

preprocess

The preprocess sub-command prepares the documents for the following elaborations.

Usage:

python3 slrkit.py preprocess

If the input file (the abstract file) contains the status column created by the journals filter command, the preprocess command uses that column value to filter out the paper published in the rejected journals. It also filters the stop-words using the list of words provided by the user. No default list of stop-words is used, the user must provide his own lists.

This command also uses the acronyms file to search and mark the acronyms as relevant words. Only the acronyms with the relevant or the keyword label are considered.

The preprocess command, also mark as relevant all the terms provided by the user in the relevant terms lists. The user can also choose how the command marks this terms. The input of this command is the abstract file. The output of this command is the abstract file without the paper discarded because published in rejected journals. To this file is also added a new column with the text of each paper preprocessed. More information can be found in the preprocess.py section of the README.md

The preprocess sub-command uses the preprocess.toml configuration file. This file has the following structure:

  • datafile: the name of the abstract file that will be used as input. This field is pre-filled with <project-name>_abstracts.csv;
  • output: output file name. This field is pre-filled with <project-name>_preproc.csv;
  • placeholder: placeholder used to mark the barriers (the stop-words and the punctuation). This character is also used as prefix and suffix for the placeholder for the relevant terms and the acronyms. It is pre-filled with the character @;
  • stop-words: lists of stop-words provided by the user. No other lists are used, so the user shall provide its own;
  • relevant-term: lists of relevant terms. This field is particular. Each element must be a list of at least one item and at most two items. The first item is the name of a list of relevant terms. The second one, if present, is the marker that the user want to be used for all the terms in this list. All the terms will be marked with <placeholder><marker><placeholder>. If the marker is omitted than the command replaces every term using the placeholder, all the words of the term separated with _ character and then another placeholder;
  • acronyms: name of the acronyms file. If the acronyms command is run before, it is pre-filled with <project-name>_acronyms.csv;
  • target-column: name of the column used for the document text. It is pre-filled with abstract;
  • output-column: name of the column that is added to the output, containing the preprocessed text. It is pre-filled with abstract_lem;
  • input-delimiter: input file field delimiter. It is pre-filled with \t;
  • output-delimiter: input file field delimiter. It is pre-filled with \t;
  • rows: number of rows of the input file to process. If empty, all the rows are used;
  • language: language of text. Must be a ISO 639-1 two-letter code. Pre-filled with en;
  • regex: csv file with some dataset specific regex substitutions that has to be applied to the text.

The output of this command will be called the preprocess file in the rest of this document.

terms

This command allows the user to generate and handle lists of terms.

This command accepts one sub-command:

  • generate: generate a list of terms that have to be classified.

Usage:

python3 slrkit.py terms {generate}

If the terms command is invoked without a sub-command, the generate sub-command is run.

terms generate

The generate sub-command generates a list of terms from the documents in the preprocess file. This command runs the gen_terms.py script. The format of this list is the one used by FAWOC. The structure is the following:

  • id: a progressive identification number;
  • term: the n-gram;
  • label: the label added by FAWOC to the n-gram. This field is left blank by the terms generate command.

This command produces also the fawoc_data.tsv file, with the following structure:

  • id: the identification number of the term;
  • term: the term;
  • count: the number of occurrences of the term.

The output of this command will be called the terms file in the rest of this document.

The terms generate sub-command uses the terms_generate.toml configuration file. It has the following structure:

  • datafile: name of the input file (the preprocess file). It is pre-filled with project-name>_preproc.csv;
  • output: name of the output file. It is pre-filled with <project-name>_terms.csv;
  • stdout: if true the command also print the output to the standard output;
  • n-grams: maximum size of an n-gram. All the n-gram with lengths from one word to this number of words are generated. By default, this field is filled with 4;
  • min-frequency: minimum number of occurrences of an n-gram. All the n-gram with less occurrences than this value are discarded. Pre-filled with 5;
  • placeholder: placeholder used to mark the barriers in the preprocess stage. All the n-grams containing this character or containing words that start and end with this character are discarded. It is pre-filled with the character @;
  • column: column of the input file with the text to elaborate. Pre-filled with abstract_lem;
  • delimiter: field delimiter used by the input file. Pre-filled with \t.

fawoc

The fawoc command runs FAWOC on a list produced by the previous commands. This command accepts three sub-commands:

  • terms: run FAWOC on the terms file;
  • journals: run FAWOC on the journals file;
  • acronyms: run FAWOC on the acronyms file.

Usage:

python3 slrkit.py fawoc [--input LABEL] [--width WIDTH]

the optional arguments are passed to FAWOC and override the corresponding values in the configuration file:

  • --input: label to review;
  • --width: width of the FAWOC windows, in number of columns.

If the fawoc command is invoked without a sub-command, the terms sub-command is run. Each sub-command writes to its own profiler file, in the log directory of the project.

fawoc terms

The fawoc terms sub-command allows the user to classify the terms file. This command uses the fawoc_terms.toml configuration file that has the following structure:

  • datafile: file to classify. Pre-filled with <project-name>_terms.csv;
  • input: label to review;
  • dry-run: if true, FAWOC don't write anything on the datafile on exit;
  • no-auto-save: if true, no auto saving;
  • no-profile: if true, no data is written to the profiler file;
  • width: width of the FAWOC windows in columns.

The profiler file for this sub-command is fawoc_terms_profiler.log in the log directory of the project.

fawoc journals

The fawoc journals sub-command allows the user to classify the journals file. This command uses the fawoc_journals.toml configuration file. Its structure is the same of the fawoc_terms.toml. The only difference is that the datafile field is pre-filled with <project-name>_journals.csv.

The profiler file for this sub-command is fawoc_journals_profiler.log in the log directory of the project.

fawoc acronyms

The fawoc acronyms sub-command allows the user to classify the acronyms file. This command uses the fawoc_acronyms.toml configuration file. Its structure is the same of the fawoc_terms.toml. The only difference is that the datafile field is pre-filled with <project-name>_acronyms.csv.

The profiler file for this sub-command is fawoc_acronyms_profiler.log in the log directory of the project.

topics

The topics command extracts topics from the documents of the project.

This command accepts the following sub-commands:

  • extract: extracts topics from the documents;
  • optimize: optimizes the parameter of the topic extraction algorithm and uses that parameters to extract topics.

The sub-command is always required.

topics extract

This sub-command trains an LDA model and outputs the extracted topics and the association between topics and documents.

Usage:

python3 slrkit.py topics extract [--config CONFIG | --directory DIRECTORY] [--uuid UUID] [--id ID]

Optional arguments:

  • --config | -c CONFIG: specifies a different configuration file than the default one;
  • --directory | -d DIRECTORY: specifies the path to the directory with the results of the optimization phase;
  • --uuid | -u UUID: UUID of the model stored in the result directory.
  • --id ID: 0-based id of the model stored in the result directory. The associaction between id and model is stored in the results.csv file of the result directory. This file is sorted by coherence so the id 0 is the best model. If both --uuid and this option are missing and the --directory is present, --id is assumed with value 0.

The --config and the --direcotry are mutually exclusive. Also, the --uuid and the --id option are mutally exclusive. The --directory, in conjunction with the --uuid or the --id, allows the user to select one model of a run of the optimize_lda command (or the lda_ga.py script). If one of the --uuid/--id option is present, the --directory is required, otherwise the command ends with an error.

This command runs the lda.py script. The topicws extract sub-command uses, by default, the lda.toml configuration file that has the following structure:

  • preproc_file: name of the preprocess file. Pre-filled with <project-name>_preproc.csv;
  • terms_file: name of the terms file. Pre-filled with <project-name>_terms.csv;
  • outdir: path to the directory where to save the results. Pre-filled with the path to project directory;
  • text-column: column of the preprocess file to elaborate. Pre-filled with abstract_lem;
  • title-column: column in the preprocess file to use as document title. Pre-filled with title;
  • topics: number of topic to extract. Pre-filled with 20;
  • alpha: alpha parameter of LDA. Pre-filled with auto;
  • beta: beta parameter of LDA. Pre-filled with auto;
  • no_below: keep tokens which are contained in at least this number of documents. Pre-filled with 20;
  • no_above: keep tokens which are contained in no more than this fraction of documents (fraction of total corpus size, not an absolute number). Pre-filled with 0.5;
  • seed: seed to be use in training;
  • model: if true the lda model is saved to directory <outdir>/lda_model. The model is saved with name "model";
  • no-relevant: if set, use only the term labelled as keyword in the terms file;
  • load-model: path to a directory where a previously trained model is saved. Inside this directory the model named "model" is searched. the loaded model is used with the dataset file to generate the topics and the topic document association;
  • no_timestamp: if true, no timestamp is added to the output file names;
  • placeholder: placeholder for the barriers. Pre-filled with @;
  • delimiter: field delimiter used in the preprocess file. Pre-filled with \t.

The command manage to set the PYTHONHASHSEED to 0 so setting the seed value is enough to have reproducible runs.

More information on the PYTHONHASHSEED variable can be found here.

topics optimize

This sub-command runs the lda_ga.py script to find the best combination of parameters for an LDA model.

Usage:

python3 slrkit.py topics optimize

The topics optimize sub-command uses the optimize_lda.toml configuration file that has the following structure:

  • preproc_file: name of the preprocess file. Pre-filled with <project-name>_preproc.csv;
  • terms_file: name of the terms file. Pre-filled with <project-name>_terms.csv;
  • ga_params: path of the file with the parameters used by the GA. Pre-filled with the absolute path to the optimize_lda_ga_params.toml file in the configuration directory;
  • outdir: path to the directory where to save the results. Pre-filled with the path to project directory;
  • text-column: column of the preprocess file to elaborate. Pre-filled with abstract_lem;
  • title-column: column in the preprocess file to use as document title. Pre-filled with title;
  • seed: seed to be use in training;
  • placeholder: placeholder for the barriers. Pre-filled with @;
  • delimiter: field delimiter used in the preprocess file. Pre-filled with \t.
  • no_timestamp: if true, no timestamp is added to the output file names;

The ga_params file has the following structure:

  • limits: this section contains the ranges of the parameter;
    • min_topics: minimum number of topics;
    • max_topics: maximum number of topics;
    • max_no_below: maximum value of the no-below parameter. The minimum is always 1. A value of -1 means a tenth of the number of documents;
    • min_no_above: minimum value of the no-above parameter. The maximum is always 1.
  • algorithm: this section contains the parameters used by the GA:
    • mu: number of individuals that will pass each generation;
    • lambda: number of individuals that are generated at each generation;
    • initial: size of the initial population;
    • generations: number of generation;
    • tournament_size: number of individuals randomly selected for the selection tournament.
  • probabilities: this section contains the probabilities used by the script:
    • mutate: probability of mutation;
    • component_mutation: probability of mutation of each individual component;
    • mate: probability of crossover (also called mating);
    • no_filter: probability that a new individual is created with no term filter (no_above = no_below = 1);
  • mutate: this section contains the parameters of the Gaussian distributions used by the mutation for each parameter:
    • topics.mu and topics.sigma are the mean value and the standard deviation for the topics parameter;
    • alpha_val.mu and alpha_val.sigma are the mean value and the standard deviation for the value of the alpha parameter;
    • beta.mu and beta.sigma are the mean value and the standard deviation for the beta parameter;
    • no_above.mu and no_above.sigma are the mean value and the standard deviation for the no_above parameter;
    • no_below.mu and no_below.sigma are the mean value and the standard deviation for the no_below parameter;
    • alpha_type.mu and alpha_type.sigma are the mean value and the standard deviation for the type of the alpha parameter.

Refer to the documentation of the lda_ga.py script in README.md for more information about the behaviour of the script and the GA parameters.

The script outputs all the trained models in <outdir>/<date>_<time>_lda_results/models/<UUID>. The command outputs also the topics and the documents topics correspondence for each trained model.

For each trained model is it produced a toml file with all the parameter already set to use the corresponding model with the lda.py script or the lda command. These toml files are saved in <outdir>/<date>_<time>_lda_results/toml/<UUID>.toml, and can be loaded in the lda.py script or the topics extract command using its --config option. It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv with the following format:

  • id: progressive identification number;
  • topics: number of topics;
  • alpha: alpha value;
  • beta: beta value;
  • no_below: no-below value;
  • no_above: no-above value;
  • coherence: coherence score of the model;
  • times: time spent evaluating this model;
  • seed: seed used;
  • uuid: UUID of the model;
  • num_docs: number of document;
  • num_not_empty: number of documents not empty after filtering.

The script, also outputs the extracted topics and the topics-documents association produced by the best model. The topics are output in <outdir>/lda_terms-topics_<date>_<time>.json and the topics assigned to each document in <outdir>/lda_docs-topics_<date>_<time>.json. A txt file with a summary of the results is also produced with name <outdir>/lda_info_<date>_<time>.txt.

The command manage to set the PYTHONHASHSEED to 0 so setting the seed value is enough to have reproducible runs.

More information on the PYTHONHASHSEED variable can be found here.

report

The report command produces some reports with statistics about the papers analyzed by the lda command. This command runs the topic_report.py script.

Usage:

python3 slrkit.py report [docs_topics_file terms_topics_file]

With no arguments, the command searches all the lda_docs-topics*.json and lda_terms-topics*.json files in the current directory and uses the most recent one for each type. Files with that names are the ones produced by the lda command and contains the association between documents and topics and the association between terms and topics. The docs_topics_file and terms_topic_file options, allow the user to select a different set of JSON files.

The command uses the report.toml configuration file that has the following structure:

  • abstract_file: name of the abstracts file of the project. It is pre-filled with <project-name>_abstarcts.csv;
  • dir: output directory where the templates and the reports are saved. If empty, the current directory is used;
  • minyear: minimum year to consider. If empty, the minimum year found in the data is used;
  • maxyear: maximum year to consider. If empty, the maximum year found in the data is used.
  • plotsize: number of topics to be displayed in each subplot saved in the report directory;
  • compact: if true the command creates a compact table for the topics;
  • no_stats: if true the topics table do not show the statistics about terms.

On the first run, the command copies the report_template.md and report_template.tex from the report_template directory inside this repository, to the current project. These two files are used to create the reports. The user can customize the two copied template as he wishes.

The command creates a directory named report<timestamp> containing:

  • the report (called report.md) in markdown format;
  • the report (called report.tex) in LaTeX format;
  • a figure in png format (called reportyear.png) used by the two reports above;
  • a directory tables with some LaTeX files used by the LaTeX report.

For information about the statistics reported, refer to the topic_report.py documentation in the README file.

record

The record command creates a commit in the git repository of the project. This commit records all the data and the configuration of the project.

Usage:

python3 slrkit.py record [--clean] [--rm] message

The message argument is the commit message to use for the commit. It cannot be the empty string. The optional arguments are:

  • --clean: this flag tells the command to clean the repository index from file not referenced in the configuration files. These files are left in the project, but they become untracked;
  • --rm: this flag tells the command to clean the project removing files not referenced in the configuration files. This flag remove these files from the repository index and from the file-system. Use with caution.

The command records the following files:

  1. the modifications made to the META.toml file;
  2. all the modified configuration file;
  3. all the modifications made to the .gitignore file;
  4. the README.md file if present;
  5. the bibliographic database used as input of the import file;
  6. the journals file;
  7. the acronyms file;
  8. the stop-words lists used by the preprocess command, if any;
  9. the relevant terms lists used by the preprocess command, if any;
  10. the terms file, with the corresponding fawoc_data.tsv file;
  11. all the profiler files created by the fawoc sub-commands.

The names of the files listed from 5 to 11 are taken from the configuration files of the commands that generates/uses them. These files are committed only if they exist in the project at the moment the report command is run. If one of these files is deleted, or its name is not referenced anymore in the configuration files, the record command does not remove the file from the repository unless the --clean flag is set. With the --rm flag the record command removes and deletes the file that are not referenced anymore in the configuration files. Use this option with caution.

The record command does not use any configuration file.

Auto-discovery of the file to record

The record command uses the to_record function of all the script used by the slrkit command to retrieve the list of file to record. The command imports each script and searches for the to_record function. If present, this function is called with the content of the configuration file of the script as a python dictionary. The function must return a list of file names to record. If there is something wrong in the configuration data, the function must raise a ValueError exception with the reason of the error. The message of the exception is used by the record command to create the error message to show to the user.

stopwords

The stopwords command extracts a list of terms classified as stopwords from the terms file. The command searches the terms labelled as stopword in the terms file (the file that is the input of the fawoc terms command) and outputs the list of these terms (one per line). The file created, is added to the stop-word list in the preprocess.toml.

Usage:

python3 slrkit.py stopwords [--no-add] output

The output argument is the output file of the command The --no-add optional arguments allows to not add the output file to the stop-word list in preprocess.toml.

build

The build command executes the command required to re-create the files not versioned. This command is helpful after a cloning a slrkit project. For more information see this section

The command executes the following commands in order:

  1. import;
  2. journal filter;
  3. preprocess.

Usage:

python3 slrkit.py build

readme

The readme command creates and git commits a README.md file for the project. The information are taken from the META.toml file. More precisely the following information are used:

  • from the Project section:
    • Name;
    • Author;
    • Description;
  • from the Source section:
    • URL if present or Origin;
    • Date;
    • Query.

If one or more of these fields are empty, the command simply skips that README part. After the README is created, it is committed to the git repository.

Usage:

python3 slrkit.py readme

Commands not available

This section documents some commands that are not directly available. They can be activated and used modifying the code of the slrkit command.

They are:

  • lda_grid_search: grid search optimization of the LDA parameters.

lda_grid_search

How to activate

Modify the argparse sub-parser of the topics optimize sub-command to accept a boolean option named grid-search. All the other code is ready.

How it works

The lda_grid_search command performs a grid search on the LDA model parameters and outputs all the trained models.

The command searches the best combination (in terms of coherence) of number of topics, alpha, beta, no-below and no-above parameters. It searches all the possible combinations of parameters, discarding all the cases that results with all the documents empty.

This command uses the lda_grid_search.toml configuration file that has the following format:

  • preproc_file: name of the preprocess file. Pre-filled with <project-name>_preproc.csv;
  • terms_file: name of the terms file. Pre-filled with <project-name>_terms.csv;
  • outdir: path to the directory where to save the results. Pre-filled with the path to project directory;
  • text-column: column of the preprocess file to elaborate. Pre-filled with abstract_lem;
  • title-column: column in the preprocess file to use as document title. Pre-filled with title;
  • min-topics: minimum number of topics to test. Pre-filled with 5;
  • max-topics: maximum number of topics to test. Pre-filled with 20;
  • step-topics: step used to create the grid of topics values. Pre-filled with 1;
  • seed: seed to be use in training;
  • plot-show: if true, a plot of the coherence is shown;
  • plot-save: if true the plot of the coherence is saved as <outdir>/lda_plot.pdf;
  • placeholder: placeholder for the barriers. Pre-filled with @;
  • delimiter: field delimiter used in the preprocess file. Pre-filled with \t.

The command runs the lda_grid_search.py script. Refer to its documentation in the README for the criteria used to set up the grid of parameters.

To each trained model it is assigned an UUID. The command outputs all the models in <outdir>/<date>_<time>_lda_results/<UUID>. It also outputs a tsv file in <outdir>/<date>_<time>_lda_results/results.csv with the following format:

  • id: progressive identification number;
  • corpus: descriptor of the corpus used. It has the form (labels, no_below, no_above), with labels the list of labels considered when filtering the documents (relevant and keyword or keyword alone). no_below and no_above have the same meaning as below;
  • no_below: no-below value;
  • no_above: no-above value;
  • topics: number of topics;
  • alpha: alpha value;
  • beta: beta value;
  • coherence: coherence score of the model;
  • times: time spent evaluating this model;
  • seed: seed used;
  • uuid: UUID of the model;
  • num_docs: number of document;
  • num_not_empty: number of documents not empty after filtering.

The command manage to set the PYTHONHASHSEED to 0 so setting the seed value is enough to have reproducible runs.

More information on the PYTHONHASHSEED variable can be found here.

Exchanging slrkit projects with git

A slrkit project is a git repository, so it is possible to record the work done and exchange it using a remote repository. Since the record command tracks only the configuration of a project and the files that cannot be recreated directly using the slrkit commands, cloning/pulling a slrkit project requires some steps to recreate the missing files.

In particular the following command must be run:

  • import: to recreate the abstracts file;
  • journals filter: to mark the excluded papers. This is mandatory if a journals file is present in the repository;
  • preprocess: to recreate the preprocess file used by the lda related commands.

After these commands the working directory is ready to run any lda related command.

The build command executes these commands in order.

Auto-discovery of the configuration parameters

The slrkit.py code, tries to auto-discover the configuration parameters of a script. This is done using the ArgParse class from the slrkit_utils.argument_parser module of the slrkit_utils repository. This class works like the standard ArgumentParser class of the argparse python module, but it collects information about each argument and stores it in the slrkit_arguments dictionary. Using this dictionary slrkit.py can find the names of each argument, its default value, if it is optional or required and all the other annotation. With this information slrkit.py can automatically create the default configuration files, and can easily pass the value in the configuration file to the command.

Script adaptation

A script can be run as a command if it is adapted to do so. First the script must be importable from the slrkit.py code. Second the script module must define a function named init_argparse that does not take any arguments and returns the ArgParse object used by the script itself. Third the script information must be registered in the SCRIPTS dictionary (see below); Finally, the module must have a function (the name can be chosen freely) that accepts a argparse Namespace as argument that execute all the logic of the script. This Namespace is the one returned by the ArgParse object after the command line parsing.

The slrkit.py code will use these features to handle and run the script. The script code is imported by the slrkit.py code. The init_argparse function and the ArgParse object are used to handle the arguments of the script, create a default configuration file for the command and to handle the configuration file and prepare the arguments for the script. The function with the logic of the script will be called by the slrkit.py with all the required arguments.

The ArgParse class

This class is defined in the slrkit_utils.argument_parser module of the slrkit_utils repository. The ArgParse class is a subclass of the argparse.ArgumentParser class that collects the information about the configured arguments. The collected information are stored in the slrkit_arguments attribute that is a dictionary where the key is the name of an argument and the value is a dictionary that contains all the collected information.

This is done with the overridden add_argument. This method collects standard information like:

  • the name of the argument (stored as the key of slrkit_arguments);
  • the name of the destination of the argument read from the command line (stored as dest);
  • the default value (value);
  • the type (type);
  • the help string (help);
  • the choice keyword argument that is the collection of allowable values for this argument (choice).

The overridden method can also accept some other custom attributes in the form of keyword arguments. They are:

  • input: bool value, default False, flags an argument as an input file;
  • output: bool value, default False, flags an argument as an output file;
  • non_standard: bool value, default False, specifies that this argument must be handled in a special way (currently this attribute is not used);
  • logfile: bool value, default False, specifies that this argument is the path of a logfile;
  • suggest_suffix: str value, default None, suffix to suggest to the user for the value of this argument;
  • cli_only: bool value, default False, specifies that this argument is intended to be use on the command line only.

These attributes are stored in the argument dictionary usign their name as the key. In addition, the required attribute is stored in the dictionary. This is a boolean value that tells if the argument is required or optional.

The action attribute is also stored. This is the Action object used by the argument parser to handle the argument and to store the correct value of the argument. This attribute can be used to store the argument value from the configuration file in the same way the argument parser does.

The input attribute is used to detect which arguments are input files coming from other stages. The output attribute is used to identify which argument is the output file of a script. The dependency system of the slrkit command uses these attributes to correctly suggest the default names of the input and output files in the configuration files and to suggest which command must be run if one or more inputs are missing.

The file name suggestion in the configuration file also use the suggest_suffix attribute. If an argument has this attribute set, its value is used to create the default value used in the configuration file creation. The default name will be <project name><suggest_suffix>.

The logfile attribute is used to mark the argument with the path to the log file in order to collect all the project logs in the log directory inside the project configuration directory.

Configuration files creation

The slrkit.py code creates the configuration files using the content of the slrkit_arguments attribute of the ArgParse object of each script that is configured as a command. For each script argument not flagged as cli_only or logfile a corresponding entry is created in the configuration file. The entry as the same name as the key of the slrkit_arguments dictionary. The value value is used as the default value of each entry unless the suggest_suffix is specified. In that case the file name suggestion is performed as specified above. For each entry, the text of the help value of the slrkit_arguments is provvided as a comment. Moreover, a comment stating if the value is required or not is also produced.

The dependencies system

In the slrkit.py code, the SCRIPTS dictionary stores the information regarding a script used as a command. The key of this dictionary is the name of the command. If a command as some sub-commands, the corresponding key value will be <command name>_<sub-command name>. Each entry of this dictionary has the following structure:

  • module: name of the module of the script of the command without the .py extension;
  • additional_init: boolean value that tells if this command requires additional actions to be performed during the project initialization. An example is the optimize_lda command that requires that the optimize_lda_ga_params.toml file to be copied in configuration directory and to update the ga_param entry of the optimize_lda.toml accordingly;
  • depends: list of the dependencies of the command;
  • no_config: boolean value taht tells if a command does not use a configuration file. If it is True, the command not uses a configuration file, and so no configuration file is created by the init command.

The depends list contains an element for each input file of the script that is produced by another command. Each element is the name of the command that produces that file. The order of each element must be the same of the corresponding input in the ArgParse argument declaration. For instance, if one script takes two inputs and first declared one depends on the output of the preprocess command while the second one depends on the output of the terms generate command, the corresponding depends list will be ['preprocess', 'terms_generate'].

The slrkit.py code uses the depends list in this way:

  1. the list of the inputs of a script (the argument flagged as input) is retrieved. The order of definition of each argument is preserved;
  2. for each input, the corresponding entry (the entry with the same index) in the depends list is taken;
  3. the entry is used to find the output (the argument flagged as output) of the command named in the depends entry on which this input depends;
  4. this information is used both to provvide a default value for each input in the configuration files creation and to suggest which command must be run if an input is missing.

The commands listed in the SCRIPTS dictionary are the only ones that are handled in the configuration file creation phase of the init command.

The prepare_script_arguments function

The prepare_script_arguments function handles the content of a configuration file and create the Namespace with the arguments for a script.

The function takes the following arguments: * config: content of the config file; * config_dir: path to the config file directory; * confname: name of the config file; * script_args: information about the script arguments. This dictionary is the slrkit_arguments attribute of the ArgParse object of the script.

The function returns the Namespace with the arguments values. All the arguments are filled using the values in the configuration file. The arguments flagged as cli_only in script_args are filled with the default value taken from script_args. The arguments flagged as logfile is filled with a path to a log file in the log directory in the configuration directory. The arguments flagged as non_standard are not processed by the function, and must be handled by the code that runs the command.

The prepare_script_arguments function returns also a dictionary with the inputs and a dictionary with the outputs of the script. These dictionaries have the name of the argument as the key and the value of the argument as the item.