A Python tool for extracting GCMD (Global Change Master Directory) keywords used by metadata records from AODN's GeoNetwork catalog via the CSW service.
This tool assists the AODN Metadata Governance Officer in extracting GCMD keyword on-demand reports.
It works with the CSW service of both GeoNetwork3 and GeoNetwork4.
- Python 3.10
- Poetry
- Conda (recommended for creating a virtual environment)
-
Install Conda (if not already installed):
Follow the instructions at Conda Installation.
-
Create and activate a Conda virtual environment:
conda create -n gcmd_extractor python=3.10 conda activate gcmd_extractor
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 -
Make sure to add Poetry to your PATH as instructed during the installation.
-
Clone the repository:
# after cloning the repo with git clone command cd geonetwork-gcmd-extractor
-
Install dependencies using Poetry:
poetry install
Configurations are defined in config/config.json
, you can change CSW service source URL in there for example.
Run the script:
poetry run python main.py
For parameter usage instruction
poetry run python main.py --help
There is an implementation for using NLP to fuzzy group similar texts regardless of typos, plurals, case sensitivity, etc. For example:
Inputs:
["Sea surface tempoerature", "SEA SURFACE TEMPERATUR", "car", "cars", "elephant", "ellephent", "antarticca"]
Outputs:
['SEA SURFACE TEMPERATURE', 'CAR', 'ELEPHANT', 'ANTARCTICA']
This module is not used in the processor class; it is there for reference purposes. To use it, after running poetry install
, you might want to run poetry run download-spacy-model
and then import it where needed.
from utils.nlp_grouping import GroupingSimilarTexts
Output files will be generated in the outputs
folder.