pip install -r requirements.txt
Install Popper for its pdftotext
command. For example, on macOS:
brew install poppler
Install Pandoc to convert DOCX to text. For example, on macOS:
brew install pandoc
Install Tesseract OCR to convert PDF to text. For example, on macOS:
brew install tesseract
The commands automatically download:
- Punkt Tokenizer Models from NLTK, for sentence splitting
- paraphrase-multilingual-MiniLM-L12-v2 from Hugging Face, for sentence similarity
./manage.py --help
Download data, for example:
./manage.py download-ted 2022 01 2022 12
Transform TED XML data to CSV, for example:
./manage.py xml2csv 2022 01 2022 12 2022.csv
Extract sentences from CSV, for example:
./manage.py csv2corpus 2022.csv corpus-furniture.csv 391
./manage.py csv2corpus 2022.csv corpus-textiles.csv 18 395 98311 98312 5083 5082 98313
./manage.py csv2corpus 2022.csv corpus-cleaning.csv 90911200 90919 98341130 98341110
Extract green requirements from PDF documents, for example:
./manage.py pdf2queries 'Criteria for Furniture.pdf' queries-furniture.csv 6 27
Download data, for example:
./manage.py download-do data/do
Transform DOCX, BMP, PNG, JPEG and PDF to text files:
./manage.py any2txt data/do
Extract sentences from text files:
./manage.py txt2corpus data/do corpus-do.csv spanish
Perform a semantic similarity search, for example:
./manage.py search corpus-furniture.csv queries-furniture.csv 0.7
Check the frequencies of values in columns using codelists:
qsv index 2022.csv
qsv frequency -l 0 -s MONTH,FORM,LG,CPV2,CPV3,CPV4,CPV5,ECONOMIC_CRITERIA_DOC,TECHNICAL_CRITERIA_DOC,AC_PROCUREMENT_DOC,AC_PRICE,SUITABILITY_ANY,ECONOMIC_FINANCIAL_INFO_ANY,ECONOMIC_FINANCIAL_MIN_LEVEL_ANY,TECHNICAL_PROFESSIONAL_INFO_ANY,TECHNICAL_PROFESSIONAL_MIN_LEVEL_ANY,PERFORMANCE_CONDITIONS_ANY,AC_QUALITY_ANY,AC_COST_ANY,CRITERIA_CANDIDATE_ANY 2022.csv | sort
Column | Description | Required | Format | Example |
---|---|---|---|---|
MONTH | The monthly package | ✓ | YYYY-MM | 2022-01 |
FORM | The form number | ✓ | codelist | F02 |
LG | The document language | ✓ | codelist | DE |
URI_DOC | The notice URL | ✓ | URL | |
URL_DOCUMENT_ANY | Whether URI_DOCUMENT is set |
✓ | boolean | |
URI_DOCUMENT | The access URL for procurement documents | URL | ||
CPV2 | The first 2 digits of CPV_MAIN |
✓ | codelist | 30 |
CPV3 | The first 3 digits of CPV_MAIN |
✓ | codelist | 301 |
CPV4 | The first 4 digits of CPV_MAIN |
✓ | codelist | 3019 |
CPV5 | The first 5 digits of CPV_MAIN |
✓ | codelist | 30197 |
CPV_MAIN | Main CPV code | ✓ | codelist | 30197630 |
SUITABILITY_ANY | Whether SUITABILITY is set |
boolean | ||
SUITABILITY | Suitability to pursue the professional activity, including requirements relating to enrolment on professional or trade registers | Python list | ||
ECONOMIC_CRITERIA_DOC | Whether the notice defers to procurement documents for economic criteria | boolean | ||
ECONOMIC_FINANCIAL_INFO_ANY | Whether ECONOMIC_FINANCIAL_INFO_ANY is set |
boolean | ||
ECONOMIC_FINANCIAL_INFO | List and brief description of economic selection criteria | Python list | ||
ECONOMIC_FINANCIAL_MIN_LEVEL_ANY | Whether ECONOMIC_FINANCIAL_MIN_LEVEL_ANY is set |
boolean | ||
ECONOMIC_FINANCIAL_MIN_LEVEL | Minimum level(s) of economic standards possibly required | Python list | ||
TECHNICAL_CRITERIA_DOC | Whether the notice defers to procurement documents for technical criteria | boolean | ||
TECHNICAL_PROFESSIONAL_INFO_ANY | Whether TECHNICAL_PROFESSIONAL_INFO_ANY is set |
boolean | ||
TECHNICAL_PROFESSIONAL_INFO | List and brief description of technical selection criteria | Python list | ||
TECHNICAL_PROFESSIONAL_MIN_LEVEL_ANY | Whether TECHNICAL_PROFESSIONAL_MIN_LEVEL_ANY is set |
boolean | ||
TECHNICAL_PROFESSIONAL_MIN_LEVEL | Minimum level(s) of technical standards possibly required | Python list | ||
PERFORMANCE_CONDITIONS_ANY | Whether PERFORMANCE_CONDITIONS_ANY is set |
boolean | ||
PERFORMANCE_CONDITIONS | Contract performance conditions | Python list | ||
CPV_ADDITIONAL | Additional CPV code(s) | codelist, colon-separated | ||
AC_PROCUREMENT_DOC | Whether non-price criteria are stated only in procurement documents | boolean | ||
AC_PRICE | Whether price is a criterion | boolean | ||
AC_QUALITY_ANY | Whether AC_QUALITY_ANY is set |
boolean | ||
AC_QUALITY | The names of the quality criteria | Python list | ||
AC_COST_ANY | Whether AC_COST_ANY is set |
boolean | ||
AC_COST | The names of the cost criteria | Python list | ||
CRITERIA_CANDIDATE_ANY | Whether CRITERIA_CANDIDATE_ANY is set |
✓ | boolean | |
CRITERIA_CANDIDATE | Objective criteria for choosing the limited number of candidates | Python list |
- Train new model
- Use GPU acceleration (CUDA)