HPLT - High Performance Language Technologies
A space that combines petabytes of natural language data with large-scale model training
Pinned Loading
Repositories
Showing 10 of 17 repositories
- warc2text-runner Public
Scripts for parallelized extraction of plain texts from WARC archieves. Aiming at common and reproducible extraction approach.
hplt-project/warc2text-runner’s past year of commit activity - OpusCleaner Public
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
hplt-project/OpusCleaner’s past year of commit activity - monolingual-multilingual-instruction-tuning Public
Monolingual or Multilingual Instruction Tuning: Which Makes a Better Alpaca
hplt-project/monolingual-multilingual-instruction-tuning’s past year of commit activity