Releases: huggingface/datatrove
Releases · huggingface/datatrove
v0.4.0
What's Changed
- Readme nits by @hynky1999 in #280
- Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
- Fix languages listify bug by @BramVanroy in #294
- [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
- [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
- FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
- upgrades to support numpy 2.0
- added additional word tokenizers and revamped word tokenizer assignment mechanism
- MinHash optimizations + new rust tool to speed up step3
- MinHash cluster sizes feature
- fixed memory leaks from some word tokenizers
- updated url blocklists
- added caching to some word tokenization calls
- glotlid support
- general bugfixes
New Contributors
- @lyuwen made their first contribution in #286
- @BramVanroy made their first contribution in #294
- @silverriver made their first contribution in #296
- @Youggls made their first contribution in #307
Full Changelog: v0.3.0...v0.4.0
v0.3.0
What's Changed
- Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in #160
- Add a skip parameter to all readers (defaults to zero) by @rantav in #167
- Adds n-gram based decontamination by @guipenedo in #172
- Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in #139
- Adds
tasks_per_job
to slurm executor by @guipenedo in #153 - Unsigned int tokenizer and srun args by @marianna13 in #154
- Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in #169
- remove ListFilter from the process_common_crawl_dump example by @QasidSaleem in #181
- Hf dataset update by @hynky1999 in #170
- Optimize URLFilter and add option to disable integrated wordlists by @its5Q in #174
- Add progres for files by @hynky1999 in #176
- Make colorization configurable for both files and console output by @guipenedo in #185
- Migrate dedup to xxhash by @guipenedo in #179
- [WIP] Multi-Lingual Tokenization by @beme248 in #147
- Add more word tokenizers by @vsabolcec in #187
- Speed up CI with uv by @guipenedo in #188
- Url Index + missing hash_config struct inference by @hynky1999 in #191
- Migrate pipeline blocks to new word tokenizers by @guipenedo in #189
- Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in #192
- Extend randomize_start feature to local executor by @justHungryMan in #193
- Add description for randomize_start by @justHungryMan in #194
- Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in #199
- Issues w/ DatatroveFolderDataset by @TJ-Solergibert in #203
- code consistency about radomize_start_duration by @justHungryMan in #207
- feat(ci): add trufflehog secrets detection by @McPatate in #211
- fix(ci): remove unnecessary permissions by @McPatate in #212
- Add label_only option to LanguageFilter by @justHungryMan in #210
- Fixes text normalization by @hynky1999 in #218
- Summary stats by @hynky1999 in #158
- Speedup json writer by @its5Q in #175
- add alternative fasttext lid models by @guipenedo in #226
- Adds paths_file to readers by @guipenedo in #228
- Add an example for filtering an HF dataset and push to hub by @loubnabnl in #201
- checks if min_num_sentences is disabled or not before computing the n… by @QasidSaleem in #232
- DocumentTokenizerContextShuffler fixes by @sippycoder in #229
- add dependencies lid.py, io.py #239 by @aiqwe in #241
- Add withdirs to extra_options only when not using glob_pattern by @olga1988olga in #244
- Add token and char count to histogram stats by @guipenedo in #251
- fix correct type inference for cached filesystems by @hynky1999 in #257
- Simple enhancement for readibility by @aiqwe in #253
- Fix
test_basic_article_trafilatura
test failure by @tylerjthomas9 in #264 - Update MinhashConfig with detailed settings and add default language … by @justHungryMan in #252
- Update README.md by @shizhediao in #276
- Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in #230
- Update filter_hf_dataset.py by @shizhediao in #274
- Add expand_metadata Option to JsonlWriter by @justHungryMan in #268
- Add shuffle option on huggingface reader by @justHungryMan in #224
New Contributors
- @rantav made their first contribution in #167
- @QasidSaleem made their first contribution in #181
- @its5Q made their first contribution in #174
- @beme248 made their first contribution in #147
- @vsabolcec made their first contribution in #187
- @TJ-Solergibert made their first contribution in #203
- @McPatate made their first contribution in #211
- @loubnabnl made their first contribution in #201
- @sippycoder made their first contribution in #229
- @aiqwe made their first contribution in #241
- @olga1988olga made their first contribution in #244
- @tylerjthomas9 made their first contribution in #264
- @shizhediao made their first contribution in #276
Full Changelog: v0.2.0...v0.3.0
v0.2.0
What's Changed
- Adds multi node parallelism to local executor by @guipenedo in #85
- Changed fsx default filepath for logging output to user's home by @Anacheron51 in #86
- [
Docs
] Fix typos by @StandardAI in #91 - bugfix stats file not being saved to s3 by @guipenedo in #92
- Fix url stats by @thomwolf in #89
- Efficiency: np.fromiter instead of np.array by @giorgioangel in #88
- Adds language option for nltk by @guipenedo in #94
- Fix compression type by @jordane95 in #95
- Decoupled reading logic from DedupReader by @guipenedo in #98
- Support for arbitrary fasttext models by @guipenedo in #99
- Adds citation by @guipenedo in #101
- Adds parquet writer by @guipenedo in #103
- Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in #105
- Adding doc strings + adding a faster tokenized doc merger by @thomwolf in #90
- Add email on slurm and extend fasttext filter functionalities by @thomwolf in #111
- Add
jobs_status
command. by @lvwerra in #113 - Re-enable
datasets
test by @mariosasko in #114 - Update warc.py by @jordane95 in #115
- Bug fix: when file is empty by @jordane95 in #126
- Load tokenizer using
from_file
by @guipenedo in #122 - Adds
depends=
to LocalPipelineExecutor by @guipenedo in #100 - Improve C4 filter and dedup by @guipenedo in #124
- Adds option to shuffle input files in readers by @guipenedo in #128
- update Trafilatura version by @adbar in #130
- Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in #133
- Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in #134
- add requeue and QOS slurm options by @marianna13 in #144
- Fix substring dedup range by @jordane95 in #132
- Line dedup min remove words option by @guipenedo in #146
- New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in #151
- Url deduplication by @hynky1999 in #145
- Fix race conditions during download/extraction by @hynky1999 in #155
- Adds PII removal by @guipenedo in #156
- Pypi Publish Action by @hynky1999 in #159
New Contributors
- @Anacheron51 made their first contribution in #86
- @StandardAI made their first contribution in #91
- @giorgioangel made their first contribution in #88
- @lvwerra made their first contribution in #113
- @adbar made their first contribution in #130
- @justHungryMan made their first contribution in #134
- @marianna13 made their first contribution in #144
- @hynky1999 made their first contribution in #145
Full Changelog: v0.0.1...v0.2.0