Releases · huggingface/datatrove · GitHub

06 Dec 18:43

guipenedo

v0.4.0 Latest

Latest

What's Changed

Readme nits by @hynky1999 in #280
Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
Fix languages listify bug by @BramVanroy in #294
[Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
[fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
- upgrades to support numpy 2.0
- added additional word tokenizers and revamped word tokenizer assignment mechanism
- MinHash optimizations + new rust tool to speed up step3
- MinHash cluster sizes feature
- fixed memory leaks from some word tokenizers
- updated url blocklists
- added caching to some word tokenization calls
- glotlid support
- general bugfixes

New Contributors

@lyuwen made their first contribution in #286
@BramVanroy made their first contribution in #294
@silverriver made their first contribution in #296
@Youggls made their first contribution in #307

Full Changelog: v0.3.0...v0.4.0

Contributors

silverriver, BramVanroy, and 4 other contributors

Assets 2

28 Aug 15:47

guipenedo

v0.3.0

What's Changed

Added c4 badwords filter, added batch tokenization to tokenscounter by @guipenedo in #160
Add a skip parameter to all readers (defaults to zero) by @rantav in #167
Adds n-gram based decontamination by @guipenedo in #172
Fix: Handle Non-dict Objects in to_dict Without Errors by @justHungryMan in #139
Adds tasks_per_job to slurm executor by @guipenedo in #153
Unsigned int tokenizer and srun args by @marianna13 in #154
Enhance BaseReader to allow custom adapters access to instance variables by @justHungryMan in #169
remove ListFilter from the process_common_crawl_dump example by @QasidSaleem in #181
Hf dataset update by @hynky1999 in #170
Optimize URLFilter and add option to disable integrated wordlists by @its5Q in #174
Add progres for files by @hynky1999 in #176
Make colorization configurable for both files and console output by @guipenedo in #185
Migrate dedup to xxhash by @guipenedo in #179
[WIP] Multi-Lingual Tokenization by @beme248 in #147
Add more word tokenizers by @vsabolcec in #187
Speed up CI with uv by @guipenedo in #188
Url Index + missing hash_config struct inference by @hynky1999 in #191
Migrate pipeline blocks to new word tokenizers by @guipenedo in #189
Fix snapshot representation and numeric conversion in example Code (fineweb) by @justHungryMan in #192
Extend randomize_start feature to local executor by @justHungryMan in #193
Add description for randomize_start by @justHungryMan in #194
Allow an integer parameter for 'randomize_start' in executor/base.py by @justHungryMan in #199
Issues w/ DatatroveFolderDataset by @TJ-Solergibert in #203
code consistency about radomize_start_duration by @justHungryMan in #207
feat(ci): add trufflehog secrets detection by @McPatate in #211
fix(ci): remove unnecessary permissions by @McPatate in #212
Add label_only option to LanguageFilter by @justHungryMan in #210
Fixes text normalization by @hynky1999 in #218
Summary stats by @hynky1999 in #158
Speedup json writer by @its5Q in #175
add alternative fasttext lid models by @guipenedo in #226
Adds paths_file to readers by @guipenedo in #228
Add an example for filtering an HF dataset and push to hub by @loubnabnl in #201
checks if min_num_sentences is disabled or not before computing the n… by @QasidSaleem in #232
DocumentTokenizerContextShuffler fixes by @sippycoder in #229
add dependencies lid.py, io.py #239 by @aiqwe in #241
Add withdirs to extra_options only when not using glob_pattern by @olga1988olga in #244
Add token and char count to histogram stats by @guipenedo in #251
fix correct type inference for cached filesystems by @hynky1999 in #257
Simple enhancement for readibility by @aiqwe in #253
Fix test_basic_article_trafilatura test failure by @tylerjthomas9 in #264
Update MinhashConfig with detailed settings and add default language … by @justHungryMan in #252
Update README.md by @shizhediao in #276
Implement zstd Compression Support for JSONL and Parquet Files by @justHungryMan in #230
Update filter_hf_dataset.py by @shizhediao in #274
Add expand_metadata Option to JsonlWriter by @justHungryMan in #268
Add shuffle option on huggingface reader by @justHungryMan in #224

New Contributors

@rantav made their first contribution in #167
@QasidSaleem made their first contribution in #181
@its5Q made their first contribution in #174
@beme248 made their first contribution in #147
@vsabolcec made their first contribution in #187
@TJ-Solergibert made their first contribution in #203
@McPatate made their first contribution in #211
@loubnabnl made their first contribution in #201
@sippycoder made their first contribution in #229
@aiqwe made their first contribution in #241
@olga1988olga made their first contribution in #244
@tylerjthomas9 made their first contribution in #264
@shizhediao made their first contribution in #276

Full Changelog: v0.2.0...v0.3.0

Contributors

rantav, guipenedo, and 15 other contributors

Assets 2

22 Apr 17:18

guipenedo

v0.2.0

What's Changed

Adds multi node parallelism to local executor by @guipenedo in #85
Changed fsx default filepath for logging output to user's home by @Anacheron51 in #86
[Docs] Fix typos by @StandardAI in #91
bugfix stats file not being saved to s3 by @guipenedo in #92
Fix url stats by @thomwolf in #89
Efficiency: np.fromiter instead of np.array by @giorgioangel in #88
Adds language option for nltk by @guipenedo in #94
Fix compression type by @jordane95 in #95
Decoupled reading logic from DedupReader by @guipenedo in #98
Support for arbitrary fasttext models by @guipenedo in #99
Adds citation by @guipenedo in #101
Adds parquet writer by @guipenedo in #103
Utilities to efficiently parallelize the upload of dataset files to the HuggingFace hub by @guipenedo in #105
Adding doc strings + adding a faster tokenized doc merger by @thomwolf in #90
Add email on slurm and extend fasttext filter functionalities by @thomwolf in #111
Add jobs_status command. by @lvwerra in #113
Re-enable datasets test by @mariosasko in #114
Update warc.py by @jordane95 in #115
Bug fix: when file is empty by @jordane95 in #126
Load tokenizer using from_file by @guipenedo in #122
Adds depends= to LocalPipelineExecutor by @guipenedo in #100
Improve C4 filter and dedup by @guipenedo in #124
Adds option to shuffle input files in readers by @guipenedo in #128
update Trafilatura version by @adbar in #130
Changes to text normalization + FTFY and lines symbol formatters by @guipenedo in #133
Minor Terminology and Documentation Updates for Local Tokenizer Loading by @justHungryMan in #134
add requeue and QOS slurm options by @marianna13 in #144
Fix substring dedup range by @jordane95 in #132
Line dedup min remove words option by @guipenedo in #146
New options for FastTextClassifierFilter: apply on sentence or paragraph (line) level by @guipenedo in #151
Url deduplication by @hynky1999 in #145
Fix race conditions during download/extraction by @hynky1999 in #155
Adds PII removal by @guipenedo in #156
Pypi Publish Action by @hynky1999 in #159

New Contributors

@Anacheron51 made their first contribution in #86
@StandardAI made their first contribution in #91
@giorgioangel made their first contribution in #88
@lvwerra made their first contribution in #113
@adbar made their first contribution in #130
@justHungryMan made their first contribution in #134
@marianna13 made their first contribution in #144
@hynky1999 made their first contribution in #145

Full Changelog: v0.0.1...v0.2.0

Contributors

adbar, guipenedo, and 10 other contributors

Assets 2

07 Feb 15:10

guipenedo

v0.0.1

First release

Assets 2