Skip to content

Releases: huggingface/datatrove

v0.4.0

06 Dec 18:43
842b241
Compare
Choose a tag to compare

What's Changed

  • Readme nits by @hynky1999 in #280
  • Fixed a bug that in the reader pipline, the document count is always less that the actual number of documents by the number of files. by @lyuwen in #286
  • Fix languages listify bug by @BramVanroy in #294
  • [Fixbug] Ensure only one task will be launched for each srun cmd by @silverriver in #296
  • [fixbug]: Fixed the issue in MinhashBuildIndex where get_datafolder w… by @Youggls in #307
  • FineWeb-2: multilingual, numpy 2.0, minhash improvements by @guipenedo and @hynky1999 in #285:
    • upgrades to support numpy 2.0
    • added additional word tokenizers and revamped word tokenizer assignment mechanism
    • MinHash optimizations + new rust tool to speed up step3
    • MinHash cluster sizes feature
    • fixed memory leaks from some word tokenizers
    • updated url blocklists
    • added caching to some word tokenization calls
    • glotlid support
    • general bugfixes

New Contributors

Full Changelog: v0.3.0...v0.4.0

v0.3.0

28 Aug 15:47
d95e0ee
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.3.0

v0.2.0

22 Apr 17:18
6d06210
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.0.1...v0.2.0

v0.0.1

07 Feb 15:10
bd3c89a
Compare
Choose a tag to compare

First release