All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
3.2.0 - 2024-08-14
- make
pycld2
andfasttext
libraries optional - replace
langid.py
library withpy3langid
- update github workflows and include Python 3.12 tests
OpusRead
interface usingmoses
format (requiresopustools >= 1.6.2
)
3.1.0 - 2024-06-05
- support
lingua
based for language detection (#65)
- Python 3.7 support
- fix score method in
SentenceEmbeddingFilter
(#71) - fix filter and filterfalse methods in
SentenceEmbeddingFilter
3.0.0 - 2023-10-11
opusfilter-autogen
script for automatic filter config generationscore_direction
,accept_threshold
, andreject_threshold
properties for filters
- refactor code and move auxiliary methods to opusfilter.util
- update varikn installation instructions (installable from PyPI)
- update github workflows and include Python 3.11 tests
- update library version requirements to support Python 3.11
- use xxhash instead of pyhash for hash functions
- use opus-fast-mosestokenizer instead of fast-mosestokenizer
- install eflomal from PyPI and use the new interface in WordAlignFilter
- Python 3.6 support
- catch NotImplementedError from beautifulsoup 4.11.2
- catch ParserRejectedMarkup from beautifulsoup 4.12.0
2.6.0 - 2022-11-30
- add
slice
missing from the enabled steps
- improve documentation
- import slow libraries only when needed
- use chunks for the filter method of
SentenceEmbeddingFilter
- change
RepetitionFilter
to use single score for consistency with the threshold
- allow float thresholds for
AverageWordLengthFilter
- remove unnecessary code from
RegExpSub
- add
setuptools
version requirement
2.5.1 - 2022-09-28
- add missing document file
2.5.0 - 2022-09-28
map_space_to
option for Jieba and MeCab tokenizers to preserve existing space characters in input- parallel processing options for filter, score, and preprocess steps
- re-organize documentation and support building it with sphinx
- catch TypeError exceptions from BeautifulSoup in HtmlTagFilter
2.4.0 - 2022-04-05
- an option to write filter scores to a file with
opusfilter-test
- new filters:
AlphabetRatioFilter
,RegExpFilter
,SimilarityFilter
,SentenceEmbeddingFilter
- support for Japanese word segmentation using
MeCab
as a tokenizer - preprocessing methods for subword segmentation (
BPESegmentation
,MorfessorSegmentation
) - subword segmentation support for the n-gram language models and language model filters
- allow per-language parameters for LengthFilter, LengthRatioFilter, LongWordFilter, and AverageWordLengthFilter
- fix documentation for
train_aligment
parameters
2.3.1 - 2022-01-28
- fix bug in classifier training without development set
2.3.0 - 2022-01-18
- new OpusFilterRuntimeError exception for having e.g. empty training data
- option to save scores from the training data when creating word aligment priors
- RepetitionFilter for filtering segments with repeated substrings
- new preprocessor for sentence splitting monolingual data
- method-specific options for LanguageIDFilter
- chunksize option to the common section
- LMClassifierFilter for classification based on n-gram language models
- add
workdir
attribute to theFilterABC
base class and change that the filters should use it for any file parameters - increase default chunksize in FilterPipeline from 10000 to 100000
- refactor and clean up code
2.2.0 - 2021-11-23
- support for Chinese word segmentation using
jieba
as a tokenizer (#27)
2.1.2 - 2021-11-11
- fix wrong keyword argument name in opusfilter-duplicates
2.1.1 - 2021-10-19
- move "How to contribute" to docs/CONTRIBUTING.md
- fix setuptools requirement (#21)
- fix version requirement for pandas (>=1.0.0)
2.1.0 - 2021-08-31
- replace PyYAML with ruamel.yaml
- support for variables in the YAML configuration (#13)
- support to
fasttext
based for language detection (#20) suppress_prompts
parameter foropus_read
(#19)download
andwrite
steps- "How to contribute" section to README.md
- changelog
- bibliography and improved references
2.0.0 - 2021-06-01
- extend to n-lingual parallel data instead of just bilingual data
- switch tokenizer to
fast-mosestokenizer
- new commands:
opusfilter-diagram
,opusfilter-duplicates
,opusfilter-test
- new filters:
LongestCommonSubstringFilter
,AverageWordLengthFilter
- new steps:
preprocess
- set "latest" as the default corpus release for
opus_read
(#5) - overlap option for
remove_duplicates
- lower threshold option for
CrossEntropyFilter
- github CI workflow for flake8 and unittests
- behaviour of simple filters on empty segments
1.0.1 - 2020-05-25
- improved logging, documentation, and project files
- prevent
UnboundLocalError
for empty output after filter
1.0.0 - 2020-04-10
First tagged version.