Skip to content

Monotextor v1.1: Title in the Sky

Latest
Compare
Choose a tag to compare
@ZJaume ZJaume released this 31 May 14:32
92f2214

Added

  • Apply Monofixer to document titles.
  • Detect sensitive data in paragraphs.
  • Compressed preverticals support.
  • New paragraph id format (prevertical2text).
  • Remove tabs, endlines and carriage return that generate additional lines or fields when normalization is disabled (Monofixer).
  • Detect Serbo-Croatian script (FastSpell).
  • Automatic installation of Hunspell dictionaries (FastSpell).

Fixed

  • Python 3.10 compatibility
  • Check that Monocleaner model exists.
  • Snakemake always running everything despite no file changes.
  • Fix issue with encoding errors in sentence splitting making unexpected offsets in document metadata
  • Fix warning format when paragraph id > total paragraphs
  • Monotextor imports in bitextor_split
  • Correct names in stat files.

Changed

  • Group Serbo-Croatian under hbs(FastSpell).
  • Better langid coverage for Icelandic (FastSpell).
  • Filter sentences by Monocleaner score and language id.
  • Remove hardcoded Monocleaner threshold.
  • Use pigz in rules that are parallelized.
  • Updated installation instructions.
  • Update Snakemake.
  • Update lxml.