Skip to content

Releases: echogarden-project/echogarden

v1.8.2

15 Oct 21:51
Compare
Choose a tag to compare

Features

  • whisper: new option timestampAccuracy with possible values medium or high. medium uses a reduced subset of attention heads for alignment, which makes it fast to compute. high uses all attention heads for alignment, and is thus more accurate at the word level, but slower for larger models. Defaults to medium
  • whisper.cpp: new options temperature, temperatureIncrement, enableFlashAttention. Using flash attention can significantly improve performance in some cases. Note: enabling flash attention will automatically disable the enableDTW option since the two don't seem to work together

Fixes

  • whisper.cpp: derive correct model name for large-v3-turbo
  • whisper and whisper.cpp: error when model is set to large-v3-turbo and a translation task is requested (large-v3-turbo doesn't support translation tasks)

Full Changelog: v1.8.1...v1.8.2

v1.8.1

10 Oct 22:37
Compare
Choose a tag to compare

Fixes

  • whisper alignment: Ensure the resulting timeline always includes all words, even if not all transcript tokens got decoded.

Full Changelog: v1.8.0...v1.8.1

v1.8.0

10 Oct 21:23
Compare
Choose a tag to compare

Enhancements

  • Use PFFFT library (WASM port) with SIMD support, instead of KissFFT, for FFT operations
  • MDX-NET source separation: further improvements in speed, mostly due to faster FFT operations. Using DirectML (Windows) or CUDA (Linux) GPU acceleration, now gets to up to 30x real-time on an NVIDIA RTX 2060 and 13th Gen Core i3 for the default model, and about 17x real-time for the 3 higher quality models.
  • whisper.cpp: use updated packages
  • whisper: by default, use optimized alignment heads for all model sizes (increases recognition speed by reducing the alignment time for each part). Can be enabled or disabled using a new option useOptimizedAlignmentHeads
  • Source separation: ensure output audio never clips
  • Optimizations in various audio processing operations. Less copying in memory

Fixes

  • whisper alignment engine: fix issue where the model would decode too many tokens for a single part, eventually leading to a crash due to an ONNX runtime error. The maximum decoded tokens per part is now configurable using maxTokensPerPart, and defaults to 250
  • MDX-NET source separation: reduce default logging verbosity. Can be made more verbose by setting logLevel to trace

Documentation

  • Officially document cuda ONNX provider support for all engines that depend on ONNX models. Supported on Linux only, and can often be faster than DirectML, even when used within Windows WSL (Ubuntu). Requires manual installation of CUDA Toolkit 12.x and cuDNN 9.x

Full Changelog: v1.7.0...v1.8.0

v1.7.0

09 Oct 10:26
Compare
Choose a tag to compare

New features

  • MDX-NET now includes 3 new, higher quality models: UVR_MDXNET_Main, Kim_Vocal_1 and Kim_Vocal_2. These models produce cleaner sound with less artifacts, and are about 3x slower on CPU than existing ones, but still fast on GPU
  • Google Translate text-to-text translation: Add 2 customization options: tld (set top-level domain like com for google.com or co.uk for google.co.uk) and maxCharactersPerPart (maximum number of characters in each text part sent to the the server)

Enhancements

  • MDX-NET source separation implementation has been partially rewritten, with substantially better performance, reduced memory usage, and GPU support. With an NVIDIA RTX 2060 GPU (over DirectML) and 13th Gen Core i3, it now achieves 20x real-time processing speed, which is closer in performance to Python implementations like the ones in Ultimate Vocal Remover and Python Audio Separator

Behavioral changes

  • MDX-NET will now use the dml ONNX execution provider (DirectML-based GPU acceleration) on Windows by default, if available

Fixes

  • Text-to-text translation: fix several issues
  • Google Translate text-to-text translation: Improve and fix several issues. Ensure translated output preserves the line break structure of the original

Full Changelog: v1.6.2...v1.7.0

v1.6.2

06 Oct 19:09
Compare
Choose a tag to compare

Fixes

  • dtw-ra: split fragments to chunks based on total character count, rather than the fragment count (currently set to a maximum of 1000 characters in a chunk).

Full Changelog: v1.6.1...v1.6.2

v1.6.1

04 Oct 16:56
Compare
Choose a tag to compare

Enhancements

  • Log ONNX provider used in Whisper session

Fixes

  • Preserve paragraphs in Google Translate output text

Full Changelog: v1.6.0...v1.6.1

v1.6.0

04 Oct 06:00
Compare
Choose a tag to compare

New features

  • Initial support for text-to-text translation (Google Translate engine)
  • openai-cloud STT engine: Support for custom OpenAI API compatible speech-to-text providers, like Groq
  • Support for the new large-v3-turbo Whisper model in both integrated Whisper engine and whisper.cpp engine.
  • Add 6 new VITS voices

Enhancements

  • Whisper (integrated engine): hash seed before using it (ensures seeds like 0, 1, 2, 3, 4 would produce more distinct results)
  • whisper.cpp: use updated builds

Behavioral changes

  • Whisper (integrated engine): on Windows x64, will possibly use GPU accelerated decoding (decoderProvider=dml) for larger models (small*, medium* and large*)
  • alignTimelineTranslation / e5 engine: reduce default DTW window's token count to 20,000 tokens

Removed features

  • whisper.cpp: removed internal package support for cublas-1.8.0, due to build issues with latest VS2022, and very long build times.
  • Removed optional dependency on unused package speaker due to security vulnerabilities, and its native module requirements.
  • whisper: removed option for large model keyword (large-v3-turbo is currently the only one supported).

Fixes

  • When deriving sentence / segment timeline from word timeline, ensure sentences never break within words by temporarily masking potential sentence ending characters in the body of the word. Attempts to resolve issues #67 and #58
  • dtw-ra: when producing an alignment reference for a set of fragments, process the fragments in chunks, rather than all at once (currently uses a maximum of 1000 fragments for each chunk). Should resolve issue #64
  • whisper.cpp: add workaround for rare whisper.cpp issue with missing time offsets by falling back to last known end offset when they are not included. Should resolve issue #65
  • Don't error when DTW length is less than 2 (fixes rare issue with Whisper's internal alignment)
  • Fix logging in timeline translation alignment

Full Changelog: v1.5.0...v1.6.0

v1.5.0

26 May 15:04
Compare
Choose a tag to compare

New features

  • Speech-to-transcript-and-translation alignment aligns a translated transcript to the spoken audio with the assistance of the transcript in the original language. Supports 100 source and target languages. It does it uses a two-stage approach: first, conventional alignment is performed between the spoken audio and its native-language transcript. Then, the resulting timeline is aligned to the translated text using cross-language semantic text-to-text alignment
  • Timeline-to-translation alignment accepts a timeline and translated transcript, and performs the second stage independently. This can allow to reuse a previously aligned transcript with multiple translations, or be applied to the timeline output after speech synthesis or recognition

Enhancements

  • Add support for passing cuda as ONNX provider. Latest onnxruntime-node now supports it, but only on Linux (for Windows, use dml - DirectML)

Behavioral changes

  • Passing a subtitle file to synthesis operations now ignores the cues and splits to sentences based on punctuation alone
  • API operations for speech-to-translation now include separate properties for source and target languages

Fixes

  • Timeline uncropping now correctly handles the edge case where a timestamp is higher than the audio duration (this can occur due to rounding or numerical stability)
  • Mel spectrogram conversion now handles the case where a filterbank is wider than the maximum frequency

Full Changelog: v1.4.4...v1.5.0

v1.4.4

15 May 06:54
Compare
Choose a tag to compare

Enhancements

  • DTW speech alignment: use optimized Euclidian distance computation function with a fully unrolled loop when vector size is exactly 13 (typical MFCC vector size)

Fixes

  • eSpeak: Prevent using vertical bar separators (|) in the exact set of voices that (incorrectly) pronounce them: roa/an (Aragonese), art/eo (Esperanto), trk/ky (Kirghiz), zlw/pl (Polish), zle/uk (Ukranian)
  • Add missing entry for Latin (la) in language code parser

Full Changelog: v1.4.3...v1.4.4

v1.4.3

12 May 04:04
Compare
Choose a tag to compare

Fixes

  • eSpeak: Bring back the | workaround, but only when the language isn't Polish

Full Changelog: v1.4.2...v1.4.3 (note: release v1.4.2 was unintentionally not committed to GitHub, so this includes its changes as well)