You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To reproduce, run aeneas (latest devel) with four different configurations, either enabling or disabling cew and using a sample rate of either 16000 or 22050 for ffmpeg:
tail -n4 *.srt
==> sonnet-cew=False|ffmpeg_sample_rate=16000.srt <==
15
00:00:53,200 --> 00:00:53,240
To eat the world's due, by the grave and thee.==> sonnet-cew=False|ffmpeg_sample_rate=22050.srt <==1500:00:48,000 --> 00:00:53,240To eat the world's due, by the grave and thee.
==> sonnet-cew=True|ffmpeg_sample_rate=16000.srt <==
15
00:00:48,080 --> 00:00:53,240
To eat the world's due, by the grave and thee.==> sonnet-cew=True|ffmpeg_sample_rate=22050.srt <==1500:00:48,000 --> 00:00:53,240To eat the world's due, by the grave and thee.
Note that the last segment starts at roughly 48 seconds except for the combination cew=False|ffmpeg_sample_rate=16000, where it starts at 53.2 seconds instead.
Here's a snippet from the verbose output of a run with that configuration, highlighting important lines with ----->:
[DEBU] Synthesizer: Synthesizing text...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine via C extension or subprocess
[DEBU] ESPEAKTTSWrapper: C extension 'cew' disabled
[DEBU] ESPEAKTTSWrapper: Running the pure Python code
[DEBU] ESPEAKTTSWrapper: Synthesizing multiple via subprocess...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine using multiple generic function...
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate...
[DEBU] ESPEAKTTSWrapper: Reading codec and sample rate from OUTPUT_AUDIO_FORMAT
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate... done
[DEBU] ESPEAKTTSWrapper: codec: pcm_s16le
-----> ESPEAKTTSWrapper: sample rate: 22050
[DEBU] ESPEAKTTSWrapper: Examining fragment 0 (no cache)...
[DEBU] ESPEAKTTSWrapper: Language to voice code: 'eng' => 'en'
[DEBU] ESPEAKTTSWrapper: Calling helper function
[DEBU] ESPEAKTTSWrapper: Synthesizer helper called with output_file_path=None => creating temporary output file
[DEBU] ESPEAKTTSWrapper: Temporary output file path is '/tmp/tmp30di9k3w.wav'
[DEBU] ESPEAKTTSWrapper: TTS engine reads text from stdin
[DEBU] ESPEAKTTSWrapper: Creating arguments list...
[DEBU] ESPEAKTTSWrapper: Creating arguments list... done
[DEBU] ESPEAKTTSWrapper: Calling TTS engine...
[DEBU] ESPEAKTTSWrapper: Calling with arguments '['espeak', '-v', 'en', '-w', '/tmp/tmp30di9k3w.wav']'
[DEBU] ESPEAKTTSWrapper: Calling with text '1'
[DEBU] ESPEAKTTSWrapper: Passing text via stdin...
[DEBU] ESPEAKTTSWrapper: Passing text via stdin... done
[DEBU] ESPEAKTTSWrapper: TTS engine wrote audio data to file
[DEBU] ESPEAKTTSWrapper: Calling TTS ... done
[DEBU] ESPEAKTTSWrapper: Reading audio data...
[DEBU] AudioFile: Loading audio data...
[DEBU] AudioFile: self.file_format is None or not good => converting self.file_path
[DEBU] AudioFile: Temporary PCM16 mono WAVE file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono...
-----> FFMPEGWrapper: Calling with arguments '['ffmpeg', '-i', '/tmp/tmp30di9k3w.wav', '-ac', '1', '-ar', '16000', '-y', '-map_metadata', '-1', '-flags', '+bitexact', '-f', 'wav', '/tmp/tmp_ow6yas8.wav']'
[DEBU] FFMPEGWrapper: Call completed
[DEBU] FFMPEGWrapper: Returning output file path '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono... done
[DEBU] AudioFile: Deleted temporary audio file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Sample length: 0.638
-----> AudioFile: Sample rate: 16000
[DEBU] AudioFile: Audio format: pcm16
[DEBU] AudioFile: Audio channels: 1
[DEBU] AudioFile: Loading audio data... done
What happens is this:
Since cew is disabled, the synthesized audio for each line is concatenated in Python code.
A buffer is allocated to hold the concatenated audio and its sample rate is determined to be 22050.
Before a file is loaded into the buffer, it is converted to mono using ffmpeg.
Since the ffmpeg command specifies a sample rate of 16000, the samples loaded into the buffer do not have the expected sample rate of 22050.
When the concatenated buffer is written to a file, the sample rate is set to 22050, causing the audio to appear sped up.
As a consequence, all timestamps are out of sync.
I'd like to fix this, but beforehand I'd like to know why things are done this way. Evidently the true sample rate of the file is known once the data gets loaded, so is it ever necessary to set the sample rate beforehand?
The text was updated successfully, but these errors were encountered:
Hi Yorwba... Any chance you could tell me how to fix this bug? I'm working on a lot of audio/text syncing with 22kHz files and so this fix would be a life saver for me :-) Peter
To reproduce, run aeneas (latest devel) with four different configurations, either enabling or disabling
cew
and using a sample rate of either 16000 or 22050 forffmpeg
:Then look at the last 4 lines of each:
Note that the last segment starts at roughly 48 seconds except for the combination
cew=False|ffmpeg_sample_rate=16000
, where it starts at 53.2 seconds instead.Here's a snippet from the verbose output of a run with that configuration, highlighting important lines with
----->
:What happens is this:
cew
is disabled, the synthesized audio for each line is concatenated in Python code.ffmpeg
.ffmpeg
command specifies a sample rate of 16000, the samples loaded into the buffer do not have the expected sample rate of 22050.I'd like to fix this, but beforehand I'd like to know why things are done this way. Evidently the true sample rate of the file is known once the data gets loaded, so is it ever necessary to set the sample rate beforehand?
The text was updated successfully, but these errors were encountered: