Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample rate mismatch leads to incorrect timing #260

Open
Yorwba opened this issue Aug 14, 2020 · 2 comments
Open

Sample rate mismatch leads to incorrect timing #260

Yorwba opened this issue Aug 14, 2020 · 2 comments
Labels
Milestone

Comments

@Yorwba
Copy link

Yorwba commented Aug 14, 2020

To reproduce, run aeneas (latest devel) with four different configurations, either enabling or disabling cew and using a sample rate of either 16000 or 22050 for ffmpeg:

for conf in cew={True,False}'|'ffmpeg_sample_rate={16000,22050}; do
    python -m aeneas.tools.execute_task -v aeneas/tools/res/audio.mp3 aeneas/tools/res/plain.txt 'task_language=eng|is_text_type=plain|os_task_file_format=srt' -r="$conf" sonnet-"$conf".srt
done

Then look at the last 4 lines of each:

tail -n4 *.srt
==> sonnet-cew=False|ffmpeg_sample_rate=16000.srt <==
15
00:00:53,200 --> 00:00:53,240
To eat the world's due, by the grave and thee.


==> sonnet-cew=False|ffmpeg_sample_rate=22050.srt <==
15
00:00:48,000 --> 00:00:53,240
To eat the world's due, by the grave and thee.


==> sonnet-cew=True|ffmpeg_sample_rate=16000.srt <==
15
00:00:48,080 --> 00:00:53,240
To eat the world's due, by the grave and thee.


==> sonnet-cew=True|ffmpeg_sample_rate=22050.srt <==
15
00:00:48,000 --> 00:00:53,240
To eat the world's due, by the grave and thee.

Note that the last segment starts at roughly 48 seconds except for the combination cew=False|ffmpeg_sample_rate=16000, where it starts at 53.2 seconds instead.

Here's a snippet from the verbose output of a run with that configuration, highlighting important lines with ----->:

[DEBU] Synthesizer: Synthesizing text...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine via C extension or subprocess
[DEBU] ESPEAKTTSWrapper: C extension 'cew' disabled
[DEBU] ESPEAKTTSWrapper: Running the pure Python code
[DEBU] ESPEAKTTSWrapper: Synthesizing multiple via subprocess...
[DEBU] ESPEAKTTSWrapper: Calling TTS engine using multiple generic function...
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate...
[DEBU] ESPEAKTTSWrapper: Reading codec and sample rate from OUTPUT_AUDIO_FORMAT
[DEBU] ESPEAKTTSWrapper: Determining codec and sample rate... done
[DEBU] ESPEAKTTSWrapper:   codec:       pcm_s16le
-----> ESPEAKTTSWrapper:   sample rate: 22050
[DEBU] ESPEAKTTSWrapper: Examining fragment 0 (no cache)...
[DEBU] ESPEAKTTSWrapper: Language to voice code: 'eng' => 'en'
[DEBU] ESPEAKTTSWrapper: Calling helper function
[DEBU] ESPEAKTTSWrapper: Synthesizer helper called with output_file_path=None => creating temporary output file
[DEBU] ESPEAKTTSWrapper: Temporary output file path is '/tmp/tmp30di9k3w.wav'
[DEBU] ESPEAKTTSWrapper: TTS engine reads text from stdin
[DEBU] ESPEAKTTSWrapper: Creating arguments list...
[DEBU] ESPEAKTTSWrapper: Creating arguments list... done
[DEBU] ESPEAKTTSWrapper: Calling TTS engine...
[DEBU] ESPEAKTTSWrapper: Calling with arguments '['espeak', '-v', 'en', '-w', '/tmp/tmp30di9k3w.wav']'
[DEBU] ESPEAKTTSWrapper: Calling with text '1'
[DEBU] ESPEAKTTSWrapper: Passing text via stdin...
[DEBU] ESPEAKTTSWrapper: Passing text via stdin... done
[DEBU] ESPEAKTTSWrapper: TTS engine wrote audio data to file
[DEBU] ESPEAKTTSWrapper: Calling TTS ... done
[DEBU] ESPEAKTTSWrapper: Reading audio data...
[DEBU] AudioFile: Loading audio data...
[DEBU] AudioFile: self.file_format is None or not good => converting self.file_path
[DEBU] AudioFile: Temporary PCM16 mono WAVE file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono...
-----> FFMPEGWrapper: Calling with arguments '['ffmpeg', '-i', '/tmp/tmp30di9k3w.wav', '-ac', '1', '-ar', '16000', '-y', '-map_metadata', '-1', '-flags', '+bitexact', '-f', 'wav', '/tmp/tmp_ow6yas8.wav']'
[DEBU] FFMPEGWrapper: Call completed
[DEBU] FFMPEGWrapper: Returning output file path '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Converting audio file to mono... done
[DEBU] AudioFile: Deleted temporary audio file: '/tmp/tmp_ow6yas8.wav'
[DEBU] AudioFile: Sample length:  0.638
-----> AudioFile: Sample rate:    16000
[DEBU] AudioFile: Audio format:   pcm16
[DEBU] AudioFile: Audio channels: 1
[DEBU] AudioFile: Loading audio data... done

What happens is this:

  1. Since cew is disabled, the synthesized audio for each line is concatenated in Python code.
  2. A buffer is allocated to hold the concatenated audio and its sample rate is determined to be 22050.
  3. Before a file is loaded into the buffer, it is converted to mono using ffmpeg.
  4. Since the ffmpeg command specifies a sample rate of 16000, the samples loaded into the buffer do not have the expected sample rate of 22050.
  5. When the concatenated buffer is written to a file, the sample rate is set to 22050, causing the audio to appear sped up.
  6. As a consequence, all timestamps are out of sync.

I'd like to fix this, but beforehand I'd like to know why things are done this way. Evidently the true sample rate of the file is known once the data gets loaded, so is it ever necessary to set the sample rate beforehand?

@readbeyond readbeyond added the bug label Jan 21, 2021
@readbeyond readbeyond added this to the 2.0.0 milestone Jan 21, 2021
@WalkaboutPianoMan
Copy link

Hi Yorwba... Any chance you could tell me how to fix this bug? I'm working on a lot of audio/text syncing with 22kHz files and so this fix would be a life saver for me :-) Peter

@seancondev
Copy link

it took me 3 days of debug before I found this issue...
downsampling fixes the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants