-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error on Speech-to-transcript alignment for large audio files #51
Comments
Based on the log messages, it's likely the downsampling is failing due to the large audio memory size. The downsampling is done via the Speex resampler WebAssembly module. The standard, current version of WebAssembly doesn't currently support arrays that are about 2GB / 4GB or more (not sure what's the exact limit that holds). WASM64 will remove that limit but it's just a proposal and not deployed in any runtime (without a flag). To perform the downsampling, the entire audio is passed to the WebAssembly module, so it looks like it may be failing to allocate it. I could try to pass it in chunks. I think the Speex resampling library supports that. I'll look into it. For now, you can try downsampling to mono 16Khz wave file using another tool like ffmpeg or, say foobar2000, and then using the downsampled wave file instead as input. If it detects this format then no downsampling should be performed. I don't think that, other than the downsampling part, alignment and recognition operations require passing the entire audio to WebAssembly. There may be still a limit of 2^32 elements for the Float32Array audio sample data, but that's like 16GB of memory (4 bytes per element). |
I also now noticed that I didn't compile |
Thank you for your quick answer.
What should i do next? Thanks! |
Seems like what is happening is that now the synthesized speech, produced as part of the DTW alignment process, is causing the same error while it is being downsampled to 16kHz, so the workaround doesn't work in all cases. Anyway, I'm working on this right now. I've already modified the Speex resampler to process in chunks. This should solve the issue with maximum WASM memory size in a more through way, so this particular error shouldn't occur. It seems to handle 1 hour audio files fine. Now I'm testing with longer audio files, like 3.5 hours or more. The Speex resampler now works fine with arbitrary sizes. But I see now that the wave file encoder / decoder I wrote isn't handling wave files or buffers larger than 4GB, because they are beyond the standard specification. I'm working on handling these larger files by ignoring some of the chunk sizes and parsing in a special way, that works with the kind of WAVE output ffmpeg outputs in those cases. |
Not sure if it helps, but: the downsampled audio.wav file i tested with is 1.06 GB in size (the original mp3 file was about 132MB). |
I've made the changes to the wave decoder and encoder to support lengths larger than 4 GiB. I can now align a 3.5 hour audio file in about 5 minutes using the default I have 32GB of RAM so it works reasonably fast, but anything longer than about 3.5 to 4 hours could become very slow (will need to swap to disk). If this is an audiobook, it's probably better to first split the audio to chapters manually. Maybe it would help to handle these longer durations by using multi-pass processing settings like with something like I'll publish the new version soon so you can test it. |
I've released It should work with long audio durations, say, up to like 3 to 4 hours, but 9 hour file may be a bit too much for the If you want to process the entire thing at once using On the other hand, the If you have an NVIDIA GPU You can try I'll try to experiment these length of durations, and see if I can maybe tune the default |
Thank you for the update. I've tested some values for the params you suggested. The DTW cost matrix memory size: 3999.7MB - surprisingly (i only have 16GB ram on my Macbook M1).
I plan to try some more combinations of the params and values, to see if i can get an improved result. |
This is the current logic used to set the DTW window duration when it's not given: if (options.dtw!.windowDuration == null) {
const sourceAudioDuration = getRawAudioDuration(sourceRawAudio)
if (sourceAudioDuration < 5 * 60) { // If up to 5 minutes, set window to one minute
options.dtw!.windowDuration = 60
} else if (sourceAudioDuration < 60 * 60) { // If up to 1 hour, set window to 20% of total duration
options.dtw!.windowDuration = Math.ceil(sourceAudioDuration * 0.2)
} else { // If 1 hour or more, set window to 12 minutes
options.dtw!.windowDuration = 12 * 60
}
} So an audio duration beyond 1 hour, is always set to a window of 12 minutes (I didn't remember that, actually, since I wrote this more than a year ago). This means that the alignment only looks in a range of 12 minutes around the interpolated location in the synthesized reference to try to find the best matching frame. If you're giving it several hours, it's very likely that 12 minutes window may not be enough, especially if the audio contains areas of music, etc, that are not filtered out. (The reason I set it to 12 minutes is because larger sizes would get to tens of gigabytes of memory quickly - that was actually before I added support for lower granularities so maybe I should readjust the logic). Even though you've set granularity to The second pass, if requested, would, by default use a window of 15 seconds to refine the alignment found in the first pass. It may not be necessary for subtitles. Try to increase the window size to 20 minutes (20 * 60 seconds), 30 minutes (30 * 60 seconds), or more. With these durations, it's likely the memory usage would become way over 10GB, 20GB though. It depends on the length of the audio you give it. Also, another aspect I found is that Node 22 now allows buffers and typed arrays of arbitrary lengths (I've allocated a 35 GiB buffer successfully), but Node 20 and before allows only up to 4 GiB. I made some changes to try to reduce the peak memory when converting the wave buffer to raw audio, but it still produce several copies in memory, though. If you're passing the input as a 16kHz mono wave file, it should avoid some of these issues. If you're aren't using Node 22, then that may actually be the only way to load multi-hour audio files, currently. |
I released I reworked the auto-selection of granularities and window durations. So now:
I might change this in the future, of course. I tested this on 6 hours audio and the memory requirement of the first pass was about 5GB - 7GB (the size actually depends on how many words the speech contains, which impacts the synthesized reference size) and the second pass much smaller. The two passes are actually so fast that they take a minority of the overall time. The other processing stages are now taking most of the time. I think it did it in about 300 seconds (5 minutes). I also attempted to fix some core issues with how the DTW algorithm works with smaller window sizes, so now I'm more confident the multi-pass should work correctly, so I can select it by default. I removed all warnings like There was also a completely unreported eSpeak issue that I got with about 20% of the multi-hour audiobooks I downloaded from YouTube, causing an error due to a missing marker (it was caused by isolated |
Thank you for the changes. |
Hello, I'm trying to use the speech-to-transcript alignment on large audio files (6+ hours duration).
I'm getting the error below. I tried with two different files (one almost 6h, the other 9.1h).
I wasn't able to find this exact error anywhere else, which is quite strange.
I'm on Macbook M1 (latest stable Mac OS). I have ffmpeg 6.1.1
I can provide the files i used, if needed.
I tried manually with a smaller chunks of audio & text (cut from the same files) and it seems to work.
However, i can't do this manually for all the files.
How to debug & work around this issue?
Thanks in advance! Great work btw!
The text was updated successfully, but these errors were encountered: