-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Audio loudness normalization #181
Comments
First of all, equalisation is a specific term for manipulating frequency contents of a sound. The issue described here is called Loudness Normalization. Issue is due for renaming. While this is a great goal, it's rather hard to enforce live. The speakers have to have a reliable way for measuring their output loudness. The suitable measurement units are Short-Term LUFS or RMS, but not every piece of software provides the meters for that. |
So, SReview (the tool that we use for postprocessing and transcoding videos, and which we this year repurposed for also handling upload and preprocessing) actually has builtin support for loudness normalization, using bs1770gain. However, since that software mangles the audio in more ways than just "loudness normalization", which was causing bugs in SReview (apart from it being written by a right-wing extremist nazi), it was disabled for the upload processing for FOSDEM 2021. I recently implemented loudness normalization using the ffmpeg loudnorm filter, which should allow me to disable the bs1770gain implementation; so if FOSDEM 2022 is still going to be online (in whole or in part), this issue should be fixed. |
The issue with loudnorm is that it doesn't guarantee linear normalization even in double-pass mode. I've bumped into this with my ffmpeg-loudnorm-helper thingy. For some dynamic content loudnorm doesn't apply compression/limiting to fit peaks into the required range and falls back to dynamic mode which sometimes results in sudden jumps of loudness and overall inferior sound compared to a proper chain of compression and loudness normalization. |
Darn, didn't know that (I only just wrote the loudnorm based normalization). In that case I suppose we're not there yet :-/ Do you have any better suggestions for implementing audio loudness normalization? |
This will probably be a long comment. The thing is, for the best results you need to control both the dynamic range and the final loudness of the content. Doing it by hand is super easy due to multiple metering and visual clues available and usually takes a bit of back-and-forth passes. Also, standard tools process audio in 32bit floating point which eliminates the issue of clipping at the intermediate stages. The dumb approach could be something along the following algo which remotely resembles the manual routine:
Assumptions:
To gracefully deal with the second assumption the tool could track a few moving averages (akin to a short-term LUFS) on the time scales of the points 2 and 3 and if there's no crossing some predetermined levels it means the audio already fits into the required dynamic range corridor and the appropriate processing stage can be skipped. This routine is based on the approach described in my article which I proposed as go-to reference for voice processing for FOSDEM: https://indiscipline.github.io/post/voice-sound-reference/#strategies-for-applying-processing Manual processing has a benefit of not necessarily relying on compression for dealing with dynamics on stages 2 and 3, as an engineer can clearly see the portions of the audio which fall below/above the average and adjust the gain accordingly. This is what loudnorm seems to be trying to simulate, but the volume swings are often unnatural and the timing of applying gain adjustments are unreliable. Tracking loudness state ("the average volume shifted" - i.e. distance to a microphone changed, "short volume outlier" - i.e. a loud phrase, exciting moment, etc., "loud sounds happening" - laughter, coughing, dropping things, etc.) and applying simple gain corrections based on it would be preferable to just relying on compression, but this requires a much more complex solution. UPD: Proper dynamic processing units allow reacting to a loudness measured not only in peak but in RMS, which in a way gets one closer to tracking state which I described above. You can set multiple such units to process audio based on different time resolutions. Also, I still suggest renaming the issue. |
Ah, another addition. My previous post is all about post-precessing. This is suboptimal to properly adjusting the sound on the recording end. If the settings were off and the sound was recorded distorted or too noisy or mangled by overeager noise-reduction or abysmal codecs then there's almost nothing you can do about it in post-processing. Preparing some standard protocol of setting things up for the recording can go a long way. |
Yeah, okay, afraid that sounds a bit too complex... My problem is that I'm dealing with audio which can be literally anything:
and I want something "reasonable" to roll out. For each case, that would be:
Without any manual work (because that's the whole design goal of SReview: "do as much automated as possible") I'm know I'm asking for AI or some magic code that will DWIM without any effort, but I don't need perfection; I just want to get as close as possible to those three results. The alternative is that a (too) small team will have to manually balance 600+ videos in two weeks, and that's just not possible. I just found ffmpeg also has an "ebur128" filter, which I guess I can look into for more options, but for now I'll stick to loudnorm, accept that it won't be perfect, and put this on the backburner in case I have time left at some undefined point in the future (yeah, right). Alternatively, patches are definitely welcome ;-) |
The logical simplification of the steps I proposed is sticking one gentle compressor before loudnorm to decrease the number of fallbacks to dynamic normalization. This will hurt case 3 a bit, but unfortunately this is the rarest case and you probably aren't optimizing for it. |
It's the rarest at FOSDEM, but it's not always the rarest. It's true that I'm not optimizing for it, though; and if there's a conference where it is guaranteed to always have happened, it's always just possible to disable the normalization in SReview, so then that shouldn't hurt anymore. Thanks for your input, you've given me some food for thought. Not sure I'll find the time to implement this any time soon, but at least I know how to improve matters should it be necessary. |
I'll be glad to help further. Feel free to contact me on matrix with any questions. Not sure I'll be able to contribute any code, though. |
Note that the Sennheiser AVX series microphones we use at FOSDEM have a
relatively sophisticated automated gain control builtin.
Playing with a normalisation algorithm on top of that quickly becomes
rather hairy...
Op vr 25 jun. 2021 om 16:13 schreef Kirill ***@***.***>:
… Ah, another addition. My previous post is all about post-precessing. This
is suboptimal to properly adjusting the sound on the recording end. If the
settings were off and the sound was recorded distorted or too noisy or
mangled by overeager noise-reduction or abysmal codecs then there's almost
nothing you can do about it in post-processing. Preparing some standard
protocol of setting things up for the recording can go a long way.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#181 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAF6ECBHYD3GZTMFDJP4CQTTUSFINANCNFSM4XH7KYEQ>
.
--
Mark Van den Borre
Hogestraat 16
3000 Leuven, België
+32 486 961726
|
On Fri, Jun 25, 2021 at 10:43:22AM -0700, Mark Van den Borre wrote:
Note that the Sennheiser AVX series microphones we use at FOSDEM have a
relatively sophisticated automated gain control builtin.
Not as relevant as you might think, for three reasons:
- Automatic gain in the microphone deals with short term loudness, not
the long term over the whole talk that I'm trying to deal with.
- This issue is mostly about handling prerecorded videos, rather than
postprocessing and post-event transcoding. Even if we end up deciding
that there is no need for loudness normalization in postprocessing (I
think there is, but...), I still want to deal with audio loudness in
SReview for FOSDEM for preprocessing if we need to do it again, but
the code is the same for both cases.
- While SReview was originally written for FOSDEM, it's grown beyond
that, and I want it to work in more situations than "just" FOSDEM.
At any rate, when I rewrote the audio loudness normalization
functionality a while back, I made it much easier to also completely
disable loudness normalization, so we can very easily switch it off if
we want to.
Playing with a normalisation algorithm on top of that quickly becomes
rather hairy...
Not really. If the audo loudness levels are similar over the entire
video, there isn't *that* much that SReview can do badly.
…--
***@***.***{be,co.za}
wouter@{grep.be,fosdem.org,debian.org}
|
Will https://github.com/complexlogic/rsgain/ be of any help here? |
From that page:
That's not what we are trying to do; we want to create an audio stream that has the correct loudness levels, rather than leaving it at "original" values and adding tags (so a media player can correct). There are a number of standards for audio loudness levels which we try to follow which are being adhered to by most TV broadcasters; so adhering to that means you can play your video on a TV set and you won't need to adjust your volume (hopefully...) Additionally, rsgain is meant for a music library, and that is reflected in the container formats it supports; none of them are containers that support video, there are only audio containers. So, thanks for the suggestion, but no, that won't be of any help. |
@yoe am I right that once |
Possibly, but it probably also won't give us an EBU R.128 style loudness normalization, so it isn't very useful really. |
Why EBU R.128 is so important? Looks like ReplainGain 2.0 is newer. |
ReplayGain is meant for portable media players; EBU R.128 is meant for broadcast audio. The two do not serve similar purposes. "Newer" is irrelevant here :-) |
I still don't get why audio loudness of ReplayGain (or normalization for media streams in files) should be worse than EBU R.128 (or normalization for broadcast streams)? If it is not worse, then why it is not useful? |
It's almost the same thing (as it's based on ITU BS.1770-3, while EBU R.128 is ~ ITU BS.1770-2), but it's just a draft and not a finished and agreed upon standard. EBU R.128 is already in use and very likely might be enforced in some way or another during delivery, so it's prudent to conform. |
Sigh. We already use the ffmpeg loudnorm filter, which does approximately the same thing as rsgain (although optimizations can certainly be added, as explained before). This comes for free with ffmpeg, which is already a dependency. This "rsgain" thing that you point to does not show any advantages over ffmpeg, but adds extra dependencies (that we then have to install on our systems) and doesn't support video files -- which means you have to extract the audio, perform the normalization, and then join the audio back together again. We used to do this when audio normalization was implemented using So, please accept that I've looked at the problem space, understand it reasonably well, and know how to deal with it. Suggesting a switch to $tool (where $tool is "not ffmpeg") is not helpful, unless it is accompanied by a thorough technical explanation that shows you know how audio normalization works and why $tool would be better than an ffmpeg-based approach. Thanks, |
To expand on this a bit more: There are a million values for $tool which claim they can do audio normalization "automatically" and "for free" and they're all lying, because audio normalization is not really something you can do automatically, because the human ear is a very complicated and weird thing. It's reasonably easy to accomplish for audio that is meticulously edited by audio professionals (such as a music album), but to get it to work correctly on a bunch of audio that comes from a of source that can be literally anything from "the worn-out builtin microphone of an old laptop at too large a distance" and "a professional recording microphone used correctly" is a completely different story. This is not a simple "ah I know this thing that I used over my music library so let's just use that" thing. |
Not relevant since no more prerecordings + remote speakers. |
Is still relevant, as we still want to do audio normalization in postprocessing for things that happened on-site. |
That's exactly what I expect from open source. ) Pretty appreciate all talks and explanations that tell me things I would never be able to discover otherwise.
The more I study acoustics in rooms and recording for movies, the more I excited about the whole stuff. When I talk to people about sound, they all listen like a child to a fairytale. This topic is fascinating exactly because its is ubiquitous and weird.
AI systems can do this perfectly well, like they do to pictures, but for that they need to copy the human expertise, and for that some humans needs to share this expertise with the world to make it possible. That's would be a humane approach to AI. Also given that "AI" or rather "ML" was in digital sound processing probably from the very beginning. |
Do we still have this in the rooms? The microphones we use (Sennheiser AVX) seem to be doing a great job with their AGC, and the rest is up to the video team not to screw up the levels when setting up the system. I think the only issue I've had was when someone really screwed up things, and there's a limit how much we can protect against that :) |
an often read complaint was that talks were too silent and Q&A following that too loud ... and there is a wide variety in volumes over all talks...
The text was updated successfully, but these errors were encountered: