Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shader_debugprintf: support new VVL-DEBUG-PRINTF message and fix VVL version check for API selection #1187

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

SRSaunders
Copy link
Contributor

@SRSaunders SRSaunders commented Oct 9, 2024

Description

Fixes two issues that arose with Vulkan SDK 1.3.296:

  1. Supports new VVL-DEBUG-PRINTF callback message. Previous SDKs used WARNING-DEBUG-PRINTF or UNKNOWN-DEBUG-PRINTF. Without this fix the debug data is not available in the UI Overlay.
  2. Fixes my incorrect assumption that the Vulkan instance version matched the SDK version for all platforms - true on macOS but not true for Windows and Linux. This version is used to set the API level for the sample, which is important for performance and to avoid a previous defect in the Vulkan Validation layer. I have replaced the instance version check with a Validation Layer version check which is portable across all platforms: Win, Linux, macOS. Without this fix, performance is poor on Windows and Linux when using Vulkan SDK 1.3.296.

Fixes #1184.

Tested on Windows 10, Manjaro Linux, and macOS Ventura using Vulkan SDKs 1.3.290 and 1.3.296.

I hope this is the last time I have to fix this. It seems that VVL changes can easily break this sample.

General Checklist:

Please ensure the following points are checked:

  • My code follows the coding style
  • I have reviewed file licenses
  • I have commented any added functions (in line with Doxygen)
  • I have commented any code that could be hard to understand
  • My changes do not add any new compiler warnings
  • My changes do not add any new validation layer errors or warnings
  • I have used existing framework/helper functions where possible
  • My changes do not add any regressions
  • I have tested every sample to ensure everything runs correctly
  • This PR describes the scope and expected impact of the changes I am making

Note: The Samples CI runs a number of checks including:

  • I have updated the header Copyright to reflect the current year (CI build will fail if Copyright is out of date)
  • My changes build on Windows, Linux, macOS and Android. Otherwise I have documented any exceptions

If this PR contains framework changes:

  • I did a full batch run using the batch command line argument to make sure all samples still work properly

Sample Checklist

If your PR contains a new or modified sample, these further checks must be carried out in addition to the General Checklist:

  • I have tested the sample on at least one compliant Vulkan implementation
  • If the sample is vendor-specific, I have tagged it appropriately
  • I have stated on what implementation the sample has been tested so that others can test on different implementations and platforms
  • Any dependent assets have been merged and published in downstream modules
  • For new samples, I have added a paragraph with a summary to the appropriate chapter in the readme of the folder that the sample belongs to e.g. api samples readme
  • For new samples, I have added a tutorial README.md file to guide users through what they need to know to implement code using this feature. For example, see conditional_rendering
  • For new samples, I have added a link to the Antora navigation so that the sample will be listed at the Vulkan documentation site

Copy link
Collaborator

@SaschaWillems SaschaWillems left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR. I do have some remarks though, mostly related to comment and code structure. I think it's important that people can easily follow understand the changes ;)

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 18, 2024

Thanks @SaschaWillems for the feedback. I am away on vacation this week, but will make the requested changes when I am back.

UPDATE: Back now and changes submitted in 0dc4963.

asuessenbach
asuessenbach previously approved these changes Oct 22, 2024
@SaschaWillems
Copy link
Collaborator

No idea why, but with this PR and the latest SDK (1.3.296) and in windows, this sample is now again running with less than 1 fps. Forcing it to use VK 1.2 is somehow even slower (0 or inf fps).

If I force VK 1.0 performance is fine, but I don't get any debug output.

Not sure what is happening here and why this sample is so problematic. The debug printf sample from m own samples repo works just fine no matter the api version :/

@SRSaunders
Copy link
Contributor Author

No idea why, but with this PR and the latest SDK (1.3.296) and in windows, this sample is now again running with less than 1 fps. Forcing it to use VK 1.2 is somehow even slower (0 or inf fps).

Very strange. Can I ask you to recheck before and after this PR, but being careful with your SDK version selection and project gen/build? I did a lot of testing with old and new SDKs on Windows 10, Linux and macOS before submitting originally. I will go back and test again to see if I can somehow duplicate what you are seeing.

If I force VK 1.0 performance is fine, but I don't get any debug output.

Debug PrintF requires Vulkan 1.1 or later. So no surprise that you are not getting debug output with API 1.0.

The debug printf sample from my own samples repo works just fine no matter the api version

I suspect your repo's sample relies on the instrinsic Debug PrintF capability at the shader level on Windows. However, this is not cross-platform portable. Whereas the Vulkan-Samples one uses the VVL version of the feature all the time. Perhaps that is why you are seeing a difference at least on Windows. Again, I will so back and see if I can verify this.

@SaschaWillems
Copy link
Collaborator

It also happens with the old code (before this PR). I only have SDK 1.3.296 installed.

So probably a regression in the validation layers?

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 23, 2024

Ok, I have rechecked this PR on Windows 10, and even fast-forwarded my local branch to current main HEAD just to make sure. I am using Vulkan SDK 1.3.296.0 with my Radeon RX6600XT GPU. My Vulkan Configurator has been reset to default settings.

Before this PR I get:
main only

After this PR I get:
shader_debugprintf FF

Is it possible that your Vulkan Configurator has a custom setting that is interfering with the sample? Or possibly a difference between AMD and nVidia GPUs? Just grasping at straws since I cannot duplicate your issue and the 1.3.296 VVL seems to be working correctly using API 1.1 for debug printf.

Copy link
Contributor

@asuessenbach asuessenbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, I have to distinguish two cases:

  1. VulkanConfigurator is running
    VK_EXT_LAYER_SETTINGS_EXTENSION_NAME is available
    instance creation is done by VulkanSample::create_instance (line 469)
    render speed is high
    debug_utils_message_callback is never called, thus no debugprintf output
  2. VulkanConfigurator is not running
    VK_EXT_LAYER_SETTINGS_EXTENSION_NAME is not available
    instance creation is done locally (line 523)
    render speed is extremely low
    debug_utils_message_callback is called, with higher rate than the frame rate

Note, in case 2, you're using VkValidationFeaturesEXT, which is part of VK_EXT_VALIDATION_FEATURES_EXTENSION_NAME. But you don't ask for it in the ShaderDebugPrintf constructor (or anywhere else). And in fact, that extension is not supported on my machine. Strange, that the VVL doesn't cry there.

@SaschaWillems
Copy link
Collaborator

That would explain why it's so slow for me. I never ran that sample with the VulkanConfigurator running. That's case 2.

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 24, 2024

Thanks @asuessenbach for pointing out the missing VK_EXT_validation_features extension. I have made a few changes that might make a difference as follows:

  1. Moved layer settings out of the constructor, and into ShaderDebugPrintf::create_instance(). Now it will run only if the VK_EXT_layer_settings extension is available. This part is for encapsulation only and will not change behaviour.
  2. Added and enabled the VK_EXT_validation_features extension when the VK_EXT_layer_settings extension is not available at runtime. This might change behaviour, but I am concerned about @asuessenbach's comment that the extension is not available on his machine. I'm not sure how that is possible.
  3. Fixed an incorrect string comparison operation for VK_EXT_layer_settings in [HPP]Instance::[HPP]Instance(). This was my mistake from an earlier PR. This could have prevented proper specification of the validation layer feature settings when VK_EXT_layer_settings is active. Again, this could change behaviour.

These changes may not be the final solution as I have observed the following when testing:

  1. Linux (Manjaro) using Vulkan 1.3.295 (from pkg mgr) and VVL 1.3.290 (from pkg mgr): this PR works properly (good frame rate, debug data available) when running with vkconfig and without. VK_EXT_layer_settings is only available when vkconfig is active. In this case the debug data is available both in the UI and in the stdout console. No performance issues are visible in either case.
  2. macOS (Ventura) using Vulkan SDK 1.3.296: this PR works properly (good frame rate, debug data available) when running with vkconfig and without. VK_EXT_layer_settings is available both when vkconfig is inactive and active - this is a difference vs Linux. In the latter case (vkconfig active) the debug data is available both in the UI and in the stdout console. No performance issues are present. Also tested with Vulkan SDK 1.3.290 and the results are the same - no performance problems. The only issue is that vkconfig does not appear to recognize the repeated message limit for the new VVL-* messages (vs. the previous INFO-* or WARNING-* messages, etc). A minor issue but likely a bug.
  3. Windows 10 using Vulkan SDK 1.3.296 with my AMD 6600XT GPU: this PR works properly (good frame rate, debug data available) when running without vkconfig only. When vkconfig is active, the sample will not start and complains about an unsupported extension during vkCreateInstance(). However, VK_EXT_layer_settings is available during enumeration when vkconfig is active. Something very strange is going on here - either a bug on the Windows side or something I do not understand. I am not sure how VK_EXT_layer_settings can be enumerated but not supported. See my console output in this case:

nolayerext

In summary:

  1. Linux: works properly using VVL 1.3.290 with and without vkconfig. Can't test VVL 1.3.296 since it is not yet available as a package for my Manjaro distro.
  2. macOS: works properly using VVL 1.3.290 and 1.3.296 with and without vkconfig.
  3. Windows 10 on AMD 6600XT GPU: works properly using VVL 1.3.296 without vkconfig only.

Lastly, I thought VK_EXT_layer_settings was meant to replace and deprecate VK_EXT_validation_features. I don't understand why VK_EXT_layer_settings is available all the time on macOS, but for Windows and Linux seems to be enabled only when vkconfig is running. This seems incorrect to me. Can you explain this?

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 24, 2024

Ok, I think I have finally figured it out. It appears that you don't need to actually enable the VK_EXT_layer_settings extension in order to use it. I’m not sure if this is a feature or a bug. In any case, I have updated the sample and [HPP]Instance::[HPP]Instance() to check for availability of the extension vs. enablement. This approach works across all platforms and behaviours appear to be consistent now:

  1. Sample is tolerant of Vulkan SDK versions: tested against VVL 1.3.290 (Win, Linux, macOS) and 1.3.296 (Win, macOS)
  2. Sample is tolerant of vkconfig running or not running. The only thing to be careful of when running vkconfig is to make sure "Limit Duplicated Messages" is turned off - otherwise debug callback messages will be suppressed and the debug output UI will be blank.

@asuessenbach
Copy link
Contributor

AFAIK, those two extensions (VK_EXT_layer_settings and VK_EXT_validation_features) are not supported by any NVIDIA GPU, but are provided by a layer injected by for example the VulkanConfigurator. That might explain why it's that slow.

Besides that, just to make sure it has been noted: As VK_EXT_validation_features is deprecated in favour of VK_EXT_layer_settings, using VK_EXT_validation_features would just be a fallback solution. Don't know, if it's worth to have that. And you should bail out in a friendly way, if none of those extensions is available, maybe with a hint to the VulkanConfigurator.

@SaschaWillems
Copy link
Collaborator

Welp, still sub 1 fps for me with latest SDK and vkconfig NOT running.

Just let me know when it's in a state were I should test.

If we can't get this to work, we may simply go back to the initial version and maybe remove the debug output and tell people to attach a graphics debugger.

@SRSaunders
Copy link
Contributor Author

Thanks @asuessenbach for the info re nVidia GPUs. I have an AMD card and I guess this is the difference here.

@SaschaWillems would you please test using this PR with vkconfig running and let me know the result? I presume you are using an nVidia GPU - please confirm.

If this works, and as @asuessenbach suggests, I will try to detect this condition and offer a message to nVidia users.

@SaschaWillems
Copy link
Collaborator

If this works, and as @asuessenbach suggests, I will try to detect this condition and offer a message to nVidia users.

If we get to a point where we have to show a message under certain conditions to users of a certain vendor we're not heading where I'd like our samples to head. I'd rather remove the output debug stuff then.

@SRSaunders
Copy link
Contributor Author

@SaschaWillems I understand. However I’d still like to track this down if possible and you testing on Nvidia with vkconfig active would give more information. I can’t do this test myself. Thx.

@SaschaWillems
Copy link
Collaborator

SaschaWillems commented Oct 24, 2024

Windows 11 23H2, nvidia RTX 4070, latest Vulkan developer driver, SDK 1.3.296.

And I get <1 fps even with vkconfig up and running:

image

I'm pretty sure that the sample ran fine when I initially wrote it, but not sure why it no longer does.

Can't rule out a configuration issue on my side 100%, but not sure where to start looking.

@SRSaunders
Copy link
Contributor Author

I just added a minor hygiene change to use vk::ExtensionProperties vs. VkExtensionProperties in HPPInstance(). Also updated some comments and decided to explicitly request required GPU features for debugPrintfEXT as per docs.

More importantly, I was able to find an nVidia GPU to test this. I have narrowed down what causes the slowdown and am now convinced it is a VVL debugPrintfEXT defect on that GPU platform. Simply by disabling the following debugPrintfEXT feature enablement lines I can restore FPS performance on nVidia machines for both vkconfig running and not running cases. Unfortunately this drops the debug info, but hopefully this is a temporary thing until this issue can be addressed.

...
	//add_layer_setting(layerSetting);
...
	instance_create_info.pNext = nullptr; //&validation_features;
...

I will respond on the other thread to @spencer-lunarg to see if he can help.

@spencer-lunarg
Copy link
Contributor

@SRSaunders before we had the Slow Down on for Vulkan 1.1 and 1.2/1.3 were good... is that still the case or is it now for all versions?

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 25, 2024

When using an nVidia GPU with SDK 1.3.296, it slows down for all API versions. When using SDK 1.3.290 with the same setup (nVidia GPU), the sample works properly when using API 1.2 - as expected per previous discussion.

For AMD GPUs (and Apple Silicon on macOS) with SDK 1.3.296 everything works properly when using API 1.1

@spencer-lunarg
Copy link
Contributor

ok, so the problem has be isolated down to an NVIDIA GPU (I was testing on Intel and found no issues)... Later tonight I will be back at my desk and can try again on my NVIDIA machine

@spencer-lunarg
Copy link
Contributor

@SRSaunders I saw

It appears that you don't need to actually enable the VK_EXT_layer_settings extension in order to use it

So DebugPrintf I came out in 2020 and Layer Settings came out in 2023, so it was designed to be use the "old" way, 2023 we made it possible to use with Layer Settings (which is what vkconfig uses as well)

in 2024 we have been slowly pouring more effort into GPU-AV and DebugPrintf was there so we merged it in, added features, fixed bugs, so lots have recently was touching it. I tried to update the VVL docs for it https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/docs/debug_printf.md

Not in the 1.3.296 SDK (didn't want to rush it) but the next future SDK will have to go set VK_LAYER_PRINTF_ONLY_PRESET=1 for people who "just want to turn it on now" and not fuss with settings

@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 26, 2024

I tried to update the VVL docs for it https://github.com/KhronosGroup/Vulkan-ValidationLayers/blob/main/docs/debug_printf.md

@spencer-lunarg thanks for updating the docs. This is the same doc I referred back to today when digging into the issue we are facing here.

... but the next future SDK will have to go set VK_LAYER_PRINTF_ONLY_PRESET=1 for people who "just want to turn it on now" and not fuss with settings

Good to know there will be an even simpler way to enable debugPrintfEXT going forward.

While we are on the topic of VK_EXT_layer_settings, I would like to ask about the different behaviours of this extension cross-platform. For instance, on Windows and Linux, it seems this extension is not advertized during enumeration unless vkconfig is running. And in addition, if you try to enable it on Windows (even with vkconfig running), it will bail out with an extension not found error during vkCreateInstance() - kind of unexpected. This is in contrast to what happens on macOS where the extension is available at all times (not just when vkconfig is running), and can be enabled during instance creation without bailing out. Perhaps I am off base or don't understand it properly, but I thought that VK_EXT_layer_settings was meant to allow setting of layer features under program control during instance creation, and should not depend on whether vkconfig is running or not. Would you help clarify this for me?

@spencer-lunarg
Copy link
Contributor

@SRSaunders

but I thought that VK_EXT_layer_settings was meant to allow setting of layer features under program control during instance creation, and should not depend on whether vkconfig is running or not

That is correct... something is wrong here (and of course it involves 3 different things) ... do me a favor, for this thing, create a new VVL issue and I can work with Christope (who maintains both VkConfig and wrote the VK_EXT_layer_settings extension) to figure out what is going on.

The layer has to define its settings in way that can be used, for example you can set printf_to_stdout by setting VK_EXT_layer_settings::pSettingName to "printf_to_stdout". We (VVL) also use the layer setting code in VUL so that it will automatically allow people to set set VK_LAYER_PRINTF_TO_STDOUT=1 to do the same thing

What gets ugly is that we have legacy ways to do settings sadly...until the commit right after we branched for 1.3.296, DebugPrintf and GPU-AV both did Shader Instrumentation, but would step over each other, so we needed a way to prevent both used at the same time. This was done with this very ugly GPU_BASED_DEBUG_PRINTF enum thing and honestly I want no one to waste time thinking about it, the next SDK will have a simple printf_enable setting to turn off/on DebugPrintf

@spencer-lunarg
Copy link
Contributor

Also want to clarify that enables=VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT is correct what you have, it is the "original" way of setting things, seems the issue

So tested, I have a laptop with AMD Radeon 780M integrated and NVIDIA 4060 (using env variables to turn on, not vkconfig, shouldn't really matter in theory)

  • Samples ToT | AMD - 1.3.290 SDK - slow (1 FPS)
  • Samples ToT | AMD - 1.3.296 SDK - normal
  • Samples ToT | AMD - ToT VVL - normal
  • Samples PR 1187 | AMD - 1.3.290 SDK - slow (1 FPS)
  • Samples PR 1187 | AMD - 1.3.296 SDK - normal
  • Samples PR 1187 | AMD - ToT VVL - normal

With NVIDIA

  • Samples ToT | NVIDIA - 1.3.290 SDK - slow (1 FPS)
  • Samples ToT | NVIDIA - 1.3.296 SDK - slow
  • Samples ToT | NVIDIA - ToT VVL - slow
  • Samples PR 1187 | NVIDIA - 1.3.290 SDK - slow (1 FPS)
  • Samples PR 1187 | NVIDIA - 1.3.296 SDK - slow
  • Samples PR 1187 | NVIDIA - ToT VVL - slow

so basically this PR is not fix/breaking things from my end... I agree this seems like something on the VVL side (we reopened the issue and I will look more into this weekend) but everything in this PR seems sane ... it should not take "hacks" to get DebugPrintf to work, if it does, we failed in VVL badly

... from a quick glance this seems to be a sync issue within how GPU-AV is working (we had to start using timeline semaphores under the hood to get things to be faster in GPU-AV, before we did a big vkQueueWaitIdle, ... but we share this code with Sync Validation, which has also been active and I feel we regressed here)

@SRSaunders
Copy link
Contributor Author

@spencer-lunarg I have raised issue KhronosGroup/Vulkan-ValidationLayers#8760 as requested.

so basically this PR is not fix/breaking things from my end... I agree this seems like something on the VVL side (we reopened the issue and I will look more into this weekend) but everything in this PR seems sane

Thanks for reviewing and good to know this PR appears correct from your side. I look forward to any discoveries you make regarding slowdowns we are seeing with nVidia GPUs.

…gPrintfEXT

(cherry picked from commit 3365c7d974ae1cb7222cf35fdbe82accfa3fd926)
@SRSaunders
Copy link
Contributor Author

SRSaunders commented Oct 31, 2024

Following interaction with the VVL team, I think this is now ready for review. A couple of learnings:

  1. The VK_EXT_layer_settings extension is trickier than I first realized. On Windows and Linux it is primarily a layer instance extension and not a driver extension. On macOS things are a bit different where VK_EXT_layer_settings is made visible by the MoltenVK driver as well. However, to detect its availability in general you need to query the layer and not the driver during instance enumeration. Thanks to @spencer-lunarg + team and armed with this new (to me) information, I have modified the sample to now do proper detection. This results in all platforms using the layer settings path independent of whether vkconfig is running or not. The legacy VkValidationFeatureEnableEXT path is still present in the code, but will be used only with older SDKs that don't have VK_EXT_layer_settings within the VVL layer.
  2. The above findings also made me realize that the general framework for adding a layer setting (in [HPP]Instance()) was not using the correct criteria to chain in the VkLayerSettingsCreateInfoEXT struct during instance creation. Testing for the presence of VK_EXT_layer_settings in the driver will not give the right answer. And looking for which layer to check would only add a bunch of unnecessary complexity to the code. So I simplified things and now leave it up to the sample to determine if layer settings is supported, and if so, to push a layer setting using add_layer_setting(). In [HPP]Instance() I now only check for the presence of layer setting entries in the required_layer_settings vector. This puts the onus on the sample to make sure layer settings entries are supported. However, this is not very risky since vkCreateInstance() will throw away any layer settings that don't match and should not complain.
  3. Lastly, the observed performance slow-downs on various platforms and SDK versions has these mitigations:
    a) The slow-downs that @spencer-lunarg observed with older SDKs in shader_debugprintf: support new VVL-DEBUG-PRINTF message and fix VVL version check for API selection #1187 (comment) has been solved. While I was picking the correct API version 1.2 for older SDKs, I had added code that explicitly requested support for timeline semaphores. It turns out that while the VVL debugPrintfEXT feature requires this and implicitly enables it under the covers, explicitly requesting support in the sample breaks performance. I am not sure why this is the case, but I will leave it to @spencer-lunarg to decide if this is an issue or not. Nonetheless, I have removed this from the sample and older SDKs now work without performance degradation.
    b) The slow-down that was visible for nVidia GPUs running SDK 1.3.296 is apparently solved by gpu: Skip present submission Vulkan-ValidationLayers#8766, which will be available with the next SDK. I have not been able to test this yet, but will report back once I can. UPDATED: fix verified as working on my nVidia GPU machine. However, the only workaround for now is to use SDK 1.3.290 for nVidia GPU users until the next SDK is released.
    c) By fixing the logic to test for the VVL version (and not the instance version) when selecting API 1.1 vs. 1.2, this PR does solve performance issues observed with SDK 1.3.296 for Windows AMD, Linux AMD, and macOS. I'm not sure about Linux nVidia as I cannot test that combination.

@spencer-lunarg
Copy link
Contributor

It turns out that while the VVL debugPrintfEXT feature requires this and implicitly enables it under the covers, explicitly requesting support in the sample breaks performance.

Is this still true? I tried to add it explicitly in this PR and didn't see it slow down (the issue was the old pre-1.3.290 SDK and should be patched now)

@SRSaunders
Copy link
Contributor Author

It turns out that while the VVL debugPrintfEXT feature requires this and implicitly enables it under the covers, explicitly requesting support in the sample breaks performance.

Is this still true? I tried to add it explicitly in this PR and didn't see it slow down (the issue was the old pre-1.3.290 SDK and should be patched now)

I observed this performance impact when using SDK 1.3.290 with API 1.2 and timeline semaphores explicitly enabled. With SDK 1.3.296 using API 1.1 with timeline semaphores enabled, performance was fine. So I removed the explicit enablement of timeline semaphores and now I get consistent performance using: a) API 1.2 with SDKs <= 1.3.290, and b) API 1.1 with SDKs >= 1.3.296 (aside from the nVidia issue mentioned above).

@SRSaunders
Copy link
Contributor Author

Given the long discussion and in case it wasn't clear, this PR is now ready to go.

Note for nVidia users (Windows) using SDK 1.3.296: A fix is also required from the VVL which will ship in the next SDK. In the interim, nVidia users should run using this PR combined with SDK 1.3.290 for the shader_debugprintf sample.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

shader_debugprintf problems with new VulkanSDK 1.3.296
4 participants