Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Driver timeouts with RDNA3 GPUs on Windows + Vulkan #10720

Closed
Wolver32 opened this issue Jan 25, 2024 · 25 comments · Fixed by #11223
Closed

[BUG]: Driver timeouts with RDNA3 GPUs on Windows + Vulkan #10720

Wolver32 opened this issue Jan 25, 2024 · 25 comments · Fixed by #11223

Comments

@Wolver32
Copy link

Wolver32 commented Jan 25, 2024

Describe the Bug

I already reported this issue on Discord one or two weeks ago but I decided to put it also here as I've seen other people having similar issues. At the time I didn't provide a GS Dump, but now I am doing so. The dump was recorded in VCS while playing in Software mode, when changing to Vulkan the driver timeout occurs between frames 480-500, usually around 490. Here's the link: https://drive.usercontent.google.com/download?id=19zbfSnzsAGUvRRz2T5x2beABn7KF1mxA&export=download
(check comment below in case this link doesn't work anymore)

From what I understand, this seems to be happening only on RDNA3 cards. If you want me to test other stuff with these 2 games and/or post more dumps, feel free to ask.

Reproduction Steps

In VCS I can always replicate the issue when going above the wooden bridge near the military base, in LCS I haven't found a spot where I can trigger it consistently. I haven't done extensive testing with other APIs, but after around 30 minutes of continuous playtime in VCS with DX12 I didn't notice a single crash.

I also did a few tests on Fedora Silverblue with one of the latest nightly AppImage builds, RADV never crashed so I would mark this as a Windows-only issue.

Expected Behavior

No response

PCSX2 Revision

v1.7.5509

Operating System

Windows 11

If Linux - Specify Distro

No response

CPU

Intel Core i5-12600KF

GPU

AMD Radeon RX 7800 XT

GS Settings

  • Aspect Ratio = 16:9
  • FVM Aspect Ratio: 16:9
  • Internal Resolution = 8x
  • Anisotropic Filtering = 16x
    (Happens even at default settings)

Emulation Settings

No response

GS Window Screenshots

No response

Logs & Dumps

No response

@refractionpcsx2
Copy link
Member

Okay here's a local copy of the dump, I recompressed it to xz (down from over 600mb to about 200mb), then compressed in to zip volumes.

Because Github sucks and doesn't let you upload anything without a known extension, I had to add .zip to the end of all the files, so you will need to remove that before extracting.

This is simply so if the google drive link dies for any reason, we have a copy on github too, I wouldn't recommend putting in the effort unless you need to :P

Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.001.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.002.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.003.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.004.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.005.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.006.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.007.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.008.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.009.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.010.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.011.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.012.zip
Grand Theft Auto - Vice City Stories_SLUS-21590_20240125192703.gs.zip.013.zip

@stenzek
Copy link
Contributor

stenzek commented Jan 26, 2024

I can't reproduce any GPU crash on my NV GPU, nor are any validation errors showing. So it's either an AMD driver issue, or something the validation layers aren't checking for.

Can't really do much about it, or narrow it down to the exact draw/shader, as I don't have one of these GPUs.

@Wolver32
Copy link
Author

It'd be nice if someone with a non-RDNA3 AMD GPU could do a test. I saw a few people on the Discord server discussing about RNDA3-specific crashes and so I imagined that other AMD GPUs aren't affected. With my old RTX 3060 I never experienced any crashes.

I own 2 other games (Gran Turismo 3 and 4), those don't seem to cause driver timeouts when running under Vulkan... But I should probably test them more to confirm that they're not affected.

I am no expert and I really have no idea on what could cause these timeouts, but if the Mesa driver works just fine in Linux I think there's probably something weird going on with the AMD Windows driver.

@Wolver32 Wolver32 changed the title [BUG]: GTA LCS and VCS trigger driver timeouts with RDNA3 GPUs on Windows + Vulkan [BUG]: Driver timeouts with RDNA3 GPUs on Windows + Vulkan Jan 31, 2024
@Wolver32
Copy link
Author

Wolver32 commented Jan 31, 2024

Today I had a bit of free time and so I did a few more tests...

Update 1: I played Gran Turismo 3 for around 50 minutes, Windows + Vulkan, 8x resolution, 16x AF and Basic Blending Accuracy. I had zero issues while playing, everything looked normal. I then decided to crank Blending Accuracy all the way up to Maximum, the 7800 XT should have the horsepower to handle it. Result: driver timeouts after around 5-10 minutes of play. While I can't reproduce it consistently (like I managed to do in VCS), I changed the title of this bug report and made it more generic as I'm sure there are other games that trigger driver timeouts in particular conditions. I tried messing around with Blending Accuracy in VCS, but even with Minimum the crash occurs at the exact same spot.

Update 2: I decided to try older PCSX2 builds with VCS. The first one I chose is v1.7.3722-64bit-AVX2-Qt from December 16th 2022, Vulkan, 8x resolution, 16x AF and Blending Accuracy set to Maximum, result: the crash didn't occur on the usual bridge, but after around 15 minutes in the ferris wheel area. I then picked an even older build, v1.7.2264-64bit-AVX2 from January 23rd 2022 (one of the earliest Vulkan implementations), and ran again VCS with the previous settings... After around 30 minutes of gameplay I didn't have a single crash! Of course I can't be 100% sure that this build will never trigger driver timeouts, but it definitely seems more stable than current ones. Maybe some newer Vulkan extension that doesn't play too well with the AMD Windows driver?

This weekend I should have time to test more stuff.

@Wolver32
Copy link
Author

Wolver32 commented Feb 4, 2024

Here I am with another report, this will be a long one with lots of numbers... I decided to test if resolution and/or blending accuracy have an impact in driver timeouts (spoiler: yes), to do this I used the GS Dump that I posted here and each time I took note of the frame where the driver crashed. I started by checking all resolutions at default settings (including Blending Accuracy, which is set to Basic), I did 3 passes for each resolution:

  • Native (PS2): 118 118 118
  • 1.25x Native: 118 118 118
  • 1.5x Native: 118 118 118
  • 1.75x Native: 118 122 118
  • 2x Native: 118 120 128
  • 2.25x Native: 189 180 188
  • 2.5x Native: 248 248 248
  • 2.75x Native: 324 324 324
  • 3x Native: 378 378 378
  • 3.5x Native: 388 388 388
  • 4x Native: 390 400 402
  • 5x Native: 430 426 424
  • 6x Native: 440 440 440
  • 7x Native: 444 444 444
  • 8x Native: 490 490 490

image

With Blending Accuracy set to its default setting (Basic), it's clear that a driver timeout is triggered later at higher internal resolutions. I then decided to take 3 resolutions to test the other Blending Accuracy settings: Native (which gave the worst results), 3x (~1080p, seems to be a good middle ground) and 8x (best results). Let's see the results for the other Blending Accuracy levels...

Minimum Blending Accuracy

  • Native (PS2): 118 118 118
  • 3x Native: 378 378 330
  • 8x Native: 492 490 490

Medium Blending Accuracy

  • Native (PS2): 118 118 118
  • 3x Native: 378 330 330
  • 8x Native: 490 492 490

High Blending Accuracy

  • Native (PS2): 118 118 118
  • 3x Native: 378 378 330
  • 8x Native: 492 490 492

Full Blending Accuracy

  • Native (PS2): 118 118 118
  • 3x Native: 378 378 330
  • 8x Native: 490 490 490

There's really not much to say here, the results seem to be in line with Basic. There's however one more level that I haven't mentioned yet, and it gave weird results...
Maximum Blending Accuracy

  • Native (PS2): PASSED PASSED PASSED
  • 3x Native: 118 118 118
  • 8x Native: 2 CRASH 2

Maximum Blending Accuracy completely changes everything, now Native resolution gives by far the best results by completing the whole GS Dump, while 8x either doesn't load at all or is able to show only 2 frames.
Considering that this is the first time that I see my GS Dump being reproduced in its entirety on Windows + Vulkan, I went in game for a quick test... I managed to play for around 3-4 minutes and then the screen went black, but after around 10 seconds it came back to life and emulation surprisingly continued; this happened 2-3 times, but eventually a driver timeout was triggered. So the combo Native + Maximum Blending Accuracy seems the most "stable", but it will eventually crash.
Feel free to ask me to test particular settings, I can't really interpretate these numbers.

EXTRA
Probably irrelevant, but around 30% of times that a driver timeout is triggered, an assertion error windows appears. Error message is at line 508 in GS/Renderers/Common/GSRenderer.cpp

@stenzek
Copy link
Contributor

stenzek commented Apr 23, 2024

Could you please re-test on the latest release? Apparently the feedback loop extension was causing GPU crashes on RDNA3, which I've now dropped, so with any luck this should be resolved, if it's the same issue.

@Wolver32
Copy link
Author

Wolver32 commented Apr 23, 2024

I don't have much time today, but I was still able to do a quick test with GTA VCS. I used 8x resolution scale, 16x Anisotropic Filtering and Maximum Blending Accuracy, these settings instantly caused driver timeouts.
With the latest release my old GS Dump passes and even during gameplay I was able to go past that bridge. I went around the map for around half an hour and the game behaved and performed just as I expected, but my session was unfortunately ended by another driver timeout:
image

It happened seemingly at random as I was just trying to land a helicopter, yesterday when I was testing another build posted on the Discord server I experienced it after the "Welcome to Vice City" loading screen (when going to the other island). Still, there's clearly a major improvement.

@stenzek
Copy link
Contributor

stenzek commented Apr 23, 2024

Oh well. Will have to wait until someone with a RDNA3 GPU looks into it then. I'm not going to rush out and buy one just for PCSX2 :P

@refractionpcsx2
Copy link
Member

refractionpcsx2 commented May 2, 2024

One thing you could try, is pick a GS Dump that crashes, then enable the GS dumping stuff (you might need to enable advanced from the tools menu, then you'll find the "Debug" section in the settings), set the start draw to 0, set the number of draws to like 10000 or something and set a folder to dump to, then tick "Dump GS Draws", then open the GS dump wait for it to crash (don't forget to disable this again when you next reload).

Once that happens, go to the folder you told it to dump, find the first file near the bottom which doesn't start with vsync and tell us the name of the last file, it should be like 00013_vertex.txt.

Also tell us which dump you used..

if you can do this run a couple of times on the same GS Dump and make sure it's the same number every time, that would be very helpful!

@Wolver32
Copy link
Author

Wolver32 commented May 2, 2024

Recent versions now always pass with my old GS Dump that I posted here and during gameplay crashes seem random, so I'm not able to trigger a driver timeout reliably. I then used it on build v1.7.5720 (20th April 2024), which is one of the latest ones that has this behaviour with my old GS Dump in relation to resolution and blending accuracy.

With your suggested settings, I performed 10 tests, 5 with default blending accuracy and the other 5 with maximum blending accuracy. The other settings are Vulkan + 8x Resolution Scale + 16x AF.

Default Blending Accuracy: the last non-vsync file across all 5 tests was 10001_vertex.txt which, without knowing anything, seems rather uninteresting given that number of draws was set to 10000 (and those files start from 1 and not 0).

Maximum Blending Accuracy: the last non-vsync file across all 5 tests was 04755_vertex.txt. I'll upload here the vertex.txt and context.txt files from the last run:
04755_context.txt
04755_vertex.txt

@refractionpcsx2
Copy link
Member

So Maximum died? confused why it's a lower number :D

@Wolver32
Copy link
Author

Wolver32 commented May 2, 2024

Maximum + 8x Res dies almost immediately (around frame 2), while Default + 8x Res dies around frame 490.
I guess that's why I get a lower number with maximum blending accuracy.

@refractionpcsx2
Copy link
Member

okay, when you did each run, did you delete the old files after noting the last one? Just wanna make sure we're looking at the right draw

@Wolver32
Copy link
Author

Wolver32 commented May 2, 2024

Yep

@refractionpcsx2
Copy link
Member

and just to clarify, it always died on the same one?

@Wolver32
Copy link
Author

Wolver32 commented May 2, 2024

Yes, all 5 tests (with max blending accuracy) ended with that exact file.

@refractionpcsx2
Copy link
Member

perfect, thanks! :)

@stenzek
Copy link
Contributor

stenzek commented May 3, 2024

I still have no idea what the actual issue here is, but given the feedback loop extension was apparently problematic, I wonder if using local reads, which formalize programmable blending will help.

Try #11179. You'll need the latest driver, as of about a week ago (AMD was late adding the extension, NV's had it for months).

@Wolver32
Copy link
Author

Wolver32 commented May 3, 2024

Updated the driver and tried your build, unfortunately it's a regression compared to recent nightly builds.
8x Resolution + Maximum Blending Accuracy now triggers a driver timeout around frame 660, which is a lot better than build v1.7.5720 (crashes at frame 2) but recent nightly builds don't have a problem with my GS Dump at these settings. Even in gameplay I got a crash after like 30 seconds.

@stenzek
Copy link
Contributor

stenzek commented May 3, 2024

Oh well. Like I said above, it won't be fixed until someone with the knowledge and hardware looks into it. I'm out of random ideas to try, and there isn't any spec violation going on the current validation layers can detect.

@Wolver32
Copy link
Author

Wolver32 commented May 3, 2024

Thank you for trying though, I imagine the struggle if you don't have the hardware. Unfortunately my knowledge here is very limited, all I can do is helping test stuff... And whenever anyone wants me to try things, feel free to ask :)
For me personally it doesn't change that much as I'll continue to use PCSX2 on Linux, RADV works flawlessly and I have yet to encounter an issue there.

@stenzek
Copy link
Contributor

stenzek commented May 10, 2024

So like an idiot I went out and bought a GPU to debug this (and other uses too). It is completely random, and not a specific draw. But think I may have found a workaround.

Give #11223 a shot. I can't trigger a crash in any of the games that I could before (Ratchet, VCS).

@Wolver32
Copy link
Author

Oh wow, ok tomorrow I'll give it a shot!

@Wolver32
Copy link
Author

I think you fixed it!
I had around 40+ minutes of uninterrupted gameplay in VCS, yesterday around 20-ish, no driver timeouts to report. I'll still try a bit more in the coming days (especially with other games), but the results I got this time are more than promising. Thank you for investing time and money into this :)

One unrelated and minor thing that I noticed while testing is the water rendering being off... I don't know if this is again a RDNA3 exclusive thing, but this is what I'm seeing on Windows:
VCS_Windows

While on Linux everything looks normal:
VCS_Linux

I initially noticed it while playing on #11223, but then I also tried to see if it's the same on the last nightly build and the answer is yes. Those screenshots were taken from build 1.7.5799, same settings used on both OSes. I decided to post it here for now as it could be another RDNA3 issue, but if needed I can open another bug report.

@stenzek
Copy link
Contributor

stenzek commented May 12, 2024

Post a GS dump of it. Can't do much without one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants