Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ResourcePool::allocate() may starve other threads, causing hitches or stutters with dlssg enabled #33

Open
Nukem9 opened this issue Mar 4, 2024 · 3 comments
Labels
need-info Need additional info

Comments

@Nukem9
Copy link
Member

Nukem9 commented Mar 4, 2024

When ResourcePool::allocate() reaches a limit (i.e. VRAM budget or maximum queue depth) it'll spin in a busy loop hoping another thread comes around and frees existing allocations via ResourcePool::recycle(). If said busy loop exceeds its time limit, it'll fall back to a brand new allocation instead. I'm assuming this loop doesn't execute under normal circumstances because Streamline doesn't spawn threads and games rarely parallelize slEvaluateFeature() calls.

However, once DLSS-G is enabled there's suddenly 3 threads competing with each other: a game (present) thread, a sl.pacer thread, and a sl.dlssg thread. The game and sl.pacer threads are often contended on ResourcePool's mutex, leading to a problem where ::recycle() is unable to progress after ::allocate() enters its busy loop. Streamline tries to mitigate this deadlock with the following code:

float resourcePoolWaitUs = bytesAvailable > footprint.totalBytes && allocated.second.size() < m_maxQueueSize ? 500.0f : 100000.0f;
// Use more precise timer
extra::AverageValueMeter meter;
meter.begin();
// Prevent deadlocks, time out after a reasonable wait period.
// See comments above about the wait time and VRAM consumption.
while (items.second.empty() && meter.getElapsedTimeUs() < resourcePoolWaitUs)
{
lock.unlock();
// Better than sleep for modern CPUs with hyper-threading
YieldProcessor();
lock.lock();
meter.end();
}

There's an oversight on line 144 as std::mutex does not guarantee fairness. YieldProcessor() is one instruction and makes no real difference. Unlocking and relocking might wake other threads but ::allocate() can reacquire the lock before anybody else gets a chance. This is often the case on my machine.

Based on my not-so-scientific testing I usually hit that 100000us pause in games every 1-2s which results in a stuttery mess. This only occurs with vertical sync enabled through Nvidia Control Panel. Games are smooth with vertical sync off.

I annotated an Nsight trace while trying to understand what's happening. Possibly useful for someone:

@jake-nv
Copy link
Collaborator

jake-nv commented Mar 5, 2024

Thanks for the detailed analysis and report. We're discussing this, but it will likely take a while to arrive at a fix everyone is happy with.

@kirillNVIDIA
Copy link

Yeah - we need Sleep(1) instead of YieldProcessor() there. I think it should be a rare condition though. This happens only when there are no resources to recycle. Normally the presenting thread should release the resource by the time rendering thread needs it. To fix it properly - we need a detailed description of the repro case so we can repro the bad case, then fix it, and then verify that the issue is fixed.

@jake-nv jake-nv added the need-info Need additional info label Mar 26, 2024
@Nukem9
Copy link
Member Author

Nukem9 commented Mar 28, 2024

Yeah - we need Sleep(1) instead of YieldProcessor() there. I think it should be a rare condition though.

Agreed. Although I was kind of hoping you guys would use a condition variable instead.

Normally the presenting thread should release the resource by the time rendering thread needs it. To fix it properly - we need a detailed description of the repro case so we can repro the bad case, then fix it, and then verify that the issue is fixed.

There's little information to add besides what's posted above. There's no easy repro. VRAM exhaustion is not a factor. What I do know is that I can reproduce minute stutters in a number of games (say, Cyberpunk 2077) in areas with light CPU load - probably because the kernel thread scheduler doesn't preempt the busy loop. I don't plan on root causing it as there's no source code or symbols available for sl.dlss_g.dll.

No amount of fiddling with settings seems to change things, so I've accepted it as a consequence of my HW/OS (Windows Server 2022) configuration. I binary patched various Streamline DLLs and that's a good enough "fix" for me.

Given the rarity, I don't think it's worth spending time investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-info Need additional info
Projects
None yet
Development

No branches or pull requests

3 participants