Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop poking AMDGPU SMU I2C #194

Open
K900 opened this issue Mar 23, 2021 · 15 comments
Open

Stop poking AMDGPU SMU I2C #194

K900 opened this issue Mar 23, 2021 · 15 comments

Comments

@K900
Copy link

K900 commented Mar 23, 2021

As of this writing, ddcutil will attempt to poke all i2c-N devices available on the system to detect monitors. However, on recent AMD cards (specifically Navi2x variants, i.e. RX 6000 series), even trying to read from the SMU I2C bus can get the GPU into an inconsistent state (see 1). This should eventually be resolved via a more correct I2C implementation on the kernel side (again, see 1), but it might be worth it to also add a filter on the ddcutil side to make sure the SMU bus is never touched - nothing good (for ddcutil) can ever come out of it anyway.

@rockowitz
Copy link
Owner

rockowitz commented Mar 23, 2021

Is this something you've actually experienced with ddcutil, or just a general heads up?

In normal operation, ddcutil prefilters the i2c devices using /sys. If the device has a class attribute, it must either be x03 (display) or x0a (docking station). If no class attribute is found, it checks the device name, possibly augmented by the driver name. In particular, if the name starts with "smu", the device is ignored. (Probing smu devices was found to hang the system with Linux running on a Mac G5.)

If you are seeing the problem on ddcutil, we can probe further. Normally I'd just ask for the output of ddcutil environment --very-verbose, but the environment command is bit more promiscuous in choosing which I2C devices to examine, so there's a chance that it could trigger the problem.

@K900
Copy link
Author

K900 commented Mar 23, 2021

It is, unfortunately, a very real issue. Output attached.
env.txt

The SMU bus reports class 0x03 (which is probably also wrong?), and is named "AMDGPU SMU", which is probably why it passes the "starts with smu" check.

@rockowitz
Copy link
Owner

I've added "AMDGPU SMU" to the list of names for devices that should be ignored, and pushed the change to branch 1.1.0-dev. Let me know if it works.

The problem is that the device class (e.g. x03) is tied to the device, not to individual /dev/i2c buses. (see for example line 2342 of the environment output.) This actually makes sense, since the entire device is a video adapter. So we're left with looking at the device name. Name prefixes, or driver/name combinations, are added as they are encountered, which is infrequent - "AMDGPU SMU" increases the list size from 7 to 8.

@K900
Copy link
Author

K900 commented Mar 23, 2021

Thanks! Bad news though, in order:

  1. I've just tried to compile it and I'm hitting a bunch of stringop-overflow and stringflow-truncation warnings. Build with CFLAGS="-Wno-string-overflow -Wno-stringop-truncation" helped for now, but it looks like those are actually worth fixing:
In function ‘assemble_sysfs_path2’,
    inlined from ‘rpt2_attr_binary’ at sysfs_util.c:299:4:
sysfs_util.c:244:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-overflow=]
  244 |       strncat(buffer, "/", bufsz);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:238:4: error: ‘strncpy’ specified bound 4096 equals destination size [-Werror=stringop-truncation]
  238 |    strncpy(buffer, fn_segment, bufsz);
      |    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:246:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-truncation]
  246 |       strncat(buffer, segment, bufsz);
      |       ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sysfs_util.c:246:7: error: ‘strncat’ specified bound 4096 equals destination size [-Werror=stringop-overflow=]
  1. The ignorable_i2c_device_sysfs_name check is never hit, because we do have a class, and that class is considered "OK" here: https://github.com/rockowitz/ddcutil/blob/1.1.0-dev/src/util/sysfs_i2c_util.c#L191

@rockowitz rockowitz added the bug label Mar 26, 2021
@rockowitz
Copy link
Owner

The current 1.1.0-dev branch contains a fix to the check if a /dev/i2c device is ignorable. First the device name is checked. If the device name is not on the explicit ignorable list, then the class is checked.

It also (hopefully) addresses the stringop-overflow and stringop-truncation warnings. These were introduced in a point release of gcc 8. Modifying the code so that it compiles without warning no matter what release of gcc or clang is in use, and is still clean to read, turned out to be an ugly little problem. Even glib function g_strlcpy() gets flagged. See, for example, this long stackoverflow discussion. (I inserted "hopefully" because the full set of distribution releases and architecture variants have not yet been checked.)

@K900
Copy link
Author

K900 commented Mar 26, 2021

The latest version compiles cleanly and works perfectly, thanks! Any chance we can get this backported to the older stable releases as well?

@rockowitz
Copy link
Owner

What is the context in which backporting a fix to a prior release is important? The only situation in which I could see this mattering is if an application is using the shared library and can't rebuild to use the latest version. I'm not aware of any such application.

@K900
Copy link
Author

K900 commented Mar 27, 2021

Distribution policies, mostly. Debian/Ubuntu generally don't allow new feature releases into their stable branches, so 1.0.x to 1.0.x+1 is OK, but 1.0.x to 1.1.y won't be.

@ltuikov
Copy link

ltuikov commented Jul 10, 2021

Hi @rockowitz,

This bug is a kernel bug, specifically in the way the amdgpu driver communicated with the AMD SMU.
It has been fixed by the tip of:

https://gitlab.freedesktop.org/ltuikov/linux/-/commits/i2c-rework-luben

ddcutil can now freely scan AMDGPU SMU busses. Filtering out "AMDGPU SMU" should be reverted, especially when we merge the tip of my branch into mainline.

I'm using ddcutil-0.9.9 on the kernel above and it scans the AMDGPU SMU bus without a problem.

Regards,
Luben

@K900
Copy link
Author

K900 commented Jul 10, 2021

ddcutil doesn't really gain anything by scanning the SMU bus anyway, and people may be running it on older kernels. I'd just keep it on the blocklist.

@ltuikov
Copy link

ltuikov commented Jul 10, 2021

It should eventually be phased out.

It was unfortunate that this but was added to the kernel more than 2 years ago when some things were moved around.

@K900
Copy link
Author

K900 commented Jul 10, 2021

Maybe add a kernel version check?

@rockowitz
Copy link
Owner

In general, ddcutil avoids probing SMU devices if possible. Not only can this cause hard failures, but probing I2C buses is an expensive operation, to be avoided when we know a priori that a /dev/i2c device is not a monitor. There's no harm in leaving the check in. and avoids tying ddcutil to all but recent kernels.

@rockowitz
Copy link
Owner

@ltuikov Luben, I assume you're an appropriate person to ask the following questions. If not, can you pass it on or redirect me?

I'm continuing to see problem reports with Navi GPU cards, e.g. #223, #224. What's the minimum kernel release that contains the SMU changes? I'll tell people to ensure they're at that kernel level and, if so, to post a bug report on, I assume, https://gitlab.freedesktop.org/drm/amd/-/issues. What would you like to see in the bug report?

Regards,
Sanford

@ltuikov
Copy link

ltuikov commented Sep 29, 2021

@rockowitz Hi,
The issue is resolved by at least these two patches, as shown in Linus's master branch:

544dcd74b7093a drm/amd/pm: Fix a bug in semaphore double-lock
5810323ba69289 drm/amd/pm: Fix a bug communicating with the SMU (v5)

If you have at least these two commits, then the SMU will not get in an "inconsistent state", and you don't need filtering. It just works. You can use a previous version of ddcutil, presumably without filtering.

Regards,
Luben

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants