Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process going to uninterruptible sleep #159

Closed
tdolega opened this issue Dec 16, 2020 · 7 comments
Closed

process going to uninterruptible sleep #159

tdolega opened this issue Dec 16, 2020 · 7 comments

Comments

@tdolega
Copy link

tdolega commented Dec 16, 2020

After I changed from Nvidia GTX1070 to AMD RX6800, ddcutil no longer works. It hangs with "D" state, so it can't be killed even with kill -9 and is using 100% of 1 core while locked indefinitely. Even turning off the system takes 3 minutes because system is having hard time killing ddcutil. I use Arch Linux with kernel 5.9.14 and also tried 5.10 from git. I can give all information that can be helpful with further investigation.

@tdolega tdolega changed the title process going to uninterruptible slep process going to uninterruptible sleep Dec 16, 2020
@rockowitz rockowitz added the bug label Dec 16, 2020
@rockowitz
Copy link
Owner

rockowitz commented Dec 16, 2020 via email

@tdolega
Copy link
Author

tdolega commented Dec 16, 2020

With Nvidia I was using proprietary driver. With AMD I'm using open source one.

I'm running only one command at the time, but it doesn't matter what command it will be.

ddcutil setvcp 10 100 --timeout-i2c-io --trace i2c --tid --ts
[   2798][  0.001](i2c_detect_buses              ) Starting.  i2c_buses = (nil)
[   2798][  0.004](i2c_check_bus                 ) Starting. busno=3, buf_info=0x55f4b3519000
[   2798][  0.004](i2c_open_bus                  ) Starting. busno=3, callopts=0x80
[   2798][  0.004](i2c_open_bus                  ) Done.     busno=3, Returning file descriptor: 3
[   2798][  0.004](i2c_get_parsed_edid_by_fd     ) Starting. fd=3
[   2798][  0.004](i2c_get_raw_edid_by_fd        ) Getting EDID. File descriptor = 3, filename=/dev/i2c-3
[   2798][  0.004](i2c_set_addr                  ) fd=3, addr=0x50, i2c_force_slave_addr_flag=false, callopts=CALLOPT_ERR_MSG
[   2798][  0.004](i2c_set_addr                  ) addr = 0x50. Returning OK(0): success
[   2798][  0.004](i2c_get_edid_bytes_directly   ) Getting EDID. File descriptor = 3, filename=/dev/i2c-3, read_bytewise=false
^C

dmesg after running this command is unreadable, it is getting spammed super fast with this messages (or numbers rather)

  +0,020998] data: 48 a2 00 00 00 00 b0 e4 0d 3a b1 8f ff ff b0 e4
[  +0,000002] data: 0d 3a b1 8f ff ff 00 00
[  +0,016000] data: 48 b8 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[  +0,000002] data: 00 00 00 00 00 00 00 00
[  +0,020996] data: 48 ce 00 00 00 00 00 00 00 00 e0 e4 0d 3a b1 8f
[  +0,000001] data: ff ff e0 e4 0d 3a b1 8f
[  +0,016001] data: 48 e4 ff ff a0 01 e8 8c ff ff ff ff 00 00 00 00
[  +0,000002] data: 00 00 00 00 00 e5 0d 3a
[  +0,015988] data: 48 fa b1 8f ff ff 00 e5 0d 3a b1 8f ff ff 60 df
[  +0,000001] data: 0d 3a b1 8f ff ff 30 ac
[  +0,021007] data: 49 10 0d 3a b1 8f ff ff 20 e5 0d 3a b1 8f ff ff
[  +0,000001] data: 20 e5 0d 3a b1 8f ff ff
[  +0,015998] data: 49 26 70 53 34 31 b1 8f ff ff 00 00 00 00 00 00
[  +0,000000] data: 00 00 00 00 00 00 00 00
[  +0,020999] data: 49 3c 00 00 00 00 00 00 00 00 00 00 01 00 00 00
[  +0,000001] data: 00 00 00 00 00 00 00 00
[  +0,017002] data: 49 52 00 00 00 00 40 8a 63 8c ff ff ff ff 00 00
[  +0,000000] data: 00 00 00 00 00 00 f8 e3
[  +0,020975] data: 49 68 0d 3a b1 8f ff ff 00 00 00 00 21 00 00 00
[  +0,000001] data: 00 00 00 00 00 00 00 00

when looking at it scrolling, I can see that the number is getting bigger and bigger with time

I'm running ddcutil-git from AUR (https://aur.archlinux.org/packages/ddcutil-git/) but for some reason it's version 0.9.9

Interesting thing is that this bug occurs even when I'm not using "sudo", as I presented above. iirc it shouldn't do anything as a user

After running sudo ddcutil detect those are first lines in dmesg:

[gru16 22:13] audit: type=1101 audit(1608153200.173:149): pid=5341 uid=1000 auid=1000 ses=2 msg='op=PAM:accounting grantors=pam_unix,pam_permit,pam_time acct="tsd" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/3 res=success'
[  +0,000149] audit: type=1110 audit(1608153200.173:150): pid=5341 uid=1000 auid=1000 ses=2 msg='op=PAM:setcred grantors=pam_faillock,pam_permit,pam_env,pam_faillock acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/3 res=success'
[  +0,002434] audit: type=1105 audit(1608153200.177:151): pid=5341 uid=1000 auid=1000 ses=2 msg='op=PAM:session_open grantors=pam_limits,pam_unix,pam_permit acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/3 res=success'
[  +0,013582] data: 00 00 00 00 00 00 00 00 80 d0 55 59 75 e2 ab 29
[  +0,000002] data: 63 70 75 31 35 00 00 00
[  +0,020996] data: 00 16 20 d2 55 59 75 e2 ab 39 20 d8 55 59 75 e2
[  +0,000001] data: ab 31 c8 d8 55 59 75 e2
[  +0,020754] data: 00 2c ab c9 76 65 72 73 69 6f 6e 00 e0 d8 55 59
[  +0,000000] data: 75 e2 ab d9 10 d9 55 59
[  +0,016243] data: 00 42 75 e2 ab d1 62 61 6e 6b 39 00 00 00 72 63
[  +0,000001] data: 5f 63 6f 72 65 00 2e 74
[  +0,021001] data: 00 58 65 78 74 00 00 00 68 dc 55 59 75 e2 ab f1

...

@rockowitz
Copy link
Owner

I've been puzzling over how to proceed. If the hang were happening on a machine in front of me, there are lots of little things I would check, but I don't want to send you off on lots of goose chases. My best guess is still that there's a bug in amdgpu. But it would be arrogant to dismiss the possibility of a bug in ddcutil.

That said, here are some observations and things to try.

  • What are the values of /sys/module/amdgpu/parameters/dc and /sys/module/amdgpu/parameters/hw_i2c?

  • Re ddcutil working without sudo, running as root is not needed if your userid is a member of a group (typically i2c) having r/w permissions on the /dev/i2c devices. And of course sudo is not needed within its timeout period.

  • Do "i2cdetect" or "get-edid | read-edid" trigger the lockup? What about ddccontrol? If any do, it's clearly an amdgpu problem.

  • Re the dmesg output, I assume earlier entries are lost because dmesg is a circular log. Try using "dmesg -wF > somefile".

  • I'm not familiar with logging in Arch. Presumably there are some log files in /var/log to check.

  • Use journalctl.

  • AUR has the current production release (0.9.9) of ddcutil. The most recent code is branch 1.0.0-dev on github.com/rockowitz/ddcutil. Please build from that. The remaining tests require the latest code.

  • I've added additional tracing to function i2c_get_edid_bytes_directly() to pinpoint which system call causes the hang. (I'm assuming that the lock up always occurs in that function.) Please run "ddcutil --trace i2c --tid --ts". Assuming the hang always occurs in i2c_get_edid_bytes_directly(), option --timeout-i2c-io does not apply to this code path.

  • "Utility" options --f1 etc. facilitate testing without having to modify the parser. Currently, option --f1 causes the EDID to be read using the standard (and more complex) i2c code path. Note that --timeout-i2c-io" does apply in this case. So the command to run is "ddcutil --f1 --timeout-i2c-io --tid --ts --trace i2c"

@tduboys
Copy link

tduboys commented Jan 19, 2021

Hi, I have the same issue on a AMD RX 6800 with amdgpu driver, on a fresh Arch install, both with ddcutil 0.9.9 and 1.0.0-dev

For your questions :

$ cat  /sys/module/amdgpu/parameters/dc
-1
$ cat  /sys/module/amdgpu/parameters/hw_i2c
0

I dont know how to use i2cdetect but :

$ i2cdetect -l
i2c-3   i2c             AMDGPU SMU                              I2C adapter
i2c-10  i2c             AMDGPU DM aux hw bus 2                  I2C adapter
i2c-1   smbus           SMBus PIIX4 adapter port 2 at 0b00      SMBus adapter
i2c-8   i2c             AMDGPU DM aux hw bus 0                  I2C adapter
i2c-6   i2c             AMDGPU DM i2c hw bus 2                  I2C adapter
i2c-4   i2c             AMDGPU DM i2c hw bus 0                  I2C adapter
i2c-2   smbus           SMBus PIIX4 adapter port 1 at 0b20      SMBus adapter
i2c-0   smbus           SMBus PIIX4 adapter port 0 at 0b00      SMBus adapter
i2c-9   i2c             AMDGPU DM aux hw bus 1                  I2C adapter
i2c-7   i2c             AMDGPU DM i2c hw bus 3                  I2C adapter
i2c-5   i2c             AMDGPU DM i2c hw bus 1                  I2C adapter
$ get-edid
This is read-edid version 3.0.2. Prepare for some fun.
Attempting to use i2c interface
No EDID on bus 0
No EDID on bus 1
No EDID on bus 2
(infinite sleep, unable to kill…) 

It seems that there is the same issue with get-edid command.

$ ddcutil --trace i2c --tid --ts environment
[   5017][  0.001](main                          ) Starting ddcutil execution, Tue Jan 19 21:16:08 2021
The following tests probe the runtime environment using multiple overlapping methods.
[   5017][  0.001](i2c_detect_buses              ) Starting.  i2c_buses = (nil)
[   5017][  0.003](i2c_check_bus                 ) Starting. busno=3, buf_info=0x563b3f306080
[   5017][  0.003](i2c_open_bus                  ) Starting. busno=3, callopts=0x80
[   5017][  0.003](i2c_open_bus                  ) Done.     busno=3, Returning file descriptor: 3
[   5017][  0.003](i2c_get_parsed_edid_by_fd     ) Starting. fd=3, filename=/dev/i2c-3
[   5017][  0.003](i2c_get_raw_edid_by_fd        ) Getting EDID. File descriptor = 3, filename=/dev/i2c-3
[   5017][  0.003](i2c_set_addr                  ) fd=3, addr=0x50, filename=/dev/i2c-3, i2c_force_slave_addr_flag=false, callopts=CALLOPT_ERR_MSG
[   5017][  0.003](i2c_set_addr                  ) addr = 0x50. filename = /dev/i2c-3, Returning OK(0): success
[   5017][  0.003](i2c_get_raw_edid_by_fd        ) Trying EDID read. tryctr=0, max_tries=2, read_bytewise=false, using local io
[   5017][  0.003](i2c_get_edid_bytes_directly   ) Getting EDID. File descriptor = 3, filename=/dev/i2c-3, read_bytewise=false
[   5017][  0.003](i2c_get_edid_bytes_directly   ) write_before_read = true
(infinite sleep…)

I'm not familiar to this tool, but it's used by Clight.

Edit: I just got a strange stuff. When my 2 last command was blocked (get-edid and ddcutil --trace i2c --tid --ts environment), I launched ddcutil --f1 --timeout-i2c-io --tid --ts --trace i2c environment, and after some seconds, the command answered, and exactly at this time, the 2 old one stopped.
And now, all commands are working…
Edit2: After a reboot, it takes 285 seconds to answer. I will put logs in attachment.
ddcutil.log
dmesg.log

And there is a bug on amdgpu drm project for that https://gitlab.freedesktop.org/drm/amd/-/issues/1314

@rockowitz
Copy link
Owner

rockowitz commented Jan 20, 2021 via email

@tdolega
Copy link
Author

tdolega commented Apr 26, 2021

While using kernel 5.10.32-1 (Arch linux-lts) the issue is still there, but on newest 5.11.16 i have no problem with ddcutil at all, so i guess it was fixed somewhere between this versions.

@rockowitz
Copy link
Owner

Did you upgrade to ddcutil 1.1.0 along the the kernel update? Release 1.1.0 contains a fix to keep ddcutil from probing a /dev/i2c device if its name is AMDGPU SMU, which is used to report monitor state. The fix was based on the report in issue #194, which also seems to describe your configuration (RX600, amdgpu).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants