Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM5 random I/O freezes with CQE enabled for eMMC #6512

Open
sairon opened this issue Dec 3, 2024 · 7 comments
Open

CM5 random I/O freezes with CQE enabled for eMMC #6512

sairon opened this issue Dec 3, 2024 · 7 comments

Comments

@sairon
Copy link

sairon commented Dec 3, 2024

Describe the bug

When CQE is enabled for the eMMC interface on CM5 (CM5104032), I/O operations sometimes completely freeze and do not recover. We noticed this on random occasions when the system was booted already, but the most reliable trigger was heavier I/O on the first system boot when a swapfile was created (copying few hundreds of MB of zeroes to ext4 FS). The issue is reproducible also with bcm2712-rpi-cm5-cm5io.dtb device tree, thus not limited only to the downstream bcm2712-rpi-cm5-ha-yellow.dtb. To rule out bootloader being involved here (we're using U-Boot), it was confirmed loading kernel image directly by the default bootloader doesn't resolve the problem. However, removing supports-cqe from the sdio1 node reliably fixes it (verified in couple of hundreds of boots already).

Steps to reproduce the behaviour

Boot HA OS for the first time. If it doesn't freeze, remove the swapfile (swapoff /mnt/data/swapfile && rm /mnt/data/swapfile) and reboot.

Device (s)

Other

System

Tested on Home Assistant OS 14.0.rc2 (using 6.6.51 kernel based on stable_20241008 tag).

Logs

[  242.826099]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.832463] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.840429] INFO: task jbd2/mmcblk0p7-:300 blocked for more than 120 seconds.
[  242.847572]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.853928] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.861789] INFO: task jbd2/mmcblk0p8-:344 blocked for more than 120 seconds.
[  242.868926]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.875277] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.883149] INFO: task systemd-timesyn:569 blocked for more than 120 seconds.
[  242.890282]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.896628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.904522] INFO: task dockerd:606 blocked for more than 120 seconds.
[  242.910958]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.917304] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  242.925200] INFO: task runc:[2:INIT]:1504 blocked for more than 120 seconds.
[  242.932249]       Tainted: G         C         6.6.51-haos-raspi #54
[  242.938595] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

Additional context

This seems to be similar to #6349 but the fixes for it aimed only at a limited set of SD cards.

@P33M
Copy link
Contributor

P33M commented Dec 4, 2024

Please post the complete contents of dmesg.
Can you reproduce the issue on Raspberry Pi OS?

@ViRb3
Copy link

ViRb3 commented Dec 7, 2024

using 6.6.51 kernel based on stable_20241008 tag

This looks too old to contain the CQE fixes from my issue you linked above. Those fixes were merged on 2024-10-18 (#6419), which is 10 days later than your build.

FWIW, on Raspberry Pi OS 6.6.62-1+rpt1 (2024-11-25), all 3 of my RPi5 run without issue with CQE enabled.

@P33M
Copy link
Contributor

P33M commented Dec 9, 2024

using 6.6.51 kernel based on stable_20241008 tag

This looks too old to contain the CQE fixes from my issue you linked above. Those fixes were merged on 2024-10-18 (#6419), which is 10 days later than your build.

FWIW, on Raspberry Pi OS 6.6.62-1+rpt1 (2024-11-25), all 3 of my RPi5 run without issue with CQE enabled.

Good spot, the first fix in that series is for a bug that will affect eMMC as well and would result in the described hang.

@sairon
Copy link
Author

sairon commented Dec 9, 2024

@P33M I was able to reproduce it on Raspberry Pi OS. Here are steps I did after flashing the latest image of RPi OS Lite and ran it through the first boot:

  • Added dtoverlay=uart4-pi5 to config.txt (as its where HA Yellow has its USB-UART connected).
  • Set console to that serial:
# cat /boot/firmware/cmdline.txt 
console=ttyAMA4,115200 root=PARTUUID=c79d899d-02 rootfstype=ext4 fsck.repair=yes rootwait
  • Created /usr/lib/systemd/system/haos-swapfile.service:
[Unit]
Description=Swap test
DefaultDependencies=no
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/usr/libexec/swapfile

[Install]
WantedBy=swap.target
  • Created /usr/libexec/swapfile:
#!/bin/sh
set -e

swapfile="/mnt/data/swapfile"
# Swap space in 4k blocks
swapsize="$(awk '/MemTotal/{ print int($2 * 0.33 / 4) }' /proc/meminfo)"

if [ ! -s "${swapfile}" ] || [ "$(stat "${swapfile}" -c '%s')" -lt $((swapsize * 4096)) ]; then
	# Check free space (in 4k blocks)
	if [ "$(stat -f /mnt/data -c '%f')" -lt "${swapsize}" ]; then
		echo "[WARNING] Not enough space to allocate swapfile"
		exit 1
	fi

	echo "[INFO] Creating swapfile of size $((swapsize *4))k"
	umask 0077
	dd if=/dev/zero of="${swapfile}" bs=4k count="${swapsize}"
	echo "[INFO] Created swapfile"
fi

if ! swaplabel "${swapfile}" > /dev/null 2>&1; then
	/usr/lib/systemd/systemd-makefs swap "${swapfile}"
fi
  • Necessary chores:
mkdir -p /mnt/data
systemctl enable haos-swapfile
chmod +x /usr/libexec/swapfile

Then I ran a pexpect testing script that keeps rebooting the device, removing the created file if the boot is successful, power-cycling the device if it doesn't boot within ~5 minutes.

Here is the serial output for a failed boot: rpios-cqe-frozen-onlyoverlay.txt
Here is it with our device tree (decompiled as bcm2712-rpi-cm5-ha-yellow.dts.txt), without the DT overlay and with console set to ttyAMA2 - otherwise same conditions as above): haos-cqe-frozen.txt

I tried doing everything from scratch today and writing down my steps along, to rule out any other factors. Then I followed the steps again and saved the logs that are attached above. I don't get the hung task message in RPi OS (maybe it's enabled with some kernel option downstream?) but when it freezes, it doesn't recover even after tens of minutes. What is strange (but it can be some fluke) that after the test is ran for long time, it kind of "stabilizes" and doesn't happen as often as on the first boot after the OS is flashed. I saw this both on HAOS and RPi OS.

Also, this test was conducted on 8GB RAM/8GB eMMC module (preproduction rev 0.2 which I used because I have access to BCM UART there) but it was reproducible with production CM5104032 too.

@ViRb3
Copy link

ViRb3 commented Dec 9, 2024

The latest download without running apt upgrade comes with kernel 6.6.51, which I believe is exactly the one containing the CQE bugs. Please try upgrading and see if it reproduces there.

@sairon
Copy link
Author

sairon commented Dec 9, 2024

Good spot, the first fix in that series is for a bug that will affect eMMC as well and would result in the described hang.

Which commit exactly you have in mind? From internal comms I have backtracked I have tested for the issue with efecbda (although again with HAOS only) and it had the issue.

I would like to avoid chasing ghosts, as the bug obviously has some racy trigger and it can be quite time-consuming to rule out reliably.

(EDIT: Anyway, since I'm about to call it a day in a while, I will prepare one another test run from scratch and also perform an APT upgrade after the first boot.)

@P33M
Copy link
Contributor

P33M commented Dec 10, 2024

The Raspberry Pi OS apt package is at 6.6.62. This corresponds to:

 zcat /usr/share/doc/linux-image-6.6.62+rpt-rpi-2712/changelog.Debian.gz  | grep "Linux commit" | head -n 1
  * Linux commit: dd2394360860d15146c96635796a75b05bb32b61

i.e. dd23943
This should not have CQE/eMMC issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants