Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel Crashes on Raspberry Pi Zero 2 W #6438

Open
pmessan opened this issue Oct 24, 2024 · 33 comments
Open

Kernel Crashes on Raspberry Pi Zero 2 W #6438

pmessan opened this issue Oct 24, 2024 · 33 comments

Comments

@pmessan
Copy link

pmessan commented Oct 24, 2024

Describe the bug

We have been experiencing kernel crashes on our Zero 2 W devices at seemingly random instances. We are not quite sure what triggers these crashes, but we see them more often (but not always) when we perform system tests stressing the BLE subsystem and the USB gadget subsystem at the same time (HID / CDC output). We have recorded at least 3 crashes on kernel version 6.6.50 and at least one crash on kernel version 6.6.56.

We are using an external wireless chip connected to the Zero 2 W via GPIO, and have confirmed with the chip provider that the crash does not originate from their kernel driver.

Steps to reproduce the behaviour

  1. Connect multiple BLE peripherals to the pi (we had 10)
  2. Initiate BLE transmissions at regular intervals from each peripheral (we had each device sending data at a 1 second interval)
  3. Send data to USB HID stack around same interval
  4. This should run for at least an hour (on the device running the 6.6.56 kernel we saw the crash after ~1h)

Device (s)

Raspberry Pi Zero 2 W

System

$ uname -a
Linux 6.6.56-v7 #1 SMP Fri Oct 11 11:14:35 UTC 2024 armv7l GNU/Linux

$ vcgencmd version
Sep 13 2024 16:00:14 
Copyright (c) 2012 Broadcom
version ddfba3e3c234500025b545512b4b214f28e453e9 (clean) (release) (start_cd)

$ cat /etc/rpi-issue
Raspberry Pi reference 2024-03-15
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 11096428148f0f2be3985ef3126ee71f99c7f1c2, stage2

Logs

From the device running the 6.6.56 kernel:

Oct 15 12:04:14 [alert] kernel: [ 2922.738218] 8<--- cut here ---
Oct 15 12:04:14 [alert] kernel: [ 2922.738255] `Unable to handle kernel NULL pointer dereference at virtual address 00000014 when read
Oct 15 12:04:14 [alert] kernel: [ 2922.738290] [00000014] *pgd=06336835, *pte=00000000, *ppte=00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.738334] Internal error: Oops: 17 [#1] SMP ARM
Oct 15 12:04:14 [warning] kernel: [ 2922.738357] Modules linked in: xt_recent usb_f_ecm u_ether usb_f_mass_storage usb_f_hid dwc2 roles cmac algif_hash aes_arm_bs crypto_simd cryptd algif_skcipher af_alg btnxpuart bluetooth ecdh_generic ecc crc8 moal(O) mlan(O) cfg80211 xt_hl ip6t_rt rfkill ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink binfmt_misc sc16is7xx regmap_i2c i2c_bcm2835 raspberrypi_hwmon raspberrypi_gpiomem fixed uio_pdrv_genirq uio w5100_spi w5100 libcomposite i2c_dev deflate zstd ubifs ubi ofpart spi_nor mtd spi_bcm2835 drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 overlay
Oct 15 12:04:14 [warning] kernel: [ 2922.738787] CPU: 0 PID: 487 Comm: python3 Tainted: G           O       6.6.56-v7 #1
Oct 15 12:04:14 [warning] kernel: [ 2922.738824] Hardware name: BCM2835
Oct 15 12:04:14 [warning] kernel: [ 2922.738841] PC is at memcg_charge_kernel_stack+0xc/0x9c
Oct 15 12:04:14 [warning] kernel: [ 2922.738878] LR is at copy_process+0xcc4/0x1d8c
Oct 15 12:04:14 [warning] kernel: [ 2922.738906] pc : [<80119d90>]    lr : [<8011c444>]    psr: 60000013
Oct 15 12:04:14 [warning] kernel: [ 2922.738921] sp : 9fd5de00  ip : 00000000  fp : 00010000
Oct 15 12:04:14 [warning] kernel: [ 2922.738934] r10: 9a87fe04  r9 : ffffffff  r8 : 9fd5debc
Oct 15 12:04:14 [warning] kernel: [ 2922.738952] r7 : 821ba380  r6 : 00000800  r5 : 003d0f00  r4 : 9a87fe04
Oct 15 12:04:14 [warning] kernel: [ 2922.738970] r3 : 00005526  r2 : 00005525  r1 : 00000000  r0 : 00000000
Oct 15 12:04:14 [warning] kernel: [ 2922.738987] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Oct 15 12:04:14 [warning] kernel: [ 2922.739007] Control: 10c5383d  Table: 04ee406a  DAC: 00000055
Oct 15 12:04:14 [alert] kernel: [ 2922.739022] Register r0 information: NULL pointer
Oct 15 12:04:14 [alert] kernel: [ 2922.739042] Register r1 information: NULL pointer
Oct 15 12:04:14 [alert] kernel: [ 2922.739058] Register r2 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739075] Register r3 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739092] Register r4 information: non-slab/vmalloc memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739110] Register r5 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739127] Register r6 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739144] Register r7 information: slab task_struct start 821ba380 pointer offset 0 size 4544
Oct 15 12:04:14 [alert] kernel: [ 2922.739182] Register r8 information: 2-page vmalloc region starting at 0x9fd5c000 allocated at kernel_clone+0xac/0x3a8
Oct 15 12:04:14 [alert] kernel: [ 2922.739217] Register r9 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739234] Register r10 information: non-slab/vmalloc memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739253] Register r11 information: non-paged memory
Oct 15 12:04:14 [alert] kernel: [ 2922.739269] Register r12 information: NULL pointer
Oct 15 12:04:14 [emerg] kernel: [ 2922.739286] Process python3 (pid: 487, stack limit = 0xee79fc33)
Oct 15 12:04:14 [emerg] kernel: [ 2922.739305] Stack: (0x9fd5de00 to 0x9fd5e000)
Oct 15 12:04:14 [emerg] kernel: [ 2922.739325] de00: 9a87fe04 003d0f00 00000800 8011c444 00000dc2 0000065f 00000000 ffffffff
Oct 15 12:04:14 [emerg] kernel: [ 2922.739349] de20: 8011d654 00000024 9a9c8dc4 003d0f00 840e0000 9fd5df38 00000000 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739373] de40: 00000000 8116b41c 00000000 00000000 00000000 00000000 00000001 8032b324
Oct 15 12:04:14 [emerg] kernel: [ 2922.739397] de60: 9fd5dee4 8601bca8 9a8d9a80 8032b980 00000000 00000000 82231340 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739421] de80: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000255
Oct 15 12:04:14 [emerg] kernel: [ 2922.739445] dea0: 821ba380 6b1e7000 8601bca8 9fd5dfb0 82231340 00000040 8601bca8 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739469] dec0: 00000000 00000000 00000000 08159964 00000000 003d0f00 6b1e8468 9fd5df38
Oct 15 12:04:14 [emerg] kernel: [ 2922.739493] dee0: 6b1e7ef8 00000000 821ba380 00000078 00100000 8011d654 00000a55 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739517] df00: 00000000 00000000 00000000 08159964 8b1e079c 003d0f00 6b1e8468 6b1e8900
Oct 15 12:04:14 [emerg] kernel: [ 2922.739541] df20: 6b1e7ef8 80100298 821ba380 00000078 7eacb938 8011dcf8 003d0f00 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739565] df40: 6b1e8468 6b1e8468 6b1e8468 00000000 00000000 00000000 6b1e7ef8 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739588] df60: 6b1e8900 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739612] df80: 00000000 00000000 00000000 08159964 ffffffff 6b1e8468 7eacb9fc 7eacb936
Oct 15 12:04:14 [emerg] kernel: [ 2922.739636] dfa0: 00000078 80100040 6b1e8468 7eacb9fc 003d0f00 6b1e7ef8 6b1e8468 6b1e8900
Oct 15 12:04:14 [emerg] kernel: [ 2922.739661] dfc0: 6b1e8468 7eacb9fc 7eacb936 00000078 6a9e8000 7eacb937 6a9e8000 7eacb938
Oct 15 12:04:14 [emerg] kernel: [ 2922.739685] dfe0: 003d0f00 7eacb860 76e12da4 76e10a8c 20000010 003d0f00 00000000 00000000
Oct 15 12:04:14 [emerg] kernel: [ 2922.739711]  memcg_charge_kernel_stack from copy_process+0xcc4/0x1d8c
Oct 15 12:04:14 [emerg] kernel: [ 2922.739742]  copy_process from kernel_clone+0xac/0x3a8
Oct 15 12:04:14 [emerg] kernel: [ 2922.739765]  kernel_clone from sys_clone+0x78/0x9c
Oct 15 12:04:14 [emerg] kernel: [ 2922.739788]  sys_clone from ret_fast_syscall+0x0/0x4c
Oct 15 12:04:14 [emerg] kernel: [ 2922.739811] Exception stack(0x9fd5dfa8 to 0x9fd5dff0)
Oct 15 12:04:14 [emerg] kernel: [ 2922.739830] dfa0:                   6b1e8468 7eacb9fc 003d0f00 6b1e7ef8 6b1e8468 6b1e8900
Oct 15 12:04:14 [emerg] kernel: [ 2922.739854] dfc0: 6b1e8468 7eacb9fc 7eacb936 00000078 6a9e8000 7eacb937 6a9e8000 7eacb938
Oct 15 12:04:14 [emerg] kernel: [ 2922.739877] dfe0: 003d0f00 7eacb860 76e12da4 76e10a8c
Oct 15 12:04:14 [emerg] kernel: [ 2922.739896] Code: ea0168e4 e92d4070 e52de004 e28dd004 (e5903014)
Oct 15 12:04:14 [warning] kernel: [ 2922.739943] ---[ end trace 0000000000000000 ]---

From a device running with a 6.6.50 kernel

Oct 11 07:54:04 [alert] kernel: [62993.490295] 8<--- cut here ---
Oct 11 07:54:04 [alert] kernel: [62993.490329] Unable to handle kernel paging request at virtual address 7974744d when read
Oct 11 07:54:04 [alert] kernel: [62993.490349] [7974744d] *pgd=00000000
Oct 11 07:54:04 [emerg] kernel: [62993.490368] Internal error: Oops: 5 [#1] SMP ARM
Oct 11 07:54:04 [warning] kernel: [62993.490385] Modules linked in: xt_recent usb_f_hid dwc2 roles cmac algif_hash aes_arm_bs crypto_simd cryptd algif_skcipher af_alg btnxpuart bluetooth ecdh_generic ecc crc8 moal(O) mlan(O) cfg80211 xt_hl ip6t_rt rfkill ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink binfmt_misc sc16is7xx regmap_i2c raspberrypi_hwmon i2c_bcm2835 raspberrypi_gpiomem uio_pdrv_genirq fixed uio w5100_spi w5100 libcomposite i2c_dev deflate zstd ubifs ubi ofpart spi_nor mtd spi_bcm2835 drm fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 overlay
Oct 11 07:54:04 [warning] kernel: [62993.490690] CPU: 0 PID: 145 Comm: systemd-udevd Tainted: G           O       6.6.50-v7 #1
Oct 11 07:54:04 [warning] kernel: [62993.490711] Hardware name: BCM2835
Oct 11 07:54:04 [warning] kernel: [62993.490721] PC is at __kmem_cache_alloc_node+0xc4/0x4c8
Oct 11 07:54:04 [warning] kernel: [62993.490748] LR is at __kmem_cache_alloc_node+0x44/0x4c8
Oct 11 07:54:04 [warning] kernel: [62993.490765] pc : [<803630c4>]    lr : [<80363044>]    psr: a0000013
Oct 11 07:54:04 [warning] kernel: [62993.490783] sp : 9f93dd10  ip : 0c022d19  fp : 9f93dd10
Oct 11 07:54:04 [warning] kernel: [62993.490799] r10: 00012db9  r9 : 00000036  r8 : 810057bc
Oct 11 07:54:04 [warning] kernel: [62993.490814] r7 : 00000dc0  r6 : 00000000  r5 : 7974742d  r4 : 81401100
Oct 11 07:54:04 [warning] kernel: [62993.490831] r3 : 00000020  r2 : 9ef7c378  r1 : 1dfe4000  r0 : 00012db8
Oct 11 07:54:04 [warning] kernel: [62993.490849] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Oct 11 07:54:04 [warning] kernel: [62993.490870] Control: 10c5383d  Table: 0374c06a  DAC: 00000055
Oct 11 07:54:04 [alert] kernel: [62993.490885] Register r0 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.490905] Register r1 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.490922] Register r2 information: non-slab/vmalloc memory
Oct 11 07:54:04 [alert] kernel: [62993.490943] Register r3 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.490960] Register r4 information: slab kmem_cache start 81401100 pointer offset 0 size 124
Oct 11 07:54:04 [alert] kernel: [62993.490999] Register r5 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.491016] Register r6 information: NULL pointer
Oct 11 07:54:04 [alert] kernel: [62993.491032] Register r7 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.491049] Register r8 information: non-slab/vmalloc memory
Oct 11 07:54:04 [alert] kernel: [62993.491068] Register r9 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.491085] Register r10 information: non-paged memory
Oct 11 07:54:04 [alert] kernel: [62993.491102] Register r11 information: 2-page vmalloc region starting at 0x9f93c000 allocated at kernel_clone+0xac/0x3a8
Oct 11 07:54:04 [alert] kernel: [62993.491138] Register r12 information: non-paged memory
Oct 11 07:54:04 [emerg] kernel: [62993.491156] Process systemd-udevd (pid: 145, stack limit = 0x6381dcca)
Oct 11 07:54:04 [emerg] kernel: [62993.491177] Stack: (0x9f93dd10 to 0x9f93e000)
Oct 11 07:54:04 [emerg] kernel: [62993.491194] dd00:                                     00001000 f8ea3954 82e8d0cd 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491219] dd20: 0eba24e4 ffffffff 83ef0114 00000036 80435d3c 81401100 f8ea3954 00000dc0
Oct 11 07:54:04 [emerg] kernel: [62993.491243] dd40: 9f93dddc 00000000 80b18ff0 8031a1d4 80435d3c f8ea3954 f2db9254 35302d75
Oct 11 07:54:04 [emerg] kernel: [62993.491267] dd60: 6465762d 7569cc34 81a2f030 8270eb40 f8ea3954 82e8d0b0 9f93dddc 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491292] dd80: 80b18ff0 80435d3c 82e8d0b0 81a2f030 819d6e80 9f93de40 82e8dff8 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491316] dda0: 00000000 80472880 9f93dddc 819d6e80 82e8d000 00001000 000000b0 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491340] ddc0: 00000007 9f93ddd4 860f9f00 00000000 00001000 00000000 00000000 82e8d0b8
Oct 11 07:54:04 [emerg] kernel: [62993.491364] dde0: 00000015 00000000 66e99076 87ef8dd2 22edd69e 81a2f030 00000000 860f9f00
Oct 11 07:54:04 [emerg] kernel: [62993.491387] de00: 00000000 00000000 81d1c700 8270eb60 00000000 80473c98 00000000 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491411] de20: 00000dc0 8036333c 00000000 00000000 00000000 00000000 00000000 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491435] de40: 7569cc34 f8ea3954 00000004 81f24894 00000000 00000000 00000000 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491459] de60: 00000000 00000000 00000000 00000000 00000000 87ef8dd2 00000000 9f93df68
Oct 11 07:54:04 [emerg] kernel: [62993.491483] de80: 8270eb40 81a2f030 81a2f030 00000000 860f9f00 81da5c00 81d1c700 80435938
Oct 11 07:54:04 [emerg] kernel: [62993.491507] dea0: 00000000 00000000 0000000f 87ef8dd2 000007ff 40020000 7e9ac658 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491530] dec0: 00000000 860f9f70 00000001 00000000 00000000 00000000 00000000 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491554] dee0: 00000000 00000000 00000000 00000000 00000000 87ef8dd2 00000000 40020000
Oct 11 07:54:04 [emerg] kernel: [62993.491578] df00: 860f9f00 9f93df68 81a2f030 80100298 81d1c700 87ef8dd2 81d1c700 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491602] df20: 860f9f00 9f93df68 81a2f030 81a2f0c0 81d1c700 000000d9 00000000 80393d5c
Oct 11 07:54:04 [emerg] kernel: [62993.491627] df40: 81d1c700 80ae6604 860f9f14 0127b4a8 860f9f02 0127b4a8 00008000 860f9f00
Oct 11 07:54:04 [emerg] kernel: [62993.491651] df60: 80100298 803942b4 80393fc8 00000000 00000000 00000000 0127b4c8 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491675] df80: 00008000 00000000 81d1c700 87ef8dd2 00000000 0127b4a8 0127b4a8 0127b4c8
Oct 11 07:54:04 [emerg] kernel: [62993.491699] dfa0: 000000d9 80100288 0127b4a8 0127b4a8 0000000f 0127b4c8 00008000 7fffffff
Oct 11 07:54:04 [emerg] kernel: [62993.491723] dfc0: 0127b4a8 0127b4a8 0127b4c8 000000d9 00000000 7e9ac874 0127b4a8 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491747] dfe0: 0059fbf8 7e9ac688 76d7c048 76d7bf84 80000010 0000000f 00000000 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.491774]  __kmem_cache_alloc_node from __kmalloc+0x4c/0x180
Oct 11 07:54:04 [emerg] kernel: [62993.491806]  __kmalloc from ext4_htree_store_dirent+0x30/0x100
Oct 11 07:54:04 [emerg] kernel: [62993.491839]  ext4_htree_store_dirent from htree_dirblock_to_tree+0x188/0x39c
Oct 11 07:54:04 [emerg] kernel: [62993.491872]  htree_dirblock_to_tree from ext4_htree_fill_tree+0xe0/0x368
Oct 11 07:54:04 [emerg] kernel: [62993.491899]  ext4_htree_fill_tree from ext4_readdir+0x71c/0xad0
Oct 11 07:54:04 [emerg] kernel: [62993.491928]  ext4_readdir from iterate_dir+0x88/0x16c
Oct 11 07:54:04 [emerg] kernel: [62993.491958]  iterate_dir from sys_getdents64+0x6c/0x10c
Oct 11 07:54:04 [emerg] kernel: [62993.491985]  sys_getdents64 from __sys_trace_return+0x0/0x10
Oct 11 07:54:04 [emerg] kernel: [62993.492012] Exception stack(0x9f93dfa8 to 0x9f93dff0)
Oct 11 07:54:04 [emerg] kernel: [62993.492032] dfa0:                   0127b4a8 0127b4a8 0000000f 0127b4c8 00008000 7fffffff
Oct 11 07:54:04 [emerg] kernel: [62993.492056] dfc0: 0127b4a8 0127b4a8 0127b4c8 000000d9 00000000 7e9ac874 0127b4a8 00000000
Oct 11 07:54:04 [emerg] kernel: [62993.492078] dfe0: 0059fbf8 7e9ac688 76d7c048 76d7bf84
Oct 11 07:54:04 [emerg] kernel: [62993.492098] Code: e1a0b00d e58d600c e594301c e280a001 (e7953003) 
Oct 11 07:54:04 [warning] kernel: [62993.492118] ---[ end trace 0000000000000000 ]---

Additional context

No response

@pelwell
Copy link
Contributor

pelwell commented Oct 24, 2024

It's almost certainly experiencing under-voltage. What is the power source, and how are the attached peripherals powered?

@pmessan
Copy link
Author

pmessan commented Oct 24, 2024

In most of the cases, the pi is powered via a micro USB cable connected to the USB port on the pi and a USB 2.0/3.0 port on a laptop.

We had one case where the pi was connected to an AmazonBasics USB cable connected to a USB power adapter with a rating of 5.0 V | 1.2 A | 6W

By attached peripherals I assume you mean the external wireless chip. It is wired to the 1.8V pin of the pi for power.

I have some follow-up questions:

  • How can you tell that this is caused by under-voltage? How can we verify this on our side?
  • How can we prevent under-voltage in the future?

@pelwell
Copy link
Contributor

pelwell commented Oct 24, 2024

a USB power adapter with a rating of 5.0 V | 1.2 A | 6W

1.2A is criminally low.

It is wired to the 1.8V pin of the pi for power.

Which 1.8V pin is that?

How can you tell that this is caused by under-voltage?

Because the failures are in common, non-Pi-specific code, are different every time, and they happen under load. And because I've seen it before.

How can we verify this on our side?

You could try monitoring the supply rails, but you may find that the dips are too short and too localised to detect.

How can we prevent under-voltage in the future?

The limitation is in how much power can be routed through the Z2W's power infrastructure. You may need an external source of 1.8V that is derived directly from the (better) power supply, not via the Z2W.

@pelwell
Copy link
Contributor

pelwell commented Oct 24, 2024

You could try with over_voltage=2 in config.txt as a workaround, but you may find that it makes no difference.

@pmessan
Copy link
Author

pmessan commented Oct 24, 2024

1.2A is criminally low

😅

Which 1.8V pin is that?

Apologies, I got my facts wrong, it seems we are drawing from 5V_USB instead

Because the failures are in common, non-Pi-specific code, are different every time, and they happen under load. And because I've seen it before.

Thanks for the tip, will keep this in mind for future use 😄

You could try with over_voltage=2 in config.txt as a workaround, but you may find that it makes no difference.

Thanks, will try this not expecting too much. Changing our PSUs would be a more impactful start IMO based on this info

@pelwell
Copy link
Contributor

pelwell commented Oct 24, 2024

Good luck.

@pmessan
Copy link
Author

pmessan commented Oct 28, 2024

Hello again,
I have a follow-up question:

  • This is our config.txt. It turns out we have been using over_voltage=2 already. Do you think with all the options we have disabled we could still run into under-voltage due to increased power consumption?

@pelwell
Copy link
Contributor

pelwell commented Oct 28, 2024

With over_voltage=2 you are probably at the limit of what the on-board PMIC can deliver. You can try going to over_voltage=4, etc., but in practice it will be clamped. What does vcgencmd get_config over_voltage_avs report?
If the Pi has a decent power supply (our Micro-USB supply is 2.5A) and most/all peripherals are independently powered (e.g. on a powered hub) then you should be fine.

@pmessan
Copy link
Author

pmessan commented Oct 28, 2024

What does vcgencmd get_config over_voltage_avs report?

$ vcgencmd get_config over_voltage_avs
over_voltage_avs=6250

If the Pi has a decent power supply (our Micro-USB supply is 2.5A) and most/all peripherals are independently powered (e.g. on a powered hub) then you should be fine.

And what if the peripherals are drawing power from the Z2W's on-board circuitry (USB)?

@pelwell
Copy link
Contributor

pelwell commented Oct 28, 2024

And what if the peripherals are drawing power from the Z2W's on-board circuitry (USB)?

I can't quantify how much power is available for external peripherals, a) because I've not done the measurements, and b) because it varies from chip to chip according to the natural variability you get across the silicon wafers, but the short answer is you may not be fine.

@pmessan
Copy link
Author

pmessan commented Oct 30, 2024

Hello,

We've experienced another instance of a crash on kernel 6.6.57.
We have a Z2W (with BLE peripheral as described previously) connected to 5V 3A power supply via 3m micro USB cable from Amazon Basics.

There were 2 BLE devices connected via this peripheral, no active data transmission. There was also a connection to a TCP server established but no data coming in. We're not 100% sure what fraction of the 4 cores were consumed, but the core temperature measure a few minutes before was ~38 deg C.

Logs

Oct 30 09:08:13 [alert] kernel: [ 1220.541326] 8<--- cut here ---
Oct 30 09:08:13 [alert] kernel: [ 1220.541358] Unable to handle kernel NULL pointer dereference at virtual address 00000004 when read
Oct 30 09:08:13 [alert] kernel: [ 1220.541378] [00000004] *pgd=00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.541398] Internal error: Oops: 5 [#1] SMP ARM
Oct 30 09:08:13 [warning] kernel: [ 1220.541413] Modules linked in: xt_recent dwc2 roles cmac algif_hash aes_arm_bs crypto_simd cryptd algif_skcipher af_alg btnxpuart bluetooth ecdh_generic ecc crc8 moal(O) mlan(O) cfg80211 xt_hl rfkill ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink binfmt_misc sc16is7xx regmap_i2c raspberrypi_hwmon i2c_bcm2835 raspberrypi_gpiomem fixed uio_pdrv_genirq uio w5100_spi w5100 libcomposite i2c_dev deflate zstd ubifs ubi ofpart spi_nor drm mtd spi_bcm2835 fuse drm_panel_orientation_quirks backlight ip_tables x_tables ipv6 overlay
Oct 30 09:08:13 [warning] kernel: [ 1220.541716] CPU: 0 PID: 150 Comm: systemd-udevd Tainted: G           O       6.6.57-v7 #1
Oct 30 09:08:13 [warning] kernel: [ 1220.541736] Hardware name: BCM2835
Oct 30 09:08:13 [warning] kernel: [ 1220.541747] PC is at rb_insert_color+0x1c/0x170
Oct 30 09:08:13 [warning] kernel: [ 1220.541783] LR is at ext4_htree_store_dirent+0xd4/0x100
Oct 30 09:08:13 [warning] kernel: [ 1220.541810] pc : [<80ace89c>]    lr : [<80438fe8>]    psr: 60000013
Oct 30 09:08:13 [warning] kernel: [ 1220.541827] sp : 9f981d84  ip : 00000000  fp : 80b19170
Oct 30 09:08:13 [warning] kernel: [ 1220.541843] r10: 00000000  r9 : 9f981ddc  r8 : 8308d080
Oct 30 09:08:13 [warning] kernel: [ 1220.541858] r7 : 5925aa36  r6 : 84e3fe00  r5 : 84e3f700  r4 : 3c8f6260
Oct 30 09:08:13 [warning] kernel: [ 1220.541876] r3 : 84e3fd08  r2 : 00000000  r1 : 84e3fe00  r0 : 84e3f708
Oct 30 09:08:13 [warning] kernel: [ 1220.541894] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
Oct 30 09:08:13 [warning] kernel: [ 1220.541914] Control: 10c5383d  Table: 037fc06a  DAC: 00000055
Oct 30 09:08:13 [alert] kernel: [ 1220.541929] Register r0 information: slab kmalloc-64 start 84e3f700 pointer offset 8 size 64
Oct 30 09:08:13 [alert] kernel: [ 1220.541972] Register r1 information: slab kmalloc-64 start 84e3fe00 pointer offset 0 size 64
Oct 30 09:08:13 [alert] kernel: [ 1220.542008] Register r2 information: NULL pointer
Oct 30 09:08:13 [alert] kernel: [ 1220.542027] Register r3 information: slab kmalloc-64 start 84e3fd00 pointer offset 8 size 64
Oct 30 09:08:13 [alert] kernel: [ 1220.542063] Register r4 information: non-paged memory
Oct 30 09:08:13 [alert] kernel: [ 1220.542080] Register r5 information: slab kmalloc-64 start 84e3f700 pointer offset 0 size 64
Oct 30 09:08:13 [alert] kernel: [ 1220.542115] Register r6 information: slab kmalloc-64 start 84e3fe00 pointer offset 0 size 64
Oct 30 09:08:13 [alert] kernel: [ 1220.542150] Register r7 information: non-paged memory
Oct 30 09:08:13 [alert] kernel: [ 1220.542167] Register r8 information: non-slab/vmalloc memory
Oct 30 09:08:13 [alert] kernel: [ 1220.542187] Register r9 information: 2-page vmalloc region starting at 0x9f980000 allocated at kernel_clone+0xac/0x3a8
Oct 30 09:08:13 [alert] kernel: [ 1220.542225] Register r10 information: NULL pointer
Oct 30 09:08:13 [alert] kernel: [ 1220.542242] Register r11 information: non-slab/vmalloc memory
Oct 30 09:08:13 [alert] kernel: [ 1220.542261] Register r12 information: NULL pointer
Oct 30 09:08:13 [emerg] kernel: [ 1220.542277] Process systemd-udevd (pid: 150, stack limit = 0xec1a5737)
Oct 30 09:08:13 [emerg] kernel: [ 1220.542298] Stack: (0x9f981d84 to 0x9f982000)
Oct 30 09:08:13 [emerg] kernel: [ 1220.542317] 1d80:          80438fe8 8308d080 81be10a0 81aef640 9f981e40 8308dff8 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542342] 1da0: 00000000 80475b0c 9f981ddc 81aef640 8308d000 00001000 00000080 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542366] 1dc0: 00000006 9f981dd4 82723a80 00000000 00001000 00000000 00000000 8308d088
Oct 30 09:08:13 [emerg] kernel: [ 1220.542390] 1de0: 0000000d 00000000 6718e9cb f05487c7 32c85657 81be10a0 00000000 82723a80
Oct 30 09:08:13 [emerg] kernel: [ 1220.542414] 1e00: 00000000 00000000 824811c0 84e3fe20 00000000 80476f54 00000000 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542438] 1e20: 00000dc0 80363a5c 00000000 00000000 00000000 000081d6 00000000 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542462] 1e40: 3c8f6260 5925aa36 00000004 81ff4894 00000000 00000000 00000000 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542485] 1e60: 00000000 00000000 00000000 00000000 00000000 f05487c7 00000000 9f981f68
Oct 30 09:08:13 [emerg] kernel: [ 1220.542509] 1e80: 81be10a0 84e3fe00 81be10a0 00000000 82723a80 81d65c00 824811c0 80438b9c
Oct 30 09:08:13 [emerg] kernel: [ 1220.542533] 1ea0: 00000000 00000000 0000000f f05487c7 000007ff 40020000 7ecde658 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542556] 1ec0: 00000000 82723af0 00000001 00000000 00000000 00000000 00000000 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542580] 1ee0: 00000000 00000000 00000000 00000000 00000000 f05487c7 00000000 40020000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542604] 1f00: 82723a80 9f981f68 81be10a0 80100298 824811c0 f05487c7 824811c0 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542628] 1f20: 82723a80 9f981f68 81be10a0 81be1130 824811c0 000000d9 00000000 80396da0
Oct 30 09:08:13 [emerg] kernel: [ 1220.542652] 1f40: 824811c0 80aea768 82723a94 00f23e50 82723a82 00f23e50 00008000 82723a80
Oct 30 09:08:13 [emerg] kernel: [ 1220.542676] 1f60: 80100298 803972f8 8039700c 00000000 00000000 00000000 00f23e70 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542700] 1f80: 00008000 00000000 824811c0 f05487c7 00000000 00f23e50 00f23e50 00f23e70
Oct 30 09:08:13 [emerg] kernel: [ 1220.542724] 1fa0: 000000d9 80100288 00f23e50 00f23e50 0000000f 00f23e70 00008000 7fffffff
Oct 30 09:08:13 [emerg] kernel: [ 1220.542748] 1fc0: 00f23e50 00f23e50 00f23e70 000000d9 00000000 7ecde874 00f23e50 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542772] 1fe0: 00545bf8 7ecde688 76d9c048 76d9bf84 80000010 0000000f 00000000 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.542796]  rb_insert_color from ext4_htree_store_dirent+0xd4/0x100
Oct 30 09:08:13 [emerg] kernel: [ 1220.542835]  ext4_htree_store_dirent from htree_dirblock_to_tree+0x188/0x39c
Oct 30 09:08:13 [emerg] kernel: [ 1220.542870]  htree_dirblock_to_tree from ext4_htree_fill_tree+0xe0/0x368
Oct 30 09:08:13 [emerg] kernel: [ 1220.542899]  ext4_htree_fill_tree from ext4_readdir+0x788/0xae0
Oct 30 09:08:13 [emerg] kernel: [ 1220.542930]  ext4_readdir from iterate_dir+0x88/0x16c
Oct 30 09:08:13 [emerg] kernel: [ 1220.542962]  iterate_dir from sys_getdents64+0x6c/0x10c
Oct 30 09:08:13 [emerg] kernel: [ 1220.542991]  sys_getdents64 from __sys_trace_return+0x0/0x10
Oct 30 09:08:13 [emerg] kernel: [ 1220.543018] Exception stack(0x9f981fa8 to 0x9f981ff0)
Oct 30 09:08:13 [emerg] kernel: [ 1220.543037] 1fa0:                   00f23e50 00f23e50 0000000f 00f23e70 00008000 7fffffff
Oct 30 09:08:13 [emerg] kernel: [ 1220.543061] 1fc0: 00f23e50 00f23e50 00f23e70 000000d9 00000000 7ecde874 00f23e50 00000000
Oct 30 09:08:13 [emerg] kernel: [ 1220.543083] 1fe0: 00545bf8 7ecde688 76d9c048 76d9bf84
Oct 30 09:08:13 [emerg] kernel: [ 1220.543103] Code: e5932000 e3120001 112fff1e e52de004 (e592c004) 
Oct 30 09:08:13 [warning] kernel: [ 1220.543147] ---[ end trace 0000000000000000 ]---

Could this still be caused by undervoltage?

@pelwell
Copy link
Contributor

pelwell commented Oct 30, 2024

Yes - this crash is in ext4 filesystem code which is very well tested and unrelated to the test you are doing. I bet if you repeat the same test and get another crash that it will be somewhere else (although there may be "hot spots" because of the way the code is structured).

@pmessan
Copy link
Author

pmessan commented Oct 30, 2024

We have a Z2W (with BLE peripheral as described previously) connected to 5V 3A power supply via 3m micro USB cable from Amazon Basics.

Point of correction, I just got confirmation that the test was done with a 1m USB cable.

If this is the case, how can under-voltage be occurring in this scenario?

@pelwell
Copy link
Contributor

pelwell commented Oct 30, 2024

I'm not a hardware engineer, so there may be some small inaccuracies in what I say, but the basic idea is correct.

The job of the PMIC is to keep the voltage rails at constant levels by pumping charge/adjusting the current when it sees the voltage start to drift. When there is a sudden change in current demand it takes a while for the PMIC to notice and increase the current. If the sudden change is large the PMIC may not be able to keep up with demand, and the voltage drops.

However, it can be worse than that if 1) there isn't enough current available from the supply, or 2) the amount of current that the PMIC can control is limited. It sounds as if 1) may not be the case now, but you do need to take account of voltage drop in the cable - particularly if the cable is not permanently attached. Our power supplies are pre-compensated - they run at 5.1V (say) so that after the loss in the cable 5V is delivered to the Pi. But even if 1) isn't happening, 2) can still be a limiting factor, particularly if you have unpowered peripherals drawing more current.

@adam-mujaj
Copy link

Hi Phil,
Since hardware issues are being considered I want to jump in on the conversation. I'm Adam, and the electrical engineering manager.
One other aspect that I see is the 3.3V power rail. We are using this to power some other components in the system. However, I cannot see a specification for what current is available on ths for external components. Note that in our application the onboard Wi-Fi chipset is disabled. Is there a specficiation for the available current on this for external components?

@pelwell
Copy link
Contributor

pelwell commented Oct 31, 2024

I've not seen that kind of information - have you tried our Industrial Applications support (https://www.raspberrypi.com/contact/#get-in-touch)? - but if I were in your position and had heard the kind of things I had told you I would lash up an external 3.3V supply to power the other devices to see if it solves the instability.

@adam-mujaj
Copy link

I will reach out to them to ask. And yes I was already thinking we should try applying a separate 3.3V source. The challenge is that this is not happening consistently. In fact, as far as I am aware only this one customer is seeing the crashes. The other customers have the devices working fine, including with the 1.2A supply.

@pelwell
Copy link
Contributor

pelwell commented Oct 31, 2024

How many units does this one customer have? There is variation, with "fast" parts requiring more current.

@JamesH65
Copy link
Contributor

JamesH65 commented Nov 1, 2024

If it's just one customer and all other things being equal, is there a chance this could be an EMC problem? Does the problematic customer have any heavy machinery starting up in the vicinity that might be spiking things?

@adam-mujaj
Copy link

That's also something we have considered, but it seems unlikley to me. The facility is a large open warehouse. The only machinery is the roller doors for truck loading/unloading. It does not appear to use forklifts or anything like that even.

This customer has four devices, all exhibiting the behaviour.

@pmessan
Copy link
Author

pmessan commented Nov 4, 2024

Just got confirmation from the customer: "no large electricity consumers or other machines are used in the building, so I would rule that out." (Translated from German).

@toni-proglove
Copy link

Hello,
During investigation, there is a suspicion that something might be wrong at the mmc layer.

image

As you can see, there are a lot of cmd53 write,read errors. As mentioned in the first post our external WiFi chip is connected to the GPIO lines (SDIO bus). Do you think it might be related to our crashing issue ?

I have also tried to enable CONFIG_MMC_DEBUG in the kernel and set loglevel=7 but seems like I have missed something. Any advice in debugging would be helpful.

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2024

110 is ETIMEDOUT, which means that the wireless chip did not respond in a timely fashion - but at an SDIO level, it's not just that the firmware didn't come up with an answer in time. This is more circumstantial evidence for a hardware problem, and that includes power problems.

@toni-proglove
Copy link

Thanks for the quick response. Any advice on debugging the SDIO or MMC layer ? I am asking, because I would like to take some parts out of the equation.
As power_problems were mentioned, what is your opinion on setting over_voltage to 3 or 4? Would that help ?

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2024

If the error is no response, I doubt that any kernel logging is going to tell you any more. How is the WiFi chip powered?

what is your opinion on setting over_voltage to 3 or 4? Would that help ?

That depends on the exact silicon composition of the Zero 2 W(s) in question. If vcgencmd measure_volts is already 1.4V then there's no headroom, but the chances are there is still room to go up. Maxing out the voltage is not going to damage the Pi, but running for extend periods at an unnessarily high voltage could potentially shorten it's life.

@toni-proglove
Copy link

The output of vcgencmd measure_volts is volt=1.3563V
As for the WiFi chip, it is connected to RPi 3V3 rail.

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2024

Have you disabled the onboard WiFi chip? If not,

dtoverlay=disable-wifi
dtoverlay=disable-bt

should save some power.

@toni-proglove
Copy link

Yes, both are disabled.

@pelwell
Copy link
Contributor

pelwell commented Nov 25, 2024

In that case you may have to look at providing a separate 3.3V rail, if adding a view voltage steps doesn't help.

@toni-proglove
Copy link

I did an experiment, by adding an artificial load parallel to the 3V3 rail.
Before connecting the HAT to pi zero 2w: 3.277V
After adding the HAT to pi zero 2w: 3.276V
After adding the additional load in form of a 2R resistor: 3.258V

The rPI continued with it's normal operation.

As you can see the HAT is not loading the 3V3 rail that much.
If I increase the load switch from 2R to 1R, the voltage drop is big. The 3V3 rail falls to around 2.7 or 2.5V and then the rPI reboots, which is expected as most likely the brown out reset kicks in.

@pelwell
Copy link
Contributor

pelwell commented Nov 27, 2024

That won't show you the response to sudden changes in load, which can be worse.

@toni-proglove
Copy link

Could you provide with some hint on debugging the sdhci or mmc ? Simple enabling the CONFIG_MMC_DEBUG and setting loglevel to 7 does not work.

@pelwell
Copy link
Contributor

pelwell commented Nov 29, 2024

# To enable mmc tracing
$ echo 1 | sudo tee /sys/kernel/debug/tracing/events/mmc/enable
# Do something, then view the output with:
$ sudo cat /sys/kernel/debug/tracing/trace
# To clear the current trace
$ echo -n | sudo tee /sys/kernel/debug/tracing/trace
# To disable tracing
$ echo 0 | sudo tee /sys/kernel/debug/tracing/events/mmc/enable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants