-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel Crashes on Raspberry Pi Zero 2 W #6438
Comments
It's almost certainly experiencing under-voltage. What is the power source, and how are the attached peripherals powered? |
In most of the cases, the pi is powered via a micro USB cable connected to the USB port on the pi and a USB 2.0/3.0 port on a laptop. We had one case where the pi was connected to an AmazonBasics USB cable connected to a USB power adapter with a rating of 5.0 V | 1.2 A | 6W By attached peripherals I assume you mean the external wireless chip. It is wired to the 1.8V pin of the pi for power. I have some follow-up questions:
|
1.2A is criminally low.
Which 1.8V pin is that?
Because the failures are in common, non-Pi-specific code, are different every time, and they happen under load. And because I've seen it before.
You could try monitoring the supply rails, but you may find that the dips are too short and too localised to detect.
The limitation is in how much power can be routed through the Z2W's power infrastructure. You may need an external source of 1.8V that is derived directly from the (better) power supply, not via the Z2W. |
You could try with |
😅
Apologies, I got my facts wrong, it seems we are drawing from 5V_USB instead
Thanks for the tip, will keep this in mind for future use 😄
Thanks, will try this not expecting too much. Changing our PSUs would be a more impactful start IMO based on this info |
Good luck. |
Hello again,
|
With over_voltage=2 you are probably at the limit of what the on-board PMIC can deliver. You can try going to over_voltage=4, etc., but in practice it will be clamped. What does |
$ vcgencmd get_config over_voltage_avs
over_voltage_avs=6250
And what if the peripherals are drawing power from the Z2W's on-board circuitry (USB)? |
I can't quantify how much power is available for external peripherals, a) because I've not done the measurements, and b) because it varies from chip to chip according to the natural variability you get across the silicon wafers, but the short answer is you may not be fine. |
Hello, We've experienced another instance of a crash on kernel 6.6.57. There were 2 BLE devices connected via this peripheral, no active data transmission. There was also a connection to a TCP server established but no data coming in. We're not 100% sure what fraction of the 4 cores were consumed, but the core temperature measure a few minutes before was ~38 deg C. Logs
Could this still be caused by undervoltage? |
Yes - this crash is in ext4 filesystem code which is very well tested and unrelated to the test you are doing. I bet if you repeat the same test and get another crash that it will be somewhere else (although there may be "hot spots" because of the way the code is structured). |
Point of correction, I just got confirmation that the test was done with a 1m USB cable. If this is the case, how can under-voltage be occurring in this scenario? |
I'm not a hardware engineer, so there may be some small inaccuracies in what I say, but the basic idea is correct. The job of the PMIC is to keep the voltage rails at constant levels by pumping charge/adjusting the current when it sees the voltage start to drift. When there is a sudden change in current demand it takes a while for the PMIC to notice and increase the current. If the sudden change is large the PMIC may not be able to keep up with demand, and the voltage drops. However, it can be worse than that if 1) there isn't enough current available from the supply, or 2) the amount of current that the PMIC can control is limited. It sounds as if 1) may not be the case now, but you do need to take account of voltage drop in the cable - particularly if the cable is not permanently attached. Our power supplies are pre-compensated - they run at 5.1V (say) so that after the loss in the cable 5V is delivered to the Pi. But even if 1) isn't happening, 2) can still be a limiting factor, particularly if you have unpowered peripherals drawing more current. |
Hi Phil, |
I've not seen that kind of information - have you tried our Industrial Applications support (https://www.raspberrypi.com/contact/#get-in-touch)? - but if I were in your position and had heard the kind of things I had told you I would lash up an external 3.3V supply to power the other devices to see if it solves the instability. |
I will reach out to them to ask. And yes I was already thinking we should try applying a separate 3.3V source. The challenge is that this is not happening consistently. In fact, as far as I am aware only this one customer is seeing the crashes. The other customers have the devices working fine, including with the 1.2A supply. |
How many units does this one customer have? There is variation, with "fast" parts requiring more current. |
If it's just one customer and all other things being equal, is there a chance this could be an EMC problem? Does the problematic customer have any heavy machinery starting up in the vicinity that might be spiking things? |
That's also something we have considered, but it seems unlikley to me. The facility is a large open warehouse. The only machinery is the roller doors for truck loading/unloading. It does not appear to use forklifts or anything like that even. This customer has four devices, all exhibiting the behaviour. |
Just got confirmation from the customer: "no large electricity consumers or other machines are used in the building, so I would rule that out." (Translated from German). |
Hello, As you can see, there are a lot of cmd53 write,read errors. As mentioned in the first post our external WiFi chip is connected to the GPIO lines (SDIO bus). Do you think it might be related to our crashing issue ? I have also tried to enable CONFIG_MMC_DEBUG in the kernel and set loglevel=7 but seems like I have missed something. Any advice in debugging would be helpful. |
110 is ETIMEDOUT, which means that the wireless chip did not respond in a timely fashion - but at an SDIO level, it's not just that the firmware didn't come up with an answer in time. This is more circumstantial evidence for a hardware problem, and that includes power problems. |
Thanks for the quick response. Any advice on debugging the SDIO or MMC layer ? I am asking, because I would like to take some parts out of the equation. |
If the error is no response, I doubt that any kernel logging is going to tell you any more. How is the WiFi chip powered?
That depends on the exact silicon composition of the Zero 2 W(s) in question. If |
The output of vcgencmd measure_volts is volt=1.3563V |
Have you disabled the onboard WiFi chip? If not,
should save some power. |
Yes, both are disabled. |
In that case you may have to look at providing a separate 3.3V rail, if adding a view voltage steps doesn't help. |
I did an experiment, by adding an artificial load parallel to the 3V3 rail. The rPI continued with it's normal operation. As you can see the HAT is not loading the 3V3 rail that much. |
That won't show you the response to sudden changes in load, which can be worse. |
Could you provide with some hint on debugging the sdhci or mmc ? Simple enabling the CONFIG_MMC_DEBUG and setting loglevel to 7 does not work. |
|
Describe the bug
We have been experiencing kernel crashes on our Zero 2 W devices at seemingly random instances. We are not quite sure what triggers these crashes, but we see them more often (but not always) when we perform system tests stressing the BLE subsystem and the USB gadget subsystem at the same time (HID / CDC output). We have recorded at least 3 crashes on kernel version 6.6.50 and at least one crash on kernel version 6.6.56.
We are using an external wireless chip connected to the Zero 2 W via GPIO, and have confirmed with the chip provider that the crash does not originate from their kernel driver.
Steps to reproduce the behaviour
Device (s)
Raspberry Pi Zero 2 W
System
$ uname -a Linux 6.6.56-v7 #1 SMP Fri Oct 11 11:14:35 UTC 2024 armv7l GNU/Linux $ vcgencmd version Sep 13 2024 16:00:14 Copyright (c) 2012 Broadcom version ddfba3e3c234500025b545512b4b214f28e453e9 (clean) (release) (start_cd) $ cat /etc/rpi-issue Raspberry Pi reference 2024-03-15 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 11096428148f0f2be3985ef3126ee71f99c7f1c2, stage2
Logs
From the device running the 6.6.56 kernel:
From a device running with a 6.6.50 kernel
Additional context
No response
The text was updated successfully, but these errors were encountered: