-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Description
Describe the bug
On a Raspberry Pi 5, the system occasionally enters a complete freeze under high NVMe I/O load.
The device remains reachable via ICMP (ping works), but:
- SSH becomes unresponsive
- systemd-journald stops processing logs
- no further kernel messages are emitted
- the system does not recover without a power cycle
- The issue occurs very rarely (weeks/months of stable uptime), but is reproducible by triggering heavy I/O, e.g.:
- scrubbing a long Frigate video timeline (NVMe read-heavy)
- querying a full year of data in openHAB backed by InfluxDB
This looks like an I/O or PCIe/NVMe stall rather than a userspace crash or OOM condition.
Steps to reproduce the behaviour
- Raspberry Pi 5 running Raspberry Pi OS (64-bit), kernel 6.6.x+rpt-rpi-2712
- System booted from NVMe (via Pineboards AI Bundle, M-Key)
- Run Docker containers including:
Frigate (video recordings on NVMe)
InfluxDB
openHAB
RaspberryMatic / OpenCCU - Trigger high sustained NVMe load, e.g.:
Scrub through a long Frigate video timeline
Query a large time range (e.g. 1 year) in openHAB that hits InfluxDB - After some time (minutes), the system freezes:
ping still works
SSH and journald hang
no recovery without power cycle
The issue does not happen under normal load and may require weeks to reoccur.
Device (s)
Raspberry Pi 5
System
Hardware:
- Raspberry Pi 5 Model B
- Pineboards AI Bundle (NVMe M-Key + E-Key, incl. Hailo 8L – not actively used)
- NVMe SSD (used as root filesystem and for Docker data)
Kernel:
- 6.6.74+rpt-rpi-2712 (Raspberry Pi kernel)
OS:
- Raspberry Pi OS 64-bit (Bookworm)
Storage:
- Root filesystem on NVMe
USB devices:
- Homematic RF stick (directly connected via USB extension cable)
- Zigbee + Z-Wave + Amber dongles (via active USB hub)
Issue also occurred previously with a passive hub
Containers (Docker):
- openHAB
- InfluxDB
- Frigate
- RaspberryMatic / OpenCCU
Logs
Relevant observations from logs before/during freezes:
- NVMe timeouts and aborts:
nvme nvme0: I/O timeout, aborting
I/O error, dev nvme0n1
- Journald overwhelmed or stalled:
/dev/kmsg buffer overrun, some messages lost
systemd-journald watchdog timeout
- No classic OOM signature before the freeze
- No soft-lockup watchdog messages
- After reboot, previous journal sometimes marked as uncleanly shut down
Full logs around the freeze are attached to this issue.
Additional context
The system can run perfectly stable for weeks or months
The freeze correlates strongly with high NVMe I/O pressure
USB power issues are unlikely:
- active USB hub in use
- freeze still occurs
Ping remaining functional suggests:
- kernel still partially alive
- likely blocked kernel threads (I/O / PCIe / NVMe path)
Similar behavior observed both with Frigate (video I/O) and InfluxDB (large queries)
If helpful, I can:
- test kernel parameters (e.g. NVMe power state limits)
- enable hung task panic for better diagnostics
- provide additional traces on next occurrence