Post 007

Diagnosing a Silent Crash with No Logs

Category:

Node-A went down on a Sunday morning and left me nothing.

No kernel panic. No crash dump. No syslog entries from the window that mattered. The machine was just... off. And when I brought it back up, the logs picked up exactly where the last sync had left them — because I was running log2ram, which buffers writes to disk periodically and flushes on clean shutdown. A hard lockup is not a clean shutdown.

This is the story of how I diagnosed a silent Proxmox host crash using only external telemetry, and what I found.

The Setup

Node-A — the Millennium Falcon — is my AI/ML compute node. It runs an NVIDIA RTX 4000 Ada passed through to a VM via VFIO/IOMMU for local LLM inference and RAG workloads. That detail matters.

The broader observability stack across the cluster is:

Telegraf agents on every host, collecting system metrics every 10 seconds
InfluxDB 2.x on Node-B (192.168.20.41), org: TheAlliance, bucket: telegraf
Grafana on Node-B (192.168.20.40), dashboards pulling from InfluxDB

When Node-A locked up, this stack kept running on the other nodes. InfluxDB kept receiving data from everything except FCM2250 — and that gap in the time-series data was the first clue.

What I Had (and Didn't Have)

Available:

InfluxDB metrics from FCM2250 up to the moment of lockup
Metrics from the other nodes (confirming the stack was healthy)
Post-reboot system uptime — which I could use as a reference point

Not available:

/var/log/syslog from the crash window — flushed to RAM, never written
/var/log/kern.log — same problem
Any kernel panic output — there wasn't one
MCE (Machine Check Exception) logs — nothing

log2ram is a useful optimization in normal operation. In a hard lockup scenario it erases your evidence. Lesson learned.

Finding the Exact Crash Timestamp

The key insight: InfluxDB records a uptime field under the system measurement. If I can find the last datapoint where uptime was continuously increasing — and then find the first datapoint after reboot — I can bracket the crash to within seconds.

The Flux query to find the last pre-crash datapoint:

from(bucket: "telegraf")
  |> range(start: 2026-02-09T15:50:00Z, stop: 2026-02-09T16:00:00Z)
  |> filter(fn: (r) => r._measurement == "system")
  |> filter(fn: (r) => r._field == "uptime")
  |> filter(fn: (r) => r.host == "FCM2250")
  |> tail(n: 10)

The last datapoint returned: 15:54:50 UTC — 7:54:50 AM Pacific.

After that timestamp, silence. The next FCM2250 datapoint in InfluxDB was after the manual reboot, with a fresh low uptime value. The crash window was confirmed.

What Was the System Doing?

With the exact timestamp known, I could pull the full system state from the 5 minutes leading up to the crash:

from(bucket: "telegraf")
  |> range(start: 2026-02-09T15:49:00Z, stop: 2026-02-09T15:55:00Z)
  |> filter(fn: (r) => r.host == "FCM2250")
  |> filter(fn: (r) => r._measurement == "cpu" or
                       r._measurement == "mem" or
                       r._measurement == "system")

Results:

Metric	Value at Crash
CPU idle	99.8%
Memory used	7.4%
Load average	~0.02
Active network connections	Minimal
Disk I/O	Negligible

The system was completely idle. No workload. No scheduled job. Nothing that would explain a crash from the software side.

This ruled out:

OOM condition
CPU thermal throttle or overload
A runaway process or VM consuming resources

Ruling Out Software Causes

With logs gone and metrics showing idle, I went looking for anything that could explain a hard lockup without generating a kernel panic.

MCE errors? — No MCE log entries survived. The absence isn't conclusive on its own, but combined with the idle state and clean memory metrics, hardware-level CPU/memory faults moved down the list.

Kernel bug? — Possible, but an idle system with nothing unusual in the pre-crash metrics makes a random kernel crash unlikely.

VFIO/IOMMU? — This was the one active subsystem doing something non-trivial. Node-A had the RTX 4070 Super configured for passthrough, and even when the GPU VM wasn't running, the VFIO driver still owns the device at the host level.

This is a known failure mode.

The Root Cause: PCIe Bus Stall from VFIO Passthrough

After digging into the pattern, this fits the documented behavior of certain GPU + VFIO combinations:

The GPU enters a low-power PCIe ASPM (Active State Power Management) state. The host attempts to wake it. The device either doesn't respond or responds in a way that stalls the PCIe bus. The kernel — unable to recover — freezes the entire host. No panic, no log. Just silence.

The RTX 4000 Ada is not uniquely problematic here. This behavior has been observed across a range of Ada Lovelace and Ampere GPUs in VFIO configurations, particularly when the host has ASPM enabled and the GPU is in a low-power state between VM sessions.

The contributing factors in my setup:

GPU in VFIO passthrough with ASPM active
System fully idle (no workload keeping the GPU busy or the PCIe link active)
log2ram ensuring no crash evidence survived

The Fix

Two kernel parameters added to GRUB:

pcie_aspm=off pci=noaer

pcie_aspm=off — disables PCIe Active State Power Management entirely, preventing the low-power state transitions that trigger the stall
pci=noaer — disables PCIe Advanced Error Reporting, which can cause the kernel to take fatal action on recoverable errors

Editing GRUB on Proxmox (Debian):

nano /etc/default/grub

Find the GRUB_CMDLINE_LINUX_DEFAULT line and append both parameters:

GRUB_CMDLINE_LINUX_DEFAULT="quiet pcie_aspm=off pci=noaer"

Then update GRUB and reboot:

update-grub
reboot

Gotcha: I hit a syntax error the first time — I had a stray quote in the GRUB config that prevented the parameters from applying. If your node comes back up and the parameters aren't showing in /proc/cmdline, check for syntax issues in /etc/default/grub before assuming the fix didn't work.

Verify after reboot:

cat /proc/cmdline

You should see pcie_aspm=off pci=noaer in the output.

What I Changed After

Beyond the kernel parameters, two operational changes:

1. Disabled log2ram on Node-A

The performance optimization isn't worth losing crash evidence. On a node running VFIO passthrough with a known-problematic failure mode, I need kernel logs to survive hard lockups.

systemctl disable log2ram
systemctl stop log2ram

2. Added uptime monitoring to Grafana

I already had the data — I just wasn't watching it visually. I added a panel tracking uptime across all three nodes so a sudden drop to zero is immediately obvious on the dashboard, not something I have to go query for after the fact.

What This Demonstrated

The VFIO lockup itself is a known and manageable problem. What made this incident interesting — and worth documenting — was the diagnostic process:

The observability stack I built for general monitoring became the only way to investigate this crash. Without Telegraf writing metrics to an external InfluxDB every 10 seconds, I would have had nothing. No timestamp, no system state, no starting point for root cause analysis.

That's the argument for external telemetry. Not because your system will crash, but because when it does, you want the evidence to be somewhere other than the thing that just died.

Quick Reference

Item	Detail
Affected node	Node-A / FCM2250 / Millennium Falcon
Crash timestamp	2026-02-09 15:54:50 UTC
Root cause	PCIe bus stall — RTX 4000 Ada in VFIO passthrough
Diagnostic source	InfluxDB/Telegraf external telemetry
Fix	`pcie_aspm=off pci=noaer` kernel params
Secondary fix	Disabled log2ram on Node-A

Full forensic writeup and supporting Flux queries available in the technical-writeups repository.