Post 003.2

The Telemetry Core: Full-Stack Observability & Environmental Logic

Category: Monitoring / Data Analytics / Incident Response

Post 003.1 mapped every service to its node. But an inventory only tells you what's deployed. It doesn't tell you if it's healthy, how it's performing, or what happened when something went wrong at 7:55 on a Sunday morning.

That's the Telemetry Core's job.

This pipeline isn't dashboards for graphs. It's the forensic backbone of the fleet. On February 9, 2026, the Telemetry Core was the only system that preserved evidence of a crash when everything on the affected host was destroyed.

1. The TIG Stack

Three components, 10-second granularity across all three nodes.

Telegraf (The Collector). Universal data agent on every node. Tracks CPU load, RAM usage, disk I/O, network throughput, and kernel-level stats. Pushes metrics to InfluxDB on the CR90 Corvette (Node-B) at 10-second intervals. Creates an unbroken external record of every node's vitals, even if the node itself goes dark.

InfluxDB 2.x (The Vault). Time-series database using Flux query language. Stores the cluster's historical telemetry on Node-B's ECC-protected ZFS storage. ECC prevents bit-flips during writes, ZFS prevents silent corruption on disk. The metrics you query months later are the same metrics that were written.

Grafana (The Visualizer). Primary GUI for Mission Control. Cluster-wide dashboards are actively being built out.

2. Metrics vs. Logs

Two sides of the same event. I run both for a 360-degree view.

Metrics (the Pulse) are handled by the TIG Stack on Node-B. They answer how much. If a CPU spike occurs on the Millennium Falcon, the TIG stack flags it.

Logs (the Security Footprint) are handled by Wazuh SIEM on the Gozanti Cruiser (Node-C). They answer who. While TIG shows a traffic spike, Wazuh tells you if that traffic is a brute-force SSH attempt or a legitimate backup sync.

Alerting (the Tactical Response) is handled by n8n on Node-B. Wazuh detection events trigger webhooks. n8n processes them into real-time Discord notifications via Admiral Ackbar (the fleet's alert bot). Previously this pipeline also included automated IP blocking through the OPNsense API: 2,847 IPs auto-blocked, <3s mean response time. Automated enforcement is being rebuilt for the UniFi Dream Machine platform. Detection and alerting remain fully operational.

Combining metrics and logs enables Root Cause Analysis: seeing an event and understanding both the technical impact and the intent behind it.

3. The Proof: When Monitoring Saves the Investigation

Earlier this year, the Millennium Falcon (Node-A) hard-locked. No kernel panic. No pstore dump. No journal entries. The node was running log2ram, which holds system logs in RAM. The instantaneous failure prevented a disk sync. Nine days of logs, gone.

With nothing left on the crashed host, I pivoted to InfluxDB on the Corvette. Telegraf metrics were intact. Using Flux queries, I reconstructed a second-by-second timeline of the final five minutes:

Metric	Pre-Crash Value	Anomaly?
CPU idle	99.78–99.85% (all 24 cores)	No
Memory used	7.4%	No
Network traffic	~120 KB/s, zero errors	No
System uptime	Incrementing normally until 15:54:50 UTC	Stopped

The system was idle at the moment of death. Not resource exhaustion. Not a software crash. Combined with no kernel panic or MCE (Machine Check Exception), the evidence pointed to a PCIe bus stall from the NVIDIA GPU under VFIO passthrough.

Without the Telemetry Core, the conclusion would have been "it crashed, we don't know why." With it: complete root cause analysis with documented mitigations. Full writeup in the technical-writeups repo.

4. Environmental Monitoring

Software stability depends on physical health. I don't just monitor the virtual layer.

Thermal Guard: Telegraf pulls IPMI and sensor data to track the RTX 4000's temperature and node fan speeds. Hardware throttling kills performance. If the environment approaches a thermal threshold, the Telemetry Core detects the trend before the hardware is forced to downclock. Intervention at the physical layer before the application layer feels the impact.

In the Alliance, we don't guess why a service went offline. We consult the Telemetry Core. And when something catastrophic happens, external telemetry is the difference between a shrug and a root cause.

Next up: each node in detail, starting with the one that crashed.

Next: Post 004, Node-A: The Millennium Falcon