Post 017
Why Build an Observability Stack
Running 25+ services across 3 nodes without observability is running blind. You find out something's wrong when a user (you) notices a service is down. By then, the problem has been happening for hours and the evidence is gone.
The TIG stack: Telegraf collecting metrics, InfluxDB storing them, Grafana visualizing them, gives you second-by-second visibility into everything running in the cluster. And when things go catastrophically wrong, like a silent hard lockup with zero local logs, external telemetry becomes your only forensic evidence. (Post 007)
Architecture
┌─────────────────────────────────┐
│ Grafana (Node-B) │
│ 192.168.20.40:3000 │
│ Dashboards, alerting, SSO │
└──────────┬──────────────────────┘
│ Flux queries
┌──────────▼──────────────────────┐
│ InfluxDB 2.x (Node-B) │
│ 192.168.20.41:8086 │
│ Org: TheAlliance │
│ Bucket: telegraf │
└──────────▲──────────────────────┘
│ Telegraf write API
┌─────────────────┼────────────────────┐
│ │ │
┌─────────▼───┐ ┌────────▼──────┐ ┌─────────▼───┐
│ Telegraf │ │ Telegraf │ │ Telegraf │
│ Node-A │ │ Node-B │ │ Node-C │
│ (Falcon) │ │ (Corvette) │ │ (Gozanti) │
└─────────────┘ └───────────────┘ └─────────────┘
All components run on Node-B (CR90 Corvette) except the Telegraf agents, which run on every host. InfluxDB uses the TheAlliance org with a telegraf bucket.
Telegraf Configuration
Telegraf agents collect system metrics every 10 seconds and ship them to InfluxDB. The collection interval matters - during the VFIO lockup investigation, 10-second resolution let me pinpoint the exact moment Node-A stopped reporting.
Agent Config (/etc/telegraf/telegraf.conf)
[global_tags]
node = "falcon" # Change per host: falcon, corvette, gozanti
[agent]
interval = "10s"
round_interval = true
flush_interval = "10s"
hostname = "" # Uses system hostname
[[outputs.influxdb_v2]]
urls = ["http://192.168.20.41:8086"]
token = "${INFLUX_TOKEN}"
organization = "TheAlliance"
bucket = "telegraf"
[[inputs.cpu]]
percpu = true
totalcpu = true
[[inputs.mem]]
[[inputs.disk]]
ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]
[[inputs.diskio]]
[[inputs.net]]
[[inputs.system]]
[[inputs.processes]]
[[inputs.kernel]]
NVIDIA GPU Metrics (Node-A Only)
Node-A runs the RTX 4000 Ada via VFIO passthrough. Telegraf has a native nvidia_smi input plugin:
[[inputs.nvidia_smi]]
bin_path = "/usr/bin/nvidia-smi"
This collects GPU temperature, utilization, memory usage, fan speed, and power draw - all at the same 10-second interval.
InfluxDB Setup
InfluxDB 2.x uses Flux as its query language. The initial setup:
influx setup \
--org TheAlliance \
--bucket telegraf \
--username admin \
--password <password> \
--retention 30d \
--force
Useful Flux Queries
CPU usage across all nodes (last hour):
from(bucket: "telegraf")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "cpu")
|> filter(fn: (r) => r._field == "usage_idle")
|> filter(fn: (r) => r.cpu == "cpu-total")
|> map(fn: (r) => ({r with _value: 100.0 - r._value}))
|> aggregateWindow(every: 1m, fn: mean)
Find exact moment a node stopped reporting (forensic query):
from(bucket: "telegraf")
|> range(start: 2026-02-09T15:00:00Z, stop: 2026-02-09T16:00:00Z)
|> filter(fn: (r) => r._measurement == "cpu")
|> filter(fn: (r) => r.host == "FCM2250")
|> filter(fn: (r) => r.cpu == "cpu-total")
|> last()
This query returned 15:54:50 UTC - the last data point from Node-A before the VFIO lockup. That timestamp became the anchor for the entire forensic investigation.
GPU metrics (Node-A):
from(bucket: "telegraf")
|> range(start: -5m)
|> filter(fn: (r) => r._measurement == "nvidia_smi")
|> last()
Grafana Dashboards
Grafana connects to InfluxDB as a data source using the Flux query language.
Data Source Configuration
Type: InfluxDB
Query Language: Flux
URL: http://192.168.20.41:8086
Organization: TheAlliance
Token: (InfluxDB API token)
Default Bucket: telegraf
Dashboard Panels
The primary infrastructure dashboard includes:
- CPU utilization - per-node, per-core, stacked time series
- Memory usage - total/used/available per node
- Disk I/O - read/write throughput per volume
- Network throughput - per-interface bandwidth
- GPU stats - temperature, utilization, VRAM, power (Node-A only)
- System uptime - per-node, with alert thresholds
SSO Integration
Grafana authenticates through Authentik via OIDC (Post 016). No separate Grafana credentials - users log in once through Authentik and land in their dashboards.
How This Stack Saved an Investigation
On February 9, 2026, Node-A hard-locked. No kernel panic. No journal entries. No crash dump. log2ram had buffered everything in RAM, and the hard lockup never triggered a clean shutdown to flush to disk.
The only evidence that survived was in InfluxDB - on Node-B. The Telegraf agent on Node-A had been sending metrics every 10 seconds until the exact moment of the lockup. The gap in the time-series data was the first clue. The Flux queries above pinpointed the timestamp. From there, I could confirm the system was idle at crash time (ruling out load-induced failure), verify no MCE errors occurred (ruling out memory), and trace the root cause to a PCIe bus stall from the NVIDIA GPU under VFIO passthrough.
Without external telemetry, this would have been an unsolvable mystery. The TIG stack turned "it crashed and I don't know why" into a documented postmortem with a root cause and mitigation.
Full writeup: Post 007 - Diagnosing a Silent Crash with No Logs
What I'd Do Differently
-
Longer retention for forensic data - 30 days is fine for operational metrics, but forensic investigations sometimes need to look back further. I'd set up a separate bucket with 90-day retention for critical host metrics.
-
Alerting from Grafana - Currently using Uptime Kuma for service-level alerts and Admiral Ackbar for Wazuh events. Grafana's native alerting could unify both into a single alert pipeline. On the roadmap.
-
Disable log2ram on critical nodes - The TIG stack saved me, but the root problem was log2ram eating local evidence. Persistent logging is worth the disk writes on nodes where forensic data matters.
Related: Post 007 - VFIO Forensic Postmortem | Post 010 - Grafana OAuth with Authentik