Post 017

Why Build an Observability Stack

Running 25+ services across 3 nodes without observability is running blind. You find out something's wrong when a user (you) notices a service is down. By then, the problem has been happening for hours and the evidence is gone.

The TIG stack: Telegraf collecting metrics, InfluxDB storing them, Grafana visualizing them, gives you second-by-second visibility into everything running in the cluster. And when things go catastrophically wrong, like a silent hard lockup with zero local logs, external telemetry becomes your only forensic evidence. (Post 007)

Architecture

                           ┌─────────────────────────────────┐
                           │        Grafana (Node-B)         │
                           │      192.168.20.40:3000         │
                           │   Dashboards, alerting, SSO     │
                           └──────────┬──────────────────────┘
                                      │ Flux queries
                           ┌──────────▼──────────────────────┐
                           │      InfluxDB 2.x (Node-B)      │
                           │      192.168.20.41:8086         │
                           │   Org: TheAlliance              │
                           │   Bucket: telegraf              │
                           └──────────▲──────────────────────┘
                                      │ Telegraf write API
                    ┌─────────────────┼────────────────────┐
                    │                 │                    │
          ┌─────────▼───┐    ┌────────▼──────┐   ┌─────────▼───┐
          │  Telegraf   │    │  Telegraf     │   │  Telegraf   │
          │  Node-A     │    │  Node-B       │   │  Node-C     │
          │  (Falcon)   │    │  (Corvette)   │   │  (Gozanti)  │
          └─────────────┘    └───────────────┘   └─────────────┘

All components run on Node-B (CR90 Corvette) except the Telegraf agents, which run on every host. InfluxDB uses the TheAlliance org with a telegraf bucket.

Telegraf Configuration

Telegraf agents collect system metrics every 10 seconds and ship them to InfluxDB. The collection interval matters - during the VFIO lockup investigation, 10-second resolution let me pinpoint the exact moment Node-A stopped reporting.

Agent Config (`/etc/telegraf/telegraf.conf`)

[global_tags]
  node = "falcon"  # Change per host: falcon, corvette, gozanti

[agent]
  interval = "10s"
  round_interval = true
  flush_interval = "10s"
  hostname = ""  # Uses system hostname

[[outputs.influxdb_v2]]
  urls = ["http://192.168.20.41:8086"]
  token = "${INFLUX_TOKEN}"
  organization = "TheAlliance"
  bucket = "telegraf"

[[inputs.cpu]]
  percpu = true
  totalcpu = true

[[inputs.mem]]

[[inputs.disk]]
  ignore_fs = ["tmpfs", "devtmpfs", "devfs", "iso9660", "overlay", "aufs", "squashfs"]

[[inputs.diskio]]

[[inputs.net]]

[[inputs.system]]

[[inputs.processes]]

[[inputs.kernel]]

NVIDIA GPU Metrics (Node-A Only)

Node-A runs the RTX 4000 Ada via VFIO passthrough. Telegraf has a native nvidia_smi input plugin:

[[inputs.nvidia_smi]]
  bin_path = "/usr/bin/nvidia-smi"

This collects GPU temperature, utilization, memory usage, fan speed, and power draw - all at the same 10-second interval.

InfluxDB Setup

InfluxDB 2.x uses Flux as its query language. The initial setup:

influx setup \
  --org TheAlliance \
  --bucket telegraf \
  --username admin \
  --password <password> \
  --retention 30d \
  --force

Useful Flux Queries

CPU usage across all nodes (last hour):

from(bucket: "telegraf")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> filter(fn: (r) => r._field == "usage_idle")
  |> filter(fn: (r) => r.cpu == "cpu-total")
  |> map(fn: (r) => ({r with _value: 100.0 - r._value}))
  |> aggregateWindow(every: 1m, fn: mean)

Find exact moment a node stopped reporting (forensic query):

from(bucket: "telegraf")
  |> range(start: 2026-02-09T15:00:00Z, stop: 2026-02-09T16:00:00Z)
  |> filter(fn: (r) => r._measurement == "cpu")
  |> filter(fn: (r) => r.host == "FCM2250")
  |> filter(fn: (r) => r.cpu == "cpu-total")
  |> last()

This query returned 15:54:50 UTC - the last data point from Node-A before the VFIO lockup. That timestamp became the anchor for the entire forensic investigation.

GPU metrics (Node-A):

from(bucket: "telegraf")
  |> range(start: -5m)
  |> filter(fn: (r) => r._measurement == "nvidia_smi")
  |> last()

Grafana Dashboards

Grafana connects to InfluxDB as a data source using the Flux query language.

Data Source Configuration

Type:         InfluxDB
Query Language: Flux
URL:          http://192.168.20.41:8086
Organization: TheAlliance
Token:        (InfluxDB API token)
Default Bucket: telegraf

Dashboard Panels

The primary infrastructure dashboard includes:

CPU utilization - per-node, per-core, stacked time series
Memory usage - total/used/available per node
Disk I/O - read/write throughput per volume
Network throughput - per-interface bandwidth
GPU stats - temperature, utilization, VRAM, power (Node-A only)
System uptime - per-node, with alert thresholds

SSO Integration

Grafana authenticates through Authentik via OIDC (Post 016). No separate Grafana credentials - users log in once through Authentik and land in their dashboards.

How This Stack Saved an Investigation

On February 9, 2026, Node-A hard-locked. No kernel panic. No journal entries. No crash dump. log2ram had buffered everything in RAM, and the hard lockup never triggered a clean shutdown to flush to disk.

The only evidence that survived was in InfluxDB - on Node-B. The Telegraf agent on Node-A had been sending metrics every 10 seconds until the exact moment of the lockup. The gap in the time-series data was the first clue. The Flux queries above pinpointed the timestamp. From there, I could confirm the system was idle at crash time (ruling out load-induced failure), verify no MCE errors occurred (ruling out memory), and trace the root cause to a PCIe bus stall from the NVIDIA GPU under VFIO passthrough.

Without external telemetry, this would have been an unsolvable mystery. The TIG stack turned "it crashed and I don't know why" into a documented postmortem with a root cause and mitigation.

Full writeup: Post 007 - Diagnosing a Silent Crash with No Logs

What I'd Do Differently

Longer retention for forensic data - 30 days is fine for operational metrics, but forensic investigations sometimes need to look back further. I'd set up a separate bucket with 90-day retention for critical host metrics.
Alerting from Grafana - Currently using Uptime Kuma for service-level alerts and Admiral Ackbar for Wazuh events. Grafana's native alerting could unify both into a single alert pipeline. On the roadmap.
Disable log2ram on critical nodes - The TIG stack saved me, but the root problem was log2ram eating local evidence. Persistent logging is worth the disk writes on nodes where forensic data matters.