The Blueprint: What Production-Grade Actually Means
Category: Infrastructure Philosophy / Series Introduction
The About page covers who I am and my initial post Why This Exists covers why the Alliance Fleet exists. This post is where the technical story begins.
Over the next several entries I'm walking through the full infrastructure: hardware decisions, service placement, security architecture, observability design, and the incidents that stress-tested all of it. Not as a tutorial. As an engineering narrative. Every post maps to a real decision, a tradeoff, or a problem I solved with what was available.
Before diving into nodes and services, I want to establish the design philosophy that holds it together. The difference between a homelab that "works" and one that mirrors production isn't the hardware budget. It's the architecture.
What Production-Grade Means Here
After 7+ years in enterprise IT, I've seen environments that looked solid on paper and crumbled under the first unexpected failure. I've also seen lean setups that survived because someone thought about failure modes before deploying.
The Alliance is built on four principles:
Fault Domain Isolation. Every node has a defined role. If the GPU triggers a PCIe bus stall on the Millennium Falcon (Node-A), the identity platform, SIEM, and observability pipeline keep running on the CR90 Corvette (Node-B) and Gozanti Cruiser (Node-C). If Node-B goes down for a ZFS scrub, security monitoring on Node-C doesn't blink.
Identity as Infrastructure. In every corporate environment I've worked in, identity sprawl was the quiet disaster. Twelve passwords per user, no MFA, no audit trail. The fleet runs Authentik-ChainCode (the universal digital ID for the Alliance) with OIDC/SAML across 15+ services and enforced MFA on every one. Identity isn't a feature. It's a foundational control.
Observability as a Forensic Tool. Dashboards are nice. What matters is whether your monitoring pipeline can reconstruct a timeline when everything on the crashed host is gone. The TIG Stack proved this during a real incident. That story is coming.
Documentation as Proof of Work. If it isn't documented, it didn't happen. Every decision, incident, and tradeoff in this series exists in writing.
The Series Roadmap
Post 02: The Architecture of the Alliance. The "Measure Twice, Cut Once" philosophy. Why MFF hardware, how provisioning works, and the fault domain logic behind the three-node layout.
Post 3.1: The Fleet Manifest. Full service inventory. Every VM and LXC mapped to its node, resource allocation, and purpose.
Post 3.2: The Telemetry Core. The observability pipeline. TIG Stack, metrics vs. logs, environmental monitoring, and how external telemetry saved an investigation when all local logs were destroyed.
Post 04: Node-A, The Millennium Falcon. AI/ML compute. GPU passthrough via VFIO, the Tantive-III VM, performance benchmarks, and the lockup incident that validated fault isolation.
Post 05: Node-B, The CR90 Corvette. Data and operations hub. ECC + ZFS, Authentik SSO, the observability pipeline, and why every stateful service lives on the same node.
Post 06: Node-C, The Gozanti Cruiser. Security sentinel. Wazuh SIEM, DNS filtering, the DMZ reverse proxy, and the OPNsense-to-UniFi migration.
Each post stands on its own and builds on what came before. The About page is the "what" and "why." This series is the "how" and the "what went wrong."
Let's get into it.