Project Post 001

Running LLM inference and image generation on a homelab GPU node, and everything that went wrong along the way.

The Problem

Node-A (Millennium Falcon) has an NVIDIA RTX 4000 Ada with 20GB of VRAM. For months after being installed, it did absolutely nothing. Every AI interaction in the Alliance Fleet routed through external cloud APIs. BD-1's Discord responses, n8n workflow automations, ad-hoc queries: all of it went out over the internet to OpenAI or Anthropic endpoints and came back on a round trip that cost money every single time.

Token-based API pricing is unpredictable at scale. When workflows chain multiple calls and a Discord bot is fielding requests throughout the day, the bill compounds. Beyond cost, every one of those API calls shipped fleet data off-network. Server configurations, operational context, personal details packed into prompts and sent to third-party infrastructure. For a homelab built on zero-trust principles, that felt like a contradiction.

Latency was the final irritant. Cloud round trips add 200 to 500 milliseconds of network overhead before a single token generates. For interactive use through BD-1 or webhook-triggered n8n workflows, that delay is the difference between a tool that feels responsive and one that feels sluggish.

The RTX 4000 Ada had the compute to solve all three problems. It just needed a workload.

The Solution

Tantive-III is a dedicated VM on Node-A, purpose-built for GPU inference. The Ada is passed through to the VM via VFIO, and the entire inference stack runs in Docker Compose: Ollama for LLM inference, OpenWebUI for browser-based chat, AnythingLLM for RAG document workflows, and ComfyUI for image generation.

The stack sits on the Services VLAN (192.168.20.0/24) and is reverse-proxied through NPM with wildcard SSL. OpenWebUI is exposed externally at llm.tima.dev. Everything else is internal-only.

Component	Port	Purpose
Ollama	11434	LLM inference engine
OpenWebUI	3000	Chat interface (llm.tima.dev)
AnythingLLM	3001	RAG document workflows
ComfyUI	8188	Image generation (manual launch only)

The design goal was straightforward: give BD-1 a local inference backend, enable private low-latency responses for fleet operations, and unlock on-device image generation. No more paying per token for tasks a local 8B model handles just fine.

Implementation

VFIO Passthrough

GPU passthrough is the foundation of the entire platform, and it required more tuning than expected. IOMMU has to be enabled at the host level, and the NVIDIA card has to be isolated from the host kernel before Proxmox boots.

The GRUB configuration on Node-A tells the story of every stability issue encountered along the way:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_aspm=off pci=noaer nmi_watchdog=1"

The first two parameters (intel_iommu=on iommu=pt) are standard for VFIO device assignment. The rest are not standard. They are scar tissue.

pcie_aspm=off disables PCIe Active State Power Management. The Ada's power state transitions caused silent lockups during VFIO passthrough. When the GPU tried to enter a low-power state while attached to the VM, the PCIe bus would hang and take Node-A with it. Disabling ASPM eliminates that failure mode entirely.

pci=noaer suppresses PCIe Advanced Error Reporting. AER flood events from the GPU during VFIO attach and detach cycles were filling kernel logs and contributing to instability. Useful diagnostics in theory, destabilizing in practice.

nmi_watchdog=1 enables the NMI watchdog. This was added after the first VFIO lockup incident. If the kernel hard-locks, NMI triggers automatic recovery instead of requiring someone to physically walk over and power-cycle Node-A. A last-resort safety net, but one that has already justified its existence.

After updating GRUB and rebooting, the NVIDIA GPU is blacklisted from the host via vfio-pci module configuration so Proxmox does not claim it before Tantive-III starts.

VM Constraints

Tantive-III runs Ubuntu Server inside Proxmox with the NVIDIA driver installed in the guest. The VM configuration has one hard constraint that was discovered empirically through repeated boot failures:

The VM RAM ceiling is 12GB. Allocating more than 12GB causes a PCIe initialization stall during boot. The GPU passthrough negotiation fails when the VM's memory footprint exceeds a threshold that interferes with IOMMU mapping. The VM either hangs entirely or starts without GPU access. 12GB was found to be the stable ceiling after multiple rounds of testing. It does not move.

CPU type is set to host for passthrough compatibility, and the VM is bridged to the Services VLAN with a static IP.

Docker Compose Stack

The four services run in a single Docker Compose file. The configuration is mostly standard containerized deployment, with one critical exception in the restart policies:

Ollama, OpenWebUI, and AnythingLLM all use restart: unless-stopped, which is normal for persistent services. ComfyUI uses restart: "no". This is deliberate and non-negotiable. The reason is covered in the stability section below, but the short version: an automatic restart policy on ComfyUI nearly destroyed Node-A. Twice.

Ollama and ComfyUI both reserve the GPU via NVIDIA container runtime. OpenWebUI connects to Ollama's API internally. AnythingLLM does the same for RAG queries. The dependency chain is clean: Ollama is the inference engine, everything else is a consumer.

Model Installation

The primary inference model is llama3:8b running Q4_0 quantization, which uses roughly 4.7GB of VRAM when loaded. That leaves comfortable headroom within the 20GB budget for context window allocation and concurrent requests through OpenWebUI.

deepseek-coder-v2:16b is planned as the secondary model for code generation tasks, though loading multiple models simultaneously in Ollama consumes additive VRAM. Ollama keeps the most recently used model warm in VRAM for fast response times, which means capacity planning is not just about model size but about which models are active at any given moment.

The VFIO Lockup Incidents

Two separate lockup incidents shaped every stability decision on this platform. Both presented identically: Node-A hard-locked, no SSH, no console, requiring a physical power cycle. But the root causes were completely different, and that distinction matters.

Incident v1: The Silent Crash

Tantive-III stopped responding with no obvious trigger. No error logs, no warning, nothing. The only evidence came from InfluxDB telemetry after the fact. Telegraf metrics from the VM flatlined at a specific timestamp, while Node-A's host-level metrics continued briefly before they too stopped.

The diagnosis was forensic, pieced together from time-series gaps. The GPU entered a bad state that propagated up through the PCIe bus and took the entire host down. This is a known failure mode with VFIO passthrough, and the GRUB hardening parameters (pcie_aspm=off, pci=noaer, nmi_watchdog=1) were all added in response to this incident.

Incident v2: The Crash Loop

This one had a clear culprit. ComfyUI crashed during an image generation job. Docker, configured with restart: unless-stopped, faithfully restarted it. ComfyUI crashed again. Docker restarted it again. Each restart attempt allocated VRAM without the previous allocation being properly released. The crash loop rapidly exhausted all 20GB of VRAM, the GPU driver stalled on PCIe, and Node-A locked up identically to v1.

Same symptoms. Entirely different mechanism. If the investigation had assumed v2 was "the same VFIO bug" as v1, the crash-loop VRAM exhaustion would have been missed completely.

The fix was surgical: ComfyUI now runs with restart: "no". If it crashes, it stays down until a human verifies VRAM is clear and deliberately restarts it. A manual launch is inconvenient. A host-level lockup is unacceptable.

The Lesson

Same symptoms do not mean the same root cause. Always validate after a rebuild. Incident v1's forensic diagnosis informed the investigation of v2, but the actual mechanism was completely different. Skipping that validation would have left a ticking time bomb in the Docker restart policy.

VRAM Budget and GPU Mode Switching

VRAM on the Ada is a zero-sum game. 20GB total, shared across every GPU workload on Tantive-III.

Ollama keeps approximately 12GB warm when a model is loaded, holding model weights in VRAM for fast response times. ComfyUI needs roughly 8GB for Stable Diffusion workflows. That math only works if they never run simultaneously. Running both at once risks exceeding the 20GB ceiling, triggering out-of-memory conditions, and potentially repeating the crash-loop scenario from v2.

This is why GPU mode switching exists. The fleet treats Ollama inference and ComfyUI image generation as mutually exclusive GPU modes. Scripts handle the transition: shut down one workload, verify VRAM is fully released via nvidia-smi, confirm the release, then start the other. BD-1 exposes this through slash commands (/gpu status and /gpu mode) so mode transitions can be triggered from Discord without SSH access.

The VRAM watchdog runs continuously and feeds into n8n orchestration and Discord alerts. It operates on tiered responses: warning at 80% utilization, alerting at 90%, and taking protective action before a hard ceiling hit. Without this monitoring, a VRAM leak could cascade silently into another lockup.

Tradeoffs

What Local Inference Sacrifices

Model size hits a hard ceiling. The Ada's 20GB of VRAM handles 7B to 13B parameter models comfortably, but 70B+ models either do not fit or run partially offloaded to CPU at unusable speeds. Cloud APIs serve 100B+ parameter models with no VRAM concern. For tasks that genuinely require frontier-model reasoning, local inference cannot compete.

Inference speed is GPU-bottlenecked. The Ada is a workstation card, not a datacenter accelerator. Token generation for larger quantized models is measurably slower than cloud endpoints backed by A100 or H100 clusters.

Model availability lags behind. New open-weight models need quantization, testing, and VRAM profiling before deployment. Cloud APIs just update an endpoint. Every model swap on Tantive-III is a mini capacity-planning exercise.

What It Gains

Fixed cost. The hardware already exists. There is no per-token billing, no usage spikes, no invoice surprises. The operational overhead is real, but it is predictable.

Privacy. Fleet data never leaves the network. No prompt context, no server configurations, no operational details transiting third-party infrastructure. For a homelab built on zero-trust principles, this is the point.

Latency. Local inference on the Ada eliminates 200 to 500 milliseconds of network round-trip overhead. For BD-1's Discord responses and n8n webhook-triggered workflows, the difference is tangible.

When to Reconsider

The local-first approach holds as long as workloads fit within the Ada's envelope. If BD-1 or fleet workflows ever demand frontier-model quality that requires 70B+ parameters, or if inference latency becomes a bottleneck for time-critical automations, the calculus shifts. A hybrid approach, local for private fleet operations and cloud for heavy-lift tasks, is the logical next step. The infrastructure already supports cloud API calls through n8n. The switch is not architectural. It is a policy decision about which prompts are worth paying for.

Operational Reality

Running local inference is not a deploy-and-forget workload.

GPU mode switching requires coordination. Transitioning between Ollama and ComfyUI means verifying VRAM state, stopping one service, confirming release, and starting the other. The scripts and BD-1 slash commands automate the mechanical steps, but a human still decides when to switch.

VRAM monitoring is continuous. The watchdog, n8n orchestration, and Discord alerts form a pipeline that exists because without it, a leak could cascade into a lockup with no warning.

Manual tuning never ends. New models need VRAM profiling. Quantization choices (Q4_K_M versus Q5_K_M versus Q8) directly impact quality and VRAM footprint. Context window sizes affect memory allocation. Every configuration change ripples through the capacity budget.

The NMI watchdog at the kernel level provides the last line of defense. If despite all the monitoring and safeguards the kernel still locks up, NMI triggers automatic recovery rather than leaving Node-A dead until someone physically intervenes. It has already been needed. It will probably be needed again.

What This Proves

This project is not just "I installed Ollama." It is GPU passthrough engineering on commodity hardware, failure analysis using forensic telemetry when the system gives you nothing but silence, capacity planning under hard physical constraints, and the operational judgment to accept complexity in exchange for control.

The Ada earns its keep. The trade-offs are known, the failure modes are documented, and the operational overhead is the price of keeping fleet data off someone else's servers.

This post is part of the Alliance Fleet project documentation series on Holocron Labs. Related posts cover the SIEM Automation Pipeline, Zero-Trust Identity Platform.