Post 018


The Goal

Run local LLM inference at production speed, 50+ tokens/second on 70B parameter models, without sending a single byte of data to an external API. The hardware: an NVIDIA RTX 4000 Ada Generation GPU in a Proxmox hypervisor. The method: VFIO/IOMMU passthrough to a dedicated VM.


The Hardware

Host:     Node-A (FCM2250 / Millennium Falcon)
CPU:      Intel Core Ultra 9 285K
RAM:      64GB DDR5
GPU:      NVIDIA RTX 4000 Ada Generation (16GB VRAM)
Storage:  NVMe (LVM-thin)

The RTX 4000 Ada sits in Node-A's PCIe slot. The goal is to pass the entire GPU through to a single VM (Tantive-III) so it has exclusive, bare-metal-equivalent access to the hardware.


IOMMU and VFIO Prerequisites

Enable IOMMU in BIOS

For Intel CPUs, enable VT-d (Intel Virtualization Technology for Directed I/O) in the BIOS. This is the hardware feature that allows the hypervisor to map PCIe devices to specific VMs.

Kernel Boot Parameters

Edit /etc/default/grub on the Proxmox host:

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_aspm=off pci=noaer"
  • intel_iommu=on - Enables IOMMU
  • iommu=pt - Passthrough mode for better performance
  • pcie_aspm=off - Disables PCIe Active State Power Management (mitigation for the lockup - see below)
  • pci=noaer - Disables Advanced Error Reporting (prevents kernel log spam from GPU power state transitions)
update-grub
reboot

Verify IOMMU Groups

find /sys/kernel/iommu_groups/ -maxdepth 1 -type d | wc -l

The GPU should appear in its own IOMMU group. If it shares a group with other devices, you'll need ACS override patches or a different PCIe slot.

# Find the GPU's IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
  echo "IOMMU Group $(basename $(dirname $(dirname $d))): $(lspci -nns $(basename $d))"
done | grep -i nvidia

Load VFIO Modules

echo "vfio" >> /etc/modules
echo "vfio_iommu_type1" >> /etc/modules
echo "vfio_pci" >> /etc/modules
echo "vfio_virqfd" >> /etc/modules

Blacklist the NVIDIA drivers on the host (the GPU belongs to the VM, not the hypervisor):

echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf

Bind the GPU to VFIO:

# Get the GPU's PCI IDs
lspci -nn | grep -i nvidia
# Output: 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation [10de:xxxx] (rev a1)

echo "options vfio-pci ids=10de:xxxx,10de:yyyy" >> /etc/modprobe.d/vfio.conf

Replace xxxx and yyyy with your GPU and audio device IDs.

update-initramfs -u
reboot

VM Creation: Tantive-III

The Debian ISO Gotcha

ProxMenux created the VM shell but the Debian ISO URL was wrong:

wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.0.0-amd64-netinst.iso
# 404 Not Found

The filename changes with point releases. Find the current one:

wget -qO- https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/ | grep -oP 'debian-[0-9.]+-amd64-netinst\.iso' | head -1
# debian-13.3.0-amd64-netinst.iso

Download the correct ISO:

wget -O /var/lib/vz/template/iso/debian-13.3.0-amd64-netinst.iso \
  https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso

VM Configuration

VM ID:    201
Name:     Tantive-III
CPU:      8 cores (host type)
RAM:      32GB
Disk 1:   128GB NVMe (fast-lvm, scsi0)
Disk 2:   128GB NVMe (fast-lvm, scsi1)
GPU:      PCIe passthrough (hostpci0: 0000:02:00, pcie=1)
Network:  virtio, bridge=vmbr0

The VM config in /etc/pve/qemu-server/201.conf:

hostpci0: 0000:02:00,pcie=1
scsi0: fast-lvm:vm-201-disk-1,discard=on,size=128G,ssd=1

Partition Expansion

The VM booted with only 3GB usable despite 128GB allocated. Full fix documented in Post 011 - involves kpartx, parted GPT repair, and resize2fs from the Proxmox host.

After expansion:

df -h /
# /dev/sda1  126G  2.4G  118G  2% /

NVIDIA Driver Installation

Inside the Tantive-III VM:

apt update && apt upgrade -y
apt install -y nvidia-driver firmware-misc-nonfree
reboot

After reboot:

nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx       Driver Version: 550.xx       CUDA Version: 12.x                |
|   GPU  Name           Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC     |
|   0    NVIDIA RTX 4000 Ada Gen Off  | 00000000:01:00.0  Off  |                  Off     |
|   Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M.   |
|   30%   35C    P8              9W / 130W  |     0MiB / 16376MiB  |      0%      Default   |
+-----------------------------------------------------------------------------------------+

GPU visible, driver loaded, 16GB VRAM available.


AI Stack Deployment

With the GPU accessible, deploy the AI services:

Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b
ollama pull llama3:70b
ollama pull codellama:34b

Performance: ~50 tokens/second on the 70B model with the RTX 4000 Ada.

OpenWebUI

docker run -d --name openwebui \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  ghcr.io/open-webui/open-webui:main

AnythingLLM

For RAG (Retrieval-Augmented Generation) over local documents:

docker run -d --name anythingllm \
  -p 3001:3001 \
  -v /home/user/anythingllm:/app/server/storage \
  mintplexlabs/anythingllm

Connected to the Ollama instance for embedding and inference. 500+ documents ingested into the RAG pipeline.

ComfyUI

For image generation workloads:

# Clone and install ComfyUI with GPU support
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188

The Lockup and the Kernel Parameters

Two weeks after deployment, Node-A hard-locked. No kernel panic, no crash dump, no journal entries. The full investigation is in Post 007, but the short version:

Root cause: PCIe bus stall induced by the NVIDIA GPU under VFIO passthrough. The GPU entered a power state transition that caused the PCIe bus to hang, which cascaded into a full system lockup.

Mitigation:

# Added to GRUB_CMDLINE_LINUX_DEFAULT
pcie_aspm=off    # Disable PCIe Active State Power Management
pci=noaer        # Disable Advanced Error Reporting

These parameters prevent the GPU from entering the power states that trigger the stall. The system has been stable since.


Current State

Service Port Status
Ollama 11434 Running - 6 models pulled
OpenWebUI 3000 Running - Authentik SSO
AnythingLLM 3001 Running - 500+ doc RAG pipeline
ComfyUI 8188 Running - text-to-image workflows

GPU metrics flow through Telegraf → InfluxDB → Grafana with real-time temperature, utilization, VRAM, and power monitoring.

All inference happens locally. Zero data egress. The entire AI stack is air-gapped from external APIs.


Related: Post 007 - VFIO Forensic Postmortem | Post 011 - Partition Expansion | Post 017 - TIG Stack Observability