Post 018
The Goal
Run local LLM inference at production speed, 50+ tokens/second on 70B parameter models, without sending a single byte of data to an external API. The hardware: an NVIDIA RTX 4000 Ada Generation GPU in a Proxmox hypervisor. The method: VFIO/IOMMU passthrough to a dedicated VM.
The Hardware
Host: Node-A (FCM2250 / Millennium Falcon)
CPU: Intel Core Ultra 9 285K
RAM: 64GB DDR5
GPU: NVIDIA RTX 4000 Ada Generation (16GB VRAM)
Storage: NVMe (LVM-thin)
The RTX 4000 Ada sits in Node-A's PCIe slot. The goal is to pass the entire GPU through to a single VM (Tantive-III) so it has exclusive, bare-metal-equivalent access to the hardware.
IOMMU and VFIO Prerequisites
Enable IOMMU in BIOS
For Intel CPUs, enable VT-d (Intel Virtualization Technology for Directed I/O) in the BIOS. This is the hardware feature that allows the hypervisor to map PCIe devices to specific VMs.
Kernel Boot Parameters
Edit /etc/default/grub on the Proxmox host:
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt pcie_aspm=off pci=noaer"
intel_iommu=on- Enables IOMMUiommu=pt- Passthrough mode for better performancepcie_aspm=off- Disables PCIe Active State Power Management (mitigation for the lockup - see below)pci=noaer- Disables Advanced Error Reporting (prevents kernel log spam from GPU power state transitions)
update-grub
reboot
Verify IOMMU Groups
find /sys/kernel/iommu_groups/ -maxdepth 1 -type d | wc -l
The GPU should appear in its own IOMMU group. If it shares a group with other devices, you'll need ACS override patches or a different PCIe slot.
# Find the GPU's IOMMU group
for d in /sys/kernel/iommu_groups/*/devices/*; do
echo "IOMMU Group $(basename $(dirname $(dirname $d))): $(lspci -nns $(basename $d))"
done | grep -i nvidia
Load VFIO Modules
echo "vfio" >> /etc/modules
echo "vfio_iommu_type1" >> /etc/modules
echo "vfio_pci" >> /etc/modules
echo "vfio_virqfd" >> /etc/modules
Blacklist the NVIDIA drivers on the host (the GPU belongs to the VM, not the hypervisor):
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf
echo "blacklist nvidia" >> /etc/modprobe.d/blacklist.conf
Bind the GPU to VFIO:
# Get the GPU's PCI IDs
lspci -nn | grep -i nvidia
# Output: 02:00.0 VGA compatible controller [0300]: NVIDIA Corporation [10de:xxxx] (rev a1)
echo "options vfio-pci ids=10de:xxxx,10de:yyyy" >> /etc/modprobe.d/vfio.conf
Replace xxxx and yyyy with your GPU and audio device IDs.
update-initramfs -u
reboot
VM Creation: Tantive-III
The Debian ISO Gotcha
ProxMenux created the VM shell but the Debian ISO URL was wrong:
wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.0.0-amd64-netinst.iso
# 404 Not Found
The filename changes with point releases. Find the current one:
wget -qO- https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/ | grep -oP 'debian-[0-9.]+-amd64-netinst\.iso' | head -1
# debian-13.3.0-amd64-netinst.iso
Download the correct ISO:
wget -O /var/lib/vz/template/iso/debian-13.3.0-amd64-netinst.iso \
https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-13.3.0-amd64-netinst.iso
VM Configuration
VM ID: 201
Name: Tantive-III
CPU: 8 cores (host type)
RAM: 32GB
Disk 1: 128GB NVMe (fast-lvm, scsi0)
Disk 2: 128GB NVMe (fast-lvm, scsi1)
GPU: PCIe passthrough (hostpci0: 0000:02:00, pcie=1)
Network: virtio, bridge=vmbr0
The VM config in /etc/pve/qemu-server/201.conf:
hostpci0: 0000:02:00,pcie=1
scsi0: fast-lvm:vm-201-disk-1,discard=on,size=128G,ssd=1
Partition Expansion
The VM booted with only 3GB usable despite 128GB allocated. Full fix documented in Post 011 - involves kpartx, parted GPT repair, and resize2fs from the Proxmox host.
After expansion:
df -h /
# /dev/sda1 126G 2.4G 118G 2% /
NVIDIA Driver Installation
Inside the Tantive-III VM:
apt update && apt upgrade -y
apt install -y nvidia-driver firmware-misc-nonfree
reboot
After reboot:
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.xx Driver Version: 550.xx CUDA Version: 12.x |
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| 0 NVIDIA RTX 4000 Ada Gen Off | 00000000:01:00.0 Off | Off |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| 30% 35C P8 9W / 130W | 0MiB / 16376MiB | 0% Default |
+-----------------------------------------------------------------------------------------+
GPU visible, driver loaded, 16GB VRAM available.
AI Stack Deployment
With the GPU accessible, deploy the AI services:
Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3:8b
ollama pull llama3:70b
ollama pull codellama:34b
Performance: ~50 tokens/second on the 70B model with the RTX 4000 Ada.
OpenWebUI
docker run -d --name openwebui \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
ghcr.io/open-webui/open-webui:main
AnythingLLM
For RAG (Retrieval-Augmented Generation) over local documents:
docker run -d --name anythingllm \
-p 3001:3001 \
-v /home/user/anythingllm:/app/server/storage \
mintplexlabs/anythingllm
Connected to the Ollama instance for embedding and inference. 500+ documents ingested into the RAG pipeline.
ComfyUI
For image generation workloads:
# Clone and install ComfyUI with GPU support
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
python main.py --listen 0.0.0.0 --port 8188
The Lockup and the Kernel Parameters
Two weeks after deployment, Node-A hard-locked. No kernel panic, no crash dump, no journal entries. The full investigation is in Post 007, but the short version:
Root cause: PCIe bus stall induced by the NVIDIA GPU under VFIO passthrough. The GPU entered a power state transition that caused the PCIe bus to hang, which cascaded into a full system lockup.
Mitigation:
# Added to GRUB_CMDLINE_LINUX_DEFAULT
pcie_aspm=off # Disable PCIe Active State Power Management
pci=noaer # Disable Advanced Error Reporting
These parameters prevent the GPU from entering the power states that trigger the stall. The system has been stable since.
Current State
| Service | Port | Status |
|---|---|---|
| Ollama | 11434 | Running - 6 models pulled |
| OpenWebUI | 3000 | Running - Authentik SSO |
| AnythingLLM | 3001 | Running - 500+ doc RAG pipeline |
| ComfyUI | 8188 | Running - text-to-image workflows |
GPU metrics flow through Telegraf → InfluxDB → Grafana with real-time temperature, utilization, VRAM, and power monitoring.
All inference happens locally. Zero data egress. The entire AI stack is air-gapped from external APIs.
Related: Post 007 - VFIO Forensic Postmortem | Post 011 - Partition Expansion | Post 017 - TIG Stack Observability