NPM + UniFi Firewall Rule Ordering - The Silent Traffic Drop


The Symptom

Nginx Proxy Manager was running. SSL certificates were valid. DNS resolved correctly. But when I hit https://grafana.tima.dev, NPM returned a 502 Bad Gateway. Sometimes it timed out entirely. Other services on the same VLAN worked fine - Authentik loaded, Portainer loaded. Grafana didn't.

The worst part: it was intermittent. Sometimes Grafana loaded on the first try. Sometimes it took 30 seconds. Sometimes it never loaded at all.


The Environment

NPM:       192.168.1.101 (VLAN 1 - Management)
Grafana:   192.168.20.40:3000 (VLAN 20 - Services)
Firewall:  UniFi Dream Machine Pro

NPM sits on VLAN 1. Grafana sits on VLAN 20. For NPM to proxy traffic to Grafana, the request has to cross the inter-VLAN boundary through the UDM Pro's firewall.


The Debugging Path

Check 1: Is Grafana Actually Running?

ssh root@192.168.20.40
systemctl status grafana-server
curl -s http://localhost:3000 | head -5

Grafana was running and responding locally. Not a Grafana problem.

Check 2: Can NPM Reach Grafana?

# From the NPM container
curl -v http://192.168.20.40:3000

This is where it got interesting. Sometimes the curl succeeded instantly. Sometimes it hung for 30 seconds and then either succeeded or timed out. The inconsistency pointed to a network-level issue, not an application issue.

Check 3: Is It a DNS Problem?

# From NPM
nslookup 192.168.20.40
ping -c 5 192.168.20.40

Ping worked reliably. DNS wasn't involved (NPM was configured with the IP directly). Not a DNS problem.

Check 4: Firewall Rules

This is where the bug was hiding.


The Root Cause: Firewall Rule Ordering

The UniFi Dream Machine evaluates firewall rules top to bottom, first match wins. My inter-VLAN rules looked like this (simplified):

Priority  Action   Source        Destination    Port
1         Allow    Management    Services       (all)
2         Block    All           All            (all)

Looks fine, right? Management VLAN can reach Services VLAN, everything else is blocked. And indeed, the initial SYN packet from NPM (VLAN 1) to Grafana (VLAN 20) was allowed by Rule 1.

The problem was the return traffic.

When Grafana responded, the return packet traveled from VLAN 20 → VLAN 1. The firewall evaluated this as a new flow from Services → Management. Rule 1 only allowed Management → Services. The return packet hit Rule 2 (Block All) and was silently dropped.

Why It Was Intermittent

The UDM Pro has stateful packet inspection. Once a connection is established, return traffic is automatically allowed by the connection tracking table. But connection tracking entries have a timeout. If the connection tracker expired the entry between requests (due to keep-alive timing, proxy connection pooling, or just bad luck), the next return packet was evaluated as a new flow - and dropped.

This explains the intermittency:

  • Fast requests: Connection tracking entry still active → return traffic allowed → works
  • Slow requests / new connections: Entry expired → return traffic evaluated as new flow → dropped by Rule 2 → timeout or 502

The Fix

Option A: Add a bidirectional allow rule

Priority  Action   Source        Destination    Port
1         Allow    Management    Services       (all)
2         Allow    Services      Management     (all)  ← Added
3         Block    All           All            (all)

This explicitly allows return traffic from Services → Management, regardless of connection tracking state.

Option B: Use "Established/Related" rule (more precise)

Priority  Action   Source        Destination    Port      State
1         Allow    Management    Services       (all)     New, Established
2         Allow    Services      Management     (all)     Established, Related
3         Block    All           All            (all)

Rule 2 only allows return traffic for connections that were initiated from Management. New connections from Services → Management are still blocked by Rule 3.

I went with Option A for simplicity. In a homelab with a single operator, the additional precision of Option B doesn't justify the added rule complexity. In an enterprise environment with multiple trust zones, Option B is the correct approach.


Verification

After adding the bidirectional rule:

# From NPM - rapid sequential requests
for i in {1..20}; do
    curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" http://192.168.20.40:3000
    sleep 2
done

All 20 requests returned 200 with consistent sub-second response times. No more timeouts, no more 502s.

Tested across all NPM-proxied services:

Service Before Fix After Fix
Grafana Intermittent 502 ✅ Consistent
Wazuh Slow load ✅ Fast
n8n Occasional timeout ✅ Consistent
Vaultwarden Worked (same VLAN as NPM) ✅ No change

Services on the same VLAN as NPM (like services on VLAN 1) were never affected - they don't cross the firewall boundary.


Why This Is Easy to Miss

  1. The error message is misleading. NPM reports 502 Bad Gateway, which makes you look at Grafana (the backend), not the network path between them.

  2. It's intermittent. Stateful firewalls mask the problem by allowing return traffic for active connections. You only see the failure when connection tracking entries expire.

  3. Unidirectional rules look correct. "Allow Management to Services" reads as complete. The mental model is "I'm allowing traffic between these VLANs." But a unidirectional rule only allows traffic in one direction - the return path is a separate decision.

  4. The firewall doesn't log dropped inter-VLAN traffic by default. Unless you enable logging on the Block All rule, silently dropped return packets leave no trace.


Prevention

For any inter-VLAN communication in a UniFi environment:

  1. Always create bidirectional rules - if A needs to talk to B, create rules for both A→B and B→A (or use Established/Related for the return path)
  2. Enable logging on deny rules - at least temporarily during setup, so you can see what's being dropped
  3. Test with rapid sequential requests - a single curl might succeed due to connection tracking. Twenty requests with 2-second gaps will expose the intermittent failure
  4. Check firewall rules before application config - if the same service works via direct IP but fails through a reverse proxy on a different VLAN, the problem is almost certainly in the firewall path

The Takeaway

Inter-VLAN firewall rules are not bidirectional by default. If your reverse proxy sits on a different VLAN than your backend services, you need explicit rules for both the request path and the return path. Stateful connection tracking will mask this issue intermittently, making it one of the most frustrating network bugs to diagnose.


Related: Post 013 - NPM Rebuild + Cloudflare Migration | Post 014 - Tailscale Split DNS | Post 028 - AdGuard Home