Fixing Stale HNS Endpoint Routing Errors in Windows Kubernetes Nodes

#Windows #HNS #kube-proxy #L2Bridge #Kubernetes Networking #Troubleshooting

What's the problem?

Resolve DNS timeouts and traffic black-holing in Windows Kubernetes clusters caused by stale Host Network Service (HNS) endpoints in L2Bridge configurations.

Why does this happen?

The issue arises from a synchronization gap where HNS fails to purge remote endpoint records when their IP addresses are reassigned to local pods. This creates 'zombie' entries that intercept traffic, causing routing conflicts between local and remote endpoints.

Code Example

// The logic fix implements an explicit cleanup during the sync cycle:
func (proxier *Proxier) syncProxyRules() {
    // 1. Identify stale remote endpoints where IP matches a local endpoint
    staleIDs := proxier.identifyStaleEndpoints(allEndpoints)

    // 2. Explicitly purge zombie HNS records before rule calculation
    for _, id := range staleIDs {
        hns.DeleteEndpoint(id)
    }

    // 3. Proceed to apply valid network state
    proxier.applyRules(allEndpoints)
}

How to fix it

Upgrade your kube-proxy component to the latest stable release that includes the proactive HNS reconciliation patch. If the issue persists, ensure your node environment is updated to support explicit HNS endpoint cleanup via the syncProxyRules lifecycle hooks. You can force a state reconciliation by restarting the kube-proxy service, which triggers the updated logic to purge stale IDs before re-applying network rules.