Fixing Stale Windows HNS Endpoint Conflicts in Kubernetes

#Kubernetes #Windows #HNS #Networking #kube-proxy #L2Bridge #Troubleshooting

What's the problem?

Resolve intermittent DNS timeouts and network connectivity failures in Windows Kubernetes clusters caused by stale L2Bridge HNS endpoints during pod IP reuse.

Why does this happen?

The issue occurs when pod IPs are recycled, causing the Windows Host Network Service (HNS) to retain stale remote endpoints. Kube-proxy lacks a reconciliation mechanism to identify these collisions, resulting in traffic being incorrectly routed to remote nodes instead of the local pod.

Code Example

/* Logic added to pkg/proxy/winkernel/proxier.go */

if localEP, exists := localEndpoints[ip]; exists {
    // Priority is given to the Local Endpoint
    if remoteEP, exists := remoteEndpoints[ip]; exists {
        klog.V(4).InfoS("Cleaning stale remote endpoint due to local collision", "ip", ip)
        deleteRemoteEndpoint(remoteEP)
        delete(remoteEndpoints, ip)
    }
}

// Enforce cleanup during syncProxyRules
defer hns.deleteAllRemoteEndpointsWithDupIP(conflictingEndpoints);

How to fix it

To resolve this, update your kube-proxy to a version incorporating the recent HNS synchronization logic. This fix introduces a dual-key identification process where local endpoints are granted authority over remote ones. During the syncProxyRules cycle, the system now explicitly detects IP collisions and triggers an immediate physical cleanup of stale remote HNS objects before finalizing network routing, ensuring state consistency.