Resolving Intermittent Connection Failures: Stale HNS Endpoints in Windows L2Bridge

#Kubernetes #Windows #kube-proxy #HNS #Networking #L2Bridge #Troubleshooting

What's the problem?

Fix intermittent DNS timeouts and traffic misrouting in Windows Kubernetes nodes caused by stale HNS endpoints during Pod IP recycling. Improve network stability.

Why does this happen?

In L2Bridge networks, Pod IPs are frequently recycled. When a remote endpoint persists in the Windows Host Network Service (HNS) after an IP is reassigned to a local Pod, the proxy misroutes traffic to the stale remote entry instead of the new local pod.

Code Example

// Logic implemented in pkg/proxy/winkernel/hns.go to reconcile endpoint states
if localEndpointExists && existingEp.IsRemote {
    // Flag the stale remote HNS endpoint for immediate deletion
    remoteEPsWithDupIP[existingEp.hnsID] = true;
    log.Info("Detected stale remote HNS endpoint; scheduling deletion", "ID", existingEp.hnsID);
}

// Defer block ensures cleanup occurs before new rules are applied
defer deleteAllRemoteEndpointsWithDupIP(remoteEPsWithDupIP);

How to fix it

Upgrade to a version of kube-proxy containing the endpoint reconciliation logic. Ensure your cluster is updated to support proactive HNS pruning. If manual remediation is required, restart the kube-proxy pod to trigger a full synchronization cycle, which now forces the cleanup of remote HNS entries that conflict with local Pod IP assignments.