Fixing Kubernetes kube-proxy High CPU and Latency in Large-Scale Clusters

#Kubernetes #kube-proxy #Performance Tuning #iptables #Networking #Scalability

What's the problem?

Resolve kube-proxy performance bottlenecks and CPU spikes in large Kubernetes clusters by disabling unnecessary full iptables synchronization cycles.

Why does this happen?

The kube-proxy 'iptables' mode triggers an aggressive, periodic 'full sync' of network rules every 30 minutes, regardless of cluster size. In large-scale environments with over 1,000 endpoints, this creates massive I/O overhead and CPU spikes, causing network latency and rule flapping.

Code Example

/* Logic modification in pkg/proxy/iptables/proxier.go */

// Original: Forces full sync based on time threshold
doFullSync := proxier.needFullSync || (time.Since(proxier.lastFullSync) > proxyutil.FullSyncPeriod)

// Optimized: Respects largeClusterMode to suppress periodic timers
doFullSync := proxier.needFullSync || 
    ((time.Since(proxier.lastFullSync) > proxyutil.FullSyncPeriod) && !proxier.largeClusterMode)

How to fix it

To resolve this, update your kube-proxy configuration to prioritize incremental updates by suppressing timer-based full synchronizations when operating in large-scale mode. Ensure your proxy implementation accounts for 'largeClusterMode' logic, which forces the system to perform a full sync only when strictly necessary—such as upon state corruption—rather than on a fixed time interval. This reduces resource contention and stabilizes node performance during steady-state operations.