Devoured - May 01, 2026
Kubernetes v1.36: Staleness Mitigation and Observability for Controllers (6 minute read)

Kubernetes v1.36: Staleness Mitigation and Observability for Controllers (6 minute read)

DevOps Read original

Kubernetes v1.36 introduces staleness mitigation for controllers to prevent them from taking incorrect actions based on outdated cache data.

What: The release adds atomic FIFO processing to client-go and implements staleness checks in four high-contention controllers (ReplicaSet, DaemonSet, Job, and StatefulSet) that verify cache resource versions before acting on objects.
Why it matters: Controller staleness is a subtle but serious issue where outdated local caches can cause controllers to take incorrect actions, miss updates, or delay responses, often only discovered when things go wrong in production.
Takeaway: If you're building Kubernetes controllers with client-go, you can use the new ConsistencyStore interface and LastStoreSyncResourceVersion() function to implement staleness mitigation in your own controllers.
Deep dive
  • Controllers maintain local caches of cluster state for performance, but these caches can become outdated during restarts, API server outages, or when events arrive out of order, leading to incorrect controller actions
  • The new AtomicFIFO feature in client-go enables atomic batch processing of operations, ensuring the queue remains consistent even when events arrive out of order during initial list operations
  • Controllers now track the resource version of objects they've written to the API server and compare it against their cache's resource version before taking action, skipping reconciliation if the cache is stale
  • The four updated controllers were chosen because they act on pods, which typically experience the highest contention in Kubernetes clusters
  • The ConsistencyStore interface provides three key functions: WroteAt (records when an object is written), EnsureReady (checks if cache is up to date before reconciliation), and Clear (removes deleted objects)
  • Controllers track both the resource version of the objects they manage (e.g., ReplicaSets) and the resource versions of dependent objects (e.g., pods owned by those ReplicaSets)
  • New metrics include stale_sync_skips_total to count skipped syncs due to stale caches, and store_resource_version to expose the latest resource version of each shared informer
  • All staleness mitigation features are enabled by default in v1.36 but can be disabled per-controller using feature gates like StaleControllerConsistencyDaemonSet
  • The feature implements "read your own writes" semantics, ensuring controllers see their own updates before taking further action
  • SIG API Machinery is working with controller-runtime to bring these capabilities to all controllers built with that framework, enabling automatic staleness mitigation without custom implementation
Decoder
  • Staleness: When a controller's local cache contains outdated information about cluster state, potentially causing it to make decisions based on incorrect data
  • Reconciliation: The process where a controller compares desired state with actual state and takes action to align them, first checking its local cache then updating from the API server
  • Informer: A client-go component that watches the Kubernetes API server for changes and maintains a local cache of objects a controller cares about
  • Resource version: A version identifier assigned to Kubernetes objects that increases with each update, used to track whether cached data is current
  • FIFO queue: First-in-first-out queue used by controllers to process events in order
  • client-go: The official Kubernetes client library for Go, used to build controllers and interact with the Kubernetes API
  • kube-controller-manager: The Kubernetes component that runs core controllers like ReplicaSet, DaemonSet, Job, and StatefulSet controllers
Original article

Kubernetes v1.36 introduced new features to combat "staleness" in controllers—when outdated local caches cause controllers to take incorrect actions or miss updates—by adding atomic FIFO processing to client-go and implementing staleness checks in four high-contention controllers (ReplicaSet, DaemonSet, Job, and StatefulSet) that now verify cache resource versions before acting. The update also includes new metrics like stale_sync_skips_total to monitor when controllers skip syncs due to stale data, with all features enabled by default and controllable via feature gates.