Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta) (3 minute read)
Kubernetes v1.36 now lets you modify CPU, memory, and GPU resource requests on suspended Jobs without deleting them, enabling smarter resource allocation for batch and ML workloads.
What: A beta feature in Kubernetes v1.36 that allows queue controllers and administrators to update resource specifications (CPU, memory, GPU, extended resources) in the pod template of a suspended Job before it starts running, eliminating the need to delete and recreate Jobs when resource requirements change.
Why it matters: Batch and machine learning workloads often don't know exact resource needs upfront since optimal allocation depends on current cluster conditions. Previously, changing resources meant deleting and recreating the Job, losing metadata, status, and history.
Takeaway: If running Kubernetes v1.36+, test by creating a suspended Job, editing its resources with kubectl edit, then resuming it by setting spec.suspend to false.
Deep dive
- The feature, first introduced as alpha in v1.35, is now enabled by default in v1.36 via the MutablePodResourcesForSuspendedJobs feature gate
- You can modify resource requests and limits for containers and init containers while a Job has spec.suspend set to true
- For Jobs that were running then suspended, all active Pods must terminate (status.active equals 0) before resource changes are accepted to prevent inconsistency
- The use case focuses on queue controllers like Kueue that manage cluster resources and need to adjust Job allocations based on current availability
- Example scenario: a training Job initially requesting 4 GPUs can be scaled down to 2 GPUs if that's what the cluster can provide, rather than being deleted
- Also useful for CronJobs to run with reduced resources during cluster load instead of failing outright
- No new API types were introduced—existing Job and pod template structures handle this through relaxed validation rules
- Standard resource validation still applies (limits must be greater than or equal to requests, extended resources must be whole numbers)
- When using with Jobs that may have failed Pods, consider setting podReplacementPolicy: Failed to prevent resource contention
- Dynamic Resource Allocation (DRA) resourceClaimTemplates remain immutable and must be recreated separately if using DRA workloads
Decoder
- Suspended Job: A Kubernetes Job with spec.suspend set to true, meaning it won't create Pods until resumed
- Pod template: The specification within a Job that defines how Pods should be created, including container images and resource requirements
- Resource requests/limits: CPU, memory, and GPU specifications that define minimum guaranteed resources (requests) and maximum allowed resources (limits) for containers
- Queue controller: Software like Kueue that manages job queuing and resource allocation across a Kubernetes cluster based on priorities and availability
- DRA (Dynamic Resource Allocation): A Kubernetes mechanism for managing specialized hardware resources beyond standard CPU and memory
Original article
Kubernetes v1.36 promoted to beta the ability to modify CPU, memory, GPU, and other resource requests in suspended Jobs' pod templates, eliminating the need to delete and recreate Jobs when resource requirements change. The feature, enabled by default, lets queue controllers and administrators adjust resources before Jobs start running. It is particularly useful for batch and machine learning workloads where optimal allocation depends on current cluster conditions.