Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article gives you a technical foundation for Azure Kubernetes Service (AKS) cluster upgrades by covering upgrade options and common scenarios. For in-depth guidance tailored to your needs, use the scenario-based navigation paths at the end of this article.
What this article covers
This technical reference provides comprehensive AKS upgrade fundamentals on:
- Manual versus automated upgrade options and when to use each.
- Common upgrade scenarios with specific recommendations.
- Optimization techniques for performance and minimal disruption.
- Troubleshooting guidance for capacity, drain failures, and timing issues.
- Validation processes and pre-upgrade checks.
This hub is best for helping you to understand upgrade mechanics, troubleshoot issues, optimize upgrade settings, and learn about technical implementation.
For more information, see these related articles:
- To upgrade your production AKS clusters, see AKS production upgrade strategies.
- To get upgrade patterns for AKS clusters with stateful workloads, see Stateful workload upgrade patterns.
- To use the scenario hub to help you choose the right AKS upgrade approach, see AKS upgrade scenarios: Choose your path.
If you're new to AKS upgrades, start with the upgrade scenarios hub for guided, scenario-based assistance.
Quick navigation
Your situation | Recommended path |
---|---|
Production cluster needs an upgrade | Production upgrade strategies |
Database/stateful workloads | Stateful workload patterns |
First-time upgrade or basic cluster | Basic AKS cluster upgrade |
Multiple environments or fleet | Upgrade scenarios hub |
Node pools or Windows nodes | Node pool upgrades |
Specific node pool only | Single node pool upgrade |
Upgrade options
Perform manual upgrades
Manual upgrades let you control when your cluster upgrades to a new Kubernetes version. These upgrades are useful for testing or targeting a specific version:
- Upgrade an AKS cluster
- Upgrade multiple AKS clusters via Azure Kubernetes Fleet Manager
- Upgrade the node image
- Customize node surge upgrade
- Process node OS updates
Configure automatic upgrades
Automatic upgrades keep your cluster on a supported version and up to date. Use these upgrades when you want to automate your settings:
- Automatically upgrade an AKS cluster
- Automatically upgrade multiple AKS clusters via Azure Kubernetes Fleet Manager
- Use planned maintenance to schedule and control upgrades
- Stop AKS cluster upgrades automatically on API breaking changes (preview)
- Automatically upgrade AKS cluster node operating system images
- Apply security updates to AKS nodes automatically by using GitHub actions
Special considerations for node pools that span multiple availability zones
AKS uses best-effort zone balancing in node groups. During an upgrade surge, the zones for surge nodes in virtual machine scale sets are unknown ahead of time, which can temporarily cause an unbalanced zone configuration. AKS deletes surge nodes after the upgrade and restores the original zone balance.
To keep zones balanced, set surge to a multiple of three nodes. Persistent volume claims that use Azure locally redundant storage disks are zone bound and might cause downtime if surge nodes are in a different zone. Use a pod disruption budget (PDB) to maintain high availability during drains.
Optimize upgrades to improve performance and minimize disruptions
Combine planned maintenance window, max surge, PDB, node drain timeout, and node soak time to increase the likelihood of successful, low-disruption upgrades:
- Planned maintenance window: Schedule auto-upgrade during low-traffic periods. We recommend at least four hours.
- Max surge: Higher values speed upgrades but might disrupt workloads. We recommend 33% for production.
- Max unavailable: Use when capacity is limited.
- Pod disruption budget: Set to limit pods down during upgrades. Validate for your service.
- Node drain timeout: Configure pod eviction wait duration. The default is 30 minutes.
- Node soak time: Stagger upgrades to minimize downtime. The default is 0 minutes.
Upgrade settings | How extra nodes are used | Expected behavior |
---|---|---|
maxSurge=5 , maxUnavailable=0 |
5 surge nodes | Five nodes are surged for upgrade. |
maxSurge=5 , maxUnavailable=0 |
0-4 surge nodes | Upgrade fails because of insufficient surge nodes. |
maxSurge=0 , maxUnavailable=5 |
N/A | Five existing nodes are drained for upgrade. |
Note
Before you upgrade, check for API breaking changes and review the AKS release notes to avoid disruptions.
Validations used in the upgrade process
AKS performs pre-upgrade validations to ensure cluster health:
- API breaking changes: Detects deprecated APIs.
- Kubernetes upgrade version: Ensures a valid upgrade path.
- PDB configuration: Checks for misconfigured PDBs (for example,
maxUnavailable=0
). - Quota: Confirms enough quota for surge nodes.
- Subnet: Verifies sufficient IP addresses.
- Certificates/service principals: Detects expired credentials.
These checks help to minimize upgrade failures and provide early visibility into issues.
Common upgrade scenarios and recommendations
Scenario 1: Capacity constraints
If your cluster is limited by product tier or regional capacity, upgrades might fail when surge nodes can't be provisioned. This situation is common with specialized product tiers (like GPU nodes) or in regions with limited resources. Errors such as SKUNotAvailable
, AllocationFailed
, or OverconstrainedAllocationRequest
might occur if maxSurge
is set too high for available capacity.
Recommendations to prevent or resolve
- Use
maxUnavailable
to upgrade by using existing nodes instead of surging new ones. For more information, see Customize unavailable nodes during upgrade. - Lower
maxSurge
to reduce extra capacity needs. For more information, see Customize node surge upgrade. - For security-only updates, use security patch reimages that don't require surge nodes. For more information, see Apply security and kernel updates to Linux nodes in Azure Kubernetes Service.
Scenario 2: Node drain failures and PDBs
Upgrades require draining nodes (evicting pods). Drains can fail if:
- Pods are slow to terminate (long shutdown hooks or persistent connections).
- Strict PDBs block pod evictions.
Example error message:
Code: UpgradeFailed
Message: Drain node ... failed when evicting pod ... failed with Too Many Requests error. This error is often caused by a restrictive PDB policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget. PDB debug info: ... blocked by pdb ... with 0 unready pods.
Recommendations to prevent or resolve
- Set
maxUnavailable
in PDBs to allow at least one pod to be evicted. - Increase pod replicas so that the disruption budget can tolerate evictions.
- Use
undrainableNodeBehavior
to allow upgrades to proceed even if some nodes can't be drained:- Schedule (default): Delete node and surge replacement to reduce capacity.
- Cordon (recommended): Node is cordoned and labeled as
kubernetes.azure.com/upgrade-status=Quarantined
.Example command:
az aks nodepool update \ --resource-group <resource-group-name> \ --cluster-name <cluster-name> \ --name <node-pool-name> \ --undrainable-node-behavior Cordon
The following example output shows the undrainable node behavior updated:
"upgradeSettings": { "drainTimeoutInMinutes": null, "maxSurge": "1", "nodeSoakDurationInMinutes": null, "undrainableNodeBehavior": "Cordon" }
Max Blocked Nodes Allowed (preview)
The Max Blocked Nodes Allowed (preview) feature lets you specify how many nodes that fail to drain (blocked nodes) are tolerated during upgrades or similar operations. This feature works only if the undrainable node behavior property is set. Otherwise, the command returns an error.
Note
If you don't explicitly set Max Blocked Nodes Allowed, it defaults to the value of max surge. If max surge isn't set, the default is typically 10%, so Max Blocked Nodes Allowed also defaults to 10%.
Prerequisites
The Azure CLI
aks-preview
extension version 18.0.0b9 or later is required to use this feature.Example command:
az aks nodepool update \ --cluster-name jizenMC1 \ --name nodepool1 \ --resource-group jizenTestMaxBlockedNodesRG \ --max-surge 1 \ --undrainable-node-behavior Cordon \ --max-blocked-nodes 2 \ --drain-timeout 5
- Extend drain timeout if workloads need more time. (The default is 30 minutes.)
- Test PDBs in staging, monitor upgrade events, and use blue-green deployments for critical workloads. For more information, see Blue-green deployment of AKS clusters.
Verify undrainable nodes
The blocked nodes are unscheduled for pods and marked with the label
"kubernetes.azure.com/upgrade-status: Quarantined"
.Verify the label on any blocked nodes when there's a drain node failure on upgrade:
kubectl get nodes --show-labels=true
Resolve undrainable nodes
Remove the responsible PDB:
kubectl delete pdb <pdb-name>
Remove the
kubernetes.azure.com/upgrade-status: Quarantined
label:kubectl label nodes <node-name> <label-name>
Optionally, delete the blocked node:
az aks nodepool delete-machines --cluster-name <cluster-name> --machine-names <machine-name> --name <node-pool-name> --resource-group <resource-group-name>
After you finish this step, you can reconcile the cluster status by performing any update operation without the optional fields as outlined in az aks. Alternatively, you can scale the node pool to the same number of nodes as the count of upgraded nodes. This action ensures that the node pool gets to its intended original size. AKS prioritizes the removal of the blocked nodes. This command also restores the cluster provisioning status to
Succeeded
. In the following example,2
is the total number of upgraded nodes.# Update the cluster to restore the provisioning status az aks update --resource-group <resource-group-name> --name <cluster-name> # Scale the node pool to restore the original size az aks nodepool scale --resource-group <resource-group-name> --cluster-name <cluster-name> --name <node-pool-name> --node-count 2
Scenario 3: Slow upgrades
Conservative settings or node-level issues can delay upgrades, which affects your ability to stay current with patches and improvements.
Common causes of slow upgrades include:
- Low
maxSurge
ormaxUnavailable
values (limits parallelism). - High soak times (long waits between node upgrades).
- Drain failures (see Node drain failures).
Recommendations to prevent or resolve
- Use
maxSurge=33%
,maxUnavailable=1
for production. - Use
maxSurge=50%
,maxUnavailable=2
for dev/test. - Use OS Security Patch for fast, targeted patching (avoids full node reimaging).
- Enable
undrainableNodeBehavior
to avoid upgrade blockers.
Scenario 4: IP exhaustion
Surge nodes require more IPs. If the subnet is near capacity, node provisioning can fail (for example, Error: SubnetIsFull
). This scenario is common with Azure Container Networking Interface, high maxPods
, or large node counts.
Recommendations to prevent or resolve
Ensure that your subnet has enough IPs for all nodes, surge nodes, and pods. The formula is
Total IPs = (Number of nodes + maxSurge) * (1 + maxPods)
.Reclaim unused IPs or expand the subnet (for example, from /24 to /22).
Lower
maxSurge
if subnet expansion isn't possible:az aks nodepool update \ --resource-group <resource-group-name> \ --cluster-name <cluster-name> \ --name <node-pool-name> \ --max-surge 10%
Monitor IP usage with Azure Monitor or custom alerts.
Reduce
maxPods
per node, clean up orphaned load balancer IPs, and plan subnet sizing for high-scale clusters.
Frequently asked questions
Can I use open-source tools for validation?
Yes. Many open-source tools integrate well with AKS upgrade processes:
- kube-no-trouble (kubent): Scans for deprecated APIs before upgrades.
- Trivy: Security scanning for container images and Kubernetes configurations.
- Sonobuoy: Kubernetes conformance testing and cluster validation.
- kube-bench: Security benchmark checks against Center for Internet Security standards.
- Polaris: Validation of Kubernetes best practices.
- kubectl-neat: Clean up Kubernetes manifests for validation.
How do I validate API compatibility before upgrading?
Run deprecation checks by using tools like kubent:
# Install and run API deprecation scanner
kubectl apply -f https://github.com/doitintl/kube-no-trouble/releases/latest/download/knt-full.yaml
# Check for deprecated APIs in your cluster
kubectl run knt --image=doitintl/knt:latest --rm -it --restart=Never -- \
-c /kubeconfig -o json > api-deprecation-report.json
# Review findings
cat api-deprecation-report.json | jq '.[] | select(.deprecated==true)'
What makes AKS upgrades different from other Kubernetes platforms?
AKS provides several unique advantages:
- Native Azure integration with Azure Traffic Manager, Azure Load Balancer, and networking.
- Azure Kubernetes Fleet Manager for coordinated multicluster upgrades.
- Automatic node image patching without manual node management.
- Built-in validation for quota, networking, and credentials.
- Azure support for upgrade-related issues.
Choose your upgrade path
This article provided you with a technical foundation. Now select your scenario-based path.
Ready to execute?
If you have... | Then go to... |
---|---|
Production environment | Production upgrade strategies: Battle-tested patterns for zero-downtime upgrades |
Databases or stateful apps | Stateful workload patterns: Safe upgrade patterns for data persistence |
Multiple environments | Upgrade scenarios hub: Decision tree for complex setups |
Basic cluster | Upgrade an AKS cluster: Step-by-step cluster upgrade |
Still deciding?
Use the upgrade scenarios hub for a guided decision tree that considers your:
- Downtime tolerance
- Environment complexity
- Risk profile
- Timeline constraints
Next tasks
- Review AKS patch and upgrade guidance for best practices and planning tips before you start any upgrade.
- Always check for API breaking changes and validate your workload's compatibility with the target Kubernetes version.
- Test upgrade settings (such as
maxSurge
,maxUnavailable
, and PDBs) in a staging environment to minimize production risk. - Monitor upgrade events and cluster health throughout the process.
Azure Kubernetes Service