name: kubernetes-operations description: | Kubernetes and OpenShift cluster operations, maintenance, and lifecycle management. Use this skill when: (1) Performing cluster upgrades (K8s, OCP, EKS, GKE, AKS) (2) Backup and disaster recovery (etcd, Velero, cluster state) (3) Node management: drain, cordon, scaling, replacement (4) Capacity planning and cluster scaling (5) Certificate rotation and management (6) etcd maintenance and health checks (7) Resource quota and limit range management (8) Namespace lifecycle management (9) Cluster migration and workload portability (10) Monitoring and alerting configuration (11) Log aggregation setup (12) Cost optimization and resource rightsizing metadata: author: cluster-skills version: "1.0.0"
Kubernetes / OpenShift Cluster Operations
Day-2 operations, maintenance, and lifecycle management for production clusters.
Current Versions & Documentation (January 2026)
| Platform | Current Version | Upgrade Path | Documentation |
|---|---|---|---|
| Kubernetes | 1.31.x | 1.30 → 1.31 | https://kubernetes.io/docs/tasks/administer-cluster/ |
| OpenShift | 4.17.x | 4.16 → 4.17 | https://docs.openshift.com/container-platform/4.17/ |
| EKS | 1.31 | Rolling updates | https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html |
| AKS | 1.31 | Blue-green or rolling | https://learn.microsoft.com/azure/aks/upgrade-cluster |
| GKE | 1.31 | Surge upgrades | https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster |
Key Tools & Versions
| Tool | Version | Install | Purpose |
|---|---|---|---|
| kubeadm | 1.31.x | Package manager | Cluster bootstrap |
| Velero | 1.15.x | Helm/CLI | Backup & restore |
| kube-prometheus-stack | v67.x | Helm | Monitoring |
| VPA | 1.3.x | kubectl apply | Vertical scaling |
| Cluster Autoscaler | 1.31.x | Helm | Node autoscaling |
| Karpenter | 1.1.x | Helm | AWS node provisioning |
Command Usage Convention
IMPORTANT: This skill uses kubectl as the primary command. When working with:
- OpenShift/ARO clusters: Replace
kubectlwithoc - Standard Kubernetes (AKS, EKS, GKE): Use
kubectlas shown
Node Operations
Node Lifecycle
# View node status
kubectl get nodes -o wide
# Detailed node info
kubectl describe node ${NODE_NAME}
# Check node resources
kubectl top nodes
# Node labels and taints
kubectl get nodes --show-labels
kubectl describe node ${NODE} | grep -A 5 Taints
Drain and Cordon
# Cordon: Mark node unschedulable (no new pods)
kubectl cordon ${NODE_NAME}
# Drain: Evict pods safely
kubectl drain ${NODE_NAME} \
--ignore-daemonsets \
--delete-emptydir-data \
--grace-period=60 \
--timeout=300s
# Force drain (use with caution)
kubectl drain ${NODE_NAME} \
--ignore-daemonsets \
--delete-emptydir-data \
--force \
--grace-period=30
# Uncordon: Allow scheduling again
kubectl uncordon ${NODE_NAME}
Cluster Autoscaler Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- name: cluster-autoscaler
image: registry.k8s.io/autoscaling/cluster-autoscaler:v1.31.0
command:
- ./cluster-autoscaler
- --v=4
- --cloud-provider=${CLOUD_PROVIDER}
- --nodes=${MIN}:${MAX}:${NODE_GROUP}
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-local-storage=false
- --skip-nodes-with-system-pods=true
- --balance-similar-node-groups=true
Backup and Recovery
etcd Backup
# Backup etcd (run on control plane node)
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%Y%m%d-%H%M%S).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key
# Verify backup
ETCDCTL_API=3 etcdctl snapshot status /backup/etcd-snapshot.db --write-out=table
Velero Backup (v1.15.x)
# Install Velero CLI
brew install velero
# Install Velero server with AWS provider
velero install \
--provider aws \
--bucket ${BUCKET_NAME} \
--secret-file ./credentials-velero \
--backup-location-config region=${REGION} \
--snapshot-location-config region=${REGION} \
--plugins velero/velero-plugin-for-aws:v1.10.0 \
--use-node-agent
# Create backup
velero backup create ${BACKUP_NAME} \
--include-namespaces ${NAMESPACES} \
--ttl 720h \
--default-volumes-to-fs-backup
# Create scheduled backup
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces ${NAMESPACES} \
--ttl 168h
# Restore from backup
velero restore create --from-backup ${BACKUP_NAME}
Velero Backup Manifest
apiVersion: velero.io/v1
kind: Backup
metadata:
name: ${BACKUP_NAME}
namespace: velero
spec:
includedNamespaces:
- ${NAMESPACE_1}
- ${NAMESPACE_2}
excludedResources:
- events
- events.events.k8s.io
storageLocation: default
volumeSnapshotLocations:
- default
ttl: 720h0m0s
snapshotVolumes: true
hooks:
resources:
- name: backup-hook
includedNamespaces:
- ${NAMESPACE}
labelSelector:
matchLabels:
app: database
pre:
- exec:
container: database
command:
- /bin/sh
- -c
- "pg_dump -U postgres > /backup/pre-backup.sql"
onError: Fail
timeout: 120s
Cluster Upgrades
Pre-Upgrade Checklist
#!/bin/bash
# pre-upgrade-check.sh
echo "=== Cluster Version ==="
kubectl version --short
echo -e "\n=== Node Status ==="
kubectl get nodes
echo -e "\n=== Pods Not Running ==="
kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded
echo -e "\n=== PDBs That May Block Drain ==="
kubectl get pdb -A
echo -e "\n=== Pending PVCs ==="
kubectl get pvc -A --field-selector=status.phase=Pending
echo -e "\n=== Deprecated APIs in Use ==="
kubectl get --raw /metrics | grep apiserver_requested_deprecated_apis
AKS Upgrade (Azure)
# Check current version and available upgrades
az aks get-versions --location ${LOCATION} -o table
az aks get-upgrades --resource-group ${RG} --name ${CLUSTER} -o table
# Upgrade control plane and node pools
az aks upgrade --resource-group ${RG} --name ${CLUSTER} \
--kubernetes-version 1.31.0
# Use blue-green upgrade with max surge
az aks nodepool upgrade --resource-group ${RG} --cluster-name ${CLUSTER} \
--name ${NODEPOOL} --kubernetes-version 1.31.0 \
--max-surge 33%
# Enable auto-upgrade channel
az aks update --resource-group ${RG} --name ${CLUSTER} \
--auto-upgrade-channel stable
EKS Upgrade
# Update control plane
aws eks update-cluster-version \
--name ${CLUSTER_NAME} \
--kubernetes-version 1.31
# Wait for completion
aws eks wait cluster-active --name ${CLUSTER_NAME}
# Update EKS add-ons
for addon in vpc-cni coredns kube-proxy eks-pod-identity-agent; do
aws eks update-addon --cluster-name ${CLUSTER_NAME} \
--addon-name $addon \
--resolve-conflicts PRESERVE
done
# Update managed node groups
aws eks update-nodegroup-version \
--cluster-name ${CLUSTER_NAME} \
--nodegroup-name ${NODEGROUP_NAME}
GKE Upgrade
# Check available versions
gcloud container get-server-config --region ${REGION}
# Upgrade control plane
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
--master --cluster-version 1.31
# Upgrade node pools
gcloud container clusters upgrade ${CLUSTER} --region ${REGION} \
--node-pool ${POOL} \
--cluster-version 1.31
# Enable release channel
gcloud container clusters update ${CLUSTER} --region ${REGION} \
--release-channel regular
OpenShift Upgrade
# Check available updates
oc adm upgrade
# View current version and channel
oc get clusterversion
oc get clusterversion version -o jsonpath='{.spec.channel}'
# Change channel
oc adm upgrade channel stable-4.17
# Start upgrade
oc adm upgrade --to-latest
# OR upgrade to specific version
oc adm upgrade --to=4.17.5
# Monitor upgrade progress
watch -n 10 'oc get clusterversion && oc get clusteroperators'
Resource Management
Resource Quotas
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
persistentvolumeclaims: "10"
requests.storage: 100Gi
Limit Ranges
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: ${NAMESPACE}
spec:
limits:
- type: Container
default:
cpu: 500m
memory: 512Mi
defaultRequest:
cpu: 100m
memory: 128Mi
max:
cpu: "4"
memory: 8Gi
min:
cpu: 50m
memory: 64Mi
Check Resource Usage
# Namespace resource usage vs quota
kubectl describe quota -n ${NAMESPACE}
# Pod resource usage
kubectl top pods -n ${NAMESPACE} --sort-by=memory
kubectl top pods -n ${NAMESPACE} --sort-by=cpu
# Node resource allocation
kubectl describe nodes | grep -A 5 "Allocated resources"
Certificate Management
Check Certificate Expiry
# kubeadm certificates
kubeadm certs check-expiration
# Manual check
openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates
# Check all certs
for cert in /etc/kubernetes/pki/*.crt; do
echo "=== $cert ==="
openssl x509 -in $cert -noout -dates
done
Rotate Certificates
# Renew all certificates (kubeadm)
kubeadm certs renew all
# Restart control plane components
crictl pods --name kube-apiserver -q | xargs crictl stopp
crictl pods --name kube-controller-manager -q | xargs crictl stopp
crictl pods --name kube-scheduler -q | xargs crictl stopp
Monitoring Setup
Prometheus Stack (kube-prometheus-stack v67.x)
# Add Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install
helm upgrade --install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.replicas=2 \
--set prometheus.prometheusSpec.resources.requests.memory=2Gi \
--set alertmanager.alertmanagerSpec.replicas=3 \
--set grafana.persistence.enabled=true
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
Custom ServiceMonitor
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: ${APP_NAME}
namespace: monitoring
labels:
release: prometheus
spec:
namespaceSelector:
matchNames:
- ${NAMESPACE}
selector:
matchLabels:
app.kubernetes.io/name: ${APP_NAME}
endpoints:
- port: metrics
interval: 30s
path: /metrics
Cost Optimization
VerticalPodAutoscaler
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: ${APP_NAME}-vpa
namespace: ${NAMESPACE}
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: ${APP_NAME}
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: "*"
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
Namespace Lifecycle
Namespace Template
apiVersion: v1
kind: Namespace
metadata:
name: ${NAMESPACE}
labels:
app.kubernetes.io/managed-by: cluster-skills
environment: ${ENVIRONMENT}
team: ${TEAM}
annotations:
owner: ${OWNER_EMAIL}
---
apiVersion: v1
kind: ResourceQuota
metadata:
name: default-quota
namespace: ${NAMESPACE}
spec:
hard:
requests.cpu: "10"
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50"
---
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
namespace: ${NAMESPACE}
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
Disaster Recovery
Full Cluster Recovery Checklist
- Restore etcd - See etcd restore section
- Verify Control Plane
kubectl get nodes kubectl get pods -n kube-system kubectl cluster-info - Restore Workloads (Velero)
velero restore create --from-backup ${BACKUP_NAME} - Verify Application Health
kubectl get pods -A kubectl get svc -A - Verify DNS and Networking
kubectl run dns-test --image=busybox --rm -it --restart=Never -- nslookup kubernetes