We spent three months optimizing a production Kubernetes cluster that was costing $2,400/month on AWS EKS. By the end, we were running the same workloads — same traffic, same SLAs — for $890/month. No migrations. No rewrites. Just better resource configuration and smarter autoscaling.
Here’s exactly what we changed, with the YAML configs we used.
The Root Problem: Overprovisioning Is the Default
Most Kubernetes clusters run at 25–35% actual utilization. The rest is allocated but unused — CPUs sitting idle, memory reserved but never touched. This happens because of a completely rational fear: nobody wants to be the person whose pod got OOMKilled in production at 2 AM.
So engineers set resource requests high. Then they set limits even higher. Then the cluster autoscaler provisions nodes to satisfy those inflated requests. You end up paying for 3x the compute you actually need.
Here’s a real example from the cluster we optimized. This was a typical API service deployment:
# Before: the "better safe than sorry" configuration
resources:
requests:
cpu: "1000m"
memory: "2Gi"
limits:
cpu: "2000m"
memory: "4Gi"
Actual usage from kubectl top pods averaged 120m CPU and 340Mi memory. That’s 12% CPU utilization and 17% memory utilization against requests. The scheduler reserved 1 full CPU core and 2 GiB of memory for a pod using a fraction of that.
Multiply this across 28 pods in the cluster and you get the $2,400/month bill.
Step 1: Measure Actual Usage Before Changing Anything
You cannot optimize what you haven’t measured. Before touching any resource specs, collect at least 7 days of actual usage data — ideally 14 days to capture weekly traffic patterns.
Quick check with kubectl:
# Current resource requests vs actual usage
kubectl top pods -n production --containers
# Node-level utilization
kubectl top nodes
This gives you a snapshot, but you need historical data. That means Prometheus.
Key Prometheus metrics to track:
| Metric | What It Tells You |
|---|---|
container_cpu_usage_seconds_total | Actual CPU consumed per container |
container_memory_working_set_bytes | Real memory in use (not cache) |
kube_pod_container_resource_requests | What the scheduler reserved |
kube_pod_container_resource_limits | The ceiling before throttling/OOMKill |
node_cpu_seconds_total | Node-level CPU consumption |
The query that matters most — request vs actual ratio:
# CPU request efficiency (higher = more waste)
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace)
/
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)
If this ratio is above 3, you’re paying for 3x the CPU you’re using. We found ratios between 4 and 8 across most namespaces.
Step 2: Right-Size Resource Requests
This is where most of the savings come from. Not autoscaling, not spot instances — just setting resource requests to match actual usage with a reasonable buffer.
Here’s the before and after for our five highest-cost deployments:
| Service | CPU Request (Before) | CPU Request (After) | Memory Request (Before) | Memory Request (After) |
|---|---|---|---|---|
| api-gateway | 1000m | 200m | 2Gi | 512Mi |
| auth-service | 500m | 100m | 1Gi | 256Mi |
| worker-processor | 2000m | 500m | 4Gi | 1Gi |
| web-frontend | 500m | 150m | 1Gi | 384Mi |
| notification-svc | 250m | 80m | 512Mi | 192Mi |
The formula we used: set requests to the P95 of actual usage + 20% buffer. Set limits to 2x requests for CPU (allows bursting) and 1.5x for memory (prevents runaway leaks without being too tight).
# After: based on actual measured usage
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "400m"
memory: "768Mi"
This single change reduced total cluster resource requests by about 65%, which meant the cluster autoscaler could pack workloads onto fewer nodes.
Result: 5 nodes dropped to 2 nodes. That alone cut the bill from $2,400 to roughly $960.
Step 3: Autoscaling — HPA vs VPA vs Cluster Autoscaler
These three autoscalers solve different problems. Using the wrong one (or misconfiguring the right one) either wastes money or creates instability.
Horizontal Pod Autoscaler (HPA)
Use when: Your workload handles more load by running more replicas. Stateless web servers, API services, queue consumers.
Don’t use when: Your workload is a single-instance database, a stateful service that can’t easily scale horizontally, or something that takes 5+ minutes to start.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-gateway-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-gateway
minReplicas: 2
maxReplicas: 8
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Percent
value: 50
periodSeconds: 60
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
Key decisions in this config:
minReplicas: 2— not 1. You need at least 2 for availability during rolling deploys.scaleDown.stabilizationWindowSeconds: 300— wait 5 minutes before scaling down. Prevents flapping during brief traffic dips.scaleUpis faster thanscaleDown— you want to react quickly to load increases, but be cautious about removing capacity.- Target utilization at 70%, not 50% — a 50% target means you always have double the pods you need. 70% gives headroom without being wasteful.
Vertical Pod Autoscaler (VPA)
Use when: You have workloads where the right resource request is hard to guess — batch jobs with variable resource needs, services with unpredictable memory profiles.
Don’t use when: You’re already using HPA on the same metric. VPA and HPA on CPU will fight each other.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: worker-processor-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: worker-processor
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: worker
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2000m"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
A word of caution: VPA in Auto mode restarts pods to apply new resource values. In production, start with updateMode: "Off" — this gives you recommendations without applying them. Review the recommendations, then switch to Auto once you trust them.
# Check VPA recommendations
kubectl describe vpa worker-processor-vpa -n production
Cluster Autoscaler
The Cluster Autoscaler adds or removes nodes based on pending pods. It’s reactive — a pod can’t be scheduled, so a new node spins up. A node is underutilized, so it gets drained and removed.
The cost impact is indirect but significant. Every time you right-size pod requests or scale down replicas, the cluster autoscaler can consolidate onto fewer nodes.
Configuration that matters for cost:
# cluster-autoscaler deployment args
- --scale-down-utilization-threshold=0.5
- --scale-down-delay-after-add=10m
- --scale-down-delay-after-delete=1m
- --scale-down-unneeded-time=5m
- --skip-nodes-with-local-storage=false
- --balance-similar-node-groups=true
scale-down-utilization-threshold=0.5— remove a node if it’s below 50% utilized. Default is 0.5, which is reasonable. Don’t set it too aggressive (0.3) or you’ll constantly churn nodes.scale-down-delay-after-add=10m— don’t immediately remove a node that was just added. Gives workloads time to stabilize.
Step 4: Spot Instances for Non-Critical Workloads
Spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure) cost 60–90% less than on-demand. The catch: the cloud provider can reclaim them with 2 minutes notice.
Workloads that tolerate spot well:
- CI/CD runners
- Batch processing jobs
- Dev/staging environments
- Stateless workers behind a queue (if one dies, another picks up the message)
- HPA-managed services with enough replicas that losing 1–2 pods is fine
Workloads that should stay on-demand:
- Databases and stateful services
- Services with only 1–2 replicas where losing one means downtime
- Anything with startup times over 5 minutes
Setting Up Mixed Node Pools on EKS
# managed node group with mixed instances
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: production-cluster
region: us-east-1
managedNodeGroups:
- name: on-demand-critical
instanceType: m5.large
minSize: 2
maxSize: 4
labels:
workload-type: critical
taints:
- key: dedicated
value: critical
effect: NoSchedule
- name: spot-workers
instanceTypes:
- m5.large
- m5a.large
- m5d.large
- m4.large
spot: true
minSize: 0
maxSize: 10
labels:
workload-type: spot-tolerant
Diversifying instance types in the spot pool is critical. If you only bid on one instance type and that pool gets reclaimed, all your spot nodes go away at once. Using 4+ instance types from different families spreads that risk.
Use node affinity and tolerations to control placement:
# For critical workloads — on-demand only
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- critical
tolerations:
- key: dedicated
operator: Equal
value: critical
effect: NoSchedule
# For spot-tolerant workloads
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: workload-type
operator: In
values:
- spot-tolerant
Pod Disruption Budgets
If a spot node gets reclaimed, Kubernetes drains it. A PodDisruptionBudget ensures you don’t lose too many replicas at once:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-gateway-pdb
namespace: production
spec:
minAvailable: 2
selector:
matchLabels:
app: api-gateway
This guarantees at least 2 api-gateway pods are always running, even during node drains.
Step 5: Schedule Non-Production Environments
Dev and staging clusters running 24/7 is pure waste. A dev cluster that runs from 7 AM to 8 PM on weekdays costs 54% less than one running around the clock.
For EKS, scale the node group to zero outside working hours:
# Scale down at 8 PM
aws eks update-nodegroup-config \
--cluster-name dev-cluster \
--nodegroup-name dev-nodes \
--scaling-config minSize=0,maxSize=3,desiredSize=0
# Scale up at 7 AM
aws eks update-nodegroup-config \
--cluster-name dev-cluster \
--nodegroup-name dev-nodes \
--scaling-config minSize=1,maxSize=3,desiredSize=2
Automate this with a CronJob or a Lambda triggered by EventBridge.
The Full Breakdown
Here’s where our $2,400/month went and what changed:
| Change | Monthly Savings | Effort |
|---|---|---|
| Right-sized resource requests | $720 | 2 days of measuring + updating manifests |
| Removed 3 unnecessary nodes (autoscaler consolidation) | $480 | Automatic after right-sizing |
| Moved workers to spot instances | $210 | Half day to set up mixed pools |
| Scheduled dev namespace off-hours | $100 | 1 hour with CronJobs |
| Total savings | $1,510/month | ~3 days of work |
Final bill: $890/month for the same workloads, same availability, same performance.
Common Mistakes to Avoid
Setting requests equal to limits. This is called “Guaranteed” QoS class in Kubernetes. It means no bursting, no flexibility. Every pod gets exactly what it requests, even if it only uses 10% of it. Use Guaranteed QoS only for your most latency-sensitive services.
Autoscaling on the wrong metric. HPA on CPU is fine for compute-bound services. But if your service is I/O-bound (waiting on database queries, external APIs), CPU will stay low even when the service is overloaded. Scale on custom metrics like request latency or queue depth instead.
Ignoring namespace resource quotas. Without quotas, one team can accidentally claim all cluster resources. Set per-namespace quotas:
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-api-quota
namespace: team-api
spec:
hard:
requests.cpu: "4"
requests.memory: "8Gi"
limits.cpu: "8"
limits.memory: "16Gi"
Not using container_memory_working_set_bytes. The metric container_memory_usage_bytes includes filesystem cache, which inflates memory numbers. Always use container_memory_working_set_bytes for right-sizing decisions — it reflects actual application memory.
Start With Measurement
If you take one thing from this post: run kubectl top pods --containers across your namespaces right now and compare to your resource requests. The gap between those numbers is what you’re overpaying.
Most clusters can cut 30–50% without any architectural changes. It’s just configuration work. Tedious, unglamorous configuration work that saves real money every month.
If you want continuous visibility into this gap without manually querying Prometheus, Xplorr tracks resource efficiency across your clusters and flags overprovisioned workloads automatically. But even without tooling — start measuring. The numbers will surprise you.
Keep reading
- AWS Cost Optimization Strategies That Actually Work
- GCP Cost Optimization Guide: Practical Strategies
- Cloud Cost Optimization for Startups: From $18K to $7K/Month
See how Xplorr helps → Features
Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.
Share this article