Back to Blog

We spent three months optimizing a production Kubernetes cluster that was costing $2,400/month on AWS EKS. By the end, we were running the same workloads — same traffic, same SLAs — for $890/month. No migrations. No rewrites. Just better resource configuration and smarter autoscaling.

Here’s exactly what we changed, with the YAML configs we used.

The Root Problem: Overprovisioning Is the Default

Most Kubernetes clusters run at 25–35% actual utilization. The rest is allocated but unused — CPUs sitting idle, memory reserved but never touched. This happens because of a completely rational fear: nobody wants to be the person whose pod got OOMKilled in production at 2 AM.

So engineers set resource requests high. Then they set limits even higher. Then the cluster autoscaler provisions nodes to satisfy those inflated requests. You end up paying for 3x the compute you actually need.

Here’s a real example from the cluster we optimized. This was a typical API service deployment:

# Before: the "better safe than sorry" configuration
resources:
  requests:
    cpu: "1000m"
    memory: "2Gi"
  limits:
    cpu: "2000m"
    memory: "4Gi"

Actual usage from kubectl top pods averaged 120m CPU and 340Mi memory. That’s 12% CPU utilization and 17% memory utilization against requests. The scheduler reserved 1 full CPU core and 2 GiB of memory for a pod using a fraction of that.

Multiply this across 28 pods in the cluster and you get the $2,400/month bill.

Step 1: Measure Actual Usage Before Changing Anything

You cannot optimize what you haven’t measured. Before touching any resource specs, collect at least 7 days of actual usage data — ideally 14 days to capture weekly traffic patterns.

Quick check with kubectl:

# Current resource requests vs actual usage
kubectl top pods -n production --containers

# Node-level utilization
kubectl top nodes

This gives you a snapshot, but you need historical data. That means Prometheus.

Key Prometheus metrics to track:

MetricWhat It Tells You
container_cpu_usage_seconds_totalActual CPU consumed per container
container_memory_working_set_bytesReal memory in use (not cache)
kube_pod_container_resource_requestsWhat the scheduler reserved
kube_pod_container_resource_limitsThe ceiling before throttling/OOMKill
node_cpu_seconds_totalNode-level CPU consumption

The query that matters most — request vs actual ratio:

# CPU request efficiency (higher = more waste)
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace)
/
sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (namespace)

If this ratio is above 3, you’re paying for 3x the CPU you’re using. We found ratios between 4 and 8 across most namespaces.

Step 2: Right-Size Resource Requests

This is where most of the savings come from. Not autoscaling, not spot instances — just setting resource requests to match actual usage with a reasonable buffer.

Here’s the before and after for our five highest-cost deployments:

ServiceCPU Request (Before)CPU Request (After)Memory Request (Before)Memory Request (After)
api-gateway1000m200m2Gi512Mi
auth-service500m100m1Gi256Mi
worker-processor2000m500m4Gi1Gi
web-frontend500m150m1Gi384Mi
notification-svc250m80m512Mi192Mi

The formula we used: set requests to the P95 of actual usage + 20% buffer. Set limits to 2x requests for CPU (allows bursting) and 1.5x for memory (prevents runaway leaks without being too tight).

# After: based on actual measured usage
resources:
  requests:
    cpu: "200m"
    memory: "512Mi"
  limits:
    cpu: "400m"
    memory: "768Mi"

This single change reduced total cluster resource requests by about 65%, which meant the cluster autoscaler could pack workloads onto fewer nodes.

Result: 5 nodes dropped to 2 nodes. That alone cut the bill from $2,400 to roughly $960.

Step 3: Autoscaling — HPA vs VPA vs Cluster Autoscaler

These three autoscalers solve different problems. Using the wrong one (or misconfiguring the right one) either wastes money or creates instability.

Horizontal Pod Autoscaler (HPA)

Use when: Your workload handles more load by running more replicas. Stateless web servers, API services, queue consumers.

Don’t use when: Your workload is a single-instance database, a stateful service that can’t easily scale horizontally, or something that takes 5+ minutes to start.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-gateway-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-gateway
  minReplicas: 2
  maxReplicas: 8
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 25
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Percent
          value: 50
          periodSeconds: 60
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 75

Key decisions in this config:

  • minReplicas: 2 — not 1. You need at least 2 for availability during rolling deploys.
  • scaleDown.stabilizationWindowSeconds: 300 — wait 5 minutes before scaling down. Prevents flapping during brief traffic dips.
  • scaleUp is faster than scaleDown — you want to react quickly to load increases, but be cautious about removing capacity.
  • Target utilization at 70%, not 50% — a 50% target means you always have double the pods you need. 70% gives headroom without being wasteful.

Vertical Pod Autoscaler (VPA)

Use when: You have workloads where the right resource request is hard to guess — batch jobs with variable resource needs, services with unpredictable memory profiles.

Don’t use when: You’re already using HPA on the same metric. VPA and HPA on CPU will fight each other.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: worker-processor-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: worker-processor
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
      - containerName: worker
        minAllowed:
          cpu: "100m"
          memory: "128Mi"
        maxAllowed:
          cpu: "2000m"
          memory: "4Gi"
        controlledResources: ["cpu", "memory"]

A word of caution: VPA in Auto mode restarts pods to apply new resource values. In production, start with updateMode: "Off" — this gives you recommendations without applying them. Review the recommendations, then switch to Auto once you trust them.

# Check VPA recommendations
kubectl describe vpa worker-processor-vpa -n production

Cluster Autoscaler

The Cluster Autoscaler adds or removes nodes based on pending pods. It’s reactive — a pod can’t be scheduled, so a new node spins up. A node is underutilized, so it gets drained and removed.

The cost impact is indirect but significant. Every time you right-size pod requests or scale down replicas, the cluster autoscaler can consolidate onto fewer nodes.

Configuration that matters for cost:

# cluster-autoscaler deployment args
- --scale-down-utilization-threshold=0.5
- --scale-down-delay-after-add=10m
- --scale-down-delay-after-delete=1m
- --scale-down-unneeded-time=5m
- --skip-nodes-with-local-storage=false
- --balance-similar-node-groups=true
  • scale-down-utilization-threshold=0.5 — remove a node if it’s below 50% utilized. Default is 0.5, which is reasonable. Don’t set it too aggressive (0.3) or you’ll constantly churn nodes.
  • scale-down-delay-after-add=10m — don’t immediately remove a node that was just added. Gives workloads time to stabilize.

Step 4: Spot Instances for Non-Critical Workloads

Spot instances (AWS), preemptible VMs (GCP), or spot VMs (Azure) cost 60–90% less than on-demand. The catch: the cloud provider can reclaim them with 2 minutes notice.

Workloads that tolerate spot well:

  • CI/CD runners
  • Batch processing jobs
  • Dev/staging environments
  • Stateless workers behind a queue (if one dies, another picks up the message)
  • HPA-managed services with enough replicas that losing 1–2 pods is fine

Workloads that should stay on-demand:

  • Databases and stateful services
  • Services with only 1–2 replicas where losing one means downtime
  • Anything with startup times over 5 minutes

Setting Up Mixed Node Pools on EKS

# managed node group with mixed instances
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
  name: production-cluster
  region: us-east-1

managedNodeGroups:
  - name: on-demand-critical
    instanceType: m5.large
    minSize: 2
    maxSize: 4
    labels:
      workload-type: critical
    taints:
      - key: dedicated
        value: critical
        effect: NoSchedule

  - name: spot-workers
    instanceTypes:
      - m5.large
      - m5a.large
      - m5d.large
      - m4.large
    spot: true
    minSize: 0
    maxSize: 10
    labels:
      workload-type: spot-tolerant

Diversifying instance types in the spot pool is critical. If you only bid on one instance type and that pool gets reclaimed, all your spot nodes go away at once. Using 4+ instance types from different families spreads that risk.

Use node affinity and tolerations to control placement:

# For critical workloads — on-demand only
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: workload-type
                operator: In
                values:
                  - critical
  tolerations:
    - key: dedicated
      operator: Equal
      value: critical
      effect: NoSchedule
# For spot-tolerant workloads
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          preference:
            matchExpressions:
              - key: workload-type
                operator: In
                values:
                  - spot-tolerant

Pod Disruption Budgets

If a spot node gets reclaimed, Kubernetes drains it. A PodDisruptionBudget ensures you don’t lose too many replicas at once:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-gateway-pdb
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-gateway

This guarantees at least 2 api-gateway pods are always running, even during node drains.

Step 5: Schedule Non-Production Environments

Dev and staging clusters running 24/7 is pure waste. A dev cluster that runs from 7 AM to 8 PM on weekdays costs 54% less than one running around the clock.

For EKS, scale the node group to zero outside working hours:

# Scale down at 8 PM
aws eks update-nodegroup-config \
  --cluster-name dev-cluster \
  --nodegroup-name dev-nodes \
  --scaling-config minSize=0,maxSize=3,desiredSize=0

# Scale up at 7 AM
aws eks update-nodegroup-config \
  --cluster-name dev-cluster \
  --nodegroup-name dev-nodes \
  --scaling-config minSize=1,maxSize=3,desiredSize=2

Automate this with a CronJob or a Lambda triggered by EventBridge.

The Full Breakdown

Here’s where our $2,400/month went and what changed:

ChangeMonthly SavingsEffort
Right-sized resource requests$7202 days of measuring + updating manifests
Removed 3 unnecessary nodes (autoscaler consolidation)$480Automatic after right-sizing
Moved workers to spot instances$210Half day to set up mixed pools
Scheduled dev namespace off-hours$1001 hour with CronJobs
Total savings$1,510/month~3 days of work

Final bill: $890/month for the same workloads, same availability, same performance.

Common Mistakes to Avoid

Setting requests equal to limits. This is called “Guaranteed” QoS class in Kubernetes. It means no bursting, no flexibility. Every pod gets exactly what it requests, even if it only uses 10% of it. Use Guaranteed QoS only for your most latency-sensitive services.

Autoscaling on the wrong metric. HPA on CPU is fine for compute-bound services. But if your service is I/O-bound (waiting on database queries, external APIs), CPU will stay low even when the service is overloaded. Scale on custom metrics like request latency or queue depth instead.

Ignoring namespace resource quotas. Without quotas, one team can accidentally claim all cluster resources. Set per-namespace quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-api-quota
  namespace: team-api
spec:
  hard:
    requests.cpu: "4"
    requests.memory: "8Gi"
    limits.cpu: "8"
    limits.memory: "16Gi"

Not using container_memory_working_set_bytes. The metric container_memory_usage_bytes includes filesystem cache, which inflates memory numbers. Always use container_memory_working_set_bytes for right-sizing decisions — it reflects actual application memory.

Start With Measurement

If you take one thing from this post: run kubectl top pods --containers across your namespaces right now and compare to your resource requests. The gap between those numbers is what you’re overpaying.

Most clusters can cut 30–50% without any architectural changes. It’s just configuration work. Tedious, unglamorous configuration work that saves real money every month.

If you want continuous visibility into this gap without manually querying Prometheus, Xplorr tracks resource efficiency across your clusters and flags overprovisioned workloads automatically. But even without tooling — start measuring. The numbers will surprise you.


Keep reading

See how Xplorr helps → Features


Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.

Share this article

Ready to control your cloud costs?

Join early teams getting real visibility into their AWS, Azure, and GCP spend.

Get started free
← More articles