Back to Blog

We spent three weeks auditing a mid-stage startup’s AWS infrastructure across three accounts — dev, staging, and production. Their monthly bill was $47K. When we finished, it was $33K. Same performance. Same reliability. No service degradation.

The waste wasn’t in one place. It was spread across dozens of small, boring things that nobody had looked at in months. Here’s the exact process we followed, with the commands and queries you can run today.

The 5 Real Cost Leaks Nobody Talks About

Before we get to the audit process, let’s talk about where money actually disappears. It’s not where most people think.

1. Idle EC2 Instances

This is the obvious one, but the scale still surprises people. In the 3-account setup we audited, we found 14 instances running 24/7 that were doing functionally nothing — test boxes from Q3, a Jenkins controller that had been replaced by GitHub Actions, two “temporary” bastion hosts.

Combined cost: $2,840/month.

The tricky part isn’t finding them. It’s that nobody wants to turn them off because “someone might need that.” Create a spreadsheet. Tag the owners. Give them a week to respond. Then terminate.

2. Unused EBS Volumes and Forgotten Snapshots

EBS volumes persist after instance termination unless you explicitly configured DeleteOnTermination. Most people don’t. The result: a graveyard of available state volumes attached to nothing.

But snapshots are the real sleeper. Every automated backup, every AMI you created for a deploy pipeline, every “just in case” snapshot — they accumulate silently. We found 4.2 TB of snapshots older than 18 months, costing $210/month. Nobody knew they existed.

3. NAT Gateway Data Processing Charges

This one catches even experienced engineers. A NAT Gateway costs $0.045/hour ($32.40/month) just to exist. But the real cost is $0.045 per GB of data processed through it.

If your private subnet instances are pulling container images, downloading packages, or hitting external APIs, that traffic goes through NAT. We found one account pushing 800 GB/month through a single NAT Gateway: $36/month for the gateway + $36/month in processing fees.

Multiply that across three availability zones (a common HA pattern), and you’re looking at $216/month just for NAT.

4. Cross-AZ and Cross-Region Data Transfer

AWS charges $0.01/GB for cross-AZ traffic. That sounds tiny until you realize your microservices are chatting across AZs thousands of times per second.

A service doing 50 MB/s of cross-AZ traffic costs $1,296/month in data transfer alone. We found this in a Kafka setup where producers and consumers were spread across three AZs with no topology awareness.

5. CloudWatch Log Ingestion and Retention

CloudWatch Logs charges $0.50/GB for ingestion. If you’re logging at DEBUG level in production (we’ve all done it), or if your application logs full request/response bodies, the bill grows fast.

One service in our audit was ingesting 180 GB/month of logs: $90/month. The retention was set to “Never expire.” Setting a 30-day retention and dropping the log level to INFO cut this to $15/month.

The AWS Cost Audit Process (Step by Step)

Here’s the exact sequence we run. It takes about a day per account for the first pass.

Step 1: Get the Lay of the Land with Cost Explorer

Before touching anything, understand where the money goes. Pull a 3-month breakdown by service:

aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-03-18 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=SERVICE \
  --output table

This gives you a ranked list. In most accounts, EC2, RDS, and S3 will be your top three. Don’t waste time optimizing a $12/month service when EC2 is costing you $18K.

Step 2: Find Idle EC2 Instances

Pull CPU utilization for all running instances over the last 14 days:

# List all running instances with their types and launch times
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[].Instances[].{
    ID:InstanceId,
    Type:InstanceType,
    LaunchTime:LaunchTime,
    Name:Tags[?Key==`Name`]|[0].Value
  }' \
  --output table

Then for each instance, check CloudWatch:

aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-0abc123def456 \
  --start-time 2026-03-04T00:00:00Z \
  --end-time 2026-03-18T00:00:00Z \
  --period 86400 \
  --statistics Average Maximum \
  --output table

Decision criteria:

Avg CPU (14d)Max CPU (14d)Action
< 5%< 10%Almost certainly idle. Investigate and terminate.
< 15%< 30%Right-sizing candidate. Drop one or two instance sizes.
15–40%< 60%Probably well-sized, but check memory and network too.
> 40%> 70%Leave it alone. This instance is doing real work.

Don’t just look at CPU. An instance might have low CPU but high memory usage (common with Redis, Elasticsearch, or Java apps with large heaps). Check memory via CloudWatch agent metrics or SSH in and run free -h.

Step 3: Hunt Orphaned EBS Volumes

# Find all unattached EBS volumes with their size and creation date
aws ec2 describe-volumes \
  --filters "Name=status,Values=available" \
  --query 'Volumes[].{
    ID:VolumeId,
    Size:Size,
    Type:VolumeType,
    Created:CreateTime
  }' \
  --output table

For gp3 volumes, you’re paying $0.08/GB/month. A forgotten 500 GB volume is $40/month. We typically find 10–30 orphaned volumes per account.

Before deleting: snapshot anything you’re unsure about. A snapshot of a 500 GB volume with 50 GB of actual data costs about $2.50/month — much cheaper than the volume itself.

# Snapshot before deleting
aws ec2 create-snapshot \
  --volume-id vol-0abc123 \
  --description "Backup before cleanup - March 2026" \
  --tag-specifications 'ResourceType=snapshot,Tags=[{Key=cleanup,Value=safe-to-delete-after-30d}]'

# Then delete the volume
aws ec2 delete-volume --volume-id vol-0abc123

Step 4: Clean Up Old Snapshots

# Find snapshots older than 180 days
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<`2025-09-18`].{
    ID:SnapshotId,
    Size:VolumeSize,
    Created:StartTime,
    Description:Description
  }' \
  --output table

Cross-reference with your AMI list to make sure you’re not deleting a snapshot that backs a live AMI:

# Get all snapshot IDs referenced by AMIs
aws ec2 describe-images --owners self \
  --query 'Images[].BlockDeviceMappings[].Ebs.SnapshotId' \
  --output text

Any snapshot not in that list and older than your retention policy is a delete candidate.

Step 5: Audit NAT Gateway Traffic

# Check NAT Gateway data processing over the last 30 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/NATGateway \
  --metric-name BytesOutToDestination \
  --dimensions Name=NatGatewayId,Value=nat-0abc123 \
  --start-time 2026-02-18T00:00:00Z \
  --end-time 2026-03-18T00:00:00Z \
  --period 2592000 \
  --statistics Sum \
  --output json

If you’re seeing high NAT traffic from container image pulls, consider setting up a VPC endpoint for ECR ($7.20/month per AZ) instead of routing through NAT. For S3 access, a Gateway VPC Endpoint is free and eliminates NAT charges for S3 traffic entirely.

Quick NAT cost reduction checklist:

  • Add a Gateway VPC Endpoint for S3 (free, immediate savings)
  • Add a Gateway VPC Endpoint for DynamoDB if applicable (also free)
  • Add Interface VPC Endpoints for ECR, CloudWatch, and SSM if traffic justifies $7.20/month/AZ each
  • Check if any public-facing services are unnecessarily in private subnets routing through NAT
  • Consider NAT instances for dev/staging environments where HA isn’t critical ($3.80/month for a t4g.nano vs $32.40/month for NAT Gateway)

Step 6: Review Data Transfer Patterns

Data transfer costs hide inside your EC2 line item. To break them out:

aws ce get-cost-and-usage \
  --time-period Start=2026-02-01,End=2026-03-01 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=DIMENSION,Key=USAGE_TYPE \
  --filter '{
    "Dimensions": {
      "Key": "SERVICE",
      "Values": ["Amazon Elastic Compute Cloud - Compute"]
    }
  }' \
  --output table

Look for usage types containing DataTransfer, InterZone, or Regional. If cross-AZ transfer is significant, consider:

  • Enabling topology-aware routing in Kubernetes (if applicable)
  • Placing tightly coupled services in the same AZ with fallback to other AZs
  • Using ElastiCache or DAX in the same AZ as the consuming service

The $14K/Month Breakdown

Here’s exactly where the savings came from in our 3-account audit:

CategoryMonthly SavingsAction Taken
Idle EC2 instances (14 instances)$2,840Terminated after owner confirmation
EC2 right-sizing (8 instances)$3,200Downsized from m5.2xlarge/4xlarge to m5.xlarge
Orphaned EBS volumes (23 volumes)$680Snapshotted and deleted
Old snapshots (4.2 TB)$210Deleted snapshots older than 180 days
NAT Gateway consolidation$390Replaced 3 NAT GWs with 1 + added VPC endpoints
Dev/staging scheduling$4,100Shut down non-prod from 8pm–8am + weekends
CloudWatch log retention$420Set 30-day retention, reduced log levels
Unused Elastic IPs (11)$40Released
Idle load balancers (3)$48Deleted
RDS right-sizing (1 instance)$1,900db.r5.2xlarge → db.r5.large (CPU was at 8%)
Total$13,828

The single biggest win was dev/staging scheduling. Engineers don’t work at 3am on Saturday, but those environments were running 24/7. A simple Lambda function using ec2:StopInstances and ec2:StartInstances on a CloudWatch Events schedule saved $4,100/month with zero impact on anyone.

Common Mistakes to Avoid

Optimizing the Wrong Things

If your bill is $30K/month and $22K of that is EC2, don’t spend a week optimizing your $180 S3 storage costs. Start with the biggest line item. Always.

Premature Commitment Purchases

Don’t buy Reserved Instances or Savings Plans until you’ve finished right-sizing. If you commit to an m5.2xlarge for a year and then realize the workload only needs an m5.large, you’re locked into paying for the larger instance.

Right-size first. Stabilize for 2–4 weeks. Then commit.

Ignoring Data Transfer Costs

Data transfer doesn’t show up as its own service in Cost Explorer by default. It’s buried inside other service line items. Engineers often optimize compute and storage while completely ignoring the $2K/month in cross-AZ and internet egress charges.

Downsizing When You Shouldn’t

Not every idle-looking instance should be downsized. Some workloads are bursty — low CPU 95% of the time, then spike to 80% during batch processing or deploys. Check the maximum CPU over 14 days, not just the average.

Also, don’t downsize your production database during a cost-cutting sprint without load testing first. The savings from going db.r5.xlarge to db.r5.large aren’t worth a P1 incident when your next traffic spike hits.

The Ongoing Process

A one-time audit saves money. A recurring process keeps it saved. Here’s what we recommend:

  • Weekly: review Cost Explorer for any service with >10% week-over-week increase
  • Monthly: run the orphaned resources check (EBS volumes, snapshots, EIPs, idle LBs)
  • Monthly: review CloudWatch dashboards for under-utilized instances
  • Quarterly: re-evaluate Reserved Instance and Savings Plan coverage
  • On every architecture change: estimate data transfer costs before deploying

Set up AWS Budgets with alerts at 80% and 100% of your expected monthly spend. It takes five minutes and has saved us from surprise bills more than once.

Wrapping Up

AWS cost optimization isn’t a one-time project. It’s a habit. The accounts that stay lean are the ones where someone looks at the bill every week and asks “what changed?”

The good news: the first audit always has the biggest wins. If you’ve never done a structured cost review, there’s almost certainly $5K–$15K/month sitting in your account waiting to be reclaimed. Start with the commands above, work through the checklist, and you’ll find it.

If you want to automate the ongoing monitoring piece — catching cost anomalies, tracking idle resources, and getting alerts before waste accumulates — that’s exactly what Xplorr does. But the audit process above works with nothing but the AWS CLI and an afternoon.


Keep reading

See how Xplorr helps → Features


Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.

Share this article

Ready to control your cloud costs?

Join early teams getting real visibility into their AWS, Azure, and GCP spend.

Get started free
← More articles