We spent three weeks auditing a mid-stage startup’s AWS infrastructure across three accounts — dev, staging, and production. Their monthly bill was $47K. When we finished, it was $33K. Same performance. Same reliability. No service degradation.
The waste wasn’t in one place. It was spread across dozens of small, boring things that nobody had looked at in months. Here’s the exact process we followed, with the commands and queries you can run today.
The 5 Real Cost Leaks Nobody Talks About
Before we get to the audit process, let’s talk about where money actually disappears. It’s not where most people think.
1. Idle EC2 Instances
This is the obvious one, but the scale still surprises people. In the 3-account setup we audited, we found 14 instances running 24/7 that were doing functionally nothing — test boxes from Q3, a Jenkins controller that had been replaced by GitHub Actions, two “temporary” bastion hosts.
Combined cost: $2,840/month.
The tricky part isn’t finding them. It’s that nobody wants to turn them off because “someone might need that.” Create a spreadsheet. Tag the owners. Give them a week to respond. Then terminate.
2. Unused EBS Volumes and Forgotten Snapshots
EBS volumes persist after instance termination unless you explicitly configured DeleteOnTermination. Most people don’t. The result: a graveyard of available state volumes attached to nothing.
But snapshots are the real sleeper. Every automated backup, every AMI you created for a deploy pipeline, every “just in case” snapshot — they accumulate silently. We found 4.2 TB of snapshots older than 18 months, costing $210/month. Nobody knew they existed.
3. NAT Gateway Data Processing Charges
This one catches even experienced engineers. A NAT Gateway costs $0.045/hour ($32.40/month) just to exist. But the real cost is $0.045 per GB of data processed through it.
If your private subnet instances are pulling container images, downloading packages, or hitting external APIs, that traffic goes through NAT. We found one account pushing 800 GB/month through a single NAT Gateway: $36/month for the gateway + $36/month in processing fees.
Multiply that across three availability zones (a common HA pattern), and you’re looking at $216/month just for NAT.
4. Cross-AZ and Cross-Region Data Transfer
AWS charges $0.01/GB for cross-AZ traffic. That sounds tiny until you realize your microservices are chatting across AZs thousands of times per second.
A service doing 50 MB/s of cross-AZ traffic costs $1,296/month in data transfer alone. We found this in a Kafka setup where producers and consumers were spread across three AZs with no topology awareness.
5. CloudWatch Log Ingestion and Retention
CloudWatch Logs charges $0.50/GB for ingestion. If you’re logging at DEBUG level in production (we’ve all done it), or if your application logs full request/response bodies, the bill grows fast.
One service in our audit was ingesting 180 GB/month of logs: $90/month. The retention was set to “Never expire.” Setting a 30-day retention and dropping the log level to INFO cut this to $15/month.
The AWS Cost Audit Process (Step by Step)
Here’s the exact sequence we run. It takes about a day per account for the first pass.
Step 1: Get the Lay of the Land with Cost Explorer
Before touching anything, understand where the money goes. Pull a 3-month breakdown by service:
aws ce get-cost-and-usage \
--time-period Start=2026-01-01,End=2026-03-18 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=SERVICE \
--output table
This gives you a ranked list. In most accounts, EC2, RDS, and S3 will be your top three. Don’t waste time optimizing a $12/month service when EC2 is costing you $18K.
Step 2: Find Idle EC2 Instances
Pull CPU utilization for all running instances over the last 14 days:
# List all running instances with their types and launch times
aws ec2 describe-instances \
--filters "Name=instance-state-name,Values=running" \
--query 'Reservations[].Instances[].{
ID:InstanceId,
Type:InstanceType,
LaunchTime:LaunchTime,
Name:Tags[?Key==`Name`]|[0].Value
}' \
--output table
Then for each instance, check CloudWatch:
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-0abc123def456 \
--start-time 2026-03-04T00:00:00Z \
--end-time 2026-03-18T00:00:00Z \
--period 86400 \
--statistics Average Maximum \
--output table
Decision criteria:
| Avg CPU (14d) | Max CPU (14d) | Action |
|---|---|---|
| < 5% | < 10% | Almost certainly idle. Investigate and terminate. |
| < 15% | < 30% | Right-sizing candidate. Drop one or two instance sizes. |
| 15–40% | < 60% | Probably well-sized, but check memory and network too. |
| > 40% | > 70% | Leave it alone. This instance is doing real work. |
Don’t just look at CPU. An instance might have low CPU but high memory usage (common with Redis, Elasticsearch, or Java apps with large heaps). Check memory via CloudWatch agent metrics or SSH in and run free -h.
Step 3: Hunt Orphaned EBS Volumes
# Find all unattached EBS volumes with their size and creation date
aws ec2 describe-volumes \
--filters "Name=status,Values=available" \
--query 'Volumes[].{
ID:VolumeId,
Size:Size,
Type:VolumeType,
Created:CreateTime
}' \
--output table
For gp3 volumes, you’re paying $0.08/GB/month. A forgotten 500 GB volume is $40/month. We typically find 10–30 orphaned volumes per account.
Before deleting: snapshot anything you’re unsure about. A snapshot of a 500 GB volume with 50 GB of actual data costs about $2.50/month — much cheaper than the volume itself.
# Snapshot before deleting
aws ec2 create-snapshot \
--volume-id vol-0abc123 \
--description "Backup before cleanup - March 2026" \
--tag-specifications 'ResourceType=snapshot,Tags=[{Key=cleanup,Value=safe-to-delete-after-30d}]'
# Then delete the volume
aws ec2 delete-volume --volume-id vol-0abc123
Step 4: Clean Up Old Snapshots
# Find snapshots older than 180 days
aws ec2 describe-snapshots \
--owner-ids self \
--query 'Snapshots[?StartTime<`2025-09-18`].{
ID:SnapshotId,
Size:VolumeSize,
Created:StartTime,
Description:Description
}' \
--output table
Cross-reference with your AMI list to make sure you’re not deleting a snapshot that backs a live AMI:
# Get all snapshot IDs referenced by AMIs
aws ec2 describe-images --owners self \
--query 'Images[].BlockDeviceMappings[].Ebs.SnapshotId' \
--output text
Any snapshot not in that list and older than your retention policy is a delete candidate.
Step 5: Audit NAT Gateway Traffic
# Check NAT Gateway data processing over the last 30 days
aws cloudwatch get-metric-statistics \
--namespace AWS/NATGateway \
--metric-name BytesOutToDestination \
--dimensions Name=NatGatewayId,Value=nat-0abc123 \
--start-time 2026-02-18T00:00:00Z \
--end-time 2026-03-18T00:00:00Z \
--period 2592000 \
--statistics Sum \
--output json
If you’re seeing high NAT traffic from container image pulls, consider setting up a VPC endpoint for ECR ($7.20/month per AZ) instead of routing through NAT. For S3 access, a Gateway VPC Endpoint is free and eliminates NAT charges for S3 traffic entirely.
Quick NAT cost reduction checklist:
- Add a Gateway VPC Endpoint for S3 (free, immediate savings)
- Add a Gateway VPC Endpoint for DynamoDB if applicable (also free)
- Add Interface VPC Endpoints for ECR, CloudWatch, and SSM if traffic justifies $7.20/month/AZ each
- Check if any public-facing services are unnecessarily in private subnets routing through NAT
- Consider NAT instances for dev/staging environments where HA isn’t critical ($3.80/month for a t4g.nano vs $32.40/month for NAT Gateway)
Step 6: Review Data Transfer Patterns
Data transfer costs hide inside your EC2 line item. To break them out:
aws ce get-cost-and-usage \
--time-period Start=2026-02-01,End=2026-03-01 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=DIMENSION,Key=USAGE_TYPE \
--filter '{
"Dimensions": {
"Key": "SERVICE",
"Values": ["Amazon Elastic Compute Cloud - Compute"]
}
}' \
--output table
Look for usage types containing DataTransfer, InterZone, or Regional. If cross-AZ transfer is significant, consider:
- Enabling topology-aware routing in Kubernetes (if applicable)
- Placing tightly coupled services in the same AZ with fallback to other AZs
- Using ElastiCache or DAX in the same AZ as the consuming service
The $14K/Month Breakdown
Here’s exactly where the savings came from in our 3-account audit:
| Category | Monthly Savings | Action Taken |
|---|---|---|
| Idle EC2 instances (14 instances) | $2,840 | Terminated after owner confirmation |
| EC2 right-sizing (8 instances) | $3,200 | Downsized from m5.2xlarge/4xlarge to m5.xlarge |
| Orphaned EBS volumes (23 volumes) | $680 | Snapshotted and deleted |
| Old snapshots (4.2 TB) | $210 | Deleted snapshots older than 180 days |
| NAT Gateway consolidation | $390 | Replaced 3 NAT GWs with 1 + added VPC endpoints |
| Dev/staging scheduling | $4,100 | Shut down non-prod from 8pm–8am + weekends |
| CloudWatch log retention | $420 | Set 30-day retention, reduced log levels |
| Unused Elastic IPs (11) | $40 | Released |
| Idle load balancers (3) | $48 | Deleted |
| RDS right-sizing (1 instance) | $1,900 | db.r5.2xlarge → db.r5.large (CPU was at 8%) |
| Total | $13,828 |
The single biggest win was dev/staging scheduling. Engineers don’t work at 3am on Saturday, but those environments were running 24/7. A simple Lambda function using ec2:StopInstances and ec2:StartInstances on a CloudWatch Events schedule saved $4,100/month with zero impact on anyone.
Common Mistakes to Avoid
Optimizing the Wrong Things
If your bill is $30K/month and $22K of that is EC2, don’t spend a week optimizing your $180 S3 storage costs. Start with the biggest line item. Always.
Premature Commitment Purchases
Don’t buy Reserved Instances or Savings Plans until you’ve finished right-sizing. If you commit to an m5.2xlarge for a year and then realize the workload only needs an m5.large, you’re locked into paying for the larger instance.
Right-size first. Stabilize for 2–4 weeks. Then commit.
Ignoring Data Transfer Costs
Data transfer doesn’t show up as its own service in Cost Explorer by default. It’s buried inside other service line items. Engineers often optimize compute and storage while completely ignoring the $2K/month in cross-AZ and internet egress charges.
Downsizing When You Shouldn’t
Not every idle-looking instance should be downsized. Some workloads are bursty — low CPU 95% of the time, then spike to 80% during batch processing or deploys. Check the maximum CPU over 14 days, not just the average.
Also, don’t downsize your production database during a cost-cutting sprint without load testing first. The savings from going db.r5.xlarge to db.r5.large aren’t worth a P1 incident when your next traffic spike hits.
The Ongoing Process
A one-time audit saves money. A recurring process keeps it saved. Here’s what we recommend:
- Weekly: review Cost Explorer for any service with >10% week-over-week increase
- Monthly: run the orphaned resources check (EBS volumes, snapshots, EIPs, idle LBs)
- Monthly: review CloudWatch dashboards for under-utilized instances
- Quarterly: re-evaluate Reserved Instance and Savings Plan coverage
- On every architecture change: estimate data transfer costs before deploying
Set up AWS Budgets with alerts at 80% and 100% of your expected monthly spend. It takes five minutes and has saved us from surprise bills more than once.
Wrapping Up
AWS cost optimization isn’t a one-time project. It’s a habit. The accounts that stay lean are the ones where someone looks at the bill every week and asks “what changed?”
The good news: the first audit always has the biggest wins. If you’ve never done a structured cost review, there’s almost certainly $5K–$15K/month sitting in your account waiting to be reclaimed. Start with the commands above, work through the checklist, and you’ll find it.
If you want to automate the ongoing monitoring piece — catching cost anomalies, tracking idle resources, and getting alerts before waste accumulates — that’s exactly what Xplorr does. But the audit process above works with nothing but the AWS CLI and an afternoon.
Keep reading
- Hidden Cloud Costs You’re Probably Missing Right Now
- AWS Reserved Instances vs Savings Plans: A Decision Framework
- How to Cut Your AWS Bill by 30% in One Week
See how Xplorr helps → Features
Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.
Share this article