We set up cost monitoring and alerting for a 40-person engineering team running workloads across AWS and GCP. Within the first week, the alerts caught a $6,000 misconfiguration — someone had launched 8 p3.2xlarge GPU instances for a one-off ML experiment and forgotten to terminate them. They’d been running for 11 days at $3.06/hour each before the anomaly alert fired.
That one catch paid for six months of the monitoring effort. Here’s exactly how we set it up.
Why Most Cost Alerting Fails
Before getting into the setup, it’s worth understanding why cost alerts have a bad reputation. Most teams that “already have alerting in place” actually have one of these:
- A single budget alert at 80% of a number someone picked a year ago. The threshold is either too high (never fires) or too low (fires every month and gets ignored).
- Alerts going to a shared email or Slack channel that nobody monitors. The alert fires, gets buried under 200 other notifications, and nobody acts on it.
- No anomaly detection — only absolute thresholds. If your bill is $50,000/month and growing 10% month-over-month, a $55,000 threshold fires every month. But a sudden jump from $50,000 to $58,000 in a single week? That’s the one you actually need to catch, and threshold alerts miss it.
Effective cost alerting needs four things: budget tracking, anomaly detection, enforcement mechanisms, and the right routing.
Setting Up Budget Alerts
AWS: Budgets + SNS
AWS Budgets is the built-in tool. It’s not fancy, but it works. Here’s a setup that covers the common cases.
Create an SNS topic for cost alerts:
aws sns create-topic --name cloud-cost-alerts \
--region us-east-1
# Subscribe your team's Slack webhook or email
aws sns subscribe \
--topic-arn arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts \
--protocol email \
--notification-endpoint [email protected]
Create a monthly budget with graduated alerts:
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "monthly-total-spend",
"BudgetLimit": {
"Amount": "15000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {},
"CostTypes": {
"IncludeTax": true,
"IncludeSubscription": true,
"UseBlended": false,
"IncludeRefund": false,
"IncludeCredit": false
}
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 50,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}]
},
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}]
},
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}]
},
{
"Notification": {
"NotificationType": "FORECASTED",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 100,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}]
}
]'
The four alert types here are deliberate:
| Alert | When It Fires | Purpose |
|---|---|---|
| 50% actual | Mid-month | Sanity check — are we tracking normally? |
| 80% actual | ~Day 24 normally | Warning — likely to exceed if anything unusual happens |
| 100% actual | Budget exceeded | Action required — investigate what pushed over |
| 100% forecasted | When AWS projects overage | Early warning — fires before you actually exceed |
The forecasted alert is the most valuable. AWS uses your spending trend to predict the month-end total. If you’re on pace to exceed budget by day 15, you get two weeks to act instead of finding out after the fact.
Create per-service budgets for your top cost drivers:
# EC2-specific budget
aws budgets create-budget \
--account-id 123456789012 \
--budget '{
"BudgetName": "ec2-monthly-budget",
"BudgetLimit": {
"Amount": "8000",
"Unit": "USD"
},
"TimeUnit": "MONTHLY",
"BudgetType": "COST",
"CostFilters": {
"Service": ["Amazon Elastic Compute Cloud - Compute"]
}
}' \
--notifications-with-subscribers '[
{
"Notification": {
"NotificationType": "ACTUAL",
"ComparisonOperator": "GREATER_THAN",
"Threshold": 80,
"ThresholdType": "PERCENTAGE"
},
"Subscribers": [{
"SubscriptionType": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}]
}
]'
Azure: Cost Management Alerts
Azure has built-in budget alerts through Cost Management. You can create them in the portal, but here’s the CLI approach:
# Create a budget with alert conditions
az consumption budget create \
--budget-name "monthly-total-spend" \
--amount 12000 \
--category Cost \
--time-grain Monthly \
--start-date 2026-03-01 \
--end-date 2027-03-01 \
--resource-group-filter "" \
--notifications '{
"Actual_GreaterThan_80_Percent": {
"enabled": true,
"operator": "GreaterThan",
"threshold": 80,
"contactEmails": ["[email protected]"],
"thresholdType": "Actual"
},
"Forecasted_GreaterThan_100_Percent": {
"enabled": true,
"operator": "GreaterThan",
"threshold": 100,
"contactEmails": ["[email protected]"],
"thresholdType": "Forecasted"
}
}'
Azure also supports action groups which can trigger Logic Apps, Azure Functions, or webhooks — useful for automated responses like sending a Slack message or tagging resources for review.
GCP: Budget Alerts via Billing API
GCP budgets are configured per billing account. The console is the easiest path, but you can automate with Terraform:
resource "google_billing_budget" "monthly_budget" {
billing_account = "012345-6789AB-CDEF01"
display_name = "Monthly Total Spend"
budget_filter {
projects = ["projects/my-project-id"]
}
amount {
specified_amount {
currency_code = "USD"
units = "10000"
}
}
threshold_rules {
threshold_percent = 0.5
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 0.8
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "CURRENT_SPEND"
}
threshold_rules {
threshold_percent = 1.0
spend_basis = "FORECASTED_SPEND"
}
all_updates_rule {
pubsub_topic = google_pubsub_topic.budget_alerts.id
}
}
resource "google_pubsub_topic" "budget_alerts" {
name = "budget-alerts"
}
GCP budget alerts publish to Pub/Sub, which you can wire to a Cloud Function that posts to Slack, sends email, or triggers automated remediation.
Anomaly Detection: Catching What Budgets Miss
Budgets catch predictable overages. Anomalies catch surprises — a new service someone launched, a misconfigured auto-scaling policy, a data transfer cost that spiked because a logging pipeline started shipping to a different region.
Threshold-Based Detection
The simplest approach: compare today’s spend to the same day last week. If it’s more than X% higher, alert.
# Pseudocode for a daily anomaly check
import boto3
from datetime import datetime, timedelta
ce = boto3.client('ce')
today = datetime.utcnow().date()
week_ago = today - timedelta(days=7)
def get_daily_cost(date):
result = ce.get_cost_and_usage(
TimePeriod={
'Start': str(date),
'End': str(date + timedelta(days=1))
},
Granularity='DAILY',
Metrics=['UnblendedCost']
)
return float(
result['ResultsByTime'][0]['Total']['UnblendedCost']['Amount']
)
today_cost = get_daily_cost(today - timedelta(days=1))
baseline_cost = get_daily_cost(week_ago)
pct_change = ((today_cost - baseline_cost) / baseline_cost) * 100
if pct_change > 25:
# Fire alert: "Daily spend increased 25%+ vs same day last week"
send_alert(today_cost, baseline_cost, pct_change)
This catches the obvious stuff. But it has blind spots: if your costs are growing steadily at 5% per week, the weekly comparison never fires because each day is only slightly above the previous week.
Better: Rolling Average Comparison
Compare the last 3 days against the trailing 30-day average. This catches both sudden spikes and gradual drift:
# Get 30-day cost breakdown
result = ce.get_cost_and_usage(
TimePeriod={
'Start': str(today - timedelta(days=30)),
'End': str(today)
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
# Calculate per-service rolling averages
# Alert if any service's 3-day average exceeds 30-day average by 40%+
Grouping by service is important. A 10% increase in total spend could be normal growth. But a 200% increase in a single service — like CloudWatch suddenly costing 5x more because someone enabled detailed monitoring on 500 instances — is almost always a problem.
AWS Cost Anomaly Detection
AWS has a built-in anomaly detection service that uses ML. It’s actually decent and worth enabling:
aws ce create-anomaly-monitor \
--anomaly-monitor '{
"MonitorName": "service-level-monitor",
"MonitorType": "DIMENSIONAL",
"MonitorDimension": "SERVICE"
}'
# Create a subscription to get notified
aws ce create-anomaly-subscription \
--anomaly-subscription '{
"SubscriptionName": "cost-anomaly-alerts",
"MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/abc123"],
"Subscribers": [{
"Type": "SNS",
"Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
}],
"Threshold": 100,
"Frequency": "DAILY"
}'
The Threshold: 100 means only alert on anomalies with an estimated impact of $100 or more. Set this based on your scale — for a $5,000/month account, $100 is meaningful. For a $500,000/month account, set it to $1,000 or higher to avoid noise.
What Triggers False Positives
After running anomaly detection for 6 months, here are the most common false positive triggers:
- Month-start reserved instance charges — RI fees post on the 1st of the month, causing a predictable “spike”
- Savings Plan renewals — same issue, large charge on renewal date
- End-of-month data transfer — analytics and reporting jobs that run monthly
- AWS Marketplace charges — third-party software charges that bill irregularly
Build an exception list for these known patterns rather than raising the global threshold, which would mask real anomalies.
Budget Enforcement: Stopping Spend, Not Just Alerting
Alerts tell you about a problem. Enforcement prevents it.
AWS: Service Control Policies
For non-production accounts, you can use SCPs to hard-block expensive actions:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyExpensiveEC2InDev",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"ForAnyValue:StringNotLike": {
"ec2:InstanceType": [
"t3.*",
"t3a.*",
"m5.large",
"m5.xlarge"
]
}
}
},
{
"Sid": "DenyGPUInstances",
"Effect": "Deny",
"Action": "ec2:RunInstances",
"Resource": "arn:aws:ec2:*:*:instance/*",
"Condition": {
"ForAnyValue:StringLike": {
"ec2:InstanceType": ["p3.*", "p4.*", "g4dn.*", "g5.*"]
}
}
}
]
}
This would have prevented our $6,000 GPU incident entirely. The engineer would have gotten a permission denied error, asked the team lead, and either gotten an exception or found a cheaper alternative.
Azure: Spending Limits and Policy
Azure supports spending limits on certain subscription types, and Azure Policy can restrict VM sizes:
{
"if": {
"allOf": [
{
"field": "type",
"equals": "Microsoft.Compute/virtualMachines"
},
{
"not": {
"field": "Microsoft.Compute/virtualMachines/sku.name",
"in": [
"Standard_B2s",
"Standard_D2s_v3",
"Standard_D4s_v3"
]
}
}
]
},
"then": {
"effect": "deny"
}
}
Avoiding Alert Fatigue
The biggest risk with cost alerting isn’t missing alerts — it’s getting too many, training your team to ignore them, and then missing the one that matters.
Mistakes we’ve seen:
- Setting 5 budget thresholds per service, 12 services — that’s 60 potential alerts per month. Nobody reads alert #47.
- Sending everything to #cloud-costs Slack channel — a shared channel with 200 members. Everyone assumes someone else is handling it.
- No severity levels — a $50 overage gets the same alert as a $5,000 anomaly.
- Alerting on percentage, not absolute dollars — a 200% increase on a $3/month service is $6. Not worth waking anyone up.
What actually works:
| Alert Type | Routing | Frequency |
|---|---|---|
| Weekly cost digest | Team leads via email | Weekly (Monday AM) |
| Budget 80% threshold | FinOps team Slack DM | When triggered |
| Budget 100% exceeded | Engineering manager + FinOps | When triggered |
| Anomaly > $500 impact | On-call engineer + FinOps | When detected |
| Anomaly > $2,000 impact | VP Engineering + FinOps | When detected |
The escalation path matters. Low-severity alerts go to the people who can investigate during business hours. High-severity alerts go to people who can approve immediate action.
The Weekly Cost Digest
The single most effective cost practice isn’t an alert — it’s a weekly email that shows each team their spend for the past 7 days, the trend vs. the previous 4 weeks, and the top 3 cost drivers.
We generate ours with a Lambda that runs every Monday at 8 AM:
- Pull 7-day costs from Cost Explorer, grouped by tag (
team) - Pull 30-day rolling average for comparison
- Highlight any service that increased 20%+ week-over-week
- Format as HTML email, send via SES
This isn’t alerting — it’s awareness. When engineers see their team’s costs every week, they make different decisions. The team that noticed their CloudWatch costs creeping up by $200/week went and fixed their log verbosity settings without anyone asking them to.
Putting It Together
Here’s the full monitoring stack we recommend:
- Per-account monthly budgets with alerts at 50%, 80%, 100% actual + 100% forecasted
- Per-service budgets for your top 3 cost drivers (usually EC2, RDS, and S3 or data transfer)
- AWS Cost Anomaly Detection (or equivalent) with $100+ threshold
- Custom daily anomaly check comparing per-service spend to 30-day rolling average
- SCP enforcement in non-production accounts to block expensive resource types
- Weekly cost digest to all engineering team leads
- Quarterly budget review to adjust thresholds as your infrastructure grows
The first four can be set up in a day. The SCP work takes another day if you already have AWS Organizations. The weekly digest is a weekend project.
Total effort: about a week. And if it catches even one $6,000 misconfiguration — which it will — it pays for itself immediately.
If you’d rather not build all of this from scratch, Xplorr provides cross-cloud cost monitoring with built-in anomaly detection and team-level alerting out of the box. But the principles here apply regardless of tooling — the important thing is having alerts that are actionable, routed correctly, and not so noisy that your team learns to ignore them.
Keep reading
- What Is a Cloud Cost Anomaly (And Why You Should Care)
- 5 Signs Your Cloud Bill Is About to Spike
- Building a FinOps Practice That Engineers Actually Follow
See how Xplorr helps → Features
Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.
Share this article