Back to Blog

We set up cost monitoring and alerting for a 40-person engineering team running workloads across AWS and GCP. Within the first week, the alerts caught a $6,000 misconfiguration — someone had launched 8 p3.2xlarge GPU instances for a one-off ML experiment and forgotten to terminate them. They’d been running for 11 days at $3.06/hour each before the anomaly alert fired.

That one catch paid for six months of the monitoring effort. Here’s exactly how we set it up.

Why Most Cost Alerting Fails

Before getting into the setup, it’s worth understanding why cost alerts have a bad reputation. Most teams that “already have alerting in place” actually have one of these:

  1. A single budget alert at 80% of a number someone picked a year ago. The threshold is either too high (never fires) or too low (fires every month and gets ignored).
  2. Alerts going to a shared email or Slack channel that nobody monitors. The alert fires, gets buried under 200 other notifications, and nobody acts on it.
  3. No anomaly detection — only absolute thresholds. If your bill is $50,000/month and growing 10% month-over-month, a $55,000 threshold fires every month. But a sudden jump from $50,000 to $58,000 in a single week? That’s the one you actually need to catch, and threshold alerts miss it.

Effective cost alerting needs four things: budget tracking, anomaly detection, enforcement mechanisms, and the right routing.

Setting Up Budget Alerts

AWS: Budgets + SNS

AWS Budgets is the built-in tool. It’s not fancy, but it works. Here’s a setup that covers the common cases.

Create an SNS topic for cost alerts:

aws sns create-topic --name cloud-cost-alerts \
  --region us-east-1

# Subscribe your team's Slack webhook or email
aws sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts \
  --protocol email \
  --notification-endpoint [email protected]

Create a monthly budget with graduated alerts:

aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "monthly-total-spend",
    "BudgetLimit": {
      "Amount": "15000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {},
    "CostTypes": {
      "IncludeTax": true,
      "IncludeSubscription": true,
      "UseBlended": false,
      "IncludeRefund": false,
      "IncludeCredit": false
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 50,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{
        "SubscriptionType": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
      }]
    },
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{
        "SubscriptionType": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
      }]
    },
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{
        "SubscriptionType": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
      }]
    },
    {
      "Notification": {
        "NotificationType": "FORECASTED",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 100,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{
        "SubscriptionType": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
      }]
    }
  ]'

The four alert types here are deliberate:

AlertWhen It FiresPurpose
50% actualMid-monthSanity check — are we tracking normally?
80% actual~Day 24 normallyWarning — likely to exceed if anything unusual happens
100% actualBudget exceededAction required — investigate what pushed over
100% forecastedWhen AWS projects overageEarly warning — fires before you actually exceed

The forecasted alert is the most valuable. AWS uses your spending trend to predict the month-end total. If you’re on pace to exceed budget by day 15, you get two weeks to act instead of finding out after the fact.

Create per-service budgets for your top cost drivers:

# EC2-specific budget
aws budgets create-budget \
  --account-id 123456789012 \
  --budget '{
    "BudgetName": "ec2-monthly-budget",
    "BudgetLimit": {
      "Amount": "8000",
      "Unit": "USD"
    },
    "TimeUnit": "MONTHLY",
    "BudgetType": "COST",
    "CostFilters": {
      "Service": ["Amazon Elastic Compute Cloud - Compute"]
    }
  }' \
  --notifications-with-subscribers '[
    {
      "Notification": {
        "NotificationType": "ACTUAL",
        "ComparisonOperator": "GREATER_THAN",
        "Threshold": 80,
        "ThresholdType": "PERCENTAGE"
      },
      "Subscribers": [{
        "SubscriptionType": "SNS",
        "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
      }]
    }
  ]'

Azure: Cost Management Alerts

Azure has built-in budget alerts through Cost Management. You can create them in the portal, but here’s the CLI approach:

# Create a budget with alert conditions
az consumption budget create \
  --budget-name "monthly-total-spend" \
  --amount 12000 \
  --category Cost \
  --time-grain Monthly \
  --start-date 2026-03-01 \
  --end-date 2027-03-01 \
  --resource-group-filter "" \
  --notifications '{
    "Actual_GreaterThan_80_Percent": {
      "enabled": true,
      "operator": "GreaterThan",
      "threshold": 80,
      "contactEmails": ["[email protected]"],
      "thresholdType": "Actual"
    },
    "Forecasted_GreaterThan_100_Percent": {
      "enabled": true,
      "operator": "GreaterThan",
      "threshold": 100,
      "contactEmails": ["[email protected]"],
      "thresholdType": "Forecasted"
    }
  }'

Azure also supports action groups which can trigger Logic Apps, Azure Functions, or webhooks — useful for automated responses like sending a Slack message or tagging resources for review.

GCP: Budget Alerts via Billing API

GCP budgets are configured per billing account. The console is the easiest path, but you can automate with Terraform:

resource "google_billing_budget" "monthly_budget" {
  billing_account = "012345-6789AB-CDEF01"
  display_name    = "Monthly Total Spend"

  budget_filter {
    projects = ["projects/my-project-id"]
  }

  amount {
    specified_amount {
      currency_code = "USD"
      units         = "10000"
    }
  }

  threshold_rules {
    threshold_percent = 0.5
    spend_basis       = "CURRENT_SPEND"
  }
  threshold_rules {
    threshold_percent = 0.8
    spend_basis       = "CURRENT_SPEND"
  }
  threshold_rules {
    threshold_percent = 1.0
    spend_basis       = "CURRENT_SPEND"
  }
  threshold_rules {
    threshold_percent = 1.0
    spend_basis       = "FORECASTED_SPEND"
  }

  all_updates_rule {
    pubsub_topic = google_pubsub_topic.budget_alerts.id
  }
}

resource "google_pubsub_topic" "budget_alerts" {
  name = "budget-alerts"
}

GCP budget alerts publish to Pub/Sub, which you can wire to a Cloud Function that posts to Slack, sends email, or triggers automated remediation.

Anomaly Detection: Catching What Budgets Miss

Budgets catch predictable overages. Anomalies catch surprises — a new service someone launched, a misconfigured auto-scaling policy, a data transfer cost that spiked because a logging pipeline started shipping to a different region.

Threshold-Based Detection

The simplest approach: compare today’s spend to the same day last week. If it’s more than X% higher, alert.

# Pseudocode for a daily anomaly check
import boto3
from datetime import datetime, timedelta

ce = boto3.client('ce')

today = datetime.utcnow().date()
week_ago = today - timedelta(days=7)

def get_daily_cost(date):
    result = ce.get_cost_and_usage(
        TimePeriod={
            'Start': str(date),
            'End': str(date + timedelta(days=1))
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost']
    )
    return float(
        result['ResultsByTime'][0]['Total']['UnblendedCost']['Amount']
    )

today_cost = get_daily_cost(today - timedelta(days=1))
baseline_cost = get_daily_cost(week_ago)

pct_change = ((today_cost - baseline_cost) / baseline_cost) * 100

if pct_change > 25:
    # Fire alert: "Daily spend increased 25%+ vs same day last week"
    send_alert(today_cost, baseline_cost, pct_change)

This catches the obvious stuff. But it has blind spots: if your costs are growing steadily at 5% per week, the weekly comparison never fires because each day is only slightly above the previous week.

Better: Rolling Average Comparison

Compare the last 3 days against the trailing 30-day average. This catches both sudden spikes and gradual drift:

# Get 30-day cost breakdown
result = ce.get_cost_and_usage(
    TimePeriod={
        'Start': str(today - timedelta(days=30)),
        'End': str(today)
    },
    Granularity='DAILY',
    Metrics=['UnblendedCost'],
    GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)

# Calculate per-service rolling averages
# Alert if any service's 3-day average exceeds 30-day average by 40%+

Grouping by service is important. A 10% increase in total spend could be normal growth. But a 200% increase in a single service — like CloudWatch suddenly costing 5x more because someone enabled detailed monitoring on 500 instances — is almost always a problem.

AWS Cost Anomaly Detection

AWS has a built-in anomaly detection service that uses ML. It’s actually decent and worth enabling:

aws ce create-anomaly-monitor \
  --anomaly-monitor '{
    "MonitorName": "service-level-monitor",
    "MonitorType": "DIMENSIONAL",
    "MonitorDimension": "SERVICE"
  }'

# Create a subscription to get notified
aws ce create-anomaly-subscription \
  --anomaly-subscription '{
    "SubscriptionName": "cost-anomaly-alerts",
    "MonitorArnList": ["arn:aws:ce::123456789012:anomalymonitor/abc123"],
    "Subscribers": [{
      "Type": "SNS",
      "Address": "arn:aws:sns:us-east-1:123456789012:cloud-cost-alerts"
    }],
    "Threshold": 100,
    "Frequency": "DAILY"
  }'

The Threshold: 100 means only alert on anomalies with an estimated impact of $100 or more. Set this based on your scale — for a $5,000/month account, $100 is meaningful. For a $500,000/month account, set it to $1,000 or higher to avoid noise.

What Triggers False Positives

After running anomaly detection for 6 months, here are the most common false positive triggers:

  • Month-start reserved instance charges — RI fees post on the 1st of the month, causing a predictable “spike”
  • Savings Plan renewals — same issue, large charge on renewal date
  • End-of-month data transfer — analytics and reporting jobs that run monthly
  • AWS Marketplace charges — third-party software charges that bill irregularly

Build an exception list for these known patterns rather than raising the global threshold, which would mask real anomalies.

Budget Enforcement: Stopping Spend, Not Just Alerting

Alerts tell you about a problem. Enforcement prevents it.

AWS: Service Control Policies

For non-production accounts, you can use SCPs to hard-block expensive actions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyExpensiveEC2InDev",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringNotLike": {
          "ec2:InstanceType": [
            "t3.*",
            "t3a.*",
            "m5.large",
            "m5.xlarge"
          ]
        }
      }
    },
    {
      "Sid": "DenyGPUInstances",
      "Effect": "Deny",
      "Action": "ec2:RunInstances",
      "Resource": "arn:aws:ec2:*:*:instance/*",
      "Condition": {
        "ForAnyValue:StringLike": {
          "ec2:InstanceType": ["p3.*", "p4.*", "g4dn.*", "g5.*"]
        }
      }
    }
  ]
}

This would have prevented our $6,000 GPU incident entirely. The engineer would have gotten a permission denied error, asked the team lead, and either gotten an exception or found a cheaper alternative.

Azure: Spending Limits and Policy

Azure supports spending limits on certain subscription types, and Azure Policy can restrict VM sizes:

{
  "if": {
    "allOf": [
      {
        "field": "type",
        "equals": "Microsoft.Compute/virtualMachines"
      },
      {
        "not": {
          "field": "Microsoft.Compute/virtualMachines/sku.name",
          "in": [
            "Standard_B2s",
            "Standard_D2s_v3",
            "Standard_D4s_v3"
          ]
        }
      }
    ]
  },
  "then": {
    "effect": "deny"
  }
}

Avoiding Alert Fatigue

The biggest risk with cost alerting isn’t missing alerts — it’s getting too many, training your team to ignore them, and then missing the one that matters.

Mistakes we’ve seen:

  1. Setting 5 budget thresholds per service, 12 services — that’s 60 potential alerts per month. Nobody reads alert #47.
  2. Sending everything to #cloud-costs Slack channel — a shared channel with 200 members. Everyone assumes someone else is handling it.
  3. No severity levels — a $50 overage gets the same alert as a $5,000 anomaly.
  4. Alerting on percentage, not absolute dollars — a 200% increase on a $3/month service is $6. Not worth waking anyone up.

What actually works:

Alert TypeRoutingFrequency
Weekly cost digestTeam leads via emailWeekly (Monday AM)
Budget 80% thresholdFinOps team Slack DMWhen triggered
Budget 100% exceededEngineering manager + FinOpsWhen triggered
Anomaly > $500 impactOn-call engineer + FinOpsWhen detected
Anomaly > $2,000 impactVP Engineering + FinOpsWhen detected

The escalation path matters. Low-severity alerts go to the people who can investigate during business hours. High-severity alerts go to people who can approve immediate action.

The Weekly Cost Digest

The single most effective cost practice isn’t an alert — it’s a weekly email that shows each team their spend for the past 7 days, the trend vs. the previous 4 weeks, and the top 3 cost drivers.

We generate ours with a Lambda that runs every Monday at 8 AM:

  1. Pull 7-day costs from Cost Explorer, grouped by tag (team)
  2. Pull 30-day rolling average for comparison
  3. Highlight any service that increased 20%+ week-over-week
  4. Format as HTML email, send via SES

This isn’t alerting — it’s awareness. When engineers see their team’s costs every week, they make different decisions. The team that noticed their CloudWatch costs creeping up by $200/week went and fixed their log verbosity settings without anyone asking them to.

Putting It Together

Here’s the full monitoring stack we recommend:

  1. Per-account monthly budgets with alerts at 50%, 80%, 100% actual + 100% forecasted
  2. Per-service budgets for your top 3 cost drivers (usually EC2, RDS, and S3 or data transfer)
  3. AWS Cost Anomaly Detection (or equivalent) with $100+ threshold
  4. Custom daily anomaly check comparing per-service spend to 30-day rolling average
  5. SCP enforcement in non-production accounts to block expensive resource types
  6. Weekly cost digest to all engineering team leads
  7. Quarterly budget review to adjust thresholds as your infrastructure grows

The first four can be set up in a day. The SCP work takes another day if you already have AWS Organizations. The weekly digest is a weekend project.

Total effort: about a week. And if it catches even one $6,000 misconfiguration — which it will — it pays for itself immediately.

If you’d rather not build all of this from scratch, Xplorr provides cross-cloud cost monitoring with built-in anomaly detection and team-level alerting out of the box. But the principles here apply regardless of tooling — the important thing is having alerts that are actionable, routed correctly, and not so noisy that your team learns to ignore them.


Keep reading

See how Xplorr helps → Features


Xplorr finds an average of 23% in unnecessary cloud spend. Get started free.

Share this article

Ready to control your cloud costs?

Join early teams getting real visibility into their AWS, Azure, and GCP spend.

Get started free
← More articles