Back to Blog

A cloud cost anomaly is any unexpected, statistically significant deviation in your cloud spending from its established baseline. In plain terms: when a service suddenly costs a lot more than it normally does, and you didn’t expect that to happen.

The word “anomaly” sounds technical. The concept is simple. You spend roughly $150/day on Lambda. One day you spend $1,600. That’s an anomaly.

How Rolling Average Detection Works

The most reliable method for detecting anomalies in cloud costs is comparing current spend against a rolling average of recent spend. Here’s the mechanics:

  1. Calculate the baseline: Take daily spend for a given service over the past 7 days and compute the average. This becomes the expected spend for today.
  2. Measure the deviation: Compare today’s actual spend against that baseline. Express it as a percentage.
  3. Apply a threshold: If the deviation exceeds a defined threshold (commonly 50%), flag it as an anomaly.

Why 7 days? Long enough to smooth out normal day-of-week variation (most services spend less on weekends). Short enough to adapt to legitimate step changes in your usage, like a new feature launch that permanently increases compute spend.

Why 50%? It’s high enough to avoid false positives from normal variance, but low enough to catch real problems before they compound significantly. Some teams set this lower (30%) or higher (100%) based on their tolerance for noise vs. sensitivity.

The math for a given service on a given account:

rolling_avg = sum(daily_cost, last 7 days) / 7
deviation = (today_cost - rolling_avg) / rolling_avg * 100
if deviation > 50%: alert

AWS has native anomaly detection in Cost Explorer. It uses machine learning rather than a simple rolling average, which is more sophisticated but also more of a black box — you don’t always understand why it did or didn’t fire.

Real Examples of Anomalies

Anomalies tend to fall into a few categories. Here are real-world patterns (with names removed):

The Lambda Runaway

A fintech startup ran an event-processing Lambda that normally processed a few thousand events per day at trivial cost. A bug in a client SDK sent a malformed event that caused the Lambda to enter an infinite retry loop. In two hours, it invoked 4.7 million times, generating a $12,000 charge.

The anomaly would have been trivially detectable from cost data: Lambda spend going from $2/day to thousands of dollars in a matter of hours. Without alerting, it was only caught when the monthly bill arrived.

Detection time with anomaly alerting: Under 2 hours. Detection time without it: End of month.

The Accidental Data Transfer

A media company rebuilt their image processing pipeline and accidentally configured it to read assets from an S3 bucket in us-west-2 while the processing instances ran in us-east-1. Cross-region data transfer is $0.02/GB — not scary. Until you’re processing 500,000 images per day averaging 8MB each, which generates 4TB of cross-region transfer per day: $80/day, or $2,400/month in transfer fees alone.

This particular anomaly was a step change rather than a spike — it went from near-zero to a consistent elevated level — but the rolling average comparison caught it after the second day of elevated spend.

DDoS Cost Impact

A SaaS company with auto-scaling EC2 behind an ALB got hit by a volumetric DDoS attack on a Friday afternoon. Their autoscaling group scaled from 4 instances to 80 instances in 15 minutes. The attack lasted 4 hours before they could respond.

The cost impact was approximately $800 in extra EC2 hours and $200 in ALB request fees. More significantly, it revealed that their autoscaling maximum of 200 instances would have cost them $8,000+ if the attack had continued through the weekend. They lowered the maximum and added WAF rate limiting the following Monday.

Why You Should Care

Cost anomalies have two kinds of impact:

Direct financial impact. The most obvious. A Lambda runaway, a data transfer bug, or a forgotten autoscaling maximum costs real money — and the longer it runs undetected, the more it costs.

Indirect operational signal. Cost anomalies often indicate operational problems before other signals do. A Lambda that’s running 1,000x its normal invocation count isn’t just expensive — it’s probably in an error loop that’s also impacting users. A data transfer spike often indicates misconfigured routing. Treating cost as an observability signal gives you an early warning system that complements your application metrics.

Setting Up Detection

The minimum viable setup for anomaly detection:

  1. Enable AWS Cost Anomaly Detection (Cost Explorer → Anomaly Detection → Create monitor). Free to enable, alerts via email or SNS.
  2. Set a Slack notification. Route anomaly alerts to a channel someone actually reads. Email is too easy to ignore.
  3. Set thresholds thoughtfully. A $5 anomaly on a $10/month service is a 50% spike — but it’s $5. Most teams set a minimum dollar threshold (e.g., alert only if absolute impact > $50) alongside the percentage threshold.

For multi-cloud environments, you need equivalent setup on Azure and GCP — or a unified tool that monitors all three from a single place and provides consistent alerting.

The goal isn’t to alert on everything. It’s to ensure that when something genuinely unusual happens to your cloud spend, you find out within hours — not weeks.


Xplorr monitors for cost anomalies across AWS, Azure, and GCP with unified alerting via email and Slack. Request beta access — free for early teams.

Share this article

𝕏 Share on X in Share on LinkedIn
☁️

Ready to control your cloud costs?

Join early teams getting real visibility into their AWS, Azure, and GCP spend.

Request Beta Access
← More articles