READ NOW
All posts

When the Cloud Crashes, So Does Your Wallet: How One AWS Outage Could Double Your Observability Bill

Observability
Oct
22
2025
Oct
23
2025

When AWS sneezes, the internet catches a cold, and your observability budget catches fire.

Every time the cloud hiccups, thousands of AWS-powered services, SDKs, and libraries light up your logs like a Christmas tree. The pattern is always the same: a minor network blip, a failed API call, a retry loop gone wild and suddenly you’re ingesting millions of identical error logs that say absolutely nothing new.

The Usual Suspects

  • [LaunchDarkly] Will retry stream connection in 5000 milliseconds
  • sqs-consumer was requested to be stopped
  • Received I/O error (Connection reset by peer) for streaming request - will retry
  • KinesisConsumer error: GetRecords operation failed — retrying in 2 seconds

Each message is a clone. Each one costs money.  And the kicker? You’re paying Datadog, New Relic, Elastic, or Splunk by the volume and/or message count of logs you send, not their value.

During an AWS outage, your “retry storm” can multiply your daily ingest by 5–10x. You’re literally paying to watch your systems fail… repeatedly. Yikes. 

The Hidden Cost of AWS Outages

In the recent AWS outage reported by Reuters, customers across several regions experienced widespread connectivity and timeout errors because underlying AWS networking and regional dependencies degraded. The ripple effect was brutal: every microservice built on those APIs began retrying in unison, flooding logs with endless “failed to fetch,” “connection reset,” and “will retry” messages.

During the outage (20 October 2025), our system detected  surges caused by retry loops and cascading failures. In one observed case, a single app’s log rate jumped from 500 lines per second to over 25,000 -  a 50x spike.   When AWS hiccups, telemetry systems don’t just blink - they erupt. And when they do, your Datadog bill grows as fast as your error logs.

The Fix: Log Aggregation Policies for AWS Observability Cost Control

A log aggregation policy groups together repeated or near-identical messages before they ever reach your observability backend.

Instead of sending 10,000 identical entries like:
Received I/O error (Connection reset by peer) for streaming request - will retry

You send one enriched event like:
Received I/O error (Connection reset by peer) for streaming request - will retry [occurred 10,000 times in 60s]

Same insight. 99.99% less cost.

Aggregation policies can be as simple as fingerprinting messages by regex or as smart as detecting retry loops in real time using telemetry processors in OpenTelemetry Collector, Vector, or intelligent log aggregators like Sawmills.

Why You Need This Before the Next AWS Outage

You can’t turn this on mid-chaos. Once an outage begins, your agents and forwarders are already choking. Your ingest quotas are full, your dashboards are lagging, and your CFO is about to notice the Datadog overage alert.

Building aggregation rules before disaster strikes is like adding a surge protector to your AWS observability stack. It keeps your systems informative, not noisy — and your bills predictable, not catastrophic.

In Short

Cloud outages are inevitable. Paying 10x more to watch them happen is not.

Aggregate your logs. Collapse the noise. Save your signal and your observability budget.

When the next AWS outage hits, you’ll thank yourself for not letting “Connection reset by peer” drain your Datadog, Elastic, or AWS CloudWatch Logs bill line by line.