READ MORE
All posts

Why your Datadog bill keeps growing, and why switching vendors won’t fix it

Observability
Apr
15
2026
Apr
15
2026

Most articles about Datadog costs start with the wrong diagnosis. They assume the problem is Datadog. Too expensive. Too aggressive on pricing. Too many add-ons. Too many billing dimensions. And from there, the recommendation is predictable: switch tools. But for most engineering teams, that is not the real problem. The real problem is that observability data grows faster than anyone manages it.

That growth is not accidental. It is built into how modern systems are developed. Every new service emits logs. Every deployment adds new dimensions. Every feature creates more requests, more traces, more span attributes, more metric tags, more environments, more retries, more background jobs, more dashboards, and more “just in case” instrumentation. Teams do not have a stable telemetry footprint. They have a compounding one.

That is why Datadog bills feel impossible to control. Not because the product is uniquely broken, but because most teams are feeding it an ever-expanding stream of telemetry without any real mechanism to govern what should exist, what should be retained, and what is actually worth paying to index, store, and query.

And that is also why “more storage” or “a cheaper vendor” does not solve the problem. Storage gets cheaper. Vendors compete. But telemetry volume grows to consume the savings and then keeps going. The system expands until it hits a new breaking point.

The issue is not where you send the data. The issue is that nobody is managing the data before it gets there.

Observability cost is a data growth problem, not a tooling problem

Modern software generates telemetry as a byproduct of everything it does.

A single customer request might hit an API gateway, an auth service, a core application service, a cache, a queue, a worker, a database, and a third-party API. That one request can produce logs from each layer, metrics from each service, and dozens of spans in a distributed trace. Multiply that by millions of requests per day and the volume becomes enormous before anyone has decided whether all of that data is useful.

And the keyword here is before. Observability systems are extremely good at collecting data. They are much worse at deciding whether the data deserved to exist in the first place. So the default mode in most organizations is simple:

  • developers emit more telemetry because it helps them debug
  • platform and DevOps teams inherit the bill
  • finance sees costs climb
  • everyone agrees something should be done
  • almost nothing changes at the source

The result is telemetry sprawl: too many logs, too many time series, too many traces, too many dimensions, too much retention, too much indexing, and almost no durable process for reducing any of it.

Why log volume is an always-growing problem

Logs are the easiest place to see the problem because they are the easiest thing to create. When a team is under pressure, they do not open a debate about long-term telemetry economics. They add a log line.

  • Need to debug a checkout flow? Add a few logs.
  • Need visibility into a flaky integration? Add more logs.
  • Launching a new feature? Emit logs around every state transition.
  • Trying to understand an incident? Turn on debug logging.
  • Unsure whether a piece of code will ever fail in production? Log everything.

This behavior is rational. Logs are cheap to add locally and useful in the moment. The person writing them pays almost none of the long-term cost. That means log volume tends to grow monotonically. It rarely gets cleaned up with the same energy it was created. And the growth is not just linear with traffic. It compounds the complexity, which often leads people to assume this is just a storage problem. It is not. The cost of logs is not only about storing bytes on disk. It is about ingesting them, parsing them, enriching them, shipping them over the network, indexing them for search, replicating them for durability, and querying them fast enough to be useful during an incident. Hot, searchable observability data is fundamentally more expensive than cold storage. That is why “storage keeps getting cheaper” does not rescue you. Even if raw storage were free, the expensive parts would remain:

  • ingest pipelines
  • indexes
  • query engines
  • retention tiers
  • transport
  • operational overhead
  • human review and governance

A cheaper vendor does not change that dynamic. It may lower the unit price for a while, but telemetry systems are elastic: whatever headroom you create gets consumed by more data. If your team has no discipline around telemetry generation, a lower price per GB simply delays the next billing shock. This is why observability cost keeps returning as a board-level or finance-level issue. The underlying data production function is still expanding.

Why manual data management almost never happens

Everyone says telemetry should be managed. Almost nobody actually manages it.

Not because teams are lazy, but because ownership is broken. Developers care about shipping, reliability, and debugging velocity. If adding more logs or tags makes debugging easier, they will do it. That is often the right local decision.DevOps, SRE, and platform teams care more about cost, consistency, and operational hygiene. They are the ones who notice runaway volumes and cardinality explosions. But they often do not own the application code, cannot safely edit business logic instrumentation, and do not have the organizational authority to continuously police every team’s telemetry habits.So telemetry governance falls into a dead zone:

  • the people creating the data do not feel the cost directly
  • the people feeling the cost do not control the source directly

Then add the human reality: telemetry review is tedious. No engineer wants to spend their week answering questions like:

  • Which log lines were queried in the last 90 days?
  • Which fields are redundant?
  • Which tags are exploding cardinality?
  • Which services are sampling traces poorly?
  • Which debug logs were left enabled after the incident?
  • Which metrics should be aggregated upstream instead of emitted raw?

This work is repetitive, cross-cutting, and never finished. It does not map cleanly to a sprint. It does not have an obvious single owner. And because it is ongoing, not one-time, it usually loses to feature work. So the organization does what organizations always do with tedious, important, multi-owner work: it postpones it until the bill becomes painful.

Custom metrics: where small instrumentation mistakes turn into huge bills

Custom metrics are usually where teams discover that observability pricing is not just about volume. It is about cardinality. Cardinality is the number of unique combinations of tags attached to a metric. That sounds abstract until it hits the bill. Suppose you emit a metric called request.duration.If you tag it with:

  • service=checkout
  • env=prod
  • region=us-east-1

This is manageable. But now imagine someone adds:

  • endpoint
  • http_method
  • status_code
  • pod_name
  • customer_tier
  • user_id

Now you do not have one metric. You have a combinatorial explosion of time series. Even if only one tag is bad, it can wreck the economics. user_id is the classic example. If 100,000 users hit a service, then request. duration{user_id=*} can create 100,000 distinct time series just from that dimension alone. Add endpoint and region and status_code, and the number can jump into the millions. And the worst part is that engineers often do this unintentionally. They are not trying to generate millions of billable series. They are trying to make the metric “more useful.” They want to be able to break latency down by user, or session, or cart, or request. But those identifiers belong in logs or traces when truly needed, not in metric tags that are intended to aggregate behavior across populations. Other common cardinality traps include:

  • request_id
  • session_id
  • cart_id
  • trace_id
  • pod_uid
  • container_id
  • build_sha
  • raw URLs with embedded IDs
  • dynamic feature flag values
  • free-form error messages as labels

Ephemeral infrastructure makes it worse. In Kubernetes environments, tags like pod name or container ID churn constantly. Even if the total traffic is stable, the number of unique tag values keeps rotating, which keeps creating new series. This is why cardinality is such a nasty problem: it does not look dangerous in code review. A single extra label feels harmless. But in a live system, it multiplies across traffic, services, deploys, and time. And once the data is flowing, people are reluctant to remove the tags because they fear losing visibility. So the expensive mistake becomes permanent.

APM has the same problem, just in a different shape

Application Performance Monitoring feels safer because traces are richer and more obviously valuable. But APM costs can spiral for the same reason: distributed systems create far more trace data than humans actually use. A trace is not one event. It is a tree. One request enters your edge service and then fans out:

  • auth check
  • feature flag lookup
  • database query
  • cache read
  • downstream service call
  • message queue publish
  • worker execution
  • third-party API call

That single request can produce dozens of spans. In a complex microservices environment, it can produce hundreds. At 100% capture, high-traffic systems generate enormous trace volume, most of it representing totally normal requests that nobody will ever inspect. That means you are paying premium prices to preserve perfect forensic detail for overwhelmingly uninteresting traffic. Worse, many teams rely on simple head-based sampling, where the decision to keep or drop a trace is made at the start of the request. That is easy to configure, but it is blunt. If you sample aggressively at the head, you can miss the slow or anomalous traces you actually care about. If you sample conservatively, you keep too much, and the bill explodes. Tail-based sampling is better because it lets you keep the traces that end in errors, high latency, or rare conditions. But it is operationally harder to set up and maintain, especially across many services and teams. So what happens in practice? Teams either:

  • leave capture too high because they are afraid of losing visibility, or
  • set one sampling rule and never revisit it as architecture changes

Neither scales. APM volume also grows in subtle ways:

  • richer span attributes increase payload size
  • more services create deeper traces
  • retries and async workflows duplicate span activity
  • new frameworks auto-instrument more operations
  • engineers keep custom attributes “just in case”

And just like with logs and metrics, the organizational incentive is to emit first and govern later.

Why switching vendors usually recreates the same problem somewhere else

This is why replacing Datadog is often a disappointing strategy. Yes, you may negotiate a better rate elsewhere. Yes, another platform may price logs, metrics, or traces differently. But if the underlying telemetry stream is still uncontrolled, you are mostly moving the same problem into a different contract. The data does not become smaller because the logo changed. In fact, migrations often make governance worse in the short term.

Teams are relearning tooling, rebuilding dashboards, reworking alerts, and translating instrumentation. During that period, almost nobody is focused on telemetry hygiene. They are focused on continuity. So the easiest path is to keep sending everything.

Then, six or twelve months later, the new bill starts climbing too. That is why “cheaper vendor” is usually an arbitrage opportunity, not a solution.

What actually solves the problem: manage telemetry before it becomes billable

The durable fix is upstream control. Instead of treating observability platforms as the place where you clean things up, treat them as destinations that should only receive data that has already been shaped, reduced, and prioritized. That means managing telemetry in the pipeline before it hits Datadog’s billing meter. At a practical level, that includes:

  • filtering low-signal logs before ingestion
  • removing or aggregating high-cardinality metric dimensions
  • enforcing cardinality limits centrally
  • sampling traces based on value, not fear
  • preserving complete fidelity for errors, latency outliers, and rare events
  • routing data differently by use case, retention need, or cost sensitivity

This is the difference between “observability tooling” and “telemetry management.”Datadog is excellent at storing, querying, visualizing, and alerting on telemetry. A telemetry management layer decides what telemetry deserves to arrive there in the first place. That distinction matters because once bad telemetry is already inside your vendor, you are already paying for most of the mistake.

Why static rules are still not enough

The traditional answer is to write exclusion filters, audit dashboards, tune sampling, and maintain collector configs. That helps. But it does not fundamentally solve the labor problem. Because telemetry is not static. New services appear.

Traffic patterns change. Developers add new tags. Frameworks introduce new instrumentation. An incident causes debug mode to get enabled. A new team ships a high-volume job. A well-meaning engineer adds user_id to a metric.

If your only control plane is a pile of hand-maintained rules, the organization is still depending on humans to continuously notice and react to every new source of waste. That is exactly the kind of work that does not happen reliably.

Why an agentic telemetry layer is the real answer

This is where an agentic telemetry management system becomes much more interesting than a basic pipeline. A basic pipeline can execute rules. An agentic system can continuously identify waste, understand why it is waste, and adapt the pipeline as the system changes. That matters because observability cost is not a one-time cleanup project. It is a permanent control problem. The system needs to notice things like:

  • this log source is generating huge volume but is never queried
  • this metric tag is creating extreme cardinality with almost no debugging value
  • this service is tracing routine traffic at full fidelity when only errors and tail latency matter
  • this new deployment introduced a dynamic label that multiplied time series count overnight

That is the operational gap Mills is designed to fill. Instead of asking teams to run endless audits and manually maintain suppression rules, Mills sits upstream of Datadog and continuously optimizes the telemetry stream before it becomes expensive. Your dashboards, alerts, and workflows stay in Datadog. The difference is that Datadog receives less noise and more signal.

That is how cost reduction becomes sustainable instead of episodic.

The real choice is not Datadog or not Datadog

The real choice is this: Do you want to keep paying observability bills that grow automatically with every increase in system complexity? Or do you want a control layer that manages telemetry growth before it reaches the point where pricing, retention, and indexing become a crisis? Because that crisis is not going away on its own, log volume will keep growing.

Metric cardinality will keep exploding when left unchecked. APM will keep capturing more than humans can ever inspect. Storage will get cheaper and still not be cheap enough. Vendors will discount and still not solve the underlying data problem. Without real telemetry management, data volume will continue increasing until it hits the next breaking point. That is why the winning strategy is not to rip out your observability stack. It is to govern the data flowing into it.

And that is exactly where Mills fits: upstream of Datadog, inside the telemetry path, continuously reducing waste, preserving the signals that matter, and cutting observability cost without forcing your teams to migrate the tools they already know.

{{TIPS}}

Expert observability migration tips

  1. Don't expect a vendor switch to solve a data problem. Negotiating a better rate elsewhere is arbitrage, not a fix. If your telemetry stream is uncontrolled, the same logs, metrics, and traces follow you into the new contract.
  2. Audit your metric tags before they audit your budget. Cardinality is where small instrumentation decisions turn into large billing surprises. A single tag like user_id or session_id can create millions of distinct time series from what looks like one metric in code.
  3. Manage the pipeline, not just the platform. Observability tools are built to store, query, and alert on telemetry. They are not built to decide what deserves to arrive in the first place.
  4. Fix the ownership gap before it fixes your budget for you.Telemetry governance fails because the people creating the data do not feel the cost, and the people feeling the cost do not own the source. That dead zone is where runaway volumes live.
  5. Replace static suppression rules with agentic telemetry management. Hand-maintained filters help, but they decay the moment something changes. The more durable approach is a system that continuously analyzes the stream, identifies waste in context and adapts as your architecture evolves.
Erez Rusovsky
Chief Product Officer & Co-founder, Sawmills

Previously CEO at Rollout acquired by CloudBees. Seasoned DevOps and telemetry pipeline expert.