READ MORE
All posts

What actually drives your Datadog cost, and the upstream control for each meter

Observability
Jul
2
2026
Jul
02
2026

Datadog cost is driven by a small number of usage meters that grow independently of each other: log ingest and indexing, custom metric cardinality, APM events (spans), and billable hosts, all multiplied by retention tier choices. The bill rarely grows because Datadog raised a price. It grows because your services emit more telemetry every quarter and nothing upstream decides what deserves to arrive. If you want to control Datadog cost, you control the inputs to those meters, in the pipeline, before they bill.

That is the part most "Datadog pricing explained" posts miss. They walk through the public pricing page line by line, which is useful the first time and useless the second, because the page tells you the unit price, not why your unit count keeps climbing. The platform team renewing the contract already knows the rate card. What they need is a causal map from the telemetry their developers ship to the specific meter it lands on, and a place to intervene that is not "ask every team to log less."

This article walks each meter, shows where the cost actually originates, and points at the control surface for each one. The framing throughout: cost is the outcome of telemetry decisions, and those decisions are made long before Datadog's meter runs.

Datadog cost is a data-growth problem

Telemetry volume is not stable. It compounds. Every new service ships with default instrumentation. Every deploy adds dimensions. Every incident turns on debug logging that someone forgets to turn off. Splunk's State of Observability 2025, which surveyed 1,855 ITOps and engineering professionals, frames telemetry as a growing input to the business rather than a fixed cost, and that is exactly the problem for the budget owner: the input grows on its own.

So when the Datadog bill is up 40% year over year, the honest diagnosis is usually not "Datadog got more expensive." It is "we sent Datadog 40% more billable telemetry and nobody governed it." That is why switching vendors tends to be arbitrage rather than a fix. We made this argument at length in why your Datadog bill keeps growing: the data follows you into the new contract, and six months later the new bill climbs too.

The useful move is to stop treating the bill as one number and start treating it as four meters with four different growth functions. Each one has a distinct origin in your code and a distinct upstream control.

Log management: the two-meter system

Datadog logs bill on two separate meters, and conflating them is the most common reason a log bill surprises people. The log management billing docs lay it out: you pay to ingest every gigabyte of logs, regardless of whether you ever index them, and you pay again to index the log events you want to keep searchable, priced by event count with a multiplier for the retention tier (3, 7, 15, or 30 days).

This means two failure modes stack. A service can flood ingest with verbose logs that you never index, and you pay for the ingest anyway. Or you can index a reasonable volume but at 30-day retention "to be safe," and the retention multiplier compounds every event. The classic incident pattern is debug logging enabled during an outage and left on for weeks, running ingest at several times normal volume against logs nobody queries.

The control surface is the pipeline. Before logs reach Datadog's ingest meter, a collector can drop health-check and readiness noise, sample high-volume info logs while keeping all errors, redact and shrink oversized payloads, and route audit logs to a cheaper destination than your hot searchable tier. The decision of which collector to run for this is its own question; we compared the two leading options in Vector vs OTel Collector for log collection. The point is that a log dropped at the node never touches either log meter.

Custom metrics: where cardinality becomes the biggest line item

Custom metrics are where small instrumentation decisions turn into the largest billing surprises, because Datadog does not bill per metric name. It bills per unique combination of metric name and tag values. Datadog's custom metrics billing docs are explicit: each distinct tag combination is a separate billable custom metric, counted as an hourly average over the month.

The math is multiplicative, not additive. Take a single metric:

http.request.duration{service, env, region}

That is a handful of series. Now a well-meaning engineer adds tags to make it "more useful":

http.request.duration{service, env, region, endpoint, status_code, pod_name, user_id}

user_id alone, across 50,000 users, turns one metric into 50,000 billable custom metrics. Add endpoint and status_code and pod_name, and Kubernetes pod churn keeps minting new series as pods rotate, even when traffic is flat. The series count jumps into the millions, and because nobody wants to lose visibility, the tags rarely get removed. The expensive mistake becomes permanent.

Datadog offers Metrics without Limits, which lets you ingest everything and then allowlist which tags stay queryable, so the billable cardinality is the queryable cardinality rather than the raw cardinality. That helps. It also still means the decision about which tags matter is being made inside Datadog, after ingest, by whoever remembers to configure it. The upstream alternative is to enforce an approved label list at the pipeline, so the high-cardinality tag is stripped before Datadog ever counts it. We covered the specific patterns in best practices for high-cardinality metrics in Datadog.

APM events: paying full fidelity for traffic nobody inspects

APM bills on two things: per APM host, and per million ingested spans above your plan's baseline, with an additional meter for spans you choose to retain in indexed, searchable form. The cost driver is span volume, and span volume is not linear with traffic. It is a tree.

One request fans out across an auth check, a feature-flag lookup, several database queries, a cache read, two downstream service calls, and a queue publish. That single request can produce dozens of spans, hundreds in a deep microservice graph. At 100% capture on a high-traffic service, you are paying premium rates to preserve perfect forensic detail for overwhelmingly normal requests that no engineer will ever open.

Most teams reach for head-based sampling, where the keep-or-drop decision is made at the start of the trace. It is easy to configure and blunt: sample aggressively and you lose the slow, error-bearing traces you actually needed; sample conservatively and the span meter overflows. Tail-based sampling is the better model because it decides after the trace completes, so you can keep every trace that ended in an error or high latency and drop the boring successes. It is operationally harder to run across many services, which is exactly why teams set one sampling rule and never revisit it as the architecture changes.

The control surface is a tail-sampling stage in the pipeline that keeps errors and latency outliers at full fidelity and samples the rest by value, before the spans hit the APM event meter.

Infrastructure hosts and the autoscaling tax

Infrastructure Monitoring bills per host per hour. In a static fleet this is predictable. In an autoscaling Kubernetes environment it is not, because Datadog's hourly billing counts transient hosts. When a workload scales out to absorb a traffic burst, every short-lived node generates a billable hourly record, and unless you exclude them deliberately, you pay for capacity that existed for twenty minutes.

This meter is less about telemetry shape and more about hygiene: tagging and excluding ephemeral and non-production hosts, making sure containers above the per-host threshold are accounted for, and not double-counting nodes that report through multiple integrations. It is the smallest of the four meters for most telemetry-heavy teams, but it is the one finance understands fastest, so it is often where the cost conversation starts even when the real money is in metrics and logs.

Retention tiers: the multiplier on every mistake above

Retention is not a separate meter so much as a multiplier on the others. Indexed logs at 30-day retention cost more per event than at 7-day. Indexed spans retained for searchability cost more than spans that are sampled and dropped. The trap is that retention gets set once, defensively, on the high end, and then the volume it applies to grows underneath it.

The discipline here is to match retention to the value class of the data, not to set one tier for everything. Audit logs and payment-state events may justify long, searchable retention. Routine info logs and successful-request spans do not. If your pipeline already classifies telemetry by value to make sampling decisions, that same classification should drive retention routing, so the expensive tier only ever holds the data that earns it.

Where to actually control Datadog cost

Notice that every control surface above sits in the same place: upstream of Datadog's meter, inside the telemetry path. That is not a coincidence. Datadog is excellent at storing, querying, visualizing, and alerting on telemetry. It is not designed to decide what telemetry deserves to arrive in the first place. By the time data has crossed the ingest meter, you have already paid for most of the mistake.

There is a FinOps frame for this worth naming, because the budget owner is usually being asked to defend the bill to finance. The FinOps Foundation framework describes three phases: Inform, Optimize, and Operate. Most observability teams are stuck between Inform and Optimize. They can see the bill grow and they can run a one-time cleanup, but they have no continuous Operate phase, no standing governance that holds the line as new services and new tags appear every week. Static exclusion filters and hand-tuned sampling configs are an Optimize-phase tactic. They decay the moment something changes, because telemetry is not static and the people maintaining the rules are not watching every deploy.

That is the gap. Controlling Datadog cost is not a quarterly project. It is a permanent control problem, and the controls have to live where the data flows.

How Sawmills approaches this

Every control surface in this article lives in one place: upstream of Datadog, inside the telemetry path, applied continuously. That is the work Sawmills does. Sawmills is the agentic telemetry operator that sits in front of Datadog and governs the four meters before they bill. It strips disallowed high-cardinality labels off metrics so the custom-metric count stays bounded, samples logs and spans by value while keeping every error and latency outlier, drops health-check and readiness noise before the ingest meter sees it, and routes data to retention tiers that match its value. Your dashboards, monitors, and workflows stay in Datadog. Datadog simply receives less noise and more signal.

The reason this is an operator and not a one-time cleanup is that telemetry never stops changing. A new service ships, a developer adds a tag, an incident enables debug mode, a deploy multiplies series overnight. Sawmills runs continuously inside the guardrails the platform team defines, so cost control becomes a standing Operate-phase capability rather than a quarterly fire drill. The platform team sets policy. Developers self-serve fixes inside it. The agent enforces it in real time. Customers like BigID have used this approach to cut log volume by 81% without losing the telemetry they operate on.

If your Datadog bill is growing faster than your traffic and you want to see exactly which meters are driving it and how much comes off when you govern the inputs, schedule a demo to watch Sawmills run against a telemetry stream like yours.

{{TIPS}}

FAQs for controlling your Datadog cost

How much does Datadog cost?

There is no single number, because Datadog bills across roughly twenty product lines, each with its own meter. For most teams the bill is dominated by Infrastructure Monitoring (per host), Log Management (per GB ingested plus per indexed event), Custom Metrics (per unique tag combination), and APM (per host plus per million spans). Check the current pricing page for rates, but your total is a function of your telemetry volume, not just the rate card.

Why is my Datadog bill so high?

Usually one of four causes: a high-cardinality tag added to a custom metric, debug logging left on after an incident, APM sampling set too generously on a service that scaled up, or autoscaling hosts billed hourly. The cardinality case is the most common and the most expensive.

What are Datadog custom metrics and why do they cost so much?

A custom metric is a unique combination of a metric name and its tag values, and each combination is billed separately. Adding one high-cardinality tag like user_id or pod_name can turn a single metric into tens of thousands of billable series.

Does switching off Datadog reduce cost?

Switching vendors moves the same telemetry into a different contract. If the underlying data stream is ungoverned, the new bill grows the same way. The durable fix is reducing what you send, not where you send it.

What is the difference between Datadog log ingest and indexing cost?

Ingest is charged per gigabyte for every log that arrives, whether or not you index it. Indexing is charged per event for the logs you keep searchable, with a multiplier for the retention tier. You can pay for ingest on logs you never index.

Erez Rusovsky Chief Product Officer & Co-founder, Sawmills
Previously CEO at Rollout acquired by CloudBees. Seasoned DevOps and telemetry pipeline expert.