High-cardinality metrics are the stealth tax of Datadog. They look harmless when you add “just one more tag,” but at scale they quietly multiply into millions of time series and a Datadog invoice that makes your finance team question your life choices.
This guide covers:
- How Datadog actually bills custom metrics
- Why high-cardinality metrics blow up your Datadog bill
- Best practices for high-cardinality metrics in Datadog
- What you, as a DevOps / SRE leader, can do to detect, fix, and prevent this mess
The goal: keep observability powerful and rich without lighting your budget on fire.
1. What “high cardinality” actually means in Datadog
In Datadog, a custom metric isn’t just the metric name.
A single custom metric is defined by the unique combination of:
- Metric name
- Tag key/value pairs
- Host
Each distinct combination is a separate time series, and in Datadog’s world, a billable custom metric.
Example:
- checkout.requests{env:prod,region:eu-west-1,status:200}
That’s one series.
Add a user_id tag and suddenly:
- checkout.requests{env:prod,region:eu-west-1,status:200,user_id=123}
- checkout.requests{env:prod,region:eu-west-1,status:200,user_id=124}
- …
Each one is a different custom metric.
High cardinality simply means “a tag (or combination of tags) with a huge number of possible values,” for example:
- user_id, customer_id, tenant_id
- request_id, trace_id, span_id
- pod_name, container_id
- url, path including IDs or random strings
On a toy app, this doesn’t matter. On a real platform with dozens of services and Kubernetes everywhere, it absolutely does.
2. How Datadog bills custom metrics
To control cost, you need to understand the mechanics.
2.1 How custom metrics are counted
Conceptually:
- Datadog looks at all your distinct time series per hour
- A “custom metric” is one of those unique metric+tags+host combinations
- It computes your bill based on the average number of distinct custom metrics per hour over the month
So if your cardinality doubles during peak hours, your “average distinct custom metrics per hour” goes up, and so does your bill.
2.2 Allocations and overages
Very roughly (exact numbers depend on plan and contract):
- Each host includes a certain pool of custom metrics
- That pool is shared at account level
- Once you exceed the pool, you’re charged per custom metric per month
Which is why a “small” change that adds hundreds of thousands of new tag combinations can create a very non-small surprise on the invoice.
2.3 Metric type matters too
Some metric types multiply this further:
- Histograms and distributions create multiple series per metric (count, sum, averages, percentiles, etc.)
- Those series are multiplied by each tag combination
- Percentiles especially can be expensive if you allow them with very high-cardinality tag sets
If you’re careless, a single distribution metric with four or five high-card tags and multiple percentiles can fan out into tens or hundreds of thousands of billable custom metrics.
2.4 Log-based metrics
Metrics created from logs (logs-to-metrics) are also counted as custom metrics.
Common footgun: creating log-based metrics grouped by attributes like user_id or request_id. Every new value becomes a new time series. The bill follows.
3. Why high cardinality wrecks your Datadog bill
The core problem: cardinality is multiplicative.
Rough sketch:
- Start with requests_total (no tags) → 1 series
- Add env with 3 values → 3 series
- Add region with 4 values → 12 series
- Add service with 50 microservices → 600 series
- Add user_id with 100k active users → 60,000,000 series
Each series is:
- Stored
- Indexed (depending on your configuration)
- Included in your hourly distinct custom metrics count
So that “quick change” where someone added user_id or session_id to a core metric is not a cosmetic tweak; it’s a cost explosion.
The important insight:
High cardinality isn’t bad by definition.
Unintentional, unbounded cardinality on core metrics is bad.
4. First job of a DevOps leader: visibility and attribution
You can’t optimize what you can’t see. So step one is turning Datadog back on itself.
4.1 Use Datadog’s own usage data
Datadog exposes usage metrics and UI pages for:
- Total custom metrics over time
- Custom metrics by metric name
- Custom metrics broken down by tag
Use these to build:
- A dashboard of custom metric volume over time
- A breakdown of “top N metrics by custom metric count”
- A breakdown by service, team, env, region, etc.
You want to be able to answer:
- Which metrics are our biggest cost drivers?
- Which services/teams own those metrics?
- Which tags are contributing most to cardinality?
Without attribution, every optimization discussion turns into a blame fog.
4.2 Tag-level usage analysis
Next, look at whether tags are actually used:
- Which tags are used in dashboard queries, monitors, SLOs?
- Which tags are never used to filter or group?
- Which tags are very high-cardinality and low-usage?
You can do this via APIs, internal scripts, or governance tools:
- Enumerate your metrics and their tags
- Compare to the tags that appear in queries
- Flag “high-cardinality, low value” tags per metric
This gives you a list of suspects to investigate with each owning team.
5. Detecting telemetry waste: common patterns
Once you have visibility, you’ll start seeing the same offenders again and again.
5.1 High-cardinality tag anti-patterns
Red flags:
- User-level tags:
- user_id, customer_id, tenant_id, account_id, email
- user_id, customer_id, tenant_id, account_id, email
- Per-request/session tags:
- session_id, request_id, trace_id, span_id
- session_id, request_id, trace_id, span_id
- Free-form identifiers:
- url or path with IDs and query strings baked in
- url or path with IDs and query strings baked in
- Low-value infra tags:
- pod_name, container_id, auto-generated hostnames
- pod_name, container_id, auto-generated hostnames
Those are often perfectly fine in logs and traces. They’re almost always dangerous on core metrics.
5.2 Hidden cardinality from integrations
Another classic source of surprise:
- Cloud integrations (AWS, GCP, Azure) automatically adding a swarm of instance tags
- Kubernetes integrations tagging everything with pod, node, replica set, etc.
- Third-party integrations that propagate their own tags into your metrics
Ask yourself:
- Do we actually use these tags in queries?
- Do we need them on every metric, or only on infrastructure metrics?
- Are we accidentally copying cloud tags into application-level metrics?
Often, simply trimming integration tags can remove a huge chunk of your custom metric volume.
5.3 Unused metrics and tags
A lot of cost comes from metrics and dimensions that nobody uses anymore:
- Old experiments and temporary dashboards
- Historical migrations (“we’re still emitting both the legacy metric and the new one… three years later”)
- Copy-paste metrics where someone took an internal metric from one team and just changed the name
You want scripts or processes that regularly ask:
- Which metrics haven’t been queried in X days?
- Which tags on those metrics have never appeared in a query?
- Can we turn these off, or at least slim them down?
This is “garbage collection for observability.”
6. Fixing it: reducing cardinality and cost without losing observability
Once you know where the waste is, you can attack it in layers.
6.1 Clean up integrations and auto-tags
Start with the lowest-risk, highest-impact changes:
- Audit cloud integrations:
- Disable metric sets you don’t use
- Limit which tags get propagated from infrastructure to application metrics
- Disable metric sets you don’t use
- Audit Kubernetes tags:
- Prefer stable tags like namespace, deployment, service over pod_name or container_id
- Prefer stable tags like namespace, deployment, service over pod_name or container_id
- Stop copying every environment metadata field into metric tags out of habit
In many environments, you can shave 20–40% of custom metric volume with this alone.
6.2 Use Datadog’s Metrics without Limits (MwL)
Metrics without Limits lets you:
- Ingest metrics with all their tags
- Choose which tags are indexed for querying and grouping
- Drop other tags from indexing so they don’t contribute to custom metric counts
A practical pattern:
- For each high-volume metric, look at which tags are actually used in queries over the last 30 days.
- Keep the core ones (e.g., env, service, region, status, maybe endpoint).
- Drop ephemeral and user-level tags from indexing (user_id, session_id, pod_name, etc.).
You still get rich telemetry at ingest, but you’re only paying Datadog to index the dimensions you actually care about in practice.
6.3 Redesign metrics around “intentional cardinality”
This is the heart of best practices for high-cardinality metrics in Datadog: you design metrics, you don’t just sprinkle tags.
Guidelines:
- Prefer bounded dimensions:
- env, region, az, service, team, version, endpoint
- env, region, az, service, team, version, endpoint
- Avoid per-user and per-request cardinality on core metrics
- Use coarser segments when you do need “per-something”:
- Per-tenant instead of per-user
- Per-plan (“enterprise vs SMB”) instead of per-customer ID
- Per-tenant instead of per-user
- Be explicit:
- Each metric should have a stated purpose (“used in SLO X, dashboard Y”) and an owner
- Each metric should have a stated purpose (“used in SLO X, dashboard Y”) and an owner
If a tag isn’t required for debugging, alerting, or SLOs, it probably doesn’t belong on the metric.
6.4 Use logs and traces for ultra-high-cardinality detail
Some things are legitimately high-cardinality, like user identities or individual requests. That’s fine — just don’t model them as metrics.
Pattern:
- Keep very detailed, high-cardinality data (user IDs, trace IDs, raw URLs, stack traces) in logs and traces
- Derive low-cardinality metrics from those signals:
- Errors per service/region/status
- Latency percentiles per endpoint/region
- Errors per service/region/status
- Avoid grouping log-based metrics by unbounded attributes like user_id or request_id
Metrics give you trends and aggregates. Logs and traces give you forensic detail. Use each for what it’s good at.
6.5 Introduce a telemetry governance / policy layer
At some point, “tell teams to be careful with tags” stops working.
This is where a governance layer comes in: a service or process that sits between “engineers emit stuff” and “Datadog bills us for stuff.”
It typically does things like:
- Scan Datadog for metric and tag usage
- Identify high-cardinality tags and metrics that are rarely or never queried
- Recommend or enforce policies like:
- “No user-level tags on core metrics”
- “These 10 tags are globally disallowed on metrics”
- “This particular tag is only allowed for service X as an exception”
- “No user-level tags on core metrics”
- Keep configuration in code (Git) so changes are reviewable and auditable
This shifts you from “everyone does whatever they want” to “we have paved roads and guardrails.”
7. Keeping it fixed: monitors, budgets, and culture
Cardinality naturally tends to creep. You need ongoing control loops.
7.1 Set metric budgets and monitors
Use Datadog’s usage metrics to:
- Create dashboards for:
- Total custom metrics
- Custom metrics per team/service
- Total custom metrics
- Create alerting on:
- Sudden spikes in custom metric volume
- Forecasted crossings of agreed thresholds
- Sudden spikes in custom metric volume
Add on top:
- Per-team “metric budgets”
- E.g. “Team Payments: target 50k indexed custom metrics, hard limit 70k”
- E.g. “Team Payments: target 50k indexed custom metrics, hard limit 70k”
- A regular review (monthly / quarterly) where you look at:
- Top cost drivers
- New metrics added
- Remediation actions taken
- Top cost drivers
That’s how you keep costs aligned with business value.
7.2 Build cardinality awareness into the SDLC
Observability design should be part of how code ships.
Tactics:
- Add checklist items to PR templates:
- “Are new metrics high-cardinality?”
- “Do we really need these tags?”
- “Are new metrics high-cardinality?”
- Require a short rationale for new production metrics:
- “This metric will be used by dashboard X and alert Y; we need tags A, B, C to debug incidents.”
- “This metric will be used by dashboard X and alert Y; we need tags A, B, C to debug incidents.”
- Limit who can create log-based metrics and who can change Metrics without Limits config
That way, “just add a tag” becomes a conscious tradeoff instead of a reflex.
7.3 Track a few cost-health KPIs
Define a small set of KPIs that tell you if things are getting better or worse, for example:
- Total indexed custom metrics over time
- Indexed custom metrics per host
- Indexed vs ingested custom metrics (an “indexing efficiency” measure)
- Custom metrics per team / service
Keep them on a shared dashboard and review them the same way you review latency or error budgets.
8. Bringing it all together
Best practices for high-cardinality metrics in Datadog boil down to a few principles:
- Understand the billing model
- A “custom metric” is a unique metric+tags+host combination, averaged hourly over the month.
- A “custom metric” is a unique metric+tags+host combination, averaged hourly over the month.
- Make cardinality visible and attributable
- You can’t fix what you can’t see, and you can’t change what you can’t assign to an owner.
- You can’t fix what you can’t see, and you can’t change what you can’t assign to an owner.
- Eliminate obvious waste
- Clean up integrations, auto-tags, unused metrics, and unnecessary percentiles.
- Clean up integrations, auto-tags, unused metrics, and unnecessary percentiles.
- Use Datadog’s own levers
- Metrics without Limits, log-to-metric design, distribution settings — these are big hammers, use them.
- Metrics without Limits, log-to-metric design, distribution settings — these are big hammers, use them.
- Wrap it all in governance
- A lightweight policy layer, plus dashboards and alerts on custom metric usage.
- A lightweight policy layer, plus dashboards and alerts on custom metric usage.
- Change the culture
- Metrics and tags are a product. They’re designed, reviewed, versioned, and pruned — not just sprayed around.
- Metrics and tags are a product. They’re designed, reviewed, versioned, and pruned — not just sprayed around.
Done right, you get the good parts of high cardinality (rich debugging when you need it) without being crushed by unbounded custom metric costs. Observability stays fast and useful, your Datadog bill becomes predictable, and your finance team stops secretly plotting to unplug your agents.


-1.png)
