READ MORE
All posts

Best Practices for High-Cardinality Metrics in Datadog

Observability
Dec
19
2025
Dec
19
2025

High-cardinality metrics are the stealth tax of Datadog. They look harmless when you add “just one more tag,” but at scale they quietly multiply into millions of time series and a Datadog invoice that makes your finance team question your life choices.

This guide covers:

  • How Datadog actually bills custom metrics
  • Why high-cardinality metrics blow up your Datadog bill
  • Best practices for high-cardinality metrics in Datadog
  • What you, as a DevOps / SRE leader, can do to detect, fix, and prevent this mess

The goal: keep observability powerful and rich without lighting your budget on fire.

1. What “high cardinality” actually means in Datadog

In Datadog, a custom metric isn’t just the metric name.

A single custom metric is defined by the unique combination of:

  • Metric name
  • Tag key/value pairs
  • Host

Each distinct combination is a separate time series, and in Datadog’s world, a billable custom metric.

Example:

  • checkout.requests{env:prod,region:eu-west-1,status:200}

That’s one series.

Add a user_id tag and suddenly:

  • checkout.requests{env:prod,region:eu-west-1,status:200,user_id=123}
  • checkout.requests{env:prod,region:eu-west-1,status:200,user_id=124}

Each one is a different custom metric.

High cardinality simply means “a tag (or combination of tags) with a huge number of possible values,” for example:

  • user_id, customer_id, tenant_id
  • request_id, trace_id, span_id
  • pod_name, container_id
  • url, path including IDs or random strings

On a toy app, this doesn’t matter. On a real platform with dozens of services and Kubernetes everywhere, it absolutely does.

2. How Datadog bills custom metrics

To control cost, you need to understand the mechanics.

2.1 How custom metrics are counted

Conceptually:

  • Datadog looks at all your distinct time series per hour
  • A “custom metric” is one of those unique metric+tags+host combinations
  • It computes your bill based on the average number of distinct custom metrics per hour over the month

So if your cardinality doubles during peak hours, your “average distinct custom metrics per hour” goes up, and so does your bill.

2.2 Allocations and overages

Very roughly (exact numbers depend on plan and contract):

  • Each host includes a certain pool of custom metrics
  • That pool is shared at account level
  • Once you exceed the pool, you’re charged per custom metric per month

Which is why a “small” change that adds hundreds of thousands of new tag combinations can create a very non-small surprise on the invoice.

2.3 Metric type matters too

Some metric types multiply this further:

  • Histograms and distributions create multiple series per metric (count, sum, averages, percentiles, etc.)
  • Those series are multiplied by each tag combination
  • Percentiles especially can be expensive if you allow them with very high-cardinality tag sets

If you’re careless, a single distribution metric with four or five high-card tags and multiple percentiles can fan out into tens or hundreds of thousands of billable custom metrics.

2.4 Log-based metrics

Metrics created from logs (logs-to-metrics) are also counted as custom metrics.

Common footgun: creating log-based metrics grouped by attributes like user_id or request_id. Every new value becomes a new time series. The bill follows.

3. Why high cardinality wrecks your Datadog bill

The core problem: cardinality is multiplicative.

Rough sketch:

  • Start with requests_total (no tags) → 1 series
  • Add env with 3 values → 3 series
  • Add region with 4 values → 12 series
  • Add service with 50 microservices → 600 series
  • Add user_id with 100k active users → 60,000,000 series

Each series is:

  • Stored
  • Indexed (depending on your configuration)
  • Included in your hourly distinct custom metrics count

So that “quick change” where someone added user_id or session_id to a core metric is not a cosmetic tweak; it’s a cost explosion.

The important insight:
High cardinality isn’t bad by definition.
Unintentional, unbounded cardinality on core metrics is bad.

4. First job of a DevOps leader: visibility and attribution

You can’t optimize what you can’t see. So step one is turning Datadog back on itself.

4.1 Use Datadog’s own usage data

Datadog exposes usage metrics and UI pages for:

  • Total custom metrics over time
  • Custom metrics by metric name
  • Custom metrics broken down by tag

Use these to build:

  • A dashboard of custom metric volume over time
  • A breakdown of “top N metrics by custom metric count”
  • A breakdown by service, team, env, region, etc.

You want to be able to answer:

  • Which metrics are our biggest cost drivers?
  • Which services/teams own those metrics?
  • Which tags are contributing most to cardinality?

Without attribution, every optimization discussion turns into a blame fog.

4.2 Tag-level usage analysis

Next, look at whether tags are actually used:

  • Which tags are used in dashboard queries, monitors, SLOs?
  • Which tags are never used to filter or group?
  • Which tags are very high-cardinality and low-usage?

You can do this via APIs, internal scripts, or governance tools:

  • Enumerate your metrics and their tags
  • Compare to the tags that appear in queries
  • Flag “high-cardinality, low value” tags per metric

This gives you a list of suspects to investigate with each owning team.

5. Detecting telemetry waste: common patterns

Once you have visibility, you’ll start seeing the same offenders again and again.

5.1 High-cardinality tag anti-patterns

Red flags:

  • User-level tags:
    • user_id, customer_id, tenant_id, account_id, email
  • Per-request/session tags:
    • session_id, request_id, trace_id, span_id
  • Free-form identifiers:
    • url or path with IDs and query strings baked in
  • Low-value infra tags:
    • pod_name, container_id, auto-generated hostnames

Those are often perfectly fine in logs and traces. They’re almost always dangerous on core metrics.

5.2 Hidden cardinality from integrations

Another classic source of surprise:

  • Cloud integrations (AWS, GCP, Azure) automatically adding a swarm of instance tags
  • Kubernetes integrations tagging everything with pod, node, replica set, etc.
  • Third-party integrations that propagate their own tags into your metrics

Ask yourself:

  • Do we actually use these tags in queries?
  • Do we need them on every metric, or only on infrastructure metrics?
  • Are we accidentally copying cloud tags into application-level metrics?

Often, simply trimming integration tags can remove a huge chunk of your custom metric volume.

5.3 Unused metrics and tags

A lot of cost comes from metrics and dimensions that nobody uses anymore:

  • Old experiments and temporary dashboards
  • Historical migrations (“we’re still emitting both the legacy metric and the new one… three years later”)
  • Copy-paste metrics where someone took an internal metric from one team and just changed the name

You want scripts or processes that regularly ask:

  • Which metrics haven’t been queried in X days?
  • Which tags on those metrics have never appeared in a query?
  • Can we turn these off, or at least slim them down?

This is “garbage collection for observability.”

6. Fixing it: reducing cardinality and cost without losing observability

Once you know where the waste is, you can attack it in layers.

6.1 Clean up integrations and auto-tags

Start with the lowest-risk, highest-impact changes:

  • Audit cloud integrations:
    • Disable metric sets you don’t use
    • Limit which tags get propagated from infrastructure to application metrics
  • Audit Kubernetes tags:
    • Prefer stable tags like namespace, deployment, service over pod_name or container_id
  • Stop copying every environment metadata field into metric tags out of habit

In many environments, you can shave 20–40% of custom metric volume with this alone.

6.2 Use Datadog’s Metrics without Limits (MwL)

Metrics without Limits lets you:

  • Ingest metrics with all their tags
  • Choose which tags are indexed for querying and grouping
  • Drop other tags from indexing so they don’t contribute to custom metric counts

A practical pattern:

  1. For each high-volume metric, look at which tags are actually used in queries over the last 30 days.
  2. Keep the core ones (e.g., env, service, region, status, maybe endpoint).
  3. Drop ephemeral and user-level tags from indexing (user_id, session_id, pod_name, etc.).

You still get rich telemetry at ingest, but you’re only paying Datadog to index the dimensions you actually care about in practice.

6.3 Redesign metrics around “intentional cardinality”

This is the heart of best practices for high-cardinality metrics in Datadog: you design metrics, you don’t just sprinkle tags.

Guidelines:

  • Prefer bounded dimensions:
    • env, region, az, service, team, version, endpoint
  • Avoid per-user and per-request cardinality on core metrics
  • Use coarser segments when you do need “per-something”:
    • Per-tenant instead of per-user
    • Per-plan (“enterprise vs SMB”) instead of per-customer ID
  • Be explicit:
    • Each metric should have a stated purpose (“used in SLO X, dashboard Y”) and an owner

If a tag isn’t required for debugging, alerting, or SLOs, it probably doesn’t belong on the metric.

6.4 Use logs and traces for ultra-high-cardinality detail

Some things are legitimately high-cardinality, like user identities or individual requests. That’s fine — just don’t model them as metrics.

Pattern:

  • Keep very detailed, high-cardinality data (user IDs, trace IDs, raw URLs, stack traces) in logs and traces
  • Derive low-cardinality metrics from those signals:
    • Errors per service/region/status
    • Latency percentiles per endpoint/region
  • Avoid grouping log-based metrics by unbounded attributes like user_id or request_id

Metrics give you trends and aggregates. Logs and traces give you forensic detail. Use each for what it’s good at.

6.5 Introduce a telemetry governance / policy layer

At some point, “tell teams to be careful with tags” stops working.

This is where a governance layer comes in: a service or process that sits between “engineers emit stuff” and “Datadog bills us for stuff.”

It typically does things like:

  • Scan Datadog for metric and tag usage
  • Identify high-cardinality tags and metrics that are rarely or never queried
  • Recommend or enforce policies like:
    • “No user-level tags on core metrics”
    • “These 10 tags are globally disallowed on metrics”
    • “This particular tag is only allowed for service X as an exception”
  • Keep configuration in code (Git) so changes are reviewable and auditable

This shifts you from “everyone does whatever they want” to “we have paved roads and guardrails.”

7. Keeping it fixed: monitors, budgets, and culture

Cardinality naturally tends to creep. You need ongoing control loops.

7.1 Set metric budgets and monitors

Use Datadog’s usage metrics to:

  • Create dashboards for:
    • Total custom metrics
    • Custom metrics per team/service

  • Create alerting on:
    • Sudden spikes in custom metric volume
    • Forecasted crossings of agreed thresholds

Add on top:

  • Per-team “metric budgets”
    • E.g. “Team Payments: target 50k indexed custom metrics, hard limit 70k”

  • A regular review (monthly / quarterly) where you look at:
    • Top cost drivers
    • New metrics added
    • Remediation actions taken

That’s how you keep costs aligned with business value.

7.2 Build cardinality awareness into the SDLC

Observability design should be part of how code ships.

Tactics:

  • Add checklist items to PR templates:
    • “Are new metrics high-cardinality?”
    • “Do we really need these tags?”

  • Require a short rationale for new production metrics:
    • “This metric will be used by dashboard X and alert Y; we need tags A, B, C to debug incidents.”

  • Limit who can create log-based metrics and who can change Metrics without Limits config

That way, “just add a tag” becomes a conscious tradeoff instead of a reflex.

7.3 Track a few cost-health KPIs

Define a small set of KPIs that tell you if things are getting better or worse, for example:

  • Total indexed custom metrics over time
  • Indexed custom metrics per host
  • Indexed vs ingested custom metrics (an “indexing efficiency” measure)
  • Custom metrics per team / service

Keep them on a shared dashboard and review them the same way you review latency or error budgets.

8. Bringing it all together

Best practices for high-cardinality metrics in Datadog boil down to a few principles:

  1. Understand the billing model
    • A “custom metric” is a unique metric+tags+host combination, averaged hourly over the month.

  2. Make cardinality visible and attributable
    • You can’t fix what you can’t see, and you can’t change what you can’t assign to an owner.

  3. Eliminate obvious waste
    • Clean up integrations, auto-tags, unused metrics, and unnecessary percentiles.

  4. Use Datadog’s own levers
    • Metrics without Limits, log-to-metric design, distribution settings — these are big hammers, use them.

  5. Wrap it all in governance
    • A lightweight policy layer, plus dashboards and alerts on custom metric usage.

  6. Change the culture
    • Metrics and tags are a product. They’re designed, reviewed, versioned, and pruned — not just sprayed around.

Done right, you get the good parts of high cardinality (rich debugging when you need it) without being crushed by unbounded custom metric costs. Observability stays fast and useful, your Datadog bill becomes predictable, and your finance team stops secretly plotting to unplug your agents.