Why metric cardinality keeps exploding and how to tame it

Observability

Oct

2025

Oct

2025

Metric cardinality is the silent amplifier in observability. Every distinct combination of a metric name and its labels produces a separate time series. When cardinality grows, memory usage rises, indexes bloat, ingestion pipelines stall, queries fan out across too many series, and SaaS bills climb. During incidents, the pain shows up as slow dashboards, missing data, and engineers flying blind.

This post explains the problem in depth and how to navigate it with policy and guardrails that work in both self-hosted stacks and SaaS. The specific examples use Prometheus and Datadog. The same ideas apply if you run VictoriaMetrics, M3, Grafana Cloud, or similar systems.

What metric cardinality actually means

Think of a metric as a template and labels as switches. Each unique setting of those switches becomes its own time series. If a request counter has labels for service, status_code, region, and endpoint, the number of series is roughly the product of the distinct values for each label. Add histograms and each bucket becomes another series per label set. Add exemplars or per-instance dimensions and the count multiplies again.

There is a difference between high cardinality and high churn. Cardinality counts how many unique series exist. Churn describes how fast new series are created and old ones disappear. Churn stresses write paths and compaction; cardinality stresses memory, index size, and query fanout. You can have low cardinality with high churn or the reverse, and both can hurt.

Why does the problem keep showing up

First, instrumentation is easy to get wrong. It is tempting to include user_id, session, request_id, raw URL paths, container IDs, or commit SHAs as labels. Each of those has either unbounded or near-unbounded value sets. One such label can turn a tidy counter into millions of series.

Second, modern platforms attach a lot of automatic metadata. Kubernetes, cloud providers, and service meshes inject tags for pods, nodes, zones, clusters, versions, and more. Those dimensions multiply with your app labels. What looks like three harmless labels can become hundreds of effective combinations once the platform adds its own fields.

Third, histograms and high-cardinality dimensions interact badly. A latency histogram with ten buckets multiplied by five labels turns into fifty time series for one metric. If the label values are unbounded, you can end up creating new series every scrape.

Fourth, scale dynamics matter. Autoscaling, ephemeral jobs, and blue/green rollouts all create fresh label values. Even without unbounded labels, a large fleet can create a large cross product of values that repeats across regions and clusters. If you export per-instance metrics for a thousand pods and then add per-endpoint and per-status labels, the total can jump by orders of magnitude.

Finally, query patterns magnify the pain. The same label sprawl that inflates writes also forces the query engine to touch far more series. Time to first sample goes up, cache hit rates go down, and query concurrency limits get consumed by requests that fan out too widely. The worst part is that this usually reveals itself while you are firefighting, not when everything is quiet.

The blast radius in self-hosted vs SaaS

In self-hosted Prometheus-like systems, high cardinality grows the head block and indices, which increases memory pressure and garbage collection time. Compaction and WAL (write-ahead log) replay take longer, and the querier has to scan more postings lists. When this happens across tenants, a single noisy service can starve others. And the stakes aren’t merely dashboards: for many teams Prometheus sits on the critical path—powering alerting and even autoscaling via metrics adapters. If Prometheus becomes slow or unavailable, alerts can fire late or not at all, autoscalers do not react, and traffic surges can snowball into user‑visible outages.

Guardrails on cardinality are therefore availability guardrails: they reduce TSDB overload, shorten WAL (write-ahead log) recovery after restarts, and help HA pairs and remote_write replicas keep pace so continuity is preserved during failures.

In SaaS, the mechanics are different but the outcome rhymes. Datadog defines a custom metric as the unique combination of metric name and tag values, including host. More unique combinations mean more billable metrics and more index entries. Datadog offers ways to control which tags remain queryable, but the ingestion cost of sending noisy labels can still show up in performance or price if you do not control them at the source.

Rule-of-thumb math (why it explodes): custom_metrics ≈ (distinct metric names) × ∏(distinct values per tag in the submitted tag set) [and often × hosts when host is present].

Example for one hot metric http.requests with tags env(3), region(3), service(25), endpoint(30), status_code(6); with and without host:

Step	Tag Added (Distinct Values)	Multiplier	Running Series
Metric name only	–	–	1
1	`+ env (3)`	×3	3
2	`+ region (3)`	×3	9
3	`+ service (25)`	×25	225
4	`+ endpoint (30)`	×30	6,750
5	`+ status_code (6)`	×6	40,500
6	`+ host (800)`	×800	32,400,000
7	`+ version (4)`	×4	129,600,000

Even modest, “reasonable” tag choices become untenable at scale. Two pragmatic levers in Datadog: aggressively trim the submitted tag set at the agent level—for example, keep env, region, service, endpoint, status_code; drop pod_uid, container_id, image_sha, node, owner, raw_url (so the multiplicative base is small), and restrict which tags are indexed/queryable for high-volume metrics. Neither eliminates the need to prevent unbounded tags at the source, especially for the unknown unknowns.

A tale of two stacks: Self Managed (Prometheus) and Cloud (Datadog)

Prometheus gives you direct visibility into series growth. You can query total active series, look at series added per scrape to find churn, and use TSDB status or promtool to see which metrics and label pairs dominate. It is transparent, which is both a blessing for diagnosis and a warning: the system will ingest what you feed it unless you put hardcoded relabeling and limits in place.

Datadog shifts the pressure to pricing and indexing. Because a metric plus its tag set drives cost, a single high-cardinality tag can turn a cheap counter into an expensive one. Datadog provides features to allowlist which tags are indexed and to inspect tag cardinality. These features are useful, but they are not a substitute for disciplined labeling. If you attach user_id or raw URL to a hot metric, you will feel it sooner or later.

What tools exist today, and where they fall short

Prometheus has status endpoints and offline analysis that surface the heavy hitters. Datadog has explorers and tag controls. These are valuable, but they tend to be reactive. They tell you where the fire is, but they do not stop bad labels from shipping in the first place. Collectors like the OpenTelemetry Collector or Prometheus scraper can drop or rewrite labels before storage, but you have to configure them before, and they do not protect against unknown labels. Offline analysis can be useful if the team has the time, but it is more of an audit tool than a guardrail.

Per-tenant isolation (Cortex per-tenant limits, Grafana Mimir limits, Thanos tenancy) (one Prometheus per team or distinct tenants in Cortex/Mimir/Thanos) does contain blast radius: runaway labels stay inside that tenant, and you can enforce caps (max series, ingest rate, label/value length, query/series limits). But it doesn’t scale well, as each tenant duplicates scrape configs, rules, retention, HA, upgrades, and on‑call.

The policy that actually works

The answer is not one tool. It is a small set of rules enforced in multiple places to ensure that one mistake cannot take down the entire observability stack.

Define a safe label vocabulary. Publish an allowlist of stable labels such as env, region, service, status_code, http method. Publish a denylist that forbids unbounded labels such as user_id, session, request_id, container_id, pod_uid, url with query strings, and trace_id. Keep the lists short so they are memorable.
Smart guardrails for the unknown unknowns. Continuously measure active time-series per metric and per service; learn a rolling baseline (for example, prior 7 days with seasonality). When a meaningful increase is detected (e.g., >3× over 15 minutes or a sustained anomaly score), automatically notify the owning team and temporarily block or aggregate only the specific metric for that service at the collector/scrape edge. Scope actions to the service/metric—not just the tenant—to limit blast radius. The block auto-lifts once time-series counts return to normal, and every action is recorded for audit and postmortem.

Implementation notes and examples

The examples below are intentionally minimal. They are meant to show where to place the guardrails, not to be copy-and-paste policy for every environment.

Prometheus scrape-time hygiene

Use metric_relabel_configs to remove risky labels and drop obviously per-request series. Keep the regexes simple and reviewable.

scrape_configs:
  - job_name: "apps"
    static_configs:
      - targets: ["app:9100"]
    metric_relabel_configs:
      # Drop unbounded labels by key
      - action: labeldrop
        regex: "^(user_id|session|request_id|pod_uid|container_id|trace_id|url)$"

      # Drop metrics that encode per-request detail (example pattern)
      - source_labels: ["__name__"]
        regex: "http_request_.*_by_user.*"
        action: drop

      # Drop raw URLs with query strings
      - action: drop
        source_labels: ["url"]
        regex: ".*\?.*"

You can achieve similar filtering in the OpenTelemetry Collector or in vmagent. The key is to do it before the data reaches the TSDB or the SaaS index.

Datadog indexing and agent settings

In Datadog, keep the ingest simple and the index lean. In practice that means two moves:

Restrict which tags are indexed for a given high-volume metric so you do not explode custom metric counts. Start with env, region, service, endpoint, and status_code. Revisit monthly.
Reduce agent tag cardinality in Kubernetes so the agent does not attach dozens of unnecessary tags to every point. Keep only the tags that show up in dashboards and alerts.

Neither step removes the need for collection-time or pipeline filtering. If you keep emitting unbounded labels, you will either pay for them or lose the ability to query the dimensions you care about.

Best defense against metric cardinality

Observability is a shared resource. Metric cardinality is how that resource gets overdrawn. The best defense is a small, boring set of rules that everyone understands, enforced at the points where they fail-safe: allow/deny list, pipelines, and CI. Start there, and your dashboards will be fast when you need them, and your finance team will not send you surprise screenshots at the end of the month.