High Cardinality in Metrics: Challenges, Causes, and Solutions

Observability

Feb

2026

Oct

2024

As engineers, we love data. But there's a point where having too much data--or more specifically, too many unique values--can become a headache. This is exactly what we mean by "high cardinality." It's one of the biggest drivers of observability cost overruns, and it affects every team managing metrics in tools like Datadog, Splunk, Prometheus, or any time-series database. Let's dive into what this term really means, why it's such a challenge, and some real-life examples you've likely run into.

What Is High Cardinality?

In data terms, cardinality refers to the number of unique values in a dataset. High cardinality means that a column or field has a very large number of unique values. If you think about some of the typical metric labels we handle as DevOps engineers or software developers, things like user IDs, request paths, or device IDs often come with a seemingly endless variety of values.

High cardinality is not inherently bad, but it becomes problematic in observability systems where high cardinality in metrics leads to increased costs, slow queries, and inefficient storage usage. It's the difference between collecting insightful metrics and being drowned by an avalanche of overly granular data that is difficult to use.

How Does High Cardinality Escalate Quickly in Metrics?

High cardinality in metrics can escalate quickly in certain situations, often catching engineers by surprise. This usually happens when fields that are expected to have manageable numbers of unique values start accumulating more unique entries due to changes in usage patterns or increased system complexity. Let's dive deeper into a few common scenarios where high cardinality in metrics can escalate, with some code examples to illustrate how these situations arise:

User-Specific Labels in Metrics: Consider a situation where metrics are labeled with user IDs to track specific user actions. If millions of users are interacting with your service daily, each user ID represents a unique label value.

# Example: Creating metrics with user-specific labels
from prometheus_client import Counter

user_action_counter = Counter('user_actions', 'Count of user actions', ['user_id'])

def track_user_action(user_id):
    user_action_counter.labels(user_id=user_id).inc()

# Millions of users generating unique metrics
for user_id in range(1, 1000000):
    track_user_action(f'user_{user_id}')

In this example, each user generates a unique label (user_id), leading to a massive number of unique metric series. This can quickly overwhelm Prometheus or any other time-series database due to the sheer number of unique time series being created.

A better approach would be to aggregate user actions by cohorts or user segments rather than using individual user IDs as labels, reducing the overall number of unique metric series.

Instance-Level Metrics for High-Scale Deployments: If you are monitoring metrics at the instance level across a large-scale deployment, each instance may generate unique metrics. For cloud environments where instances are created and destroyed frequently, this can result in high cardinality.

# Example: Tracking instance metrics
from prometheus_client import Gauge

instance_cpu_usage = Gauge('instance_cpu_usage', 'CPU usage per instance', ['instance_id'])

def log_instance_cpu(instance_id, cpu_usage):
    instance_cpu_usage.labels(instance_id=instance_id).set(cpu_usage)

# Simulating multiple instances with unique IDs
for instance_id in range(1, 10000):
    log_instance_cpu(f'instance_{instance_id}', 75.0)

Here, each cloud instance (instance_id) generates a unique metric series, and as instances are scaled up or down, the cardinality becomes extremely high. Instead of tracking each instance individually, consider aggregating metrics by service or availability zone.

Dynamic Endpoint Monitoring: Metrics that track performance of individual endpoints can quickly escalate in cardinality if those endpoints contain dynamic components, such as user-specific paths.

# Example: Monitoring dynamic endpoints
from prometheus_client import Histogram

endpoint_latency = Histogram('endpoint_latency', 'Latency of requests to endpoints', ['endpoint'])

def log_endpoint_latency(endpoint, latency):
    endpoint_latency.labels(endpoint=endpoint).observe(latency)

# Logging latency for multiple dynamic endpoints
for i in range(1, 10000):
    log_endpoint_latency(f'/users/{i}/profile', 0.2)

Each unique endpoint (e.g., /users/12345/profile) adds a new metric series, leading to high cardinality. A more efficient way would be to generalize endpoints by stripping dynamic components, such as using /users/{user_id}/profile as a static label.

IoT Device Metrics: In IoT environments, each device often reports metrics independently. With thousands of devices sending frequent telemetry, each with a unique identifier, the number of time series can explode.

# Example: Metrics from IoT devices
from prometheus_client import Gauge

device_temperature = Gauge('device_temperature', 'Temperature reported by IoT devices', ['device_id'])

def report_device_temperature(device_id, temperature):
    device_temperature.labels(device_id=device_id).set(temperature)

# Thousands of devices generating telemetry
for device_id in range(1, 10000):
    report_device_temperature(f'device_{device_id}', 22.5)

Each device_id creates a unique metric series. Aggregating devices by region or model can help reduce the number of unique labels and, consequently, the cardinality of the metrics.

Hidden Causes of High Cardinality in Metrics

There are also scenarios that developers typically don't think of when designing metrics, which can lead to unexpected high cardinality:

Frequent Updates with Dynamic Labels: Metrics that include frequently changing label values, such as timestamps or request IDs, can lead to unexpected high cardinality. For example, if you include a request ID in your labels, every request generates a new unique label, leading to an explosion in the number of metric series.

Auto-Scaling and Ephemeral Infrastructure: In environments that use auto-scaling, new instances are frequently spun up and destroyed. If each instance has unique identifiers included in metric labels, such as instance IDs or container IDs, this can cause the number of unique series to grow rapidly. Ephemeral infrastructure makes it particularly easy to underestimate the impact of cardinality.

Debug Labels: Adding detailed debug information as labels to metrics might seem like a good idea during development or troubleshooting, but it can drastically increase cardinality if those labels are not removed afterward. Labels like debug_mode=true or trace_id can introduce an enormous variety of values, especially in production environments.

User-Agent Metrics: Including user-agent strings as labels to track metrics per browser or device type can lead to high cardinality because user-agent strings can vary significantly. Instead of including full user-agent strings, consider categorizing by browser type or version.

Custom Tags: Allowing users to define their own tags or custom attributes can lead to unpredictable cardinality. For example, metrics labeled with user-defined tags can result in a wide variety of unique values, which are difficult to predict and control.

Why Is High Cardinality a Problem in Metrics?

Increased Costs: Storage of high-cardinality metrics can get expensive. Every unique label combination generates a separate time series, and if you're generating millions of distinct time series, the storage costs add up. Observability platforms like Datadog charge per custom metric, and Splunk bills based on ingest volume — uncontrolled cardinality directly inflates both. Organizations sending high-cardinality data to these tools often see their observability spending double or triple without any corresponding gain in visibility.
Slow Queries: High cardinality makes querying metrics significantly slower. Systems like Prometheus, Datadog, and Grafana struggle when asked to aggregate across many unique time series. Queries that need to aggregate metrics labeled by user IDs or instance IDs, for example, may time out or take an impractical amount of time to execute.
Scaling Issues: High cardinality puts immense pressure on the underlying infrastructure, especially in distributed systems. Time-series databases need to manage large volumes of unique series, which can lead to data imbalance across nodes and affect query performance and availability. This imbalance can cause nodes to become overwhelmed with data, impacting system reliability.
Operational Complexity: Managing a high number of unique metric series can lead to operational complexity. Maintaining indices, managing query performance, and ensuring system stability all become more challenging as cardinality grows. Engineers need to spend more time on infrastructure maintenance and performance tuning, reducing the time available for feature development or improving user experience.

How to Solve High Cardinality Issues

Addressing high cardinality requires a combination of strategies to reduce the number of unique label values in your metrics. These methods work at the instrumentation level — for pipeline-level solutions that handle cardinality automatically, see the telemetry pipeline section below. Here are some effective techniques to help you get started:

Aggregate Labels: Instead of using highly unique labels (e.g., user IDs or instance IDs), consider aggregating metrics at a higher level, such as user cohorts, regions, or service tiers. This allows you to retain valuable insights without incurring the cost of extremely high cardinality.

Bucketing: Group data into predefined buckets to reduce the number of unique values. For instance, rather than storing exact latency values or detailed metrics per request, bucket these values into ranges (e.g., 0-100ms, 100-200ms, etc.). This can significantly lower the number of unique metric series while still providing a clear performance overview.

Label Whitelisting: Implement strict policies around which labels can be added to your metrics. Review your metrics and eliminate unnecessary labels that do not provide value. This can prevent metrics from including unpredictable labels, such as request IDs or other dynamically generated values.

Use Exemplars: Instead of adding high-cardinality labels to all metrics, use exemplars to trace specific data points. Exemplars can add additional context to a small sample of metrics without significantly increasing overall cardinality. This approach works well in systems like Prometheus.

Sample Data: Consider sampling your metrics when tracking high-cardinality values. For instance, instead of logging every single request, log only a sample (e.g., 1%) of requests that contain high-cardinality information. This helps to keep cardinality under control while still retaining insight into the behavior of the system.

Normalize Labels: Remove or generalize dynamic parts of labels. For example, instead of labeling metrics with full URLs, replace dynamic components with placeholders (e.g., /users/{user_id}/profile). This approach helps reduce the number of distinct label combinations, significantly lowering cardinality.

Periodic Cleanup: Introduce automated jobs to clean up old, unused, or low-value metrics from your time-series database. High cardinality often results from historical data being retained longer than necessary. Periodic cleanup ensures that only relevant data is retained, which helps maintain system efficiency.

Batch Metrics by Service Level: Aggregate metrics at the service level rather than the individual instance or user level. Metrics like response times or error rates can be tracked by the service or availability zone, which provides sufficient insight for monitoring without the overhead of tracking every individual instance.

The real cost of high cardinality: what it does to your observability bill

The techniques above work at the instrumentation level, but they require discipline across every team that ships code. In practice, cardinality creeps back in. A new microservice adds user-level labels. A deploy includes debug tags that never get removed. An auto-scaling event spins up thousands of ephemeral containers, each with a unique instance ID in the metrics.

This is where the financial impact gets concrete. Datadog charges per custom metric, and each unique label combination counts as a separate custom metric. A single service with a high-cardinality label can generate tens of thousands of custom metrics, sending your Datadog bill into overage territory. Splunk's license model bills on ingest volume, so high-cardinality data that expands into more time series means more data stored and more license cost.

The pattern is the same across observability platforms: high cardinality → more data → higher costs → budget overruns. Reducing cardinality at the source helps, but at enterprise scale, companies need something that handles it automatically across the entire observability stack — before the data reaches your tools. This is where telemetry pipeline management comes in.

How telemetry pipelines solve high cardinality at scale

A telemetry pipeline sits between your applications and your observability platforms (Datadog, Splunk, New Relic, Elastic, Grafana, Honeycomb). It processes telemetry data in transit, which makes it the ideal place to handle cardinality problems before they hit your backend and your bill.

Here's what a telemetry pipeline can do for high cardinality:

Detect cardinality issues automatically. Instead of manually auditing metrics, AI-powered telemetry management tools like Sawmills identify which metrics have spiking cardinality and flag the specific labels causing the explosion — with suggestions for how to fix them. This saves DevOps teams hours of investigation.

Apply aggregation and label reduction in transit. The pipeline can strip high-cardinality labels, aggregate metrics by service or region, or normalize dynamic values, all before sending data to Datadog, Splunk, or Prometheus. You get the cost optimization without changing application code or handling complex configuration changes across services.

Enforce governance policies. Define rules about which labels are allowed, set cardinality thresholds per metric, and let the pipeline automatically throttle or drop metrics that exceed limits. This prevents cardinality from spiking back up after you've cleaned it.

Route data based on value. Not all high-cardinality data is wasteful. Some of it is genuinely useful for debugging. A telemetry pipeline lets you route high-cardinality metrics to low-cost storage for ad-hoc analysis while sending aggregated summaries to your primary observability platform, reducing costs while keeping the data accessible.

Tools like Sawmills, Cribl, and Edge Delta offer these capabilities, and companies looking for alternatives to manual cardinality management are increasingly adopting them. Sawmills differentiates through AI-powered recommendations that identify cardinality optimization opportunities and implement fixes with a single click. For companies managing multiple services and handling telemetry ingestion across multiple environments, this is the difference between fighting cardinality fires manually and having a system that manages it.

‍