save your seat
All posts

How to Lower Your Observability Bill Without Losing Visibility

Observability
May
12
2025
May
08
2025

If your observability bill keeps growing but your insights aren’t, you’re not alone. In this article, we break down what’s really driving up your telemetry costs—thing like duplicate logs, noisy metadata, and data you probably don’t even need—and show you how to fix it.

Prefer to read instead? Or want to dive deeper into the examples and YAML config? Everything we cover in the video is explained in the blog below.

The Problem: We're drowning in telemetry data

Telemetry data has become incredibly complex—logs, metrics, and traces are now coming from dozens of sources, each with its own format, granularity, and retention policy. And the volume? It’s exploding. This explosion in data is driving runaway costs and creating chaos:

  • Microservices, Kubernetes, and autoscaling have led to massive growth in logs, metrics, and traces
  • Teams set up their own pipelines with no guardrails, creating inconsistent policies for volume, cardinality, and retention
  • There’s no visibility into what’s being ingested or why—so redundant and low-value data quietly eats up budget
  • And one bad config or noisy log can break RCA or blow the entire budget

And by the way, most of that data is never even queried—over 80%, in fact

DevOps Bears the Burden

While developers are shipping the telemetry, DevOps is the one paying for it. And the system isn’t set up for accountability:

  • App teams aren’t measured on data hygiene or observability cost
  • Developers often don’t know their code is generating high-cardinality metrics or spammy logs
  • Most telemetry issues can only be fixed in the code—but DevOps doesn’t own the code

What can we do to stop the flood of irrelevant telemetry?

Most observability pipelines are full of logs that no one ever looks at. These messages come from frameworks, libraries, health checks, and infrastructure—but rarely help during an incident or postmortem.

For example, it’s common for debug logs like DEBUG: Entering function getUser() to remain active in production. These logs can generate thousands of lines per minute across a fleet of services—all of which are stored and indexed, even if no one ever queries them.

Health checks are another silent offender. Requests to /readiness or /liveness endpoints happen constantly, and each one creates a nearly identical log line. The same goes for access logs for static files like /logo.png or /favicon.ico. They’re technically correct, but operationally useless.

You’ll often find infrastructure logs too—messages like “Container xyz started”—that lack context and don’t offer insight into what the application is doing.

Individually, these messages may seem harmless. But they quietly add up to a massive volume of ingest—and a massive chunk of your observability bill.

The solution is to clean at the source. Instead of letting junk telemetry flow into your backend, apply filters at the edge. Use OpenTelemetry processors or your log shipper to drop low-severity logs or known-noise patterns.

Here’s how you’d drop all DEBUG logs before they enter your pipeline:

  filter/drop_debug_logs:
    logs:
      include:
        match_type: strict
        severity_text:
          - DEBUG

service:
  pipelines:
    logs:
      receivers: [otlp]
      processors: [filter/drop_debug_logs]
      exporters: [logging]

This kind of filtering can cut log volume significantly—without losing any real observability.

When too much of the same thing becomes a problem

Sometimes the problem isn’t the content of the log—it’s the fact that it’s being emitted thousands of times.

Think of an error like failed to connect to database. You need to know about it. But you don’t need 10,000 identical lines per minute telling you the same thing.

Repeated errors like authentication failures, timeouts, or “resource not found” logs often flood observability systems during an outage or retry loop. Each message is valid. But after the first few, they stop telling you anything new.

Rather than index every duplicate line, a better approach is to sample or aggregate. Sampling keeps only a portion of repeated messages—like 1 in every 100. Aggregation combines them into summaries that show volume over time.

processors:
  tail_sampling:
    policies:
      - name: duplicate_error_sampling
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

When you're logging the same thing over and over again

High-volume logs don’t always come from errors. Sometimes it’s just normal usage—API calls that succeed, dashboards that load, and users that log in. But when the same request is made over and over again, those logs become a cost multiplier.

Take these real examples:

High-volume logs don’t always come from errors. Sometimes it’s just normal usage—API calls that succeed, dashboards that load, and users that log in. But when the same request is made over and over again, those logs become a cost multiplier.

Take these real examples:

[15/Mar/2025:08:01:44 +0000]  GET  /api/users  200  44ms
[15/Mar/2025:08:01:46 +0000]  GET  /api/users  200  46ms
[15/Mar/2025:08:01:48 +0000]  GET  /api/users  200  43ms
[15/Mar/2025:08:01:50 +0000]  GET  /api/users  200  45ms

Each of these is technically different, but operationally the same. Indexing each one adds no value.

Instead, aggregate those entries into a single line:

[SUMMARY] /api/users — 124 requests in last 60s, avg latency: 45ms, error rate: 0%

This reduces log volume while still capturing the behavior you care about.

Third-party tools are talking too much

It’s not just your code that’s creating telemetry. Open-source tools and infrastructure components often emit logs at high frequency—even when nothing’s wrong.

CoreDNS, for instance, logs every DNS resolution. ArgoCD logs every application sync. gRPC libraries can flood logs with retry attempts. External-secrets and ORMs also generate constant state updates.

These logs rarely surface in queries or alerts. But they’re still being stored and billed.

You can reduce this noise by identifying the sources—such as container names or namespaces—and filtering by severity. For example, you might choose to only keep warning and error logs from ArgoCD.

...
processors:
  filter/argocd_errors_only:
    logs:
      include:
        match_type: expr
        expressions:
          - resource.attributes["k8s.container.name"] =~ "argocd-.*" && severity_number >= 17

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [filter/argocd_errors_only]
      exporters: [logging]

Metadata bloat: the invisible cost driver

Even when a log message is valuable, the metadata attached to it can cause trouble.

Kubernetes, container runtimes, and cloud environments attach dozens of fields to every log line: pod UIDs, image IDs, node IPs, file paths, and more. These fields are often identical across thousands of messages—and rarely queried.

Yet every field adds weight to the payload, increases cardinality, and drives up storage and indexing costs.

If you’re not using a field to filter or alert, drop it.

processors:
  attributes/drop_fields:
    actions:
      - key: kubernetes.pod.replicaSet
        action: delete
      - key: cloud.datacenter
        action: delete
      - key: filesystem.log_dir
        action: delete
      - key: filesystem.log_file
        action: delete
      - key: kubernetes.pod.uid
        action: delete

service:
  pipelines:
    logs:
      receivers: [your_log_receiver]
      processors: [attributes/drop_fields]
      exporters: [your_exporter]

Not everything needs to be a log

Some telemetry is useful—but doesn’t need to be in your logs.

Access logs for static files, regular status updates like queue depth, or cache hits are better represented as metrics—or sent to cold storage if you need to retain them.

These logs are rarely queried, but they come with full indexing and storage costs if you treat them like application logs.

Instead, you can route them elsewhere.

...
processors:
  filter/healthcheck_only:
    logs:
      include:
        match_type: expr
        expressions:
          - attributes["log_type"] == "healthcheck"

  transform/healthcheck_metric_flag:
    log_statements:
      - context: log
        statements:
          - set(attributes["http_healthcheck_up"], 1)

  logs_to_metric/healthcheck_metric:
    instrumentation_scope_name: logs.healthcheck
    resource_to_telemetry_conversion:
      enabled: true
    log_record_to_metric:
      - name: http_healthcheck_up
        value: attribute(http_healthcheck_up)
        aggregation: sum
        temporality: cumulative
        attribute_keys: [service.name]

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
....

The retention trap

Even when data is useful, keeping it around too long can become a hidden cost driver. Retention policies are often set broadly across teams and data types—"keep everything for 30 days"—whether or not that data is actually needed.

The result is an expensive log warehouse full of telemetry that was never queried past hour one. Retention-based pricing models make this worse: the longer your data sits, the more you pay.

The fix: apply more targeted retention strategies. Keep critical logs for 30 days, but drop routine traffic or health checks after 24 hours. Some data may only need to live in cold storage (like S3), or not at all.

You can also export filtered or time-bound subsets to external storage before deletion, ensuring auditability without high retention cost.

Why manual optimization doesn't scale

Could you fix all of this manually? Sure. But you’d need to:

  • Audit log messages across every microservice
  • Work with dev teams to reduce verbosity
  • Coordinate consistent filters, samplers, and routes
  • Manually tune attributes and retention per log type

And you'd have to repeat that work every time new services or dependencies are added.

The truth is, most teams don’t have time to babysit their observability pipelines. Manual optimization takes effort, and without dedicated ownership, it never becomes a priority—until the bill arrives.

Where Sawmills fits in

That’s where Sawmills comes in.

Sawmills is the first smart telemetry management platform—built on the OpenTelemetry Collector and powered by AI. It gives DevOps and platform teams real-time control over what gets ingested, stored, and routed—without having to build and maintain filters by hand.

With Sawmills, you can:

  • Cut observability costs by 50–90% without losing visibility
  • Drop junk data at the edge with intelligent, AI-driven filtering
  • Route logs, metrics, and traces by team, use case, or environment
  • Apply retention and policy enforcement at the pipeline level
  • Act in context, directly from the UI—drop a field, filter a pattern, or reroute traffic with one click

It’s not a dashboard. It’s not another backend. It’s the control layer that helps you scale observability without scaling cost.

If you're ready to stop paying for noise, book a demo and see how Sawmills helps you get the data you need—without everything you don’t.