READ MORE
All posts

Telemetry pipelines: the control layer between your services and your observability bill

Observability
Jun
15
2026
Jun
15
2026

A telemetry pipeline is the layer that sits between the services emitting logs, metrics, and traces and the backend that stores and queries them. It collects telemetry from many sources, shapes it in flight by filtering, sampling, transforming, and enriching, and routes each signal to the right destination. Everything that decides what telemetry deserves to exist, in what shape, and at what cost happens here, before the data hits a billing meter.

Most teams discover the pipeline backward. They run Datadog or Grafana Cloud or Splunk, the bill climbs faster than traffic, and someone goes looking for the place to intervene. The answer is almost never the backend. By the time data lands there, you have already paid to ingest, parse, index, and store it. The leverage is upstream, in the pipeline, where you still get to decide what to keep. Grafana's 2025 Observability Survey found that 74% of respondents now treat cost as a top priority when selecting observability tools, and that 41% increased their observability spend in the past year. The pipeline is where that pressure gets resolved or ignored.

This is a reference for the platform engineer who owns that decision. It covers what a telemetry pipeline actually is, how the components fit together, where the cost and reliability decisions live, and why a pipeline that works on day one quietly stops working six months later.

What a telemetry pipeline actually is

Strip away the vendor framing and a telemetry pipeline is three stages: ingest, process, route. In the OpenTelemetry world, the OpenTelemetry Collector is the canonical implementation, and its component model maps directly onto those stages.

Receivers ingest telemetry. They speak OTLP, scrape Prometheus endpoints, tail container log files, pull kubeletstats, or accept Jaeger and Fluentd formats. Processors transform telemetry while it is in memory: dropping it, sampling it, redacting fields, enriching it with Kubernetes metadata, batching it. Exporters send the result to one or more backends. A pipeline wires a chain of these together for a single signal type, and one Collector can run many pipelines at once.

Service:
  pipelines:
    logs:
      receivers: [filelog, otlp]
      processors: [memory_limiter, k8sattributes, filter/drop_noise, batch]
      exporters: [otlphttp/backend]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite/backend]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp/backend]

The pipeline is not a forwarder. It is a processing engine. The difference matters because the forwarder mental model leads teams to treat the pipeline as plumbing and put all their effort into the backend, where the expensive decisions are no longer theirs to make. We have written before about how the OTel Collector's modular design is exactly what makes it so capable and so easy to misoperate, and the same tension runs through every telemetry pipeline regardless of which tool implements it.

The three jobs, and why the middle one is where the money is

Ingest is mostly a coverage problem. You either collect from a source or you do not, and the OpenTelemetry Collector configuration model makes adding a receiver cheap. Routing is mostly a correctness problem: audit logs go to the durable store, errors go to the fast-query backend, debug logs go to cheap cold storage or nowhere.

The middle stage, processing, is where cost is decided, and it is the stage most teams underinvest in. A pipeline that ingests everything and routes everything is just an expensive relay. The processors are what let you keep the signal and drop the noise before either becomes billable.

Three processors carry most of that weight. The filter processor drops telemetry that matches a condition. The transform processor reshapes attributes and bodies using OTTL. The tail_sampling processor decides which traces to keep based on what actually happened in them. Get these three right and the bill follows. Get them wrong and no backend negotiation will save you.

Where the cost decisions live

Start with the filter processor, because dropping data you never needed is the cheapest win available. Health-check and readiness-probe traffic is the canonical example: high volume, zero diagnostic value, and trivially matchable.

processors:
  filter/drop_noise:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(body, ".*GET /healthz.*")'
        - 'IsMatch(body, ".*GET /readyz.*")'
        - 'IsMatch(body, ".*kube-probe.*")'
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

Next is cardinality, which is where small instrumentation mistakes turn into large bills. Cardinality is the number of unique attribute combinations on a metric, and it multiplies. A request.duration metric tagged with service and env is cheap. Add user_id, pod_name, and request_id and you have manufactured millions of time series from a single metric definition. Ephemeral Kubernetes infrastructure makes this worse, because pod names and container IDs churn constantly even when traffic is flat. The pipeline is the right place to strip these dimensions, using the transform processor and OTTL to delete the offending keys before they reach the backend.

processors:
  transform/reduce_cardinality:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user_id")
          - delete_key(attributes, "request_id")
          - delete_key(attributes, "pod_name")

Traces are the third lever, and head-based sampling is the wrong default for production. Deciding whether to keep a trace at its start is blind to whether that trace ended in an error or a two-second latency spike, which are exactly the traces you want. Tail sampling makes the decision after spans complete, so you can keep every error and every slow request while sampling routine traffic down hard.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

There is an architectural constraint hiding in that config. Tail sampling has to hold all spans for a trace in memory until it decides, which means every span for a given trace must reach the same Collector instance. That rules out making the decision at the node level, where spans for one request scatter across agents. Tail sampling belongs at a gateway, often fronted by the loadbalancing exporter that routes by trace ID. This single constraint shapes the whole topology of a serious pipeline.

Agent and gateway: where each decision belongs

The deployment question that follows is whether to run the pipeline as an agent, a gateway, or both. The OpenTelemetry deployment guidance lays out the patterns, but the practical answer for most Kubernetes teams is both, with a clear division of labor.

Agents run as a DaemonSet, one per node, with access to the host filesystem and local pod telemetry. They are the right place for node-local work: tailing container logs, doing first-pass noise filtering, redacting obvious secrets, and buffering against transient downstream slowness. Doing this work at the edge means low-value logs never leave the node, which cuts network egress and the load on everything downstream.

Gateways run as a standalone Deployment that receives telemetry from the agents. They are the right place for decisions that need a global view: tail sampling, cross-service routing, centralized policy enforcement, and final export to backends. The processor order inside each tier matters. The recommended ordering puts memory_limiter first to protect the Collector from OOM, then enrichment with k8sattributes, then filtering and sampling, then transformation, with batch last just before the exporters. Reorder these and you either pay to transform data you are about to drop or you batch before you filter and lose the efficiency batching was supposed to buy.

If you are choosing the collection engine for the agent tier specifically, the tradeoffs between Vector and the OTel Collector for log collection come down to whether your immediate problem is log-volume ergonomics or org-wide telemetry standardization.

Reliability is a pipeline decision too

Cost gets the attention, but reliability is what determines whether the pipeline survives a real incident. Every pipeline needs an explicit answer to one question: what happens when the destination slows down or goes offline.

The building blocks are memory_limiter to cap memory pressure, exporter sending queues with retry, and per-stream backpressure policy. The judgment call is per signal class. Audit logs and severe errors usually justify disk buffering and blocking backpressure, because losing them is worse than slowing ingestion. Debug and info logs usually justify dropping under pressure, because protecting the node and the critical telemetry matters more than preserving low-value data. A single global backpressure policy is how teams either lose evidence during an incident or crash a node trying to preserve logs nobody will read.

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25
  batch:
    send_batch_size: 8192
    timeout: 200ms

This is also why benchmarking a pipeline on raw throughput is misleading. The numbers that matter are the ones under your real transforms, your real compression, and your real failure modes: destination unavailable for five, fifteen, and sixty minutes, with disk filling and memory climbing. A pipeline that wins a synthetic throughput test can still fall over the first time a backend has a bad afternoon.

Why a working pipeline quietly stops working

A telemetry pipeline is not a configuration you write once. It is a control problem that never stops, because the telemetry it governs never stops changing.

New services ship with default instrumentation and nobody trims it. A deployment introduces a dynamic label that multiplies time series overnight. An incident flips on debug logging and it stays on for a month. A new team starts emitting a high-volume event that no filter rule anticipated. Every one of these is a new source of waste, and a static pipeline config does not notice any of them. The CNCF community has been blunt about this: pipelines optimized for ingestion rather than insight give teams the illusion of control while the volume, and the bill, keep climbing.

The traditional response is to write more rules. Exclusion filters, sampling configs, cardinality limits, hand-maintained OTTL. That helps until the next change, and the next change always comes. The labor of continuously noticing new waste, understanding why it is waste, and adjusting the pipeline without breaking the signal someone pages on is exactly the kind of cross-cutting, never-finished work that loses to feature delivery every sprint. The dead zone is structural: the developers creating the telemetry do not feel the cost, and the platform team feeling the cost does not own the source. This is the same dynamic we traced in detail in why your Datadog bill keeps growing, and it is why a cheaper backend never fixes the problem. The data follows you into the new contract.

How Sawmills approaches this

OpenTelemetry solved the wire format, the data model, and the collection engine. What it does not do is decide what telemetry should exist, what to sample, what to drop, or how to keep those decisions current as services multiply. The Collector executes the policy you give it. It does not write the policy. That is the gap.

Sawmills is the agentic operator that runs in that gap. It sits in the telemetry pipeline, upstream of Datadog, Splunk, Grafana, or whatever backend you already run, and it works the processing stage continuously: identifying which log sources generate volume nobody queries, which metric attributes are detonating cardinality with no debugging value, and which services are tracing routine traffic at full fidelity when only errors and tail latency matter. Your DevOps team defines the guardrails. Developers self-serve fixes inside them. The agent decides what to filter, sample, transform, and route in real time, and it adapts as your architecture changes rather than waiting for the next quarterly audit. Built on OpenTelemetry, it operates the collectors and pipelines you already have instead of asking you to migrate.

The outcome is a pipeline that keeps the signal you operate on and stops paying for the noise you do not, without a standing rotation of engineers maintaining suppression rules by hand. Schedule a demo to see Sawmills running as the operator on a telemetry pipeline shaped like yours.