READ MORE
All posts

The OpenTelemetry Collector in production: the processor pipeline that decides what your backend ever sees

Observability
Jul
2
2026
Jul
01
2026

Many teams install the OpenTelemetry Collector from a Helm chart, point it at a backend, watch data flow into Datadog or Grafana or New Relic, and consider the job done. The pipeline works. Telemetry arrives. Dashboards populate. And then six months later the ingest bill is up 40%, the collector pods are getting OOMKilled during traffic spikes, and nobody can explain why the trace they needed during last week's incident was never sampled.

None of that is the collector failing. It is the collector doing exactly what it was configured to do. The receivers receive, the exporters export, and the part in the middle that actually decides what telemetry deserves to reach a backend sits empty or runs defaults. The OTel Collector is the most consequential component in a production telemetry stack and the one most teams configure the least. A recurring r/OpenTelemetry thread on collector performance at scale is the same question asked a dozen ways: it ran fine in the demo, so why does it fall over in production?

The answer is that a demo pipeline and a production pipeline share a config format and nothing else. This piece is about the difference, and specifically about the processor decisions that determine your bill, your signal quality, and whether your on-call engineer can find the trace they need at 3am.

The collector is a processing engine, not a forwarder

Start with the framing, because the wrong mental model produces the wrong configs. The OpenTelemetry Collector is not a log shipper that happens to speak OTLP. It is a vendor-neutral processing engine built from five component types, and the processing is the point.

Receivers ingest telemetry (OTLP, prometheus, filelog, kubeletstats, and dozens more). Exporters send it onward to one or more backends. Processors transform telemetry in flight, and this is where filtering, enrichment, sampling, redaction, and batching happen. Connectors bridge one pipeline into another, so the spanmetrics connector can read spans and emit derived metrics. Extensions add capabilities like health checks and pprof that are not tied to telemetry flow.

Those components get wired into pipelines, one per signal type, inside the service block. A single collector commonly runs three at once:

service:
  pipelines:
    logs:
      receivers: [filelog, otlp]
      processors: [memory_limiter, k8sattributes, filter/drop_noise, batch]
      exporters: [otlphttp/backend]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheusremotewrite/backend]
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlphttp/backend]


Read that processors list on each pipeline. That ordered array is the entire governance surface of your telemetry stack. Everything you will ever decide about what to keep, what to drop, what to enrich, and what to pay for is expressed there. The receivers and exporters are plumbing. The processors are policy.

Processor order is the whole game

Processors execute in the order you list them, and the order changes both correctness and cost. The OTel project documents a recommended ordering that holds for almost every production pipeline, and it is worth understanding why each position matters rather than copying it blindly.

memory_limiter goes first. It protects the collector from running itself out of memory by refusing data when usage crosses a threshold, which forces backpressure upstream instead of an OOMKill. If it runs after expensive processors, the collector has already spent the memory you were trying to protect.

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 80
    spike_limit_percentage: 25

Enrichment comes next. Processors like k8sattributes and resource add context (pod name, namespace, deployment) that later filtering and routing rules depend on. You cannot filter on k8s.namespace.name before the processor that attaches it has run.

Filtering and sampling come third, after enrichment but before transformation, because there is no reason to spend CPU transforming telemetry you are about to drop. Transformation runs fourth. And batch always runs last, just before the exporters, grouping telemetry to cut network round-trips and per-request overhead on the backend side.

processors:
  batch:
    send_batch_size: 8192
    timeout: 200ms
    send_batch_max_size: 10000

Get this order wrong and you do not get an error. You get a pipeline that quietly wastes CPU, drops the wrong data, or enriches records that were already discarded. The collector will run your bad ordering forever without complaint, which is exactly why so many teams ship it.

Where cost gets decided: filtering, cardinality, and sampling

If observability spend is climbing faster than headcount or traffic, the cause is almost always telemetry that should never have reached the backend. As we argued in why your Datadog bill keeps growing, this is a data-growth problem, not a tooling problem, and the collector is the place to solve it because it sits upstream of the billing meter. Three processors do most of the work.

The filter processor drops telemetry that matches a condition. Health-check spam is the canonical example, and dropping it at the agent means it never crosses the network or hits ingest:

processors:
  filter/drop_noise:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(body, ".*GET /healthz.*")'
        - 'IsMatch(body, ".*GET /readyz.*")'
    traces:
      span:
        - 'attributes["http.route"] == "/healthz"'
        - 'attributes["http.route"] == "/readyz"'

The transform processor, driven by OTTL, handles cardinality. Metric cost is the product of unique attribute combinations, and a single unbounded attribute like user_id or pod_uid can multiply a clean metric into millions of time series. OTTL lets you drop the offending key before it becomes billable:

processors:
  transform/cardinality:
    error_mode: ignore
    metric_statements:
      - context: datapoint
        statements:
          - delete_key(attributes, "user_id")
          - delete_key(attributes, "request_id")

Traces are where the largest savings hide, because full-fidelity capture preserves perfect forensic detail for traffic nobody will ever inspect. The tail_sampling processor decides after spans complete, so you can keep every error and every slow trace while sampling routine traffic down hard:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow
        type: latency
        latency: {threshold_ms: 1000}
      - name: probabilistic
        type: probabilistic
        probabilistic: {sampling_percentage: 10}

That composite policy is what most production trace pipelines actually need: keep everything that ended in an error, keep everything slow, and keep a 10% sample of the boring remainder. Head-based sampling with the probabilistic_sampler is cheaper but blind to span content, so it will happily discard the error trace you needed. The tradeoff is that tail sampling has to hold spans in memory for the decision_wait window, which has a deployment consequence we will get to next.

Agent, gateway, or both: the deployment decision that constrains everything

How you deploy the collector decides which of those processors you can even use. There are three patterns, and the choice is not cosmetic.

An agent runs as a DaemonSet, one collector per node, with access to the host filesystem and local pod telemetry. It is the natural place for log collection and node-local filtering. A gateway runs as a standalone collector cluster that receives telemetry from agents and SDKs and centralizes processing and routing. A sidecar runs a collector in each application pod, which gives isolation at the cost of multiplying resource usage across every replica.

Most production stacks run agent plus gateway: agents do node-local collection and cheap filtering, gateways do centralized policy enforcement, sampling, and routing to backends. The reason this matters for the previous section is that tail sampling cannot run on agents. All spans for a single trace must arrive at the same collector instance for a tail decision to be made, and in a DaemonSet those spans are scattered across every node the request touched. Tail sampling belongs on the gateway, and if you run more than one gateway replica you need the loadbalancing exporter to route spans by trace_id so a whole trace lands on one instance.

This is the kind of constraint that does not show up in a demo, where everything runs in one collector on one node and tail sampling appears to work fine. It shows up in production, the first time you scale the gateway horizontally and your sampling decisions start fragmenting because half the spans for each trace went to the wrong replica.

Where collectors break under load

Reliability, not throughput, is what determines whether the collector survives a real incident. The collector can sustain serious volume (the project runs continuous load tests on every commit to the contrib repo), but raw capacity is rarely the thing that fails. Configuration is.

The k8sattributes processor is a frequent culprit. It enriches telemetry by querying the Kubernetes API, which means it needs RBAC permissions and adds latency on the enrichment path. Under a pod-churn storm, where ephemeral pods rotate faster than the processor can resolve metadata, it becomes a bottleneck precisely when you have the most telemetry to process.

Backpressure is the second. When a backend slows down or goes offline, the collector has to decide what happens to the telemetry it cannot export: buffer it, apply backpressure upstream, or drop it. That decision should differ by signal value. Audit logs and error traces deserve durable buffering. Debug logs do not. A single global queue policy is how you either lose the evidence you needed or crash a node trying to preserve data that did not matter.

Then there is scaling. The collector is stateless enough to scale horizontally, but as we covered in autoscaling your collector with KEDA, CPU-based autoscaling is the wrong signal for a telemetry workload whose pressure shows up as queue depth and memory long before CPU saturates. The collector exposes its own internal metrics for exactly this reason, and they are the signal you should scale on. This operational surface (memory limits, queue behavior, enrichment latency, scaling triggers) is the part that the config-hits-the-fan reality we wrote about in observability pipelines love the OTel Collector until the config hits the fan keeps surfacing.

The config is easy. The policy is not.

Here is the uncomfortable part. Everything above is mechanically straightforward. The YAML is documented, the processors are well-tested, and OpenTelemetry is now the second-highest-velocity project in the CNCF with a contributor base that keeps the components solid. The collector will faithfully execute any policy you give it.

What the collector does not do is decide what that policy should be. It cannot tell you that the user_id attribute on request.duration is the thing wrecking your cardinality, or that one team's debug logs got left on after an incident, or that a new service shipped last week tracing routine traffic at 100% fidelity. It executes the rules. It does not author them, notice when they go stale, or adapt them as services multiply and traffic patterns shift.

That is the real work, and it is continuous. A telemetry pipeline is not a project you finish. New services appear, frameworks auto-instrument more operations, developers add tags, and every change is a chance for waste to creep back in. The collector gives you the control surface. Someone, or something, still has to operate it.

How Sawmills approaches this

Sawmills runs as the operator. Built on the OTel collector, it analyzes the telemetry flowing through your pipelines in real time, identifies the log sources, metric attributes, and trace patterns that cost the most while delivering the least signal, and applies the filtering, cardinality, and sampling decisions through the same processors described above. Your collectors, your backends, and your dashboards stay exactly where they are.

The division of labor is the point. Your platform team defines the guardrails (what must always be kept, what can be sampled, what should never be indexed), and Sawmills operates the pipeline continuously inside them, adapting as new services and new traffic patterns appear instead of waiting for the next quarterly audit or the next billing shock. Developers self-serve fixes in Slack or Teams without filing a ticket against the platform team. The collector keeps executing policy. Sawmills decides what the policy should be and keeps it current.

If your collector pipeline is running close to defaults and your ingest bill is climbing while your signal quality is not, that gap is exactly what Sawmills closes. Schedule a demo to see Sawmills operating a live collector pipeline against a telemetry stream like yours.