save your seat
All posts

Observability pipelines love the OTel Collector— until the config hits the fan

Observability
May
16
2025
May
15
2025

OpenTelemetry Collector (OTel Collector) is the Swiss Army knife of modern observability — versatile, open-source, and everywhere. It's the thing you plug in when you want your telemetry to flow freely between dozens of protocols and destinations, all without getting handcuffed to one vendor's vision of the truth. It's your Prometheus receiver, your Jaeger exporter, your trace processor, your gateway to telemetry Nirvana.

Kind of.

Because as much as the Collector promises streamlined observability, what it often delivers is... a lot of work.

Why it's so popular

For starters, OTel Collector speaks every protocol you can think of — OTLP, Jaeger, Prometheus, Zipkin, and more — and it talks to just about every backend you'd want to send data to. With one agent, you get a vendor-neutral clearinghouse for all your telemetry. You configure it once, and boom: switch backends without rewriting code.

Its modular pipeline (receivers → processors → exporters) means you can drop in encryption, filtering, batching, or transformation logic in one place rather than littering it across your services. It's stateless, scalable, and built-in Go — fast enough to keep up with high-throughput data pipelines and lean enough to run anywhere.

But with great power comes a metric ton of YAML, unpredictable behavior, and a headache of operational complexity.

The limitations of OpenTelemetry Collector

OTel Collector can be a powerful tool — it's just not a magic one. Here's where it struggles, and what that means for teams relying on it to be the glue in their observability pipeline.

1. Processing performance is a bottleneck waiting to happen

Every processor you add to your pipeline eats into throughput. Regex-heavy filtering rules? Welcome to head-of-line blocking. Long pipelines? You're likely straining memory, not just CPU — especially during garbage collection.

Despite its promise of scale, the Collector is stateless. That's great for real-time streaming, but forget about stateful operations like long-window aggregation. You're left hacking around the limits with retries, storage queues, or, worse, silent data drops.

2. Who watches the watcher?

OTel Collector emits its own telemetry, sure — but using that to debug the system running your telemetry can feel like peering into a mirror held by another mirror. No native UI. No built-in alerts for dropped spans. If something breaks, you're hunting through logs (that you hopefully remembered to export).

Want to detect data loss? You'll need verbose metrics, a Prometheus scrape setup, and dashboards that actually surface pipeline health. Otherwise, it's entirely possible for data to vanish without a trace — while your dashboards pretend everything is fine.

3. Timeouts and retries aren't what you think

Out of the box, Collector's retry logic is rudimentary. If an exporter goes down, queues fill and telemetry disappears unless you've explicitly configured persistence and backoff logic. There's no concept of circuit breakers or pipeline isolation — so a bad exporter can tank your whole flow.

Batch sizes and timeouts? You're responsible for tuning them. And the defaults aren't always your friend.

4. Deployment is complicated. Really complicated.

Collector runs as a sidecar, a DaemonSet, or a centralized gateway. Each has trade-offs. Load balancing isn't built-in, and autoscaling based on CPU/mem metrics? Not a reliable proxy for telemetry throughput.

You'll need to manage memory tuning, queue sizes, retry behavior, batch thresholds — and updates are frequent, sometimes breaking, and hard to centralize. There's no RBAC or multi-tenancy support either. If you want tenant isolation, bring your own infrastructure.

5. Configuration is verbose and brittle

YAML, by its nature, is hard to reason about at scale. Collector configs grow massive quickly. There's no includes system, no native support for modularity, and no hot-reloading. You're stuck restarting services for every small tweak — and hoping you didn't fat-finger an indent.

For newcomers, the learning curve is real: which processors to use? In what order? With what side effects? Good luck.

6. Stateless architecture limits advanced use cases

The Collector's stateless design is efficient, but it cuts off more complex telemetry workflows. Tail-based sampling requires extra routing. Correlation across spans or traces? That's a job for another service entirely. You can add persistent storage — but querying it isn't really supported.

7. Extensibility is real, but rough

Want to extend the Collector? You better know Go. There's no dynamic plugin system or Lua/Python scripting support. You're building and compiling from source.

OTTL, the OpenTelemetry Transformation Language, is a bright spot — letting you write in-config expressions — but it's still limited. And when you compare this to GUI-based routing tools? The difference is stark.

So is it worth it?

Yes — with caveats. OTel Collector is still one of the most powerful, flexible tools for observability data routing. But you have to know its trade-offs and design around its limitations.

If you need deep filtering, rich correlation, or GUI-based routing, you'll hit walls. If you're fine with a lean, programmable, open-source agent? It's still the gold standard.

Have your cake and eat it too

The OpenTelemetry Collector is a powerful tool, but it asks for trade-offs: performance tuning, manual observability, brittle YAML, and endless config wrangling. If your team is feeling the weight of those trade-offs, you're not alone.

At Sawmills, we help teams cut through the noise with smarter telemetry pipelines built on top of open standards. Think: better defaults, deeper visibility, and control without complexity. Less duct tape. More signal.

You don't have to fight your tooling to make observability work. You just need the right foundation. Let's build it together.

Talk to us to learn more.