save your seat
All posts

What your logs should be telling you: beyond root cause analysis

Observability
May
16
2025
May
01
2025

In April, Sawmills presented at KubeCon's Observability Day, addressing a growing challenge: why modern logs are failing engineers, and what we can do about it.

Logs should be the foundation of observability — clear, factual records of what happened inside our systems. But in today's complex environments, logs have morphed into something else: fragmented, inconsistent and bloated, forcing DevOps and engineers to look for the proverbiabl needle in the hay stack.

At Sawmills, we believe telemetry data management is being revolutionized with artificial intelligence. We also know developers can manage telemetry more efficiently by changing how they think about it — not just as raw data, but as narrative infrastructure that must be shaped intentionally to make root cause analysis faster, cheaper, and more reliable.

In this conference recap, we’ll share some of the practical techniques engineers and DevOps can use to make telemetry more efficient while also  improving RCA. We’ll also share what we believe is the future of telemetry management - the AI revolution that is sweeping through observability. 

Logs: the foundation of observability — and its weakest link

In theory, logs should be the most powerful signal in an engineer's toolbox. They capture everything: user actions, system decisions, database calls, external API responses, and error states.

But in practice, logs often fail at their most important job: helping teams reconstruct what actually happened. Why? Most logs today capture isolated events, not complete stories. They're built for the convenience of the developer writing them, not for the engineer debugging a live incident six months later.

Over time, this has created a systemic problem. The more distributed and complex systems become, the more fragmented the available telemetry data becomes. Logs exist, but the connective tissue that makes them meaningful is missing.

When RCA depends on stitching together thousands of atomic facts by hand, every investigation becomes slower, more expensive, and more prone to human error.

Why root cause analysis is still harder than it should be

Even companies with strong observability practices struggle with RCA — not because their engineers lack skill, but because the raw material they're working with is fundamentally flawed. When we examine real production systems, we consistently see several repeating problems:

  • Logs are repetitive to the point of absurdity. If a service experiences a transient error like a failed database connection, it doesn't just log the initial failure. It logs every retry attempt, every timeout, and every circuit breaker trip — flooding logs with noise that obscures the original event.
  • Logs are often unstructured or inconsistently structured. One team logs timestamps as ts, another as time, another inside a nested metadata field. Some logs are free-text strings; others are semi-structured JSON. Critical fields like user_id or request_id appear in some logs and are missing from others.
  • Third-party services inject their own chaos. A Kubernetes cluster generates thousands of logs per hour from orchestration events, node status updates, and network health probes — most with little relevance to application behavior but still swamping the telemetry stream.
  • Logs are often isolated snapshots rather than connected flows. Each log knows about its specific event — a request received, a database write, a transaction committed — but not where it fits in the broader story.

The result? Engineers diagnosing issues face a tedious, error-prone manual reconstruction process: scrolling through endless log entries, correlating events by timestamps and hostnames, racing to piece together a coherent timeline before SLA clocks expire.

And the consequences extend beyond lost time and frustration. All this redundant, low-value telemetry — retries, noise, fragmented context — doesn't just make RCA harder. It massively inflates observability costs. Every unnecessary log ingested, indexed, and stored drives up bills from platforms like Datadog, New Relic, and others.

For better root cause analysis, we need better data

The real challenge lies in the structure and nature of telemetry data itself. Across hundreds of engineers we’ve spoken with, we consistently see six major issues that cripple RCA and inflate observability costs:

1. Burst errors and warning patterns

Transient failures flood logs with nearly identical error messages in rapid succession. Authentication failures, connection issues, and resource errors often occur in bursts during service degradation. Instead of one clear signal, you get hundreds of noisy, redundant logs — overwhelming alerting systems and operators alike.

2. Verbose debug and trace logs

Inside production telemetry, it's common to find full request/response payloads, SQL queries with parameters, and method entry/exit traces logged at DEBUG level. In isolation, each log might seem harmless. But at scale, these verbose outputs drown critical insights under gigabytes of irrelevant detail. Engineers dig through massive database query dumps and API payloads just to find what matters.

3. Third-party service logs

Infrastructure components generate their own telemetry — often at overwhelming volume. Kubernetes orchestrators, service meshes, DNS servers, CI/CD pipelines, and secret managers all emit status updates, sync events, and configuration logs. While valuable for platform operations, these logs often pollute application observability pipelines, making it harder to correlate application failures with system behavior.

4. Health and status check noise

Modern systems are filled with constant background health probes: /health, /liveness, /readiness endpoints being polled every few seconds. Additionally, services broadcast regular "healthy" status updates and self-test results. The intention is good — confirming operational status — but without careful routing, these floods of positive signals add huge volume with almost no diagnostic value when things actually go wrong.

5. Multi-line and unstructured logs

Logs that span multiple lines — such as full stack traces, formatted JSON payloads, or tabular SQL outputs — create parsing nightmares. Inconsistent formatting between services makes it almost impossible to automate correlation. Stack traces might be captured as raw text dumps; SQL queries might be embedded inside free-text logs. Instead of clean, searchable telemetry, you get sprawling, brittle fragments.

6. Redundant metadata attributes

Most modern logs are bloated with repetitive metadata: container IDs, node hashes, cloud provider info, cluster names, deployment versions, and parent/child ownership relationships. Each piece might be useful individually — but repeated on every single line, they add massive overhead to telemetry ingestion, indexing, and storage costs without materially improving RCA capabilities.

What can you do? Focus on hygiene first

There are practical steps teams can start taking right now to make their logs more actionable and their root cause analysis faster:

  • Convert unstructured logs into structured formats. Move away from free-text logs and adopt structured formats like JSON. Consistent field names and machine-readable logs make correlation, querying, and summarization dramatically easier.
  • Sample repeated log entries. Instead of recording every identical retry, sample redundant events intelligently to capture patterns without overwhelming the system.
  • Aggregate similar log lines. Group together repeated actions into higher-level summaries. For example, instead of 500 connection errors, create one aggregate event describing the burst.
  • Convert incorrect log types into proper metrics. If a signal is high-frequency and low-variance — like successful heartbeats, retry counts, or cache hits — it should probably be recorded as a metric, not a flood of log lines.
  • Eliminate pure junk (DEBUG in production). Debug-level logs, verbose payloads, stack traces — all have value during development, but in production, they often contribute little to RCA. Filter them out or route them away from primary pipelines.
  • Route low-value logs to object storage. Not everything needs to live in your real-time observability stack. Route verbose or historical logs to cheaper storage tiers where they can be queried if needed, but don't add latency and cost to live troubleshooting.
  • Standardize attributes and remove duplication. Ensure there's an allow list and a white list and that these attributes are accounted for.

By focusing on these principles, teams can dramatically reduce their telemetry footprint, improve RCA speed, and lower their overall observability costs.

Aggregation = less noise and more insight

One of the most powerful upgrades you can make to your telemetry is aggregation. At first glance, aggregation seems purely about reducing data volume. Instead of logging every single heartbeat, retry, or health check, you group them into summary events.

This immediately shrinks your telemetry footprint — but the real value goes much deeper. Aggregation doesn't just condense telemetry. It amplifies your understanding of system behavior.

Before aggregation:

Imagine a system that performs a /health check every few seconds. Raw telemetry fills up with hundreds of nearly identical lines:

[OK] /health responded in 42ms
[OK] /health responded in 44ms
[OK] /health responded in 43ms
[OK] /health responded in 45ms

(…and so on)

Each line says almost nothing on its own. The real patterns — stability, fluctuations, anomalies — are invisible in the noise.

After aggregation:

Instead of capturing every probe, you group them intelligently:

"Over the last 10 minutes: 120 successful health checks, average response time 43.1ms, min 41ms, max 45ms."

Suddenly, you don't just know that the system was healthy — you know how healthy it was, how consistent, and whether any outliers emerged. You move from data overload to operational insight.

Why aggregation improves RCA:

  • Faster anomaly detection. When health check times spike, you see the deviation immediately in the summary — without sifting through thousands of individual entries.
  • Better trend visibility. Patterns of minor degradation, retry bursts, or rising latencies become obvious when events are grouped and statistically summarized.
  • Smarter telemetry pipelines. Aggregated events carry more meaningful metadata — averages, minimums, maximums — in a fraction of the data footprint.

You gain richer context while spending less on ingestion and storage. Aggregation isn't about hiding detail. It's about surfacing the details that actually matter — and filtering out the noise that doesn't.

Bringing context to your logs

One of the biggest challenges with logs is the lack of context. By their nature, logs are isolated — each log line stands alone, with no built-in connection to what came before or after. When troubleshooting, you chase log after log, trying to piece together what happened. It's easy to get lost because the system doesn't preserve the cause-and-effect story. And that's a hard problem to fix — because that's how logs have always been built.

At Sawmills, we asked ourselves: what if we could reconstruct the narrative?

We realized that most modern logs carry some kind of connection — a trace ID, a transaction ID, a session ID. By mapping those connections, we could group related log lines, build a tree representing the flow of events, and use a large language model to generate a single structured summary that tells the full story across the transaction. Instead of scattering five log lines for a single authentication flow, we can generate one clean log that says, "Request processing started, authentication successful, request processing completed."

Even in large, messy environments, this method holds. We see that 80% of telemetry volume is typically generated by 20% of transaction patterns. By summarizing the frequent flows, we dramatically cut telemetry noise — and for the first time, observability becomes about seeing the real story, not hunting for fragments.

The future of telemetry data management  

In modern cloud-native environments, telemetry grows faster than any human team can curate. The future is using AI to actively manage telemetry in real time:

  • Automatically correlating related events across services
  • Summarizing log flows into coherent narratives
  • Optimizing what data gets ingested, stored, surfaced and routed
  • Proactively eliminating noise before it hits your observability bills

That's why we founded Sawmills. To give teams real control over their telemetry pipelines — not just to clean up logs after the fact, but to intelligently shape their data as it flows through their systems.

With Sawmills, you can:

  • Cut observability ingestion and storage costs by 50–90%
  • Improve RCA speed with AI-driven summarization and aggregation
  • Enforce telemetry policies and quotas that keep data quality high and budgets predictable
  • Take back control from bloated observability bills and runaway data growth

The goal isn't just smaller logs or faster queries. It's smarter systems — where the right data tells the right story, at the right time.