save your seat
All posts

Observability agents are coming. Your telemetry is not ready for them

Observability
Jun
17
2025
Jun
17
2025
Observability agents are coming. Your telemetry is not ready for them

Imagine you’re in the middle of an incident. PagerDuty is blaring. Your CPU metrics are off the charts. Traffic is spiking. And your logs? A sprawling mess of unstructured text, throttled APIs, and timestamp chaos. Now, imagine an agent, not a human, but a well-trained LLM, could step in, diagnose, and even resolve the issue.

But here’s the catch: Without the right context, even the best agent falls short. And context lives in your telemetry.

Not all telemetry is created equal

Metrics, traces, and logs form the core of modern observability. Metrics reveal symptoms, like a fever spiking when a service slows down. Traces provide structured context, showing the detailed path a request took, including exceptions, variables, timing, and bottlenecks. Logs complement this by giving the gritty narrative, often unstructured details, odd behaviors or raw stack traces. However, despite their richness, these signals frequently remain siloed, noisy, and inconsistent.

Telemetry, as it exists in most environments, is built for humans. But agents aren’t humans. For agents to be effective, they need structured context, information humans naturally have but LLMs must explicitly receive. Agents also thrive on a compressed, distilled form of telemetry rather than raw data, allowing them to reason clearly without confusion. Therefore, filtering, structuring, and optimizing telemetry isn’t just helpful, it’s essential for agent usability.

The agent’s retrieval nightmare

Now, let’s say you wire your agent directly into your observability platform and give it the green light to query. What happens next?

First, the agent hits a wall of API limitations. These endpoints were built for dashboards, not for autonomous reasoning. They’re slow, often throttled, and return too much or too little. Second, there's the issue of retention. The agent might need logs from last week, but your platform only keeps them for 3-7 days. Too bad.

Then there’s the structure problem. Logs are free-form. They’re human-readable but often meaningless to a machine unless heavily parsed. Finally, your telemetry is probably fragmented. Metrics live in Prometheus, logs in Datadog, traces in OpenTelemetry. Your agent needs to hopscotch across systems to get a complete view. That’s not intelligence. It’s scavenging.

The context challenge

The challenge isn't about technical constraints like context window size. Instead, it's about the completeness and clarity of the information provided. Agents depend entirely on the telemetry and context you supply to navigate incidents effectively.

An incident often starts as a simple alert, PagerDuty notifications, CPU spikes, latency increases, but this alert is just the symptom. The real task is constructing a complete understanding from fragmented and unstructured telemetry sources across logs, metrics, and traces.

Agents face specific hurdles: fragmented data sources scattered across platforms, inconsistent data structures that require constant translation, and limited retention windows that create gaps in historical context. Moreover, there's often critical institutional knowledge known only to experienced SREs, common issue patterns, known false alarms and proven resolutions, that isn't captured in any system.

Addressing these challenges means systematically structuring telemetry data, enriching it with this implicit operational knowledge, and optimizing for rapid retrieval. When context is clear and structured, agents can effectively diagnose and resolve incidents at scale. Without it, they remain limited to handling only superficial symptoms while the root causes persist.

Making telemetry machine-readable

To fix this, you need to reimagine your pipeline. That means designing your telemetry flow from ingestion to retrieval with the agent in mind. Here’s how to get started:

  • Start with filtering and cleaning: Apply rules to discard irrelevant logs (e.g., health checks, debug messages), normalize noisy metrics, and remove duplicate traces. This eliminates the signal-to-noise problem that overwhelms agents with irrelevant data.
  • Aggregate similar events: Group logs with similar patterns, repeated errors, and recurring messages instead of storing each occurrence individually. Transform hundreds of "connection timeout" entries into clustered insights showing frequency, affected services, and time patterns. This horizontal aggregation reveals systemic issues that would be invisible when buried in repetitive log noise.
  • Enrich telemetry with metadata: Add useful context such as environment, deployment version, and ownership tags to all telemetry records. This enables agents to filter precisely and understand the operational context of each event.
  • Standardize formats: Use a common schema for logs, metrics, and traces. Flatten nested structures and align field naming across services so agents can correlate events without translation overhead.
  • Compact and summarize: Instead of line-by-line logs, auto-generate vertical summaries that capture entire transaction spans or application flows using heuristic or ML techniques. Transform lengthy application flows into structured summaries, rather than parsing through 50 individual log lines for a checkout transaction, agents receive a cohesive narrative: "User checkout failed at payment validation step due to expired card, retry attempted twice, fallback to manual review triggered." This approach condenses complex traces and transaction flows into digestible, contextual summaries that preserve the critical decision points and failure modes while eliminating verbose implementation details that obscure the core story.
  • Store in optimized formats: Save data in Parquet files, bucketed by time, service, and severity. This reduces cost and speeds up query performance. The key is partitioning data so agents can precisely slice what they need, investigating a checkout incident from the last hour queries only that specific time/service/severity partition instead of scanning everything. While time, service, and severity form the basic partitioning strategy, this can be expanded with additional dimensions like deployment version, region, or request type depending on your incident patterns. This partitioning approach transforms broad data scans into targeted lookups, enabling agents to rapidly pivot between different slices of telemetry during complex incident investigations.
  • Enable semantic retrieval: Move beyond keyword-based searches that trap agents in data silos. Vector embeddings allow agents to understand meaning and relationships, finding conceptually related issues across services even when they use different terminology. Generate embeddings that capture not just log content, but the full operational context, including service relationships, deployment state, business impact, and historical patterns. Store these enriched embeddings in a vector database to enable agents to reason across system boundaries and connect distributed failures that traditional searches miss entirely.

These steps form the foundation of a telemetry architecture built for autonomous agents. They bridge the gap between raw data and machine understanding, ensuring agents can get to the right answer faster.

Agents today are trained to behave like junior SREs. They follow runbooks, grep through logs, and ask the same questions a human would. But they need this context to operate. Context that lives not just in logs, but in CPU spikes, flame graphs, support tickets, and even Slack threads.

Telemetry is the bedrock of this context. But it has to be usable. When it isn’t, the agent flounders. When it is, the agent accelerates resolution, scales your incident response, and augments your team, rather than slowing it down.

Sawmills is leading the agentic telemetry revolution

Sawmills helps you ingest telemetry from every corner of your stack, including logs, metrics, and traces. We clean the junk, stitch the context, and transform your data into structured, agent-ready formats.

Whether you’re using an open-source LLM, building custom internal agents, or just engaged with a new and shiny startup for an SRE agent, we make sure your telemetry is no longer a bottleneck. Our platform writes to storage formats optimized for both cost and speed, and we embed the data in vector databases so agents can retrieve what they need without delay.

With Sawmills, you don’t have to fight your observability tooling. You make it smarter. And you make your agents capable.

TL;DR for DevOps leaders

If you're plugging agents into raw, unoptimized telemetry streams, you're flying blind. The key to faster, smarter incident resolution is structured context, and that starts with transforming your logs, metrics, and traces into agent-readable formats.

Let Sawmills do the heavy lifting. Clean, optimize, and serve your telemetry—not just for humans, but for the future.