READ NOW
All posts

Managing telemetry deployments while safeguarding infrastructure

Observability
Oct
20
2025
Oct
20
2025

In 2024, OpenAI shipped an update meant to improve observability across its Kubernetes clusters. The update backfired, triggering a three-hour outage that took down ChatGPT, its API, and associated services.

The culprit? A new telemetry service that overwhelmed the Kubernetes control plane by making too many resource-intensive API calls. What was supposed to give OpenAI more visibility ended up turning the lights off.

In short, a system designed to monitor infrastructure caused the infrastructure to crash. 

The importance of telemetry guardrails 

OpenAI isn’t alone. As systems grow and scale, so does the volume and complexity of the telemetry data. Your observability stack isn't a passive sidecar. It interacts deeply with your services, your network, and your control plane. If it misbehaves, whether through excessive API calls, unchecked cardinality, or traffic spikes, the blast radius can be enormous.

This is why a smart telemetry pipeline isn't just about collecting data. It's about having control over that data. Understanding what's flowing where. And most importantly, setting guardrails that can prevent your visibility layer from taking down your business.

Here are some critical practices every DevOps team should consider to avoid telemetry-induced outages:

  • Guardrails on telemetry behavior. Implement resource limits (CPU, memory, bandwidth caps) on observability agents, circuit breakers that disable telemetry collection when it impacts application health, and rate limiting with intelligent sampling to ensure monitoring never becomes the cause of an outage.
  • Example: Configuring Datadog Agent Limits
    • Resource Limits in datadog.yaml: You can set resource limits directly in the Datadog Agent's main configuration file (datadog.yaml).
      To limit CPU and memory, you can add or modify the following lines.
    • requests: Guarantees the specified resources for the container. The scheduler uses this to decide where to place the pod.
    • limits: Caps the maximum resources the container can consume. If the container tries to exceed these limits, it may be throttled (CPU) or killed (memory).
# CPU and memory limits for the Datadog Agent process
# These are soft limits, mainly for reporting and internal orchestration.
# For hard limits, consider container orchestrator settings (e.g., Kubernetes, Docker).
limits:
  cpu_limit: 0.5 # Limits CPU usage to 50% of one core
  memory_limit: 512MB # Limits memory usage to 512 Megabytes

Quotas and Budget Alerts. Establish strict quotas on telemetry volume, cardinality, and API call rates for each service or team. For on-premise systems, these quotas prevent resource exhaustion and ensure system stability. In cloud environments, they are crucial for cost control, preventing runaway expenses from unexpected data surges. Couple these quotas with automated alerts that notify teams when they approach or exceed their limits, allowing for proactive intervention before an outage or bill shock occurs. This dual approach of hard limits and early warnings transforms telemetry from a potential liability into a predictable and manageable resource.

Isolated control planes. Keeping telemetry systems on a separate infrastructure to create a firewall between your observability platform and production services, ensuring that issues like data volume spikes or processing bottlenecks in your monitoring systems can't bring down customer-facing applications. This separation means dedicated compute, network, and storage resources for telemetry collection and processing, preventing resource contention with production workloads.  

Protecting Against Telemetry Spikes. Unexpected surges in telemetry volume, often caused by error storms, deployment issues, or cascading failures, can quickly overwhelm observability backends and even impact production infrastructure. To mitigate this, mechanisms such as adaptive rate limiting, dynamic sampling (which increases during surges), and automatic throttling based on ingestion thresholds can contain these bursts before they affect your systems. Currently, a direct off-the-shelf product for this doesn't exist, requiring a custom solution. Sawmills.ai is actively developing policies to address this.

Tenant-Level Separation for Blast Radius Control. In multi-tenant environments, a single tenant's problems (e.g., a misconfigured application generating excessive telemetry) can compromise the observability of other tenants. Implementing tenant-level separation—through isolated data pipelines, dedicated resources, or strict access controls—prevents one tenant's issues from cascading and impacting the stability or performance of the entire observability platform. This approach limits the blast radius of failures and maintains service quality for all users.

Canary deployments for observability changes. Just like code, telemetry updates should roll out in phases with proper monitoring and rollback mechanisms. Treat observability configuration changes the same way you treat code deployments: roll them out to a small canary group first while monitoring for performance impacts. This phased approach lets you catch resource-heavy telemetry changes early and roll back before they degrade the entire production environment, preventing what seems like a minor logging update from triggering a major outage.

Monitor the monitors. Your observability platform requires its own dedicated monitoring to detect problems before they cascade into production outages. Track key health metrics like agent CPU/memory usage, ingestion rates, buffer depths, pipeline latency, and backend storage capacity, with alerts that fire when these indicators approach dangerous thresholds. 

Examples of DataDog alerts for volume explosions:

  • Ingestion Rate Anomaly: Alert when datadog.estimated_usage.log_events.ingested.count for a specific service or region deviates significantly from its historical baseline (e.g., a 3-sigma deviation over a 5-minute window).
  • Buffer Depth Exceeded: Alert when datadog.agent.log_agent.buffer_size for any agent exceeds 80% of its configured maximum for more than 1 minute, indicating potential backpressure.
  • Pipeline Latency Spike: Alert when datadog.estimated_usage.log_events.pipeline_latency for a critical pipeline increases by more than 50% compared to the previous 10 minutes.
  • Backend Storage Capacity: Alert when datadog.estimated_usage.log_events.storage.daily_retained_bytes grows by more than 20% in a 24-hour period for a specific index or retention policy.
  • Agent CPU/Memory Usage: Alert when system.cpu.usage or system.mem.used for DataDog agents on critical hosts exceeds 90% for more than 5 minutes.

Smart telemetry management starts with visibility and control

Sawmills is the first smart telemetry management platform that gives teams deep visibility into their data.

From proactive spike protection to intelligent routing, shaping, and filtering of high-volume data, Sawmills helps you ensure that your observability system is stable, performant, and cost-efficient. Monitor telemetry flows in real-time, set guardrails that prevent data surges from destabilizing your infrastructure, and eliminate wasteful spending on redundant or low-value data.

OpenAI's 2024 downtime illustrated the consequences when telemetry systems inadvertently disrupt product experience. For three hours, one of the world's most widely used services was impacted by cascading telemetry failures. This can be prevented by leveraging a smarter pipeline that protects your systems from telemetry spikes while helping you control costs.

Sawmills gives you the visibility and control to keep your observability infrastructure resilient before issues reach production.

Explore Sawmills.ai.