Kubernetes observability for platform teams, from kubelet metrics to cost-controlled telemetry

Observability

Jul

2026

Jul

2026

A Kubernetes cluster does not generate telemetry as a side effect of running workloads. It generates telemetry as the dominant output of running workloads. By the time a single ReplicaSet has cycled through three rolling deploys, it has produced more log lines than the application has produced business events. Multiply across the dozens of services in a typical production cluster, the hundreds of Pods, the thousands of restarts per week, and the engineering organization that started instrumenting with the goal of better debugging is now operating an observability stack that costs more than the platform it observes.

This is not a failure of the engineers who added the instrumentation. They were each making a local decision that improved their service. It is also not a failure of the vendor. The vendor charges for what it stores, indexes, and queries. The failure is structural. Telemetry produced by developers gets paid for by a different team, and the team that pays does not own the source.

For most companies running Kubernetes at scale, that team is the platform team. This is a guide for them: what to instrument, where to collect, how to enforce policy across services they don't write code for, and how to bend the cost trajectory before the next renewal conversation.

Why Kubernetes observability is a platform team problem

In a Kubernetes environment, the unit of ownership that matters for observability is the cluster, not the service. Each service contributes telemetry. The cluster aggregates it. The collectors run as cluster-level resources. The RBAC that gates them is cluster-scoped. The cost is billed against the cluster, not the team that owns the service.

This is the structural reason Kubernetes observability lives in platform teams. Developers write instrumentation, but platform teams own the collectors, the dashboards, the alert routing, the retention policies, the export quotas, and the budget conversations with finance. Developers see "my service is noisy" and adjust their loggers if they have time. Platform teams see "the Datadog bill is up 40% this quarter" and have no obvious lever to pull, because the data is created by dozens of services they don't ship code for.

The ownership gap is the same one that breaks most centralized initiatives in a Kubernetes-first organization. Platform teams have authority over the substrate. They don't have authority over what runs on it. So most of what they do becomes reactive: write a runbook, publish best practices, escalate when a service produces an obvious outlier.

The structural fix is not "platform engineers should write better instructions." The fix is that policy decisions about telemetry should be made centrally and enforced automatically at the substrate, not delegated to dozens of teams and hoped for. The platform team defines the rules. The runtime enforces them. Developers move fast within them.

The signals that actually matter

Most Kubernetes observability content treats "logs, metrics, traces" as the canonical triad. In a Kubernetes cluster, that triad is incomplete. There are five distinct telemetry sources, each producing different data with different cost characteristics.

Pod and container logs. Written by the container runtime to /var/log/pods/<namespace>_<pod>_<uid>/<container>/<rotation>.log and symlinked from /var/log/containers. Collected by a node-local agent that reads the host filesystem. This is the highest-volume signal in most clusters.

Kubelet stats. Resource usage from the kubelet's /stats/summary endpoint, fed by cAdvisor inside the kubelet. CPU, memory, network, filesystem per Pod and per container. Available via the OTel kubeletstats receiver.

kube-state-metrics. Cluster state exposed as Prometheus metrics, scraped from a separate Deployment that watches the Kubernetes API. Deployment counts, Pod phases, Node conditions, container restart counts. Not from the kubelet. Not from the runtime. From the API server.

Application metrics. Emitted by the application itself, typically via the OTel SDK or the Prometheus client library. Custom counters, histograms, business-relevant metrics. This is where cardinality lives.

Distributed traces. Emitted by the application's OTel SDK, propagated across services via W3C TraceContext. Spans for each operation, attributes for context, links across asynchronous boundaries.

Each source covers what the others cannot. Logs explain what happened inside a single service. Kubelet stats show resource pressure at the substrate. Kube-state-metrics show the cluster's intended state diverging from reality. Application metrics encode business behavior. Traces stitch the request across services.

A specific confusion worth clearing up: kubelet stats and kube-state-metrics are not the same thing. The kubelet sees what is actually running on the Node. Kube-state-metrics sees what the API server expects to be running. When a Node fails, the kubelet stops reporting. Kube-state-metrics keeps reporting the Pod as Pending until the scheduler reassigns it. You need both.

The Kubernetes documentation covers the substrate's logging architecture in detail at kubernetes.io/docs/concepts/cluster-administration/logging. Read it once.

Collector deployment patterns

There are three canonical patterns for deploying observability collectors in Kubernetes. Each has specific failure modes at scale.

DaemonSet (agent per Node). The collector runs as a Pod on every Node, scheduled by a DaemonSet controller. The agent reads node-local logs from /var/log/pods via a hostPath mount, scrapes the kubelet endpoint on its own Node, and receives OTLP traffic from local Pods. This is the cheapest pattern for log collection at high volume because no log data crosses the network until the agent forwards it. The trade-off is that each agent must be configured identically, and tail-based trace sampling does not work at this layer because spans for one trace can land on different Nodes.

Sidecar (per Pod). The collector runs as a container inside each application Pod. Strong isolation between tenants, per-app configuration possible. The trade-off is that you double the resource footprint across the cluster and inherit the operational complexity of managing sidecar definitions through admission webhooks or Pod spec injection. Sidecar is appropriate for high-isolation workloads (payment processors, regulated data) or for vendor agents that don't function as DaemonSets. For general-purpose observability, DaemonSet beats sidecar in nearly every case.

Gateway (cluster-level service). A standalone collector Deployment, typically 2 to 10 replicas behind a Service, receives telemetry from agents, sidecars, or directly from apps. The gateway is where centralized processing happens: filtering, transformation, redaction, tail-based sampling, routing to multiple backends. The trade-off is that the gateway becomes a critical-path service. A gateway outage means telemetry loss.

The recommended architecture for most production clusters is DaemonSet plus Gateway. Agents do node-local collection (especially logs and kubelet stats). The gateway does centralized policy enforcement, tail-based trace sampling, and backend routing.

[Apps] → [DaemonSet Agent (per Node)] → [Gateway (cluster-level)] → [Backend(s)]

This is the architecture the OTel Collector deployment docs recommend at opentelemetry.io/docs/collector/deployment, and it is what most platform teams converge on after trying the alternatives.

Multi-tenant boundaries

When multiple teams share a cluster, observability becomes a tenant-isolation problem. The collector's ServiceAccount needs cluster-wide permissions to read Pod metadata, scrape kubelet endpoints, and discover services. Without isolation, any Pod can emit telemetry attributed to any tenant. Without attribution, the platform team cannot tell which team's behavior drove the bill.

Three patterns hold up in production:

Per-namespace collectors. Each team's Namespace runs its own collector instance. Strong isolation, operational overhead that scales with team count. Appropriate when team boundaries are firm and shared infrastructure costs are charged back.

Centralized collector with tenant routing. A single DaemonSet plus Gateway architecture. The gateway routes telemetry to per-tenant backends based on Namespace labels:

processors:
  routing:
    from_attribute: k8s.namespace.name
    table:
      - value: team-payments
        exporters: [otlp/payments-backend]
      - value: team-checkout
        exporters: [otlp/checkout-backend]
    default_exporters: [otlp/shared-backend]

This works when the platform team trusts the Namespace attribution and the tenants accept a shared collector instance.

Centralized collector with mandatory enrichment. The k8sattributes processor enriches every span, metric, and log with the source Pod's Kubernetes metadata. Cost can then be attributed to the Namespace, Deployment, or an owner annotation. This is the pattern most platform teams converge on. Documentation lives at the k8sattributesprocessor README.

processors:
  k8sattributes:
    auth_type: serviceAccount
    extract:
      metadata:
        - k8s.pod.name
        - k8s.deployment.name
        - k8s.namespace.name
        - k8s.node.name
      annotations:
        - tag_name: cost.team
          key: cost-center
          from: pod

The k8sattributes processor adds API latency and requires the right RBAC. The required permissions are minimal but specific:

rules:
  - apiGroups: [""]
    resources: ["pods", "namespaces"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["apps"]
    resources: ["deployments", "replicasets"]
    verbs: ["get", "list", "watch"]

Without these permissions, the processor returns empty attributes and downstream attribution silently breaks. The collector keeps running. The cost-attribution dashboard quietly stops being useful. This is one of the most common production failure modes in Kubernetes observability stacks and it usually goes undetected until the next billing cycle.

The cost trajectory most teams underestimate

Teams often underestimate the cost trajectory of Kubernetes observability. The reason is structural. Telemetry volume in a Kubernetes cluster grows faster than the workload it observes, and Kubernetes-specific dynamics make that growth invisible until the bill arrives.

Three drivers compound:

Pod and container churn. Every Pod restart, every rolling deploy, every scheduled CronJob creates new Pod UIDs and container IDs. If those identifiers are used as metric labels, every churn event creates new time series. A service that runs five Pods and redeploys daily can produce thirty new label combinations per week from k8s.pod.uid and container.id alone, even though the service's actual behavior has not changed. The vendor bills per active series. The team had no way to know.

Log accumulation per service. A new microservice ships with at least one logger. Most ship with several. As service count grows, log volume grows multiplicatively. Each service produces 1 to 10 GB per day at moderate traffic. A platform team running 100 services is shipping somewhere between 100 GB and 1 TB per day of logs to whichever backend they are billed by. Adding twenty services adds another 20 to 200 GB per day with no decision made about whether the new logs deserve to exist.

Retention drift. Default retention configurations in Helm charts and Terraform modules are usually too generous. A team configures 30-day retention "just to be safe" for one application. The configuration gets copied. The retention multiplier compounds with volume. The bill grows even when usage does not.

The cardinality math from any one of these is enough to derail a budget. Combined, they produce the cost curves platform teams describe to finance as "we don't know what changed." Something did change. Several things changed. The platform team did not see them because the changes were normal Kubernetes behavior, not exceptional events.

The mitigation is not vigilance. Vigilance fails for the same reason any human-in-the-loop process fails at scale. It is tedious, it is continuous, and it competes with feature work. Mitigation requires that policy decisions about telemetry live somewhere they can be enforced continuously, without manual review.

From operator-by-hand to operator-by-policy

The state of Kubernetes observability for most platform teams looks like this. A Helm-deployed collector with the default values. Maybe a few custom processors added during the last cost-cutting exercise. A dashboard the team built two quarters ago that no one updates. A runbook with three rules that are not enforced. A monthly cost review where the team identifies a couple of outliers and asks the responsible service owner to fix them.

This is operator-by-hand. It works until it does not. When the cluster runs 20 services, it scales with one platform engineer. When it runs 200 services, the platform team falls behind and the gap compounds. New cardinality bombs land before the last ones are fixed. The default Helm values stop matching the cluster's actual needs. The runbook becomes legacy documentation.

The direction the industry is moving in is operator-by-policy. The platform team defines the rules: which attributes are approved as metric labels, which log severities are allowed in production, which traces must be sampled, which Namespaces can export to which backends. The runtime enforces them. The collector becomes an execution engine for policy, not a hand-maintained configuration.

This is the role of an agentic telemetry operator. The agent continuously evaluates incoming telemetry. Low fidelity telemetry gets dropped, redacted, sampled, or routed differently before it reaches the backend. The platform team's cognitive load drops from "review what is happening" to "review what the agent caught."

The opportunity is that Kubernetes makes this enforcement possible at the substrate. Every signal flows through collectors the platform team controls. There is one place to apply policy. There just has not been an operator-class layer that consumes that opportunity. That is what comes next.

How Sawmills approaches this

Sawmills is an agentic telemetry operator built for the Kubernetes telemetry path. It analyzes telemetry in stream without hand-maintained rules, while also enabling platform and DevOps teams to apply Sawmills fixes with a single click or within existing engineering workflows.

Sawmills evaluates every metric and log. Cardinality limits hold without manual reviews. Forbidden attributes never reach the backend. Sampling decisions adapt to traffic. Cost stays bounded as services multiply.

Sawmills works alongside the existing observability backend, whether Datadog, Splunk, New Relic, Grafana Cloud, or something else. What changes is that the work platform teams do month-to-month moves from manual maintenance to AI-powered management. The operator does the rest.

See it for yourself.