Log aggregation in Kubernetes: the pipeline that survives 5,000 pods, not the tool that demos well at 50

A pod writes a log line to stdout. The container runtime captures it, writes it to a file on the node, and rotates that file when it hits 10 MiB. The pod gets rescheduled to another node an hour later, and the file is gone. Multiply that by every service, every replica, every restart, and you have the reason log aggregation exists in Kubernetes: the logs you need to debug an incident do not survive the thing that produced them.
The usual response is to pick a tool. Someone deploys the EFK stack, or Loki, or wires up a vendor agent, and the demo works. Logs show up in a dashboard. Then the cluster grows from 50 nodes to 500, a noisy service starts emitting a million lines a minute, and the same setup either falls behind, drops lines, or sends a bill that makes the CFO ask what changed.
The tool is the least interesting decision here. What determines whether aggregation holds is the architecture underneath it and who owns the policy that governs what gets collected. Both of those are platform-team problems, and both are usually decided by default rather than on purpose.
What aggregation actually has to do in a cluster
Strip away the branding and every Kubernetes log aggregation pipeline does the same four things. It reads logs off each node, enriches them with the Kubernetes metadata that makes them queryable, ships them somewhere durable, and does this without losing data when pods churn.
The reading step is more specific than most guides admit. Kubernetes itself does not store or rotate container logs. The container runtime does. Logs land at /var/log/pods/<namespace>_<pod>_<uid>/<container>/<rotation>.log, with symlinks under /var/log/containers. Rotation is handled by containerd or CRI-O, typically at 10 MiB per file with a handful of files retained, per the Kubernetes logging architecture docs. A collector that reads these files has to track inodes across rotation events, or it will either duplicate lines or skip them when a file rolls.
Enrichment is what turns a raw line into something you can actually query. A log that says connection refused is useless without knowing which pod, namespace, deployment, and node produced it. That metadata does not live in the log line. It lives in the Kubernetes API, and the pipeline has to join the two. This is the step that quietly drives a lot of downstream cost, because every attribute you attach is a dimension your backend has to store and index.
The architecture that scales: collect at the node, enforce at the gateway
There are three canonical ways to deploy collectors in Kubernetes, and only one combination holds up across cluster sizes. Sidecars give you strong per-pod isolation but double your collector footprint and add a container to every pod spec. A bare DaemonSet collects efficiently but gives you no central place to apply policy. The pattern that scales is a DaemonSet for collection plus a gateway for processing, documented in the OpenTelemetry collector deployment guide.
The DaemonSet runs one collector per node. Because it reads logs from the local filesystem, there is no network hop between the app and the collector, which is what makes it cheap at high log volume. A minimal OpenTelemetry filelog setup looks like this:
receivers:
filelog:
include: [ /var/log/pods/*/*/*.log ]
include_file_path: true
operators:
- type: container
processors:
k8sattributes:
auth_type: serviceAccount
extract:
metadata:
- k8s.namespace.name
- k8s.pod.name
- k8s.deployment.name
- k8s.node.name
exporters:
otlp:
endpoint: otel-gateway.observability.svc:4317
service:
pipelines:
logs:
receivers: [filelog]
processors: [k8sattributes]
exporters: [otlp]
The k8sattributes processor needs RBAC to watch the API, specifically get, list, and watch on pods and namespaces. Skipping the RBAC is the single most common reason a walkthrough fails to reproduce, and it is why the filelog receiver docs are worth reading before you copy a config. We covered the mechanics of reading logs off the node in more depth in Container Logs in Kubernetes: How to View and Collect Them.
The gateway is a separate deployment, usually a handful of replicas behind a Service. The DaemonSets forward to it, and it becomes the one place where filtering, redaction, routing, and sampling happen before anything reaches a backend. Centralizing policy here is the entire point. It is also why the gateway becomes a critical-path service, which leads directly to where this architecture breaks.
Where it breaks at scale
A pipeline that works at 50 nodes can fail at 5,000 in ways that have nothing to do with the tool you picked.
Log rotation under load is the first failure mode. When a high-volume pod rotates a file every few seconds and the collector is already behind, the inode-tracking logic that handles rotation cleanly at low volume starts duplicating or dropping lines. This shows up as gaps in the exact incident window you most need, because that is when volume spikes.
Multiline reassembly is the second. A stack trace is one logical event spread across dozens of physical lines. The collector reassembles it with a start-pattern regex, and that regex is brittle. Mixed formats in one stream, leading whitespace that differs from the framework default, or interleaved output from multiple processes all break it. The result is either fragmented traces or a single log entry that swallows half a file.
Attribute cardinality is the third, and it is the one that shows up on the invoice. Every Kubernetes label you attach during enrichment becomes a dimension downstream. Attach pod_name, and you have created a label that changes on every restart, which is exactly the kind of unbounded cardinality that makes label-indexed backends slow and content-indexed backends expensive. The pipeline, not the backend, is where you decide which attributes are worth their cost.
The gateway itself is the fourth. Once all egress flows through it, a gateway outage means telemetry loss, and a traffic spike means it needs to scale fast or back-pressure into the agents. Scaling it on CPU alone usually scales it on the wrong signal. We wrote about scaling the collector on telemetry throughput instead in You've Been Autoscaling Your Collector All Wrong.
The backend choice is downstream of the pipeline
By the time you are comparing Loki against Elasticsearch, the important decisions are already made. They are worth understanding, but they do not rescue a bad pipeline.
The core difference is what gets indexed. Elasticsearch indexes the full content of every log line, which makes arbitrary full-text search fast and makes storage expensive. Grafana Loki indexes only labels and stores log bodies as compressed chunks in object storage, which is why its label-based design maps cleanly onto Kubernetes and why teams routinely report storage costs well below a content-indexed stack for the same retention. The trade is real: Loki is excellent when you know which service you are looking at and want to filter by namespace or pod, and weaker for ad-hoc search across arbitrary content where Elasticsearch still wins.
But notice what both inherit. If the pipeline ships a million health-check logs a minute, Loki stores a million cheap chunks and Elasticsearch indexes a million expensive documents, and both are storing noise. The label cardinality you generate upstream determines Loki's query performance. The volume you fail to drop upstream determines Elasticsearch's index size. Choosing a cheaper backend lowers the unit cost of a decision you should not have made. It does not unmake it.
Aggregation is a policy problem, not a plumbing problem
Here is the gap the SERP misses. Every guide explains how to wire a collector to a backend, and almost none of them name the actual structural problem: developers create logs because logs are cheap to add in code, and the platform team pays for them because volume compounds at the destination. Nobody owns the cleanup, so it never happens.
The fix is to decide what deserves to be aggregated, and to enforce that decision in the pipeline rather than in a quarterly review. The highest-value moves are unglamorous. Drop kube-probe and health-check traffic at the node before it enters the gateway. Sample chatty INFO logs from services that emit the same line thousands of times a minute, keeping enough to debug and dropping the rest. Route by value so error logs go to a searchable hot tier and debug output goes to cheap archive or nowhere. Strip the high-cardinality attributes that no one queries. None of this is a tool feature. It is policy, and it has to be enforced continuously because every new service ships with default instrumentation that ignores it.
That is the difference between a pipeline that holds at 5,000 pods and one that merely demoed well at 50. The first one has an owner and a set of rules that travel with every new service. The second one has a dashboard and a growing bill.
How Sawmills approaches this
Sawmills analyzes telemetry data in-stream. Instead of a platform engineer hand-editing collector configs every time a service is added or a log volume spikes, the Sawmills agent watches what is flowing through the pipeline and enforces the drop, sample, redact, and route decisions, inside the guardrails the platform team defines. When a new service ships with noisy default instrumentation, the policy already applies to it, rather than waiting for someone to notice it on next month's invoice.
This is the platform-team frame in practice. A team running ten services can keep aggregation honest by hand. A team running two hundred cannot, because the work scales with service count, not with headcount. Sawmills makes the operator-per-policy model scale with the cluster, so aggregation stays governed as services come and go and the underlying OpenTelemetry collectors keep doing what they do best: executing the policy, not deciding it.
If your Kubernetes log volume is growing faster than your ability to reason about what is in it, schedule a demo to see Sawmills enforce aggregation policy continuously against a cluster shaped like yours.


