Telemetry volume grows faster than most teams expect. Logs, metrics, traces, Kubernetes events, and infrastructure signals all need to be collected, shaped, filtered, enriched, routed, and protected before they land in storage or analysis tools.
That is where an observability pipeline comes in. Instead of shipping every raw event downstream, DevOps teams use a pipeline to control what gets collected, what gets transformed, what gets dropped, and where each signal goes. For a broader primer, see our guide to observability pipelines.
This article focuses on one practical decision inside that pipeline: Vector vs OpenTelemetry Collector for log collection.
Both can collect Kubernetes logs. Both can transform telemetry. Both can run as agents or gateways. But they feel very different once you operate them in production.
Detailed evaluation for teams making the rollout
The biggest difference between Vector and OTel Collector is not whether they can collect logs. They both can.
The difference is what each one optimizes for.
Vector is built around observability data pipelines. Its configuration is organized as sources, transforms, and sinks. That makes it easy to read a log flow from top to bottom: collect here, transform there, send over there. Vector positions itself as an observability data pipeline for collecting, transforming, and routing logs and metrics. See the Vector GitHub repository.
OTel Collector is built around OpenTelemetry’s broader architecture. It receives telemetry, processes it, and exports it through pipelines made of receivers, processors, and exporters. Those pipelines can operate on logs, metrics, and traces, which makes OTel Collector a natural fit for teams standardizing on OpenTelemetry across their stack. See the OpenTelemetry Collector architecture docs.
That leads to a simple operational distinction:
Vector feels like a log pipeline first. OTel Collector feels like a telemetry standard first.
Both are valid. The right choice depends on whether your immediate problem is log volume and transformation, or long-term telemetry standardization across logs, metrics, and traces.
How Vector handles log collection
Vector’s config model is direct:
sources → transforms → sinks
A source receives data. A transform changes, filters, samples, enriches, or routes data. A sink sends data to a destination.
A minimal Kubernetes log collection pipeline looks like this:
data_dir: /var/lib/vector
sources:
kubernetes_logs:
type: kubernetes_logs
transforms:
normalize:
type: remap
inputs:
- kubernetes_logs
source: |
.collector = "vector"
.environment = "${ENVIRONMENT:-unknown}"
parsed, err = parse_json(.message)
if err == null && is_object(parsed) {
. = merge(., parsed)
}
.severity = .severity ?? .level ?? .log_level ?? "INFO"
.service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"
if exists(.authorization) {
.authorization = "[REDACTED]"
}
if exists(.request.headers.authorization) {
.request.headers.authorization = "[REDACTED]"
}
drop_noise:
type: filter
inputs:
- normalize
condition: |
!(
contains(string!(.message), "/healthz") ||
contains(string!(.message), "/readyz") ||
contains(string!(.message), "kube-probe") ||
.severity == "DEBUG"
)
sinks:
outbound:
type: http
inputs:
- drop_noise
uri: "https://telemetry-gateway.example.com/logs"
method: post
compression: gzip
encoding:
codec: json
The important thing is readability. A DevOps engineer can usually understand the flow without already knowing a deep telemetry framework.
Vector’s transformation language, VRL, is designed specifically for observability data. The docs describe it as an expression-oriented language for transforming logs and metrics, with built-in functions tailored to observability use cases. See the VRL documentation.
That makes Vector especially strong when the work looks like this:
parse JSON
normalize severity
rename fields
drop health checks
redact secrets
route audit logs separately
sample low-value info logs
remove high-cardinality Kubernetes metadata
For log-heavy pipelines, that work is not occasional. It is the job.
How OTel Collector handles log collection
OTel Collector uses this model:
receivers → processors → exportersThose components are then wired together inside service pipelines.
A comparable OTel Collector log pipeline looks like this:
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
start_at: end
operators:
- type: container
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
filter/drop_noise:
error_mode: ignore
logs:
log_record:
- 'IsMatch(log.body, ".*GET /healthz.*")'
- 'IsMatch(log.body, ".*GET /readyz.*")'
- 'IsMatch(log.body, ".*kube-probe.*")'
transform/normalize:
error_mode: ignore
log_statements:
- context: log
statements:
- set(log.attributes["collector"], "otelcol")
- set(log.attributes["environment"], resource.attributes["deployment.environment"])
- set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
- delete_key(log.attributes, "authorization")
- delete_key(log.attributes, "password")
batch:
send_batch_size: 8192
timeout: 200ms
exporters:
otlphttp/outbound:
endpoint: https://telemetry-gateway.example.com
compression: gzip
service:
pipelines:
logs:
receivers:
- filelog
processors:
- memory_limiter
- filter/drop_noise
- transform/normalize
- batch
exporters:
- otlphttp/outboundThe OTel version is more verbose, but the structure is powerful. It gives platform teams a standard way to assemble pipelines for logs, metrics, and traces. The Collector architecture supports one or more pipelines, each with receivers, optional processors, and exporters. See the OpenTelemetry Collector architecture docs.
That matters when you want one collector strategy across many teams.
For example:
logs pipeline:
filelog → memory_limiter → filter → transform → batch → exporter
metrics pipeline:
prometheus → memory_limiter → attributes → batch → exporter
traces pipeline:
otlp → memory_limiter → tail_sampling → batch → exporter
Vector can handle multiple telemetry types too, but OTel Collector is more naturally aligned with the OpenTelemetry ecosystem and data model.
Performance comparison
Performance is where teams often want a clean winner. In practice, there is no universal answer.
Collector performance changes based on log size, parsing rules, multiline handling, metadata enrichment, regex use, batching, compression, buffering, downstream latency, CPU limits, memory limits, and failure behavior.
Still, there are useful signals. Vector publishes sizing guidance that is easy to use during early capacity planning. In its examples, Vector estimates around 10 MiB/s per vCPU for unstructured logs and around 25 MiB/s per vCPU for structured logs, metrics, and traces. See Vector’s sizing guidance.
That lets teams do rough math before load testing:
Expected unstructured log volume: 200 MiB/s
Vector planning estimate: 10 MiB/s per vCPU
Initial capacity estimate: 20 vCPU
Then add headroom and test with your real transforms
OpenTelemetry takes a different approach. The OTel project publishes Collector benchmark infrastructure, and its docs state that load tests run on every commit to the opentelemetry-collector-contrib repository. Those tests run Collector binaries with different configurations and send traffic through them. See the OpenTelemetry Collector benchmark docs.
That is useful for ecosystem reliability. It does not give you the same simple sizing formula, but it does show that Collector performance is continuously tested.
A Kubernetes log collector benchmark also gives a useful, workload-specific comparison. In the benchmark’s 100-Pod scenario, Vector reached 25,000 logs/sec, while OpenTelemetry Collector reached 20,500 logs/sec. At roughly 10,000 logs/sec, Vector used 0.412 CPU and OTel Collector used 0.491 CPU. In the same 10,000 logs/sec test, OTel Collector used 106.83 MiB of mean memory, while Vector used 153.50 MiB. See the VictoriaMetrics log collector benchmark.
That benchmark should not be treated as universal truth. It used specific versions, a specific Kubernetes setup, official Helm chart defaults, a 1-core CPU limit, a 1 GiB memory limit, and no performance tuning. The authors also disclosed collector-specific edge cases around rotation and backlog behavior. See the benchmark writeup and the benchmark source code.
The practical read:
For DevOps teams, the performance decision should come from a realistic test plan:
normal load:
current production logs/sec and MiB/sec
burst load:
3x to 5x expected production volume
failure mode:
destination unavailable for 5, 15, and 60 minutes
measure:
CPU
memory
disk growth
queue growth
dropped records
duplicate records
p95/p99 latency
restart recovery
malformed records
The performance takeaway is straightforward: Vector has stronger evidence for log-pipeline efficiency and practical sizing. OTel Collector has stronger evidence for ecosystem-level testing and standardization. Neither replaces your own benchmark.
Installation and first deployment
Both tools are easy to install. The difference is how many architectural decisions you need to make upfront.
Vector installation example
A basic Vector Helm install for Kubernetes agent mode:
helm repo add vector https://helm.vector.dev
helm repo update
helm upgrade --install vector vector/vector \
--namespace observability \
--create-namespace \
--set role=Agent
A more practical first values file:
role: Agent
customConfig:
data_dir: /var/lib/vector
api:
enabled: true
address: 0.0.0.0:8686
sources:
kubernetes_logs:
type: kubernetes_logs
transforms:
add_context:
type: remap
inputs:
- kubernetes_logs
source: |
.collector = "vector"
.cluster = "${CLUSTER_NAME:-unknown}"
.environment = "${ENVIRONMENT:-unknown}"
drop_health_checks:
type: filter
inputs:
- add_context
condition: |
!(
contains(string!(.message), "/healthz") ||
contains(string!(.message), "/readyz")
)
sinks:
outbound:
type: http
inputs:
- drop_health_checks
uri: "https://telemetry-gateway.example.com/logs"
method: post
compression: gzip
encoding:
codec: json
Vector’s early setup is easy to explain:
Install agent
Collect Kubernetes logs
Add transforms
Send logs downstream
That simplicity matters when a platform team wants adoption across multiple service teams.
OTel Collector installation example
A basic OTel Collector Helm install:
helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update
helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
--namespace observability \
--create-namespace \
--set image.repository=otel/opentelemetry-collector-k8s \
--set mode=daemonset
For Kubernetes log collection, the Helm chart supports a logsCollection preset. The chart docs note that this feature requires an agent collector deployment and a Collector image that includes the filelog receiver, such as the Kubernetes Collector image. See the OpenTelemetry Helm chart docs.
A practical values file:
mode: daemonset
image:
repository: otel/opentelemetry-collector-k8s
presets:
logsCollection:
enabled: true
includeCollectorLogs: false
config:
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
filter/drop_noise:
error_mode: ignore
logs:
log_record:
- 'IsMatch(log.body, ".*GET /healthz.*")'
- 'IsMatch(log.body, ".*GET /readyz.*")'
batch:
send_batch_size: 8192
timeout: 200ms
exporters:
otlphttp/outbound:
endpoint: https://telemetry-gateway.example.com
compression: gzip
service:
pipelines:
logs:
processors:
- memory_limiter
- filter/drop_noise
- batch
exporters:
- otlphttp/outbound
OTel Collector setup is not hard, but teams need to make more choices:
- Which Collector distribution?
- DaemonSet, Deployment, or gateway?
- Which receivers are included?
- Which processors are available?
- Which pipelines handle logs, metrics, and traces?
- Which exporters are approved?
That extra structure is worthwhile when the collector becomes a shared platform standard.
Ongoing filtering and transformation
This is where the difference becomes obvious. nstallation happens once. Filtering and transformation happen forever. Your team will eventually need to answer questions like:
- Can we drop health checks only in production?
- Can we keep all error logs but sample info logs?
- Can we redact authorization headers?
- Can we parse this legacy plain-text format?
- Can we remove high-cardinality Kubernetes labels?
- Can we route audit logs to a different destination?
- Can we prove a drop rule did not remove useful data?
Vector filtering and transformation
Vector’s VRL is usually more comfortable for log-heavy work. Example: parse JSON, normalize fields, redact secrets, and mark routing flags.
transforms:
app_log_policy:
type: remap
inputs:
- kubernetes_logs
drop_on_error: false
source: |
.collector = "vector"
.event.original = .message
parsed, err = parse_json(.message)
if err == null && is_object(parsed) {
. = merge(., parsed)
}
.severity = upcase(string!(.severity ?? .level ?? .log_level ?? "INFO"))
.service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"
if exists(.password) {
.password = "[REDACTED]"
}
if exists(.token) {
.token = "[REDACTED]"
}
if exists(.authorization) {
.authorization = "[REDACTED]"
}
if exists(.request.headers.authorization) {
.request.headers.authorization = "[REDACTED]"
}
del(.kubernetes.pod_uid)
del(.kubernetes.container_id)
.routing.keep = .severity == "ERROR" || .severity == "FATAL"
.routing.audit = exists(.audit_event) || contains(string!(.message), "AUDIT")
.routing.low_value = contains(string!(.message), "/healthz") || .severity == "DEBUG"
Then split streams by policy:
transforms:
important_logs:
type: filter
inputs:
- app_log_policy
condition: '.routing.keep == true || .routing.audit == true'
standard_logs:
type: filter
inputs:
- app_log_policy
condition: '.routing.keep != true && .routing.audit != true && .routing.low_value != true'
This is compact. The policy is readable. The transform logic stays close to the log stream. For teams that frequently add, tune, or roll back log rules, that is a real advantage.
OTel Collector filtering and transformation
OTel Collector uses processors. The transform processor modifies telemetry using OTTL statements, and those statements execute against incoming telemetry in the order specified by the configuration. See the OTel transform processor docs.
Example: drop noisy logs.
processors:
filter/drop_low_value:
error_mode: ignore
logs:
log_record:
- 'IsMatch(log.body, ".*GET /healthz.*")'
- 'IsMatch(log.body, ".*GET /readyz.*")'
- 'log.severity_number < SEVERITY_NUMBER_INFO'
Example: normalize and redact fields.
processors:
transform/normalize_logs:
error_mode: ignore
log_statements:
- context: log
statements:
- set(log.attributes["collector"], "otelcol")
- set(log.attributes["event.original"], log.body)
- set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
- delete_key(log.attributes, "password")
- delete_key(log.attributes, "token")
- delete_key(log.attributes, "authorization")
Then wire the processors into the pipeline:
service:
pipelines:
logs:
receivers:
- filelog
processors:
- memory_limiter
- filter/drop_low_value
- transform/normalize_logs
- batch
exporters:
- otlphttp/outbound
This is more verbose than Vector, but it is also easier to standardize. A platform team can define approved processor patterns and apply them across logs, metrics, and traces.
The tradeoff is day-to-day ergonomics. For log-first parsing, VRL often feels more natural. For OpenTelemetry-wide governance, OTTL fits better.
Production config example: Vector
Here is a fuller Vector example for Kubernetes log collection with parsing, redaction, filtering, routing, and disk buffering.
role: Agent
customConfig:
data_dir: /var/lib/vector
acknowledgements:
enabled: true
api:
enabled: true
address: 0.0.0.0:8686
sources:
kubernetes_logs:
type: kubernetes_logs
glob_minimum_cooldown_ms: 10000
transforms:
parse_normalize_redact:
type: remap
inputs:
- kubernetes_logs
drop_on_error: false
source: |
.collector = "vector"
.cluster = "${CLUSTER_NAME:-unknown}"
.environment = "${ENVIRONMENT:-unknown}"
.event.original = .message
parsed, err = parse_json(.message)
if err == null && is_object(parsed) {
. = merge(., parsed)
}
.severity = upcase(string!(.severity ?? .level ?? .log_level ?? "INFO"))
.service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"
if exists(.authorization) {
.authorization = "[REDACTED]"
}
if exists(.request.headers.authorization) {
.request.headers.authorization = "[REDACTED]"
}
if exists(.token) {
.token = "[REDACTED]"
}
if exists(.password) {
.password = "[REDACTED]"
}
del(.kubernetes.pod_uid)
del(.kubernetes.container_id)
.routing.audit = exists(.audit_event) || contains(string!(.message), "AUDIT")
.routing.error = .severity == "ERROR" || .severity == "FATAL"
.routing.noise = contains(string!(.message), "/healthz") ||
contains(string!(.message), "/readyz") ||
contains(string!(.message), "kube-probe") ||
.severity == "DEBUG"
audit_logs:
type: filter
inputs:
- parse_normalize_redact
condition: '.routing.audit == true'
error_logs:
type: filter
inputs:
- parse_normalize_redact
condition: '.routing.error == true && .routing.audit != true'
standard_logs:
type: filter
inputs:
- parse_normalize_redact
condition: '.routing.audit != true && .routing.error != true && .routing.noise != true'
sinks:
audit_out:
type: http
inputs:
- audit_logs
uri: "https://telemetry-gateway.example.com/logs/audit"
method: post
compression: gzip
encoding:
codec: json
buffer:
type: disk
max_size: 21474836480
when_full: block
error_out:
type: http
inputs:
- error_logs
uri: "https://telemetry-gateway.example.com/logs/errors"
method: post
compression: gzip
encoding:
codec: json
buffer:
type: disk
max_size: 10737418240
when_full: block
standard_out:
type: http
inputs:
- standard_logs
uri: "https://telemetry-gateway.example.com/logs/standard"
method: post
compression: gzip
encoding:
codec: json
buffer:
type: disk
max_size: 5368709120
when_full: drop_newest
The useful design choice here is that audit, error, and standard logs do not share the same durability policy.
audit logs:
block when buffer is full
error logs:
block when buffer is full
standard logs:
drop newest when buffer is full
That is exactly the kind of policy separation teams need in production.
Audit logs and high-severity error logs may be worth preserving even if the downstream system is slow. Standard informational logs may not deserve the same treatment. Separating these streams helps teams protect important data without letting low-value telemetry destabilize the collection layer.
Production config example: OTel Collector
Here is a comparable OTel Collector configuration.
mode: daemonset
image:
repository: otel/opentelemetry-collector-k8s
presets:
logsCollection:
enabled: true
includeCollectorLogs: false
config:
receivers:
filelog:
include:
- /var/log/pods/*/*/*.log
start_at: end
include_file_path: true
operators:
- type: container
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
filter/drop_noise:
error_mode: ignore
logs:
log_record:
- 'IsMatch(log.body, ".*GET /healthz.*")'
- 'IsMatch(log.body, ".*GET /readyz.*")'
- 'IsMatch(log.body, ".*kube-probe.*")'
transform/normalize_redact:
error_mode: ignore
log_statements:
- context: log
statements:
- set(log.attributes["collector"], "otelcol")
- set(log.attributes["event.original"], log.body)
- set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
- set(log.attributes["environment"], resource.attributes["deployment.environment"]) where log.attributes["environment"] == nil
- delete_key(log.attributes, "authorization")
- delete_key(log.attributes, "token")
- delete_key(log.attributes, "password")
transform/mark_routes:
error_mode: ignore
log_statements:
- context: log
statements:
- set(log.attributes["routing.audit"], true) where IsMatch(log.body, ".*AUDIT.*")
- set(log.attributes["routing.error"], true) where log.severity_number >= SEVERITY_NUMBER_ERROR
batch:
send_batch_size: 8192
timeout: 200ms
exporters:
otlphttp/outbound:
endpoint: https://telemetry-gateway.example.com
compression: gzip
sending_queue:
enabled: true
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 300s
service:
pipelines:
logs:
receivers:
- filelog
processors:
- memory_limiter
- filter/drop_noise
- transform/normalize_redact
- transform/mark_routes
- batch
exporters:
- otlphttp/outbound
This configuration is more componentized than the Vector example. That is good for platform governance, but it can be more tedious for teams that mostly need fast log-specific changes.
A strong OTel rollout usually includes shared config templates:
base processors:
memory_limiter
batch
security processors:
redact known sensitive attributes
cost processors:
drop health checks
sample noisy logs
metadata processors:
add cluster, namespace, service, environment
Once those patterns are approved, service teams can inherit the standard pipeline instead of writing everything from scratch.
Reliability and backpressure
Performance gets attention, but reliability determines whether the collector survives real production incidents.
Every team should define what happens when:
- the destination slows down
- the destination goes offline
- a node restarts
- the collector restarts
- a service starts emitting 10x more logs
- disk fills up
- memory pressure increases
- a Kubernetes log file rotates under load
Vector reliability considerations
Vector’s disk buffering and explicit sink behavior make reliability policy easy to express.
Example:
sinks:
critical_logs:
type: http
inputs:
- audit_logs
- error_logs
uri: "https://telemetry-gateway.example.com/logs/critical"
method: post
compression: gzip
encoding:
codec: json
buffer:
type: disk
max_size: 53687091200
when_full: block
For critical logs, when_full: block makes sense because losing audit or severe error logs may be worse than slowing ingestion.
For lower-value logs:
sinks:
standard_logs:
type: http
inputs:
- standard_logs
uri: "https://telemetry-gateway.example.com/logs/standard"
method: post
compression: gzip
encoding:
codec: json
buffer:
type: disk
max_size: 5368709120
when_full: drop_newest
That policy says: protect the node and preserve critical telemetry first.
Vector’s buffering model distinguishes between memory buffers and disk buffers. Memory buffers are faster but less durable. Disk buffers are better suited for handling downstream slowdowns or temporary failures. See the Vector buffering model docs.
OTel Collector reliability considerations
OTel Collector reliability is usually built from memory limiting, batching, exporter queues, retries, and horizontal scaling.
Example:
processors:
memory_limiter:
check_interval: 5s
limit_percentage: 80
spike_limit_percentage: 25
batch:
send_batch_size: 8192
timeout: 200ms
exporters:
otlphttp/outbound:
endpoint: https://telemetry-gateway.example.com
compression: gzip
sending_queue:
enabled: true
queue_size: 10000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 300s
The key is processor order. In most production configs, memory limiting comes early, filtering happens before expensive transforms, and batching happens near the end.
service:
pipelines:
logs:
receivers:
- filelog
processors:
- memory_limiter
- filter/drop_noise
- transform/normalize
- batch
exporters:
- otlphttp/outbound
OTel gives teams strong building blocks, but it expects operators to understand how the pieces interact.
The OpenTelemetry scaling docs recommend watching memory limiter behavior, refused telemetry, exporter queue size, and queue capacity when deciding whether to scale collectors. See the OpenTelemetry Collector scaling docs.
Community and ecosystem
Vector has a strong, focused community around observability pipelines. Its repository positions it as a high-performance, end-to-end observability data pipeline for collecting, transforming, and routing logs and metrics. See the Vector repository.
OTel Collector has the larger ecosystem advantage. OpenTelemetry is a CNCF Incubating project, and CNCF lists broad contributor and organizational participation across the project. See the CNCF OpenTelemetry project page.
That matters for long-term platform strategy.
Choose Vector when your team wants a focused, efficient log pipeline with approachable transformation semantics.
Choose OTel Collector when your team wants to align collection, instrumentation, semantic conventions, and export patterns around OpenTelemetry.
Common mistakes when evaluating Vector vs OTel Collector
Mistake 1: benchmarking only raw throughput
Raw throughput matters, but it is not enough. A collector that wins a simple benchmark may lose once you add JSON parsing, multiline logs, metadata enrichment, regex redaction, compression, retries, and destination-specific behavior. Your benchmark should include the exact things your production pipeline will do.
At minimum, test:
logs per second
MiB per second
CPU per MiB
memory at steady state
memory during downstream failure
disk buffer growth
exporter queue growth
p50, p95, and p99 latency
dropped records
duplicate records
restart recovery
Mistake 2: ignoring backpressure policy
You need to decide what happens when the destination is slow or unavailable. For audit logs, you may want disk buffering and backpressure. For debug logs, you may prefer dropping. For application info logs, you may want sampling. For security logs, you may need a separate high-durability route. There is no universal answer. Reliability policy should match log value.
Mistake 3: treating all logs equally
Not every log deserves the same path. A payment failure event, an authentication anomaly, a customer-impacting error, and a routine health check should not receive identical treatment. The pipeline should reflect business value.
A useful classification might look like this:
must keep:
audit logs
security events
payment state changes
customer-impacting errors
usually keep:
warnings
application errors
deploy events
dependency failures
safe to reduce:
health checks
debug logs
routine success messages
duplicate retries
high-volume low-value status logs
Mistake 4: putting every rule in one giant config
Both Vector and OTel Collector configs can become hard to maintain if every team adds rules without structure. A better model is:
global safety rules:
redact secrets
remove obviously risky fields
environment rules:
drop debug logs in production
keep more detail in staging
service rules:
parse service-specific formats
normalize service-specific fields
routing rules:
send audit logs separately
send errors separately
reduce low-value logs
This makes ownership clearer and reviews easier.
Mistake 5: dropping logs without validating the impact
A drop rule that saves money can also remove the only clue you need during an incident.
Before rolling out a major filter, test it against real logs. Sample what would be dropped. Confirm with service owners. Monitor error rates, alert quality, and investigation workflows after rollout.
Cost reduction is useful only if the remaining telemetry still helps teams operate the system.
For teams dealing with messy plain-text logs, our guide to AI-powered unstructured to structured log transformation covers how unstructured logs can be converted into cleaner, queryable fields without turning every new format into a manual regex project.
Practical rollout plan
Phase 1: inventory your telemetry
Start with a current-state map.
sources:
Kubernetes container logs
node logs
application JSON logs
plain-text legacy logs
audit logs
ingress logs
control-plane logs
destinations:
search
alerting
security review
long-term archive
cost-optimized storage
analytics
Then classify logs by value.
must keep:
security events
audit logs
payment or billing state changes
customer-impacting errors
usually keep:
application errors
warnings
deploy events
dependency failures
often reduce:
health checks
debug logs
success messages
routine retries
duplicate request logs
Phase 2: build two realistic proof-of-concept pipelines
Do not compare Vector and OTel Collector with toy configs.
Build realistic configs.
For Vector:
kubernetes_logs
→ parse JSON
→ redact sensitive fields
→ drop health checks
→ normalize service and severity
→ disk-buffered export
For OTel Collector:
filelog receiver
→ memory_limiter
→ filter processor
→ transform processor
→ batch processor
→ OTLP export
The goal is not to prove one tool can run. The goal is to prove which one your team can operate.
Phase 3: test under normal, burst, and failure conditions
Test three modes.
steady state:
expected production volume
burst:
3x to 5x expected production volume
failure:
destination unavailable for 5, 15, and 60 minutes
Measure:
CPU
memory
disk buffer growth
log latency
missing logs
duplicates
collector restart behavior
queue size
destination retry behavior
This is where collector differences become real.
Phase 4: evaluate operator experience
Ask the people who will own the system to complete real tasks.
drop a noisy endpoint
redact a nested field
route audit logs separately
add environment metadata
remove high-cardinality labels
debug a failed transform
estimate data reduction
roll back a bad rule
This is often more revealing than the benchmark.
A collector that performs well but is hard for your team to change safely may not be the right collector.
Phase 5: standardize the winning pattern
After testing, standardize the pattern.
For Vector, that may mean shared transforms, standard sink templates, and clear rules for when to use disk buffering.
For OTel Collector, that may mean approved receiver, processor, and exporter templates for each environment.
Either way, treat collector configuration like production code. Review it. Test it. Version it. Roll it out gradually.
{{TIPS}}
When to choose Vector
Choose Vector when most of these are true:
Your main problem is log volume.
You need high-throughput node-level collection.
You want readable, compact transformation logic.
Your team frequently writes parsing, redaction, and filtering rules.
You need clear disk buffering and backpressure behavior.
Your telemetry strategy is log-first.
You want service teams to understand the pipeline quickly.
Vector is especially strong for edge filtering.
A common Vector-first architecture looks like this:
Kubernetes nodes
→ Vector DaemonSet
→ parse
→ redact
→ drop obvious noise
→ buffer
→ send downstream
This pattern is useful when the cost and volume problem starts at the node. If you can remove low-value logs before they leave the cluster, you reduce network usage, gateway pressure, and downstream ingestion costs.
Vector is not only a collector in this model. It is a programmable edge pipeline.
When to choose OTel Collector
Choose OTel Collector when most of these are true:
Your company is standardizing on OpenTelemetry.
Logs are only one part of your telemetry strategy.
You need one architecture for logs, metrics, and traces.
You want vendor-neutral telemetry semantics.
You need broad receiver, processor, and exporter coverage.
You have platform engineering capacity to manage collector configs.
You want a common collector pattern across many teams and environments.
OTel Collector is especially strong as a telemetry backbone.
A common OTel-first architecture looks like this:
Applications and infrastructure
→ OTel Collector agents
→ memory limiter
→ resource detection
→ filtering
→ transformation
→ batching
→ OTel Collector gateway
→ routing
→ aggregation
→ export to approved destinations
This architecture is less about making one log pipeline elegant and more about creating an organization-wide telemetry control plane.
That is OTel Collector’s biggest advantage: it fits naturally into a broader OpenTelemetry strategy.
When to use both
Many mature teams should consider using both. That does not mean doubling complexity for no reason. It means using the right collector in the right part of the pipeline. A hybrid architecture can look like this:
High-volume Kubernetes logs
→ Vector agents for edge parsing, filtering, and buffering
→ OTel Collector gateway for standardized export and routing
→ downstream storage, search, alerting, or analytics destinations
Or:
Application traces and metrics
→ OTel Collector agents
→ OTel Collector gateway
Noisy application logs
→ Vector agents
→ shared downstream telemetry pipeline
This is often the most practical enterprise answer. Use OTel Collector where standardization matters most. Use Vector where log-path efficiency and transformation ergonomics matter most.
Final recommendation
Use Vector when log collection is the main problem. Vector is the better fit when your team needs high-throughput log collection, readable transforms, fast edge filtering, practical buffering, and frequent log-specific policy changes. Its sources → transforms → sinks model is easy to understand, and VRL is comfortable for parsing, redaction, normalization, and routing.
Use OTel Collector when telemetry standardization is the main problem. OTel Collector is the better fit when your team wants one vendor-neutral collector architecture for logs, metrics, and traces. It is more verbose, but its receiver/processor/exporter model fits well when platform teams need reusable patterns across many services and environments.
Use both when the architecture calls for it. A common mature pattern is:
Vector at the edge for noisy, high-volume logs
OTel Collector as the standard telemetry gateway
Downstream storage, search, alerting, or analytics destinations
The right answer is not the collector with the best logo, the most stars, or the fastest synthetic benchmark. The right answer is the one your team can operate safely when production volume spikes, downstream systems slow down, and the business asks why observability costs doubled.
For log-first pipelines, start with Vector. For OpenTelemetry-wide standardization, start with OTel Collector. For complex environments, test both against your real log volume, transformations, buffering requirements, and destination behavior before standardizing.
How Sawmills simplifies this
Whether you choose Vector, OTel Collector, or both, the install is the easy part. The work that consumes your team is everything after: finding which telemetry is burning budget, deciding what to sample, aggregate, transform, route, or drop, and rolling those changes out safely. Sawmills handles that work continuously.
Mills, the agentic telemetry operator at the core of Sawmills, analyzes your flows in real time, applies the policies your DevOps team defines, and runs the pipeline autonomously within those guardrails. DevOps owns the strategy. Developers self-serve fixes in Slack or Teams. Built on OpenTelemetry, Sawmills works with the collectors and backends you already run. See Sawmills in action.
Five Expert Tips for Vector and OTel Collector in Production
1. Evaluate operator experience as a real test phase, not an afterthought. Before standardizing, have the people who will own the collector complete the tasks they will repeat for years: drop a noisy endpoint, redact a nested field, route audit logs separately, debug a failed transform, roll back a bad rule. A collector that benchmarks well but is hard to change safely will cost you more in incident response than it ever saves in throughput.
2. Classify your logs by business value before writing a single rule. Audit logs, payment state changes, and customer-impacting errors do not deserve the same path as health checks and debug logs. If your pipeline cannot distinguish these classes, every cost-cutting rule is a gamble. Build the must-keep, usually-keep, and safe-to-reduce categories first. Then write the transforms.
3. Set backpressure per stream. Block on critical, drop on low-value. Vector makes this explicit with per-sink when_full: block versus drop_newest. OTel Collector requires more deliberate wiring through separate exporters and sending queues. Either way, audit and error logs should not share a buffer policy with debug logs. A single global policy is how you either lose evidence or crash a node.
4. Validate drop rules against real logs before rolling them out. A rule that saves money can also remove the only clue you need during an incident. Sample what would be dropped. Confirm with service owners. Watch error rates, alert quality, and investigation workflows after rollout. Cost reduction only counts if the remaining telemetry still helps you operate the system.
5. Benchmark with your transforms, your destinations, and your failure modes. Synthetic throughput numbers do not survive contact with JSON parsing, regex redaction, multiline logs, compression, retries, and 60-minute downstream outages. Test normal load, 3-5x burst, and destination unavailable for 5, 15, and 60 minutes. Anything less is not a real benchmark.

Previously CEO at Rollout acquired by CloudBees. Seasoned DevOps and telemetry pipeline expert.



