Log Collection: Vector vs OTel Collector for DevOps Teams

Observability

May

2026

May

2026

Telemetry volume grows faster than most teams expect. Logs, metrics, traces, Kubernetes events, and infrastructure signals all need to be collected, shaped, filtered, enriched, routed, and protected before they land in storage or analysis tools.

That is where an observability pipeline comes in. Instead of shipping every raw event downstream, DevOps teams use a pipeline to control what gets collected, what gets transformed, what gets dropped, and where each signal goes. For a broader primer, see our guide to observability pipelines.

This article focuses on one practical decision inside that pipeline: Vector vs OpenTelemetry Collector for log collection.

Both can collect Kubernetes logs. Both can transform telemetry. Both can run as agents or gateways. But they feel very different once you operate them in production.

Category	Vector	OTel Collector	Practical takeaway
Best fit	Log-heavy pipelines, edge filtering, high-throughput collection	Standardized logs, metrics, and traces pipelines	Choose Vector when logs are the main pain. Choose OTel Collector when cross-signal standardization matters more.
Pipeline model	sources → transforms → sinks	receivers → processors → exporters → pipelines	Vector is simpler for log-first mental models. OTel Collector is better aligned with OpenTelemetry architecture.
Transformation language	VRL, Vector Remap Language	OTTL, OpenTelemetry Transformation Language	VRL is usually easier for log parsing and shaping. OTTL is powerful but more tied to OTel's data model.
Kubernetes setup	Straightforward DaemonSet or Agent model	Straightforward Helm chart, but distribution and component choice matter	Both are production-ready; OTel Collector has more upfront choices.
Performance	Vector publishes practical sizing estimates, including about 10 MiB/s per vCPU for unstructured logs and 25 MiB/s per vCPU for structured telemetry workloads in its examples. See Vector sizing guidance.	OpenTelemetry runs Collector load tests on every commit to opentelemetry-collector-contrib. See OpenTelemetry Collector benchmarks.	Vector gives clearer initial capacity-planning guidance. OTel Collector gives stronger continuous regression testing.
Third-party benchmark signal	In one Kubernetes benchmark, Vector reached 25,000 logs/sec in the 100-Pod scenario.	In the same benchmark, OTel Collector reached 20,500 logs/sec, used slightly more CPU than Vector at 10k logs/sec, and less memory than Vector at 10k logs/sec. See the VictoriaMetrics log collector benchmark.	Vector looked stronger on throughput and CPU in that workload; OTel Collector used less memory. Test with your own data.
Community	Strong focused project community around telemetry pipelines	Very large CNCF ecosystem with broad adoption and contribution	OTel Collector wins on ecosystem gravity. Vector wins on focused log-pipeline ergonomics.
Day-two operations	Compact transforms, readable routing, strong log-pipeline usability	Standard processor model, good for shared platform governance	Vector is often easier for log rules. OTel Collector is easier to standardize across signals.

Detailed evaluation for teams making the rollout

The biggest difference between Vector and OTel Collector is not whether they can collect logs. They both can.

The difference is what each one optimizes for.

Vector is built around observability data pipelines. Its configuration is organized as sources, transforms, and sinks. That makes it easy to read a log flow from top to bottom: collect here, transform there, send over there. Vector positions itself as an observability data pipeline for collecting, transforming, and routing logs and metrics. See the Vector GitHub repository.

OTel Collector is built around OpenTelemetry’s broader architecture. It receives telemetry, processes it, and exports it through pipelines made of receivers, processors, and exporters. Those pipelines can operate on logs, metrics, and traces, which makes OTel Collector a natural fit for teams standardizing on OpenTelemetry across their stack. See the OpenTelemetry Collector architecture docs.

That leads to a simple operational distinction:

Vector feels like a log pipeline first. OTel Collector feels like a telemetry standard first.

Both are valid. The right choice depends on whether your immediate problem is log volume and transformation, or long-term telemetry standardization across logs, metrics, and traces.

How Vector handles log collection

Vector’s config model is direct:

sources → transforms → sinks

A source receives data. A transform changes, filters, samples, enriches, or routes data. A sink sends data to a destination.

A minimal Kubernetes log collection pipeline looks like this:

data_dir: /var/lib/vector

sources:
  kubernetes_logs:
    type: kubernetes_logs

transforms:
  normalize:
    type: remap
    inputs:
      - kubernetes_logs
    source: |
      .collector = "vector"
      .environment = "${ENVIRONMENT:-unknown}"

      parsed, err = parse_json(.message)
      if err == null && is_object(parsed) {
        . = merge(., parsed)
      }

      .severity = .severity ?? .level ?? .log_level ?? "INFO"
      .service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"

      if exists(.authorization) {
        .authorization = "[REDACTED]"
      }

      if exists(.request.headers.authorization) {
        .request.headers.authorization = "[REDACTED]"
      }

  drop_noise:
    type: filter
    inputs:
      - normalize
    condition: |
      !(
        contains(string!(.message), "/healthz") ||
        contains(string!(.message), "/readyz") ||
        contains(string!(.message), "kube-probe") ||
        .severity == "DEBUG"
      )

sinks:
  outbound:
    type: http
    inputs:
      - drop_noise
    uri: "https://telemetry-gateway.example.com/logs"
    method: post
    compression: gzip
    encoding:
      codec: json

The important thing is readability. A DevOps engineer can usually understand the flow without already knowing a deep telemetry framework.

Vector’s transformation language, VRL, is designed specifically for observability data. The docs describe it as an expression-oriented language for transforming logs and metrics, with built-in functions tailored to observability use cases. See the VRL documentation.

That makes Vector especially strong when the work looks like this:

parse JSON
normalize severity
rename fields
drop health checks
redact secrets
route audit logs separately
sample low-value info logs
remove high-cardinality Kubernetes metadata

For log-heavy pipelines, that work is not occasional. It is the job.

How OTel Collector handles log collection

OTel Collector uses this model:

receivers → processors → exporters

Those components are then wired together inside service pipelines.

A comparable OTel Collector log pipeline looks like this:

receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    start_at: end
    operators:
      - type: container

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25

  filter/drop_noise:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(log.body, ".*GET /healthz.*")'
        - 'IsMatch(log.body, ".*GET /readyz.*")'
        - 'IsMatch(log.body, ".*kube-probe.*")'

  transform/normalize:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(log.attributes["collector"], "otelcol")
          - set(log.attributes["environment"], resource.attributes["deployment.environment"])
          - set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
          - delete_key(log.attributes, "authorization")
          - delete_key(log.attributes, "password")

  batch:
    send_batch_size: 8192
    timeout: 200ms

exporters:
  otlphttp/outbound:
    endpoint: https://telemetry-gateway.example.com
    compression: gzip

service:
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - memory_limiter
        - filter/drop_noise
        - transform/normalize
        - batch
      exporters:
        - otlphttp/outbound

The OTel version is more verbose, but the structure is powerful. It gives platform teams a standard way to assemble pipelines for logs, metrics, and traces. The Collector architecture supports one or more pipelines, each with receivers, optional processors, and exporters. See the OpenTelemetry Collector architecture docs.

That matters when you want one collector strategy across many teams.

For example:

logs pipeline:
  filelog → memory_limiter → filter → transform → batch → exporter

metrics pipeline:
  prometheus → memory_limiter → attributes → batch → exporter

traces pipeline:
  otlp → memory_limiter → tail_sampling → batch → exporter

Vector can handle multiple telemetry types too, but OTel Collector is more naturally aligned with the OpenTelemetry ecosystem and data model.

Performance comparison

Performance is where teams often want a clean winner. In practice, there is no universal answer.

Collector performance changes based on log size, parsing rules, multiline handling, metadata enrichment, regex use, batching, compression, buffering, downstream latency, CPU limits, memory limits, and failure behavior.

Still, there are useful signals. Vector publishes sizing guidance that is easy to use during early capacity planning. In its examples, Vector estimates around 10 MiB/s per vCPU for unstructured logs and around 25 MiB/s per vCPU for structured logs, metrics, and traces. See Vector’s sizing guidance.

That lets teams do rough math before load testing:

Expected unstructured log volume: 200 MiB/s
Vector planning estimate: 10 MiB/s per vCPU
Initial capacity estimate: 20 vCPU
Then add headroom and test with your real transforms

OpenTelemetry takes a different approach. The OTel project publishes Collector benchmark infrastructure, and its docs state that load tests run on every commit to the opentelemetry-collector-contrib repository. Those tests run Collector binaries with different configurations and send traffic through them. See the OpenTelemetry Collector benchmark docs.

That is useful for ecosystem reliability. It does not give you the same simple sizing formula, but it does show that Collector performance is continuously tested.

A Kubernetes log collector benchmark also gives a useful, workload-specific comparison. In the benchmark’s 100-Pod scenario, Vector reached 25,000 logs/sec, while OpenTelemetry Collector reached 20,500 logs/sec. At roughly 10,000 logs/sec, Vector used 0.412 CPU and OTel Collector used 0.491 CPU. In the same 10,000 logs/sec test, OTel Collector used 106.83 MiB of mean memory, while Vector used 153.50 MiB. See the VictoriaMetrics log collector benchmark.

That benchmark should not be treated as universal truth. It used specific versions, a specific Kubernetes setup, official Helm chart defaults, a 1-core CPU limit, a 1 GiB memory limit, and no performance tuning. The authors also disclosed collector-specific edge cases around rotation and backlog behavior. See the benchmark writeup and the benchmark source code.

The practical read:

Performance question	Likely advantage
Highest log throughput in simple Kubernetes file tailing	Vector, based on the cited benchmark
Lower CPU in that benchmark	Vector
Lower memory in that benchmark	OTel Collector
Initial sizing guidance	Vector
Continuous upstream benchmark infrastructure	OTel Collector
Final production answer	Test with your logs, transforms, destinations, and failure modes

For DevOps teams, the performance decision should come from a realistic test plan:

normal load:
  current production logs/sec and MiB/sec

burst load:
  3x to 5x expected production volume

failure mode:
  destination unavailable for 5, 15, and 60 minutes

measure:
  CPU
  memory
  disk growth
  queue growth
  dropped records
  duplicate records
  p95/p99 latency
  restart recovery
  malformed records

The performance takeaway is straightforward: Vector has stronger evidence for log-pipeline efficiency and practical sizing. OTel Collector has stronger evidence for ecosystem-level testing and standardization. Neither replaces your own benchmark.

Installation and first deployment

Both tools are easy to install. The difference is how many architectural decisions you need to make upfront.

Vector installation example

A basic Vector Helm install for Kubernetes agent mode:

helm repo add vector https://helm.vector.dev
helm repo update

helm upgrade --install vector vector/vector \
  --namespace observability \
  --create-namespace \
  --set role=Agent

A more practical first values file:

role: Agent

customConfig:
  data_dir: /var/lib/vector

  api:
    enabled: true
    address: 0.0.0.0:8686

  sources:
    kubernetes_logs:
      type: kubernetes_logs

  transforms:
    add_context:
      type: remap
      inputs:
        - kubernetes_logs
      source: |
        .collector = "vector"
        .cluster = "${CLUSTER_NAME:-unknown}"
        .environment = "${ENVIRONMENT:-unknown}"

    drop_health_checks:
      type: filter
      inputs:
        - add_context
      condition: |
        !(
          contains(string!(.message), "/healthz") ||
          contains(string!(.message), "/readyz")
        )

  sinks:
    outbound:
      type: http
      inputs:
        - drop_health_checks
      uri: "https://telemetry-gateway.example.com/logs"
      method: post
      compression: gzip
      encoding:
        codec: json

Vector’s early setup is easy to explain:

Install agent
Collect Kubernetes logs
Add transforms
Send logs downstream

That simplicity matters when a platform team wants adoption across multiple service teams.

OTel Collector installation example

A basic OTel Collector Helm install:

helm repo add open-telemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo update

helm upgrade --install otel-collector open-telemetry/opentelemetry-collector \
  --namespace observability \
  --create-namespace \
  --set image.repository=otel/opentelemetry-collector-k8s \
  --set mode=daemonset

For Kubernetes log collection, the Helm chart supports a logsCollection preset. The chart docs note that this feature requires an agent collector deployment and a Collector image that includes the filelog receiver, such as the Kubernetes Collector image. See the OpenTelemetry Helm chart docs.

A practical values file:

mode: daemonset

image:
  repository: otel/opentelemetry-collector-k8s

presets:
  logsCollection:
    enabled: true
    includeCollectorLogs: false

config:
  processors:
    memory_limiter:
      check_interval: 5s
      limit_percentage: 80
      spike_limit_percentage: 25

    filter/drop_noise:
      error_mode: ignore
      logs:
        log_record:
          - 'IsMatch(log.body, ".*GET /healthz.*")'
          - 'IsMatch(log.body, ".*GET /readyz.*")'

    batch:
      send_batch_size: 8192
      timeout: 200ms

  exporters:
    otlphttp/outbound:
      endpoint: https://telemetry-gateway.example.com
      compression: gzip

  service:
    pipelines:
      logs:
        processors:
          - memory_limiter
          - filter/drop_noise
          - batch
        exporters:
          - otlphttp/outbound

OTel Collector setup is not hard, but teams need to make more choices:

Which Collector distribution?
DaemonSet, Deployment, or gateway?
Which receivers are included?
Which processors are available?
Which pipelines handle logs, metrics, and traces?
Which exporters are approved?

That extra structure is worthwhile when the collector becomes a shared platform standard.

Ongoing filtering and transformation

This is where the difference becomes obvious. nstallation happens once. Filtering and transformation happen forever. Your team will eventually need to answer questions like:

Can we drop health checks only in production?
Can we keep all error logs but sample info logs?
Can we redact authorization headers?
Can we parse this legacy plain-text format?
Can we remove high-cardinality Kubernetes labels?
Can we route audit logs to a different destination?
Can we prove a drop rule did not remove useful data?

Vector filtering and transformation

Vector’s VRL is usually more comfortable for log-heavy work. Example: parse JSON, normalize fields, redact secrets, and mark routing flags.

transforms:
  app_log_policy:
    type: remap
    inputs:
      - kubernetes_logs
    drop_on_error: false
    source: |
      .collector = "vector"
      .event.original = .message

      parsed, err = parse_json(.message)
      if err == null && is_object(parsed) {
        . = merge(., parsed)
      }

      .severity = upcase(string!(.severity ?? .level ?? .log_level ?? "INFO"))
      .service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"

      if exists(.password) {
        .password = "[REDACTED]"
      }

      if exists(.token) {
        .token = "[REDACTED]"
      }

      if exists(.authorization) {
        .authorization = "[REDACTED]"
      }

      if exists(.request.headers.authorization) {
        .request.headers.authorization = "[REDACTED]"
      }

      del(.kubernetes.pod_uid)
      del(.kubernetes.container_id)

      .routing.keep = .severity == "ERROR" || .severity == "FATAL"
      .routing.audit = exists(.audit_event) || contains(string!(.message), "AUDIT")
      .routing.low_value = contains(string!(.message), "/healthz") || .severity == "DEBUG"

Then split streams by policy:

transforms:
  important_logs:
    type: filter
    inputs:
      - app_log_policy
    condition: '.routing.keep == true || .routing.audit == true'

  standard_logs:
    type: filter
    inputs:
      - app_log_policy
    condition: '.routing.keep != true && .routing.audit != true && .routing.low_value != true'

This is compact. The policy is readable. The transform logic stays close to the log stream. For teams that frequently add, tune, or roll back log rules, that is a real advantage.

OTel Collector filtering and transformation

OTel Collector uses processors. The transform processor modifies telemetry using OTTL statements, and those statements execute against incoming telemetry in the order specified by the configuration. See the OTel transform processor docs.

Example: drop noisy logs.

processors:
  filter/drop_low_value:
    error_mode: ignore
    logs:
      log_record:
        - 'IsMatch(log.body, ".*GET /healthz.*")'
        - 'IsMatch(log.body, ".*GET /readyz.*")'
        - 'log.severity_number < SEVERITY_NUMBER_INFO'

Example: normalize and redact fields.

processors:
  transform/normalize_logs:
    error_mode: ignore
    log_statements:
      - context: log
        statements:
          - set(log.attributes["collector"], "otelcol")
          - set(log.attributes["event.original"], log.body)
          - set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
          - delete_key(log.attributes, "password")
          - delete_key(log.attributes, "token")
          - delete_key(log.attributes, "authorization")

Then wire the processors into the pipeline:

service:
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - memory_limiter
        - filter/drop_low_value
        - transform/normalize_logs
        - batch
      exporters:
        - otlphttp/outbound

This is more verbose than Vector, but it is also easier to standardize. A platform team can define approved processor patterns and apply them across logs, metrics, and traces.

The tradeoff is day-to-day ergonomics. For log-first parsing, VRL often feels more natural. For OpenTelemetry-wide governance, OTTL fits better.

Production config example: Vector

Here is a fuller Vector example for Kubernetes log collection with parsing, redaction, filtering, routing, and disk buffering.

role: Agent

customConfig:
  data_dir: /var/lib/vector

  acknowledgements:
    enabled: true

  api:
    enabled: true
    address: 0.0.0.0:8686

  sources:
    kubernetes_logs:
      type: kubernetes_logs
      glob_minimum_cooldown_ms: 10000

  transforms:
    parse_normalize_redact:
      type: remap
      inputs:
        - kubernetes_logs
      drop_on_error: false
      source: |
        .collector = "vector"
        .cluster = "${CLUSTER_NAME:-unknown}"
        .environment = "${ENVIRONMENT:-unknown}"
        .event.original = .message

        parsed, err = parse_json(.message)
        if err == null && is_object(parsed) {
          . = merge(., parsed)
        }

        .severity = upcase(string!(.severity ?? .level ?? .log_level ?? "INFO"))
        .service.name = .service.name ?? .service ?? .kubernetes.container_name ?? "unknown"

        if exists(.authorization) {
          .authorization = "[REDACTED]"
        }

        if exists(.request.headers.authorization) {
          .request.headers.authorization = "[REDACTED]"
        }

        if exists(.token) {
          .token = "[REDACTED]"
        }

        if exists(.password) {
          .password = "[REDACTED]"
        }

        del(.kubernetes.pod_uid)
        del(.kubernetes.container_id)

        .routing.audit = exists(.audit_event) || contains(string!(.message), "AUDIT")
        .routing.error = .severity == "ERROR" || .severity == "FATAL"
        .routing.noise = contains(string!(.message), "/healthz") ||
                         contains(string!(.message), "/readyz") ||
                         contains(string!(.message), "kube-probe") ||
                         .severity == "DEBUG"

    audit_logs:
      type: filter
      inputs:
        - parse_normalize_redact
      condition: '.routing.audit == true'

    error_logs:
      type: filter
      inputs:
        - parse_normalize_redact
      condition: '.routing.error == true && .routing.audit != true'

    standard_logs:
      type: filter
      inputs:
        - parse_normalize_redact
      condition: '.routing.audit != true && .routing.error != true && .routing.noise != true'

  sinks:
    audit_out:
      type: http
      inputs:
        - audit_logs
      uri: "https://telemetry-gateway.example.com/logs/audit"
      method: post
      compression: gzip
      encoding:
        codec: json
      buffer:
        type: disk
        max_size: 21474836480
        when_full: block

    error_out:
      type: http
      inputs:
        - error_logs
      uri: "https://telemetry-gateway.example.com/logs/errors"
      method: post
      compression: gzip
      encoding:
        codec: json
      buffer:
        type: disk
        max_size: 10737418240
        when_full: block

    standard_out:
      type: http
      inputs:
        - standard_logs
      uri: "https://telemetry-gateway.example.com/logs/standard"
      method: post
      compression: gzip
      encoding:
        codec: json
      buffer:
        type: disk
        max_size: 5368709120
        when_full: drop_newest

The useful design choice here is that audit, error, and standard logs do not share the same durability policy.

audit logs:
  block when buffer is full

error logs:
  block when buffer is full

standard logs:
  drop newest when buffer is full

That is exactly the kind of policy separation teams need in production.

Audit logs and high-severity error logs may be worth preserving even if the downstream system is slow. Standard informational logs may not deserve the same treatment. Separating these streams helps teams protect important data without letting low-value telemetry destabilize the collection layer.

Production config example: OTel Collector

Here is a comparable OTel Collector configuration.

mode: daemonset

image:
  repository: otel/opentelemetry-collector-k8s

presets:
  logsCollection:
    enabled: true
    includeCollectorLogs: false

config:
  receivers:
    filelog:
      include:
        - /var/log/pods/*/*/*.log
      start_at: end
      include_file_path: true
      operators:
        - type: container

  processors:
    memory_limiter:
      check_interval: 5s
      limit_percentage: 80
      spike_limit_percentage: 25

    filter/drop_noise:
      error_mode: ignore
      logs:
        log_record:
          - 'IsMatch(log.body, ".*GET /healthz.*")'
          - 'IsMatch(log.body, ".*GET /readyz.*")'
          - 'IsMatch(log.body, ".*kube-probe.*")'

    transform/normalize_redact:
      error_mode: ignore
      log_statements:
        - context: log
          statements:
            - set(log.attributes["collector"], "otelcol")
            - set(log.attributes["event.original"], log.body)
            - set(log.attributes["service.name"], resource.attributes["service.name"]) where log.attributes["service.name"] == nil
            - set(log.attributes["environment"], resource.attributes["deployment.environment"]) where log.attributes["environment"] == nil
            - delete_key(log.attributes, "authorization")
            - delete_key(log.attributes, "token")
            - delete_key(log.attributes, "password")

    transform/mark_routes:
      error_mode: ignore
      log_statements:
        - context: log
          statements:
            - set(log.attributes["routing.audit"], true) where IsMatch(log.body, ".*AUDIT.*")
            - set(log.attributes["routing.error"], true) where log.severity_number >= SEVERITY_NUMBER_ERROR

    batch:
      send_batch_size: 8192
      timeout: 200ms

  exporters:
    otlphttp/outbound:
      endpoint: https://telemetry-gateway.example.com
      compression: gzip
      sending_queue:
        enabled: true
        queue_size: 10000
      retry_on_failure:
        enabled: true
        initial_interval: 1s
        max_interval: 30s
        max_elapsed_time: 300s

  service:
    pipelines:
      logs:
        receivers:
          - filelog
        processors:
          - memory_limiter
          - filter/drop_noise
          - transform/normalize_redact
          - transform/mark_routes
          - batch
        exporters:
          - otlphttp/outbound

This configuration is more componentized than the Vector example. That is good for platform governance, but it can be more tedious for teams that mostly need fast log-specific changes.

A strong OTel rollout usually includes shared config templates:

base processors:
  memory_limiter
  batch

security processors:
  redact known sensitive attributes

cost processors:
  drop health checks
  sample noisy logs

metadata processors:
  add cluster, namespace, service, environment

Once those patterns are approved, service teams can inherit the standard pipeline instead of writing everything from scratch.

Reliability and backpressure

Performance gets attention, but reliability determines whether the collector survives real production incidents.

Every team should define what happens when:

the destination slows down
the destination goes offline
a node restarts
the collector restarts
a service starts emitting 10x more logs
disk fills up
memory pressure increases
a Kubernetes log file rotates under load

Vector reliability considerations

Vector’s disk buffering and explicit sink behavior make reliability policy easy to express.

Example:

sinks:
  critical_logs:
    type: http
    inputs:
      - audit_logs
      - error_logs
    uri: "https://telemetry-gateway.example.com/logs/critical"
    method: post
    compression: gzip
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 53687091200
      when_full: block

For critical logs, when_full: block makes sense because losing audit or severe error logs may be worse than slowing ingestion.

For lower-value logs:

sinks:
  standard_logs:
    type: http
    inputs:
      - standard_logs
    uri: "https://telemetry-gateway.example.com/logs/standard"
    method: post
    compression: gzip
    encoding:
      codec: json
    buffer:
      type: disk
      max_size: 5368709120
      when_full: drop_newest

That policy says: protect the node and preserve critical telemetry first.

Vector’s buffering model distinguishes between memory buffers and disk buffers. Memory buffers are faster but less durable. Disk buffers are better suited for handling downstream slowdowns or temporary failures. See the Vector buffering model docs.

OTel Collector reliability considerations

OTel Collector reliability is usually built from memory limiting, batching, exporter queues, retries, and horizontal scaling.

Example:

processors:
  memory_limiter:
    check_interval: 5s
    limit_percentage: 80
    spike_limit_percentage: 25

  batch:
    send_batch_size: 8192
    timeout: 200ms

exporters:
  otlphttp/outbound:
    endpoint: https://telemetry-gateway.example.com
    compression: gzip
    sending_queue:
      enabled: true
      queue_size: 10000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 300s

The key is processor order. In most production configs, memory limiting comes early, filtering happens before expensive transforms, and batching happens near the end.

service:
  pipelines:
    logs:
      receivers:
        - filelog
      processors:
        - memory_limiter
        - filter/drop_noise
        - transform/normalize
        - batch
      exporters:
        - otlphttp/outbound

OTel gives teams strong building blocks, but it expects operators to understand how the pieces interact.

The OpenTelemetry scaling docs recommend watching memory limiter behavior, refused telemetry, exporter queue size, and queue capacity when deciding whether to scale collectors. See the OpenTelemetry Collector scaling docs.

Community and ecosystem

Vector has a strong, focused community around observability pipelines. Its repository positions it as a high-performance, end-to-end observability data pipeline for collecting, transforming, and routing logs and metrics. See the Vector repository.

OTel Collector has the larger ecosystem advantage. OpenTelemetry is a CNCF Incubating project, and CNCF lists broad contributor and organizational participation across the project. See the CNCF OpenTelemetry project page.

That matters for long-term platform strategy.

Choose Vector when your team wants a focused, efficient log pipeline with approachable transformation semantics.

Choose OTel Collector when your team wants to align collection, instrumentation, semantic conventions, and export patterns around OpenTelemetry.

Common mistakes when evaluating Vector vs OTel Collector

Mistake 1: benchmarking only raw throughput

Raw throughput matters, but it is not enough. A collector that wins a simple benchmark may lose once you add JSON parsing, multiline logs, metadata enrichment, regex redaction, compression, retries, and destination-specific behavior. Your benchmark should include the exact things your production pipeline will do.

At minimum, test:

logs per second
MiB per second
CPU per MiB
memory at steady state
memory during downstream failure
disk buffer growth
exporter queue growth
p50, p95, and p99 latency
dropped records
duplicate records
restart recovery

Mistake 2: ignoring backpressure policy

You need to decide what happens when the destination is slow or unavailable. For audit logs, you may want disk buffering and backpressure. For debug logs, you may prefer dropping. For application info logs, you may want sampling. For security logs, you may need a separate high-durability route. There is no universal answer. Reliability policy should match log value.

Mistake 3: treating all logs equally

Not every log deserves the same path. A payment failure event, an authentication anomaly, a customer-impacting error, and a routine health check should not receive identical treatment. The pipeline should reflect business value.

A useful classification might look like this:

must keep:
  audit logs
  security events
  payment state changes
  customer-impacting errors

usually keep:
  warnings
  application errors
  deploy events
  dependency failures

safe to reduce:
  health checks
  debug logs
  routine success messages
  duplicate retries
  high-volume low-value status logs

Mistake 4: putting every rule in one giant config

Both Vector and OTel Collector configs can become hard to maintain if every team adds rules without structure. A better model is:

global safety rules:
  redact secrets
  remove obviously risky fields

environment rules:
  drop debug logs in production
  keep more detail in staging

service rules:
  parse service-specific formats
  normalize service-specific fields

routing rules:
  send audit logs separately
  send errors separately
  reduce low-value logs

This makes ownership clearer and reviews easier.

Mistake 5: dropping logs without validating the impact

A drop rule that saves money can also remove the only clue you need during an incident.

Before rolling out a major filter, test it against real logs. Sample what would be dropped. Confirm with service owners. Monitor error rates, alert quality, and investigation workflows after rollout.

Cost reduction is useful only if the remaining telemetry still helps teams operate the system.

For teams dealing with messy plain-text logs, our guide to AI-powered unstructured to structured log transformation covers how unstructured logs can be converted into cleaner, queryable fields without turning every new format into a manual regex project.

Practical rollout plan

Phase 1: inventory your telemetry

Start with a current-state map.

sources:
  Kubernetes container logs
  node logs
  application JSON logs
  plain-text legacy logs
  audit logs
  ingress logs
  control-plane logs

destinations:
  search
  alerting
  security review
  long-term archive
  cost-optimized storage
  analytics

Then classify logs by value.

must keep:
  security events
  audit logs
  payment or billing state changes
  customer-impacting errors

usually keep:
  application errors
  warnings
  deploy events
  dependency failures

often reduce:
  health checks
  debug logs
  success messages
  routine retries
  duplicate request logs

Phase 2: build two realistic proof-of-concept pipelines

Do not compare Vector and OTel Collector with toy configs.

Build realistic configs.

For Vector:

kubernetes_logs
  → parse JSON
  → redact sensitive fields
  → drop health checks
  → normalize service and severity
  → disk-buffered export

For OTel Collector:

filelog receiver
  → memory_limiter
  → filter processor
  → transform processor
  → batch processor
  → OTLP export

The goal is not to prove one tool can run. The goal is to prove which one your team can operate.

Phase 3: test under normal, burst, and failure conditions

Test three modes.

steady state:
  expected production volume

burst:
  3x to 5x expected production volume

failure:
  destination unavailable for 5, 15, and 60 minutes

Measure:

CPU
memory
disk buffer growth
log latency
missing logs
duplicates
collector restart behavior
queue size
destination retry behavior

This is where collector differences become real.

Phase 4: evaluate operator experience

Ask the people who will own the system to complete real tasks.

drop a noisy endpoint
redact a nested field
route audit logs separately
add environment metadata
remove high-cardinality labels
debug a failed transform
estimate data reduction
roll back a bad rule

This is often more revealing than the benchmark.

A collector that performs well but is hard for your team to change safely may not be the right collector.

Phase 5: standardize the winning pattern

After testing, standardize the pattern.

For Vector, that may mean shared transforms, standard sink templates, and clear rules for when to use disk buffering.

For OTel Collector, that may mean approved receiver, processor, and exporter templates for each environment.

Either way, treat collector configuration like production code. Review it. Test it. Version it. Roll it out gradually.

When to choose Vector

Choose Vector when most of these are true:

Your main problem is log volume.
You need high-throughput node-level collection.
You want readable, compact transformation logic.
Your team frequently writes parsing, redaction, and filtering rules.
You need clear disk buffering and backpressure behavior.
Your telemetry strategy is log-first.
You want service teams to understand the pipeline quickly.

Vector is especially strong for edge filtering.

A common Vector-first architecture looks like this:

Kubernetes nodes
  → Vector DaemonSet
    → parse
    → redact
    → drop obvious noise
    → buffer
    → send downstream

This pattern is useful when the cost and volume problem starts at the node. If you can remove low-value logs before they leave the cluster, you reduce network usage, gateway pressure, and downstream ingestion costs.

Vector is not only a collector in this model. It is a programmable edge pipeline.

When to choose OTel Collector

Choose OTel Collector when most of these are true:

Your company is standardizing on OpenTelemetry.
Logs are only one part of your telemetry strategy.
You need one architecture for logs, metrics, and traces.
You want vendor-neutral telemetry semantics.
You need broad receiver, processor, and exporter coverage.
You have platform engineering capacity to manage collector configs.
You want a common collector pattern across many teams and environments.

OTel Collector is especially strong as a telemetry backbone.

A common OTel-first architecture looks like this:

Applications and infrastructure
  → OTel Collector agents
    → memory limiter
    → resource detection
    → filtering
    → transformation
    → batching
  → OTel Collector gateway
    → routing
    → aggregation
    → export to approved destinations

This architecture is less about making one log pipeline elegant and more about creating an organization-wide telemetry control plane.

That is OTel Collector’s biggest advantage: it fits naturally into a broader OpenTelemetry strategy.

When to use both

Many mature teams should consider using both. That does not mean doubling complexity for no reason. It means using the right collector in the right part of the pipeline. A hybrid architecture can look like this:

High-volume Kubernetes logs
  → Vector agents for edge parsing, filtering, and buffering
  → OTel Collector gateway for standardized export and routing
  → downstream storage, search, alerting, or analytics destinations

Or:

Application traces and metrics
  → OTel Collector agents
  → OTel Collector gateway

Noisy application logs
  → Vector agents
  → shared downstream telemetry pipeline

This is often the most practical enterprise answer. Use OTel Collector where standardization matters most. Use Vector where log-path efficiency and transformation ergonomics matter most.

Final recommendation

Use Vector when log collection is the main problem. Vector is the better fit when your team needs high-throughput log collection, readable transforms, fast edge filtering, practical buffering, and frequent log-specific policy changes. Its sources → transforms → sinks model is easy to understand, and VRL is comfortable for parsing, redaction, normalization, and routing.

Use OTel Collector when telemetry standardization is the main problem. OTel Collector is the better fit when your team wants one vendor-neutral collector architecture for logs, metrics, and traces. It is more verbose, but its receiver/processor/exporter model fits well when platform teams need reusable patterns across many services and environments.

Use both when the architecture calls for it. A common mature pattern is:

Vector at the edge for noisy, high-volume logs
OTel Collector as the standard telemetry gateway
Downstream storage, search, alerting, or analytics destinations

The right answer is not the collector with the best logo, the most stars, or the fastest synthetic benchmark. The right answer is the one your team can operate safely when production volume spikes, downstream systems slow down, and the business asks why observability costs doubled.

For log-first pipelines, start with Vector. For OpenTelemetry-wide standardization, start with OTel Collector. For complex environments, test both against your real log volume, transformations, buffering requirements, and destination behavior before standardizing.

How Sawmills simplifies this

Whether you choose Vector, OTel Collector, or both, the install is the easy part. The work that consumes your team is everything after: finding which telemetry is burning budget, deciding what to sample, aggregate, transform, route, or drop, and rolling those changes out safely. Sawmills handles that work continuously.

Mills, the agentic telemetry operator at the core of Sawmills, analyzes your flows in real time, applies the policies your DevOps team defines, and runs the pipeline autonomously within those guardrails. DevOps owns the strategy. Developers self-serve fixes in Slack or Teams. Built on OpenTelemetry, Sawmills works with the collectors and backends you already run. See Sawmills in action.

Five Expert Tips for Vector and OTel Collector in Production

1. Evaluate operator experience as a real test phase, not an afterthought. Before standardizing, have the people who will own the collector complete the tasks they will repeat for years: drop a noisy endpoint, redact a nested field, route audit logs separately, debug a failed transform, roll back a bad rule. A collector that benchmarks well but is hard to change safely will cost you more in incident response than it ever saves in throughput.

2. Classify your logs by business value before writing a single rule. Audit logs, payment state changes, and customer-impacting errors do not deserve the same path as health checks and debug logs. If your pipeline cannot distinguish these classes, every cost-cutting rule is a gamble. Build the must-keep, usually-keep, and safe-to-reduce categories first. Then write the transforms.

3. Set backpressure per stream. Block on critical, drop on low-value. Vector makes this explicit with per-sink when_full: block versus drop_newest. OTel Collector requires more deliberate wiring through separate exporters and sending queues. Either way, audit and error logs should not share a buffer policy with debug logs. A single global policy is how you either lose evidence or crash a node.

4. Validate drop rules against real logs before rolling them out. A rule that saves money can also remove the only clue you need during an incident. Sample what would be dropped. Confirm with service owners. Watch error rates, alert quality, and investigation workflows after rollout. Cost reduction only counts if the remaining telemetry still helps you operate the system.

5. Benchmark with your transforms, your destinations, and your failure modes. Synthetic throughput numbers do not survive contact with JSON parsing, regex redaction, multiline logs, compression, retries, and 60-minute downstream outages. Test normal load, 3-5x burst, and destination unavailable for 5, 15, and 60 minutes. Anything less is not a real benchmark.

**Erez Rusovsky Chief Product Officer & Co-founder, Sawmills**
Previously CEO at Rollout acquired by CloudBees. Seasoned DevOps and telemetry pipeline expert.

‍