Learn more
All posts

Best Practices for Using OpenTelemetry and AWS

Observability
Jul
17
2025
Jul
17
2025

Part 3 of 3: Security, performance optimization, and advanced operational practices

As you scale your observability implementation with OpenTelemetry and AWS, you'll encounter new challenges around security, performance, cost management, and operational complexity. This final part of our comprehensive guide focuses on advanced best practices that will help you build a production-ready, enterprise-grade observability platform.

We'll cover critical areas such as resource attribution and metadata management, data volume optimization through intelligent sampling, security and compliance considerations, performance and scalability strategies, and operational excellence practices. These advanced techniques will help you maximize the value of your observability investment while maintaining security, performance, and cost efficiency.

Whether you're operating at scale with hundreds of services or building a foundation for future growth, these practices will ensure your observability platform can evolve with your business needs while maintaining operational excellence.

Best Practices for Using OpenTelemetry and AWS

Implementing observability with OpenTelemetry and AWS requires careful planning and adherence to best practices. Here's a comprehensive guide organized by key areas:

1. Resource Attribution and Metadata

Use Consistent Resource Attributes:

# Example resource configuration for all services
resource:
  attributes:
    - key: service.name
      value: "user-service"  # Consistent service naming
    - key: service.version
      value: "1.2.3"  # Semantic versioning
    - key: service.namespace
      value: "production"  # Environment identification
    - key: deployment.environment
      value: "prod"  # Standard environment tags
    - key: team
      value: "backend"  # Team ownership
    - key: cost.center
      value: "engineering"  # Cost allocation

Best Practices:

  • Standardize naming conventions across all services
  • Include business context like team, cost center, and project
  • Use semantic conventions from OpenTelemetry specification
  • Add infrastructure metadata like region, availability zone, and instance type

2. Data Volume Management and Sampling

Implement Intelligent Sampling:

# Collector configuration with sampling
processors:
  probabilistic_sampler:
    hash_seed: 22  # Consistent hash for sampling
    sampling_percentage: 10  # Sample 10% of traces
  tail_sampling:
    decision_wait: 10s  # Wait time for sampling decision
    num_traces: 50000  # Maximum traces in memory
    expected_new_traces_per_sec: 1000  # Expected trace rate
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]  # Always sample errors
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000  # Always sample slow requests
      - name: default-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5  # Sample 5% of normal requests


Data Reduction Strategies:

  • Filter unnecessary attributes before export
  • Use batch processors to reduce API calls
  • Implement cardinality limits to prevent metric explosion
  • Drop verbose logs in production environments

3. Security and Compliance

Secure Communication:

# TLS configuration for secure communication
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/ssl/certs/collector.crt  # TLS certificate
          key_file: /etc/ssl/private/collector.key  # Private key
          ca_file: /etc/ssl/certs/ca.crt  # CA certificate

IAM Best Practices:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "xray:PutTraceSegments",
        "xray:PutTelemetryRecords"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "aws:RequestTag/Environment": "production"
        }
      }
    }
  ]
}


Security Recommendations:

4. Performance and Scalability

Collector Configuration Optimization:

# Performance-optimized collector configuration
processors:
  batch:
    timeout: 1s  # Batch timeout
    send_batch_size: 1024  # Optimal batch size
    send_batch_max_size: 2048  # Maximum batch size
    metadata_keys: ["service.name", "service.version"]  # Include metadata
    metadata_cardinality_limit: 1000  # Limit cardinality

  memory_limiter:
    check_interval: 1s  # Memory check interval
    limit_mib: 1000  # Memory limit in MiB
    spike_limit_mib: 200  # Spike memory limit

  resource:
    attributes:
      - key: host.name
        from_attribute: host.name
        action: upsert
      - key: process.runtime.name
        from_attribute: process.runtime.name
        action: upsert


Scaling Strategies:

  • Use DaemonSet deployment for node-level collection
  • Implement horizontal scaling with multiple collector instances
  • Monitor collector resource usage and adjust limits accordingly
  • Use load balancing for high-availability setups

5. Monitoring and Alerting

Collector Health Monitoring:

# Health check configuration
receivers:
  healthcheck:
    endpoint: 0.0.0.0:13133  # Health check endpoint

processors:
  transform:
    trace_statements:
      - context: span
        statements:
          - set(attributes["collector.version"], "1.0.0")  # Add collector version
          - set(attributes["collector.instance"], env("HOSTNAME"))  # Add instance info


Comprehensive Alerting Strategy:

# CloudWatch Alarms for collector health
Resources:
  CollectorErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ADOTCollectorErrorRate
      AlarmDescription: Alert when collector error rate is high
      MetricName: otelcol_exporter_sent_failed_requests
      Namespace: AWS/OTel
      Statistic: Sum
      Period: 300
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

  CollectorLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ADOTCollectorLatency
      AlarmDescription: Alert when collector processing latency is high
      MetricName: otelcol_processor_batch_batch_send_size
      Namespace: AWS/OTel
      Statistic: Average
      Period: 300
      Threshold: 1000
      ComparisonOperator: GreaterThanThreshold

6. Data Quality and Consistency

Standardize Telemetry Data:

# Consistent attribute naming across services
STANDARD_ATTRIBUTES = {
    "http.method": "GET",
    "http.status_code": 200,
    "http.url": "/api/users",
    "user.id": "12345",
    "request.id": "req-abc123",
    "business.operation": "user_lookup"
}

# Use consistent span naming
def create_span_name(operation, resource):
    return f"{resource}.{operation}"  # e.g., "user_service.create_user"


Data Validation:

  • Validate attribute types and values before sending
  • Use consistent naming conventions across all services
  • Implement data quality checks in the collector
  • Monitor for data anomalies and inconsistencies

7. Cost Optimization

AWS Cost Management:

# Cost-optimized configuration
exporters:
  awscloudwatch:
    region: us-west-2
    log_group_name: "/aws/eks/my-cluster/application"
    log_stream_name: "{PodName}"
    # Enable log retention policies
    retention_in_days: 7  # Keep logs for 7 days
    # Use custom metrics sparingly
    metric_declarations:
      - dimensions: [ServiceName, Operation]
        metric_name_selectors:
          - "request_duration"
          - "error_rate"

  awsxray:
    region: us-west-2
    # Implement sampling to reduce costs
    sampling_strategy: "adaptive"
    sampling_rate: 0.1  # Sample 10% of traces


Cost Optimization Strategies:

8. Operational Excellence

Deployment and Configuration Management:

# GitOps-friendly configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: adot-collector-config
  namespace: aws-otel
  labels:
    app: adot-collector
    version: "1.0.0"
    environment: production
data:
  config.yaml: |
    # Configuration managed via GitOps
    receivers:
      otlp:
        protocols:
          grpc:
            endpoint: 0.0.0.0:4317
    # ... rest of configuration


Operational Best Practices:

  • Use GitOps workflows for configuration management
  • Implement blue-green deployments for collector updates
  • Maintain configuration versioning and rollback capabilities
  • Document configuration changes and their impact
  • Regular backup and recovery testing of collector configurations

9. Troubleshooting and Debugging

Common Issues and Solutions:

High Memory Usage:

# Memory optimization
processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200
  batch:
    timeout: 1s
    send_batch_size: 512  # Reduce batch size


Network Connectivity Issues:

# Troubleshooting commands
# Check collector connectivity
kubectl exec -n aws-otel deployment/adot-collector -- curl -v http://localhost:4317/health

# Check AWS service connectivity
kubectl exec -n aws-otel deployment/adot-collector -- aws sts get-caller-identity

# Check collector logs
kubectl logs -n aws-otel -l app=adot-collector --tail=100 -f


Debugging Tools:

  • Enable debug logging for troubleshooting
  • Use collector health endpoints for monitoring
  • Implement distributed tracing for collector operations
  • Monitor collector metrics in CloudWatch

10. Future-Proofing and Evolution

Planning for Scale:

  • Design for multi-region deployment from the start
  • Plan for service mesh integration (Istio, Linkerd)
  • Consider hybrid cloud scenarios and multi-cloud strategies
  • Stay updated with OpenTelemetry specification changes
  • Evaluate new AWS observability features as they become available

Migration Strategies:

  • Gradual migration from existing monitoring solutions
  • A/B testing of new observability patterns
  • Feature flags for new telemetry collection
  • Backward compatibility during transitions

Advanced Implementation Patterns

11. Multi-Tenant Observability

Tenant Isolation Strategies:

# Multi-tenant collector configuration
processors:
  resource:
    attributes:
      - key: tenant.id
        from_attribute: tenant.id
        action: upsert
      - key: tenant.environment
        from_attribute: tenant.environment
        action: upsert

  filter:
    spans:
      include:
        match_type: regexp
        attributes:
          - key: tenant.id
            value: ".*"  # Include all tenants

Data Segregation:

  • Separate log groups per tenant
  • Custom namespaces for tenant-specific metrics
  • X-Ray trace filtering by tenant
  • Cost allocation by tenant

12. High Availability and Disaster Recovery

Multi-Region Deployment:

# Multi-region collector configuration
exporters:
  awscloudwatch:
    region: us-west-2  # Primary region
    log_group_name: "/aws/eks/my-cluster/application"
  awscloudwatch/backup:
    region: us-east-1  # Backup region
    log_group_name: "/aws/eks/my-cluster/application-backup"


Disaster Recovery Strategies:

  • Cross-region replication of critical telemetry data
  • Backup collector instances in secondary regions
  • Automated failover procedures
  • Data retention policies for compliance

13. Compliance and Governance

Data Governance:

# Compliance-focused configuration
processors:
  transform:
    trace_statements:
      - context: span
        statements:
          # Remove PII from telemetry data
          - delete(attributes["user.email"])
          - delete(attributes["user.phone"])
          # Add compliance metadata
          - set(attributes["compliance.data_classification"], "public")
          - set(attributes["compliance.retention_period"], "90d")


Compliance Features:

  • PII detection and removal from telemetry data
  • Data classification and retention policies
  • Audit logging for data access
  • Compliance reporting and dashboards

14. Performance Optimization

Advanced Performance Tuning:

# Performance-optimized configuration
processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048
    metadata_keys: ["service.name", "service.version"]
    metadata_cardinality_limit: 1000

  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200

  # Advanced sampling for high-volume services
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 5  # Lower sampling for high-volume services

  # Tail sampling for important traces
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: default-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 1  # Very low sampling for normal requests


Performance Monitoring:

  • Collector performance metrics monitoring
  • Resource utilization optimization
  • Network performance analysis
  • Bottleneck identification and resolution

15. Security Hardening

Advanced Security Configuration:

# Security-hardened configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        tls:
          cert_file: /etc/ssl/certs/collector.crt
          key_file: /etc/ssl/private/collector.key
          ca_file: /etc/ssl/certs/ca.crt
        auth:
          authenticator: "basicauth"
          basicauth:
            username: "collector"
            password: "${COLLECTOR_PASSWORD}"

processors:
  # Remove sensitive data
  transform:
    trace_statements:
      - context: span
        statements:
          - delete(attributes["password"])
          - delete(attributes["token"])
          - delete(attributes["secret"])


Security Best Practices:

  • TLS encryption for all communications
  • Authentication and authorization for collector access
  • Sensitive data filtering from telemetry
  • Network security with VPC and security groups
  • Regular security audits and penetration testing

Implementation Roadmap

Phase 1: Foundation (Weeks 1-2)

  1. Set up basic ADOT collector on EKS
  2. Implement basic instrumentation for one service
  3. Configure CloudWatch and X-Ray exporters
  4. Create basic dashboards and alerts

Phase 2: Scaling (Weeks 3-4)

  1. Instrument all services with OpenTelemetry
  2. Implement sampling strategies for cost optimization
  3. Set up comprehensive alerting and monitoring
  4. Create service-specific dashboards

Phase 3: Optimization (Weeks 5-6)

  1. Implement security hardening and compliance features
  2. Optimize performance and resource utilization
  3. Set up multi-region deployment for high availability
  4. Implement advanced monitoring and troubleshooting

Phase 4: Advanced Features (Weeks 7-8)

  1. Implement multi-tenant observability
  2. Set up disaster recovery procedures
  3. Advanced compliance and governance features
  4. Performance optimization and tuning

Success Metrics

Technical Metrics:

  • Collector uptime: >99.9%
  • Data processing latency: <1 second
  • Error rate: <0.1%
  • Resource utilization: <80% CPU, <80% memory

Business Metrics:

  • Mean time to detection (MTTD): <5 minutes
  • Mean time to resolution (MTTR): <30 minutes
  • Monitoring coverage: 100% of critical services
  • Cost per service: <$50/month per service

Operational Metrics:

  • Alert accuracy: >95% (low false positive rate)
  • Dashboard adoption: >80% of teams using dashboards
  • Incident reduction: >50% reduction in production incidents
  • Developer productivity: >20% improvement in debugging time

Summary

In this final part of our comprehensive guide, we've explored advanced best practices for using OpenTelemetry and AWS to build a production-ready, enterprise-grade observability platform:

Key advanced practices covered:

  1. Resource attribution and metadata management for consistent observability
  2. Data volume management through intelligent sampling and filtering
  3. Security and compliance considerations for enterprise environments
  4. Performance and scalability optimization for high-volume systems
  5. Comprehensive monitoring and alerting strategies
  6. Data quality and consistency across all services
  7. Cost optimization and resource management
  8. Operational excellence with GitOps and automation
  9. Troubleshooting and debugging techniques
  10. Future-proofing and evolution strategies

Advanced implementation patterns:

  • Multi-tenant observability for SaaS applications
  • High availability and disaster recovery strategies
  • Compliance and governance features for regulated industries
  • Performance optimization for high-scale deployments
  • Security hardening for enterprise security requirements

Implementation outcomes:

  • Enterprise-grade observability platform that scales with your business
  • Comprehensive security and compliance features
  • Cost-effective operations through intelligent optimization
  • Operational excellence through automation and best practices
  • Future-ready architecture that can evolve with new technologies

Next steps:

  • Implement these practices incrementally, starting with your most critical services
  • Establish feedback loops to continuously improve your observability platform
  • Regular review and optimization of your implementation
  • Stay updated with OpenTelemetry and AWS observability features
  • Share knowledge and best practices across your organization

This comprehensive approach to observability with OpenTelemetry and AWS will provide you with deep insights into your applications, enable faster incident response, improve user experience, and support your business growth with a robust, scalable, and cost-effective observability platform.

This concludes our three-part comprehensive guide to implementing observability with AWS and OpenTelemetry. For additional resources and support, visit the OpenTelemetry documentation and AWS observability services.