Learn more
All posts

Best Practices for Monitoring with AWS and OpenTelemetry

Observability
Jul
16
2025
Jul
15
2025

In part 1 of our series on AWS and OpenTelemetry, we discussed setting up OpenTelemetry infrastructure set up with AWS services. The next critical step is implementing comprehensive monitoring strategies. Effective monitoring goes beyond simply collecting data—it involves creating actionable insights, setting up intelligent alerting, and building operational excellence.

This guide focuses on monitoring best practices specifically designed for AWS CloudWatch, AWS X-Ray, and OpenTelemetry observability stacks. You'll learn how to establish monitoring baselines, create comprehensive alerting strategies, build service-specific dashboards, and implement distributed tracing for complex microservices.

Whether you're a DevOps engineer, SRE, or developer, these practices will help you build a monitoring system that not only detects issues but also provides the context needed to resolve them quickly and prevent future occurrences.

Best Practices for Monitoring

1. Establish Monitoring Baselines

Define SLOs (Service Level Objectives) for critical business metrics:

  • Availability: Target 99.9% uptime for critical services
  • Latency: P95 response time under 500ms for user-facing APIs
  • Error Rate: Less than 1% error rate for production services
  • Throughput: Monitor request rates and capacity planning

Set up baseline alarms for error rates (target: <1%), latency (target: <500ms), and throughput:

# Example: Baseline SLO monitoring
Resources:
  AvailabilitySLO:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ServiceAvailability
      AlarmDescription: Service availability below SLO target
      MetricName: availability_percentage
      Namespace: my-service
      Statistic: Average
      Period: 300
      Threshold: 99.9
      ComparisonOperator: LessThanThreshold
      EvaluationPeriods: 2

  LatencySLO:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ServiceLatency
      AlarmDescription: Service latency above SLO target
      MetricName: request_duration_p95
      Namespace: my-service
      Statistic: Average
      Period: 300
      Threshold: 500
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

Monitor golden signals: latency, traffic, errors, and saturation:

  • Latency: Response time percentiles (P50, P95, P99)
  • Traffic: Request rate, concurrent users, data transfer
  • Errors: Error rates, failed requests, exceptions
  • Saturation: Resource utilization (CPU, memory, disk, network)

Create service-specific dashboards for different teams (DevOps, SRE, Business):

  • Engineering teams: Technical metrics, performance, errors
  • Business teams: User engagement, conversion rates, revenue impact
  • Operations teams: Infrastructure health, capacity, costs

2. Implement Comprehensive Alerting Strategy

Multi-level alerting approach with different severity levels and response times:

# Example: Multi-level alerting strategy
Resources:
  # Critical alerts - immediate response required
  CriticalErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: CriticalErrorRate
      AlarmDescription: Critical error rate threshold exceeded
      MetricName: error_rate
      Namespace: my-service
      Statistic: Average
      Period: 60  # 1 minute evaluation
      Threshold: 5  # 5% error rate
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1  # Immediate alert
      AlarmActions:
        - !Ref CriticalSNSTopicArn
        - !Ref PagerDutyIntegrationArn

  # Warning alerts - investigate within 15 minutes
  WarningLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighLatencyWarning
      AlarmDescription: Response time degradation detected
      MetricName: request_duration_p95
      Namespace: my-service
      Statistic: Average
      Period: 300  # 5 minute evaluation
      Threshold: 1000  # 1 second
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2  # 2 consecutive periods
      AlarmActions:
        - !Ref WarningSNSTopicArn

  # Info alerts - monitor trends
  InfoTrafficAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: TrafficSpike
      AlarmDescription: Unusual traffic pattern detected
      MetricName: requests_total
      Namespace: my-service
      Statistic: Sum
      Period: 300
      Threshold: 1000  # 1000 requests in 5 minutes
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      AlarmActions:
        - !Ref InfoSNSTopicArn

Alert fatigue prevention strategies:

  • Use anomaly detection instead of static thresholds where appropriate
  • Implement alert correlation to group related alerts
  • Set up alert suppression during maintenance windows
  • Create alert escalation policies with timeouts
  • Regular alert review and cleanup of unused or noisy alerts

3. Create Service-Specific Dashboards

Comprehensive service dashboard with multiple widgets for different perspectives:

# Example: Create comprehensive service dashboard
aws cloudwatch put-dashboard \
    --dashboard-name "MyService-Comprehensive" \
    --dashboard-body '{
        "widgets": [
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["my-service", "requests_total", "endpoint", "/api/process"],
                        [".", "error_rate", ".", "."],
                        [".", "request_duration_p95", ".", "."],
                        [".", "active_connections", ".", "."]
                    ],
                    "period": 300,
                    "stat": "Sum",
                    "region": "us-west-2",
                    "title": "Service Overview",
                    "view": "timeSeries",
                    "stacked": false
                }
            },
            {
                "type": "log",
                "properties": {
                    "query": "SOURCE \"/aws/eks/my-cluster/application\"\n| fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 20",
                    "region": "us-west-2",
                    "title": "Recent Errors",
                    "view": "table"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["my-service", "request_duration_p50", "endpoint", "/api/process"],
                        [".", "request_duration_p95", ".", "."],
                        [".", "request_duration_p99", ".", "."]
                    ],
                    "period": 300,
                    "stat": "Average",
                    "region": "us-west-2",
                    "title": "Response Time Percentiles",
                    "view": "timeSeries"
                }
            },
            {
                "type": "metric",
                "properties": {
                    "metrics": [
                        ["AWS/ECS", "CPUUtilization", "ServiceName", "my-service"],
                        [".", "MemoryUtilization", ".", "."]
                    ],
                    "period": 300,
                    "stat": "Average",
                    "region": "us-west-2",
                    "title": "Resource Utilization",
                    "view": "timeSeries"
                }
            }
        ]
    }'

Dashboard best practices:

  • Group related metrics in logical sections
  • Use consistent time ranges across widgets
  • Include both current values and trends
  • Add context with annotations for deployments and incidents
  • Create role-based dashboards (engineering, business, operations)

4. Implement Distributed Tracing for Complex Microservices

Enable X-Ray Insights for automatic anomaly detection:

  • Automatic detection of unusual patterns in traces
  • Anomaly scoring based on historical data
  • Root cause analysis for performance issues
  • Service dependency mapping and impact analysis

Set up trace sampling to balance visibility with cost:

# Example: Intelligent trace sampling configuration
processors:
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: 10  # Sample 10% of traces by default
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    expected_new_traces_per_sec: 1000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]  # Always sample errors
      - name: slow-policy
        type: latency
        latency:
          threshold_ms: 1000  # Always sample slow requests
      - name: default-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 5  # Sample 5% of normal requests

Create service maps to visualize dependencies:

  • Automatic service discovery and dependency mapping
  • Performance impact analysis across service boundaries
  • Bottleneck identification in distributed systems
  • Change impact assessment for deployments

Monitor trace latency across service boundaries:

  • End-to-end latency tracking for user journeys
  • Service boundary performance monitoring
  • Database and external service latency tracking
  • Cache performance and hit rate analysis

Track business transactions end-to-end:

  • User journey mapping from frontend to backend
  • Business process monitoring (checkout, payment, etc.)
  • Cross-service transaction correlation
  • Business impact analysis for technical issues

5. Monitor the Collector Itself

Collector health monitoring to ensure the observability system is working:

# Collector health monitoring
Resources:
  CollectorHealthAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ADOTCollectorHealth
      AlarmDescription: ADOT collector is not healthy
      MetricName: otelcol_health_check
      Namespace: AWS/OTel
      Statistic: Average
      Period: 60
      Threshold: 1
      ComparisonOperator: LessThanThreshold
      EvaluationPeriods: 2

  CollectorMemoryAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ADOTCollectorMemory
      AlarmDescription: ADOT collector memory usage high
      MetricName: otelcol_memory_usage
      Namespace: AWS/OTel
      Statistic: Average
      Period: 300
      Threshold: 80  # 80% memory usage
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

  CollectorErrorAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: ADOTCollectorErrors
      AlarmDescription: ADOT collector experiencing errors
      MetricName: otelcol_exporter_sent_failed_requests
      Namespace: AWS/OTel
      Statistic: Sum
      Period: 300
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

Key collector metrics to monitor:

  • Health status: Collector pod health and readiness
  • Memory usage: Prevent OOM kills and performance issues
  • Error rates: Failed exports to AWS services
  • Processing latency: Time to process and export telemetry
  • Queue depth: Backlog of unprocessed telemetry data

6. Business Metrics and KPIs

Track business transactions (orders, payments, user actions):

# Example: Business transaction tracking
def process_order(order_data):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("business.transaction_type", "order")
        span.set_attribute("business.order_value", order_data["total"])
        span.set_attribute("business.customer_id", order_data["customer_id"])
        
        # Track business metrics
        order_counter.add(1, {
            "status": "processing",
            "payment_method": order_data["payment_method"]
        })
        
        # Process the order
        result = perform_order_processing(order_data)
        
        # Update business metrics based on result
        if result["success"]:
            order_counter.add(1, {"status": "completed"})
            revenue_gauge.record(order_data["total"])
        else:
            order_counter.add(1, {"status": "failed"})
        
        return result
# Example: Revenue impact monitoring
Resources:
  RevenueImpactAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: RevenueImpact
      AlarmDescription: Critical failure impacting revenue
      MetricName: revenue_impact_per_minute
      Namespace: business-metrics
      Statistic: Sum
      Period: 60
      Threshold: 100  # $100 revenue impact per minute
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      AlarmActions:
        - !Ref CriticalBusinessSNSTopicArn

Create executive dashboards with business metrics:

  • High-level KPIs: Revenue, user growth, conversion rates
  • Business health indicators: Customer satisfaction, support tickets
  • Operational efficiency: Cost per transaction, resource utilization
  • Market trends: User behavior patterns, seasonal variations

Correlate technical metrics with business outcomes:

  • Performance impact: How latency affects conversion rates
  • Error correlation: Which errors impact revenue most
  • Capacity planning: User growth vs. infrastructure scaling
  • Cost optimization: Resource efficiency vs. business value

7. Log Management and Analysis

Advanced log analysis queries for operational insights:

-- Find slow database queries
fields @timestamp, @message, @duration
| filter @message like /database/
| filter @duration > 100
| sort @duration desc
| limit 50

-- Identify error patterns
fields @timestamp, @message, @error_type
| filter @message like /ERROR/
| stats count() by @error_type
| sort count desc

-- Monitor API usage patterns
fields @timestamp, @message, @endpoint, @user_id
| filter @message like /API/
| stats count() by @endpoint, @user_id
| sort count desc

-- Detect security anomalies
fields @timestamp, @message, @ip_address, @user_agent
| filter @message like /authentication/
| filter @message like /failed/
| stats count() by @ip_address
| sort count desc
| limit 20

-- Performance trend analysis
fields @timestamp, @message, @response_time
| filter @message like /request/
| stats avg(@response_time) by bin(5m)
| sort @timestamp desc

Log-based metrics for custom business insights:

  • Error rate trends by service and endpoint
  • User behavior patterns and usage analytics
  • Security event monitoring and threat detection
  • Compliance audit trails and data access logs

Log retention and archival strategies:

  • Hot storage: Recent logs (7-30 days) for active monitoring
  • Warm storage: Historical logs (30-90 days) for analysis
  • Cold storage: Long-term logs (90+ days) for compliance
  • Automated archival based on log age and importance

8. Performance Monitoring

Monitor resource utilization (CPU, memory, disk, network):

# Example: Resource monitoring alarms

# Example: Resource monitoring alarms
Resources:
  HighCPUAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighCPUUtilization
      AlarmDescription: CPU utilization above 80%
      MetricName: CPUUtilization
      Namespace: AWS/ECS
      Statistic: Average
      Period: 300
      Threshold: 80
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

  HighMemoryAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighMemoryUtilization
      AlarmDescription: Memory utilization above 85%
      MetricName: MemoryUtilization
      Namespace: AWS/ECS
      Statistic: Average
      Period: 300
      Threshold: 85
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 2

Track application performance (response times, throughput):

  • Response time percentiles: P50, P95, P99 for different endpoints
  • Throughput monitoring: Requests per second, concurrent users
  • Queue depth monitoring: Backlog of pending requests
  • Cache performance: Hit rates, eviction rates, miss penalties

Set up capacity planning alerts for resource exhaustion:

  • Predictive scaling: Alert before resources are exhausted
  • Trend analysis: Monitor resource usage growth patterns
  • Auto-scaling triggers: Automatic resource provisioning
  • Cost optimization: Right-sizing based on actual usage

Monitor database performance and connection pools:

  • Query performance: Slow query detection and optimization
  • Connection pool health: Pool utilization and connection errors
  • Database metrics: Read/write ratios, lock contention
  • Replication lag: For read replicas and multi-region setups

Track cache hit rates and performance:

  • Cache efficiency: Hit rates, miss rates, eviction rates
  • Cache performance: Response times, memory usage
  • Cache warming: Pre-loading frequently accessed data
  • Cache invalidation: Impact of cache clears on performance

9. Security and Compliance Monitoring

Monitor authentication and authorization events:

-- Authentication monitoring
fields @timestamp, @message, @user_id, @ip_address, @result
| filter @message like /authentication/
| stats count() by @result, @user_id
| sort count desc

-- Authorization failures
fields @timestamp, @message, @user_id, @resource, @action
| filter @message like /authorization/
| filter @message like /denied/
| stats count() by @user_id, @resource
| sort count desc

Track API access patterns for anomalies:

  • Rate limiting: Monitor for unusual request patterns
  • Geographic anomalies: Access from unexpected locations
  • Time-based patterns: Unusual access times or patterns
  • User behavior: Deviations from normal usage patterns

Set up alerts for security events (failed logins, privilege escalations):

# Example: Security monitoring alarms
Resources:
  FailedLoginAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: HighFailedLogins
      AlarmDescription: High rate of failed login attempts
      MetricName: failed_logins_per_minute
      Namespace: security-metrics
      Statistic: Sum
      Period: 60
      Threshold: 10
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      AlarmActions:
        - !Ref SecuritySNSTopicArn

  PrivilegeEscalationAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: PrivilegeEscalation
      AlarmDescription: Privilege escalation detected
      MetricName: privilege_escalation_events
      Namespace: security-metrics
      Statistic: Sum
      Period: 300
      Threshold: 1
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      AlarmActions:
        - !Ref SecuritySNSTopicArn

Monitor compliance metrics (data retention, audit logs):

  • Data retention compliance: Ensure logs are retained for required periods
  • Audit trail completeness: Verify all required events are logged
  • Access review monitoring: Track privileged access and changes
  • Regulatory compliance: GDPR, HIPAA, SOX compliance monitoring

Track certificate expiration and SSL/TLS issues:

  • Certificate monitoring: Expiration dates and renewal tracking
  • SSL/TLS configuration: Security protocol versions and cipher suites
  • Certificate validation: Chain of trust verification
  • Automated renewal: Integration with certificate management systems

10. Cost and Resource Optimization

Monitor AWS service costs and set up billing alerts:

# Example: Cost monitoring alarms
Resources:
  CostAnomalyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: CostAnomaly
      AlarmDescription: Unusual cost increase detected
      MetricName: EstimatedCharges
      Namespace: AWS/Billing
      Statistic: Maximum
      Period: 86400  # Daily
      Threshold: 100  # $100 daily increase
      ComparisonOperator: GreaterThanThreshold
      EvaluationPeriods: 1
      AlarmActions:
        - !Ref CostManagementSNSTopicArn

Track resource utilization to identify waste:

  • Idle resource detection: Unused EC2 instances, EBS volumes
  • Over-provisioned resources: Instances with low utilization
  • Storage optimization: Unused S3 buckets, old snapshots
  • Network cost optimization: Data transfer patterns and costs

Set up cost anomaly detection for unexpected charges:

  • Daily cost tracking: Monitor daily spending patterns
  • Service-level cost analysis: Break down costs by AWS service
  • Region cost optimization: Identify cost differences across regions
  • Reserved instance optimization: Track RI utilization and savings

Monitor data transfer costs and optimize network usage:

  • Cross-region transfer costs: Minimize unnecessary data transfer
  • CDN optimization: CloudFront usage and cache hit rates
  • API Gateway costs: Request patterns and optimization opportunities
  • VPC endpoint costs: Private connectivity cost analysis

Track storage costs and implement lifecycle policies:

  • S3 lifecycle policies: Automatic transition to cheaper storage tiers
  • EBS snapshot management: Old snapshot cleanup and optimization
  • RDS storage optimization: Database storage usage and growth
  • Backup cost optimization: Backup retention and storage costs

11. Incident Response and On-Call

Create runbooks for common issues:

# Example: High Error Rate Runbook

## Symptoms
- Error rate > 5% for more than 2 minutes
- Increased response times
- User complaints about service failures

## Immediate Actions
1. Check service health endpoints
2. Review recent deployments
3. Check database connectivity
4. Verify external service dependencies

## Investigation Steps
1. Check CloudWatch logs for error patterns
2. Review X-Ray traces for slow operations
3. Check resource utilization (CPU, memory)
4. Verify network connectivity

## Resolution Steps
1. Rollback recent changes if necessary
2. Scale up resources if needed
3. Restart unhealthy instances
4. Update monitoring thresholds

## Post-Incident
1. Document root cause
2. Update runbooks
3. Review monitoring gaps
4. Schedule post-mortem

Set up escalation policies for different alert severities:

  • P0 (Critical): Immediate escalation to on-call engineer
  • P1 (High): Escalation after 15 minutes if not acknowledged
  • P2 (Medium): Escalation after 1 hour if not resolved
  • P3 (Low): Escalation after 4 hours if not addressed

Implement incident response workflows with tools like PagerDuty:

  • Alert routing: Route alerts to appropriate teams
  • Escalation chains: Automatic escalation when alerts aren't acknowledged
  • Incident creation: Automatic incident creation from critical alerts
  • Status page updates: Automatic status page updates during incidents

Create post-incident review processes to learn from failures:

  • Incident documentation: Detailed incident reports and timelines
  • Root cause analysis: Systematic analysis of failure causes
  • Action item tracking: Follow-up on improvement items
  • Process improvement: Update procedures based on lessons learned

Set up automated remediation for common issues:

  • Auto-scaling: Automatic resource scaling based on load
  • Health check recovery: Automatic restart of unhealthy instances
  • Circuit breaker patterns: Automatic failure isolation
  • Rollback automation: Automatic rollback of failed deployments

12. Continuous Improvement

Regular review of alert thresholds based on historical data:

  • Threshold optimization: Adjust thresholds based on actual patterns
  • Seasonal adjustments: Account for business cycles and patterns
  • Performance improvements: Update thresholds as systems improve
  • False positive reduction: Eliminate noisy alerts

A/B testing of monitoring strategies to optimize effectiveness:

  • Alert sensitivity testing: Test different threshold levels
  • Dashboard effectiveness: Measure dashboard usage and value
  • Response time optimization: Test different escalation policies
  • Tool evaluation: Test new monitoring tools and approaches

Feedback loops from on-call teams to improve alerting:

  • Alert quality surveys: Regular feedback on alert usefulness
  • False positive tracking: Monitor and reduce false positives
  • Response time analysis: Track time to acknowledge and resolve
  • Team satisfaction: Monitor on-call team satisfaction and burnout

Regular cleanup of unused dashboards and alarms:

  • Dashboard audit: Review and remove unused dashboards
  • Alarm cleanup: Remove or update outdated alarms
  • Metric optimization: Remove unused custom metrics
  • Cost optimization: Reduce monitoring costs through cleanup

Documentation updates for monitoring procedures:

  • Runbook maintenance: Keep runbooks up to date
  • Process documentation: Document monitoring procedures
  • Tool documentation: Keep tool usage guides current
  • Knowledge sharing: Share monitoring best practices across teams

Summary

In this second part of our comprehensive guide, we've explored advanced monitoring strategies and operational excellence practices for AWS and OpenTelemetry observability stacks:

Key monitoring practices covered:

  1. Establishing monitoring baselines with SLOs and golden signals
  2. Implementing comprehensive alerting with multi-level severity
  3. Creating service-specific dashboards for different stakeholders
  4. Leveraging distributed tracing for complex microservices
  5. Monitoring the monitoring system itself
  6. Tracking business metrics and correlating with technical data
  7. Advanced log analysis and management strategies
  8. Performance monitoring across all system components
  9. Security and compliance monitoring for enterprise requirements
  10. Cost optimization and resource management
  11. Incident response and on-call best practices
  12. Continuous improvement processes

Operational excellence outcomes:

  • Proactive issue detection before users are impacted
  • Faster incident resolution with comprehensive context
  • Better resource utilization through intelligent monitoring
  • Improved user experience through performance optimization
  • Cost-effective operations through monitoring optimization
  • Compliance readiness through comprehensive audit trails

Next steps:

  • In Part 3, we'll explore Best Practices for Using OpenTelemetry and AWS, covering security, performance optimization, cost management, and future-proofing strategies
  • Implement these monitoring practices incrementally, starting with critical services
  • Establish feedback loops to continuously improve your monitoring strategy
  • Regular review and optimization of your monitoring setup

This monitoring foundation will enable you to build a robust, scalable observability platform that provides deep insights into your system's performance and user experience while supporting operational excellence and business growth.