In part 1 of our series on AWS and OpenTelemetry, we discussed setting up OpenTelemetry infrastructure set up with AWS services. The next critical step is implementing comprehensive monitoring strategies. Effective monitoring goes beyond simply collecting data—it involves creating actionable insights, setting up intelligent alerting, and building operational excellence.
This guide focuses on monitoring best practices specifically designed for AWS CloudWatch, AWS X-Ray, and OpenTelemetry observability stacks. You'll learn how to establish monitoring baselines, create comprehensive alerting strategies, build service-specific dashboards, and implement distributed tracing for complex microservices.
Whether you're a DevOps engineer, SRE, or developer, these practices will help you build a monitoring system that not only detects issues but also provides the context needed to resolve them quickly and prevent future occurrences.
Best Practices for Monitoring
1. Establish Monitoring Baselines
Define SLOs (Service Level Objectives) for critical business metrics:
- Availability: Target 99.9% uptime for critical services
- Latency: P95 response time under 500ms for user-facing APIs
- Error Rate: Less than 1% error rate for production services
- Throughput: Monitor request rates and capacity planning
Set up baseline alarms for error rates (target: <1%), latency (target: <500ms), and throughput:
# Example: Baseline SLO monitoring
Resources:
AvailabilitySLO:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ServiceAvailability
AlarmDescription: Service availability below SLO target
MetricName: availability_percentage
Namespace: my-service
Statistic: Average
Period: 300
Threshold: 99.9
ComparisonOperator: LessThanThreshold
EvaluationPeriods: 2
LatencySLO:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ServiceLatency
AlarmDescription: Service latency above SLO target
MetricName: request_duration_p95
Namespace: my-service
Statistic: Average
Period: 300
Threshold: 500
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2
Monitor golden signals: latency, traffic, errors, and saturation:
- Latency: Response time percentiles (P50, P95, P99)
- Traffic: Request rate, concurrent users, data transfer
- Errors: Error rates, failed requests, exceptions
- Saturation: Resource utilization (CPU, memory, disk, network)
Create service-specific dashboards for different teams (DevOps, SRE, Business):
- Engineering teams: Technical metrics, performance, errors
- Business teams: User engagement, conversion rates, revenue impact
- Operations teams: Infrastructure health, capacity, costs
2. Implement Comprehensive Alerting Strategy
Multi-level alerting approach with different severity levels and response times:
# Example: Multi-level alerting strategy
Resources:
# Critical alerts - immediate response required
CriticalErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: CriticalErrorRate
AlarmDescription: Critical error rate threshold exceeded
MetricName: error_rate
Namespace: my-service
Statistic: Average
Period: 60 # 1 minute evaluation
Threshold: 5 # 5% error rate
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1 # Immediate alert
AlarmActions:
- !Ref CriticalSNSTopicArn
- !Ref PagerDutyIntegrationArn
# Warning alerts - investigate within 15 minutes
WarningLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: HighLatencyWarning
AlarmDescription: Response time degradation detected
MetricName: request_duration_p95
Namespace: my-service
Statistic: Average
Period: 300 # 5 minute evaluation
Threshold: 1000 # 1 second
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2 # 2 consecutive periods
AlarmActions:
- !Ref WarningSNSTopicArn
# Info alerts - monitor trends
InfoTrafficAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: TrafficSpike
AlarmDescription: Unusual traffic pattern detected
MetricName: requests_total
Namespace: my-service
Statistic: Sum
Period: 300
Threshold: 1000 # 1000 requests in 5 minutes
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
AlarmActions:
- !Ref InfoSNSTopicArn
Alert fatigue prevention strategies:
- Use anomaly detection instead of static thresholds where appropriate
- Implement alert correlation to group related alerts
- Set up alert suppression during maintenance windows
- Create alert escalation policies with timeouts
- Regular alert review and cleanup of unused or noisy alerts
3. Create Service-Specific Dashboards
Comprehensive service dashboard with multiple widgets for different perspectives:
# Example: Create comprehensive service dashboard
aws cloudwatch put-dashboard \
--dashboard-name "MyService-Comprehensive" \
--dashboard-body '{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["my-service", "requests_total", "endpoint", "/api/process"],
[".", "error_rate", ".", "."],
[".", "request_duration_p95", ".", "."],
[".", "active_connections", ".", "."]
],
"period": 300,
"stat": "Sum",
"region": "us-west-2",
"title": "Service Overview",
"view": "timeSeries",
"stacked": false
}
},
{
"type": "log",
"properties": {
"query": "SOURCE \"/aws/eks/my-cluster/application\"\n| fields @timestamp, @message\n| filter @message like /ERROR/\n| sort @timestamp desc\n| limit 20",
"region": "us-west-2",
"title": "Recent Errors",
"view": "table"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["my-service", "request_duration_p50", "endpoint", "/api/process"],
[".", "request_duration_p95", ".", "."],
[".", "request_duration_p99", ".", "."]
],
"period": 300,
"stat": "Average",
"region": "us-west-2",
"title": "Response Time Percentiles",
"view": "timeSeries"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ECS", "CPUUtilization", "ServiceName", "my-service"],
[".", "MemoryUtilization", ".", "."]
],
"period": 300,
"stat": "Average",
"region": "us-west-2",
"title": "Resource Utilization",
"view": "timeSeries"
}
}
]
}'
Dashboard best practices:
- Group related metrics in logical sections
- Use consistent time ranges across widgets
- Include both current values and trends
- Add context with annotations for deployments and incidents
- Create role-based dashboards (engineering, business, operations)
4. Implement Distributed Tracing for Complex Microservices
Enable X-Ray Insights for automatic anomaly detection:
- Automatic detection of unusual patterns in traces
- Anomaly scoring based on historical data
- Root cause analysis for performance issues
- Service dependency mapping and impact analysis
Set up trace sampling to balance visibility with cost:
# Example: Intelligent trace sampling configuration
processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: 10 # Sample 10% of traces by default
tail_sampling:
decision_wait: 10s
num_traces: 50000
expected_new_traces_per_sec: 1000
policies:
- name: error-policy
type: status_code
status_code:
status_codes: [ERROR] # Always sample errors
- name: slow-policy
type: latency
latency:
threshold_ms: 1000 # Always sample slow requests
- name: default-policy
type: probabilistic
probabilistic:
sampling_percentage: 5 # Sample 5% of normal requests
Create service maps to visualize dependencies:
- Automatic service discovery and dependency mapping
- Performance impact analysis across service boundaries
- Bottleneck identification in distributed systems
- Change impact assessment for deployments
Monitor trace latency across service boundaries:
- End-to-end latency tracking for user journeys
- Service boundary performance monitoring
- Database and external service latency tracking
- Cache performance and hit rate analysis
Track business transactions end-to-end:
- User journey mapping from frontend to backend
- Business process monitoring (checkout, payment, etc.)
- Cross-service transaction correlation
- Business impact analysis for technical issues
5. Monitor the Collector Itself
Collector health monitoring to ensure the observability system is working:
# Collector health monitoring
Resources:
CollectorHealthAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ADOTCollectorHealth
AlarmDescription: ADOT collector is not healthy
MetricName: otelcol_health_check
Namespace: AWS/OTel
Statistic: Average
Period: 60
Threshold: 1
ComparisonOperator: LessThanThreshold
EvaluationPeriods: 2
CollectorMemoryAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ADOTCollectorMemory
AlarmDescription: ADOT collector memory usage high
MetricName: otelcol_memory_usage
Namespace: AWS/OTel
Statistic: Average
Period: 300
Threshold: 80 # 80% memory usage
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2
CollectorErrorAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ADOTCollectorErrors
AlarmDescription: ADOT collector experiencing errors
MetricName: otelcol_exporter_sent_failed_requests
Namespace: AWS/OTel
Statistic: Sum
Period: 300
Threshold: 10
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2
Key collector metrics to monitor:
- Health status: Collector pod health and readiness
- Memory usage: Prevent OOM kills and performance issues
- Error rates: Failed exports to AWS services
- Processing latency: Time to process and export telemetry
- Queue depth: Backlog of unprocessed telemetry data
6. Business Metrics and KPIs
Track business transactions (orders, payments, user actions):
# Example: Business transaction tracking
def process_order(order_data):
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("business.transaction_type", "order")
span.set_attribute("business.order_value", order_data["total"])
span.set_attribute("business.customer_id", order_data["customer_id"])
# Track business metrics
order_counter.add(1, {
"status": "processing",
"payment_method": order_data["payment_method"]
})
# Process the order
result = perform_order_processing(order_data)
# Update business metrics based on result
if result["success"]:
order_counter.add(1, {"status": "completed"})
revenue_gauge.record(order_data["total"])
else:
order_counter.add(1, {"status": "failed"})
return result
# Example: Revenue impact monitoring
Resources:
RevenueImpactAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: RevenueImpact
AlarmDescription: Critical failure impacting revenue
MetricName: revenue_impact_per_minute
Namespace: business-metrics
Statistic: Sum
Period: 60
Threshold: 100 # $100 revenue impact per minute
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
AlarmActions:
- !Ref CriticalBusinessSNSTopicArn
Create executive dashboards with business metrics:
- High-level KPIs: Revenue, user growth, conversion rates
- Business health indicators: Customer satisfaction, support tickets
- Operational efficiency: Cost per transaction, resource utilization
- Market trends: User behavior patterns, seasonal variations
Correlate technical metrics with business outcomes:
- Performance impact: How latency affects conversion rates
- Error correlation: Which errors impact revenue most
- Capacity planning: User growth vs. infrastructure scaling
- Cost optimization: Resource efficiency vs. business value
7. Log Management and Analysis
Advanced log analysis queries for operational insights:
-- Find slow database queries
fields @timestamp, @message, @duration
| filter @message like /database/
| filter @duration > 100
| sort @duration desc
| limit 50
-- Identify error patterns
fields @timestamp, @message, @error_type
| filter @message like /ERROR/
| stats count() by @error_type
| sort count desc
-- Monitor API usage patterns
fields @timestamp, @message, @endpoint, @user_id
| filter @message like /API/
| stats count() by @endpoint, @user_id
| sort count desc
-- Detect security anomalies
fields @timestamp, @message, @ip_address, @user_agent
| filter @message like /authentication/
| filter @message like /failed/
| stats count() by @ip_address
| sort count desc
| limit 20
-- Performance trend analysis
fields @timestamp, @message, @response_time
| filter @message like /request/
| stats avg(@response_time) by bin(5m)
| sort @timestamp desc
Log-based metrics for custom business insights:
- Error rate trends by service and endpoint
- User behavior patterns and usage analytics
- Security event monitoring and threat detection
- Compliance audit trails and data access logs
Log retention and archival strategies:
- Hot storage: Recent logs (7-30 days) for active monitoring
- Warm storage: Historical logs (30-90 days) for analysis
- Cold storage: Long-term logs (90+ days) for compliance
- Automated archival based on log age and importance
8. Performance Monitoring
Monitor resource utilization (CPU, memory, disk, network):
# Example: Resource monitoring alarms
# Example: Resource monitoring alarms
Resources:
HighCPUAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: HighCPUUtilization
AlarmDescription: CPU utilization above 80%
MetricName: CPUUtilization
Namespace: AWS/ECS
Statistic: Average
Period: 300
Threshold: 80
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2
HighMemoryAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: HighMemoryUtilization
AlarmDescription: Memory utilization above 85%
MetricName: MemoryUtilization
Namespace: AWS/ECS
Statistic: Average
Period: 300
Threshold: 85
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 2
Track application performance (response times, throughput):
- Response time percentiles: P50, P95, P99 for different endpoints
- Throughput monitoring: Requests per second, concurrent users
- Queue depth monitoring: Backlog of pending requests
- Cache performance: Hit rates, eviction rates, miss penalties
Set up capacity planning alerts for resource exhaustion:
- Predictive scaling: Alert before resources are exhausted
- Trend analysis: Monitor resource usage growth patterns
- Auto-scaling triggers: Automatic resource provisioning
- Cost optimization: Right-sizing based on actual usage
Monitor database performance and connection pools:
- Query performance: Slow query detection and optimization
- Connection pool health: Pool utilization and connection errors
- Database metrics: Read/write ratios, lock contention
- Replication lag: For read replicas and multi-region setups
Track cache hit rates and performance:
- Cache efficiency: Hit rates, miss rates, eviction rates
- Cache performance: Response times, memory usage
- Cache warming: Pre-loading frequently accessed data
- Cache invalidation: Impact of cache clears on performance
9. Security and Compliance Monitoring
Monitor authentication and authorization events:
-- Authentication monitoring
fields @timestamp, @message, @user_id, @ip_address, @result
| filter @message like /authentication/
| stats count() by @result, @user_id
| sort count desc
-- Authorization failures
fields @timestamp, @message, @user_id, @resource, @action
| filter @message like /authorization/
| filter @message like /denied/
| stats count() by @user_id, @resource
| sort count desc
Track API access patterns for anomalies:
- Rate limiting: Monitor for unusual request patterns
- Geographic anomalies: Access from unexpected locations
- Time-based patterns: Unusual access times or patterns
- User behavior: Deviations from normal usage patterns
Set up alerts for security events (failed logins, privilege escalations):
# Example: Security monitoring alarms
Resources:
FailedLoginAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: HighFailedLogins
AlarmDescription: High rate of failed login attempts
MetricName: failed_logins_per_minute
Namespace: security-metrics
Statistic: Sum
Period: 60
Threshold: 10
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
AlarmActions:
- !Ref SecuritySNSTopicArn
PrivilegeEscalationAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: PrivilegeEscalation
AlarmDescription: Privilege escalation detected
MetricName: privilege_escalation_events
Namespace: security-metrics
Statistic: Sum
Period: 300
Threshold: 1
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
AlarmActions:
- !Ref SecuritySNSTopicArn
Monitor compliance metrics (data retention, audit logs):
- Data retention compliance: Ensure logs are retained for required periods
- Audit trail completeness: Verify all required events are logged
- Access review monitoring: Track privileged access and changes
- Regulatory compliance: GDPR, HIPAA, SOX compliance monitoring
Track certificate expiration and SSL/TLS issues:
- Certificate monitoring: Expiration dates and renewal tracking
- SSL/TLS configuration: Security protocol versions and cipher suites
- Certificate validation: Chain of trust verification
- Automated renewal: Integration with certificate management systems
10. Cost and Resource Optimization
Monitor AWS service costs and set up billing alerts:
# Example: Cost monitoring alarms
Resources:
CostAnomalyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: CostAnomaly
AlarmDescription: Unusual cost increase detected
MetricName: EstimatedCharges
Namespace: AWS/Billing
Statistic: Maximum
Period: 86400 # Daily
Threshold: 100 # $100 daily increase
ComparisonOperator: GreaterThanThreshold
EvaluationPeriods: 1
AlarmActions:
- !Ref CostManagementSNSTopicArn
Track resource utilization to identify waste:
- Idle resource detection: Unused EC2 instances, EBS volumes
- Over-provisioned resources: Instances with low utilization
- Storage optimization: Unused S3 buckets, old snapshots
- Network cost optimization: Data transfer patterns and costs
Set up cost anomaly detection for unexpected charges:
- Daily cost tracking: Monitor daily spending patterns
- Service-level cost analysis: Break down costs by AWS service
- Region cost optimization: Identify cost differences across regions
- Reserved instance optimization: Track RI utilization and savings
Monitor data transfer costs and optimize network usage:
- Cross-region transfer costs: Minimize unnecessary data transfer
- CDN optimization: CloudFront usage and cache hit rates
- API Gateway costs: Request patterns and optimization opportunities
- VPC endpoint costs: Private connectivity cost analysis
Track storage costs and implement lifecycle policies:
- S3 lifecycle policies: Automatic transition to cheaper storage tiers
- EBS snapshot management: Old snapshot cleanup and optimization
- RDS storage optimization: Database storage usage and growth
- Backup cost optimization: Backup retention and storage costs
11. Incident Response and On-Call
Create runbooks for common issues:
# Example: High Error Rate Runbook
## Symptoms
- Error rate > 5% for more than 2 minutes
- Increased response times
- User complaints about service failures
## Immediate Actions
1. Check service health endpoints
2. Review recent deployments
3. Check database connectivity
4. Verify external service dependencies
## Investigation Steps
1. Check CloudWatch logs for error patterns
2. Review X-Ray traces for slow operations
3. Check resource utilization (CPU, memory)
4. Verify network connectivity
## Resolution Steps
1. Rollback recent changes if necessary
2. Scale up resources if needed
3. Restart unhealthy instances
4. Update monitoring thresholds
## Post-Incident
1. Document root cause
2. Update runbooks
3. Review monitoring gaps
4. Schedule post-mortem
Set up escalation policies for different alert severities:
- P0 (Critical): Immediate escalation to on-call engineer
- P1 (High): Escalation after 15 minutes if not acknowledged
- P2 (Medium): Escalation after 1 hour if not resolved
- P3 (Low): Escalation after 4 hours if not addressed
Implement incident response workflows with tools like PagerDuty:
- Alert routing: Route alerts to appropriate teams
- Escalation chains: Automatic escalation when alerts aren't acknowledged
- Incident creation: Automatic incident creation from critical alerts
- Status page updates: Automatic status page updates during incidents
Create post-incident review processes to learn from failures:
- Incident documentation: Detailed incident reports and timelines
- Root cause analysis: Systematic analysis of failure causes
- Action item tracking: Follow-up on improvement items
- Process improvement: Update procedures based on lessons learned
Set up automated remediation for common issues:
- Auto-scaling: Automatic resource scaling based on load
- Health check recovery: Automatic restart of unhealthy instances
- Circuit breaker patterns: Automatic failure isolation
- Rollback automation: Automatic rollback of failed deployments
12. Continuous Improvement
Regular review of alert thresholds based on historical data:
- Threshold optimization: Adjust thresholds based on actual patterns
- Seasonal adjustments: Account for business cycles and patterns
- Performance improvements: Update thresholds as systems improve
- False positive reduction: Eliminate noisy alerts
A/B testing of monitoring strategies to optimize effectiveness:
- Alert sensitivity testing: Test different threshold levels
- Dashboard effectiveness: Measure dashboard usage and value
- Response time optimization: Test different escalation policies
- Tool evaluation: Test new monitoring tools and approaches
Feedback loops from on-call teams to improve alerting:
- Alert quality surveys: Regular feedback on alert usefulness
- False positive tracking: Monitor and reduce false positives
- Response time analysis: Track time to acknowledge and resolve
- Team satisfaction: Monitor on-call team satisfaction and burnout
Regular cleanup of unused dashboards and alarms:
- Dashboard audit: Review and remove unused dashboards
- Alarm cleanup: Remove or update outdated alarms
- Metric optimization: Remove unused custom metrics
- Cost optimization: Reduce monitoring costs through cleanup
Documentation updates for monitoring procedures:
- Runbook maintenance: Keep runbooks up to date
- Process documentation: Document monitoring procedures
- Tool documentation: Keep tool usage guides current
- Knowledge sharing: Share monitoring best practices across teams
Summary
In this second part of our comprehensive guide, we've explored advanced monitoring strategies and operational excellence practices for AWS and OpenTelemetry observability stacks:
Key monitoring practices covered:
- Establishing monitoring baselines with SLOs and golden signals
- Implementing comprehensive alerting with multi-level severity
- Creating service-specific dashboards for different stakeholders
- Leveraging distributed tracing for complex microservices
- Monitoring the monitoring system itself
- Tracking business metrics and correlating with technical data
- Advanced log analysis and management strategies
- Performance monitoring across all system components
- Security and compliance monitoring for enterprise requirements
- Cost optimization and resource management
- Incident response and on-call best practices
- Continuous improvement processes
Operational excellence outcomes:
- Proactive issue detection before users are impacted
- Faster incident resolution with comprehensive context
- Better resource utilization through intelligent monitoring
- Improved user experience through performance optimization
- Cost-effective operations through monitoring optimization
- Compliance readiness through comprehensive audit trails
Next steps:
- In Part 3, we'll explore Best Practices for Using OpenTelemetry and AWS, covering security, performance optimization, cost management, and future-proofing strategies
- Implement these monitoring practices incrementally, starting with critical services
- Establish feedback loops to continuously improve your monitoring strategy
- Regular review and optimization of your monitoring setup
This monitoring foundation will enable you to build a robust, scalable observability platform that provides deep insights into your system's performance and user experience while supporting operational excellence and business growth.